Re: Concatenate multiple tokens into one

Nick Martin Thu, 11 Nov 2010 11:45:39 -0800

Thanks Robert, I had been trying to get your ConcatFilter to work, but I'm not 
sure what i need in the classpath and where Token comes from.
Will check the thread you mention.


Best

Nick

On 11 Nov 2010, at 18:13, Robert Gründler wrote:

> I've posted a ConcaFilter in my previous mail which does concatenate tokens. 
> This works fine, but i
> realized that what i wanted to achieve is implemented easier in another way 
> (by using 2 separate field types).
> 
> Have a look at a previous mail i wrote to the list and the reply from Ahmet 
> Arslan (topic: "EdgeNGram relevancy).
> 
> 
> best
> 
> 
> -robert
> 
> 
> 
> 
> See 
> On Nov 11, 2010, at 5:27 PM, Nick Martin wrote:
> 
>> Hi Robert, All,
>> 
>> I have a similar problem, here is my fieldType, 
>> http://paste.pocoo.org/show/289910/
>> I want to include stopword removal and lowercase the incoming terms. The 
>> idea being to take, "Foo Bar Baz Ltd" and turn it into "foobarbaz" for the 
>> EdgeNgram filter factory.
>> If anyone can tell me a simple way to concatenate tokens into one token 
>> again, similar too the KeyWordTokenizer that would be super helpful.
>> 
>> Many thanks
>> 
>> Nick
>> 
>> On 11 Nov 2010, at 00:23, Robert Gründler wrote:
>> 
>>> 
>>> On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote:
>>> 
>>>> Are you sure you really want to throw out stopwords for your use case?  I 
>>>> don't think autocompletion will work how you want if you do. 
>>> 
>>> in our case i think it makes sense. the content is targetting the 
>>> electronic music / dj scene, so we have a lot of words like "DJ" or 
>>> "featuring" which
>>> make sense to throw out of the query. Also searches for "the beastie boys" 
>>> and "beastie boys" should return a match in the autocompletion.
>>> 
>>>> 
>>>> And if you don't... then why use the WhitespaceTokenizer and then try to 
>>>> jam the tokens back together? Why not just NOT tokenize in the first 
>>>> place. Use the KeywordTokenizer, which really should be called the 
>>>> NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just 
>>>> creates one token from the entire input string. 
>>> 
>>> I started out with the KeywordTokenizer, which worked well, except the 
>>> StopWord problem.
>>> 
>>> For now, i've come up with a quick-and-dirty custom "ConcatFilter", which 
>>> does what i'm after:
>>> 
>>> public class ConcatFilter extends TokenFilter {
>>> 
>>>     private TokenStream tstream;
>>> 
>>>     protected ConcatFilter(TokenStream input) {
>>>             super(input);
>>>             this.tstream = input;
>>>     }
>>> 
>>>     @Override
>>>     public Token next() throws IOException {
>>>             
>>>             Token token = new Token();
>>>             StringBuilder builder = new StringBuilder();
>>>             
>>>             TermAttribute termAttribute = (TermAttribute) 
>>> tstream.getAttribute(TermAttribute.class);
>>>             TypeAttribute typeAttribute = (TypeAttribute) 
>>> tstream.getAttribute(TypeAttribute.class);
>>>             
>>>             boolean incremented = false;
>>>             
>>>             while (tstream.incrementToken()) {
>>>                     
>>>                     if (typeAttribute.type().equals("word")) {
>>>                             builder.append(termAttribute.term());           
>>>                 
>>>                     }
>>>                     incremented = true;
>>>             }
>>>             
>>>             token.setTermBuffer(builder.toString());
>>>             
>>>             if (incremented == true)
>>>                     return token;
>>>             
>>>             return null;
>>>     }
>>> }
>>> 
>>> I'm not sure if this is a safe way to do this, as i'm not familar with the 
>>> whole solr/lucene implementation after all.
>>> 
>>> 
>>> best
>>> 
>>> 
>>> -robert
>>> 
>>> 
>>> 
>>> 
>>>> 
>>>> Then lowercase, remove whitespace (or not), do whatever else you want to 
>>>> do to your single token to normalize it, and then edgengram it. 
>>>> 
>>>> If you include whitespace in the token, then when making your queries for 
>>>> auto-complete, be sure to use a query parser that doesn't do 
>>>> "pre-tokenization", the 'field' query parser should work well for this. 
>>>> 
>>>> Jonathan
>>>> 
>>>> 
>>>> 
>>>> ________________________________________
>>>> From: Robert Gründler [rob...@dubture.com]
>>>> Sent: Wednesday, November 10, 2010 6:39 PM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Concatenate multiple tokens into one
>>>> 
>>>> Hi,
>>>> 
>>>> i've created the following filterchain in a field type, the idea is to use 
>>>> it for autocompletion purposes:
>>>> 
>>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- create tokens 
>>>> separated by whitespace -->
>>>> <filter class="solr.LowerCaseFilterFactory"/> <!-- lowercase everything -->
>>>> <filter class="solr.StopFilterFactory" ignoreCase="true" 
>>>> words="stopwords.txt" enablePositionIncrements="true" />  <!-- throw out 
>>>> stopwords -->
>>>> <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" 
>>>> replacement="" replace="all" />  <!-- throw out all everything except a-z 
>>>> -->
>>>> 
>>>> <!-- actually, here i would like to join multiple tokens together again, 
>>>> to provide one token for the EdgeNGramFilterFactory -->
>>>> 
>>>> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" 
>>>> maxGramSize="25" /> <!-- create edgeNGram tokens for autocomplete matches 
>>>> -->
>>>> 
>>>> With that kind of filterchain, the EdgeNGramFilterFactory will receive 
>>>> multiple tokens on input strings with whitespaces in it. This leads to the 
>>>> following results:
>>>> Input Query: "George Cloo"
>>>> Matches:
>>>> - "George Harrison"
>>>> - "John Clooridge"
>>>> - "George Smith"
>>>> -"George Clooney"
>>>> - etc
>>>> 
>>>> However, only "George Clooney" should match in the autocompletion use case.
>>>> Therefore, i'd like to add a filter before the EdgeNGramFilterFactory, 
>>>> which concatenates all the tokens generated by the 
>>>> WhitespaceTokenizerFactory.
>>>> Are there filters which can do such a thing?
>>>> 
>>>> If not, are there examples how to implement a custom TokenFilter?
>>>> 
>>>> thanks!
>>>> 
>>>> -robert
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>> 
>

Re: Concatenate multiple tokens into one

Reply via email to