Re: Concatenate multiple tokens into one

Robert Gründler Thu, 11 Nov 2010 12:00:12 -0800

this is the full source code, but be warned, i'm not a java developer, and i 
have no background in lucine/solr development:


// ConcatFilter

import java.io.IOException;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;

public class ConcatFilter extends TokenFilter {

  protected ConcatFilter(TokenStream input) 
  {
    super(input);               
  }

  @Override
  public Token next() throws IOException 
  {
    Token token = new Token();
    StringBuilder builder = new StringBuilder();

    TermAttribute termAttribute = (TermAttribute) 
input.getAttribute(TermAttribute.class);
    TypeAttribute typeAttribute = (TypeAttribute) 
input.getAttribute(TypeAttribute.class);

    boolean hasToken = false;

    while (input.incrementToken()) 
    {
      if (typeAttribute.type().equals("word")) {
        builder.append(termAttribute.term());
        hasToken = true;
      }                 
    }

    if (hasToken == true) {
      token.setTermBuffer(builder.toString());
      return token;
    }
      
    return null;
  }
}

//ConcatFilterFactory:

import org.apache.lucene.analysis.TokenStream;
import org.apache.solr.analysis.BaseTokenFilterFactory;

public class ConcatFilterFactory extends BaseTokenFilterFactory {

        @Override
        public TokenStream create(TokenStream stream) {
                return new ConcatFilter(stream);                
        }
}

and in your schema.xml, you can simply add the filterfactory using this element:

<filter class="com.example.ConcatFilterFactory" />

Jar files i have included in the buildpath (can be found in the solr download 
package):

apache-solr-core-1.4.1.jar
lucene-analyzers-2.9.3.jar
lucene-core.2.9.3-jar


good luck ;)


-robert




On Nov 11, 2010, at 8:45 PM, Nick Martin wrote:

> Thanks Robert, I had been trying to get your ConcatFilter to work, but I'm 
> not sure what i need in the classpath and where Token comes from.
> Will check the thread you mention.
> 
> Best
> 
> Nick
> 
> On 11 Nov 2010, at 18:13, Robert Gründler wrote:
> 
>> I've posted a ConcaFilter in my previous mail which does concatenate tokens. 
>> This works fine, but i
>> realized that what i wanted to achieve is implemented easier in another way 
>> (by using 2 separate field types).
>> 
>> Have a look at a previous mail i wrote to the list and the reply from Ahmet 
>> Arslan (topic: "EdgeNGram relevancy).
>> 
>> 
>> best
>> 
>> 
>> -robert
>> 
>> 
>> 
>> 
>> See 
>> On Nov 11, 2010, at 5:27 PM, Nick Martin wrote:
>> 
>>> Hi Robert, All,
>>> 
>>> I have a similar problem, here is my fieldType, 
>>> http://paste.pocoo.org/show/289910/
>>> I want to include stopword removal and lowercase the incoming terms. The 
>>> idea being to take, "Foo Bar Baz Ltd" and turn it into "foobarbaz" for the 
>>> EdgeNgram filter factory.
>>> If anyone can tell me a simple way to concatenate tokens into one token 
>>> again, similar too the KeyWordTokenizer that would be super helpful.
>>> 
>>> Many thanks
>>> 
>>> Nick
>>> 
>>> On 11 Nov 2010, at 00:23, Robert Gründler wrote:
>>> 
>>>> 
>>>> On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote:
>>>> 
>>>>> Are you sure you really want to throw out stopwords for your use case?  I 
>>>>> don't think autocompletion will work how you want if you do. 
>>>> 
>>>> in our case i think it makes sense. the content is targetting the 
>>>> electronic music / dj scene, so we have a lot of words like "DJ" or 
>>>> "featuring" which
>>>> make sense to throw out of the query. Also searches for "the beastie boys" 
>>>> and "beastie boys" should return a match in the autocompletion.
>>>> 
>>>>> 
>>>>> And if you don't... then why use the WhitespaceTokenizer and then try to 
>>>>> jam the tokens back together? Why not just NOT tokenize in the first 
>>>>> place. Use the KeywordTokenizer, which really should be called the 
>>>>> NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just 
>>>>> creates one token from the entire input string. 
>>>> 
>>>> I started out with the KeywordTokenizer, which worked well, except the 
>>>> StopWord problem.
>>>> 
>>>> For now, i've come up with a quick-and-dirty custom "ConcatFilter", which 
>>>> does what i'm after:
>>>> 
>>>> public class ConcatFilter extends TokenFilter {
>>>> 
>>>>    private TokenStream tstream;
>>>> 
>>>>    protected ConcatFilter(TokenStream input) {
>>>>            super(input);
>>>>            this.tstream = input;
>>>>    }
>>>> 
>>>>    @Override
>>>>    public Token next() throws IOException {
>>>>            
>>>>            Token token = new Token();
>>>>            StringBuilder builder = new StringBuilder();
>>>>            
>>>>            TermAttribute termAttribute = (TermAttribute) 
>>>> tstream.getAttribute(TermAttribute.class);
>>>>            TypeAttribute typeAttribute = (TypeAttribute) 
>>>> tstream.getAttribute(TypeAttribute.class);
>>>>            
>>>>            boolean incremented = false;
>>>>            
>>>>            while (tstream.incrementToken()) {
>>>>                    
>>>>                    if (typeAttribute.type().equals("word")) {
>>>>                            builder.append(termAttribute.term());           
>>>>                 
>>>>                    }
>>>>                    incremented = true;
>>>>            }
>>>>            
>>>>            token.setTermBuffer(builder.toString());
>>>>            
>>>>            if (incremented == true)
>>>>                    return token;
>>>>            
>>>>            return null;
>>>>    }
>>>> }
>>>> 
>>>> I'm not sure if this is a safe way to do this, as i'm not familar with the 
>>>> whole solr/lucene implementation after all.
>>>> 
>>>> 
>>>> best
>>>> 
>>>> 
>>>> -robert
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> 
>>>>> Then lowercase, remove whitespace (or not), do whatever else you want to 
>>>>> do to your single token to normalize it, and then edgengram it. 
>>>>> 
>>>>> If you include whitespace in the token, then when making your queries for 
>>>>> auto-complete, be sure to use a query parser that doesn't do 
>>>>> "pre-tokenization", the 'field' query parser should work well for this. 
>>>>> 
>>>>> Jonathan
>>>>> 
>>>>> 
>>>>> 
>>>>> ________________________________________
>>>>> From: Robert Gründler [rob...@dubture.com]
>>>>> Sent: Wednesday, November 10, 2010 6:39 PM
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: Concatenate multiple tokens into one
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> i've created the following filterchain in a field type, the idea is to 
>>>>> use it for autocompletion purposes:
>>>>> 
>>>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- create tokens 
>>>>> separated by whitespace -->
>>>>> <filter class="solr.LowerCaseFilterFactory"/> <!-- lowercase everything 
>>>>> -->
>>>>> <filter class="solr.StopFilterFactory" ignoreCase="true" 
>>>>> words="stopwords.txt" enablePositionIncrements="true" />  <!-- throw out 
>>>>> stopwords -->
>>>>> <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" 
>>>>> replacement="" replace="all" />  <!-- throw out all everything except a-z 
>>>>> -->
>>>>> 
>>>>> <!-- actually, here i would like to join multiple tokens together again, 
>>>>> to provide one token for the EdgeNGramFilterFactory -->
>>>>> 
>>>>> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" 
>>>>> maxGramSize="25" /> <!-- create edgeNGram tokens for autocomplete matches 
>>>>> -->
>>>>> 
>>>>> With that kind of filterchain, the EdgeNGramFilterFactory will receive 
>>>>> multiple tokens on input strings with whitespaces in it. This leads to 
>>>>> the following results:
>>>>> Input Query: "George Cloo"
>>>>> Matches:
>>>>> - "George Harrison"
>>>>> - "John Clooridge"
>>>>> - "George Smith"
>>>>> -"George Clooney"
>>>>> - etc
>>>>> 
>>>>> However, only "George Clooney" should match in the autocompletion use 
>>>>> case.
>>>>> Therefore, i'd like to add a filter before the EdgeNGramFilterFactory, 
>>>>> which concatenates all the tokens generated by the 
>>>>> WhitespaceTokenizerFactory.
>>>>> Are there filters which can do such a thing?
>>>>> 
>>>>> If not, are there examples how to implement a custom TokenFilter?
>>>>> 
>>>>> thanks!
>>>>> 
>>>>> -robert
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>

Re: Concatenate multiple tokens into one

Reply via email to