this is the full source code, but be warned, i'm not a java developer, and i have no background in lucine/solr development:
// ConcatFilter import java.io.IOException; import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.TokenFilter; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.tokenattributes.TermAttribute; import org.apache.lucene.analysis.tokenattributes.TypeAttribute; public class ConcatFilter extends TokenFilter { protected ConcatFilter(TokenStream input) { super(input); } @Override public Token next() throws IOException { Token token = new Token(); StringBuilder builder = new StringBuilder(); TermAttribute termAttribute = (TermAttribute) input.getAttribute(TermAttribute.class); TypeAttribute typeAttribute = (TypeAttribute) input.getAttribute(TypeAttribute.class); boolean hasToken = false; while (input.incrementToken()) { if (typeAttribute.type().equals("word")) { builder.append(termAttribute.term()); hasToken = true; } } if (hasToken == true) { token.setTermBuffer(builder.toString()); return token; } return null; } } //ConcatFilterFactory: import org.apache.lucene.analysis.TokenStream; import org.apache.solr.analysis.BaseTokenFilterFactory; public class ConcatFilterFactory extends BaseTokenFilterFactory { @Override public TokenStream create(TokenStream stream) { return new ConcatFilter(stream); } } and in your schema.xml, you can simply add the filterfactory using this element: <filter class="com.example.ConcatFilterFactory" /> Jar files i have included in the buildpath (can be found in the solr download package): apache-solr-core-1.4.1.jar lucene-analyzers-2.9.3.jar lucene-core.2.9.3-jar good luck ;) -robert On Nov 11, 2010, at 8:45 PM, Nick Martin wrote: > Thanks Robert, I had been trying to get your ConcatFilter to work, but I'm > not sure what i need in the classpath and where Token comes from. > Will check the thread you mention. > > Best > > Nick > > On 11 Nov 2010, at 18:13, Robert Gründler wrote: > >> I've posted a ConcaFilter in my previous mail which does concatenate tokens. >> This works fine, but i >> realized that what i wanted to achieve is implemented easier in another way >> (by using 2 separate field types). >> >> Have a look at a previous mail i wrote to the list and the reply from Ahmet >> Arslan (topic: "EdgeNGram relevancy). >> >> >> best >> >> >> -robert >> >> >> >> >> See >> On Nov 11, 2010, at 5:27 PM, Nick Martin wrote: >> >>> Hi Robert, All, >>> >>> I have a similar problem, here is my fieldType, >>> http://paste.pocoo.org/show/289910/ >>> I want to include stopword removal and lowercase the incoming terms. The >>> idea being to take, "Foo Bar Baz Ltd" and turn it into "foobarbaz" for the >>> EdgeNgram filter factory. >>> If anyone can tell me a simple way to concatenate tokens into one token >>> again, similar too the KeyWordTokenizer that would be super helpful. >>> >>> Many thanks >>> >>> Nick >>> >>> On 11 Nov 2010, at 00:23, Robert Gründler wrote: >>> >>>> >>>> On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote: >>>> >>>>> Are you sure you really want to throw out stopwords for your use case? I >>>>> don't think autocompletion will work how you want if you do. >>>> >>>> in our case i think it makes sense. the content is targetting the >>>> electronic music / dj scene, so we have a lot of words like "DJ" or >>>> "featuring" which >>>> make sense to throw out of the query. Also searches for "the beastie boys" >>>> and "beastie boys" should return a match in the autocompletion. >>>> >>>>> >>>>> And if you don't... then why use the WhitespaceTokenizer and then try to >>>>> jam the tokens back together? Why not just NOT tokenize in the first >>>>> place. Use the KeywordTokenizer, which really should be called the >>>>> NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just >>>>> creates one token from the entire input string. >>>> >>>> I started out with the KeywordTokenizer, which worked well, except the >>>> StopWord problem. >>>> >>>> For now, i've come up with a quick-and-dirty custom "ConcatFilter", which >>>> does what i'm after: >>>> >>>> public class ConcatFilter extends TokenFilter { >>>> >>>> private TokenStream tstream; >>>> >>>> protected ConcatFilter(TokenStream input) { >>>> super(input); >>>> this.tstream = input; >>>> } >>>> >>>> @Override >>>> public Token next() throws IOException { >>>> >>>> Token token = new Token(); >>>> StringBuilder builder = new StringBuilder(); >>>> >>>> TermAttribute termAttribute = (TermAttribute) >>>> tstream.getAttribute(TermAttribute.class); >>>> TypeAttribute typeAttribute = (TypeAttribute) >>>> tstream.getAttribute(TypeAttribute.class); >>>> >>>> boolean incremented = false; >>>> >>>> while (tstream.incrementToken()) { >>>> >>>> if (typeAttribute.type().equals("word")) { >>>> builder.append(termAttribute.term()); >>>> >>>> } >>>> incremented = true; >>>> } >>>> >>>> token.setTermBuffer(builder.toString()); >>>> >>>> if (incremented == true) >>>> return token; >>>> >>>> return null; >>>> } >>>> } >>>> >>>> I'm not sure if this is a safe way to do this, as i'm not familar with the >>>> whole solr/lucene implementation after all. >>>> >>>> >>>> best >>>> >>>> >>>> -robert >>>> >>>> >>>> >>>> >>>>> >>>>> Then lowercase, remove whitespace (or not), do whatever else you want to >>>>> do to your single token to normalize it, and then edgengram it. >>>>> >>>>> If you include whitespace in the token, then when making your queries for >>>>> auto-complete, be sure to use a query parser that doesn't do >>>>> "pre-tokenization", the 'field' query parser should work well for this. >>>>> >>>>> Jonathan >>>>> >>>>> >>>>> >>>>> ________________________________________ >>>>> From: Robert Gründler [rob...@dubture.com] >>>>> Sent: Wednesday, November 10, 2010 6:39 PM >>>>> To: solr-user@lucene.apache.org >>>>> Subject: Concatenate multiple tokens into one >>>>> >>>>> Hi, >>>>> >>>>> i've created the following filterchain in a field type, the idea is to >>>>> use it for autocompletion purposes: >>>>> >>>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- create tokens >>>>> separated by whitespace --> >>>>> <filter class="solr.LowerCaseFilterFactory"/> <!-- lowercase everything >>>>> --> >>>>> <filter class="solr.StopFilterFactory" ignoreCase="true" >>>>> words="stopwords.txt" enablePositionIncrements="true" /> <!-- throw out >>>>> stopwords --> >>>>> <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" >>>>> replacement="" replace="all" /> <!-- throw out all everything except a-z >>>>> --> >>>>> >>>>> <!-- actually, here i would like to join multiple tokens together again, >>>>> to provide one token for the EdgeNGramFilterFactory --> >>>>> >>>>> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" >>>>> maxGramSize="25" /> <!-- create edgeNGram tokens for autocomplete matches >>>>> --> >>>>> >>>>> With that kind of filterchain, the EdgeNGramFilterFactory will receive >>>>> multiple tokens on input strings with whitespaces in it. This leads to >>>>> the following results: >>>>> Input Query: "George Cloo" >>>>> Matches: >>>>> - "George Harrison" >>>>> - "John Clooridge" >>>>> - "George Smith" >>>>> -"George Clooney" >>>>> - etc >>>>> >>>>> However, only "George Clooney" should match in the autocompletion use >>>>> case. >>>>> Therefore, i'd like to add a filter before the EdgeNGramFilterFactory, >>>>> which concatenates all the tokens generated by the >>>>> WhitespaceTokenizerFactory. >>>>> Are there filters which can do such a thing? >>>>> >>>>> If not, are there examples how to implement a custom TokenFilter? >>>>> >>>>> thanks! >>>>> >>>>> -robert >>>>> >>>>> >>>>> >>>>> >>>> >>> >> >