Thanks Robert, I had been trying to get your ConcatFilter to work, but I'm not sure what i need in the classpath and where Token comes from. Will check the thread you mention.
Best Nick On 11 Nov 2010, at 18:13, Robert Gründler wrote: > I've posted a ConcaFilter in my previous mail which does concatenate tokens. > This works fine, but i > realized that what i wanted to achieve is implemented easier in another way > (by using 2 separate field types). > > Have a look at a previous mail i wrote to the list and the reply from Ahmet > Arslan (topic: "EdgeNGram relevancy). > > > best > > > -robert > > > > > See > On Nov 11, 2010, at 5:27 PM, Nick Martin wrote: > >> Hi Robert, All, >> >> I have a similar problem, here is my fieldType, >> http://paste.pocoo.org/show/289910/ >> I want to include stopword removal and lowercase the incoming terms. The >> idea being to take, "Foo Bar Baz Ltd" and turn it into "foobarbaz" for the >> EdgeNgram filter factory. >> If anyone can tell me a simple way to concatenate tokens into one token >> again, similar too the KeyWordTokenizer that would be super helpful. >> >> Many thanks >> >> Nick >> >> On 11 Nov 2010, at 00:23, Robert Gründler wrote: >> >>> >>> On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote: >>> >>>> Are you sure you really want to throw out stopwords for your use case? I >>>> don't think autocompletion will work how you want if you do. >>> >>> in our case i think it makes sense. the content is targetting the >>> electronic music / dj scene, so we have a lot of words like "DJ" or >>> "featuring" which >>> make sense to throw out of the query. Also searches for "the beastie boys" >>> and "beastie boys" should return a match in the autocompletion. >>> >>>> >>>> And if you don't... then why use the WhitespaceTokenizer and then try to >>>> jam the tokens back together? Why not just NOT tokenize in the first >>>> place. Use the KeywordTokenizer, which really should be called the >>>> NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just >>>> creates one token from the entire input string. >>> >>> I started out with the KeywordTokenizer, which worked well, except the >>> StopWord problem. >>> >>> For now, i've come up with a quick-and-dirty custom "ConcatFilter", which >>> does what i'm after: >>> >>> public class ConcatFilter extends TokenFilter { >>> >>> private TokenStream tstream; >>> >>> protected ConcatFilter(TokenStream input) { >>> super(input); >>> this.tstream = input; >>> } >>> >>> @Override >>> public Token next() throws IOException { >>> >>> Token token = new Token(); >>> StringBuilder builder = new StringBuilder(); >>> >>> TermAttribute termAttribute = (TermAttribute) >>> tstream.getAttribute(TermAttribute.class); >>> TypeAttribute typeAttribute = (TypeAttribute) >>> tstream.getAttribute(TypeAttribute.class); >>> >>> boolean incremented = false; >>> >>> while (tstream.incrementToken()) { >>> >>> if (typeAttribute.type().equals("word")) { >>> builder.append(termAttribute.term()); >>> >>> } >>> incremented = true; >>> } >>> >>> token.setTermBuffer(builder.toString()); >>> >>> if (incremented == true) >>> return token; >>> >>> return null; >>> } >>> } >>> >>> I'm not sure if this is a safe way to do this, as i'm not familar with the >>> whole solr/lucene implementation after all. >>> >>> >>> best >>> >>> >>> -robert >>> >>> >>> >>> >>>> >>>> Then lowercase, remove whitespace (or not), do whatever else you want to >>>> do to your single token to normalize it, and then edgengram it. >>>> >>>> If you include whitespace in the token, then when making your queries for >>>> auto-complete, be sure to use a query parser that doesn't do >>>> "pre-tokenization", the 'field' query parser should work well for this. >>>> >>>> Jonathan >>>> >>>> >>>> >>>> ________________________________________ >>>> From: Robert Gründler [rob...@dubture.com] >>>> Sent: Wednesday, November 10, 2010 6:39 PM >>>> To: solr-user@lucene.apache.org >>>> Subject: Concatenate multiple tokens into one >>>> >>>> Hi, >>>> >>>> i've created the following filterchain in a field type, the idea is to use >>>> it for autocompletion purposes: >>>> >>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- create tokens >>>> separated by whitespace --> >>>> <filter class="solr.LowerCaseFilterFactory"/> <!-- lowercase everything --> >>>> <filter class="solr.StopFilterFactory" ignoreCase="true" >>>> words="stopwords.txt" enablePositionIncrements="true" /> <!-- throw out >>>> stopwords --> >>>> <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" >>>> replacement="" replace="all" /> <!-- throw out all everything except a-z >>>> --> >>>> >>>> <!-- actually, here i would like to join multiple tokens together again, >>>> to provide one token for the EdgeNGramFilterFactory --> >>>> >>>> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" >>>> maxGramSize="25" /> <!-- create edgeNGram tokens for autocomplete matches >>>> --> >>>> >>>> With that kind of filterchain, the EdgeNGramFilterFactory will receive >>>> multiple tokens on input strings with whitespaces in it. This leads to the >>>> following results: >>>> Input Query: "George Cloo" >>>> Matches: >>>> - "George Harrison" >>>> - "John Clooridge" >>>> - "George Smith" >>>> -"George Clooney" >>>> - etc >>>> >>>> However, only "George Clooney" should match in the autocompletion use case. >>>> Therefore, i'd like to add a filter before the EdgeNGramFilterFactory, >>>> which concatenates all the tokens generated by the >>>> WhitespaceTokenizerFactory. >>>> Are there filters which can do such a thing? >>>> >>>> If not, are there examples how to implement a custom TokenFilter? >>>> >>>> thanks! >>>> >>>> -robert >>>> >>>> >>>> >>>> >>> >> >