Re: EdgeNGram relevancy
thanks for the explanation. the results for the autocompletion are pretty good now, but we still have a small problem. When there are hits in the edgytext2 fields, results which only have hits in the edgytext field should not be returned at all. Example: Query: Martin Sco Current Results (in that order): - Martin Scorsese - Martin Lawrence - Joseph Martin However, in an autocompletion context, only Martin Scorsese makes sense, the 2 others are logically not correct. I'm not sure if this can be solved on the solr side, or if we should implement the logic in the application. thanks! -robert On Nov 12, 2010, at 12:13 AM, Jonathan Rochkind wrote: Without the parens, the edgytext: only applied to Mr, the default field still applied to Scorcese. The double quotes are neccesary in the second case (rather than parens), because on a non-tokenized field because the standard query parser will pre-tokenize on whitespace before sending individual white-space seperated words to match the index. If the index includes multi-word tokens with internal whitespace, they will never match. But the standard query parser doesn't pre-tokenize like this, it passes the whole phrase to the index intact. Robert Gründler wrote: Did you run your query without using () and operators? If yes can you try this? q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0 I didn't use () and in my query before. Using the query with those operators works now, stopwords are thrown out as the should, thanks. However, i don't understand how the () and operators affect the StopWordFilter. Could you give a brief explanation for the above example? thanks! -robert
Re: EdgeNGram relevancy
it seems adding the '+' (required) operator to each term in a multi-term query does the trick: http://lucene.apache.org/java/2_4_0/queryparsersyntax.html#+ ie: edgytext2:(+Martin +Sco) -robert On Nov 16, 2010, at 8:52 PM, Robert Gründler wrote: thanks for the explanation. the results for the autocompletion are pretty good now, but we still have a small problem. When there are hits in the edgytext2 fields, results which only have hits in the edgytext field should not be returned at all. Example: Query: Martin Sco Current Results (in that order): - Martin Scorsese - Martin Lawrence - Joseph Martin However, in an autocompletion context, only Martin Scorsese makes sense, the 2 others are logically not correct. I'm not sure if this can be solved on the solr side, or if we should implement the logic in the application. thanks! -robert On Nov 12, 2010, at 12:13 AM, Jonathan Rochkind wrote: Without the parens, the edgytext: only applied to Mr, the default field still applied to Scorcese. The double quotes are neccesary in the second case (rather than parens), because on a non-tokenized field because the standard query parser will pre-tokenize on whitespace before sending individual white-space seperated words to match the index. If the index includes multi-word tokens with internal whitespace, they will never match. But the standard query parser doesn't pre-tokenize like this, it passes the whole phrase to the index intact. Robert Gründler wrote: Did you run your query without using () and operators? If yes can you try this? q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0 I didn't use () and in my query before. Using the query with those operators works now, stopwords are thrown out as the should, thanks. However, i don't understand how the () and operators affect the StopWordFilter. Could you give a brief explanation for the above example? thanks! -robert
EdgeNGram relevancy
Hi, consider the following fieldtype (used for autocompletion): fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / /analyzer /fieldType This works fine as long as the query string is a single word. For multiple words, the ranking is weird though. Example: Query String: Bill Cl Result (in that order): - Clyde Phillips - Clay Rogers - Roger Cloud - Bill Clinton Bill Clinton should have the highest rank in that case. Has anyone an idea how to to configure this fieldtype to make matches in both tokens rank higher than those who match in either token? thanks! -robert
Re: EdgeNGram relevancy
You can add an additional field, with using KeywordTokenizerFactory instead of WhitespaceTokenizerFactory. And query both these fields with an OR operator. edgytext:(Bill Cl) OR edgytext2:Bill Cl You can even apply boost so that begins with matches comes first. --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com wrote: From: Robert Gründler rob...@dubture.com Subject: EdgeNGram relevancy To: solr-user@lucene.apache.org Date: Thursday, November 11, 2010, 5:51 PM Hi, consider the following fieldtype (used for autocompletion): fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / /analyzer /fieldType This works fine as long as the query string is a single word. For multiple words, the ranking is weird though. Example: Query String: Bill Cl Result (in that order): - Clyde Phillips - Clay Rogers - Roger Cloud - Bill Clinton Bill Clinton should have the highest rank in that case. Has anyone an idea how to to configure this fieldtype to make matches in both tokens rank higher than those who match in either token? thanks! -robert
Re: EdgeNGram relevancy
thanks a lot, that setup works pretty well now. the only problem now is that the StopWords do not work that good anymore. I'll provide an example, but first the 2 fieldtypes: !-- autocomplete field which finds matches inside strings (scor matches Martin Scorsese) -- fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / /analyzer /fieldType !-- autocomplete field which finds startsWith matches only (scor matches only Scorpio, but not Martin Scorsese) -- fieldType name=edgytext2 class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / /analyzer /fieldType This setup now makes troubles regarding StopWords, here's an example: Let's say the index contains 2 Strings: Mr Martin Scorsese and Martin Scorsese. Mr is in the stopword list. Query: edgytext:Mr Scorsese OR edgytext2:Mr Scorsese^2.0 This way, the only result i get is Mr Martin Scorsese, because the strict field edgytext2 is boosted by 2.0. Any idea why in this case Martin Scorsese is not in the result at all? thanks again! -robert On Nov 11, 2010, at 5:57 PM, Ahmet Arslan wrote: You can add an additional field, with using KeywordTokenizerFactory instead of WhitespaceTokenizerFactory. And query both these fields with an OR operator. edgytext:(Bill Cl) OR edgytext2:Bill Cl You can even apply boost so that begins with matches comes first. --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com wrote: From: Robert Gründler rob...@dubture.com Subject: EdgeNGram relevancy To: solr-user@lucene.apache.org Date: Thursday, November 11, 2010, 5:51 PM Hi, consider the following fieldtype (used for autocompletion): fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / /analyzer /fieldType This works fine as long as the query string is a single word. For multiple words, the ranking is weird though. Example: Query String: Bill Cl Result (in that order): - Clyde Phillips - Clay Rogers - Roger Cloud - Bill Clinton Bill Clinton should have the highest rank in that case. Has anyone an idea how to to configure this fieldtype to make matches in both tokens rank higher than those who match in either token? thanks! -robert
Re: EdgeNGram relevancy
This setup now makes troubles regarding StopWords, here's an example: Let's say the index contains 2 Strings: Mr Martin Scorsese and Martin Scorsese. Mr is in the stopword list. Query: edgytext:Mr Scorsese OR edgytext2:Mr Scorsese^2.0 This way, the only result i get is Mr Martin Scorsese, because the strict field edgytext2 is boosted by 2.0. Any idea why in this case Martin Scorsese is not in the result at all? Did you run your query without using () and operators? If yes can you try this? q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0 If no can you paste output of debugQuery=on
Re: EdgeNGram relevancy
On 12 Nov 2010, at 01:46, Ahmet Arslan iori...@yahoo.com wrote: This setup now makes troubles regarding StopWords, here's an example: Let's say the index contains 2 Strings: Mr Martin Scorsese and Martin Scorsese. Mr is in the stopword list. Query: edgytext:Mr Scorsese OR edgytext2:Mr Scorsese^2.0 This way, the only result i get is Mr Martin Scorsese, because the strict field edgytext2 is boosted by 2.0. Any idea why in this case Martin Scorsese is not in the result at all? Did you run your query without using () and operators? If yes can you try this? q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0 If no can you paste output of debugQuery=on This would still not deal with the problem of removing stop words from the indexing and query analysis stages. I really need something that will allow that and give a single token as in the example below. Best Nick
Re: EdgeNGram relevancy
Could anyone help me understand what does Clyde Phillips appear in the results for Bill Cl?? Clyde Phillips doesn't produce any EdgeNGram that would match Bill Cl, so why is it even in the results? Thanks. --- On Thu, 11/11/10, Ahmet Arslan iori...@yahoo.com wrote: You can add an additional field, with using KeywordTokenizerFactory instead of WhitespaceTokenizerFactory. And query both these fields with an OR operator. edgytext:(Bill Cl) OR edgytext2:Bill Cl You can even apply boost so that begins with matches comes first. --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com wrote: From: Robert Gründler rob...@dubture.com Subject: EdgeNGram relevancy To: solr-user@lucene.apache.org Date: Thursday, November 11, 2010, 5:51 PM Hi, consider the following fieldtype (used for autocompletion): fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / /analyzer /fieldType This works fine as long as the query string is a single word. For multiple words, the ranking is weird though. Example: Query String: Bill Cl Result (in that order): - Clyde Phillips - Clay Rogers - Roger Cloud - Bill Clinton Bill Clinton should have the highest rank in that case. Has anyone an idea how to to configure this fieldtype to make matches in both tokens rank higher than those who match in either token? thanks! -robert
Re: EdgeNGram relevancy
according to the fieldtype i posted previously, i think it's because of: 1. WhiteSpaceTokenizer splits the String Clyde Phillips into 2 tokens: Clyde and Phillips 2. EdgeNGramFilter gets the 2 tokens, and creates an EdgeNGram for each token: C Cl Cly ... AND P Ph Phi ... The Query String Bill Cl gets split up in 2 Tokens Bill and Cl by the WhitespaceTokenizer. This creates a match for the 2nd token Ci of the query, and one of the subtokens the EdgeNGramFilter created: Cl. -robert On Nov 11, 2010, at 21:34 , Andy wrote: Could anyone help me understand what does Clyde Phillips appear in the results for Bill Cl?? Clyde Phillips doesn't produce any EdgeNGram that would match Bill Cl, so why is it even in the results? Thanks. --- On Thu, 11/11/10, Ahmet Arslan iori...@yahoo.com wrote: You can add an additional field, with using KeywordTokenizerFactory instead of WhitespaceTokenizerFactory. And query both these fields with an OR operator. edgytext:(Bill Cl) OR edgytext2:Bill Cl You can even apply boost so that begins with matches comes first. --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com wrote: From: Robert Gründler rob...@dubture.com Subject: EdgeNGram relevancy To: solr-user@lucene.apache.org Date: Thursday, November 11, 2010, 5:51 PM Hi, consider the following fieldtype (used for autocompletion): fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / /analyzer /fieldType This works fine as long as the query string is a single word. For multiple words, the ranking is weird though. Example: Query String: Bill Cl Result (in that order): - Clyde Phillips - Clay Rogers - Roger Cloud - Bill Clinton Bill Clinton should have the highest rank in that case. Has anyone an idea how to to configure this fieldtype to make matches in both tokens rank higher than those who match in either token? thanks! -robert
Re: EdgeNGram relevancy
Ah I see. Thanks for the explanation. Could you set the defaultOperator to AND? That way both Bill and Cl must be a match and that would exclude Clyde Phillips. --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com wrote: From: Robert Gründler rob...@dubture.com Subject: Re: EdgeNGram relevancy To: solr-user@lucene.apache.org Date: Thursday, November 11, 2010, 3:51 PM according to the fieldtype i posted previously, i think it's because of: 1. WhiteSpaceTokenizer splits the String Clyde Phillips into 2 tokens: Clyde and Phillips 2. EdgeNGramFilter gets the 2 tokens, and creates an EdgeNGram for each token: C Cl Cly ... AND P Ph Phi ... The Query String Bill Cl gets split up in 2 Tokens Bill and Cl by the WhitespaceTokenizer. This creates a match for the 2nd token Ci of the query, and one of the subtokens the EdgeNGramFilter created: Cl. -robert On Nov 11, 2010, at 21:34 , Andy wrote: Could anyone help me understand what does Clyde Phillips appear in the results for Bill Cl?? Clyde Phillips doesn't produce any EdgeNGram that would match Bill Cl, so why is it even in the results? Thanks. --- On Thu, 11/11/10, Ahmet Arslan iori...@yahoo.com wrote: You can add an additional field, with using KeywordTokenizerFactory instead of WhitespaceTokenizerFactory. And query both these fields with an OR operator. edgytext:(Bill Cl) OR edgytext2:Bill Cl You can even apply boost so that begins with matches comes first. --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com wrote: From: Robert Gründler rob...@dubture.com Subject: EdgeNGram relevancy To: solr-user@lucene.apache.org Date: Thursday, November 11, 2010, 5:51 PM Hi, consider the following fieldtype (used for autocompletion): fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / /analyzer /fieldType This works fine as long as the query string is a single word. For multiple words, the ranking is weird though. Example: Query String: Bill Cl Result (in that order): - Clyde Phillips - Clay Rogers - Roger Cloud - Bill Clinton Bill Clinton should have the highest rank in that case. Has anyone an idea how to to configure this fieldtype to make matches in both tokens rank higher than those who match in either token? thanks! -robert
Re: EdgeNGram relevancy
Did you run your query without using () and operators? If yes can you try this? q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0 I didn't use () and in my query before. Using the query with those operators works now, stopwords are thrown out as the should, thanks. However, i don't understand how the () and operators affect the StopWordFilter. Could you give a brief explanation for the above example? thanks! -robert
Re: EdgeNGram relevancy
Without the parens, the edgytext: only applied to Mr, the default field still applied to Scorcese. The double quotes are neccesary in the second case (rather than parens), because on a non-tokenized field because the standard query parser will pre-tokenize on whitespace before sending individual white-space seperated words to match the index. If the index includes multi-word tokens with internal whitespace, they will never match. But the standard query parser doesn't pre-tokenize like this, it passes the whole phrase to the index intact. Robert Gründler wrote: Did you run your query without using () and operators? If yes can you try this? q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0 I didn't use () and in my query before. Using the query with those operators works now, stopwords are thrown out as the should, thanks. However, i don't understand how the () and operators affect the StopWordFilter. Could you give a brief explanation for the above example? thanks! -robert