RE: bi-grams for common terms - any analyzers do that?

Burton-West, Tom Thu, 23 Sep 2010 09:03:31 -0700

Hi all,

The CommonGrams filter is designed to only work on phrase queries.  It is 
designed to solve the problem of slow phrase queries with phrases containing 
common words, when you don't want to use stop words.  It would not make sense 
for Boolean queries. Boolean queries just get passed through unchanged.


For background on the CommonGramsFilter please see: 
http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2

There are two filters,  CommonGramsFilter and CommonGramsQueryFilter you use 
CommonGramsFilter on indexing and CommonGramsQueryFilter for query processing.  
CommonGramsFilter outputs both CommonGrams and Unigrams so that Boolean queries 
(i.e. non-phrase queries)  will work.  For example "the rain" would produce 3 
tokens:
the  position 1
rain position 2
the-rain position 1
When you have a phrase query, you want Solr to search for the token "the-rain" 
so you don't want the unigrams.
When you have a Boolean query, the CommonGramsQueryFilter only gets one token 
as input and simply outputs it.

Appended below is a sample config from our schema.xml.

For background on the problem with "l'art" please see: 
http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance We 
used a custom filter to change all punctuation to spaces.   You could probably 
use one of the other filters to do this. (See the comments from David Smiley at 
the end of the blog post regarding possible approaches.)At the time, I just 
couldn't get WordDelimiterFilter to behave as documented with various 
combinations of parameters and was not aware of the other filters David 
mentions.

The problem with "l'art" is actually due to a bug or feature in the 
QueryParser.  Currently the QueryParser interacts with the token chain and 
decides whether the tokens coming back from a tokenfilter should be treated as 
a phrase query based on whether or not more than one non-synonym token comes 
back from the tokestream for a single 'queryparser token'.
It also splits on whitespace which causes all CJK queries to be treated as 
phrase queries regardless of the CJK tokenizer you use. This is a contentious 
issue.  See https://issues.apache.org/jira/browse/LUCENE-2458.  There is a 
semi-workaround using PositionFilter, but it has many undesirable side effects. 
 I believe Robert Muir, who is an expert on the various problems involved and  
opened Lucene-2458 is working on a better fix.

Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search


------------
<fieldType name="CommonGramTest" class="solr.TextField" 
positionIncrementGap="100">
−
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="ISOLatin1AccentFilterFactory"/>
<filter class="solr.PunctuationFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CommonGramsFilterFactory" words="new400common.txt"/>
</analyzer>
−
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="ISOLatin1AccentFilterFactory"/>
<filter class="solr.PunctuationFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CommonGramsQueryFilterFactory" words="new400common.txt"/>
</analyzer>
</fieldType>

RE: bi-grams for common terms - any analyzers do that?

Reply via email to