Hi all, The CommonGrams filter is designed to only work on phrase queries. It is designed to solve the problem of slow phrase queries with phrases containing common words, when you don't want to use stop words. It would not make sense for Boolean queries. Boolean queries just get passed through unchanged.
For background on the CommonGramsFilter please see: http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2 There are two filters, CommonGramsFilter and CommonGramsQueryFilter you use CommonGramsFilter on indexing and CommonGramsQueryFilter for query processing. CommonGramsFilter outputs both CommonGrams and Unigrams so that Boolean queries (i.e. non-phrase queries) will work. For example "the rain" would produce 3 tokens: the position 1 rain position 2 the-rain position 1 When you have a phrase query, you want Solr to search for the token "the-rain" so you don't want the unigrams. When you have a Boolean query, the CommonGramsQueryFilter only gets one token as input and simply outputs it. Appended below is a sample config from our schema.xml. For background on the problem with "l'art" please see: http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance We used a custom filter to change all punctuation to spaces. You could probably use one of the other filters to do this. (See the comments from David Smiley at the end of the blog post regarding possible approaches.)At the time, I just couldn't get WordDelimiterFilter to behave as documented with various combinations of parameters and was not aware of the other filters David mentions. The problem with "l'art" is actually due to a bug or feature in the QueryParser. Currently the QueryParser interacts with the token chain and decides whether the tokens coming back from a tokenfilter should be treated as a phrase query based on whether or not more than one non-synonym token comes back from the tokestream for a single 'queryparser token'. It also splits on whitespace which causes all CJK queries to be treated as phrase queries regardless of the CJK tokenizer you use. This is a contentious issue. See https://issues.apache.org/jira/browse/LUCENE-2458. There is a semi-workaround using PositionFilter, but it has many undesirable side effects. I believe Robert Muir, who is an expert on the various problems involved and opened Lucene-2458 is working on a better fix. Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search ------------ <fieldType name="CommonGramTest" class="solr.TextField" positionIncrementGap="100"> − <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="ISOLatin1AccentFilterFactory"/> <filter class="solr.PunctuationFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.CommonGramsFilterFactory" words="new400common.txt"/> </analyzer> − <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="ISOLatin1AccentFilterFactory"/> <filter class="solr.PunctuationFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.CommonGramsQueryFilterFactory" words="new400common.txt"/> </analyzer> </fieldType>