[ https://issues.apache.org/jira/browse/SOLR-908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756933#action_12756933 ]
Jason Rutherglen commented on SOLR-908: --------------------------------------- This schema consistently and randomly generates query truncations. Perhaps because we're mixing the new and old tokenizing APIs? I can't figure out what state is being shared nor how to debug this. We unfortunately upgraded to Solr 1.4 trunk and so cannot revert back to 1.3. I wrote a test case that has not reproduced the bug locally. The bug happens in a distributed environment with 20+ servers. {code} <fieldType name="vCommonGrams" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.HTMLStripStandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.CommonGramsFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.CommonGramsQueryFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/> </analyzer> </fieldType> {code} > Port of Nutch CommonGrams filter to Solr > ----------------------------------------- > > Key: SOLR-908 > URL: https://issues.apache.org/jira/browse/SOLR-908 > Project: Solr > Issue Type: Wish > Components: Analysis > Reporter: Tom Burton-West > Priority: Minor > Attachments: CommonGramsPort.zip, SOLR-908.patch, SOLR-908.patch, > SOLR-908.patch, SOLR-908.patch, SOLR-908.patch, SOLR-908.patch, SOLR-908.patch > > > Phrase queries containing common words are extremely slow. We are reluctant > to just use stop words due to various problems with false hits and some > things becoming impossible to search with stop words turned on. (For example > "to be or not to be", "the who", "man in the moon" vs "man on the moon" etc.) > > Several postings regarding slow phrase queries have suggested using the > approach used by Nutch. Perhaps someone with more Java/Solr experience might > take this on. > It should be possible to port the Nutch CommonGrams code to Solr and create > a suitable Solr FilterFactory so that it could be used in Solr by listing it > in the Solr schema.xml. > "Construct n-grams for frequently occuring terms and phrases while indexing. > Optimize phrase queries to use the n-grams. Single terms are still indexed > too, with n-grams overlaid." > http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/CommonGrams.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.