[
https://issues.apache.org/jira/browse/SOLR-908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756933#action_12756933
]
Jason Rutherglen commented on SOLR-908:
---------------------------------------
This schema consistently and randomly generates query
truncations. Perhaps because we're mixing the new and old
tokenizing APIs? I can't figure out what state is being shared
nor how to debug this. We unfortunately upgraded to Solr 1.4
trunk and so cannot revert back to 1.3. I wrote a test case that
has not reproduced the bug locally. The bug happens in a
distributed environment with 20+ servers.
{code}
<fieldType name="vCommonGrams" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.CommonGramsFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.CommonGramsQueryFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
</analyzer>
</fieldType>
{code}
> Port of Nutch CommonGrams filter to Solr
> -----------------------------------------
>
> Key: SOLR-908
> URL: https://issues.apache.org/jira/browse/SOLR-908
> Project: Solr
> Issue Type: Wish
> Components: Analysis
> Reporter: Tom Burton-West
> Priority: Minor
> Attachments: CommonGramsPort.zip, SOLR-908.patch, SOLR-908.patch,
> SOLR-908.patch, SOLR-908.patch, SOLR-908.patch, SOLR-908.patch, SOLR-908.patch
>
>
> Phrase queries containing common words are extremely slow. We are reluctant
> to just use stop words due to various problems with false hits and some
> things becoming impossible to search with stop words turned on. (For example
> "to be or not to be", "the who", "man in the moon" vs "man on the moon" etc.)
>
> Several postings regarding slow phrase queries have suggested using the
> approach used by Nutch. Perhaps someone with more Java/Solr experience might
> take this on.
> It should be possible to port the Nutch CommonGrams code to Solr and create
> a suitable Solr FilterFactory so that it could be used in Solr by listing it
> in the Solr schema.xml.
> "Construct n-grams for frequently occuring terms and phrases while indexing.
> Optimize phrase queries to use the n-grams. Single terms are still indexed
> too, with n-grams overlaid."
> http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/CommonGrams.html
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.