[jira] Updated: (SOLR-908) Port of Nutch CommonGrams filter to Solr

Tom Burton-West (JIRA) Fri, 03 Apr 2009 16:17:34 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tom Burton-West updated SOLR-908:
---------------------------------

    Attachment: CommonGramsPort.zip


Attached is my first cut at a port of the Nutch CommonGrams filter to Solr.   I 
still need to write tests for CommonGramsFilterFactory and 
CommonGramsQueryFilterFactory.  Nutch had a method call to optimize phrase 
queries.  For Solr I just wrote CommonGramsQueryFilter for handling queries.   
Preliminary tests with a relatively small* index of 100,000 full-text documents 
(index size 44GB, about  30% larger than the index without commongrams)  
indicate about 10x increase in response times for phrase queries.   

This post by Hoss was extremely helpful as was his suggestion to use the Solr 
BufferedTokenStream  as a base class:
http://www.nabble.com/Re%3A-Index---search-questions--special-cases-p7344056.html


Here is an example schema.xml entry
<fieldType name="ocrCommon" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.CommonGramsFilterFactory"  words="tomwords.txt"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.CommonGramsQueryFilterFactory"  
words="tomwords.txt"/>
      </analyzer>
    </fieldType> 

Tom Burton-West
University of Michagan Library
-----------------------------------
*We are working on indexing 1-6 million full-text documents. Our current 1 
million document index is about 235GB,  so in our context 100,000 docs is 
relatively small.

The files in CommonGramsPort.zip are:
CommonGramsFilter.java  
CommonGramsFilterTest.java  
CommonGramsFilterFactory.java 
CommonGramsQueryFilter.java   
CommonGramsQueryFilterFactory.java   
CommonGramsQueryFilterTest.java   
TestCommonGrams.java     (Non-junit test for input on STDIN)




> Port of Nutch  CommonGrams filter to Solr
> -----------------------------------------
>
>                 Key: SOLR-908
>                 URL: https://issues.apache.org/jira/browse/SOLR-908
>             Project: Solr
>          Issue Type: Wish
>          Components: Analysis
>            Reporter: Tom Burton-West
>            Priority: Minor
>         Attachments: CommonGramsPort.zip
>
>
> Phrase queries containing common words are extremely slow.  We are reluctant 
> to just use stop words due to various problems with false hits and some 
> things becoming impossible to search with stop words turned on. (For example 
> "to be or not to be", "the who", "man in the moon" vs "man on the moon" etc.) 
>  
> Several postings regarding slow phrase queries have suggested using the 
> approach used by Nutch.  Perhaps someone with more Java/Solr experience might 
> take this on.
> It should be possible to port the Nutch CommonGrams code to Solr  and create 
> a suitable Solr FilterFactory so that it could be used in Solr by listing it 
> in the Solr schema.xml.
> "Construct n-grams for frequently occuring terms and phrases while indexing. 
> Optimize phrase queries to use the n-grams. Single terms are still indexed 
> too, with n-grams overlaid."
> http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/CommonGrams.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-908) Port of Nutch CommonGrams filter to Solr

Reply via email to