[ 
http://issues.apache.org/jira/browse/SOLR-81?page=comments#action_12460405 ] 
            
Otis Gospodnetic commented on SOLR-81:
--------------------------------------

This patch contains 3 new classes for org.apache.solr.analysis:
1. NGramTokenizerFactory
2. NGramTokenizer
3. NGramTokenizerTest (all tests pass)
+ 1 modified class:
4. BaseTokenizerFactory

I *think* the above can be configured in schema.xml as follows:

    <fieldtype name="gram1" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <tokenizer class="solr.NGramTokenizerFactory" minGram="1" maxGram="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldtype>
    <fieldtype name="gram2" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <tokenizer class="solr.NGramTokenizerFactory" minGram="2" maxGram="2"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldtype>
    <fieldtype name="gram3" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <tokenizer class="solr.NGramTokenizerFactory" minGram="3" maxGram="3"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldtype>

And I *believe* the following fields would have to be defined (to match the 
fields in Spellchecker.java):

<field name="word" type="string" indexed="true" stored="true" 
multiValued="false"/>
<field name="start1" type="string" indexed="true" stored="true" 
multiValued="false"/>  **
<field name="end1" type="string" indexed="true" stored="true" 
multiValued="false"/> **
<field name="start2" type="string" indexed="true" stored="true" 
multiValued="false"/> **
<field name="end2" type="string" indexed="true" stored="true" 
multiValued="false"/> **
<field name="start3" type="string" indexed="true" stored="true" 
multiValued="false"/> **
<field name="end3" type="string" indexed="true" stored="true" 
multiValued="false"/> **
<field name="start4" type="string" indexed="true" stored="true" 
multiValued="false"/> **
<field name="end4" type="string" indexed="true" stored="true" 
multiValued="false"/> **
<field name="gram1" type="gram1" indexed="true" stored="true" 
multiValued="false"/>
<field name="gram2" type="gram2" indexed="true" stored="true" 
multiValued="false"/>
<field name="gram3" type="gram3" indexed="true" stored="true" 
multiValued="false"/>
<field name="gram4" type="gram4" indexed="true" stored="true" 
multiValued="false"/>

c.f. http://wiki.apache.org/jakarta-lucene/SpellChecker
I am not sure how to configure the fields marked with  ** above.
Maybe I don't even need startN/endN fields.  I am not sure how endN fields 
would be useful.  The startN are probably useful because those can get an extra 
boost.

I *think* the above config (except for ** fields, which I don't know how to 
handle) will do the following.
If the input (query string) is "pork", my ngrammer may generate the following 
uni- and bi-gram tokens:

  p o r k po or rk

And this is how I think they will get mapped to fields and indexed:
word: pork
gram1: p o r k
gram2: po or rk
start1: p **
start2: po **
end1 rk **
end2: rk **

Again, not sure how to achieve **.

I haven't actually tried this.  I am only modifying my local 
example/solr/conf/schema.xml for now, and I haven't actually indexed anything 
with the above config.

Thoughts/comments?

> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: http://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: SOLR-81-ngram.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram 
> documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to