[
http://issues.apache.org/jira/browse/SOLR-81?page=comments#action_12460405 ]
Otis Gospodnetic commented on SOLR-81:
--------------------------------------
This patch contains 3 new classes for org.apache.solr.analysis:
1. NGramTokenizerFactory
2. NGramTokenizer
3. NGramTokenizerTest (all tests pass)
+ 1 modified class:
4. BaseTokenizerFactory
I *think* the above can be configured in schema.xml as follows:
<fieldtype name="gram1" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<tokenizer class="solr.NGramTokenizerFactory" minGram="1" maxGram="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldtype>
<fieldtype name="gram2" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<tokenizer class="solr.NGramTokenizerFactory" minGram="2" maxGram="2"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldtype>
<fieldtype name="gram3" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<tokenizer class="solr.NGramTokenizerFactory" minGram="3" maxGram="3"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldtype>
And I *believe* the following fields would have to be defined (to match the
fields in Spellchecker.java):
<field name="word" type="string" indexed="true" stored="true"
multiValued="false"/>
<field name="start1" type="string" indexed="true" stored="true"
multiValued="false"/> **
<field name="end1" type="string" indexed="true" stored="true"
multiValued="false"/> **
<field name="start2" type="string" indexed="true" stored="true"
multiValued="false"/> **
<field name="end2" type="string" indexed="true" stored="true"
multiValued="false"/> **
<field name="start3" type="string" indexed="true" stored="true"
multiValued="false"/> **
<field name="end3" type="string" indexed="true" stored="true"
multiValued="false"/> **
<field name="start4" type="string" indexed="true" stored="true"
multiValued="false"/> **
<field name="end4" type="string" indexed="true" stored="true"
multiValued="false"/> **
<field name="gram1" type="gram1" indexed="true" stored="true"
multiValued="false"/>
<field name="gram2" type="gram2" indexed="true" stored="true"
multiValued="false"/>
<field name="gram3" type="gram3" indexed="true" stored="true"
multiValued="false"/>
<field name="gram4" type="gram4" indexed="true" stored="true"
multiValued="false"/>
c.f. http://wiki.apache.org/jakarta-lucene/SpellChecker
I am not sure how to configure the fields marked with ** above.
Maybe I don't even need startN/endN fields. I am not sure how endN fields
would be useful. The startN are probably useful because those can get an extra
boost.
I *think* the above config (except for ** fields, which I don't know how to
handle) will do the following.
If the input (query string) is "pork", my ngrammer may generate the following
uni- and bi-gram tokens:
p o r k po or rk
And this is how I think they will get mapped to fields and indexed:
word: pork
gram1: p o r k
gram2: po or rk
start1: p **
start2: po **
end1 rk **
end2: rk **
Again, not sure how to achieve **.
I haven't actually tried this. I am only modifying my local
example/solr/conf/schema.xml for now, and I haven't actually indexed anything
with the above config.
Thoughts/comments?
> Add Query Spellchecker functionality
> ------------------------------------
>
> Key: SOLR-81
> URL: http://issues.apache.org/jira/browse/SOLR-81
> Project: Solr
> Issue Type: New Feature
> Components: search
> Reporter: Otis Gospodnetic
> Priority: Minor
> Attachments: SOLR-81-ngram.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram
> documents. For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/[email protected]/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira