RE: Suggesting broken words with solr.WordBreakSolrSpellChecker

Dyer, James Tue, 27 Jan 2015 07:22:19 -0800

I think the word break spellchecker will do what you want.  But, if I were you, 
I'd dial back "maxChanges" to 1 or 2.  You don't want it slicing a word into 10 
parts or trying to combine 10 adjacent words.  You also need the 
"minBreakLength" to be no more than 2, if you want it to break "go" (length=2) 
off of "gopro".


James Dyer
Ingram Content Group


-----Original Message-----
From: fabio.bozzo [mailto:f.bo...@3-w.it] 
Sent: Tuesday, January 27, 2015 2:58 AM
To: solr-user@lucene.apache.org
Subject: Suggesting broken words with solr.WordBreakSolrSpellChecker

I indexed an electronics e-commerce product catalog.

This is a typical document from my collection:


"docs": [
      {
        "prezzo_vendita_d": 39.9,
        "codice_produttore_s": "DK00150020",
        "codice_s": "5.BAT.27407",
        "descrizione": "BATTERIA GO PRO HERO ",
        "barcode_interno_s": "185323000958",
        "categoria": "Batterie",
        "prezzo_acquisto_d": 16.12,
        "marchio": "GO PRO",
        "data_aggiornamento_dt": "2012-06-21T00:00:00Z",
        "id": "27407",
        "_version_": 1491274123542790100
      },
  {
    "codice_produttore_s": "DK0052043",
    "codice_s": "05.SP.42760",
    "id": "42760",
    "marchio": "SP GADGETS",
    "barcode_interno_s": "4028017520430",
    "prezzo_acquisto_d": 34.4,
    "data_aggiornamento_dt": "2014-11-04T00:00:00Z",
    "descrizione": "SP POS CASE GOPRO OLIVE LARGE",
    "prezzo_vendita_d": 59.95,
    "_version_": 1491274406746390500
  }
...]
I want my spellchecker to suggest "go pro" to users searching "gopro"
(without whitespace).

I also want users searching "go pro" to find "gopro" products, too.

Here's a little bit of my configuration:

*schema.xml*
<field name="marchio" type="string" indexed="true" stored="true"/>
        <field name="categoria" type="string" indexed="true" stored="true"/>
        <field name="fornitore" type="string" indexed="true" stored="true"/>
        <field name="descrizione" type="string" indexed="true"
stored="true"/>

        <field name="catch_all_original" type="text_general" indexed="true"
stored="false" multiValued="true" />
        <field name="catch_all" type="text_it" indexed="true" stored="false"
multiValued="true" />

<copyField source="marchio" dest="catch_all" />
    <copyField source="categoria" dest="catch_all" />
    <copyField source="descrizione" dest="catch_all" />
    <copyField source="fornitore" dest="catch_all" />

    <copyField source="marchio" dest="catch_all_original" />
    <copyField source="categoria" dest="catch_all_original" />
    <copyField source="descrizione" dest="catch_all_original" />
    <copyField source="fornitore" dest="catch_all_original" />
...

        <fieldType name="text_it" class="solr.TextField"
positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"
preserveOriginal="1" />

                <filter class="solr.ElisionFilterFactory" ignoreCase="true"
articles="lang/contractions_it.txt"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.ASCIIFoldingFilterFactory"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_it.txt" format="snowball" />
                <filter class="solr.ItalianLightStemFilterFactory"/>
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"
preserveOriginal="1" />

                <filter class="solr.ElisionFilterFactory" ignoreCase="true"
articles="lang/contractions_it.txt"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.ASCIIFoldingFilterFactory"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_it.txt" format="snowball" />

                <filter class="solr.ItalianLightStemFilterFactory"/>
                <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
            </analyzer>
        </fieldType>

<br />

*solr-config.xml*
<requestHandler name="/select" class="solr.SearchHandler">

        <lst name="defaults">
            <str name="echoParams">explicit</str>
            <int name="rows">10</int>
            <str name="df">catch_all</str>

            <str name="spellcheck">on</str>
            <str name="spellcheck.dictionary">default</str>
            <str name="spellcheck.dictionary">wordbreak</str>
            <str name="spellcheck.extendedResults">false</str>
            <str name="spellcheck.count">5</str>
            <str name="spellcheck.alternativeTermCount">2</str>
            <str name="spellcheck.maxResultsForSuggest">5</str>
            <str name="spellcheck.collate">true</str>
            <str name="spellcheck.collateExtendedResults">true</str>
            <str name="spellcheck.maxCollationTries">5</str>
            <str name="spellcheck.maxCollations">3</str>
        </lst>

        <arr name="last-components">
            <str>spellcheck</str>
        </arr>

    </requestHandler>
...
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">

        <str name="queryAnalyzerFieldType">text_general</str>

        <lst name="spellchecker">
            <str name="name">default</str>
            <str name="field">catch_all_original</str>
            <str name="classname">solr.DirectSolrSpellChecker</str>
            <str name="distanceMeasure">internal</str>
            <float name="accuracy">0.5</float>
            <int name="maxEdits">2</int>  
            <int name="minPrefix">1</int>
            <int name="maxInspections">5</int>
            <int name="minQueryLength">4</int>
            <float name="maxQueryFrequency">0.01</float>
        </lst>

        <lst name="spellchecker">
            <str name="name">wordbreak</str>
            <str name="classname">solr.WordBreakSolrSpellChecker</str>      
            <str name="field">catch_all_original</str>
            <str name="combineWords">true</str>
            <str name="breakWords">true</str>
            <int name="maxChanges">10</int>
            <int name="minBreakLength">3</int>
        </lst>

    </searchComponent>


*Is the spellchecker the right solution or is this the case for something
else, like the "more like this" feature?*

Thank you



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Suggesting-broken-words-with-solr-WordBreakSolrSpellChecker-tp4182172.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Suggesting broken words with solr.WordBreakSolrSpellChecker

Reply via email to