Re: Stopwords work for Solr but not for Mahout

Grant Ingersoll Sat, 02 Jan 2010 07:54:45 -0800

I assume you re-indexed and you used the VectorDumper (along with the 
dictionary) to dump out the Vectors that were converted and verified no stop 
words?


On Jan 2, 2010, at 9:03 AM, Bogdan Vatkov wrote:

> this is my Solr config:
> 
>   <field name="msg_body" type="text" termVectors="true" indexed="true"
> stored="true"/>
> 
> and the type text is as configured by default:
> 
>    <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <!-- in this example, we will only use synonyms at query time
>        <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>        -->
>        <!-- Case insensitive stop word removal.
>          add enablePositionIncrements=true in both the index and query
>          analyzers to leave a 'gap' for more accurate phrase queries.
>        -->
>        <filter class="solr.StopFilterFactory"
>                ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.StopFilterFactory"
>                ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
>      </analyzer>
>    </fieldType>
> 
> and I have entered quite some stopwords in the stopwords.txt file
> 
> my SolrToMahout.sh file:
> 
> #!/bin/bash
> set -x
> cd /store/dev/inst/mahout-0.2
> java -classpath
> /store/dev/inst/mahout-0.2/utils/target/mahout-utils-0.2.jar:$( echo
> /store/dev/inst/mahout-0.2/utils/target/dependency/*.jar . | sed 's/ /:/g')
> org.apache.mahout.utils.vectors.lucene.Driver --dir
> /store/dev/inst/apache-solr-1.4.0/example/solr/data/index \
>   --output /store/dev/inst/mahout-0.2/clustering-example/solr/output
> --field msg_body --dictOut
> /store/dev/inst/mahout-0.2/clustering-example/solr_dict/dict
> 
> Best regards,
> Bogdan
> 
> On Sat, Jan 2, 2010 at 3:49 PM, Grant Ingersoll <[email protected]> wrote:
> 
>> What do the relevant pieces of your Solr setup look like and how are you
>> invoking the Lucene driver?
>> 
>> -Grant

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search

Re: Stopwords work for Solr but not for Mahout

Reply via email to