Re: Stopwords work for Solr but not for Mahout

Grant Ingersoll Sat, 02 Jan 2010 08:41:23 -0800

On Jan 2, 2010, at 11:34 AM, Bogdan Vatkov wrote:

> I re-indexed but I cannot find a way to use the VectorDumper w/ Dictionary,
> I am using mahout v 0.2 and not the very latest trunk code since the latter
> was not compiling and I had to use older code.


Hmm, I'm using trunk and it is compiling.  You have to do "mvn install" from 
the root Mahout dir, if that helps at all.

If you turn on the TermVectorComponent 
(http://wiki.apache.org/solr/TermVectorComponent) in Solr, what do your vectors 
look like?  Do they have stopwords?

> 
> On Sat, Jan 2, 2010 at 5:54 PM, Grant Ingersoll <[email protected]> wrote:
> 
>> I assume you re-indexed and you used the VectorDumper (along with the
>> dictionary) to dump out the Vectors that were converted and verified no stop
>> words?
>> 
>> On Jan 2, 2010, at 9:03 AM, Bogdan Vatkov wrote:
>> 
>>> this is my Solr config:
>>> 
>>>  <field name="msg_body" type="text" termVectors="true" indexed="true"
>>> stored="true"/>
>>> 
>>> and the type text is as configured by default:
>>> 
>>>   <fieldType name="text" class="solr.TextField"
>>> positionIncrementGap="100">
>>>     <analyzer type="index">
>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>       <!-- in this example, we will only use synonyms at query time
>>>       <filter class="solr.SynonymFilterFactory"
>>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>>>       -->
>>>       <!-- Case insensitive stop word removal.
>>>         add enablePositionIncrements=true in both the index and query
>>>         analyzers to leave a 'gap' for more accurate phrase queries.
>>>       -->
>>>       <filter class="solr.StopFilterFactory"
>>>               ignoreCase="true"
>>>               words="stopwords.txt"
>>>               enablePositionIncrements="true"
>>>               />
>>>       <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>       <filter class="solr.SnowballPorterFilterFactory"
>> language="English"
>>> protected="protwords.txt"/>
>>>     </analyzer>
>>>     <analyzer type="query">
>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>>> ignoreCase="true" expand="true"/>
>>>       <filter class="solr.StopFilterFactory"
>>>               ignoreCase="true"
>>>               words="stopwords.txt"
>>>               enablePositionIncrements="true"
>>>               />
>>>       <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>       <filter class="solr.SnowballPorterFilterFactory"
>> language="English"
>>> protected="protwords.txt"/>
>>>     </analyzer>
>>>   </fieldType>
>>> 
>>> and I have entered quite some stopwords in the stopwords.txt file
>>> 
>>> my SolrToMahout.sh file:
>>> 
>>> #!/bin/bash
>>> set -x
>>> cd /store/dev/inst/mahout-0.2
>>> java -classpath
>>> /store/dev/inst/mahout-0.2/utils/target/mahout-utils-0.2.jar:$( echo
>>> /store/dev/inst/mahout-0.2/utils/target/dependency/*.jar . | sed 's/
>> /:/g')
>>> org.apache.mahout.utils.vectors.lucene.Driver --dir
>>> /store/dev/inst/apache-solr-1.4.0/example/solr/data/index \
>>>  --output /store/dev/inst/mahout-0.2/clustering-example/solr/output
>>> --field msg_body --dictOut
>>> /store/dev/inst/mahout-0.2/clustering-example/solr_dict/dict
>>> 
>>> Best regards,
>>> Bogdan
>>> 
>>> On Sat, Jan 2, 2010 at 3:49 PM, Grant Ingersoll <[email protected]>
>> wrote:
>>> 
>>>> What do the relevant pieces of your Solr setup look like and how are you
>>>> invoking the Lucene driver?
>>>> 
>>>> -Grant
>> 
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>> 
>> Search the Lucene ecosystem using Solr/Lucene:
>> http://www.lucidimagination.com/search
>> 
>> 
> 
> 
> -- 
> Bogdan Vatkov
> email: [email protected]

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search

Re: Stopwords work for Solr but not for Mahout

Reply via email to