Re: Stopwords work for Solr but not for Mahout

Bogdan Vatkov Sat, 02 Jan 2010 08:57:13 -0800

If I use the TermVectorComponent the search results do not contain stopwords
- which seems to be ok at this point in time.
But when I use the Lucene Driver I can see the stop words in the dictionary
file alone and later in the clusters.
Is there a way that I can print the vectors with the real terms in place -
instead of just some indexes?


On Sat, Jan 2, 2010 at 6:40 PM, Grant Ingersoll <[email protected]> wrote:

>
> On Jan 2, 2010, at 11:34 AM, Bogdan Vatkov wrote:
>
> > I re-indexed but I cannot find a way to use the VectorDumper w/
> Dictionary,
> > I am using mahout v 0.2 and not the very latest trunk code since the
> latter
> > was not compiling and I had to use older code.
>
> Hmm, I'm using trunk and it is compiling.  You have to do "mvn install"
> from the root Mahout dir, if that helps at all.
>
> If you turn on the TermVectorComponent (
> http://wiki.apache.org/solr/TermVectorComponent) in Solr, what do your
> vectors look like?  Do they have stopwords?
>
> >
> > On Sat, Jan 2, 2010 at 5:54 PM, Grant Ingersoll <[email protected]>
> wrote:
> >
> >> I assume you re-indexed and you used the VectorDumper (along with the
> >> dictionary) to dump out the Vectors that were converted and verified no
> stop
> >> words?
> >>
> >> On Jan 2, 2010, at 9:03 AM, Bogdan Vatkov wrote:
> >>
> >>> this is my Solr config:
> >>>
> >>>  <field name="msg_body" type="text" termVectors="true" indexed="true"
> >>> stored="true"/>
> >>>
> >>> and the type text is as configured by default:
> >>>
> >>>   <fieldType name="text" class="solr.TextField"
> >>> positionIncrementGap="100">
> >>>     <analyzer type="index">
> >>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>>       <!-- in this example, we will only use synonyms at query time
> >>>       <filter class="solr.SynonymFilterFactory"
> >>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
> >>>       -->
> >>>       <!-- Case insensitive stop word removal.
> >>>         add enablePositionIncrements=true in both the index and query
> >>>         analyzers to leave a 'gap' for more accurate phrase queries.
> >>>       -->
> >>>       <filter class="solr.StopFilterFactory"
> >>>               ignoreCase="true"
> >>>               words="stopwords.txt"
> >>>               enablePositionIncrements="true"
> >>>               />
> >>>       <filter class="solr.WordDelimiterFilterFactory"
> >>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >>>       <filter class="solr.LowerCaseFilterFactory"/>
> >>>       <filter class="solr.SnowballPorterFilterFactory"
> >> language="English"
> >>> protected="protwords.txt"/>
> >>>     </analyzer>
> >>>     <analyzer type="query">
> >>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>>       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> >>> ignoreCase="true" expand="true"/>
> >>>       <filter class="solr.StopFilterFactory"
> >>>               ignoreCase="true"
> >>>               words="stopwords.txt"
> >>>               enablePositionIncrements="true"
> >>>               />
> >>>       <filter class="solr.WordDelimiterFilterFactory"
> >>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> >>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >>>       <filter class="solr.LowerCaseFilterFactory"/>
> >>>       <filter class="solr.SnowballPorterFilterFactory"
> >> language="English"
> >>> protected="protwords.txt"/>
> >>>     </analyzer>
> >>>   </fieldType>
> >>>
> >>> and I have entered quite some stopwords in the stopwords.txt file
> >>>
> >>> my SolrToMahout.sh file:
> >>>
> >>> #!/bin/bash
> >>> set -x
> >>> cd /store/dev/inst/mahout-0.2
> >>> java -classpath
> >>> /store/dev/inst/mahout-0.2/utils/target/mahout-utils-0.2.jar:$( echo
> >>> /store/dev/inst/mahout-0.2/utils/target/dependency/*.jar . | sed 's/
> >> /:/g')
> >>> org.apache.mahout.utils.vectors.lucene.Driver --dir
> >>> /store/dev/inst/apache-solr-1.4.0/example/solr/data/index \
> >>>  --output /store/dev/inst/mahout-0.2/clustering-example/solr/output
> >>> --field msg_body --dictOut
> >>> /store/dev/inst/mahout-0.2/clustering-example/solr_dict/dict
> >>>
> >>> Best regards,
> >>> Bogdan
> >>>
> >>> On Sat, Jan 2, 2010 at 3:49 PM, Grant Ingersoll <[email protected]>
> >> wrote:
> >>>
> >>>> What do the relevant pieces of your Solr setup look like and how are
> you
> >>>> invoking the Lucene driver?
> >>>>
> >>>> -Grant
> >>
> >> --------------------------
> >> Grant Ingersoll
> >> http://www.lucidimagination.com/
> >>
> >> Search the Lucene ecosystem using Solr/Lucene:
> >> http://www.lucidimagination.com/search
> >>
> >>
> >
> >
> > --
> > Bogdan Vatkov
> > email: [email protected]
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>


-- 
Bogdan Vatkov
email: [email protected]

Re: Stopwords work for Solr but not for Mahout

Reply via email to