I re-indexed but I cannot find a way to use the VectorDumper w/ Dictionary, I am using mahout v 0.2 and not the very latest trunk code since the latter was not compiling and I had to use older code.
On Sat, Jan 2, 2010 at 5:54 PM, Grant Ingersoll <[email protected]> wrote: > I assume you re-indexed and you used the VectorDumper (along with the > dictionary) to dump out the Vectors that were converted and verified no stop > words? > > On Jan 2, 2010, at 9:03 AM, Bogdan Vatkov wrote: > > > this is my Solr config: > > > > <field name="msg_body" type="text" termVectors="true" indexed="true" > > stored="true"/> > > > > and the type text is as configured by default: > > > > <fieldType name="text" class="solr.TextField" > > positionIncrementGap="100"> > > <analyzer type="index"> > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > <!-- in this example, we will only use synonyms at query time > > <filter class="solr.SynonymFilterFactory" > > synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> > > --> > > <!-- Case insensitive stop word removal. > > add enablePositionIncrements=true in both the index and query > > analyzers to leave a 'gap' for more accurate phrase queries. > > --> > > <filter class="solr.StopFilterFactory" > > ignoreCase="true" > > words="stopwords.txt" > > enablePositionIncrements="true" > > /> > > <filter class="solr.WordDelimiterFilterFactory" > > generateWordParts="1" generateNumberParts="1" catenateWords="1" > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > > <filter class="solr.LowerCaseFilterFactory"/> > > <filter class="solr.SnowballPorterFilterFactory" > language="English" > > protected="protwords.txt"/> > > </analyzer> > > <analyzer type="query"> > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" > > ignoreCase="true" expand="true"/> > > <filter class="solr.StopFilterFactory" > > ignoreCase="true" > > words="stopwords.txt" > > enablePositionIncrements="true" > > /> > > <filter class="solr.WordDelimiterFilterFactory" > > generateWordParts="1" generateNumberParts="1" catenateWords="0" > > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> > > <filter class="solr.LowerCaseFilterFactory"/> > > <filter class="solr.SnowballPorterFilterFactory" > language="English" > > protected="protwords.txt"/> > > </analyzer> > > </fieldType> > > > > and I have entered quite some stopwords in the stopwords.txt file > > > > my SolrToMahout.sh file: > > > > #!/bin/bash > > set -x > > cd /store/dev/inst/mahout-0.2 > > java -classpath > > /store/dev/inst/mahout-0.2/utils/target/mahout-utils-0.2.jar:$( echo > > /store/dev/inst/mahout-0.2/utils/target/dependency/*.jar . | sed 's/ > /:/g') > > org.apache.mahout.utils.vectors.lucene.Driver --dir > > /store/dev/inst/apache-solr-1.4.0/example/solr/data/index \ > > --output /store/dev/inst/mahout-0.2/clustering-example/solr/output > > --field msg_body --dictOut > > /store/dev/inst/mahout-0.2/clustering-example/solr_dict/dict > > > > Best regards, > > Bogdan > > > > On Sat, Jan 2, 2010 at 3:49 PM, Grant Ingersoll <[email protected]> > wrote: > > > >> What do the relevant pieces of your Solr setup look like and how are you > >> invoking the Lucene driver? > >> > >> -Grant > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem using Solr/Lucene: > http://www.lucidimagination.com/search > > -- Bogdan Vatkov email: [email protected]
