On Jan 2, 2010, at 11:34 AM, Bogdan Vatkov wrote: > I re-indexed but I cannot find a way to use the VectorDumper w/ Dictionary, > I am using mahout v 0.2 and not the very latest trunk code since the latter > was not compiling and I had to use older code.
Hmm, I'm using trunk and it is compiling. You have to do "mvn install" from the root Mahout dir, if that helps at all. If you turn on the TermVectorComponent (http://wiki.apache.org/solr/TermVectorComponent) in Solr, what do your vectors look like? Do they have stopwords? > > On Sat, Jan 2, 2010 at 5:54 PM, Grant Ingersoll <[email protected]> wrote: > >> I assume you re-indexed and you used the VectorDumper (along with the >> dictionary) to dump out the Vectors that were converted and verified no stop >> words? >> >> On Jan 2, 2010, at 9:03 AM, Bogdan Vatkov wrote: >> >>> this is my Solr config: >>> >>> <field name="msg_body" type="text" termVectors="true" indexed="true" >>> stored="true"/> >>> >>> and the type text is as configured by default: >>> >>> <fieldType name="text" class="solr.TextField" >>> positionIncrementGap="100"> >>> <analyzer type="index"> >>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >>> <!-- in this example, we will only use synonyms at query time >>> <filter class="solr.SynonymFilterFactory" >>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> >>> --> >>> <!-- Case insensitive stop word removal. >>> add enablePositionIncrements=true in both the index and query >>> analyzers to leave a 'gap' for more accurate phrase queries. >>> --> >>> <filter class="solr.StopFilterFactory" >>> ignoreCase="true" >>> words="stopwords.txt" >>> enablePositionIncrements="true" >>> /> >>> <filter class="solr.WordDelimiterFilterFactory" >>> generateWordParts="1" generateNumberParts="1" catenateWords="1" >>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> >>> <filter class="solr.LowerCaseFilterFactory"/> >>> <filter class="solr.SnowballPorterFilterFactory" >> language="English" >>> protected="protwords.txt"/> >>> </analyzer> >>> <analyzer type="query"> >>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >>> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" >>> ignoreCase="true" expand="true"/> >>> <filter class="solr.StopFilterFactory" >>> ignoreCase="true" >>> words="stopwords.txt" >>> enablePositionIncrements="true" >>> /> >>> <filter class="solr.WordDelimiterFilterFactory" >>> generateWordParts="1" generateNumberParts="1" catenateWords="0" >>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> >>> <filter class="solr.LowerCaseFilterFactory"/> >>> <filter class="solr.SnowballPorterFilterFactory" >> language="English" >>> protected="protwords.txt"/> >>> </analyzer> >>> </fieldType> >>> >>> and I have entered quite some stopwords in the stopwords.txt file >>> >>> my SolrToMahout.sh file: >>> >>> #!/bin/bash >>> set -x >>> cd /store/dev/inst/mahout-0.2 >>> java -classpath >>> /store/dev/inst/mahout-0.2/utils/target/mahout-utils-0.2.jar:$( echo >>> /store/dev/inst/mahout-0.2/utils/target/dependency/*.jar . | sed 's/ >> /:/g') >>> org.apache.mahout.utils.vectors.lucene.Driver --dir >>> /store/dev/inst/apache-solr-1.4.0/example/solr/data/index \ >>> --output /store/dev/inst/mahout-0.2/clustering-example/solr/output >>> --field msg_body --dictOut >>> /store/dev/inst/mahout-0.2/clustering-example/solr_dict/dict >>> >>> Best regards, >>> Bogdan >>> >>> On Sat, Jan 2, 2010 at 3:49 PM, Grant Ingersoll <[email protected]> >> wrote: >>> >>>> What do the relevant pieces of your Solr setup look like and how are you >>>> invoking the Lucene driver? >>>> >>>> -Grant >> >> -------------------------- >> Grant Ingersoll >> http://www.lucidimagination.com/ >> >> Search the Lucene ecosystem using Solr/Lucene: >> http://www.lucidimagination.com/search >> >> > > > -- > Bogdan Vatkov > email: [email protected] -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
