Yesterday I had issues with mapping cluster results to dictionary entries - it happened that I was using different dictionary - therefore the result clusters shown really strange results. But once I fixed all the commands, input/output files, etc. I got very good result from clusterization POV (I mean clusters are quite correct having in mind the input documents) but unfortunately the clusters contained mostly words which I would like to stop - and which words I placed in the stopwords.txt in Solr (re-indexed, restarted Solr, etc.).
Where do you suggest I debug the vector creation? Seems Solr respects the stopwords but not the vector creation (then clustering). On Sun, Jan 3, 2010 at 4:02 PM, Grant Ingersoll <[email protected]> wrote: > > On Jan 3, 2010, at 8:58 AM, Bogdan Vatkov wrote: > > > I have stopwords.txt file with 1200+ words, i did not understand this > with > > the stemming - you mean my stopwords are somehow ignored due to some > > stemming or ? > > No, stopword removal happens before stemming so it is possible that a word > that was not stopped was then stemmed to a stopword. > > I thought you said yesterday you got it straightened out. > > > > > On Sun, Jan 3, 2010 at 3:53 PM, Grant Ingersoll <[email protected]> > wrote: > > > >> Are you sure you have stopwords and it is not the result of stemming > some > >> other word? > >> > >> On Jan 3, 2010, at 7:57 AM, Bogdan Vatkov wrote: > >> > >>> my Solr config is like the default one: > >>> > >>> <field name="msg_body" type="text" termVectors="true" indexed="true" > >>> stored="true"/> > >>> > >>> <fieldType name="text" class="solr.TextField" > >> positionIncrementGap="100"> > >>> <analyzer type="index"> > >>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> > >>> <filter class="solr.StopFilterFactory" > >>> ignoreCase="true" > >>> words="stopwords.txt" > >>> enablePositionIncrements="true" > >>> /> > >>> <filter class="solr.WordDelimiterFilterFactory" > >>> generateWordParts="1" generateNumberParts="1" catenateWords="1" > >>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > >>> <filter class="solr.LowerCaseFilterFactory"/> > >>> <filter class="solr.SnowballPorterFilterFactory" > >> language="English" > >>> protected="protwords.txt"/> > >>> </analyzer> > >>> <analyzer type="query"> > >>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> > >>> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" > >>> ignoreCase="true" expand="true"/> > >>> <filter class="solr.StopFilterFactory" > >>> ignoreCase="true" > >>> words="stopwords.txt" > >>> enablePositionIncrements="true" > >>> /> > >>> <filter class="solr.WordDelimiterFilterFactory" > >>> generateWordParts="1" generateNumberParts="1" catenateWords="0" > >>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> > >>> <filter class="solr.LowerCaseFilterFactory"/> > >>> <filter class="solr.SnowballPorterFilterFactory" > >> language="English" > >>> protected="protwords.txt"/> > >>> </analyzer> > >>> </fieldType> > >> > >> > > > > > > -- > > Best regards, > > Bogdan > > -- Best regards, Bogdan
