Re: Stopwords not working as expected

Bogdan Vatkov Sun, 03 Jan 2010 06:14:03 -0800

Unfortunately it is all classified data I could not share, I will try to
debug


On Sun, Jan 3, 2010 at 4:10 PM, Grant Ingersoll <[email protected]> wrote:

> Is there anyway you could zip up a small document set and your Solr home
> and post somewhere?
>
> On Jan 3, 2010, at 9:08 AM, Bogdan Vatkov wrote:
>
> > Yesterday I had issues with mapping cluster results to dictionary entries
> -
> > it happened that I was using different dictionary - therefore the result
> > clusters shown really strange results.
> > But once I fixed all the commands, input/output files, etc. I got very
> good
> > result from clusterization POV (I mean clusters are quite correct having
> in
> > mind the input documents) but unfortunately the clusters contained mostly
> > words which I would like to stop - and which words I placed in the
> > stopwords.txt in Solr (re-indexed, restarted Solr, etc.).
> >
> > Where do you suggest I debug the vector creation? Seems Solr respects the
> > stopwords but not the vector creation (then clustering).
> >
> > On Sun, Jan 3, 2010 at 4:02 PM, Grant Ingersoll <[email protected]>
> wrote:
> >
> >>
> >> On Jan 3, 2010, at 8:58 AM, Bogdan Vatkov wrote:
> >>
> >>> I have stopwords.txt file with 1200+ words, i did not understand this
> >> with
> >>> the stemming - you mean my stopwords are somehow ignored due to some
> >>> stemming or ?
> >>
> >> No, stopword removal happens before stemming so it is possible that a
> word
> >> that was not stopped was then stemmed to a stopword.
> >>
> >> I thought you said yesterday you got it straightened out.
> >>
> >>>
> >>> On Sun, Jan 3, 2010 at 3:53 PM, Grant Ingersoll <[email protected]>
> >> wrote:
> >>>
> >>>> Are you sure you have stopwords and it is not the result of stemming
> >> some
> >>>> other word?
> >>>>
> >>>> On Jan 3, 2010, at 7:57 AM, Bogdan Vatkov wrote:
> >>>>
> >>>>> my Solr config is like the default one:
> >>>>>
> >>>>> <field name="msg_body" type="text" termVectors="true" indexed="true"
> >>>>> stored="true"/>
> >>>>>
> >>>>> <fieldType name="text" class="solr.TextField"
> >>>> positionIncrementGap="100">
> >>>>>    <analyzer type="index">
> >>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>>>>      <filter class="solr.StopFilterFactory"
> >>>>>              ignoreCase="true"
> >>>>>              words="stopwords.txt"
> >>>>>              enablePositionIncrements="true"
> >>>>>              />
> >>>>>      <filter class="solr.WordDelimiterFilterFactory"
> >>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >>>>>      <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>      <filter class="solr.SnowballPorterFilterFactory"
> >>>> language="English"
> >>>>> protected="protwords.txt"/>
> >>>>>    </analyzer>
> >>>>>    <analyzer type="query">
> >>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>>>>      <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt"
> >>>>> ignoreCase="true" expand="true"/>
> >>>>>      <filter class="solr.StopFilterFactory"
> >>>>>              ignoreCase="true"
> >>>>>              words="stopwords.txt"
> >>>>>              enablePositionIncrements="true"
> >>>>>              />
> >>>>>      <filter class="solr.WordDelimiterFilterFactory"
> >>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> >>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >>>>>      <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>      <filter class="solr.SnowballPorterFilterFactory"
> >>>> language="English"
> >>>>> protected="protwords.txt"/>
> >>>>>    </analyzer>
> >>>>>  </fieldType>
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Best regards,
> >>> Bogdan
> >>
> >>
> >
> >
> > --
> > Best regards,
> > Bogdan
>
>


-- 
Best regards,
Bogdan

Re: Stopwords not working as expected

Reply via email to