Can you give a small test file that demonstrates the problem? -Yonik http://www.lucidimagination.com
On Fri, Sep 25, 2009 at 5:34 AM, Kundig, Andreas <andreas.kun...@wipo.int> wrote: > Hello > > I can't bring HTMLStripStandardTokenizerFactory to remove the content of the > style tag, as the documentation says it should. > > A search for 'mso' returns a document where the search term only appears in > the style tag (it's a word document saved as html). Here is the highlight > returned by solr (by the way: the wrong word is highlighted). > > "vetica; \n\tpanose-1:2 11 5 4 2 2 2 2 2 > 4;&<em>#13</em>;\n\tmso-font-charset:0;&<em>#13</em>;\n\tmso-generic-font-family:swiss;&<em>#13</em>" > > I am using solr 1.3. Here is how I configured the tokenizer in schema.xml > > <fieldType name="text_en" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.HTMLStripStandardTokenizerFactory"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt" enablePositionIncrements="true"/> > <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" > generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" > splitOnCaseChange="1"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.EnglishPorterFilterFactory" > protected="protwords.txt"/> > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.HTMLStripStandardTokenizerFactory"/> > <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" > ignoreCase="true" expand="true"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt"/> > <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" > generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" > splitOnCaseChange="1"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.EnglishPorterFilterFactory" > protected="protwords.txt"/> > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > </analyzer> > </fieldType> > > Am I doing something wrong? > > thank you > Andréas Kündig > > World Intellectual Property Organization Disclaimer: > > This electronic message may contain privileged, confidential and > copyright protected information. If you have received this e-mail > by mistake, please immediately notify the sender and delete this > e-mail and all its attachments. Please ensure all e-mail attachments > are scanned for viruses prior to opening or using. >