HTMLStripStandardTokenizerFactory doesn't interpret word boundaries on html 
tags correctly.
-------------------------------------------------------------------------------------------

                 Key: SOLR-678
                 URL: https://issues.apache.org/jira/browse/SOLR-678
             Project: Solr
          Issue Type: Bug
          Components: search
    Affects Versions: 1.2
         Environment: Mac OS X 10.5.4, java version "1.5.0_13"

            Reporter: Matt Connolly


The HTMLStripStandardTokenizerFactory filter does not place word boundaries on 
HTML tags like it should.

For example, indexing the text "<h2>title</h2><p>some comment</p>" results in 
two words being indexed: "titlesome" and "comment" when there should be three 
words: "title" "some" and "comment".

Not all tags need this, for example, it may be perfectly reasonable to write 
"<b>sub</b>script" to be indexed as "subscript" since the <b> is interpretted 
as inline, not block.

I would suggest all block or paragraph tags be translated into spaces so that 
text on either side of the tag is considered separate tokens. eg: p div h1 h2 
h3 h4 h5 h6 br hr pre   (etc)




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to