[jira] [Resolved] (SOLR-678) HTMLStripStandardTokenizerFactory doesn't interpret word boundaries on html tags correctly.

Erick Erickson (JIRA) Sat, 30 Nov 2013 05:26:10 -0800

     [ 
https://issues.apache.org/jira/browse/SOLR-678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Erick Erickson resolved SOLR-678.
---------------------------------

    Resolution: Won't Fix

2013 Old JIRA cleanup

> HTMLStripStandardTokenizerFactory doesn't interpret word boundaries on html 
> tags correctly.
> -------------------------------------------------------------------------------------------
>
>                 Key: SOLR-678
>                 URL: https://issues.apache.org/jira/browse/SOLR-678
>             Project: Solr
>          Issue Type: Bug
>          Components: search
>    Affects Versions: 1.2
>         Environment: Mac OS X 10.5.4, java version "1.5.0_13"
>            Reporter: Matt Connolly
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> The HTMLStripStandardTokenizerFactory filter does not place word boundaries 
> on HTML tags like it should.
> For example, indexing the text "<h2>title</h2><p>some comment</p>" results in 
> two words being indexed: "titlesome" and "comment" when there should be three 
> words: "title" "some" and "comment".
> Not all tags need this, for example, it may be perfectly reasonable to write 
> "<b>sub</b>script" to be indexed as "subscript" since the <b> is interpretted 
> as inline, not block.
> I would suggest all block or paragraph tags be translated into spaces so that 
> text on either side of the tag is considered separate tokens. eg: p div h1 h2 
> h3 h4 h5 h6 br hr pre   (etc)



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SOLR-678) HTMLStripStandardTokenizerFactory doesn't interpret word boundaries on html tags correctly.

Reply via email to