[ https://issues.apache.org/jira/browse/SOLR-678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Erick Erickson resolved SOLR-678. --------------------------------- Resolution: Won't Fix 2013 Old JIRA cleanup > HTMLStripStandardTokenizerFactory doesn't interpret word boundaries on html > tags correctly. > ------------------------------------------------------------------------------------------- > > Key: SOLR-678 > URL: https://issues.apache.org/jira/browse/SOLR-678 > Project: Solr > Issue Type: Bug > Components: search > Affects Versions: 1.2 > Environment: Mac OS X 10.5.4, java version "1.5.0_13" > Reporter: Matt Connolly > Original Estimate: 4h > Remaining Estimate: 4h > > The HTMLStripStandardTokenizerFactory filter does not place word boundaries > on HTML tags like it should. > For example, indexing the text "<h2>title</h2><p>some comment</p>" results in > two words being indexed: "titlesome" and "comment" when there should be three > words: "title" "some" and "comment". > Not all tags need this, for example, it may be perfectly reasonable to write > "<b>sub</b>script" to be indexed as "subscript" since the <b> is interpretted > as inline, not block. > I would suggest all block or paragraph tags be translated into spaces so that > text on either side of the tag is considered separate tokens. eg: p div h1 h2 > h3 h4 h5 h6 br hr pre (etc) -- This message was sent by Atlassian JIRA (v6.1#6144) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org