[ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556636#action_12556636 ]
Yonik Seeley commented on SOLR-42: ---------------------------------- Hmmm, this points out a deficiency in this approach... it could break up words or tokens (with whitespace) that were not originally separated (think international char in the middle of a word). So I think this approach is probably OK for now, but a better approach would have the tokenizer get the offsets from the reader somehow (perhaps just a whitespace tokenizer with HTML stripping integrated). > Highlighting problems with HTMLStripWhitespaceTokenizerFactory > -------------------------------------------------------------- > > Key: SOLR-42 > URL: https://issues.apache.org/jira/browse/SOLR-42 > Project: Solr > Issue Type: Bug > Components: highlighter > Reporter: Andrew May > Assignee: Grant Ingersoll > Priority: Minor > Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, > SOLR-42.patch, SOLR-42.patch > > > Indexing content that contains HTML markup, causes problems with highlighting > if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names > from being searchable). > Example title field: > <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a > polyorogenic terrane of NW Iberia > Searching for title:fabrics with highlighting on, the highlighted version has > the <em> tags in the wrong place - 22 characters to the left of where they > should be (i.e. the sum of the lengths of the tags). > Response from Yonik on the solr-user mailing-list: > HTMLStripWhitespaceTokenizerFactory works in two phases... > HTMLStripReader removes the HTML and passes the result to > WhitespaceTokenizer... at that point, Tokens are generated, but the > offsets will correspond to the text after HTML removal, not before. > I did it this way so that HTMLStripReader could go before any > tokenizer (like StandardTokenizer). > Can you open a JIRA bug for this? The fix would be a special version > of HTMLStripReader integrated with a WhitespaceTokenizer to keep > offsets correct. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.