[ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556740#action_12556740 ]
Yonik Seeley commented on SOLR-42: ---------------------------------- {quote}i just have to wonder if there is a special marker character that could be used instead of whitespace{quote} For a hack, not a bad idea... There could be a TokenFilter that removes any such characters in tokens, and it could even be automatically used by Tokenizers that use the html strip reader. > Highlighting problems with HTMLStripWhitespaceTokenizerFactory > -------------------------------------------------------------- > > Key: SOLR-42 > URL: https://issues.apache.org/jira/browse/SOLR-42 > Project: Solr > Issue Type: Bug > Components: highlighter > Reporter: Andrew May > Assignee: Grant Ingersoll > Priority: Minor > Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, > SOLR-42.patch, SOLR-42.patch, SOLR-42.patch > > > Indexing content that contains HTML markup, causes problems with highlighting > if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names > from being searchable). > Example title field: > <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a > polyorogenic terrane of NW Iberia > Searching for title:fabrics with highlighting on, the highlighted version has > the <em> tags in the wrong place - 22 characters to the left of where they > should be (i.e. the sum of the lengths of the tags). > Response from Yonik on the solr-user mailing-list: > HTMLStripWhitespaceTokenizerFactory works in two phases... > HTMLStripReader removes the HTML and passes the result to > WhitespaceTokenizer... at that point, Tokens are generated, but the > offsets will correspond to the text after HTML removal, not before. > I did it this way so that HTMLStripReader could go before any > tokenizer (like StandardTokenizer). > Can you open a JIRA bug for this? The fix would be a special version > of HTMLStripReader integrated with a WhitespaceTokenizer to keep > offsets correct. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.