[ https://issues.apache.org/jira/browse/SOLR-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13627153#comment-13627153 ]
Steve Rowe commented on SOLR-4686: ---------------------------------- Hi Holger, I wrote the latest version of HTMLStripCharFilter, and the behavior you describe is expected, though obviously not good. The problem is that when a CharFilter replaces an input sequence with a differently-sized output sequence, it has to decide how to map the offsets back. All of the CharFilter's I've looked at map the end offsets for smaller output sequences to the end offset of the larger input sequence. I suppose a CharFilter could make different choices, though, as long as it did so consistently. HTMLStripCharFilter could change offset mappings for end tags to point at the offset of the *beginning* of the input sequence, while keeping offset mappings for start tags the same as they are now for all tags: at the offset of the *end* of the input sequence. {{<a>xxx</a>}} would then be highlit as {{<a><em>xxx</em></a>}}. But "fixing" this one issue won't solve the general problem. An example: if HTMLStripCharFilter were to change offset mappings for end tags as described above, {{<b>x</b><i>xx</i>}} would still result in {{<b><em>x</b><i>xx</em></i>}}, which is problematic in a way that modifications to HTMLStripCharFilter can't fix. It's worth noting that HTMLTidy can fix up your example, but doesn't properly handle my example - I tested with the cmdline version on OS X. My surface reading of Highlighter and Formatter classes makes me think that there is no natural plugin point right now for an HTML-aware boundary insertion mechanism. I suspect that the low complaint volume to date is as a result of the lenient HTML parsing browsers do; even though the output HTML is invalid, it (usually?) looks okay anyway. > HTMLStripCharFilter and Highlighter generates invalid HTML > ---------------------------------------------------------- > > Key: SOLR-4686 > URL: https://issues.apache.org/jira/browse/SOLR-4686 > Project: Solr > Issue Type: Bug > Components: highlighter > Affects Versions: 4.1 > Reporter: Holger Floerke > Labels: HTML, highlighter > > Using the HTMLStripCharFilter may yield to an invalid HTML highlight. > The HTMLStripCharFilter has a special treatment of inline-elements (eg. "a", > "b", ...). For theese elements the CharFilter ignores the tag and does not > insert any split-character. > If you index > """ > <a>xxx</a> > """ > you get the word "xxx" starting at position 3 ending on position 10(!) > If you highlight a search on "xxx", you will get > """ > <a><em>xxx</a></em> > """ > which is invalid HTML. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org