[
https://issues.apache.org/jira/browse/SOLR-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13627191#comment-13627191
]
Steve Rowe commented on SOLR-4686:
----------------------------------
I've read that the [Jericho HTML
parser|http://jericho.htmlparser.net/docs/index.html], implemented in Java,
reports tag offsets, unlike many other HTML parsers, and that could be useful
in implementing the HTML-aware boundary insertion mechanism I mentioned
earlier.
> HTMLStripCharFilter and Highlighter generates invalid HTML
> ----------------------------------------------------------
>
> Key: SOLR-4686
> URL: https://issues.apache.org/jira/browse/SOLR-4686
> Project: Solr
> Issue Type: Bug
> Components: highlighter
> Affects Versions: 4.1
> Reporter: Holger Floerke
> Labels: HTML, highlighter
>
> Using the HTMLStripCharFilter may yield to an invalid HTML highlight.
> The HTMLStripCharFilter has a special treatment of inline-elements (eg. "a",
> "b", ...). For theese elements the CharFilter ignores the tag and does not
> insert any split-character.
> If you index
> """
> <a>xxx</a>
> """
> you get the word "xxx" starting at position 3 ending on position 10(!)
> If you highlight a search on "xxx", you will get
> """
> <a><em>xxx</a></em>
> """
> which is invalid HTML.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]