[ 
https://issues.apache.org/jira/browse/SOLR-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13627191#comment-13627191
 ] 

Steve Rowe commented on SOLR-4686:
----------------------------------

I've read that the [Jericho HTML 
parser|http://jericho.htmlparser.net/docs/index.html], implemented in Java, 
reports tag offsets, unlike many other HTML parsers, and that could be useful 
in implementing the HTML-aware boundary insertion mechanism I mentioned 
earlier.  
                
> HTMLStripCharFilter and Highlighter generates invalid HTML
> ----------------------------------------------------------
>
>                 Key: SOLR-4686
>                 URL: https://issues.apache.org/jira/browse/SOLR-4686
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>    Affects Versions: 4.1
>            Reporter: Holger Floerke
>              Labels: HTML, highlighter
>
> Using the HTMLStripCharFilter may yield to an invalid HTML highlight.
> The HTMLStripCharFilter has a special treatment of inline-elements (eg. "a", 
> "b", ...). For theese elements the CharFilter ignores the tag and does not 
> insert any split-character.
> If you index
> """
> <a>xxx</a>
> """
> you get the word "xxx" starting at position 3 ending on position 10(!) 
> If you highlight a search on "xxx", you will get
> """
> <a><em>xxx</a></em>
> """
> which is invalid HTML.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to