[ 
https://issues.apache.org/jira/browse/SOLR-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13627548#comment-13627548
 ] 

Holger Floerke commented on SOLR-4686:
--------------------------------------

Hi Steve,

thanks for your quick comments. 

"""
My surface reading of Highlighter and Formatter classes makes me think that 
there is no natural plugin point right now for an HTML-aware boundary insertion 
mechanism. 
"""
Are you think, the highlighter/formatter has a problem, or the offsets of the 
HTMLStripCharFilter are the problem? This question may be philosophical, but in 
my opition the HTMLStripCharFilter is resposible to write the correct offsets. 
This isn't easy, because the filter has to "understand" the structure, 
modifying start-positions and end-positions in certain cases and so on, but I 
see problems are growing when more people are doing XHTML output with 
highlighter.

In my case, I use HTMLStripCharFilter to normalize XML-Input, therefor I would 
be happy about a switch "do not treat inline elements". 

                
> HTMLStripCharFilter and Highlighter generates invalid HTML
> ----------------------------------------------------------
>
>                 Key: SOLR-4686
>                 URL: https://issues.apache.org/jira/browse/SOLR-4686
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>    Affects Versions: 4.1
>            Reporter: Holger Floerke
>              Labels: HTML, highlighter
>
> Using the HTMLStripCharFilter may yield to an invalid HTML highlight.
> The HTMLStripCharFilter has a special treatment of inline-elements (eg. "a", 
> "b", ...). For theese elements the CharFilter ignores the tag and does not 
> insert any split-character.
> If you index
> """
> <a>xxx</a>
> """
> you get the word "xxx" starting at position 3 ending on position 10(!) 
> If you highlight a search on "xxx", you will get
> """
> <a><em>xxx</a></em>
> """
> which is invalid HTML.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to