[ 
https://issues.apache.org/jira/browse/SOLR-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13628645#comment-13628645
 ] 

Steve Rowe commented on SOLR-4686:
----------------------------------

bq. Are you think, the highlighter/formatter has a problem, or the offsets of 
the HTMLStripCharFilter are the problem? 

The existing HTML formatters try to insert start and end tags without being 
aware of the structure into which they're inserting, and this is a problem when 
the existing intervening markup is not balanced.

As I mentioned in my previous comment, I think HTMLStripCharFilter could behave 
differently with end tags and improve output for your example, but I can think 
of examples where the current behavior works and changing it would make it 
works, e.g. highlighting the phrase "xxx yyy", where the original markup is 
'xxx <b>yyy</b>', which currently works well: '<em>xxx <b>yyy</b></em>', but 
would be imbalanced if end tag offsets were changed in the way I suggested: 
'<em>xxx <b>yyy</em></b>'.  So on balance, I'm disinclined to make any changes.

bq. In my case, I use HTMLStripCharFilter to normalize XML-Input, therefor I 
would be happy about a switch "do not treat inline elements".

Have you seen the XmlCharFilter on SOLR-2597 ?

                
> HTMLStripCharFilter and Highlighter generates invalid HTML
> ----------------------------------------------------------
>
>                 Key: SOLR-4686
>                 URL: https://issues.apache.org/jira/browse/SOLR-4686
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>    Affects Versions: 4.1
>            Reporter: Holger Floerke
>              Labels: HTML, highlighter
>
> Using the HTMLStripCharFilter may yield to an invalid HTML highlight.
> The HTMLStripCharFilter has a special treatment of inline-elements (eg. "a", 
> "b", ...). For theese elements the CharFilter ignores the tag and does not 
> insert any split-character.
> If you index
> """
> <a>xxx</a>
> """
> you get the word "xxx" starting at position 3 ending on position 10(!) 
> If you highlight a search on "xxx", you will get
> """
> <a><em>xxx</a></em>
> """
> which is invalid HTML.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to