[ 
https://issues.apache.org/jira/browse/SOLR-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13627153#comment-13627153
 ] 

Steve Rowe commented on SOLR-4686:
----------------------------------

Hi Holger,

I wrote the latest version of HTMLStripCharFilter, and the behavior you 
describe is expected, though obviously not good.

The problem is that when a CharFilter replaces an input sequence with a 
differently-sized output sequence, it has to decide how to map the offsets 
back.  All of the CharFilter's I've looked at map the end offsets for smaller 
output sequences to the end offset of the larger input sequence.  I suppose a 
CharFilter could make different choices, though, as long as it did so 
consistently.

HTMLStripCharFilter could change offset mappings for end tags to point at the 
offset of the *beginning* of the input sequence, while keeping offset mappings 
for start tags the same as they are now for all tags: at the offset of the 
*end* of the input sequence.  {{<a>xxx</a>}} would then be highlit as 
{{<a><em>xxx</em></a>}}.

But "fixing" this one issue won't solve the general problem.  An example: if 
HTMLStripCharFilter were to change offset mappings for end tags as described 
above, {{<b>x</b><i>xx</i>}} would still result in 
{{<b><em>x</b><i>xx</em></i>}}, which is problematic in a way that 
modifications to HTMLStripCharFilter can't fix.

It's worth noting that HTMLTidy can fix up your example, but doesn't properly 
handle my example - I tested with the cmdline version on OS X.

My surface reading of Highlighter and Formatter classes makes me think that 
there is no natural plugin point right now for an HTML-aware boundary insertion 
mechanism.  

I suspect that the low complaint volume to date is as a result of the lenient 
HTML parsing browsers do; even though the output HTML is invalid, it (usually?) 
looks okay anyway.
                
> HTMLStripCharFilter and Highlighter generates invalid HTML
> ----------------------------------------------------------
>
>                 Key: SOLR-4686
>                 URL: https://issues.apache.org/jira/browse/SOLR-4686
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>    Affects Versions: 4.1
>            Reporter: Holger Floerke
>              Labels: HTML, highlighter
>
> Using the HTMLStripCharFilter may yield to an invalid HTML highlight.
> The HTMLStripCharFilter has a special treatment of inline-elements (eg. "a", 
> "b", ...). For theese elements the CharFilter ignores the tag and does not 
> insert any split-character.
> If you index
> """
> <a>xxx</a>
> """
> you get the word "xxx" starting at position 3 ending on position 10(!) 
> If you highlight a search on "xxx", you will get
> """
> <a><em>xxx</a></em>
> """
> which is invalid HTML.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to