[jira] Updated: (SOLR-1394) HTML stripper is splitting tokens

Anders Melchiorsen (JIRA) Wed, 14 Oct 2009 02:46:59 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Anders Melchiorsen updated SOLR-1394:
-------------------------------------

    Description: 
The Solr HTML stripper is replacing any removed HTML with whitespace. This is 
to keep offsets correct for highlighting.

However, as was already pointed out in SOLR-42, this means that any token 
containing an HTML entity will be split into several tokens. That makes the 
HTML stripper completely unreliable for international text (and any text is 
potentially interantional).

The current code is actually deficient for BOTH highlighting and indexing, 
where the previous incarnation (that did not insert spaces) only had problems 
with highlighting.

The only workaround is to not use entities at all, which is impossible in some 
situations and inconvenient in most situations. If the client is required to 
transform entities before handing it to Solr, it might as well be required to 
also strip tags, and then the HTML stripper would not be needed at all.

Today, we have a better solution that can be used: offset correction. We can 
then avoid inserting extra whitespace, but still get correct offsets. The 
attached patch implements just that.


  was:
I am having problems with the Solr HTML stripper.

After some investigation, I have found the cause to be that the
stripper is replacing the removed HTML with spaces. This obviously
breaks when the HTML is in the middle of a word, like "G&uuml;nther".

So, without knowing what I was doing, I hacked together a fix that
uses offset correction instead.

That seemed to work, except that closing tags and attributes still
broke the positioning. With even less of a clue, I replaced read()
with next() in the two methods handling those.

Finally, invalid HTML also gave wrong offsets, and I fixed that by
restoring numRead when rolling back the input stream.

At this point I stopped trying to break it, so there may still be more
problems. Or I might have introduced some problem on my own. Anyway, I
have put the three patches at the bottom of this mail, in case
somebody wants to move along with this issue.



Clarifying the description, the original report was just a pasted e-mail.


> HTML stripper is splitting tokens
> ---------------------------------
>
>                 Key: SOLR-1394
>                 URL: https://issues.apache.org/jira/browse/SOLR-1394
>             Project: Solr
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.4
>            Reporter: Anders Melchiorsen
>         Attachments: SOLR-1394.patch, SOLR-1394.patch
>
>
> The Solr HTML stripper is replacing any removed HTML with whitespace. This is 
> to keep offsets correct for highlighting.
> However, as was already pointed out in SOLR-42, this means that any token 
> containing an HTML entity will be split into several tokens. That makes the 
> HTML stripper completely unreliable for international text (and any text is 
> potentially interantional).
> The current code is actually deficient for BOTH highlighting and indexing, 
> where the previous incarnation (that did not insert spaces) only had problems 
> with highlighting.
> The only workaround is to not use entities at all, which is impossible in 
> some situations and inconvenient in most situations. If the client is 
> required to transform entities before handing it to Solr, it might as well be 
> required to also strip tags, and then the HTML stripper would not be needed 
> at all.
> Today, we have a better solution that can be used: offset correction. We can 
> then avoid inserting extra whitespace, but still get correct offsets. The 
> attached patch implements just that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1394) HTML stripper is splitting tokens

Reply via email to