[ 
https://issues.apache.org/jira/browse/SOLR-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16624552#comment-16624552
 ] 

Mike Sokolov commented on SOLR-1394:
------------------------------------

I'm pretty sure this issue is no longer valid. I don't use this code actively 
recently, so I can't speak with great authority, but I did look at some related 
issues, and it seems that quite a while ago LUCENE-3690 made major improvements 
in handling of entities, so perhaps this can simply be closed as fixed?

> HTML stripper is splitting tokens
> ---------------------------------
>
>                 Key: SOLR-1394
>                 URL: https://issues.apache.org/jira/browse/SOLR-1394
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>            Reporter: Anders Melchiorsen
>            Priority: Major
>             Fix For: 1.4
>
>         Attachments: SOLR-1394.patch, SOLR-1394.patch, hex-entity.patch
>
>
> The Solr HTML stripper is replacing any removed HTML with whitespace. This is 
> to keep offsets correct for highlighting.
> However, as was already pointed out in SOLR-42, this means that any token 
> containing an HTML entity will be split into several tokens. That makes the 
> HTML stripper completely unreliable for international text (and any text is 
> potentially interantional).
> The current code is actually deficient for BOTH highlighting and indexing, 
> where the previous incarnation (that did not insert spaces) only had problems 
> with highlighting.
> The only workaround is to not use entities at all, which is impossible in 
> some situations and inconvenient in most situations. If the client is 
> required to transform entities before handing it to Solr, it might as well be 
> required to also strip tags, and then the HTML stripper would not be needed 
> at all.
> Today, we have a better solution that can be used: offset correction. We can 
> then avoid inserting extra whitespace, but still get correct offsets. The 
> attached patch implements just that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to