[ 
https://issues.apache.org/jira/browse/SOLR-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757686#action_12757686
 ] 

Anders Melchiorsen commented on SOLR-1394:
------------------------------------------

To me (the author), things are much better with the patch, and I have seen no 
ill effects. The exception that Jason reported happens *without* the patch.

My problem is that I cannot really vouch for the patch, as I have no previous 
experience with the Solr code. So, it would be really nice if someone with the 
experience could take fifteen minutes to review the three tiny modifications.

When replacing read() with next() in one of the patches, I noted that I was not 
sure why it worked better. I have later figured out that read() is the external 
interface, so it should (probably?) not be used internally by the stripper.


> HTML stripper is splitting tokens
> ---------------------------------
>
>                 Key: SOLR-1394
>                 URL: https://issues.apache.org/jira/browse/SOLR-1394
>             Project: Solr
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.4
>            Reporter: Anders Melchiorsen
>         Attachments: SOLR-1394.patch
>
>
> I am having problems with the Solr HTML stripper.
> After some investigation, I have found the cause to be that the
> stripper is replacing the removed HTML with spaces. This obviously
> breaks when the HTML is in the middle of a word, like "Günther".
> So, without knowing what I was doing, I hacked together a fix that
> uses offset correction instead.
> That seemed to work, except that closing tags and attributes still
> broke the positioning. With even less of a clue, I replaced read()
> with next() in the two methods handling those.
> Finally, invalid HTML also gave wrong offsets, and I fixed that by
> restoring numRead when rolling back the input stream.
> At this point I stopped trying to break it, so there may still be more
> problems. Or I might have introduced some problem on my own. Anyway, I
> have put the three patches at the bottom of this mail, in case
> somebody wants to move along with this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to