[ https://issues.apache.org/jira/browse/SOLR-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754215#action_12754215 ]
Jason Rutherglen commented on SOLR-1394: ---------------------------------------- bq. In which situation do you get this exception? We need to log the HTML. I'll post it when we implement the logging. > HTML stripper is splitting tokens > --------------------------------- > > Key: SOLR-1394 > URL: https://issues.apache.org/jira/browse/SOLR-1394 > Project: Solr > Issue Type: Bug > Components: Analysis > Affects Versions: 1.4 > Reporter: Anders Melchiorsen > Attachments: SOLR-1394.patch > > > I am having problems with the Solr HTML stripper. > After some investigation, I have found the cause to be that the > stripper is replacing the removed HTML with spaces. This obviously > breaks when the HTML is in the middle of a word, like "Günther". > So, without knowing what I was doing, I hacked together a fix that > uses offset correction instead. > That seemed to work, except that closing tags and attributes still > broke the positioning. With even less of a clue, I replaced read() > with next() in the two methods handling those. > Finally, invalid HTML also gave wrong offsets, and I fixed that by > restoring numRead when rolling back the input stream. > At this point I stopped trying to break it, so there may still be more > problems. Or I might have introduced some problem on my own. Anyway, I > have put the three patches at the bottom of this mail, in case > somebody wants to move along with this issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.