HTML stripper is splitting tokens --------------------------------- Key: SOLR-1394 URL: https://issues.apache.org/jira/browse/SOLR-1394 Project: Solr Issue Type: Bug Components: Analysis Affects Versions: 1.4 Reporter: Anders Melchiorsen
I am having problems with the Solr HTML stripper. After some investigation, I have found the cause to be that the stripper is replacing the removed HTML with spaces. This obviously breaks when the HTML is in the middle of a word, like "Günther". So, without knowing what I was doing, I hacked together a fix that uses offset correction instead. That seemed to work, except that closing tags and attributes still broke the positioning. With even less of a clue, I replaced read() with next() in the two methods handling those. Finally, invalid HTML also gave wrong offsets, and I fixed that by restoring numRead when rolling back the input stream. At this point I stopped trying to break it, so there may still be more problems. Or I might have introduced some problem on my own. Anyway, I have put the three patches at the bottom of this mail, in case somebody wants to move along with this issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.