[ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12625835#action_12625835 ]
Jim Murphy commented on SOLR-42: -------------------------------- I've tracked down some background info on this issue - at least the way it was affecting me. I could care less about highlighting - I'm using the HTMLStripWhitespaceTokenizerFactory during indexing to tokenize blog content - which obviously contains lots of html. The pathological case I've found with our input document set is: Content contains a malformed xml processing instruction in the first "page" of the buffer that contains more than one page of data. It seems this is a fairly common (maybe MS Word XML?) form of invalid HTML. Commonly it looks like this: ...valid html...<?xml:namespace prefix = o />...valid html... Notice the PI starts with "<?xml" but terminates with a close tag., doh. This issue is manifested in HTMLStripReader. It causes the following code to read too much off the buffer and invalidates the previous mark at the beginning of the tag. private int readProcessingInstruction() throws IOException { // "<?" has already been read while ((numRead - lastMark) < readAheadLimitMinus1) { int ch = next(); if (ch=='?' && peek()=='>') { next(); return MATCH; } else if (ch==-1) { return MISMATCH; } } return MISMATCH; } The demoralizing part is the special treatment (readAheadLimitMinus1) isn't enough. There is actually a "over read" by 2 chars. The IOException - Invalid Mark happens when readProcessingInstruction() retuns (a mismatch because the entire buffer is read without finding the close PI) and restoreState(); is called to reset the marks - which fails. If I tweak readAheadLimitMinus1 like this readAheadLimitMinus1 -= 2 So maybe the variable should be readAheadLimitMinus3 ;) then the buffer limits are preserved and the exception isn't found, parsing proceedes as expected. Jim > Highlighting problems with HTMLStripWhitespaceTokenizerFactory > -------------------------------------------------------------- > > Key: SOLR-42 > URL: https://issues.apache.org/jira/browse/SOLR-42 > Project: Solr > Issue Type: Bug > Components: highlighter > Reporter: Andrew May > Assignee: Grant Ingersoll > Priority: Minor > Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, > HtmlStripReaderTestXmlProcessing.patch, > HtmlStripReaderTestXmlProcessing.patch, SOLR-42.patch, SOLR-42.patch, > SOLR-42.patch, SOLR-42.patch, TokenPrinter.java > > > Indexing content that contains HTML markup, causes problems with highlighting > if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names > from being searchable). > Example title field: > <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a > polyorogenic terrane of NW Iberia > Searching for title:fabrics with highlighting on, the highlighted version has > the <em> tags in the wrong place - 22 characters to the left of where they > should be (i.e. the sum of the lengths of the tags). > Response from Yonik on the solr-user mailing-list: > HTMLStripWhitespaceTokenizerFactory works in two phases... > HTMLStripReader removes the HTML and passes the result to > WhitespaceTokenizer... at that point, Tokens are generated, but the > offsets will correspond to the text after HTML removal, not before. > I did it this way so that HTMLStripReader could go before any > tokenizer (like StandardTokenizer). > Can you open a JIRA bug for this? The fix would be a special version > of HTMLStripReader integrated with a WhitespaceTokenizer to keep > offsets correct. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.