[ https://issues.apache.org/jira/browse/TIKA-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733344#comment-13733344 ]
Uwe Schindler commented on TIKA-1134: ------------------------------------- Hi Hoss, the "rule" in TIKA is: - TIKA inserts ignoreableWhitespace to support plain-text extraction on block elements and <br/> tags (which are also somehow "empty" block elements) - see TIKA-171. Nothing else will insert ignorableWhitespace into the content handler. This means, consumers that are only interested in the *plain text* contents of parsed files, should ignore all HTML syntax elements and just treat ignorableWhitespace as significant - this is what TextOnlyContentHandler does to extract text. This was decided in TIKA-171 long time ago. If you are interested in *structured* HTML output, use the XHTML elements and ignore the whitespace. > ContentHandler gets ignorable whitespace for <br> tags when parsing HTML > ------------------------------------------------------------------------ > > Key: TIKA-1134 > URL: https://issues.apache.org/jira/browse/TIKA-1134 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Hoss Man > Attachments: TIKA-1134.patch > > > I'm not very knowledgable about Tika, so it's possible iI'm missunderstanding > something here, but it appears that the way Tika parses HTML to produce XHTML > SAX events is missinterpreting "<br>" tags as equivilent to ignorable > whitespace containing a newline. This means that clients who ask Tika to > parse files, and specify their own ContentHandler to capture the character > data can get sequences of run-on text w/o knowing that the "<br>" tag was > present -- _unless_ they explicitly handle ignorbaleWhitespace and treat it > as "real" whitespace -- but this creates a catch-22 if you really do want to > ignore the ignorable whitespace in the HTML markup. > The crux of the problem seems to be: > * instead of generating a startElement event for "br" the HtmlParser treats > it as a xhtml.newline(). > * xhtml.newline() generates and ignorableWhitespace SAX event instead of a > characters SAX event > ...either one of these by themselves might be fine, but in combination they > don't really make any sense. If for example an actual newline exists in the > html, it comes across as part of a characters SAX event, not as ignorbale > whitespace. > Changing the newline() function to delegate to characters(...) seems to solve > the problem for <br> tags in HTML, but breaks several tests -- probably > because the newline() function is also used to add intentionally add > (synthetic) ignorableWhitespace events after elements. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira