[jira] [Commented] (TIKA-1134) ContentHandler gets ignorable whitespace for
tags when parsing HTML

Uwe Schindler (JIRA) Thu, 08 Aug 2013 03:32:13 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733344#comment-13733344
 ]


Uwe Schindler commented on TIKA-1134:
-------------------------------------

Hi Hoss,
the "rule" in TIKA is:
- TIKA inserts ignoreableWhitespace to support plain-text extraction on block 
elements and <br/> tags (which are also somehow "empty" block elements) - see 
TIKA-171. Nothing else will insert ignorableWhitespace into the content 
handler. This means, consumers that are only interested in the *plain text* 
contents of parsed files, should ignore all HTML syntax elements and just treat 
ignorableWhitespace as significant - this is what TextOnlyContentHandler does 
to extract text. This was decided in TIKA-171 long time ago. If you are 
interested in *structured* HTML output, use the XHTML elements and ignore the 
whitespace.
                
> ContentHandler gets ignorable whitespace for <br> tags when parsing HTML
> ------------------------------------------------------------------------
>
>                 Key: TIKA-1134
>                 URL: https://issues.apache.org/jira/browse/TIKA-1134
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Hoss Man
>         Attachments: TIKA-1134.patch
>
>
> I'm not very knowledgable about Tika, so it's possible iI'm missunderstanding 
> something here, but it appears that the way Tika parses HTML to produce XHTML 
> SAX events is missinterpreting "<br>" tags as equivilent to ignorable 
> whitespace containing a newline.  This means that clients who ask Tika to 
> parse files, and specify their own ContentHandler to capture the character 
> data can get sequences of run-on text w/o knowing that the "<br>" tag was 
> present -- _unless_ they explicitly handle ignorbaleWhitespace and treat it 
> as "real" whitespace -- but this creates a catch-22 if you really do want to 
> ignore the ignorable whitespace in the HTML markup.
> The crux of the problem seems to be:
>  * instead of generating a startElement event for "br" the HtmlParser treats 
> it as a xhtml.newline().
>  * xhtml.newline() generates and ignorableWhitespace SAX event instead of a 
> characters SAX event
> ...either one of these by themselves might be fine, but in combination they 
> don't really make any sense.  If for example an actual newline exists in the 
> html, it comes across as part of a characters SAX event, not as ignorbale 
> whitespace.
> Changing the newline() function to delegate to characters(...) seems to solve 
> the problem for <br> tags in HTML, but breaks several tests -- probably 
> because the newline() function is also used to add intentionally add 
> (synthetic) ignorableWhitespace events after elements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1134) ContentHandler gets ignorable whitespace for tags when parsing HTML

Reply via email to

[jira] [Commented] (TIKA-1134) ContentHandler gets ignorable whitespace for
tags when parsing HTML