[jira] [Commented] (TIKA-1134) ContentHandler gets ignorable whitespace for
tags when parsing HTML

Uwe Schindler (JIRA) Fri, 09 Aug 2013 06:32:52 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13734769#comment-13734769
 ]


Uwe Schindler commented on TIKA-1134:
-------------------------------------

Hoss: I agree to fix this in the documentation.

On the SOLR-4679 i explained in more details *why TIKA is doing this*:

{quote}
Let me recapitulate TIKA's problems:

- TIKA decided to use XHTML as its output format to report the parsed documents 
to the consumer. This is nice, because it allows to preserve some of the 
formatting (like bold fonts, paragraphs,...) originating from the original 
document. Of course most of this formatting is lost, but you can still "detect" 
things like emphasized text. By choosing XHTML as output format, of course TIKA 
must use XHTML formatting for new lines and similar. So whenever a line break 
is needed, the TIKA pasrer emits a <br/> tag or places the "paragraph" (in a 
PDF) inside a <p/> element. As we all know, HTML ignores formatting like 
newlines, tabs,... (all are treated as one single whitespace, so means like 
this regreplace: {{s/\s+/ /}}
- On the other hand, TIKA wants to make it simple for people to extract the 
*plain text* contents. With the XHTML-only approach this would be hard for the 
consumer. Because to add the correct newlines, the consumer has to fully 
understand XHTML and detect block elements and replace them by \n

To support both usages of TIKA the idea was to embed this information which is 
unimportant to HTML (as HTML ignores whitespaces completely) as 
ignorableWhitespace as "convenience" for the user. A fully compliant XHTML 
consumer would not parse the ignoreable stuff. As it understands HTML it would 
detect a <p> element as a block element and format the output.

Solr unfortunately has some strange approach: It is mainly interested in the 
text only contents, so ideally when consuming the HTLL it could use 
{{WriteoutContentHandler(StringBuilder, 
BodyContentHandler(parserConmtentHandler)}}. In that case TIKA would do the 
right thing automatically: It would extract only text from the body element and 
would use the "convenience whitespace" to format the text in ASCII-ART-like way 
(using tabs, newlines,...) :-)
Solr has a hybrid approach: It collects all into a content tag (which is 
similar to the above approcha), but the bug is that in contrast to TIKA's 
official WriteOutContentHandler it does not use the ignorable whitespace 
inserted for convenience. In addition TIKA also has a stack where it allows to 
process parts of the documents (like the title element or all <em> elements). 
In that case it has several StringBuilders in parallel that are populated with 
the contents. The problems are here too, but cannot be solved by using 
ignorable whitespace: e.g. one indexes only all <em> elements (which are inline 
HTML elements no block elements), there is no whitespace so all em elements 
would be glued together in the em field of your index... I just mention this, 
in my opinion the SolrContentHandler needs more work to "correctly" understand 
HTML and not just collect element names in a map!

Now to your complaint: You proposed to report the newlines as real 
{{character()}} events - but this is not the right thing to do here. As I said, 
HTML does not know these characters, they are ignored. The "formatting" is done 
by the element names (like <p>, <div>, <table>). So the "helper" whitespace for 
text-only consumers should be inserted as ignorableWhitespace only, if we would 
add it to the real character data we would report things that every HTML parser 
(like nekohtml) would never report to the consumer. Nekohtml would also report 
this useless extra whitespace as ignorable.

The convenience here is that TIKA's XHTMLContentHandler used by all parsers is 
"configured" to help the text-only user, but don't hurt the HTML-only user. 
This differentiation is done by reporting the HTML element names (p, div, 
table, th, td, tr, abbr, em, strong,...) but also report the 
ASCII-ART-text-only content like TABs indide tables, newlines after block 
elements,... This is always done as ignorableWhitespace (for convenience), a 
real HTML parser must ignore it - and its correct to do this.
{quote}

I think we should document this in the javadocs or the "howto" page, so 
implementors of ContentHandlers know what to do!
                
> ContentHandler gets ignorable whitespace for <br> tags when parsing HTML
> ------------------------------------------------------------------------
>
>                 Key: TIKA-1134
>                 URL: https://issues.apache.org/jira/browse/TIKA-1134
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Hoss Man
>         Attachments: TIKA-1134.patch
>
>
> I'm not very knowledgable about Tika, so it's possible iI'm missunderstanding 
> something here, but it appears that the way Tika parses HTML to produce XHTML 
> SAX events is missinterpreting "<br>" tags as equivilent to ignorable 
> whitespace containing a newline.  This means that clients who ask Tika to 
> parse files, and specify their own ContentHandler to capture the character 
> data can get sequences of run-on text w/o knowing that the "<br>" tag was 
> present -- _unless_ they explicitly handle ignorbaleWhitespace and treat it 
> as "real" whitespace -- but this creates a catch-22 if you really do want to 
> ignore the ignorable whitespace in the HTML markup.
> The crux of the problem seems to be:
>  * instead of generating a startElement event for "br" the HtmlParser treats 
> it as a xhtml.newline().
>  * xhtml.newline() generates and ignorableWhitespace SAX event instead of a 
> characters SAX event
> ...either one of these by themselves might be fine, but in combination they 
> don't really make any sense.  If for example an actual newline exists in the 
> html, it comes across as part of a characters SAX event, not as ignorbale 
> whitespace.
> Changing the newline() function to delegate to characters(...) seems to solve 
> the problem for <br> tags in HTML, but breaks several tests -- probably 
> because the newline() function is also used to add intentionally add 
> (synthetic) ignorableWhitespace events after elements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1134) ContentHandler gets ignorable whitespace for tags when parsing HTML

Reply via email to

[jira] [Commented] (TIKA-1134) ContentHandler gets ignorable whitespace for
tags when parsing HTML