[jira] [Commented] (TIKA-1134) ContentHandler gets ignorable whitespace for br tags when parsing HTML
[ https://issues.apache.org/jira/browse/TIKA-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734769#comment-13734769 ] Uwe Schindler commented on TIKA-1134: - Hoss: I agree to fix this in the documentation. On the SOLR-4679 i explained in more details *why TIKA is doing this*: {quote} Let me recapitulate TIKA's problems: - TIKA decided to use XHTML as its output format to report the parsed documents to the consumer. This is nice, because it allows to preserve some of the formatting (like bold fonts, paragraphs,...) originating from the original document. Of course most of this formatting is lost, but you can still detect things like emphasized text. By choosing XHTML as output format, of course TIKA must use XHTML formatting for new lines and similar. So whenever a line break is needed, the TIKA pasrer emits a br/ tag or places the paragraph (in a PDF) inside a p/ element. As we all know, HTML ignores formatting like newlines, tabs,... (all are treated as one single whitespace, so means like this regreplace: {{s/\s+/ /}} - On the other hand, TIKA wants to make it simple for people to extract the *plain text* contents. With the XHTML-only approach this would be hard for the consumer. Because to add the correct newlines, the consumer has to fully understand XHTML and detect block elements and replace them by \n To support both usages of TIKA the idea was to embed this information which is unimportant to HTML (as HTML ignores whitespaces completely) as ignorableWhitespace as convenience for the user. A fully compliant XHTML consumer would not parse the ignoreable stuff. As it understands HTML it would detect a p element as a block element and format the output. Solr unfortunately has some strange approach: It is mainly interested in the text only contents, so ideally when consuming the HTLL it could use {{WriteoutContentHandler(StringBuilder, BodyContentHandler(parserConmtentHandler)}}. In that case TIKA would do the right thing automatically: It would extract only text from the body element and would use the convenience whitespace to format the text in ASCII-ART-like way (using tabs, newlines,...) :-) Solr has a hybrid approach: It collects all into a content tag (which is similar to the above approcha), but the bug is that in contrast to TIKA's official WriteOutContentHandler it does not use the ignorable whitespace inserted for convenience. In addition TIKA also has a stack where it allows to process parts of the documents (like the title element or all em elements). In that case it has several StringBuilders in parallel that are populated with the contents. The problems are here too, but cannot be solved by using ignorable whitespace: e.g. one indexes only all em elements (which are inline HTML elements no block elements), there is no whitespace so all em elements would be glued together in the em field of your index... I just mention this, in my opinion the SolrContentHandler needs more work to correctly understand HTML and not just collect element names in a map! Now to your complaint: You proposed to report the newlines as real {{character()}} events - but this is not the right thing to do here. As I said, HTML does not know these characters, they are ignored. The formatting is done by the element names (like p, div, table). So the helper whitespace for text-only consumers should be inserted as ignorableWhitespace only, if we would add it to the real character data we would report things that every HTML parser (like nekohtml) would never report to the consumer. Nekohtml would also report this useless extra whitespace as ignorable. The convenience here is that TIKA's XHTMLContentHandler used by all parsers is configured to help the text-only user, but don't hurt the HTML-only user. This differentiation is done by reporting the HTML element names (p, div, table, th, td, tr, abbr, em, strong,...) but also report the ASCII-ART-text-only content like TABs indide tables, newlines after block elements,... This is always done as ignorableWhitespace (for convenience), a real HTML parser must ignore it - and its correct to do this. {quote} I think we should document this in the javadocs or the howto page, so implementors of ContentHandlers know what to do! ContentHandler gets ignorable whitespace for br tags when parsing HTML Key: TIKA-1134 URL: https://issues.apache.org/jira/browse/TIKA-1134 Project: Tika Issue Type: Bug Components: parser Reporter: Hoss Man Attachments: TIKA-1134.patch I'm not very knowledgable about Tika, so it's possible iI'm missunderstanding something here, but it appears that the way Tika parses HTML to produce XHTML SAX events is missinterpreting br tags as
[jira] [Commented] (TIKA-1134) ContentHandler gets ignorable whitespace for br tags when parsing HTML
[ https://issues.apache.org/jira/browse/TIKA-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733344#comment-13733344 ] Uwe Schindler commented on TIKA-1134: - Hi Hoss, the rule in TIKA is: - TIKA inserts ignoreableWhitespace to support plain-text extraction on block elements and br/ tags (which are also somehow empty block elements) - see TIKA-171. Nothing else will insert ignorableWhitespace into the content handler. This means, consumers that are only interested in the *plain text* contents of parsed files, should ignore all HTML syntax elements and just treat ignorableWhitespace as significant - this is what TextOnlyContentHandler does to extract text. This was decided in TIKA-171 long time ago. If you are interested in *structured* HTML output, use the XHTML elements and ignore the whitespace. ContentHandler gets ignorable whitespace for br tags when parsing HTML Key: TIKA-1134 URL: https://issues.apache.org/jira/browse/TIKA-1134 Project: Tika Issue Type: Bug Components: parser Reporter: Hoss Man Attachments: TIKA-1134.patch I'm not very knowledgable about Tika, so it's possible iI'm missunderstanding something here, but it appears that the way Tika parses HTML to produce XHTML SAX events is missinterpreting br tags as equivilent to ignorable whitespace containing a newline. This means that clients who ask Tika to parse files, and specify their own ContentHandler to capture the character data can get sequences of run-on text w/o knowing that the br tag was present -- _unless_ they explicitly handle ignorbaleWhitespace and treat it as real whitespace -- but this creates a catch-22 if you really do want to ignore the ignorable whitespace in the HTML markup. The crux of the problem seems to be: * instead of generating a startElement event for br the HtmlParser treats it as a xhtml.newline(). * xhtml.newline() generates and ignorableWhitespace SAX event instead of a characters SAX event ...either one of these by themselves might be fine, but in combination they don't really make any sense. If for example an actual newline exists in the html, it comes across as part of a characters SAX event, not as ignorbale whitespace. Changing the newline() function to delegate to characters(...) seems to solve the problem for br tags in HTML, but breaks several tests -- probably because the newline() function is also used to add intentionally add (synthetic) ignorableWhitespace events after elements. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1134) ContentHandler gets ignorable whitespace for br tags when parsing HTML
[ https://issues.apache.org/jira/browse/TIKA-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733348#comment-13733348 ] Uwe Schindler commented on TIKA-1134: - I think this issue is Won't fix. The issues described by Hoss are caused by user error :-) So maybe keep this open to make javadocs inside all those wrapper ContentHandlers like BodyContentHandler to explicitely state that those extract plain text and add extra whitespace to support this. ContentHandler gets ignorable whitespace for br tags when parsing HTML Key: TIKA-1134 URL: https://issues.apache.org/jira/browse/TIKA-1134 Project: Tika Issue Type: Bug Components: parser Reporter: Hoss Man Attachments: TIKA-1134.patch I'm not very knowledgable about Tika, so it's possible iI'm missunderstanding something here, but it appears that the way Tika parses HTML to produce XHTML SAX events is missinterpreting br tags as equivilent to ignorable whitespace containing a newline. This means that clients who ask Tika to parse files, and specify their own ContentHandler to capture the character data can get sequences of run-on text w/o knowing that the br tag was present -- _unless_ they explicitly handle ignorbaleWhitespace and treat it as real whitespace -- but this creates a catch-22 if you really do want to ignore the ignorable whitespace in the HTML markup. The crux of the problem seems to be: * instead of generating a startElement event for br the HtmlParser treats it as a xhtml.newline(). * xhtml.newline() generates and ignorableWhitespace SAX event instead of a characters SAX event ...either one of these by themselves might be fine, but in combination they don't really make any sense. If for example an actual newline exists in the html, it comes across as part of a characters SAX event, not as ignorbale whitespace. Changing the newline() function to delegate to characters(...) seems to solve the problem for br tags in HTML, but breaks several tests -- probably because the newline() function is also used to add intentionally add (synthetic) ignorableWhitespace events after elements. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1134) ContentHandler gets ignorable whitespace for br tags when parsing HTML
[ https://issues.apache.org/jira/browse/TIKA-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733649#comment-13733649 ] Hoss Man commented on TIKA-1134: bq. keep this open to make javadocs inside all those wrapper ContentHandlers like BodyContentHandler to explicitely state that those extract plain text and add extra whitespace to support this. If this behavior is by design, then i would definitely urge the addition of some documentation _somewhere_ to better clarify for people writing their own ContentHandler's when/how ignorableWhitespace events may be generated and how they should be interpreted. ContentHandler gets ignorable whitespace for br tags when parsing HTML Key: TIKA-1134 URL: https://issues.apache.org/jira/browse/TIKA-1134 Project: Tika Issue Type: Bug Components: parser Reporter: Hoss Man Attachments: TIKA-1134.patch I'm not very knowledgable about Tika, so it's possible iI'm missunderstanding something here, but it appears that the way Tika parses HTML to produce XHTML SAX events is missinterpreting br tags as equivilent to ignorable whitespace containing a newline. This means that clients who ask Tika to parse files, and specify their own ContentHandler to capture the character data can get sequences of run-on text w/o knowing that the br tag was present -- _unless_ they explicitly handle ignorbaleWhitespace and treat it as real whitespace -- but this creates a catch-22 if you really do want to ignore the ignorable whitespace in the HTML markup. The crux of the problem seems to be: * instead of generating a startElement event for br the HtmlParser treats it as a xhtml.newline(). * xhtml.newline() generates and ignorableWhitespace SAX event instead of a characters SAX event ...either one of these by themselves might be fine, but in combination they don't really make any sense. If for example an actual newline exists in the html, it comes across as part of a characters SAX event, not as ignorbale whitespace. Changing the newline() function to delegate to characters(...) seems to solve the problem for br tags in HTML, but breaks several tests -- probably because the newline() function is also used to add intentionally add (synthetic) ignorableWhitespace events after elements. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1134) ContentHandler gets ignorable whitespace for br tags when parsing HTML
[ https://issues.apache.org/jira/browse/TIKA-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733819#comment-13733819 ] Hoss Man commented on TIKA-1134: The crux of my initial confusion and continuted concern about needing this documented so future users avoid this same confusion comes from these types of statements Uwe made... bq. ... consumers that are only interested in the plain text contents of parsed files, should ignore all HTML syntax elements and just treat ignorableWhitespace as significant ...and over in SOLR-4679, uwe added... bq. ... ignoreable whitespace is XML semantics only, in (X)HTML this does not exist (it is handled differently, but is never reported by HTML parsers), so the idea in TIKA is to reuse (its a bit incorrect) the ignoreableWhitespace SAX event to report this added whitespace. As someone who is not a Tika expert, or an XHTML expert, or even an HTML expert -- i have no way of knowing any of this information if i'm trying to build/maintain a custom ContentHandler to parse out specific bits of information from arbitrary files. In this specific case, i'm maintaining a ContentHandler used in Tika that attempts to be very generic and agnostic to the types of files that get parsed -- so even if I was an HTML expert and understood that ignoreable whitespace isn't really an (X)HTML concept, i wouldn't know if/when i should assume that was relevent in building a custom ContentHandler for Tika, because all i have to go on is the general information that tika handles the arbitrary file parsing for me and generates SAX events combined with the org.xml.sax.ContentHandler javadocs -- which might then lead me to the XML specs explanation of what ignorableWHitespace is, which lead me to (seemingly reasonably) assume that if Tika is taking care of parsing file type $foo and mapping that to SAX events, i probably don't want ignorableWHitespace -- but the truth is i do for some (all?) file types. that part isn't clear in any docs i've seen, and should probably be made clear somewhere. ContentHandler gets ignorable whitespace for br tags when parsing HTML Key: TIKA-1134 URL: https://issues.apache.org/jira/browse/TIKA-1134 Project: Tika Issue Type: Bug Components: parser Reporter: Hoss Man Attachments: TIKA-1134.patch I'm not very knowledgable about Tika, so it's possible iI'm missunderstanding something here, but it appears that the way Tika parses HTML to produce XHTML SAX events is missinterpreting br tags as equivilent to ignorable whitespace containing a newline. This means that clients who ask Tika to parse files, and specify their own ContentHandler to capture the character data can get sequences of run-on text w/o knowing that the br tag was present -- _unless_ they explicitly handle ignorbaleWhitespace and treat it as real whitespace -- but this creates a catch-22 if you really do want to ignore the ignorable whitespace in the HTML markup. The crux of the problem seems to be: * instead of generating a startElement event for br the HtmlParser treats it as a xhtml.newline(). * xhtml.newline() generates and ignorableWhitespace SAX event instead of a characters SAX event ...either one of these by themselves might be fine, but in combination they don't really make any sense. If for example an actual newline exists in the html, it comes across as part of a characters SAX event, not as ignorbale whitespace. Changing the newline() function to delegate to characters(...) seems to solve the problem for br tags in HTML, but breaks several tests -- probably because the newline() function is also used to add intentionally add (synthetic) ignorableWhitespace events after elements. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira