[jira] Commented: (TIKA-171) New ContentHandler for plain text output that has no problem with missing white space after XHTML block tags

Jukka Zitting (JIRA) Thu, 27 Nov 2008 17:05:07 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651443#action_12651443
 ]


Jukka Zitting commented on TIKA-171:
------------------------------------

Good idea!

Patch committed in revision 721317. Note that I changed the indenting to use 
spaces instead of tabs to match the normal practice in Tika.

Your code also doesn't comply with the Sun Java coding conventions we follow 
(most notably the if statements should always use braces)
but for now I left the code as-is to avoid conflicts in case you already made 
some local changes.

We already do something similar in a number of the parsers that explicitly 
output extra whitespace like newlines and tabs at suitable places to make the 
test output look better. This generic approach is IMHO better, and so we should 
remove such per-parser code.

Instead of having XHTMLToTextContentHandler pass through only character events, 
how about if it passed all SAX events and simply inserted extra 
ignorableWhitespace events where appropriate? This would make the different 
features more orthogonal and thus easier to combine in new ways. The current 
functionality could still be achieved by combining the class with 
TextContentHandler or WriteOutContentHandler.

Some other potential improvements:

* Emit a double newline at the end of block elements (but only a single newline 
after </tr> or <br/>) to produce an empty line to separate paragraphs in text 
output. This makes the output easier for manual inspection and might even help 
some post-processors (that for some reason don't know how to use XHTML) to 
better detect structure in the text output.

* Detect if the incoming XHTML document already has such extra whitespace and 
either (partially) replace it with the emitted whitespace or keep it and avoid 
emitting extra whitespace.

* Avoid emitting extra whitespace for empty elements. This way we can keep the 
nicely symmetric property that an empty input stream results in an empty text 
output stream.

Anyway, these are all minor tweaks and we probably shouldn't spend too much 
effort tweaking the output. The main purpose of the text output is just to be 
indexable so even the current implementation is perfectly OK. Nice text 
rendering is only a secondary concern.

> New ContentHandler for plain text output that has no problem with missing 
> white space after XHTML block tags
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-171
>                 URL: https://issues.apache.org/jira/browse/TIKA-171
>             Project: Tika
>          Issue Type: Improvement
>          Components: general
>    Affects Versions: 0.2
>            Reporter: Uwe Schindler
>         Attachments: TIKA-171.patch
>
>
> One problem with mapping document content to plain text is incorrect 
> whitespace handling:
> The normal way to parse documents to plain text is to instantiate a parser 
> and pass the SAX events from the parser to a 
> BodyContentHandler(TextContentHandler(Writer)). This appends all output to a 
> writer (see example on web site).
> This works good for thumb parsers that just create a single <p>> tag in XHTML 
> output whith all content of the document in it (including newlines).
> As soon, as a more inteligent parser is used (e.g. HTML Parser) that creates 
> multiple nodes and a feature-rich XHTML document, the problems begin. The 
> TextContentHandler just strips all tags away and only characters() events are 
> forwarded to the Writer. When the original document (e.g. a HTML document) 
> does not contain additional whitespace and linefeeds (e.g. it is correct and 
> possible to create a XHTML document with all content in one text line, but 
> consisting of several paragraphs. In this case </p><p> events between 
> paragraphs are stripped and there is no whitespace anymore between the two 
> paragraphs.
> My patch contains a new XHTMLToTextContentHandler, that checks the elements 
> and inserts whitespace to the output depending on the XHTML tag type. HTML 
> block tags like <p/> get a newline at the end, but HTML inline tags do not 
> add whitespace. This mapping is done by a simple Set<String> of tag names 
> extracted from the XHTML 1.0 spec. To make it even better, tables are printed 
> out with white space and tabs between cells.
> With this patch, I am able to correctly index a lot of document with Lucene.
> The patch also changes some tests to correctly check for the '\n' at the end 
> of plain text streams (which are included because of the single <p>-paragraph 
> around plain text).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-171) New ContentHandler for plain text output that has no problem with missing white space after XHTML block tags

Reply via email to