[
https://issues.apache.org/jira/browse/TIKA-171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jukka Zitting resolved TIKA-171.
--------------------------------
Resolution: Fixed
Fix Version/s: 0.2
Assignee: Jukka Zitting
Since we are going to re-roll the 0.2 release from the current trunk, I think
it makes sense to resolve this issue as fixed for 0.2 in the current state and
perhaps create new issues for the proposed improvements.
Resolving as Fixed for 0.2.
> New ContentHandler for plain text output that has no problem with missing
> white space after XHTML block tags
> ------------------------------------------------------------------------------------------------------------
>
> Key: TIKA-171
> URL: https://issues.apache.org/jira/browse/TIKA-171
> Project: Tika
> Issue Type: Improvement
> Components: general
> Affects Versions: 0.2
> Reporter: Uwe Schindler
> Assignee: Jukka Zitting
> Fix For: 0.2
>
> Attachments: TIKA-171.patch
>
>
> One problem with mapping document content to plain text is incorrect
> whitespace handling:
> The normal way to parse documents to plain text is to instantiate a parser
> and pass the SAX events from the parser to a
> BodyContentHandler(TextContentHandler(Writer)). This appends all output to a
> writer (see example on web site).
> This works good for thumb parsers that just create a single <p>> tag in XHTML
> output whith all content of the document in it (including newlines).
> As soon, as a more inteligent parser is used (e.g. HTML Parser) that creates
> multiple nodes and a feature-rich XHTML document, the problems begin. The
> TextContentHandler just strips all tags away and only characters() events are
> forwarded to the Writer. When the original document (e.g. a HTML document)
> does not contain additional whitespace and linefeeds (e.g. it is correct and
> possible to create a XHTML document with all content in one text line, but
> consisting of several paragraphs. In this case </p><p> events between
> paragraphs are stripped and there is no whitespace anymore between the two
> paragraphs.
> My patch contains a new XHTMLToTextContentHandler, that checks the elements
> and inserts whitespace to the output depending on the XHTML tag type. HTML
> block tags like <p/> get a newline at the end, but HTML inline tags do not
> add whitespace. This mapping is done by a simple Set<String> of tag names
> extracted from the XHTML 1.0 spec. To make it even better, tables are printed
> out with white space and tabs between cells.
> With this patch, I am able to correctly index a lot of document with Lucene.
> The patch also changes some tests to correctly check for the '\n' at the end
> of plain text streams (which are included because of the single <p>-paragraph
> around plain text).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.