[
https://issues.apache.org/jira/browse/TIKA-268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744436#action_12744436
]
Uwe Schindler commented on TIKA-268:
------------------------------------
The problem is, that the HTML parser strips all tags, that are not in
SAFE_ELEMENTS. <TABLE> tags are replaced by <P> and all inner tags simply
ignored and not passed through. As all other ContentHandlers (like OOXML,
OpenXML,..) produce XHTML table tags, the HTML parser should preserve the
table. This can be achieved by modifying the SAFE_ELEMENTS map.
If you then convert the output to text-only, the output will contain tabs and
NLs, as XHTMLContentHandler adds ignorableWhiteSpace between table tags and
newlines after HTML block tags.
> HTMLParser ommits necessary space-characters when parsing table-data
> ---------------------------------------------------------------------
>
> Key: TIKA-268
> URL: https://issues.apache.org/jira/browse/TIKA-268
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.3, 0.4
> Environment: Win, Mac, Lin; Java 5+
> Reporter: Joachim Zittmayr
> Priority: Critical
> Original Estimate: 4h
> Remaining Estimate: 4h
>
> When an HTML file with a table structure is given to the TIKA-ecosystem, then
> HTML parser doesn't output space characters between table cells.
> Example:
> Input
> ------------------------------
> <table>
> <tr>
> <td>Apache LUCENE<td><td>is f****** amazing!</td>
> </tr>
> <tr>
> <td>Apache TIKA</td><td>freaks you out!</td>
> </tr>
> <table>
> ------------------------------
> Output
> ------------------------------
> Apache LUCENEis f****** amazing!
> Apache TIKAfreaks you out!
> ------------------------------
> unfortuantely i didnt have the time to do some investigation within
> HTMLParser.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.