[jira] Commented: (TIKA-268) HTMLParser ommits necessary space-characters when parsing table-data

Uwe Schindler (JIRA) Tue, 18 Aug 2009 02:51:50 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744436#action_12744436
 ]


Uwe Schindler commented on TIKA-268:
------------------------------------

The problem is, that the HTML parser strips all tags, that are not in 
SAFE_ELEMENTS. <TABLE> tags are replaced by <P> and all inner tags simply 
ignored and not passed through. As all other ContentHandlers (like OOXML, 
OpenXML,..) produce XHTML table tags, the HTML parser should preserve the 
table. This can be achieved by modifying the SAFE_ELEMENTS map.

If you then convert the output to text-only, the output will contain tabs and 
NLs, as XHTMLContentHandler adds ignorableWhiteSpace between table tags and 
newlines after HTML block tags.

> HTMLParser ommits necessary space-characters when parsing table-data 
> ---------------------------------------------------------------------
>
>                 Key: TIKA-268
>                 URL: https://issues.apache.org/jira/browse/TIKA-268
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.3, 0.4
>         Environment: Win, Mac, Lin; Java 5+
>            Reporter: Joachim Zittmayr
>            Priority: Critical
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When an HTML file with a table structure is given to the TIKA-ecosystem, then 
> HTML parser doesn't output space characters between table cells.
> Example:
> Input
> ------------------------------
> <table>
>   <tr>
>     <td>Apache LUCENE<td><td>is f****** amazing!</td>
>  </tr>
>  <tr>
>     <td>Apache TIKA</td><td>freaks you out!</td>
>  </tr>
> <table>
> ------------------------------
> Output
> ------------------------------
> Apache LUCENEis f****** amazing!
> Apache TIKAfreaks you out!
> ------------------------------
> unfortuantely i didnt have the time to do some investigation within 
> HTMLParser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-268) HTMLParser ommits necessary space-characters when parsing table-data

Reply via email to