HTMLParser ommits necessary space-characters when parsing table-data
---------------------------------------------------------------------
Key: TIKA-268
URL: https://issues.apache.org/jira/browse/TIKA-268
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 0.4, 0.5
Environment: Win, Mac, Lin; Java 5+
Reporter: Joachim Zittmayr
Priority: Critical
When an HTML file with a table structure is given to the TIKA-ecosystem, then
HTML parser doesn't output space characters between table cells.
Example:
Input
------------------------------
<table>
<tr>
<td>Apache LUCENE<td><td>is f****** amazing!</td>
</tr>
<tr>
<td>Apache TIKA</td><td>freaks you out!</td>
</tr>
<table>
------------------------------
Output
------------------------------
Apache LUCENEis f****** amazing!
Apache TIKAfreaks you out!
------------------------------
unfortuantely i didnt have the time to do some investigation within HTMLParser.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.