HTMLParser ommits necessary space-characters when parsing table-data 
---------------------------------------------------------------------

                 Key: TIKA-268
                 URL: https://issues.apache.org/jira/browse/TIKA-268
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.4, 0.5
         Environment: Win, Mac, Lin; Java 5+
            Reporter: Joachim Zittmayr
            Priority: Critical


When an HTML file with a table structure is given to the TIKA-ecosystem, then 
HTML parser doesn't output space characters between table cells.

Example:

Input
------------------------------
<table>
  <tr>
    <td>Apache LUCENE<td><td>is f****** amazing!</td>
 </tr>
 <tr>
    <td>Apache TIKA</td><td>freaks you out!</td>
 </tr>
<table>
------------------------------

Output
------------------------------

Apache LUCENEis f****** amazing!

Apache TIKAfreaks you out!

------------------------------

unfortuantely i didnt have the time to do some investigation within HTMLParser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to