Louic Vermeer created TIKA-3032:
-----------------------------------

             Summary: Table cells below a colspan property are shifted
                 Key: TIKA-3032
                 URL: https://issues.apache.org/jira/browse/TIKA-3032
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.23
         Environment: Linux neon 5.3.18-1-MANJARO #1 SMP PREEMPT Wed Dec 18 
18:34:35 UTC 2019 x86_64 GNU/Linux

openjdk 13.0.2 2020-01-14
OpenJDK Runtime Environment (build 13.0.2+8)
OpenJDK 64-Bit Server VM (build 13.0.2+8, mixed mode)
            Reporter: Louic Vermeer
         Attachments: table.html

When a colspan property is used in html or xml input, cells in the rows below 
the colspan are shifted to the left. Therefore it is no longer possible to 
reconstruct which column the values belong to after being parsing.

In the attached example, the labels are no longer above the correct column. 
This example was inspired by the tables in the sec filings XBRL data. See for 
example the following link (22MB!) to a 10-K filing: 
https://www.sec.gov/Archives/edgar/data/1410636/000141063619000041/0001410636-19-000041.txt

Suggested solution:

Tika could insert empty cells behind the cell with the colspan. While this may 
not be perfect, at least it would prevent cells after it from shifting position 
and ending up in the wrong column. The ideal solution (for me at least) would 
be to preserve the colspan information in XML output and to insert extra tabs 
in TXT output to keep the columns aligned.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to