Louic Vermeer created TIKA-3032:
-----------------------------------
Summary: Table cells below a colspan property are shifted
Key: TIKA-3032
URL: https://issues.apache.org/jira/browse/TIKA-3032
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.23
Environment: Linux neon 5.3.18-1-MANJARO #1 SMP PREEMPT Wed Dec 18
18:34:35 UTC 2019 x86_64 GNU/Linux
openjdk 13.0.2 2020-01-14
OpenJDK Runtime Environment (build 13.0.2+8)
OpenJDK 64-Bit Server VM (build 13.0.2+8, mixed mode)
Reporter: Louic Vermeer
Attachments: table.html
When a colspan property is used in html or xml input, cells in the rows below
the colspan are shifted to the left. Therefore it is no longer possible to
reconstruct which column the values belong to after being parsing.
In the attached example, the labels are no longer above the correct column.
This example was inspired by the tables in the sec filings XBRL data. See for
example the following link (22MB!) to a 10-K filing:
https://www.sec.gov/Archives/edgar/data/1410636/000141063619000041/0001410636-19-000041.txt
Suggested solution:
Tika could insert empty cells behind the cell with the colspan. While this may
not be perfect, at least it would prevent cells after it from shifting position
and ending up in the wrong column. The ideal solution (for me at least) would
be to preserve the colspan information in XML output and to insert extra tabs
in TXT output to keep the columns aligned.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)