[ 
https://issues.apache.org/jira/browse/TIKA-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Louic Vermeer updated TIKA-3032:
--------------------------------
    Description: 
When a colspan property is used in html or xml input, cells to the right of the 
colspan are shifted to the left. Therefore the structure of the table gets 
compromised, and it is no longer possible to reconstruct which cells belong to 
which column.

In the attached minimal example, the labels are no longer above the correct 
column after parsing to XML or plain text. This example was inspired by the 
tables in the sec filings, see for example the following link (22MB!) to a 10-K 
filing: 
[https://www.sec.gov/Archives/edgar/data/1410636/000141063619000041/0001410636-19-000041.txt]

Suggested solution:

Tika could insert empty cells behind the cell with the colspan. While this may 
not be perfect, at least it would prevent cells after it from shifting position 
and ending up in the wrong column. The ideal solution (for me at least) would 
be to preserve the colspan information in XML output and to insert extra tabs 
in TXT output to keep the columns aligned.

 

  was:
When a colspan property is used in html or xml input, cells to the right of the 
colspan are shifted to the left. Therefore the structure of the table gets 
compromised, and it is no longer possible to reconstruct which cells belong to 
which column.

In the attached example, the labels are no longer above the correct column 
after parsing to XML or plain text. This example was inspired by the tables in 
the sec filings XBRL data. See for example the following link (22MB!) to a 10-K 
filing: 
[https://www.sec.gov/Archives/edgar/data/1410636/000141063619000041/0001410636-19-000041.txt]

Suggested solution:

Tika could insert empty cells behind the cell with the colspan. While this may 
not be perfect, at least it would prevent cells after it from shifting position 
and ending up in the wrong column. The ideal solution (for me at least) would 
be to preserve the colspan information in XML output and to insert extra tabs 
in TXT output to keep the columns aligned.

 


> Table cells below a colspan property are shifted
> ------------------------------------------------
>
>                 Key: TIKA-3032
>                 URL: https://issues.apache.org/jira/browse/TIKA-3032
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.23
>         Environment: Linux neon 5.3.18-1-MANJARO #1 SMP PREEMPT Wed Dec 18 
> 18:34:35 UTC 2019 x86_64 GNU/Linux
> openjdk 13.0.2 2020-01-14
> OpenJDK Runtime Environment (build 13.0.2+8)
> OpenJDK 64-Bit Server VM (build 13.0.2+8, mixed mode)
>            Reporter: Louic Vermeer
>            Priority: Minor
>         Attachments: table.html
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> When a colspan property is used in html or xml input, cells to the right of 
> the colspan are shifted to the left. Therefore the structure of the table 
> gets compromised, and it is no longer possible to reconstruct which cells 
> belong to which column.
> In the attached minimal example, the labels are no longer above the correct 
> column after parsing to XML or plain text. This example was inspired by the 
> tables in the sec filings, see for example the following link (22MB!) to a 
> 10-K filing: 
> [https://www.sec.gov/Archives/edgar/data/1410636/000141063619000041/0001410636-19-000041.txt]
> Suggested solution:
> Tika could insert empty cells behind the cell with the colspan. While this 
> may not be perfect, at least it would prevent cells after it from shifting 
> position and ending up in the wrong column. The ideal solution (for me at 
> least) would be to preserve the colspan information in XML output and to 
> insert extra tabs in TXT output to keep the columns aligned.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to