[ 
https://issues.apache.org/jira/browse/TIKA-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025855#comment-17025855
 ] 

Louic Vermeer edited comment on TIKA-3032 at 1/29/20 1:06 PM:
--------------------------------------------------------------

Not sure if I should have marked this as critical, but it does lead to loss of 
information when dealing with tables so I think it is important.


was (Author: louic):
Not sure if I should mark this as critical, but it does lead to loss of 
information when dealing with tables so I think it is important.

> Table cells below a colspan property are shifted
> ------------------------------------------------
>
>                 Key: TIKA-3032
>                 URL: https://issues.apache.org/jira/browse/TIKA-3032
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.23
>         Environment: Linux neon 5.3.18-1-MANJARO #1 SMP PREEMPT Wed Dec 18 
> 18:34:35 UTC 2019 x86_64 GNU/Linux
> openjdk 13.0.2 2020-01-14
> OpenJDK Runtime Environment (build 13.0.2+8)
> OpenJDK 64-Bit Server VM (build 13.0.2+8, mixed mode)
>            Reporter: Louic Vermeer
>            Priority: Critical
>         Attachments: table.html
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> When a colspan property is used in html or xml input, cells in the rows below 
> the colspan are shifted to the left. Therefore it is no longer possible to 
> reconstruct which column the values belong to after being parsing.
> In the attached example, the labels are no longer above the correct column. 
> This example was inspired by the tables in the sec filings XBRL data. See for 
> example the following link (22MB!) to a 10-K filing: 
> https://www.sec.gov/Archives/edgar/data/1410636/000141063619000041/0001410636-19-000041.txt
> Suggested solution:
> Tika could insert empty cells behind the cell with the colspan. While this 
> may not be perfect, at least it would prevent cells after it from shifting 
> position and ending up in the wrong column. The ideal solution (for me at 
> least) would be to preserve the colspan information in XML output and to 
> insert extra tabs in TXT output to keep the columns aligned.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to