Rens Huizenga created PDFBOX-4293:
-------------------------------------

             Summary: PDFBox does not align "columns" properly
                 Key: PDFBOX-4293
                 URL: https://issues.apache.org/jira/browse/PDFBOX-4293
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 2.0.11
         Environment: Windows 7 64 
            Reporter: Rens Huizenga
             Fix For: 2.0.11
         Attachments: PDFconversieTekst CONVERTIO.txt, PDFconversieTekst.pdf, 
PDFconversieTekst.pdf.txt, PDFconversieTekst.xlsx

 I have to convert Pdf's to database data. I developed a parser that reads .txt 
files. The original data is available in PDFs only . Therefore .txt files will 
have to be created by Tika converting the PDF's to .txt. After conversion I 
recognise an alignment issue with the .txt data compared to the  columns in the 
PDF. On the TIKA website I read that I need to check if the problems also 
occurs in PDFBox, so I checked for that. PDFBox has the same issue.

These lines of PDF data:
a  b  c  d  e   
a  b  c  d      e

are both presented as
a  b  c  d  e

in the text file, causing for example numbers to be presented in the wrong 
"column".

Unfortunately I cannot share busniess documents, but i have created an example 
in Excel, saved it as PDF and converted it to .txt. See attachments.

In addition I converted the testset online with Convertio.co. Their results is 
as expected, with enough spaces between the words/numbers to recognise the 
column.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to