Lee van Hooff created PDFBOX-3435:
-------------------------------------

             Summary: Text extraction - words on same line detection failing in 
2.x
                 Key: PDFBOX-3435
                 URL: https://issues.apache.org/jira/browse/PDFBOX-3435
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 2.0.2, 2.0.1, 2.0.0
            Reporter: Lee van Hooff
         Attachments: text-extraction-issues.pdf

The ability to extract a line of text  as it appears in the PDF is no longer 
working in the 2.x version of pdfbox.

java -jar pdfbox-app-1.8.4.jar ExtractText -console -sort 
~/Desktop/text-extraction-issues.pdf

results in:
{noformat}
. . .
Your Code        Our Code                            Description                
                              Qty    Price Ex   Total Ex  
11SP             100129630       IRWIN VICE-GRIP 11 C-CLAMP SWIVEL PAD          
 4         00.00      000.00
IR-0352          100094584       IRWIN 600MM TOOL BAG                           
 1         00.00       00.00
EM81.9           100088913       EMPIRE TORPEDO LEVEL ALUMINIUM                 
 1         00.00       00.00
20566-618R       100023443       LENOX RECIPRO BLADE 150X20X0.9MM 18TPI 5P      
  3          0.00       00.00
. . .
{noformat}

while
java -jar pdfbox-app-2.0.2.jar ExtractText -console -sort 
~/Desktop/text-extraction-issues.pdf

results in:
{noformat}
. . .
Your Code        Our Code                            Description                
                              Qty    Price Ex   Total Ex  
IRWIN VICE-GRIP 11 C-CLAMP SWIVEL PAD    
11SP             100129630              4         00.00      000.00
IRWIN 600MM TOOL BAG                     
IR-0352          100094584              1         00.00       00.00
EMPIRE TORPEDO LEVEL ALUMINIUM           
EM81.9           100088913              1         00.00       00.00
LENOX RECIPRO BLADE 150X20X0.9MM 18TPI 5P
20566-618R       100023443              3          0.00       00.00
. . .
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to