Lee van Hooff created PDFBOX-3435: ------------------------------------- Summary: Text extraction - words on same line detection failing in 2.x Key: PDFBOX-3435 URL: https://issues.apache.org/jira/browse/PDFBOX-3435 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 2.0.2, 2.0.1, 2.0.0 Reporter: Lee van Hooff Attachments: text-extraction-issues.pdf
The ability to extract a line of text as it appears in the PDF is no longer working in the 2.x version of pdfbox. java -jar pdfbox-app-1.8.4.jar ExtractText -console -sort ~/Desktop/text-extraction-issues.pdf results in: {noformat} . . . Your Code Our Code Description Qty Price Ex Total Ex 11SP 100129630 IRWIN VICE-GRIP 11 C-CLAMP SWIVEL PAD 4 00.00 000.00 IR-0352 100094584 IRWIN 600MM TOOL BAG 1 00.00 00.00 EM81.9 100088913 EMPIRE TORPEDO LEVEL ALUMINIUM 1 00.00 00.00 20566-618R 100023443 LENOX RECIPRO BLADE 150X20X0.9MM 18TPI 5P 3 0.00 00.00 . . . {noformat} while java -jar pdfbox-app-2.0.2.jar ExtractText -console -sort ~/Desktop/text-extraction-issues.pdf results in: {noformat} . . . Your Code Our Code Description Qty Price Ex Total Ex IRWIN VICE-GRIP 11 C-CLAMP SWIVEL PAD 11SP 100129630 4 00.00 000.00 IRWIN 600MM TOOL BAG IR-0352 100094584 1 00.00 00.00 EMPIRE TORPEDO LEVEL ALUMINIUM EM81.9 100088913 1 00.00 00.00 LENOX RECIPRO BLADE 150X20X0.9MM 18TPI 5P 20566-618R 100023443 3 0.00 00.00 . . . {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org