[ https://issues.apache.org/jira/browse/PDFBOX-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14965787#comment-14965787 ]
Maruan Sahyoun commented on PDFBOX-2069: ---------------------------------------- ideally as PDFBox :-) > PDF's with Tc before Tm are getting incorrect spacing in PDFTextArea > -------------------------------------------------------------------- > > Key: PDFBOX-2069 > URL: https://issues.apache.org/jira/browse/PDFBOX-2069 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.8.5 > Environment: Windows > Reporter: Joel Hirsh > Labels: pdfbox > Attachments: PDFBOX-2609-visible.pdf, PDFBOX-2609.pdf, > PDFBox-2609-patch.zip > > Original Estimate: 1h > Remaining Estimate: 1h > > Attached PDF is getting incorrect spacing using example program > ExtractTextByArea.java as follows: > Text in the area:java.awt.Rectangle[x=10,y=500,width=600,height=200] > Transaction Activity > Date D e s c r i p t i o n Deposits W i t h d r a w a l s > 0 4 / 0 8 B E G I N N I N G BALANCE > 04 / 0 8 W I THDRAWAL - ATM 3 1 1 7 3 0 0 . 0 0 - > 62 M I L L H I L L ROAD WOODSTOCK N Y > 04 / 1 0 W I THDRAWAL - ACH 2 0 0 . 0 0 - > HUMAN RIGHTS WAT-B I L L PAYMT > 04 / 12 C K # 1 2 7 3 11 0 . 0 0 - > 0 4 / 1 5 W I THDRAWAL - ACH 2 0 2 . 5 7 - > NEW SOUTH INSURA -B I LL PAYMT > 04 / 1 5 W I THDRAWAL - ACH 3 6 . 2 6 - > WASTE CONNECTION-BILL PAYMT > 04 / 1 7 W I THDRAWAL - ACH 71 2 . 0 0 - > N PYMT T > 04 / 1 8 W I THDRAWAL - ACH 2958 9 . 0 0 3 > N PYMT T > 04 / 1 9 W I THDRAWAL - ACH 76 8 . 1 2 - > I believe this because PDF streams with Tc before Tm are having the matrix > applied to the Tc, which is contrary to my experience with graphic pipelines. > Most PDF streams seem to to have Tc after Tm, and thus do not hit this > situation. > I have attached a patch to two files that corrects the problem for this file, > and also works correctly on my test suite of about 40 files from other > sources. > The result for the attached file now becomes: > Text in the area:java.awt.Rectangle[x=10,y=500,width=600,height=200] > Transaction Activity > Date Description Deposits Withdrawals > 04/08 BEGINNING BALANCE > 04/08 WITHDRAWAL-ATM 3 117 300.00- > 62 MILL HILL ROAD WOODSTOCK NY > 04/10 WITHDRAWAL-ACH 200.00- > HUMAN RIGHTS WAT-BILL PAYMT > 04/12 CK# 1273 110.00- > 04/15 WITHDRAWAL-ACH 202.57- > NEW SOUTH INSURA-BILL PAYMT > 04/15 WITHDRAWAL-ACH 36.26- > WASTE CONNECTION-BILL PAYMT > 04/17 WITHDRAWAL-ACH 712.00- > N PYMT T > 04/18 WITHDRAWAL-ACH 29589.00 3 > N PYMT T > 04/19 WITHDRAWAL-ACH 768.12- -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org