[ 
https://issues.apache.org/jira/browse/PDFBOX-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14965780#comment-14965780
 ] 

Tilman Hausherr commented on PDFBOX-2069:
-----------------------------------------

Yes... I also had another thought - lets say we output "PDFBox" with a big Tc 
value, then it would look like this:

{code}
P   D   F   B   o   x
{code}

How should this appear in text extration? As "PDFbox" or as "P   D   F   B   o  
 x"?

> PDF's with Tc before Tm are getting incorrect spacing in PDFTextArea
> --------------------------------------------------------------------
>
>                 Key: PDFBOX-2069
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2069
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.5
>         Environment: Windows
>            Reporter: Joel Hirsh
>              Labels: pdfbox
>         Attachments: PDFBOX-2609-visible.pdf, PDFBOX-2609.pdf, 
> PDFBox-2609-patch.zip
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Attached PDF is getting incorrect spacing using example program 
> ExtractTextByArea.java as follows:
> Text in the area:java.awt.Rectangle[x=10,y=500,width=600,height=200]
> Transaction Activity
> Date D e s c r i p t i o n Deposits W i t h d r a w a l s
> 0 4 / 0 8  B E G I N N I N G  BALANCE
> 04 / 0 8  W I THDRAWAL - ATM  3 1 1 7 3 0 0 . 0 0 -
> 62 M I L L  H I L L  ROAD WOODSTOCK N Y
> 04 / 1 0  W I THDRAWAL - ACH 2 0 0 . 0 0 -
> HUMAN RIGHTS WAT-B I L L  PAYMT
> 04 / 12  C K #  1 2 7 3 11 0 . 0 0 -
> 0 4 / 1 5  W I THDRAWAL - ACH 2 0 2 . 5 7 -
> NEW SOUTH INSURA -B I LL PAYMT
> 04 / 1 5  W I THDRAWAL - ACH 3 6 . 2 6 -
> WASTE CONNECTION-BILL PAYMT
> 04 / 1 7  W I THDRAWAL - ACH 71 2 . 0 0 -
> N  PYMT T
> 04 / 1 8  W I THDRAWAL - ACH 2958 9 . 0 0 3
> N  PYMT T
> 04 / 1 9  W I THDRAWAL - ACH 76 8 . 1 2 -
> I believe this because PDF streams with Tc before Tm are having the matrix 
> applied to the Tc, which is contrary to my experience with graphic pipelines. 
>  Most PDF streams seem to to have Tc after Tm, and thus do not hit this 
> situation.
> I have attached a patch to two files that corrects the problem for this file, 
> and also works correctly on my test suite of about 40 files from other 
> sources.  
> The result for the attached file now becomes:
> Text in the area:java.awt.Rectangle[x=10,y=500,width=600,height=200]
> Transaction  Activity
> Date  Description Deposits  Withdrawals
> 04/08  BEGINNING  BALANCE
> 04/08  WITHDRAWAL-ATM  3 117 300.00-
> 62 MILL  HILL  ROAD  WOODSTOCK  NY
> 04/10  WITHDRAWAL-ACH 200.00-
> HUMAN RIGHTS  WAT-BILL  PAYMT
> 04/12  CK#  1273 110.00-
> 04/15  WITHDRAWAL-ACH 202.57-
> NEW SOUTH  INSURA-BILL  PAYMT
> 04/15  WITHDRAWAL-ACH 36.26-
> WASTE CONNECTION-BILL  PAYMT
> 04/17  WITHDRAWAL-ACH 712.00-
> N  PYMT T
> 04/18  WITHDRAWAL-ACH 29589.00 3
> N  PYMT T
> 04/19  WITHDRAWAL-ACH 768.12-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to