[
https://issues.apache.org/jira/browse/PDFBOX-571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andreas Lehmkühler resolved PDFBOX-571.
---------------------------------------
Resolution: Fixed
Fix Version/s: 1.0.0
Villus explanation seems reasonable to me. So, I've tested the patch and it
works fine.
- the rendering of the attached sample pdf is more acurate (not perfect, but
better)
- the extracted text of the attached sample pdf is more acurate too
- the other test cases are working like before
Thanks to Villu for his contribution
> Dubious handling of word spacing (Tw)
> -------------------------------------
>
> Key: PDFBOX-571
> URL: https://issues.apache.org/jira/browse/PDFBOX-571
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction, Utilities
> Affects Versions: 0.8.0-incubator
> Reporter: Villu Ruusmann
> Fix For: 1.0.0
>
> Attachments: PDFStreamEngine.patch, pg_0005.pdf, pg_0005_selectall.png
>
>
> Wanted to provide a contrary case for the current handling of word spacing.
> The sample page (pg_0005.pdf) uses a Type1C font for text rendering. The
> problem is that this Type1C font uses a custom encoding where the code values
> are assigned sequentially starting from the code value of 1. Thus the code
> value 32 is assigned to a digit "3", not to a space character " " as one
> would expect.
> The PDF producer software has (mis-)used word spacing to break up longer
> character sequences. For example, on table line 3, the character sequence
> "0.831.05" is broken into two cells "0.83" and "1.05". Other uses of this
> "optimization" can be seen when the sample page is opened in Acrobat Reader
> (tested on version 7.0) and the "Select all" operation is performed. I've
> attached the screenshot of Acrobat Reader (pg_0005_selectall.png) for your
> convenience.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.