> On 16 Nov 2016, at 10:09, Tilman Hausherr <[email protected]> wrote:
> 
> Am 16.11.2016 um 18:47 schrieb John Logan:
>> Hi,
>> 
>> I've been using PDFbox to extract text features for layout analysis, and I'm 
>> running into a file that seems render properly, but the extracted text looks 
>> totally botched.  If I copy/paste from Acrobat Reader or Mac Preview, the 
>> same glyphs are broken.
> 
> Yes.
> 
> Have a look here:
> Root/Pages/Kids/[0]/Resources/Font/Ty7
> 
> then scroll down and look at the "unicode" column. It is empty.

Just to add a little more detail, the ff ligature in the word “suffering” uses 
font Ty7 and uses the ' character to represent the ligature. I might seem odd 
to use ' but forget that it’s a single quote and just see it as being character 
number 39, because that’s how PDF views it. If you click on the font that 
Tilman mentioned in PDFDebugger then scroll down to character 39, you’ll see 
that it has been mapped to the glyph named f_f, which is the ff ligature. 
You’ll also see that the corresponding Unicode Character column is empty, which 
means that PDFBox doesn’t know how to map that glyph name.

The reason for this is that if you look at Ty7’s ToUnicode map you’ll see this 
line:

<27><27><0000>

Which maps PDF character 39 (0x27) to Unicode character zero, i.e. nothing. So 
the problem is that the PDF contains Unicode mapping data which is wrong.

— John

> You have to understand the difference between "glyph" and "character". A 
> glyph is just a painting of a character. If you see a "9" then it doesn't 
> have to be that you get a "9" in text extraction too, this must be defined 
> somewhere. And if it isn't, or is incorrect, then you won't get a good 
> extraction.
> 
> Tilman
> 
>> 
>> I've tried to make sense of the PDF using the debugger, but this is a bit 
>> beyond my (limited) PDF internals knowledge.  My guess is that the PDF file 
>> has some problems with the subsetted "BerlingskeSerifText-Extralight*2" font 
>> (this appears to be the font used in the example I provide below), but I 
>> can't determine why the problem glyphs appear fine inside a PDF viewer 
>> whereas the extracted text is incorrect.
>> 
>> Thanks for any guidance you can provide!  I've included a sample file and 
>> details below.
>> 
>> John
>> 
>> I've uploaded the PDF for a problem page here:
>> 
>> https://www.dropbox.com/s/05rlbmv74ya0lrg/TVL_2016_12-64.pdf?dl=0
>> 
>> The phrase "comfortable Airbus A XWB to Helsinki and suffering zero jet lag" 
>> on this page has problems with the numbers in "A350" and the ligature in 
>> "suffering".
>> 
>> If I use the PDFbox preflight app, I see three error classes:
>> 
>> 1.0.14 : Syntax error, Object {67:0} has an offset of 0
>> 3.1.4 : Invalid Font definition, UDWCAS+BerlingskeSerifCn-XBd: The Charset 
>> entry is missing for the Type1 Subset
>> 1.2.7 : Body Syntax error, Filter specified in metadata dictionnary
>> 
>> The PDF debugger dump of this part of the content is:
>> 
>> q
>>     1 0 0 1 99.60001 123.131 cm
>>     BT
>>       8.5 0 0 8.5 0 0 Tm
>>       /Ty5 1 Tf
>>       [ (c) 10 (omfort) -9.9 (able ) -24 (Airb) 5.1 (us ) -24 (A) ] TJ
>>     ET
>>   Q
>>   q
>>     1 0 0 1 99.60001 123.131 cm
>>     BT
>>       8.5 0 0 8.5 81.1988 0 Tm
>>       /Ty7 1 Tf
>>       [ ($%) 10 (&) ] TJ
>>     ET
>>   Q
>>   q
>>     1 0 0 1 99.60001 123.131 cm
>>     BT
>>       8.5 0 0 8.5 94.5778 0 Tm
>>       /Ty5 1 Tf
>>       [ ( ) -24 (XWB ) -24 ( ) -24 (to ) -24 (Helsinki ) -24 (and ) -24 (su) 
>> ] TJ
>>     ET
>>   Q
>>   q
>>     1 0 0 1 99.60001 123.131 cm
>>     BT
>>       8.5 0 0 8.5 186.9813 0 Tm
>>       /Ty7 1 Tf
>>       (') Tj
>>     ET
>>   Q
>>   q
>>     1 0 0 1 99.60001 123.131 cm
>>     BT
>>       8.5 0 0 8.5 192.0218 0 Tm
>>       /Ty5 1 Tf
>>       [ (ering ) -24 (z) 5 (er) 10 (o ) -24 (jet ) -24 (lag, ) -24 (t) -5 
>> (ra) 10 (v) 10 (el ) -24 (is ) -24 (g) 5 (ett) -5 (ing ) -24 (undeniably ) 
>> -24 (better) 20 (. ) ] TJ
>>     ET
>>   Q
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected] 
> <mailto:[email protected]>
> For additional commands, e-mail: [email protected] 
> <mailto:[email protected]>

Reply via email to