Hi,

I've been using PDFbox to extract text features for layout analysis, and I'm 
running into a file that seems render properly, but the extracted text looks 
totally botched.  If I copy/paste from Acrobat Reader or Mac Preview, the same 
glyphs are broken.

I've tried to make sense of the PDF using the debugger, but this is a bit 
beyond my (limited) PDF internals knowledge.  My guess is that the PDF file has 
some problems with the subsetted "BerlingskeSerifText-Extralight*2" font (this 
appears to be the font used in the example I provide below), but I can't 
determine why the problem glyphs appear fine inside a PDF viewer whereas the 
extracted text is incorrect.  

Thanks for any guidance you can provide!  I've included a sample file and 
details below.

John

I've uploaded the PDF for a problem page here:

https://www.dropbox.com/s/05rlbmv74ya0lrg/TVL_2016_12-64.pdf?dl=0

The phrase "comfortable Airbus A XWB to Helsinki and suffering zero jet lag" on 
this page has problems with the numbers in "A350" and the ligature in 
"suffering".

If I use the PDFbox preflight app, I see three error classes:

1.0.14 : Syntax error, Object {67:0} has an offset of 0
3.1.4 : Invalid Font definition, UDWCAS+BerlingskeSerifCn-XBd: The Charset 
entry is missing for the Type1 Subset
1.2.7 : Body Syntax error, Filter specified in metadata dictionnary

The PDF debugger dump of this part of the content is:

q
    1 0 0 1 99.60001 123.131 cm
    BT
      8.5 0 0 8.5 0 0 Tm
      /Ty5 1 Tf
      [ (c) 10 (omfort) -9.9 (able ) -24 (Airb) 5.1 (us ) -24 (A) ] TJ
    ET
  Q
  q
    1 0 0 1 99.60001 123.131 cm
    BT
      8.5 0 0 8.5 81.1988 0 Tm
      /Ty7 1 Tf
      [ ($%) 10 (&) ] TJ
    ET
  Q
  q
    1 0 0 1 99.60001 123.131 cm
    BT
      8.5 0 0 8.5 94.5778 0 Tm
      /Ty5 1 Tf
      [ ( ) -24 (XWB ) -24 ( ) -24 (to ) -24 (Helsinki ) -24 (and ) -24 (su) ] 
TJ
    ET
  Q
  q
    1 0 0 1 99.60001 123.131 cm
    BT
      8.5 0 0 8.5 186.9813 0 Tm
      /Ty7 1 Tf
      (') Tj
    ET
  Q
  q
    1 0 0 1 99.60001 123.131 cm
    BT
      8.5 0 0 8.5 192.0218 0 Tm
      /Ty5 1 Tf
      [ (ering ) -24 (z) 5 (er) 10 (o ) -24 (jet ) -24 (lag, ) -24 (t) -5 (ra) 
10 (v) 10 (el ) -24 (is ) -24 (g) 5 (ett) -5 (ing ) -24 (undeniably ) -24 
(better) 20 (. ) ] TJ
    ET
  Q

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to