Hi, I've been using PDFbox to extract text features for layout analysis, and I'm running into a file that seems render properly, but the extracted text looks totally botched. If I copy/paste from Acrobat Reader or Mac Preview, the same glyphs are broken.
I've tried to make sense of the PDF using the debugger, but this is a bit beyond my (limited) PDF internals knowledge. My guess is that the PDF file has some problems with the subsetted "BerlingskeSerifText-Extralight*2" font (this appears to be the font used in the example I provide below), but I can't determine why the problem glyphs appear fine inside a PDF viewer whereas the extracted text is incorrect. Thanks for any guidance you can provide! I've included a sample file and details below. John I've uploaded the PDF for a problem page here: https://www.dropbox.com/s/05rlbmv74ya0lrg/TVL_2016_12-64.pdf?dl=0 The phrase "comfortable Airbus A XWB to Helsinki and suffering zero jet lag" on this page has problems with the numbers in "A350" and the ligature in "suffering". If I use the PDFbox preflight app, I see three error classes: 1.0.14 : Syntax error, Object {67:0} has an offset of 0 3.1.4 : Invalid Font definition, UDWCAS+BerlingskeSerifCn-XBd: The Charset entry is missing for the Type1 Subset 1.2.7 : Body Syntax error, Filter specified in metadata dictionnary The PDF debugger dump of this part of the content is: q 1 0 0 1 99.60001 123.131 cm BT 8.5 0 0 8.5 0 0 Tm /Ty5 1 Tf [ (c) 10 (omfort) -9.9 (able ) -24 (Airb) 5.1 (us ) -24 (A) ] TJ ET Q q 1 0 0 1 99.60001 123.131 cm BT 8.5 0 0 8.5 81.1988 0 Tm /Ty7 1 Tf [ ($%) 10 (&) ] TJ ET Q q 1 0 0 1 99.60001 123.131 cm BT 8.5 0 0 8.5 94.5778 0 Tm /Ty5 1 Tf [ ( ) -24 (XWB ) -24 ( ) -24 (to ) -24 (Helsinki ) -24 (and ) -24 (su) ] TJ ET Q q 1 0 0 1 99.60001 123.131 cm BT 8.5 0 0 8.5 186.9813 0 Tm /Ty7 1 Tf (') Tj ET Q q 1 0 0 1 99.60001 123.131 cm BT 8.5 0 0 8.5 192.0218 0 Tm /Ty5 1 Tf [ (ering ) -24 (z) 5 (er) 10 (o ) -24 (jet ) -24 (lag, ) -24 (t) -5 (ra) 10 (v) 10 (el ) -24 (is ) -24 (g) 5 (ett) -5 (ing ) -24 (undeniably ) -24 (better) 20 (. ) ] TJ ET Q --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

