Hi Peter, I get very similar results with PDFBox 1.0.1 (slightly patched following a hint from Villu Ruusmann). It seems that PDFBox gets confused by the two column layout and the table of contents in the beginning. With PDF Kit I get something that starts with
2010 Roèník XVIII Èíslo 50A OBCHODNÝ REGISTER 15. marca 2010 Cena 18,06 € OBSAH Okresný súd Bratislava I Nové zápisy . . . . . . . . . . . Zmeny zápisov . . . . . . . . Okresný súd Trnava Nové zápisy . . . . . . . . . . . Zmeny zápisov . . . . . . . . . ................ 2 . . . . . . . . . . . . . . . .. 16 . ................ 104 . ................ 107 . ................ 118 . ................ 119 . ................ 129 . ................ 136 . ................ 156 . ................ 157 . ................ 159 , so the PDF file as such is not broken (although this is not quite the desired result either). All the best Thomas Am 15.03.2010 um 13:40 schrieb Peter Zavadsky: > Hi, > > I'm new to pdfbox and I'm trying to extract text from some government > pdf file, but some texts arent extracted correctly. Can anyone help or > suggest me what is wrong? > > Here's pdf I'm trying to extract from: > http://www.justice.gov.sk/kop/ovest/ov10/03/050/OV050A.pdf > > Here's the output from first two pages: > > > > > > > > > > > > > > > > > > > > > > > > > > !" > > > !# > > $% > > > > & > > > ' > > > > > > > ' > > > ( > ) > > > > * > > + > > > > > *# > > > *' >
smime.p7s
Description: S/MIME cryptographic signature

