Hi Peter,

I get very similar results with PDFBox 1.0.1 (slightly patched following a hint 
from Villu Ruusmann).
It seems that PDFBox gets confused by the two column layout and the table of 
contents in the beginning.
With PDF Kit I get something that starts with

2010
Roèník XVIII 􏰄 Èíslo 50A
OBCHODNÝ REGISTER
15. marca 2010  Cena 18,06 €
OBSAH
Okresný súd Bratislava I Nové zápisy    . . . . . . . . . . . Zmeny zápisov     
. . . . . . . .
Okresný súd Trnava Nové zápisy . . . . . . . . . . . Zmeny zápisov      . . . . 
. . . .
. ................      2 . . . . . . . . . . . . . . . ..      16
. ................      104 . ................  107
. ................      118 . ................  119
. ................      129 . ................  136 . ................  156
. ................      157 . ................  159

, so the PDF file as such is not broken (although this is not quite the desired 
result either).

All the best
Thomas


Am 15.03.2010 um 13:40 schrieb Peter Zavadsky:

> Hi,
> 
> I'm new to pdfbox and I'm trying to extract text from some government
> pdf file, but some texts arent extracted correctly. Can anyone help or
> suggest me what is wrong?
> 
> Here's pdf I'm trying to extract from:
> http://www.justice.gov.sk/kop/ovest/ov10/03/050/OV050A.pdf
> 
> Here's the output from first two pages:
> 
> 
>           
> 
>       
> 
>        
>       
> 
> 
>       
>                                                             
>                                                                     
>                                                                   
>                                                          
>                                                                     
>                                       
>         
>       
>        
> 
> 
>                                                             
>                                                                     
>                                                                   
> !"
>                                                           
>                                                                     
>                                              !#
>        
>        $%
>                                                             
>                                                                     
>                                                                   
> &
>                                                           
>                                                                     
>                                              '
>        
>       
> 
>                                                             
>                                                                     
>                                                                   
> '
>                                                           
>                                                                     
>                                              (
> )
>                                                                   
>                                                                     
>                                                                     
>      *
>        
>       +
> 
>                                                             
>                                                                     
>                                                                   
> *#
>                                                           
>                                                                     
>                                              *'
>        

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to