Hello there,

>
> I'm new to pdfbox and I'm trying to extract text from some government
> pdf file, but some texts arent extracted correctly. Can anyone help or
> suggest me what is wrong?
>
> Here's pdf I'm trying to extract from:
> http://www.justice.gov.sk/kop/ovest/ov10/03/050/OV050A.pdf
>

The first page of this document is constructed in a way that it only
makes sense when rendered. For example, the "Table of contents" uses
font which does not provide translation from raw bytes to characters.
You can verify it if you open this document in Acrobat Reader, select
some text and attempt to copy it to the clipboard - you'd get a
handful of bytes but no human readable text.
However, all the remaining pages appear be suitable for text
extraction. Beware that PDFBox might not be very proficient with
exotic languages such as the Slovak language, but we are sure hoping
to improve over time.

This document makes heavy use of Type1C fonts. The "native" support
for Type1C fonts was introduced in PDFBox 1.0.0. You might get
different results (maybe even better?) if you try to perform text
extraction with some older PDFBox version such as 0.8.0.

I've filed this incident in PDFBox's JIRA as PDFBox-664:
https://issues.apache.org/jira/browse/PDFBOX-664


VR

Reply via email to