Hello there, > > I'm new to pdfbox and I'm trying to extract text from some government > pdf file, but some texts arent extracted correctly. Can anyone help or > suggest me what is wrong? > > Here's pdf I'm trying to extract from: > http://www.justice.gov.sk/kop/ovest/ov10/03/050/OV050A.pdf >
The first page of this document is constructed in a way that it only makes sense when rendered. For example, the "Table of contents" uses font which does not provide translation from raw bytes to characters. You can verify it if you open this document in Acrobat Reader, select some text and attempt to copy it to the clipboard - you'd get a handful of bytes but no human readable text. However, all the remaining pages appear be suitable for text extraction. Beware that PDFBox might not be very proficient with exotic languages such as the Slovak language, but we are sure hoping to improve over time. This document makes heavy use of Type1C fonts. The "native" support for Type1C fonts was introduced in PDFBox 1.0.0. You might get different results (maybe even better?) if you try to perform text extraction with some older PDFBox version such as 0.8.0. I've filed this incident in PDFBox's JIRA as PDFBox-664: https://issues.apache.org/jira/browse/PDFBOX-664 VR

