Re: Problems with Java PDFBox

Gilad Denneboom Sun, 09 Sep 2012 04:01:58 -0700

I believe it's because that text is written in a non-standard font which is
only partially embedded in the file, called "TTE1890348t00"...
You can see it for yourself if you open the file in Acrobat and try to copy
that text using the text selection tool. The result is just a bunch of
unreadable unicode symbols. Other text in the file uses Arial or some other
standard fonts, and therefore can be read easily.


On Sun, Sep 9, 2012 at 11:13 AM, Natalia Gómez García <
[email protected]> wrote:

> Hello,
>
> I am a computer science student and I'm using your library PDFBox in Java
> to extract text data from some pdf files.
>
> In this project, I am having difficulties extracting the text from this
> pdf: http://www.escet.urjc.es/alumnos/horarios/GR_Biologia_2012-13.pdf.
> Specifically, I can't get to extract the text "Semana del 3 al 7 de
> Septiembre de 2012".
>
> Why can this be happening? Could you please give me some directions on how
> to extract this data?
>
> The code I'm using right now is the following:
> pdfDoc = PDDocument.load(url);
> pdfStripper = new PDFTextStripper();
> texto=pdfStripper.getText(pdfDoc);
> pdfDoc.close();
>
> Thanks for your attention
> Natalia
>

Re: Problems with Java PDFBox

Reply via email to