I believe it's because that text is written in a non-standard font which is only partially embedded in the file, called "TTE1890348t00"... You can see it for yourself if you open the file in Acrobat and try to copy that text using the text selection tool. The result is just a bunch of unreadable unicode symbols. Other text in the file uses Arial or some other standard fonts, and therefore can be read easily.
On Sun, Sep 9, 2012 at 11:13 AM, Natalia Gómez García < [email protected]> wrote: > Hello, > > I am a computer science student and I'm using your library PDFBox in Java > to extract text data from some pdf files. > > In this project, I am having difficulties extracting the text from this > pdf: http://www.escet.urjc.es/alumnos/horarios/GR_Biologia_2012-13.pdf. > Specifically, I can't get to extract the text "Semana del 3 al 7 de > Septiembre de 2012". > > Why can this be happening? Could you please give me some directions on how > to extract this data? > > The code I'm using right now is the following: > pdfDoc = PDDocument.load(url); > pdfStripper = new PDFTextStripper(); > texto=pdfStripper.getText(pdfDoc); > pdfDoc.close(); > > Thanks for your attention > Natalia >

