Hi, now I tried it...but without success. I experimented with the following settings (with varying values):
textStripper.setSpacingTolerance(0.5f); textStripper.setAverageCharTolerance(0.3f); What could be reasonable values? I also tried 0.999 for both. Thanks so far Dirk 2012/2/10 Hesham G. <heshamgne...@gmail.com> > Dirk , > > Did you try to use PDFTextStripper.**setAverageCharTolerance( float ) ? > > > > Best regards , > Hesham > > > ------------------------------**--------------- > Included message : > > > Hello, >> >> I use pdfbox 1.6.0 to extract text form PDFs, which works often fine. >> >> Unfortunately it seems to insert a space character, when there are >> soft-hyphens in the content of the PDF. >> Thus the extracted text is sometimes very fragmented. For example the word >> Medizin is extracted as Me di zin. >> I also tried to set the new option "parser.setEnableAutoSpace(**false);". >> But this had no effect on the output. >> >> Has anyone a suggestion how to extract the content of PDF containing >> sof-hyphens without fragmenting it? >> >> As I use the output of pdfbox for searching with Apache Solr my search >> results are getting sometimes very strange... >> >> Best regards >> Dirk >> >>