Re: Softhyphens / white space

Dirk Högemann Fri, 10 Feb 2012 07:27:57 -0800

Hi,
now I tried it...but without success.
I experimented with the following settings (with varying values):


textStripper.setSpacingTolerance(0.5f);
textStripper.setAverageCharTolerance(0.3f);

What could be reasonable values? I also tried 0.999 for both.

Thanks so far
Dirk

2012/2/10 Hesham G. <heshamgne...@gmail.com>

> Dirk ,
>
> Did you try to use PDFTextStripper.**setAverageCharTolerance( float ) ?
>
>
>
> Best regards ,
> Hesham
>
>
> ------------------------------**---------------
> Included message :
>
>
>  Hello,
>>
>> I use pdfbox 1.6.0 to extract text form PDFs, which works often fine.
>>
>> Unfortunately it seems to insert a space character, when there are
>> soft-hyphens in the content of the PDF.
>> Thus the extracted text is sometimes very fragmented. For example the word
>> Medizin is extracted as Me di zin.
>> I also tried to set the new option "parser.setEnableAutoSpace(**false);".
>> But this had no effect on the output.
>>
>> Has anyone a suggestion how to extract the content of PDF containing
>> sof-hyphens without fragmenting it?
>>
>> As I use the output of pdfbox for searching with Apache Solr my search
>> results are getting sometimes very strange...
>>
>> Best regards
>> Dirk
>>
>>

Re: Softhyphens / white space

Reply via email to