Sorry I couldn't help.

-----Original Message-----
From: Augusto Ribeiro Silva [mailto:a...@unsilo.com] 
Sent: Tuesday, May 31, 2016 9:10 AM
To: user@tika.apache.org
Subject: Re: Weird spacing in words

Hi, 

I do get the same result using pdfbox. I will open an issue over there.
Thanks for the help.

Best regards,
Augusto

> On 31 May 2016, at 14:35, Allison, Timothy B. <talli...@mitre.org> wrote:
> 
> PDFs don't necessarily include spaces.  In some (many?) cases, code has to do 
> the calculation of character widths and locations on the page to determine 
> whether or not to insert spaces.  If something goes wrong with the coordinate 
> calculations, you can get extra or missing spaces.
> 
> You could experiment with changing enableAutoSpace to false via the 
> PDFParserConfig, but I doubt that would fix the problem.
> 
> If you run straight PDFBox's app [1]
> 
> java -jar pdfbox-app...jar ExtractText file.pdf
> 
> Do you get the same spacing?  If so, please open an issue on PDFBox's issue 
> tracker.
> 
> 
> [1] http://mirror.reverse.net/pub/apache/pdfbox/2.0.1/pdfbox-app-2.0.1.jar
> 
> -----Original Message-----
> From: Augusto Ribeiro Silva [mailto:a...@unsilo.com] 
> Sent: Tuesday, May 31, 2016 7:36 AM
> To: user@tika.apache.org
> Subject: Weird spacing in words 
> 
> Hi all,
> 
> I am using TIKA java library to read the content of some PDFs and it seems 
> like it inserts some weird (hyphen-like) spacing. For example:
> The es tab lish ment of an in te grated Part ner Re la tion ship Man age ment 
> (PRM) sys tem can po ten tially ad dress sev eral as pets
> 
> I tried to extract text from the same PDF using the pdftotext command line 
> utility it extracts the text correctly:
> The establishment of an integrated Partner Relationship Management (PRM) 
> system can potentially address several aspects 
> 
> Does somebody have any idea why TIKA behaves in this way and any tips to 
> fixing it?
> 
> Best regards, 
> Augusto

Reply via email to