Sorry I couldn't help. -----Original Message----- From: Augusto Ribeiro Silva [mailto:a...@unsilo.com] Sent: Tuesday, May 31, 2016 9:10 AM To: user@tika.apache.org Subject: Re: Weird spacing in words
Hi, I do get the same result using pdfbox. I will open an issue over there. Thanks for the help. Best regards, Augusto > On 31 May 2016, at 14:35, Allison, Timothy B. <talli...@mitre.org> wrote: > > PDFs don't necessarily include spaces. In some (many?) cases, code has to do > the calculation of character widths and locations on the page to determine > whether or not to insert spaces. If something goes wrong with the coordinate > calculations, you can get extra or missing spaces. > > You could experiment with changing enableAutoSpace to false via the > PDFParserConfig, but I doubt that would fix the problem. > > If you run straight PDFBox's app [1] > > java -jar pdfbox-app...jar ExtractText file.pdf > > Do you get the same spacing? If so, please open an issue on PDFBox's issue > tracker. > > > [1] http://mirror.reverse.net/pub/apache/pdfbox/2.0.1/pdfbox-app-2.0.1.jar > > -----Original Message----- > From: Augusto Ribeiro Silva [mailto:a...@unsilo.com] > Sent: Tuesday, May 31, 2016 7:36 AM > To: user@tika.apache.org > Subject: Weird spacing in words > > Hi all, > > I am using TIKA java library to read the content of some PDFs and it seems > like it inserts some weird (hyphen-like) spacing. For example: > The es tab lish ment of an in te grated Part ner Re la tion ship Man age ment > (PRM) sys tem can po ten tially ad dress sev eral as pets > > I tried to extract text from the same PDF using the pdftotext command line > utility it extracts the text correctly: > The establishment of an integrated Partner Relationship Management (PRM) > system can potentially address several aspects > > Does somebody have any idea why TIKA behaves in this way and any tips to > fixing it? > > Best regards, > Augusto