Weird spacing in words

2016-05-31 Thread Augusto Ribeiro Silva
Hi all, I am using TIKA java library to read the content of some PDFs and it seems like it inserts some weird (hyphen-like) spacing. For example: The es tab lish ment of an in te grated Part ner Re la tion ship Man age ment (PRM) sys tem can po ten tially ad dress sev eral as pets I tried to ex

RE: Weird spacing in words

2016-05-31 Thread Allison, Timothy B.
PDFs don't necessarily include spaces. In some (many?) cases, code has to do the calculation of character widths and locations on the page to determine whether or not to insert spaces. If something goes wrong with the coordinate calculations, you can get extra or missing spaces. You could exp

Re: Weird spacing in words

2016-05-31 Thread Augusto Ribeiro Silva
Hi, I do get the same result using pdfbox. I will open an issue over there. Thanks for the help. Best regards, Augusto > On 31 May 2016, at 14:35, Allison, Timothy B. wrote: > > PDFs don't necessarily include spaces. In some (many?) cases, code has to do > the calculation of character width

RE: Weird spacing in words

2016-05-31 Thread Allison, Timothy B.
Sorry I couldn't help. -Original Message- From: Augusto Ribeiro Silva [mailto:a...@unsilo.com] Sent: Tuesday, May 31, 2016 9:10 AM To: user@tika.apache.org Subject: Re: Weird spacing in words Hi, I do get the same result using pdfbox. I will open an issue over there. Thanks for the hel