I’ve seen this before on a few documents. You might experiment with setting PDFParserConfig’s suppressDuplicateOverlappingText to true. If that doesn’t work, I’d recommend running the pure PDFBox app’s ExtractText on the document. If you get the same doubling of letters, ask over on u...@pdfbox.apache.org<mailto:u...@pdfbox.apache.org>. If you don’t, let us know!
Best, Tim From: Mohammad Ghufran [mailto:emghuf...@gmail.com] Sent: Tuesday, October 07, 2014 8:37 AM To: user@tika.apache.org Subject: Problem with content extraction Hello, I am using tika to extract content of documents using tika but I've run into a problem. In some documents, the characters in the output are repeated several times. For example, while processing a PDF file, the text "FORMATION" is transformed into "FFOORRMMAATTIIOONN" and so on. I tried looking through the mailing lists but didn't find any reference to this. I also tried with the latest version of tika but it results in the same output. The only thing i can notice is that the document seems to have text written with some shadow - if it is useful. I would like to know if someone has encountered this problem before and what are the possible solutions, if any. Best Regards, Ghufran