I’ve seen this before on a few documents.  You might experiment with setting 
PDFParserConfig’s suppressDuplicateOverlappingText to true.  If that doesn’t 
work, I’d recommend running the pure PDFBox app’s ExtractText on the document.  
If you get the same doubling of letters, ask over on 
u...@pdfbox.apache.org<mailto:u...@pdfbox.apache.org>.  If you don’t, let us 
know!

Best,

           Tim


From: Mohammad Ghufran [mailto:emghuf...@gmail.com]
Sent: Tuesday, October 07, 2014 8:37 AM
To: user@tika.apache.org
Subject: Problem with content extraction

Hello,

I am using tika to extract content of documents using tika but I've run into a 
problem. In some documents, the characters in the output are repeated several 
times. For example, while processing a PDF file, the text "FORMATION" is 
transformed into "FFOORRMMAATTIIOONN" and so on.

I tried looking through the mailing lists but didn't find any reference to 
this. I also tried with the latest version of tika but it results in the same 
output.

The only thing i can notice is that the document seems to have text written 
with some shadow - if it is useful.

I would like to know if someone has encountered this  problem before and what 
are the possible solutions, if any.

Best Regards,
Ghufran

Reply via email to