marek kapowicki created TIKA-3202: ------------------------------------- Summary: Tika duplicates the ocr text Key: TIKA-3202 URL: https://issues.apache.org/jira/browse/TIKA-3202 Project: Tika Issue Type: Bug Affects Versions: 1.24.1 Reporter: marek kapowicki Attachments: text_and_image.pdf
I m using tika 1.24.1 together with tesseract from docker image apache/tika:1.24-full The headerĀ X-Tika-PDFocrStrategy: OCR_AND_TEXT occurs the issue the output from pdf processing is duplicated: The output from the attached pdf file is: {code:java} There is some text [image: image0.jpg] There is some textT here is an image!! {code} the curl to reproduce: {code:java} curl -H "X-Tika-PDFextractInlineImages: true" -H "X-Tika-PDFocrStrategy: OCR_AND_TEXT" -T text_and_image.pdf http://localhost:9998/tika {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)