[ https://issues.apache.org/jira/browse/TIKA-3202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
marek kapowicki closed TIKA-3202. --------------------------------- Resolution: Works for Me > Tika duplicates the ocr text > ---------------------------- > > Key: TIKA-3202 > URL: https://issues.apache.org/jira/browse/TIKA-3202 > Project: Tika > Issue Type: Bug > Affects Versions: 1.24.1 > Reporter: marek kapowicki > Priority: Major > Attachments: text_and_image.pdf > > > I m using tika 1.24.1 together with tesseract from docker image > apache/tika:1.24-full > The headerĀ X-Tika-PDFocrStrategy: OCR_AND_TEXT occurs the issue > the output from pdf processing is duplicated: > The output from the attached pdf file is: > {code:java} > There is some text > [image: image0.jpg] > There is some textT > here is an image!! > {code} > the curl to reproduce: > {code:java} > curl -H "X-Tika-PDFextractInlineImages: true" -H "X-Tika-PDFocrStrategy: > OCR_AND_TEXT" -T text_and_image.pdf http://localhost:9998/tika > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)