Hi Tim, Yes, I will make use of this repo for replicating the problem and let you know my observations. Thanks for the help.
Regards, Sandeep Kulkarni -----Original Message----- From: Tim Allison <[email protected]> Sent: Thursday, May 5, 2022 2:18 AM To: [email protected] Subject: [External] Re: Tesseract OCR text extraction issue on Debian Bullseye I created a very small repo with a version of tika-app that has log level set for debug: https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftballison%2Ftika-addons%2Ftree%2Fmain%2Ftika-docker-play&data=05%7C01%7CSandeep.Kulkarni4%40veritas.com%7C183e93ac8dc148f7af1708da2e0f605f%7Cfc8e13c0422c4c55b3eaca318e6cac32%7C0%7C0%7C637872941389880300%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=h4fdaV1vTwdho%2F1WfKi0ezOlS5P1sL3ZzVzrjwcLxKw%3D&reserved=0 I'm not able to replicate your problem. To be clear, I trust you are having your problem! Can you work with that repo and see if you can get it to fail? Maybe add a stripped down version of your tika-config.xml? On Wed, May 4, 2022 at 4:02 PM Tim Allison <[email protected]> wrote: > > I just added extra debugging to the ExternalParser to figure out why > it doesn't find tesseract. Are you able to add the logging to a local > build of Tika or use a snapshot? > > You aren't running in a Turkish locale, by chance? See: TIKA-1526 > > On Wed, May 4, 2022 at 9:32 AM Sandeep Kulkarni > <[email protected]> wrote: > > > > Hi Luís, > > > > > > > > Yes, I have double checked that today and tesseract dictionaries are > > present. I am able to do the text extraction from an image from command > > line within docker container. > > > > > > > > I have given examples in reply to Tim. > > > > > > > > Regards, > > > > Sandeep Kulkarni > > > > > > > > From: Luís Filipe Nassif <[email protected]> > > Sent: Tuesday, May 3, 2022 5:40 PM > > To: [email protected] > > Subject: [External] Re: Tesseract OCR text extraction issue on > > Debian Bullseye > > > > > > > > Just some guesses, were new tesseract dictionanaries installed? Were you > > able to OCR an image from cmd line with newer tesseract? > > > > > > > > Em ter, 3 de mai de 2022 02:17, Sandeep Kulkarni > > <[email protected]> escreveu: > > > > Hi, > > > > > > > > Ours is a Java based application which uses Tika via AutoDetectParser. We > > init TesseractOCRConfig with tesseractPath and tessdataPath (and few more > > parameters) and set it into context before invoking ParsingReader. > > > > > > > > I am currently using Tika 2.2.1 with Tesseract OCR 4.0.0 (default version > > for this distro) on Debian Buster docker base image for > > openjdk:8u312-jre-buster. Things work as expected and I am able to get text > > extracted from images. > > > > > > > > We are now trying to upgrade Tesseract and have started facing some issues. > > We tried to move to Debian Bullseye based openjdk:8u332-jre-bullseye and > > Tesseract 4.1.1 (default version for this distro) and image extraction > > stopped working. We have not changed anything else within configuration for > > Tika and Tesseract. > > > > > > > > With debug logging enabled for TesseractOCRParser, I can see that > > hasTesseract is not working now and is not finding tesseract at > > /usr/bin/tesseract. > > > > > > > > 2022-05-02 10:55:26,053 DEBUG [TesseractOCRParser] hasTesseract > > (path: [/usr/bin/tesseract]): false > > > > > > > > Because of this, Tesseract OCR does not get invoked. If I take a look a the > > path at which Tesseract binary is present, I can see it at > > /usr/bin/tesseract itself. > > > > > > > > root@vic:/# which tesseract > > > > /usr/bin/tesseract > > > > root@vic # tesseract -v > > > > tesseract 4.1.1 > > > > leptonica-1.79.0 > > > > libgif 5.1.9 : libjpeg 6b (libjpeg-turbo 2.0.6) : libpng 1.6.37 : > > libtiff 4.2.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.4.0 > > > > Found AVX2 > > > > Found AVX > > > > Found FMA > > > > Found SSE > > > > Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 > > liblz4/1.9.3 libzstd/1.4.8 > > > > > > > > Whereas earlier it was working with below logs: > > > > > > > > 2022-05-02 09:55:08,275 DEBUG [TesseractOCRParser] hasTesseract > > (path: [/usr/bin/tesseract]): true > > > > 2022-05-02 09:55:08,450 INFO [Tika Parser-1] [TesseractOCRParser] > > Tesseract is installed and is being invoked. This can add greatly to > > processing time. If you do not want tesseract to be applied to your > > files see: > > https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcw > > iki.apache.org%2Fconfluence%2Fdisplay%2FTIKA%2FTikaOCR%23TikaOCR-dis > > able-ocr&data=05%7C01%7CSandeep.Kulkarni4%40veritas.com%7C183e93 > > ac8dc148f7af1708da2e0f605f%7Cfc8e13c0422c4c55b3eaca318e6cac32%7C0%7C > > 0%7C637872941389880300%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiL > > CJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sd > > ata=n4YLgud08jgGobJsE6qSA56rtRZC%2BvubTxNDmBkRf24%3D&reserved=0 > > > > 2022-05-02 09:55:08,451 DEBUG [Tika Parser-1] [TesseractOCRParser] > > Tesseract command: /usr/bin/tesseract > > /tmp/apache-tika-1769393331829017331.tmp > > /tmp/apache-tika-8424718595969554950.tmp --psm 1 -l eng -c > > page_separator= -c preserve_interword_spaces=0 txt > > > > 2022-05-02 09:55:09,222 DEBUG [Thread-29] [TesseractOCRParser] > > > > 2022-05-02 09:55:09,222 DEBUG [Thread-30] [TesseractOCRParser] > > Tesseract Open Source OCR Engine v4.0.0 with Leptonica > > > > > > > > We use below Tesseract OCR settings (earlier and now). > > > > > > > > tesseractPath=/usr/bin/ > > > > tessdataPath=/usr/share/tesseract-ocr/4.00/tessdata/ > > > > > > > > We are also facing same issue with Ubuntu based VMs that we upgraded from > > 16.04 to 20.04 recently. > > > > > > > > Finally, we use simple 'apt install tesseract-ocr' command to install > > Tesseract OCR during building docker image as well on Ubuntu VMs. As Ubuntu > > is based on Debian, it is possible that the issue we are facing are related. > > > > > > > > FYI, we are not facing issue on Windows with Tesseract OCR 4.0.0, 4.1.0 and > > 5.0.1 on Windows at all. Here we are installing Tesseract OCR available at > > https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FUB-Mannheim%2Ftesseract%2Fwiki&data=05%7C01%7CSandeep.Kulkarni4%40veritas.com%7C183e93ac8dc148f7af1708da2e0f605f%7Cfc8e13c0422c4c55b3eaca318e6cac32%7C0%7C0%7C637872941389880300%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=B%2Bmo8NKgCTaGMpGiPAFgervGpBv9RE2Kjh5G9%2FXbPlI%3D&reserved=0 > > and the paths for tesseract binary and tessdata are as below: > > > > > > > > tesseractPath=C:\Program Files\Tesseract-OCR\ > > tessdataPath=C:\Program Files\Tesseract-OCR\tessdata\ > > > > > > > > Any help would be appreciated. Also wanted to ask whether there is a > > compatibility matrix for supported Tesseract OCR versions against Tika. We > > also plan to move to 5.x in near future. > > > > > > > > Regards, > > > > Sandeep Kulkarni > > > >
