Hi Luís, Yes, I have double checked that today and tesseract dictionaries are present. I am able to do the text extraction from an image from command line within docker container.
I have given examples in reply to Tim. Regards, Sandeep Kulkarni From: Luís Filipe Nassif <[email protected]> Sent: Tuesday, May 3, 2022 5:40 PM To: [email protected] Subject: [External] Re: Tesseract OCR text extraction issue on Debian Bullseye Just some guesses, were new tesseract dictionanaries installed? Were you able to OCR an image from cmd line with newer tesseract? Em ter, 3 de mai de 2022 02:17, Sandeep Kulkarni <[email protected]<mailto:[email protected]>> escreveu: Hi, Ours is a Java based application which uses Tika via AutoDetectParser. We init TesseractOCRConfig with tesseractPath and tessdataPath (and few more parameters) and set it into context before invoking ParsingReader. I am currently using Tika 2.2.1 with Tesseract OCR 4.0.0 (default version for this distro) on Debian Buster docker base image for openjdk:8u312-jre-buster. Things work as expected and I am able to get text extracted from images. We are now trying to upgrade Tesseract and have started facing some issues. We tried to move to Debian Bullseye based openjdk:8u332-jre-bullseye and Tesseract 4.1.1 (default version for this distro) and image extraction stopped working. We have not changed anything else within configuration for Tika and Tesseract. With debug logging enabled for TesseractOCRParser, I can see that hasTesseract is not working now and is not finding tesseract at /usr/bin/tesseract. 2022-05-02 10:55:26,053 DEBUG [TesseractOCRParser] hasTesseract (path: [/usr/bin/tesseract]): false Because of this, Tesseract OCR does not get invoked. If I take a look a the path at which Tesseract binary is present, I can see it at /usr/bin/tesseract itself. root@vic:/# which tesseract /usr/bin/tesseract root@vic # tesseract -v tesseract 4.1.1 leptonica-1.79.0 libgif 5.1.9 : libjpeg 6b (libjpeg-turbo 2.0.6) : libpng 1.6.37 : libtiff 4.2.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.4.0 Found AVX2 Found AVX Found FMA Found SSE Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.8 Whereas earlier it was working with below logs: 2022-05-02 09:55:08,275 DEBUG [TesseractOCRParser] hasTesseract (path: [/usr/bin/tesseract]): true 2022-05-02 09:55:08,450 INFO [Tika Parser-1] [TesseractOCRParser] Tesseract is installed and is being invoked. This can add greatly to processing time. If you do not want tesseract to be applied to your files see: https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FTIKA%2FTikaOCR%23TikaOCR-disable-ocr&data=05%7C01%7CSandeep.Kulkarni4%40veritas.com%7C554cc55946f84de147a708da2cfdd699%7Cfc8e13c0422c4c55b3eaca318e6cac32%7C0%7C0%7C637871766419356671%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2Fr%2BiWsSoQ0YFkey8HxLRnPpwnUMoNx7hkAllbKDT%2Fy4%3D&reserved=0> 2022-05-02 09:55:08,451 DEBUG [Tika Parser-1] [TesseractOCRParser] Tesseract command: /usr/bin/tesseract /tmp/apache-tika-1769393331829017331.tmp /tmp/apache-tika-8424718595969554950.tmp --psm 1 -l eng -c page_separator= -c preserve_interword_spaces=0 txt 2022-05-02 09:55:09,222 DEBUG [Thread-29] [TesseractOCRParser] 2022-05-02 09:55:09,222 DEBUG [Thread-30] [TesseractOCRParser] Tesseract Open Source OCR Engine v4.0.0 with Leptonica We use below Tesseract OCR settings (earlier and now). tesseractPath=/usr/bin/ tessdataPath=/usr/share/tesseract-ocr/4.00/tessdata/ We are also facing same issue with Ubuntu based VMs that we upgraded from 16.04 to 20.04 recently. Finally, we use simple 'apt install tesseract-ocr' command to install Tesseract OCR during building docker image as well on Ubuntu VMs. As Ubuntu is based on Debian, it is possible that the issue we are facing are related. FYI, we are not facing issue on Windows with Tesseract OCR 4.0.0, 4.1.0 and 5.0.1 on Windows at all. Here we are installing Tesseract OCR available at https://github.com/UB-Mannheim/tesseract/wiki<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FUB-Mannheim%2Ftesseract%2Fwiki&data=05%7C01%7CSandeep.Kulkarni4%40veritas.com%7C554cc55946f84de147a708da2cfdd699%7Cfc8e13c0422c4c55b3eaca318e6cac32%7C0%7C0%7C637871766419356671%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=QdmgaaGvR%2FhCInWG9%2FOYA1sCmn%2F%2BWd8OSeTkhL%2FIolE%3D&reserved=0> and the paths for tesseract binary and tessdata are as below: tesseractPath=C:\Program Files\Tesseract-OCR\ tessdataPath=C:\Program Files\Tesseract-OCR\tessdata\ Any help would be appreciated. Also wanted to ask whether there is a compatibility matrix for supported Tesseract OCR versions against Tika. We also plan to move to 5.x in near future. Regards, Sandeep Kulkarni
