Hi, Ours is a Java based application which uses Tika via AutoDetectParser. We init TesseractOCRConfig with tesseractPath and tessdataPath (and few more parameters) and set it into context before invoking ParsingReader.
I am currently using Tika 2.2.1 with Tesseract OCR 4.0.0 (default version for this distro) on Debian Buster docker base image for openjdk:8u312-jre-buster. Things work as expected and I am able to get text extracted from images. We are now trying to upgrade Tesseract and have started facing some issues. We tried to move to Debian Bullseye based openjdk:8u332-jre-bullseye and Tesseract 4.1.1 (default version for this distro) and image extraction stopped working. We have not changed anything else within configuration for Tika and Tesseract. With debug logging enabled for TesseractOCRParser, I can see that hasTesseract is not working now and is not finding tesseract at /usr/bin/tesseract. 2022-05-02 10:55:26,053 DEBUG [TesseractOCRParser] hasTesseract (path: [/usr/bin/tesseract]): false Because of this, Tesseract OCR does not get invoked. If I take a look a the path at which Tesseract binary is present, I can see it at /usr/bin/tesseract itself. root@vic:/# which tesseract /usr/bin/tesseract root@vic # tesseract -v tesseract 4.1.1 leptonica-1.79.0 libgif 5.1.9 : libjpeg 6b (libjpeg-turbo 2.0.6) : libpng 1.6.37 : libtiff 4.2.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.4.0 Found AVX2 Found AVX Found FMA Found SSE Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.8 Whereas earlier it was working with below logs: 2022-05-02 09:55:08,275 DEBUG [TesseractOCRParser] hasTesseract (path: [/usr/bin/tesseract]): true 2022-05-02 09:55:08,450 INFO [Tika Parser-1] [TesseractOCRParser] Tesseract is installed and is being invoked. This can add greatly to processing time. If you do not want tesseract to be applied to your files see: https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr 2022-05-02 09:55:08,451 DEBUG [Tika Parser-1] [TesseractOCRParser] Tesseract command: /usr/bin/tesseract /tmp/apache-tika-1769393331829017331.tmp /tmp/apache-tika-8424718595969554950.tmp --psm 1 -l eng -c page_separator= -c preserve_interword_spaces=0 txt 2022-05-02 09:55:09,222 DEBUG [Thread-29] [TesseractOCRParser] 2022-05-02 09:55:09,222 DEBUG [Thread-30] [TesseractOCRParser] Tesseract Open Source OCR Engine v4.0.0 with Leptonica We use below Tesseract OCR settings (earlier and now). tesseractPath=/usr/bin/ tessdataPath=/usr/share/tesseract-ocr/4.00/tessdata/ We are also facing same issue with Ubuntu based VMs that we upgraded from 16.04 to 20.04 recently. Finally, we use simple 'apt install tesseract-ocr' command to install Tesseract OCR during building docker image as well on Ubuntu VMs. As Ubuntu is based on Debian, it is possible that the issue we are facing are related. FYI, we are not facing issue on Windows with Tesseract OCR 4.0.0, 4.1.0 and 5.0.1 on Windows at all. Here we are installing Tesseract OCR available at https://github.com/UB-Mannheim/tesseract/wiki and the paths for tesseract binary and tessdata are as below: tesseractPath=C:\Program Files\Tesseract-OCR\ tessdataPath=C:\Program Files\Tesseract-OCR\tessdata\ Any help would be appreciated. Also wanted to ask whether there is a compatibility matrix for supported Tesseract OCR versions against Tika. We also plan to move to 5.x in near future. Regards, Sandeep Kulkarni
