Tesseract OCR text extraction issue on Debian Bullseye

Sandeep Kulkarni Mon, 02 May 2022 22:17:45 -0700

Hi,

Ours is a Java based application which uses Tika via AutoDetectParser. We init 
TesseractOCRConfig with tesseractPath and tessdataPath (and few more 
parameters) and set it into context before invoking ParsingReader.


I am currently using Tika 2.2.1 with Tesseract OCR 4.0.0 (default version for 
this distro) on Debian Buster docker base image for openjdk:8u312-jre-buster. 
Things work as expected and I am able to get text extracted from images.

We are now trying to upgrade Tesseract and have started facing some issues. We 
tried to move to Debian Bullseye based openjdk:8u332-jre-bullseye and Tesseract 
4.1.1 (default version for this distro) and image extraction stopped working. 
We have not changed anything else within configuration for Tika and Tesseract.

With debug logging enabled for TesseractOCRParser, I can see that hasTesseract 
is not working now and is not finding tesseract at /usr/bin/tesseract.

2022-05-02 10:55:26,053 DEBUG [TesseractOCRParser] hasTesseract (path: 
[/usr/bin/tesseract]): false

Because of this, Tesseract OCR does not get invoked. If I take a look a the 
path at which Tesseract binary is present, I can see it at /usr/bin/tesseract 
itself.

root@vic:/# which tesseract
/usr/bin/tesseract
root@vic # tesseract -v
tesseract 4.1.1
leptonica-1.79.0
  libgif 5.1.9 : libjpeg 6b (libjpeg-turbo 2.0.6) : libpng 1.6.37 : libtiff 
4.2.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.4.0
Found AVX2
Found AVX
Found FMA
Found SSE
Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 
libzstd/1.4.8

Whereas earlier it was working with below logs:

2022-05-02 09:55:08,275 DEBUG [TesseractOCRParser] hasTesseract (path: 
[/usr/bin/tesseract]): true
2022-05-02 09:55:08,450 INFO  [Tika Parser-1] [TesseractOCRParser] Tesseract is 
installed and is being invoked. This can add greatly to processing time.  If 
you do not want tesseract to be applied to your files see: 
https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr
2022-05-02 09:55:08,451 DEBUG [Tika Parser-1] [TesseractOCRParser] Tesseract 
command: /usr/bin/tesseract /tmp/apache-tika-1769393331829017331.tmp 
/tmp/apache-tika-8424718595969554950.tmp --psm 1 -l eng -c page_separator= -c 
preserve_interword_spaces=0 txt
2022-05-02 09:55:09,222 DEBUG [Thread-29] [TesseractOCRParser]
2022-05-02 09:55:09,222 DEBUG [Thread-30] [TesseractOCRParser] Tesseract Open 
Source OCR Engine v4.0.0 with Leptonica

We use below Tesseract OCR settings (earlier and now).

tesseractPath=/usr/bin/
tessdataPath=/usr/share/tesseract-ocr/4.00/tessdata/

We are also facing same issue with Ubuntu based VMs that we upgraded from 16.04 
to 20.04 recently.

Finally, we use simple 'apt install tesseract-ocr' command to install Tesseract 
OCR during building docker image as well on Ubuntu VMs. As Ubuntu is based on 
Debian, it is possible that the issue we are facing are related.

FYI, we are not facing issue on Windows with Tesseract OCR 4.0.0, 4.1.0 and 
5.0.1 on Windows at all. Here we are installing Tesseract OCR available at 
https://github.com/UB-Mannheim/tesseract/wiki and the paths for tesseract 
binary  and tessdata are as below:

tesseractPath=C:\Program Files\Tesseract-OCR\
tessdataPath=C:\Program Files\Tesseract-OCR\tessdata\

Any help would be appreciated. Also wanted to ask whether there is a 
compatibility matrix for supported Tesseract OCR versions against Tika. We also 
plan to move to 5.x in near future.

Regards,
Sandeep Kulkarni

Tesseract OCR text extraction issue on Debian Bullseye

Reply via email to