Re: Tesseract OCR text extraction issue on Debian Bullseye

Sandeep Kulkarni Wed, 04 May 2022 06:32:37 -0700

Hi Luís,

Yes, I have double checked that today and tesseract dictionaries are present. I 
am able to do the text extraction from an image from command line within docker 
container.

I have given examples in reply to Tim.

Regards,
Sandeep Kulkarni

From: Luís Filipe Nassif <[email protected]>
Sent: Tuesday, May 3, 2022 5:40 PM
To: [email protected]
Subject: [External] Re: Tesseract OCR text extraction issue on Debian Bullseye

Just some guesses, were new tesseract dictionanaries installed? Were you able 
to OCR an image from cmd line with newer tesseract?

Em ter, 3 de mai de 2022 02:17, Sandeep Kulkarni 
<[email protected]<mailto:[email protected]>> escreveu:
Hi,

Ours is a Java based application which uses Tika via AutoDetectParser. We init 
TesseractOCRConfig with tesseractPath and tessdataPath (and few more 
parameters) and set it into context before invoking ParsingReader.

I am currently using Tika 2.2.1 with Tesseract OCR 4.0.0 (default version for 
this distro) on Debian Buster docker base image for openjdk:8u312-jre-buster. 
Things work as expected and I am able to get text extracted from images.

We are now trying to upgrade Tesseract and have started facing some issues. We 
tried to move to Debian Bullseye based openjdk:8u332-jre-bullseye and Tesseract 
4.1.1 (default version for this distro) and image extraction stopped working. 
We have not changed anything else within configuration for Tika and Tesseract.

With debug logging enabled for TesseractOCRParser, I can see that hasTesseract 
is not working now and is not finding tesseract at /usr/bin/tesseract.

2022-05-02 10:55:26,053 DEBUG [TesseractOCRParser] hasTesseract (path: 
[/usr/bin/tesseract]): false

Because of this, Tesseract OCR does not get invoked. If I take a look a the 
path at which Tesseract binary is present, I can see it at /usr/bin/tesseract 
itself.

root@vic:/# which tesseract
/usr/bin/tesseract
root@vic # tesseract -v
tesseract 4.1.1
leptonica-1.79.0
  libgif 5.1.9 : libjpeg 6b (libjpeg-turbo 2.0.6) : libpng 1.6.37 : libtiff 
4.2.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.4.0
Found AVX2
Found AVX
Found FMA
Found SSE
Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 
libzstd/1.4.8

Whereas earlier it was working with below logs:

2022-05-02 09:55:08,275 DEBUG [TesseractOCRParser] hasTesseract (path: 
[/usr/bin/tesseract]): true
2022-05-02 09:55:08,450 INFO  [Tika Parser-1] [TesseractOCRParser] Tesseract is 
installed and is being invoked. This can add greatly to processing time.  If 
you do not want tesseract to be applied to your files see: 
https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FTIKA%2FTikaOCR%23TikaOCR-disable-ocr&data=05%7C01%7CSandeep.Kulkarni4%40veritas.com%7C554cc55946f84de147a708da2cfdd699%7Cfc8e13c0422c4c55b3eaca318e6cac32%7C0%7C0%7C637871766419356671%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2Fr%2BiWsSoQ0YFkey8HxLRnPpwnUMoNx7hkAllbKDT%2Fy4%3D&reserved=0>
2022-05-02 09:55:08,451 DEBUG [Tika Parser-1] [TesseractOCRParser] Tesseract 
command: /usr/bin/tesseract /tmp/apache-tika-1769393331829017331.tmp 
/tmp/apache-tika-8424718595969554950.tmp --psm 1 -l eng -c page_separator= -c 
preserve_interword_spaces=0 txt
2022-05-02 09:55:09,222 DEBUG [Thread-29] [TesseractOCRParser]
2022-05-02 09:55:09,222 DEBUG [Thread-30] [TesseractOCRParser] Tesseract Open 
Source OCR Engine v4.0.0 with Leptonica

We use below Tesseract OCR settings (earlier and now).

tesseractPath=/usr/bin/
tessdataPath=/usr/share/tesseract-ocr/4.00/tessdata/

We are also facing same issue with Ubuntu based VMs that we upgraded from 16.04 
to 20.04 recently.

Finally, we use simple 'apt install tesseract-ocr' command to install Tesseract 
OCR during building docker image as well on Ubuntu VMs. As Ubuntu is based on 
Debian, it is possible that the issue we are facing are related.

FYI, we are not facing issue on Windows with Tesseract OCR 4.0.0, 4.1.0 and 
5.0.1 on Windows at all. Here we are installing Tesseract OCR available at 
https://github.com/UB-Mannheim/tesseract/wiki<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FUB-Mannheim%2Ftesseract%2Fwiki&data=05%7C01%7CSandeep.Kulkarni4%40veritas.com%7C554cc55946f84de147a708da2cfdd699%7Cfc8e13c0422c4c55b3eaca318e6cac32%7C0%7C0%7C637871766419356671%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=QdmgaaGvR%2FhCInWG9%2FOYA1sCmn%2F%2BWd8OSeTkhL%2FIolE%3D&reserved=0>
 and the paths for tesseract binary  and tessdata are as below:

tesseractPath=C:\Program Files\Tesseract-OCR\
tessdataPath=C:\Program Files\Tesseract-OCR\tessdata\

Any help would be appreciated. Also wanted to ask whether there is a 
compatibility matrix for supported Tesseract OCR versions against Tika. We also 
plan to move to 5.x in near future.

Regards,
Sandeep Kulkarni

Re: Tesseract OCR text extraction issue on Debian Bullseye

Reply via email to