Re: Tesseract OCR text extraction issue on Debian Bullseye

Tim Allison Tue, 03 May 2022 03:10:48 -0700

Are you able to share your full docker files? Are you running
tika-server or a custom application?


On Tue, May 3, 2022 at 5:40 AM Tim Allison <[email protected]> wrote:
>
> Does the application have permissions to run tesseract in the Linuxes that 
> are not working?
>
> On Tue, May 3, 2022 at 1:17 AM Sandeep Kulkarni 
> <[email protected]> wrote:
>>
>> Hi,
>>
>>
>>
>> Ours is a Java based application which uses Tika via AutoDetectParser. We 
>> init TesseractOCRConfig with tesseractPath and tessdataPath (and few more 
>> parameters) and set it into context before invoking ParsingReader.
>>
>>
>>
>> I am currently using Tika 2.2.1 with Tesseract OCR 4.0.0 (default version 
>> for this distro) on Debian Buster docker base image for 
>> openjdk:8u312-jre-buster. Things work as expected and I am able to get text 
>> extracted from images.
>>
>>
>>
>> We are now trying to upgrade Tesseract and have started facing some issues. 
>> We tried to move to Debian Bullseye based openjdk:8u332-jre-bullseye and 
>> Tesseract 4.1.1 (default version for this distro) and image extraction 
>> stopped working. We have not changed anything else within configuration for 
>> Tika and Tesseract.
>>
>>
>>
>> With debug logging enabled for TesseractOCRParser, I can see that 
>> hasTesseract is not working now and is not finding tesseract at 
>> /usr/bin/tesseract.
>>
>>
>>
>> 2022-05-02 10:55:26,053 DEBUG [TesseractOCRParser] hasTesseract (path: 
>> [/usr/bin/tesseract]): false
>>
>>
>>
>> Because of this, Tesseract OCR does not get invoked. If I take a look a the 
>> path at which Tesseract binary is present, I can see it at 
>> /usr/bin/tesseract itself.
>>
>>
>>
>> root@vic:/# which tesseract
>>
>> /usr/bin/tesseract
>>
>> root@vic # tesseract -v
>>
>> tesseract 4.1.1
>>
>> leptonica-1.79.0
>>
>>   libgif 5.1.9 : libjpeg 6b (libjpeg-turbo 2.0.6) : libpng 1.6.37 : libtiff 
>> 4.2.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.4.0
>>
>> Found AVX2
>>
>> Found AVX
>>
>> Found FMA
>>
>> Found SSE
>>
>> Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 
>> libzstd/1.4.8
>>
>>
>>
>> Whereas earlier it was working with below logs:
>>
>>
>>
>> 2022-05-02 09:55:08,275 DEBUG [TesseractOCRParser] hasTesseract (path: 
>> [/usr/bin/tesseract]): true
>>
>> 2022-05-02 09:55:08,450 INFO  [Tika Parser-1] [TesseractOCRParser] Tesseract 
>> is installed and is being invoked. This can add greatly to processing time.  
>> If you do not want tesseract to be applied to your files see: 
>> https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr
>>
>> 2022-05-02 09:55:08,451 DEBUG [Tika Parser-1] [TesseractOCRParser] Tesseract 
>> command: /usr/bin/tesseract /tmp/apache-tika-1769393331829017331.tmp 
>> /tmp/apache-tika-8424718595969554950.tmp --psm 1 -l eng -c page_separator= 
>> -c preserve_interword_spaces=0 txt
>>
>> 2022-05-02 09:55:09,222 DEBUG [Thread-29] [TesseractOCRParser]
>>
>> 2022-05-02 09:55:09,222 DEBUG [Thread-30] [TesseractOCRParser] Tesseract 
>> Open Source OCR Engine v4.0.0 with Leptonica
>>
>>
>>
>> We use below Tesseract OCR settings (earlier and now).
>>
>>
>>
>> tesseractPath=/usr/bin/
>>
>> tessdataPath=/usr/share/tesseract-ocr/4.00/tessdata/
>>
>>
>>
>> We are also facing same issue with Ubuntu based VMs that we upgraded from 
>> 16.04 to 20.04 recently.
>>
>>
>>
>> Finally, we use simple ‘apt install tesseract-ocr’ command to install 
>> Tesseract OCR during building docker image as well on Ubuntu VMs. As Ubuntu 
>> is based on Debian, it is possible that the issue we are facing are related.
>>
>>
>>
>> FYI, we are not facing issue on Windows with Tesseract OCR 4.0.0, 4.1.0 and 
>> 5.0.1 on Windows at all. Here we are installing Tesseract OCR available at 
>> https://github.com/UB-Mannheim/tesseract/wiki and the paths for tesseract 
>> binary  and tessdata are as below:
>>
>>
>>
>> tesseractPath=C:\Program Files\Tesseract-OCR\
>> tessdataPath=C:\Program Files\Tesseract-OCR\tessdata\
>>
>>
>>
>> Any help would be appreciated. Also wanted to ask whether there is a 
>> compatibility matrix for supported Tesseract OCR versions against Tika. We 
>> also plan to move to 5.x in near future.
>>
>>
>>
>> Regards,
>>
>> Sandeep Kulkarni
>>
>>

Re: Tesseract OCR text extraction issue on Debian Bullseye

Reply via email to