Re: Tesseract OCR text extraction issue on Debian Bullseye

Luís Filipe Nassif Tue, 03 May 2022 05:09:59 -0700

Just some guesses, were new tesseract dictionanaries installed? Were you
able to OCR an image from cmd line with newer tesseract?


Em ter, 3 de mai de 2022 02:17, Sandeep Kulkarni <
[email protected]> escreveu:

> Hi,
>
>
>
> Ours is a Java based application which uses Tika via AutoDetectParser. We
> init TesseractOCRConfig with tesseractPath and tessdataPath (and few more
> parameters) and set it into context before invoking ParsingReader.
>
>
>
> I am currently using Tika 2.2.1 with Tesseract OCR 4.0.0 (default version
> for this distro) on Debian Buster docker base image for
> openjdk:8u312-jre-buster. Things work as expected and I am able to get text
> extracted from images.
>
>
>
> We are now trying to upgrade Tesseract and have started facing some
> issues. We tried to move to Debian Bullseye based
> openjdk:8u332-jre-bullseye and Tesseract 4.1.1 (default version for this
> distro) and image extraction stopped working. We have not changed anything
> else within configuration for Tika and Tesseract.
>
>
>
> With debug logging enabled for TesseractOCRParser, I can see that
> hasTesseract is not working now and is not finding tesseract at
> /usr/bin/tesseract.
>
>
>
> 2022-05-02 10:55:26,053 DEBUG [TesseractOCRParser] hasTesseract (path:
> [/usr/bin/tesseract]): false
>
>
>
> Because of this, Tesseract OCR does not get invoked. If I take a look a
> the path at which Tesseract binary is present, I can see it at
> /usr/bin/tesseract itself.
>
>
>
> root@vic:/# which tesseract
>
> /usr/bin/tesseract
>
> root@vic # tesseract -v
>
> tesseract 4.1.1
>
> leptonica-1.79.0
>
>   libgif 5.1.9 : libjpeg 6b (libjpeg-turbo 2.0.6) : libpng 1.6.37 :
> libtiff 4.2.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.4.0
>
> Found AVX2
>
> Found AVX
>
> Found FMA
>
> Found SSE
>
> Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3
> libzstd/1.4.8
>
>
>
> Whereas earlier it was working with below logs:
>
>
>
> 2022-05-02 09:55:08,275 DEBUG [TesseractOCRParser] hasTesseract (path:
> [/usr/bin/tesseract]): true
>
> 2022-05-02 09:55:08,450 INFO  [Tika Parser-1] [TesseractOCRParser]
> Tesseract is installed and is being invoked. This can add greatly to
> processing time.  If you do not want tesseract to be applied to your files
> see:
> https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr
>
> 2022-05-02 09:55:08,451 DEBUG [Tika Parser-1] [TesseractOCRParser]
> Tesseract command: /usr/bin/tesseract
> /tmp/apache-tika-1769393331829017331.tmp
> /tmp/apache-tika-8424718595969554950.tmp --psm 1 -l eng -c page_separator=
> -c preserve_interword_spaces=0 txt
>
> 2022-05-02 09:55:09,222 DEBUG [Thread-29] [TesseractOCRParser]
>
> 2022-05-02 09:55:09,222 DEBUG [Thread-30] [TesseractOCRParser] Tesseract
> Open Source OCR Engine v4.0.0 with Leptonica
>
>
>
> We use below Tesseract OCR settings (earlier and now).
>
>
>
> tesseractPath=/usr/bin/
>
> tessdataPath=/usr/share/tesseract-ocr/4.00/tessdata/
>
>
>
> We are also facing same issue with Ubuntu based VMs that we upgraded from
> 16.04 to 20.04 recently.
>
>
>
> Finally, we use simple ‘apt install tesseract-ocr’ command to install
> Tesseract OCR during building docker image as well on Ubuntu VMs. As Ubuntu
> is based on Debian, it is possible that the issue we are facing are related.
>
>
>
> FYI, we are not facing issue on Windows with Tesseract OCR 4.0.0, 4.1.0
> and 5.0.1 on Windows at all. Here we are installing Tesseract OCR
> available at https://github.com/UB-Mannheim/tesseract/wiki and the paths
> for tesseract binary  and tessdata are as below:
>
>
>
> tesseractPath=C:\Program Files\Tesseract-OCR\
> tessdataPath=C:\Program Files\Tesseract-OCR\tessdata\
>
>
>
> Any help would be appreciated. Also wanted to ask whether there is a 
> compatibility
> matrix for supported Tesseract OCR versions against Tika. We also plan to
> move to 5.x in near future.
>
>
>
> Regards,
>
> Sandeep Kulkarni
>
>
>

Re: Tesseract OCR text extraction issue on Debian Bullseye

Reply via email to