Just some guesses, were new tesseract dictionanaries installed? Were you able to OCR an image from cmd line with newer tesseract?
Em ter, 3 de mai de 2022 02:17, Sandeep Kulkarni < [email protected]> escreveu: > Hi, > > > > Ours is a Java based application which uses Tika via AutoDetectParser. We > init TesseractOCRConfig with tesseractPath and tessdataPath (and few more > parameters) and set it into context before invoking ParsingReader. > > > > I am currently using Tika 2.2.1 with Tesseract OCR 4.0.0 (default version > for this distro) on Debian Buster docker base image for > openjdk:8u312-jre-buster. Things work as expected and I am able to get text > extracted from images. > > > > We are now trying to upgrade Tesseract and have started facing some > issues. We tried to move to Debian Bullseye based > openjdk:8u332-jre-bullseye and Tesseract 4.1.1 (default version for this > distro) and image extraction stopped working. We have not changed anything > else within configuration for Tika and Tesseract. > > > > With debug logging enabled for TesseractOCRParser, I can see that > hasTesseract is not working now and is not finding tesseract at > /usr/bin/tesseract. > > > > 2022-05-02 10:55:26,053 DEBUG [TesseractOCRParser] hasTesseract (path: > [/usr/bin/tesseract]): false > > > > Because of this, Tesseract OCR does not get invoked. If I take a look a > the path at which Tesseract binary is present, I can see it at > /usr/bin/tesseract itself. > > > > root@vic:/# which tesseract > > /usr/bin/tesseract > > root@vic # tesseract -v > > tesseract 4.1.1 > > leptonica-1.79.0 > > libgif 5.1.9 : libjpeg 6b (libjpeg-turbo 2.0.6) : libpng 1.6.37 : > libtiff 4.2.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.4.0 > > Found AVX2 > > Found AVX > > Found FMA > > Found SSE > > Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 > libzstd/1.4.8 > > > > Whereas earlier it was working with below logs: > > > > 2022-05-02 09:55:08,275 DEBUG [TesseractOCRParser] hasTesseract (path: > [/usr/bin/tesseract]): true > > 2022-05-02 09:55:08,450 INFO [Tika Parser-1] [TesseractOCRParser] > Tesseract is installed and is being invoked. This can add greatly to > processing time. If you do not want tesseract to be applied to your files > see: > https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr > > 2022-05-02 09:55:08,451 DEBUG [Tika Parser-1] [TesseractOCRParser] > Tesseract command: /usr/bin/tesseract > /tmp/apache-tika-1769393331829017331.tmp > /tmp/apache-tika-8424718595969554950.tmp --psm 1 -l eng -c page_separator= > -c preserve_interword_spaces=0 txt > > 2022-05-02 09:55:09,222 DEBUG [Thread-29] [TesseractOCRParser] > > 2022-05-02 09:55:09,222 DEBUG [Thread-30] [TesseractOCRParser] Tesseract > Open Source OCR Engine v4.0.0 with Leptonica > > > > We use below Tesseract OCR settings (earlier and now). > > > > tesseractPath=/usr/bin/ > > tessdataPath=/usr/share/tesseract-ocr/4.00/tessdata/ > > > > We are also facing same issue with Ubuntu based VMs that we upgraded from > 16.04 to 20.04 recently. > > > > Finally, we use simple ‘apt install tesseract-ocr’ command to install > Tesseract OCR during building docker image as well on Ubuntu VMs. As Ubuntu > is based on Debian, it is possible that the issue we are facing are related. > > > > FYI, we are not facing issue on Windows with Tesseract OCR 4.0.0, 4.1.0 > and 5.0.1 on Windows at all. Here we are installing Tesseract OCR > available at https://github.com/UB-Mannheim/tesseract/wiki and the paths > for tesseract binary and tessdata are as below: > > > > tesseractPath=C:\Program Files\Tesseract-OCR\ > tessdataPath=C:\Program Files\Tesseract-OCR\tessdata\ > > > > Any help would be appreciated. Also wanted to ask whether there is a > compatibility > matrix for supported Tesseract OCR versions against Tika. We also plan to > move to 5.x in near future. > > > > Regards, > > Sandeep Kulkarni > > >
