Are you able to share your full docker files? Are you running tika-server or a custom application?
On Tue, May 3, 2022 at 5:40 AM Tim Allison <[email protected]> wrote: > > Does the application have permissions to run tesseract in the Linuxes that > are not working? > > On Tue, May 3, 2022 at 1:17 AM Sandeep Kulkarni > <[email protected]> wrote: >> >> Hi, >> >> >> >> Ours is a Java based application which uses Tika via AutoDetectParser. We >> init TesseractOCRConfig with tesseractPath and tessdataPath (and few more >> parameters) and set it into context before invoking ParsingReader. >> >> >> >> I am currently using Tika 2.2.1 with Tesseract OCR 4.0.0 (default version >> for this distro) on Debian Buster docker base image for >> openjdk:8u312-jre-buster. Things work as expected and I am able to get text >> extracted from images. >> >> >> >> We are now trying to upgrade Tesseract and have started facing some issues. >> We tried to move to Debian Bullseye based openjdk:8u332-jre-bullseye and >> Tesseract 4.1.1 (default version for this distro) and image extraction >> stopped working. We have not changed anything else within configuration for >> Tika and Tesseract. >> >> >> >> With debug logging enabled for TesseractOCRParser, I can see that >> hasTesseract is not working now and is not finding tesseract at >> /usr/bin/tesseract. >> >> >> >> 2022-05-02 10:55:26,053 DEBUG [TesseractOCRParser] hasTesseract (path: >> [/usr/bin/tesseract]): false >> >> >> >> Because of this, Tesseract OCR does not get invoked. If I take a look a the >> path at which Tesseract binary is present, I can see it at >> /usr/bin/tesseract itself. >> >> >> >> root@vic:/# which tesseract >> >> /usr/bin/tesseract >> >> root@vic # tesseract -v >> >> tesseract 4.1.1 >> >> leptonica-1.79.0 >> >> libgif 5.1.9 : libjpeg 6b (libjpeg-turbo 2.0.6) : libpng 1.6.37 : libtiff >> 4.2.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.4.0 >> >> Found AVX2 >> >> Found AVX >> >> Found FMA >> >> Found SSE >> >> Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 >> libzstd/1.4.8 >> >> >> >> Whereas earlier it was working with below logs: >> >> >> >> 2022-05-02 09:55:08,275 DEBUG [TesseractOCRParser] hasTesseract (path: >> [/usr/bin/tesseract]): true >> >> 2022-05-02 09:55:08,450 INFO [Tika Parser-1] [TesseractOCRParser] Tesseract >> is installed and is being invoked. This can add greatly to processing time. >> If you do not want tesseract to be applied to your files see: >> https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr >> >> 2022-05-02 09:55:08,451 DEBUG [Tika Parser-1] [TesseractOCRParser] Tesseract >> command: /usr/bin/tesseract /tmp/apache-tika-1769393331829017331.tmp >> /tmp/apache-tika-8424718595969554950.tmp --psm 1 -l eng -c page_separator= >> -c preserve_interword_spaces=0 txt >> >> 2022-05-02 09:55:09,222 DEBUG [Thread-29] [TesseractOCRParser] >> >> 2022-05-02 09:55:09,222 DEBUG [Thread-30] [TesseractOCRParser] Tesseract >> Open Source OCR Engine v4.0.0 with Leptonica >> >> >> >> We use below Tesseract OCR settings (earlier and now). >> >> >> >> tesseractPath=/usr/bin/ >> >> tessdataPath=/usr/share/tesseract-ocr/4.00/tessdata/ >> >> >> >> We are also facing same issue with Ubuntu based VMs that we upgraded from >> 16.04 to 20.04 recently. >> >> >> >> Finally, we use simple ‘apt install tesseract-ocr’ command to install >> Tesseract OCR during building docker image as well on Ubuntu VMs. As Ubuntu >> is based on Debian, it is possible that the issue we are facing are related. >> >> >> >> FYI, we are not facing issue on Windows with Tesseract OCR 4.0.0, 4.1.0 and >> 5.0.1 on Windows at all. Here we are installing Tesseract OCR available at >> https://github.com/UB-Mannheim/tesseract/wiki and the paths for tesseract >> binary and tessdata are as below: >> >> >> >> tesseractPath=C:\Program Files\Tesseract-OCR\ >> tessdataPath=C:\Program Files\Tesseract-OCR\tessdata\ >> >> >> >> Any help would be appreciated. Also wanted to ask whether there is a >> compatibility matrix for supported Tesseract OCR versions against Tika. We >> also plan to move to 5.x in near future. >> >> >> >> Regards, >> >> Sandeep Kulkarni >> >>
