RE: [External] Re: Tesseract OCR text extraction issue on Debian Bullseye

Sandeep Kulkarni Wed, 04 May 2022 06:27:39 -0700

Hi Tim,

We are a custom application and don't use tika-server. We use Dropwizard 
framework to expose few REST endpoints. We make use of docker-maven-plugin 
(https://github.com/fabric8io/docker-maven-plugin) to generate our docker image.


Here is a snippet of the code which we use to do the same. I have included all 
the necessary information, only excluded few details which are very specific to 
the application.

<groupId>io.fabric8</groupId>
<artifactId>docker-maven-plugin</artifactId>
<version>0.33.0</version>
<configuration>
  <sourceDirectory>src/main/assemblies</sourceDirectory>
  <autoPull>always</autoPull>
  <images>
        <image>
          <alias>${docker.alias}</alias>
          <name>${docker.repository.name}:${project.version}</name>
          <build>
                <from>openjdk:8u332-jre-bullseye</from>
                <optimise>true</optimise><!-- compress all the runCmds into a 
single RUN directive so that only one image layer is created -->
                <runCmds>
                  <!-- Commands run when the image is being built (not when the 
container is started) -->
                  <run>apt -q -y update</run>
                  <run>apt -q -y install tesseract-ocr</run>
                  <run>unzip -u -o application-linux64.zip</run><!-- unzip the 
application binaries for linux -->
                  <run>rm -f application-linux64.zip</run><!-- remove the 
original application binaries for linux zip file -->
                </runCmds>
                <assembly>
                  <targetDir>/</targetDir>
                  <descriptor>docker-assembly.xml</descriptor>
                </assembly>
                <ports>
                  <port>${docker.app.port}</port>
                </ports>
                <workdir>/</workdir>
                <cmd>
                  <shell>java <some application specific params> -jar 
${project.version}.jar</shell>
                </cmd>
          </build>
          <run>
                <namingStrategy>alias</namingStrategy>
                <hostname>${docker.alias}</hostname>
                <privileged>false</privileged>
                <ports>
                  <port>${docker.app.port}:${docker.app.port}</port>
                </ports>
                <volumes>
                  <bind>
                        <volume>/app-local-data</volume>
                        <volume>app-shared-data:/app-shared-data</volume>
                  </bind>
                </volumes>              
                <log>
                  <enabled>true</enabled>
                  <color>cyan</color>
                </log>
          </run>
        </image>
  </images>
</configuration>

Within the docker container, the application runs under root user's context and 
I can see the following using ps command as well:

root@vic:/# ps aux | grep -i java
root         1  0.0  0.0   2412   528 ?        Ss   May02   0:00 /bin/sh -c 
java -jar application-4.0.0-SNAPSHOT.jar
root         7  0.2  8.5 6301188 877188 ?      Sl   May02   7:03 java -jar 
application-4.0.0-SNAPSHOT.jar

And root definitely has execute permissions on the binary based on following:

root@vic:/# ls -la /usr/bin/tesseract
-rwxr-xr-x 1 root root 34904 Feb  4  2021 /usr/bin/tesseract

Tessdata location also has trainedata available:

root@vic:/ # ls -la /usr/share/tesseract-ocr/4.00/tessdata/
total 14356
drwxr-xr-x 4 root root     4096 May  2 10:50 .
drwxr-xr-x 3 root root     4096 May  2 10:50 ..
drwxr-xr-x 2 root root     4096 May  2 10:50 configs
-rw-r--r-- 1 root root  4113088 Sep 15  2017 eng.traineddata
-rw-r--r-- 1 root root 10562727 Sep 15  2017 osd.traineddata
-rw-r--r-- 1 root root      572 Feb  4  2021 pdf.ttf
drwxr-xr-x 2 root root     4096 May  2 10:50 tessconfigs

Also when I enter shell within docker container, I am able to run tesseract 
command and run and get output. Following is an example of it.

root@vic:/ # /usr/bin/tesseract /images/imageTest.JPG - -l eng
I enjoyed it more back in 1990, though I'd be
hard-pressed to recall exactly why -- viewing it
now, it's only a marginally satisfying
experience, with some potent set-pieces
diluted by an almost juvenile predilection on
Lynch's part. Posted Jun 6, 2021 8:09 PM UTC

Regards,
Sandeep Kulkarni

-----Original Message-----
From: Tim Allison <[email protected]> 
Sent: Tuesday, May 3, 2022 3:40 PM
To: [email protected]
Subject: [External] Re: Tesseract OCR text extraction issue on Debian Bullseye

Are you able to share your full docker files? Are you running tika-server or a 
custom application?

On Tue, May 3, 2022 at 5:40 AM Tim Allison <[email protected]> wrote:
>
> Does the application have permissions to run tesseract in the Linuxes that 
> are not working?
>
> On Tue, May 3, 2022 at 1:17 AM Sandeep Kulkarni 
> <[email protected]> wrote:
>>
>> Hi,
>>
>>
>>
>> Ours is a Java based application which uses Tika via AutoDetectParser. We 
>> init TesseractOCRConfig with tesseractPath and tessdataPath (and few more 
>> parameters) and set it into context before invoking ParsingReader.
>>
>>
>>
>> I am currently using Tika 2.2.1 with Tesseract OCR 4.0.0 (default version 
>> for this distro) on Debian Buster docker base image for 
>> openjdk:8u312-jre-buster. Things work as expected and I am able to get text 
>> extracted from images.
>>
>>
>>
>> We are now trying to upgrade Tesseract and have started facing some issues. 
>> We tried to move to Debian Bullseye based openjdk:8u332-jre-bullseye and 
>> Tesseract 4.1.1 (default version for this distro) and image extraction 
>> stopped working. We have not changed anything else within configuration for 
>> Tika and Tesseract.
>>
>>
>>
>> With debug logging enabled for TesseractOCRParser, I can see that 
>> hasTesseract is not working now and is not finding tesseract at 
>> /usr/bin/tesseract.
>>
>>
>>
>> 2022-05-02 10:55:26,053 DEBUG [TesseractOCRParser] hasTesseract 
>> (path: [/usr/bin/tesseract]): false
>>
>>
>>
>> Because of this, Tesseract OCR does not get invoked. If I take a look a the 
>> path at which Tesseract binary is present, I can see it at 
>> /usr/bin/tesseract itself.
>>
>>
>>
>> root@vic:/# which tesseract
>>
>> /usr/bin/tesseract
>>
>> root@vic # tesseract -v
>>
>> tesseract 4.1.1
>>
>> leptonica-1.79.0
>>
>>   libgif 5.1.9 : libjpeg 6b (libjpeg-turbo 2.0.6) : libpng 1.6.37 : 
>> libtiff 4.2.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.4.0
>>
>> Found AVX2
>>
>> Found AVX
>>
>> Found FMA
>>
>> Found SSE
>>
>> Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 
>> liblz4/1.9.3 libzstd/1.4.8
>>
>>
>>
>> Whereas earlier it was working with below logs:
>>
>>
>>
>> 2022-05-02 09:55:08,275 DEBUG [TesseractOCRParser] hasTesseract 
>> (path: [/usr/bin/tesseract]): true
>>
>> 2022-05-02 09:55:08,450 INFO  [Tika Parser-1] [TesseractOCRParser] 
>> Tesseract is installed and is being invoked. This can add greatly to 
>> processing time.  If you do not want tesseract to be applied to your 
>> files see: 
>> https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwi
>> ki.apache.org%2Fconfluence%2Fdisplay%2FTIKA%2FTikaOCR%23TikaOCR-disab
>> le-ocr&amp;data=05%7C01%7CSandeep.Kulkarni4%40veritas.com%7C1a33c9ae1
>> bd545bd1a4308da2ced29c9%7Cfc8e13c0422c4c55b3eaca318e6cac32%7C0%7C0%7C
>> 637871694808801545%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIj
>> oiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=9k
>> 2DCEhrWsZvngP5DZ953sBUhuHw4OwGnJq3p%2Bd85KE%3D&amp;reserved=0
>>
>> 2022-05-02 09:55:08,451 DEBUG [Tika Parser-1] [TesseractOCRParser] 
>> Tesseract command: /usr/bin/tesseract 
>> /tmp/apache-tika-1769393331829017331.tmp 
>> /tmp/apache-tika-8424718595969554950.tmp --psm 1 -l eng -c 
>> page_separator= -c preserve_interword_spaces=0 txt
>>
>> 2022-05-02 09:55:09,222 DEBUG [Thread-29] [TesseractOCRParser]
>>
>> 2022-05-02 09:55:09,222 DEBUG [Thread-30] [TesseractOCRParser] 
>> Tesseract Open Source OCR Engine v4.0.0 with Leptonica
>>
>>
>>
>> We use below Tesseract OCR settings (earlier and now).
>>
>>
>>
>> tesseractPath=/usr/bin/
>>
>> tessdataPath=/usr/share/tesseract-ocr/4.00/tessdata/
>>
>>
>>
>> We are also facing same issue with Ubuntu based VMs that we upgraded from 
>> 16.04 to 20.04 recently.
>>
>>
>>
>> Finally, we use simple 'apt install tesseract-ocr' command to install 
>> Tesseract OCR during building docker image as well on Ubuntu VMs. As Ubuntu 
>> is based on Debian, it is possible that the issue we are facing are related.
>>
>>
>>
>> FYI, we are not facing issue on Windows with Tesseract OCR 4.0.0, 4.1.0 and 
>> 5.0.1 on Windows at all. Here we are installing Tesseract OCR available at 
>> https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FUB-Mannheim%2Ftesseract%2Fwiki&amp;data=05%7C01%7CSandeep.Kulkarni4%40veritas.com%7C1a33c9ae1bd545bd1a4308da2ced29c9%7Cfc8e13c0422c4c55b3eaca318e6cac32%7C0%7C0%7C637871694808801545%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=jecj5KxQeC367CKeh%2BUklepIOWg4luPkoBVXX8aE%2Bu4%3D&amp;reserved=0
>>  and the paths for tesseract binary  and tessdata are as below:
>>
>>
>>
>> tesseractPath=C:\Program Files\Tesseract-OCR\ tessdataPath=C:\Program 
>> Files\Tesseract-OCR\tessdata\
>>
>>
>>
>> Any help would be appreciated. Also wanted to ask whether there is a 
>> compatibility matrix for supported Tesseract OCR versions against Tika. We 
>> also plan to move to 5.x in near future.
>>
>>
>>
>> Regards,
>>
>> Sandeep Kulkarni
>>
>>

RE: [External] Re: Tesseract OCR text extraction issue on Debian Bullseye

Reply via email to