Bug#1027985: tesseract-ocr: document gets rotated on its side when converting from jpg to pdf

debbug . tesseract Thu, 05 Jan 2023 07:12:15 -0800

Package: tesseract-ocr
Version: 4.1.1-2.1
Severity: normal
X-Debbugs-Cc: debbug.tesser...@sideload.33mail.com


When tesseract is fed a JPG image of an upright document and
instructed to produce a searchable PDF, it flips the image on its
side. The rotation apparently happens before OCR is performed judging
from the text produced (as pdf2txt shows it as one character per
line). This is the syntax used:

  $ tesseract color_document.jpg sideways_doc -l eng+nld pdf

The workaround is quite ugly:

  $ pdftk doc_sideways_doc.pdf cat 1-r1east output upright_doc.pdf
  $ ocrmypdf --force-ocr -l eng+nld upright_doc.pdf proper.pdf

I don’t think this bug affects every document. It’s perhaps trying to
be smart and detect the orientation of the doc & misjudging it. If
that’s true, it’s a shame that tesseract does this automatically and
beyond the control of the user. There is no option to force tesseract
to leave the orientation as-is.

-- System Information:
Debian Release: 11.5
  APT prefers stable-updates
  APT policy: (990, 'stable-updates'), (990, 'stable-security'), (990, 
'testing'), (990, 'stable')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 5.10.0-19-amd64 (SMP w/2 CPU threads)
Kernel taint flags: TAINT_OOT_MODULE, TAINT_UNSIGNED_MODULE
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages tesseract-ocr depends on:
ii  libarchive13         3.4.3-2+deb11u1
ii  libc6                2.31-13+deb11u5
ii  libcairo2            1.16.0-5
ii  libfontconfig1       2.13.1-4.2
ii  libgcc-s1            10.2.1-6
ii  libglib2.0-0         2.66.8-1
ii  libicu67             67.1-7
ii  liblept5             1.79.0-1.1
ii  libpango-1.0-0       1.46.2-3
ii  libpangocairo-1.0-0  1.46.2-3
ii  libpangoft2-1.0-0    1.46.2-3
ii  libstdc++6           10.2.1-6
ii  libtesseract4        4.1.1-2.1
ii  tesseract-ocr-eng    1:4.00~git30-7274cfa-1.1
ii  tesseract-ocr-osd    1:4.00~git30-7274cfa-1.1

tesseract-ocr recommends no packages.

tesseract-ocr suggests no packages.

-- no debconf information

Bug#1027985: tesseract-ocr: document gets rotated on its side when converting from jpg to pdf

Reply via email to