Package: tesseract-ocr Version: 4.1.1-2.1 Severity: normal X-Debbugs-Cc: debbug.tesser...@sideload.33mail.com
When tesseract is fed a JPG image of an upright document and instructed to produce a searchable PDF, it flips the image on its side. The rotation apparently happens before OCR is performed judging from the text produced (as pdf2txt shows it as one character per line). This is the syntax used: $ tesseract color_document.jpg sideways_doc -l eng+nld pdf The workaround is quite ugly: $ pdftk doc_sideways_doc.pdf cat 1-r1east output upright_doc.pdf $ ocrmypdf --force-ocr -l eng+nld upright_doc.pdf proper.pdf I don’t think this bug affects every document. It’s perhaps trying to be smart and detect the orientation of the doc & misjudging it. If that’s true, it’s a shame that tesseract does this automatically and beyond the control of the user. There is no option to force tesseract to leave the orientation as-is. -- System Information: Debian Release: 11.5 APT prefers stable-updates APT policy: (990, 'stable-updates'), (990, 'stable-security'), (990, 'testing'), (990, 'stable') Architecture: amd64 (x86_64) Foreign Architectures: i386 Kernel: Linux 5.10.0-19-amd64 (SMP w/2 CPU threads) Kernel taint flags: TAINT_OOT_MODULE, TAINT_UNSIGNED_MODULE Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE not set Shell: /bin/sh linked to /bin/dash Init: systemd (via /run/systemd/system) LSM: AppArmor: enabled Versions of packages tesseract-ocr depends on: ii libarchive13 3.4.3-2+deb11u1 ii libc6 2.31-13+deb11u5 ii libcairo2 1.16.0-5 ii libfontconfig1 2.13.1-4.2 ii libgcc-s1 10.2.1-6 ii libglib2.0-0 2.66.8-1 ii libicu67 67.1-7 ii liblept5 1.79.0-1.1 ii libpango-1.0-0 1.46.2-3 ii libpangocairo-1.0-0 1.46.2-3 ii libpangoft2-1.0-0 1.46.2-3 ii libstdc++6 10.2.1-6 ii libtesseract4 4.1.1-2.1 ii tesseract-ocr-eng 1:4.00~git30-7274cfa-1.1 ii tesseract-ocr-osd 1:4.00~git30-7274cfa-1.1 tesseract-ocr recommends no packages. tesseract-ocr suggests no packages. -- no debconf information