Your message dated Mon, 5 Nov 2012 11:10:49 -0800
with message-id
<CAHjiUbqW72oXsOdeeu3pnoKuQ=mmylndluulztt2jumpcca...@mail.gmail.com>
and subject line fixed upstream
has caused the Debian Bug report #671262,
regarding tesseract-ocr: produces non-UTF8 output despite declaring otherwise
to be marked as done.
This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.
(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact [email protected]
immediately.)
--
671262: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=671262
Debian Bug Tracking System
Contact [email protected] with problems
--- Begin Message ---
Package: gscan2pdf
Version: 1.0.3-1
Severity: normal
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
Hi,
running gscan2pdf --debug and trying to do OCR I got:
INFO - echo tessedit_create_hocr 1 > hocr.config;tesseract /tmp/TZHdnZXQb3.tif
/tmp/7DDKyZd_Rl -l deu +hocr.config 2> /dev/null;rm hocr.config
Tesseract Open Source OCR Engine v3.02 with Leptonica
utf8 "\xC0" does not map to Unicode at /usr/share/perl5/Gscan2pdf.pm line 921,
<> chunk 1.
*** unhandled exception in callback:
*** Malformed UTF-8 character (fatal) at /usr/share/perl5/Gscan2pdf/Page.pm
line 114.
*** ignoring at /usr/bin/gscan2pdf line 10729.
I could repeat this with several documents and resolutions. When I ran tesseract
manually on the .tif file, I indeed saw non UTF-8 characters in the produced
html.
Regards,
Thomas Koch
- -- System Information:
Debian Release: wheezy/sid
APT prefers testing
APT policy: (500, 'testing')
Architecture: amd64 (x86_64)
Kernel: Linux 3.2.0-2-amd64 (SMP w/4 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Versions of packages gscan2pdf depends on:
ii graphicsmagick-imagemagick-compat [imagemagick] 1.3.12-1.1
ii libconfig-general-perl 2.50-1
ii libgoo-canvas-perl 0.06-1+b2
ii libgtk2-ex-simple-list-perl 0.50-2
ii libgtk2-imageview-perl 0.05-1+b2
ii libhtml-parser-perl 3.69-2
ii liblocale-gettext-perl 1.05-7+b1
ii liblog-log4perl-perl 1.29-1
ii libpdf-api2-perl 2.019-1
ii libproc-processtable-perl 0.45-3+b1
ii libreadonly-perl 1.03-3
ii librsvg2-common 2.36.1-1
ii libsane-perl 0.05-1
ii libset-intspan-perl 1.16-1
ii libtiff-tools 4.0.1-5
ii perlmagick 8:6.7.4.0-5
ii sane-utils 1.0.22-7.1
Versions of packages gscan2pdf recommends:
ii cuneiform <none>
ii djvulibre-bin 3.5.25.2-4
ii gocr <none>
ii libgtk2-ex-podviewer-perl 0.18-1
ii sane 1.0.14-9
ii tesseract-ocr 3.02.01-4
ii unpaper 0.3-1
ii xdg-utils 1.1.0~rc1+git20111210-6
gscan2pdf suggests no packages.
- -- no debconf information
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAEBCAAGBQJPnTTyAAoJEAf8SJEEK6ZaGlAP/28TRSC7BbGOPQ0OGa2Gmu49
iqHgd2QfABV4cZQJj8tvk663vsPiEO5Sc6go5AWOobSWGMyERUQo5ZkOLSvEWjTv
ZNumcwSCa/84H7x4K/t9aQljdX9p/hued9vPkGwRU0eH1AHc0bkzpfwo3shJ6F2g
AMyJNuLxWPm0D8Mh4/Dil/usJennCaxOAN5BFVUmn2Vuhd79xRDJSWW9eF6IFxdo
4knxmDC20Y7YZ2rBVegFiA++BGjN0dgYsgZINtMWvOeHtnx6SLlxf4yN1/CMnEQw
5TfImCZrI/+alUGu1KTSQmfgeVK+mteDiZFND1+aLjbTgIiLZq+ghFGpeebVAIPj
5kiZn2oCCVLEbQCKuYL00RH2MoRcEeWBS9rv250xM/fxPukM8ahMX6NFwSf4ZSA/
43zXDMA3oyNbX1PVgDp0MgoU7l7mKcZyefJvkUyaSNo4BzK02XCtvcp/5atXGehv
XW434tBb/WEr9y0UESLo54BnAiCCri4FGv/6KIa+Cuw26WpNh8vFRuMCZQyMMGrV
bSpuiu96B1VLvcNj+3gK4aNXgetsFO10V6u7XJ+t2W8XUTdU+lKFE55QxfKsQQvv
ttPr3P1kMrYC/sEXyzxq4Hk4tvvLQFe9I/qC5liUgymzpe9qwsrJFq5AQ2vbgKgq
2MQ3FAoqhN6szCs9yYdY
=6uA9
-----END PGP SIGNATURE-----
--- End Message ---
--- Begin Message ---
Believed fixed with 3.02.02 which was recently uploaded to Debian. Please
re-open bug if any problems.
--- End Message ---