Your message dated Sun, 19 Sep 2021 14:25:01 +0000
with message-id <[email protected]>
and subject line Bug#987057: fixed in gscan2pdf 2.12.3-1
has caused the Debian Bug report #987057,
regarding gscan2pdf: fixes to hOCR
to be marked as done.

This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact [email protected]
immediately.)


-- 
987057: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=987057
Debian Bug Tracking System
Contact [email protected] with problems
--- Begin Message ---
Package: gscan2pdf
Version: 2.11.0-1
Severity: wishlist
Tags: patch upstream

Hi,

please find attached 3 patches improving hOCR support.
- recognize more tags when reading hOCR
  In addition this also helps for rotated texts (in a separate report)
- preserve more properties when writing hOCR
- fix-indentation of intermediate non-leaf elements

The patches are against upstream's version 2.11.2

Please consider incorporating them into gscan2pdf's next version

Thanks in advance
Peter

-- System Information:
Debian Release: bullseye/sid
  APT prefers testing
  APT policy: (990, 'testing'), (500, 'unstable'), (500, 'stable'), (1, 
'experimental')
Architecture: amd64 (x86_64)

Kernel: Linux 5.10.0-5-amd64 (SMP w/12 CPU threads)
Kernel taint flags: TAINT_CRAP
Locale: LANG=de_DE.UTF-8, LC_CTYPE=de_DE.UTF-8 (charmap=UTF-8), LANGUAGE=en_GB
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages gscan2pdf depends on:
ii  imagemagick                            8:6.9.11.60+dfsg-1
ii  imagemagick-6.q16 [imagemagick]        8:6.9.11.60+dfsg-1
ii  libconfig-general-perl                 2.63-1
ii  libdate-calc-perl                      6.4-1.1
ii  libfilesys-df-perl                     0.92-6+b6
ii  libgoocanvas2-perl                     0.06-2
ii  libgtk3-imageview-perl                 6-1
ii  libgtk3-perl                           0.038-1
ii  libgtk3-simplelist-perl                0.21-1
ii  libhtml-parser-perl                    3.75-1+b1
ii  libimage-magick-perl                   8:6.9.11.60+dfsg-1
ii  libimage-sane-perl                     5-1+b1
ii  liblist-moreutils-perl                 0.430-2
ii  liblocale-codes-perl                   3.66-1
ii  liblocale-gettext-perl                 1.07-4+b1
ii  liblog-log4perl-perl                   1.54-1
ii  libossp-uuid-perl [libdata-uuid-perl]  1.6.2-1.5+b9
ii  libpdf-builder-perl                    3.021-2
ii  libproc-processtable-perl              0.59-2+b1
ii  libreadonly-perl                       2.050-3
ii  librsvg2-common                        2.50.3+dfsg-1
ii  libset-intspan-perl                    1.19-1.1
ii  libtiff-tools                          4.2.0-1
ii  libtry-tiny-perl                       0.30-1
hi  sane-utils                             1.0.31-4pm1

Versions of packages gscan2pdf recommends:
ii  djvulibre-bin       3.5.28-1
ii  gocr                0.52-3
ii  pdftk-java [pdftk]  3.2.2-1
ii  tesseract-ocr       4.1.1-2.1
ii  unpaper             6.1-2+b2
ii  xdg-utils           1.1.3-4

gscan2pdf suggests no packages.

-- no debconf information
>From f9d32fbeb11619637ea6263881b6333afc4ebeaf Mon Sep 17 00:00:00 2001
From: Peter Marschall <[email protected]>
Date: Wed, 14 Apr 2021 17:41:56 +0200
Subject: [PATCH 1/3] Bboxtree: preserve more information in to_hocr()

Keep 'textangle' and 'baseline' properties in to_hocr() method.

Signed-off-by: Peter Marschall <[email protected]>
---
 lib/Gscan2pdf/Bboxtree.pm | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/lib/Gscan2pdf/Bboxtree.pm b/lib/Gscan2pdf/Bboxtree.pm
index 59edaf61..2ee47834 100644
--- a/lib/Gscan2pdf/Bboxtree.pm
+++ b/lib/Gscan2pdf/Bboxtree.pm
@@ -514,6 +514,12 @@ EOS
         $string .= $SPACE x ( 2 + $bbox->{depth} ) . "<$tag class='$type'";
         if ( defined $bbox->{id} ) { $string .= " id='$bbox->{id}'" }
         $string .= " title='bbox $x1 $y1 $x2 $y2";
+        if ( defined $bbox->{baseline} ) {
+            $string .= '; baseline ' . join( $SPACE, @{ $bbox->{baseline} } );
+        }
+        if ( defined $bbox->{textangle} ) {
+            $string .= "; textangle $bbox->{textangle}";
+        }
         if ( defined $bbox->{confidence} ) {
             $string .= "; x_wconf $bbox->{confidence}";
         }
-- 
2.30.2

>From 11ed93483c800082525cd6a3afcfb08f0e15be9c Mon Sep 17 00:00:00 2001
From: Peter Marschall <[email protected]>
Date: Wed, 14 Apr 2021 18:02:16 +0200
Subject: [PATCH 2/3] Bboxtree: recognize more hOCR elements in _hocr2boxes()

Recognize additional elements 'ocr_header', 'ocr_footer', 'ocr_caption'
as well as their 'ocrx_...' counterparts when parsing hOCR into a Bboxtree

Recent versions of tesseract seem to generate some of these elements
instead of 'ocrx_line'.

As the line-like elements contain impoartant information, gscan2pdf needs
to recognize them, in order to
* properly diplay the OCR'ed text
* preserve as much information as possible when storing hOCR files

Signed-off-by: Peter Marschall <[email protected]>
---
 lib/Gscan2pdf/Bboxtree.pm | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/lib/Gscan2pdf/Bboxtree.pm b/lib/Gscan2pdf/Bboxtree.pm
index 2ee47834..e8fa83d3 100644
--- a/lib/Gscan2pdf/Bboxtree.pm
+++ b/lib/Gscan2pdf/Bboxtree.pm
@@ -138,6 +138,15 @@ sub _hocr2boxes {
                         when (/_par$/xsm) {
                             $data->{type} = 'para';
                         }
+                        when (/_header$/xsm) {
+                            $data->{type} = 'header';
+                        }
+                        when (/_footer$/xsm) {
+                            $data->{type} = 'footer';
+                        }
+                        when (/_caption$/xsm) {
+                            $data->{type} = 'caption';
+                        }
                         when (/_line$/xsm) {
                             $data->{type} = 'line';
                         }
-- 
2.30.2

>From 32b921376cb61ab8750ecf11c2152ec30517a579 Mon Sep 17 00:00:00 2001
From: Peter Marschall <[email protected]>
Date: Wed, 14 Apr 2021 18:41:47 +0200
Subject: [PATCH 3/3] Bboxtree: fix indentation of intermediate closing tags in
 to_hocr()

When writing hOCR files using to_hocr(), make sure closing tags
of intermedate non-leaf elements are correctly indented, even
if sibling elements follow.

This makes sure to also keep the "visual" structure of hOCR files
generated by the OCR engines.

Signed-off-by: Peter Marschall <[email protected]>
---
 lib/Gscan2pdf/Bboxtree.pm | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/lib/Gscan2pdf/Bboxtree.pm b/lib/Gscan2pdf/Bboxtree.pm
index e8fa83d3..4c267e02 100644
--- a/lib/Gscan2pdf/Bboxtree.pm
+++ b/lib/Gscan2pdf/Bboxtree.pm
@@ -495,8 +495,10 @@ EOS
     while ( my $bbox = $iter->() ) {
         if ( defined $prev_depth ) {
             if ( $prev_depth >= $bbox->{depth} ) {
+                if (@tags) { $string .= '</' . pop(@tags) . ">\n" }
+                $prev_depth--;
                 while ( $prev_depth-- >= $bbox->{depth} ) {
-                    $string .= '</' . pop(@tags) . ">\n";
+                    $string .= $SPACE x ( 2 + $prev_depth + 1 ) . '</' . 
pop(@tags) . ">\n";
                 }
             }
             else {
-- 
2.30.2


--- End Message ---
--- Begin Message ---
Source: gscan2pdf
Source-Version: 2.12.3-1
Done: Jeffrey Ratcliffe <[email protected]>

We believe that the bug you reported is fixed in the latest version of
gscan2pdf, which is due to be installed in the Debian FTP archive.

A summary of the changes between this version and the previous one is
attached.

Thank you for reporting the bug, which will now be closed.  If you
have further comments please address them to [email protected],
and the maintainer will reopen the bug report if appropriate.

Debian distribution maintenance software
pp.
Jeffrey Ratcliffe <[email protected]> (supplier of updated gscan2pdf package)

(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing [email protected])


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Format: 1.8
Date: Fri, 17 Sep 2021 19:26:26 +0200
Source: gscan2pdf
Architecture: source
Version: 2.12.3-1
Distribution: unstable
Urgency: medium
Maintainer: Jeffrey Ratcliffe <[email protected]>
Changed-By: Jeffrey Ratcliffe <[email protected]>
Closes: 987057 987058 987059 987211 987212
Changes:
 gscan2pdf (2.12.3-1) unstable; urgency=medium
 .
   * New upstream release
     Closes: #987057 (gscan2pdf: fixes to hOCR)
     Closes: #987058 (gscan2pdf: fix displaying rotated text)
     Closes: #987211
     (gscan2pdf: separate tab for Post-processing options in Scan dialog)
     Closes: #987212 (gscan2pdf: visually align 'Threshold before OCR')
     Closes: #987059 (gscan2pdf: POD and manpage improvements)
   * Update depends libpdf-builder-perl to require 3.022 or better
Checksums-Sha1:
 fafc7dd89551355cb7eaf0f2b49ccd89486f71ea 2923 gscan2pdf_2.12.3-1.dsc
 7b23d4f121b28a35906af36643bd1481827d42ac 504168 gscan2pdf_2.12.3.orig.tar.xz
 c756efffc1a27080bfa9bc7f9bfe3d03e3b30267 833 gscan2pdf_2.12.3.orig.tar.xz.asc
 1c2eb2612549958ef55b6765333e1ab0d5205460 12608 gscan2pdf_2.12.3-1.debian.tar.xz
 fa0c400626588e27ef2f9a30378fb87a5c90926d 5739 
gscan2pdf_2.12.3-1_source.buildinfo
Checksums-Sha256:
 2d6bfe43e6f6cc34a794f4c9fca77710420baeed72bf2b77f4c234f6b3983bc5 2923 
gscan2pdf_2.12.3-1.dsc
 b5d5d372823b0e7ac1b17b57474af4ee1cf49437008d925a77ea38c30c4770e9 504168 
gscan2pdf_2.12.3.orig.tar.xz
 64b1ebb833d01da9be7072d70572654c25ddb53aec435ad5b5a19a227bb34202 833 
gscan2pdf_2.12.3.orig.tar.xz.asc
 7d09ed2d6c49322544acbdb15c01c6f70dc38a0a4fa7f1dff76dbb534a57bd0e 12608 
gscan2pdf_2.12.3-1.debian.tar.xz
 ce5e851166a5ba504acd168b6472c4a4815a1c61bd10c449ac1124d00bb4755e 5739 
gscan2pdf_2.12.3-1_source.buildinfo
Files:
 a0efa77411e6e7180d8a5868b962dd1f 2923 utils optional gscan2pdf_2.12.3-1.dsc
 14e5940132874c8ec69edb91c86ae7a9 504168 utils optional 
gscan2pdf_2.12.3.orig.tar.xz
 344947c0173cc70b47200bfe41fd7aef 833 utils optional 
gscan2pdf_2.12.3.orig.tar.xz.asc
 6b654d4b02ef703ecbac66894f44ce3b 12608 utils optional 
gscan2pdf_2.12.3-1.debian.tar.xz
 a5a28a805dbd510541e8343aebc7da9c 5739 utils optional 
gscan2pdf_2.12.3-1_source.buildinfo

-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEERjKT5K4zhxhG8wInsyHyAxEPyvMFAmFHPMUACgkQsyHyAxEP
yvO2FRAAlL6WzXP9/XelEj02lo8BBjfgHIe8eOB1MUJ4KYirpJxTI+Jl1Fkzp+Eq
tjI+pKy0Vwz2h1fI97Ye0JBGJUAGyeG6gEnz3CEJhPKNBh5iRQOPW2L4sm7U59wq
KFOK3AGCgjcfmMBPqZZugxT0dIbfMBwcKxf09YMs/AaVOarOWqPoW7DV+roL3J/Z
QOcWbhQFAGOiPpFc1PG/qjXTTTcO1seyNyS2hwregdNwG0ajqg4CHcAeBG0Ym4mK
FuxRQp+LkAMj+AT45eqF+VM4gkFwnFj/A52cNRRqgpskV+jtRe5an2DZmZ1VdfV7
VoVQn1bP1F6R7w9KC482OCqlhMzRRwteyt8bkw1IgTil6JsSlEaCAEL3NpPr+vFz
n4M3Ij1fpc37M73Qp7fA+ZRPzHsnz5DPRUVwtr+lS9t/v9/0ieyFValG2DHaSNyv
i1/+4RJw3JWMm9k6yTYYuxLPCgYjAMbxdJBJF7r62Xxk6c42ggmqCjXyvQ7qp2RQ
M61NNLvh/bLtJKRmo3IsM4/wU+woymV6WKDGvAMHQ8C1uiyNGmYgrTRMYWPD0522
ulSfO05x4KfHwI79duJBi8UKxoi1m+KNPQpc34WIrYr5G3Kvw+Nv0PM+9HhkljZs
jIpk7oJVD7tCHynQaSZy7NkEov5wYs4/fg8gyxiWCNxHuTKHpK0=
=JzfY
-----END PGP SIGNATURE-----

--- End Message ---

Reply via email to