Bug#849094: liblept5: Broken on s390x (+ other big endian archs)

2018-05-05 Thread Stefan Weil
On Wed, 25 Apr 2018 10:51:09 +0200 Graham Inggs  wrote:
> I agree, and with the upload of leptonlib 1.75.3-4, the tests are now
> run during the build [3], and failure here will result in a failed
> build. Builds of leptonlib were successful on all architectures [4]
> except sparc64, which is notoriously sensitive to unaligned memory access.

Leptonica 1.76.0 built without problems and passed nearly all tests on
sparc64.

With a small fix for a misaligned memory read, only a single failing
test remains:
https://github.com/DanBloomberg/leptonica/pull/342



Bug#849094: liblept5: Broken on s390x (+ other big endian archs)

2018-05-04 Thread Sean Whitton
Hello,

On Fri, May 04 2018, Jeff Breidenbach wrote:

> Tesseract in Debian just added a build time smoke test AND Stefan
> fixed the big-endian problem. Should be live tomorrow in Sid. Assuming
> that works well, Graham should be able to re-activate the disabled
> OCRMyPDF tests. Note that this is Debian only; Ubuntu 18.04 is out the
> door so no longer pertinent.

They weren't ever deactivated in Debian.

Thank you to everyone who made this patch happen!  Great news!

-- 
Sean Whitton


signature.asc
Description: PGP signature


Bug#849094: liblept5: Broken on s390x (+ other big endian archs)

2018-05-04 Thread Jeff Breidenbach
Tesseract  in Debian just added a build time smoke test AND Stefan fixed
the big-endian
problem. Should be live tomorrow in Sid. Assuming that works well, Graham
should be
able to re-activate the disabled OCRMyPDF tests. Note that this is Debian
only; Ubuntu
18.04 is out the door so no longer pertinent.


Bug#849094: liblept5: Broken on s390x (+ other big endian archs)

2018-04-25 Thread Sean Whitton
Hello Graham,

On Wed, Apr 25 2018, Graham Inggs wrote:

> We agreed on marking the tests failing on big-endian XFAIL as a
> short-term solution.
>
> Sean, I'm not sure if you want to include this in the Debian packaging
> of ocrmypdf as well, but here [1] it is.

Thank you (and Jeff) for your work and for the patch.

I would like to hear what upstream thinks of the xfail approach before
uploading that to Debian.

>> Thinking about the failures, I suspect that the endian issues are now
>> within Tesseract not Leptonica.
>
> I agree, and with the upload of leptonlib 1.75.3-4, the tests are now
> run during the build [3], and failure here will result in a failed
> build.  Builds of leptonlib were successful on all architectures [4]
> except sparc64, which is notoriously sensitive to unaligned memory
> access.

It seems like this bug should be reassigned, then.

-- 
Sean Whitton


signature.asc
Description: PGP signature


Bug#849094: liblept5: Broken on s390x (+ other big endian archs)

2018-04-25 Thread Graham Inggs

Hi

On 24/04/2018 23:32, Jeff Breidenbach wrote:
> Recommend we talk in real time. Will send you  contact details by
> private email.

Jeff, thanks for contacting me, it is much appreciated!

We agreed on marking the tests failing on big-endian XFAIL as a 
short-term solution.


Sean, I'm not sure if you want to include this in the Debian packaging 
of ocrmypdf as well, but here [1] it is.


On 25/04/2018 10:13, James R Barlow wrote:

Great to see the most recent test run passed, even if it is with liberal
application of "expect failure". The Canonical powers that be should be
appeased for the moment. I appreciate the last minute effort to get this,
and Tesseract, into the next Ubuntu.


In this case the Canonical power was Britney, and she can be a harsh 
mistress, but she was appeased, and ocrmypdf is marked for release in 
Bionic Beaver [2].



Thinking about the failures, I suspect that the endian issues are now
within Tesseract not Leptonica.


I agree, and with the upload of leptonlib 1.75.3-4, the tests are now 
run during the build [3], and failure here will result in a failed 
build.  Builds of leptonlib were successful on all architectures [4] 
except sparc64, which is notoriously sensitive to unaligned memory access.


Regards
Graham


[1] 
https://launchpadlibrarian.net/367200176/ocrmypdf_6.1.2-1_6.1.2-1ubuntu1.diff.gz

[2] https://launchpad.net/ubuntu/+source/ocrmypdf
[3] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=895864#25
[4] https://buildd.debian.org/status/package.php?p=leptonlib=unstable



Bug#849094: liblept5: Broken on s390x (+ other big endian archs)

2018-04-25 Thread James R Barlow
Great to see the most recent test run passed, even if it is with liberal
application of "expect failure". The Canonical powers that be should be
appeased for the moment. I appreciate the last minute effort to get this,
and Tesseract, into the next Ubuntu.

Thinking about the failures, I suspect that the endian issues are now
within Tesseract not Leptonica. test_deskew passes, and this test skips
Tesseract entirely. It uses Leptonica to deskew a monochrome image and
confirm it was deskewed. I think it's extremely unlikely all that bit
twiddling would work if Leptonica were in the wrong endian. (Although there
could be individual Leptonica APIs might not work on big endian.)

The failure that surprised me is "test_tesseract_config_notfound". It
passes Tesseract a configuration file that doesn't exist, but it turns out
Tesseract proceeds with OCR rather than aborts in this case, so this isn't
informative.

Based on the failures I suspect the following command line will exit with
SIGSEGV:
  tesseract -l eng -c textonly_pdf=1 --user-words wordlist.txt
tests/resources/crom.png out pdf txt

where wordlist.txt is a file containing some words separated by newlines
and tests/resources/crom.png is distributed with OCRmyPDF. (If it does work
the PDF will be a blank page containing text with no image.)

Most of the failing tests have something to do with setting non-default
configuration variables for Tesseract.


Bug#849094: liblept5: Broken on s390x (+ other big endian archs)

2018-04-25 Thread Stefan Weil
Am 25.04.2018 um 03:37 schrieb Dan Bloomberg:
> There's an endianness.h file in leptonica/src.  Does it say BIG_ENDIAN
> or LITTLE_ENDIAN on your s390?

It says BIG_ENDIAN on my emulated S390X with Debian Testing.

Is arrayaccess.h correct for big endian machines? The 16 and 32 bit
accessors look strange because they swap the words, but not the bytes.



Bug#849094: liblept5: Broken on s390x (+ other big endian archs)

2018-04-24 Thread Jeff Breidenbach
Given the date, it sounds like we have an emergency situation.

I'm really stuck here.  My only known access to a big endian is an emulator
with Wheezy.

   http://create.stephan-brumme.com/big-endian/

That's good enough for checking suspicious parts of Leptonica. I  tried and
found nothing. But it is not good enough to check Tesseract 4.0.

Your OCRMyPdf s390 test provides strong evidence that Tesseract
is broken on big endian.

http://autopkgtest.ubuntu.com/packages/o/ocrmypdf/bionic/s390x

However, if I wipe out that package on big-endian, it will cause a
cascading
dependency failure affecting hundreds of other packages. Not great to do
36 hours before release.

Recommend we talk in real time. Will send you  contact details by private
email.


Bug#849094: liblept5: Broken on s390x (+ other big endian archs)

2018-04-22 Thread Graham Inggs
Hi Jeff

On 3 January 2017 at 20:24, Jeff Breidenbach  wrote:
> Tesseract 4 is known to not work on big endian. Stefan (on CC) is excited to
> take a look if someone can give him access to a big endian machine.

I note this bug is still open, so I assume this is still the case.

If so, would you please consider specifying only little-endian
architectures in tesseract's debian/control?
This will prevent ocrmypdf from being installable on big-endian
architectures, and this in turn, will prevent ocrmypdf's testsuite
from being run on s390x in Ubuntu and failing [1], which currently
prevents ocrmypdf from being included in Ubuntu's 18.04 LTS release,
scheduled for April 26.

Regards
Graham

[1] http://autopkgtest.ubuntu.com/packages/o/ocrmypdf/bionic/s390x



Bug#849094: liblept5: Broken on s390x (+ other big endian archs)

2017-01-04 Thread Jeff Breidenbach
Sorry, I wasn't aware of the guest account thing. Probably my fault for not
reading
email carefully enough. I am a Debian Developer and will sponsor this
request. Fill
out the information "Information guest needs to supply to sponsoring DD"
and I will
sign it.

https://dsa.debian.org/doc/guest-account/

Asking around, another option appears to be Oregon State University which
provides
access to a big endian PowerPC machine. I do not know which approach is
easier.
People who work on the Go computer language use this.

http://osuosl.org/services/powerdev/request_hosting/


Bug#849094: liblept5: Broken on s390x (+ other big endian archs)

2017-01-04 Thread Stefan Weil

On 01/04/17 08:03, Graham Inggs wrote:

On 3 January 2017 at 20:24, Jeff Breidenbach  wrote:

Tesseract 4 is known to not work on big endian. Stefan (on CC) is excited to
take a look if someone can give him access to a big endian machine.


It is possible for non-DDs to request temporary access to porterboxes,
see https://dsa.debian.org/doc/guest-account/



"People who are not yet DMs or NMs will need to find a DD who is willing 
to sponsor their request". That's what I tried to do.


Stefan



Bug#849094: liblept5: Broken on s390x (+ other big endian archs)

2017-01-03 Thread Graham Inggs
On 3 January 2017 at 20:24, Jeff Breidenbach  wrote:
> Tesseract 4 is known to not work on big endian. Stefan (on CC) is excited to
> take a look if someone can give him access to a big endian machine.

It is possible for non-DDs to request temporary access to porterboxes,
see https://dsa.debian.org/doc/guest-account/



Bug#849094: liblept5: Broken on s390x (+ other big endian archs)

2017-01-03 Thread Jeff Breidenbach
I've just uploaded 1.74.1-1 to Debian, which contains something
similar to Sean's patch.


Bug#849094: liblept5: Broken on s390x (+ other big endian archs)

2017-01-03 Thread Jeff Breidenbach
Tesseract 4 is known to not work on big endian. Stefan (on CC) is excited
to
take a look if someone can give him access to a big endian machine.

There are no known endian problems with Tesseract 3 or Leptonica, but if any
are definitively found they will get immediate attention.

I am not going to apply this patch in Debian right now. Instead will send
it
upstream for consideration.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=849094


Bug#849094: liblept5: Broken on s390x (+ other big endian archs)

2016-12-28 Thread James R Barlow
I'm the ocrmypdf upstream author.

First, be aware that the output of OCR and autorotate is cached in the test
suite and the results are persisted between test cases and runs of the test
suite in the tests/cache folder. The cache hit/miss check is not smart
enough to pick up changes that aren't reflected in leptonica's version
number, that is, debian changes. However, it looks to like me the test
suite is being run to target a temporary folder and that should remove
cache effects. Nuke the test/cache folder between test suite runs to be
sure.

All the failing tests relate to "check_monochrome_correlation", a function
that checks for close but not identical visual output compared to a
reference. Because of a now-fixed leptonica bug in one of the underlying
functions, I actually have a separate test that validates that this helper
function, and that passes on big endian.

The log shows that tesseract failed to properly detect page orientation and
came back with a low confidence answer. I interpret that to mean there are
endian issues in either tesseract or leptonica; the test isn't able to
distinguish.

It seems that the problem may be either a big endian issue in tesseract
alone (perhaps affecting multiple versions, since tesseract does not have
much a test suite) or it's some leptonica API that tesseract invokes while
doing a page orientation check. Tesseract's test suite is very limited and
probably doesn't check for consistency here.

I looks like the patch is safe to apply and would be a net improvement even
though it doesn't fix all of the issues my test suite finds.


You can check orientation (skipping full OCR) in tesseract 3.04.01 with:

$ tesseract -l eng -psm 0 test_image.png stdout

The output for LinnSequencer.jpg on my macOS-x64 machine is:

$ tesseract -l eng -psm 0 tests/resources/LinnSequencer.jpg stdout
Warning in pixReadMemJpeg: work-around: writing to a temp file
Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 31.48
Script: Latin
Script confidence: 100.95

>From the logs, tesseract reports (orientation, confidence) = (0, 1.32) for
the same page on big endian, which means whatever data it is examining is
much noisier, i.e. probably corrupted by endian swizzling. Quite likely the
OCR output is garbage as well.

It might be interesting to see what the behavior differences are for
leptonica 1.73-patched, 1.74 and tesseract 3.04.01 and 4.00alpha all on big
endian. The results matrix from those combinations would probably indicate
whether to blame tesseract or leptonica.


Bug#849094: liblept5: Broken on s390x (+ other big endian archs)

2016-12-25 Thread Graham Inggs
Control: tags -1 - patch

I've just tried running ocrmypdf's test suite against the recently
released leptonlib 1.74.0 on powerpc and I get the same results I did
with 1.73 and Sean's patch, i.e. the following three tests fail:

test_autorotate[hocr]
test_autorotate[tesseract]
test_autorotate_threshold_low

Untagging the patch as it incomplete.



Bug#849094: liblept5: Broken on s390x (+ other big endian archs)

2016-12-23 Thread Graham Inggs
I built leptonlib 1.73-6 including Sean's patch on powerpc and s390x.
I then ran ocrmypdf's test suite against it.

Test results went from:

tests/test_hocrtransform.py .
tests/test_main.py
...F..ss.F.
tests/test_pageinfo.py 

to:

tests/test_hocrtransform.py .
tests/test_main.py
..ss...FFF.
tests/test_pageinfo.py 

Tests on little-endian architectures remained successful, so it seems
to be a step in the right direction, but we aren't quite there yet.

Output of new failing tests attached.
= test session starts ==
platform linux -- Python 3.5.2+, pytest-2.9.2, py-1.4.31, pluggy-0.3.1
rootdir: 
/data/adttmp/autopkgtest-virt-lxc.shared.hw7c_x0x/downtmp/build.hfb/ocrmypdf-4.3.4,
 inifile: pytest.ini
collected 85 items

test_requirements.txt s
tests/test_hocrtransform.py .
tests/test_main.py 
..ss...FFF.
tests/test_pageinfo.py 

=== FAILURES ===
 test_autorotate[hocr] _

spoof_tesseract_cache = {'ADTTMP': 
'/data/adttmp/autopkgtest-virt-lxc.shared.hw7c_x0x/downtmp/autopkgtest_tmp', 
'ADT_ARTIFACTS': '/data/adttmp...untu1', 'AUTOPKGTEST_ARTIFACTS': 
'/data/adttmp/autopkgtest-virt-lxc.shared.hw7c_x0x/downtmp/test-suite-artifacts',
 ...}
renderer = 'hocr'

@pytest.mark.parametrize('renderer', [
'hocr',
'tesseract',
])
def test_autorotate(spoof_tesseract_cache, renderer):
# cardinal.pdf contains four copies of an image rotated in each cardinal
# direction - these ones are "burned in" not tagged with /Rotate
out = check_ocrmypdf('cardinal.pdf', 'test_autorotate_%s.pdf' % 
renderer,
 '-r', '-v', '1', env=spoof_tesseract_cache)
for n in range(1, 4+1):
correlation = check_monochrome_correlation(
reference_pdf=_infile('cardinal.pdf'),
reference_pageno=1,
test_pdf=out,
test_pageno=n)
>   assert correlation > 0.80
E   assert 0.0562746599316597 > 0.8

tests/test_main.py:401: AssertionError
- Captured stdout call -
  DEBUG - ocrmypdf 4.3.4
  DEBUG - 
os.symlink(/data/adttmp/autopkgtest-virt-lxc.shared.hw7c_x0x/downtmp/build.hfb/ocrmypdf-4.3.4/tests/resources/cardinal.pdf,
 /tmp/com.github.ocrmypdf.5hxi1p4t/origin)
  DEBUG - os.symlink(/tmp/com.github.ocrmypdf.5hxi1p4t/origin, 
/tmp/com.github.ocrmypdf.5hxi1p4t/origin.pdf)
  DEBUG - [{'images': [{'dpi_h': Decimal('300.000'), 'type': 'image', 'width': 
2550, 'bpc': 1, 'color': 'gray', 'dpi': Decimal('300.000'), 'name': '/Im0', 
'comp': 1, 'dpi_w': Decimal('300.000'), 'enc': 'jbig2', 'height': 3300}], 
'height_inches': Decimal('11'), 'rotate': 0, 'has_text': False, 'width_pixels': 
2550, 'width_inches': Decimal('8.5'), 'xres': Decimal('300.000'), 'yres': 
Decimal('300.000'), 'pageno': 0, 'height_pixels': 3300}, {'images': [{'dpi_h': 
Decimal('300.000'), 'type': 'image', 'width': 2550, 'bpc': 1, 'color': 'gray', 
'dpi': Decimal('300.000'), 'name': '/Im0', 'comp': 1, 'dpi_w': 
Decimal('300.000'), 'enc': 'jbig2', 'height': 3300}], 'height_inches': 
Decimal('8.5'), 'rotate': 0, 'has_text': False, 'width_pixels': 3300, 
'width_inches': Decimal('11'), 'xres': Decimal('300.000'), 'yres': 
Decimal('300.000'), 'pageno': 1, 'height_pixels': 2550}, {'images': [{'dpi_h': 
Decimal('300.000'), 'type': 'image', 'width': 2550, 'bpc': 1, 'color': 'gray', 
'dpi': Decimal('300.000'), 'name': '/Im0', 'comp': 1, 'dpi_w': 
Decimal('300.000'), 'enc': 'jbig2', 'height': 3300}], 'height_inches': 
Decimal('11'), 'rotate': 0, 'has_text': False, 'width_pixels': 2550, 
'width_inches': Decimal('8.5'), 'xres': Decimal('300.000'), 'yres': 
Decimal('300.000'), 'pageno': 2, 'height_pixels': 3300}, {'images': [{'dpi_h': 
Decimal('300.000'), 'type': 'image', 'width': 2550, 'bpc': 1, 'color': 'gray', 
'dpi': Decimal('300.000'), 'name': '/Im0', 'comp': 1, 'dpi_w': 
Decimal('300.000'), 'enc': 'jbig2', 'height': 3300}], 'height_inches': 
Decimal('8.5'), 'rotate': 0, 'has_text': False, 'width_pixels': 3300, 
'width_inches': Decimal('11'), 'xres': Decimal('300.000'), 'yres': 
Decimal('300.000'), 'pageno': 3, 'height_pixels': 2550}]
  DEBUG - os.symlink(/tmp/com.github.ocrmypdf.5hxi1p4t/01.page.pdf, 
/tmp/com.github.ocrmypdf.5hxi1p4t/01.ocr.page.pdf)
  DEBUG - os.symlink(/tmp/com.github.ocrmypdf.5hxi1p4t/02.page.pdf, 
/tmp/com.github.ocrmypdf.5hxi1p4t/02.ocr.page.pdf)
  DEBUG - os.symlink(/tmp/com.github.ocrmypdf.5hxi1p4t/03.page.pdf, 
/tmp/com.github.ocrmypdf.5hxi1p4t/03.ocr.page.pdf)
  DEBUG - os.symlink(/tmp/com.github.ocrmypdf.5hxi1p4t/04.page.pdf, 

Bug#849094: liblept5: Broken on s390x (+ other big endian archs)

2016-12-22 Thread Sean Whitton
Package: liblept5
Version: 1.73-6
Severity: normal
Tags: patch

Dear maintainer,

liblept looks to be broken on big endian architectures.  This was
discovered by means of the OCRmyPDF test suite.  It's failing on
s390x,[1] the broken files are emitted at the stage where OCRmyPDF
invokes liblept code, and the broken files are highly suggestive of
endianness issues.

I believe the attached backported patch will fix the problem, though
I've only been able to confirm that it doesn't break building the
package.

(Many thanks to James R. Barlow, OCRmyPDF's upstream author, for
examining the broken files, and to Mattia Rizzolo for running the tests
on an s390x porterbox.)

[1] http://autopkgtest.ubuntu.com/packages/o/ocrmypdf/zesty/s390x

-- System Information:
Debian Release: stretch/sid
  APT prefers testing
  APT policy: (900, 'testing')
Architecture: i386 (i686)

Kernel: Linux 4.8.0-2-686-pae (SMP w/2 CPU cores)
Locale: LANG=en_GB.utf8, LC_CTYPE=en_GB.utf8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)

Versions of packages liblept5 depends on:
ii  libc62.24-8
ii  libgif7  5.1.4-0.4
ii  libjpeg62-turbo  1:1.5.1-2
ii  libopenjp2-7 2.1.2-1
ii  libpng16-16  1.6.26-6
ii  libtiff5 4.0.7-1
ii  libwebp6 0.5.1-4
ii  zlib1g   1:1.2.8.dfsg-2+b3

liblept5 recommends no packages.

liblept5 suggests no packages.

-- no debconf information

-- 
Sean Whitton
Description: Fix endian detection across all build methods
 Unfortunately Leptonica has been broken on big endian systems when
 built with autotools because endianness.h was never used.
 
 This commit ensures it is generated by Autotools, CMake, and the
 static Makefile, so it has therefore been included unconditionally in
 alltypes.h. This will break the native Visual Studio build but that
 hasn't been maintained in a long time. For the static Makefile, the
 detection should now work under any Make implementation, not just GNU.
 
 With the detected endianness now preserved in a header, it is no
 longer necessary to define L_LITTLE_ENDIAN or L_BIG_ENDIAN manually
 when building against Leptonica. This is a substantial improvement as
 forgetting to do so would have resulted in broken behaviour on big
 endian systems despite a complete lack of errors or warnings. Note
 that it will still respect your choice if you do decide to define
 these manually.
 
 Building universal binaries for OS X should theoretically work with
 all build methods but this hasn't been tested. Feedback would be
 appreciated.
Author: James Le Cuirot 
Reviewed-by: Sean Whitton 
Origin: upstream, fd252ce0a17561b74f8cc02726601e5be121ac58
Forwarded: not-needed

---
diff --git a/.gitignore b/.gitignore
index 72aa855..4dc5b4b 100644
--- a/.gitignore
+++ b/.gitignore
@@ -4,3 +4,5 @@
 # build directories
 /build*
 /win*
+
+/src/endianness.h
diff --git a/endiantest.c b/endiantest.c
deleted file mode 100644
index e45f27f..000
--- a/endiantest.c
+++ /dev/null
@@ -1,50 +0,0 @@
-/**
- -  Copyright (C) 2001 Leptonica.  All rights reserved.
- -
- -  Redistribution and use in source and binary forms, with or without
- -  modification, are permitted provided that the following conditions
- -  are met:
- -  1. Redistributions of source code must retain the above copyright
- - notice, this list of conditions and the following disclaimer.
- -  2. Redistributions in binary form must reproduce the above
- - copyright notice, this list of conditions and the following
- - disclaimer in the documentation and/or other materials
- - provided with the distribution.
- -
- -  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
- -  ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
- -  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
- -  A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL ANY
- -  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
- -  EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
- -  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
- -  PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
- -  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
- -  NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- -  SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- **/
-
-/*
- *endiantest.c
- *
- *This test was contributed by Bill Janssen.  When used with the
- *gnu compiler, it allows efficient computation of the endian
- *flag as part of the normal compilation process.  As a result,
- *it is not necessary to set this flag either manually or
- *through the configure Makefile generator.
- */
-
-#include 
-
-int main()
-{
-/*