Bug#849094: liblept5: Broken on s390x (+ other big endian archs)
On Wed, 25 Apr 2018 10:51:09 +0200 Graham Inggswrote: > I agree, and with the upload of leptonlib 1.75.3-4, the tests are now > run during the build [3], and failure here will result in a failed > build. Builds of leptonlib were successful on all architectures [4] > except sparc64, which is notoriously sensitive to unaligned memory access. Leptonica 1.76.0 built without problems and passed nearly all tests on sparc64. With a small fix for a misaligned memory read, only a single failing test remains: https://github.com/DanBloomberg/leptonica/pull/342
Bug#849094: liblept5: Broken on s390x (+ other big endian archs)
Hello, On Fri, May 04 2018, Jeff Breidenbach wrote: > Tesseract in Debian just added a build time smoke test AND Stefan > fixed the big-endian problem. Should be live tomorrow in Sid. Assuming > that works well, Graham should be able to re-activate the disabled > OCRMyPDF tests. Note that this is Debian only; Ubuntu 18.04 is out the > door so no longer pertinent. They weren't ever deactivated in Debian. Thank you to everyone who made this patch happen! Great news! -- Sean Whitton signature.asc Description: PGP signature
Bug#849094: liblept5: Broken on s390x (+ other big endian archs)
Tesseract in Debian just added a build time smoke test AND Stefan fixed the big-endian problem. Should be live tomorrow in Sid. Assuming that works well, Graham should be able to re-activate the disabled OCRMyPDF tests. Note that this is Debian only; Ubuntu 18.04 is out the door so no longer pertinent.
Bug#849094: liblept5: Broken on s390x (+ other big endian archs)
Hello Graham, On Wed, Apr 25 2018, Graham Inggs wrote: > We agreed on marking the tests failing on big-endian XFAIL as a > short-term solution. > > Sean, I'm not sure if you want to include this in the Debian packaging > of ocrmypdf as well, but here [1] it is. Thank you (and Jeff) for your work and for the patch. I would like to hear what upstream thinks of the xfail approach before uploading that to Debian. >> Thinking about the failures, I suspect that the endian issues are now >> within Tesseract not Leptonica. > > I agree, and with the upload of leptonlib 1.75.3-4, the tests are now > run during the build [3], and failure here will result in a failed > build. Builds of leptonlib were successful on all architectures [4] > except sparc64, which is notoriously sensitive to unaligned memory > access. It seems like this bug should be reassigned, then. -- Sean Whitton signature.asc Description: PGP signature
Bug#849094: liblept5: Broken on s390x (+ other big endian archs)
Hi On 24/04/2018 23:32, Jeff Breidenbach wrote: > Recommend we talk in real time. Will send you contact details by > private email. Jeff, thanks for contacting me, it is much appreciated! We agreed on marking the tests failing on big-endian XFAIL as a short-term solution. Sean, I'm not sure if you want to include this in the Debian packaging of ocrmypdf as well, but here [1] it is. On 25/04/2018 10:13, James R Barlow wrote: Great to see the most recent test run passed, even if it is with liberal application of "expect failure". The Canonical powers that be should be appeased for the moment. I appreciate the last minute effort to get this, and Tesseract, into the next Ubuntu. In this case the Canonical power was Britney, and she can be a harsh mistress, but she was appeased, and ocrmypdf is marked for release in Bionic Beaver [2]. Thinking about the failures, I suspect that the endian issues are now within Tesseract not Leptonica. I agree, and with the upload of leptonlib 1.75.3-4, the tests are now run during the build [3], and failure here will result in a failed build. Builds of leptonlib were successful on all architectures [4] except sparc64, which is notoriously sensitive to unaligned memory access. Regards Graham [1] https://launchpadlibrarian.net/367200176/ocrmypdf_6.1.2-1_6.1.2-1ubuntu1.diff.gz [2] https://launchpad.net/ubuntu/+source/ocrmypdf [3] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=895864#25 [4] https://buildd.debian.org/status/package.php?p=leptonlib=unstable
Bug#849094: liblept5: Broken on s390x (+ other big endian archs)
Great to see the most recent test run passed, even if it is with liberal application of "expect failure". The Canonical powers that be should be appeased for the moment. I appreciate the last minute effort to get this, and Tesseract, into the next Ubuntu. Thinking about the failures, I suspect that the endian issues are now within Tesseract not Leptonica. test_deskew passes, and this test skips Tesseract entirely. It uses Leptonica to deskew a monochrome image and confirm it was deskewed. I think it's extremely unlikely all that bit twiddling would work if Leptonica were in the wrong endian. (Although there could be individual Leptonica APIs might not work on big endian.) The failure that surprised me is "test_tesseract_config_notfound". It passes Tesseract a configuration file that doesn't exist, but it turns out Tesseract proceeds with OCR rather than aborts in this case, so this isn't informative. Based on the failures I suspect the following command line will exit with SIGSEGV: tesseract -l eng -c textonly_pdf=1 --user-words wordlist.txt tests/resources/crom.png out pdf txt where wordlist.txt is a file containing some words separated by newlines and tests/resources/crom.png is distributed with OCRmyPDF. (If it does work the PDF will be a blank page containing text with no image.) Most of the failing tests have something to do with setting non-default configuration variables for Tesseract.
Bug#849094: liblept5: Broken on s390x (+ other big endian archs)
Am 25.04.2018 um 03:37 schrieb Dan Bloomberg: > There's an endianness.h file in leptonica/src. Does it say BIG_ENDIAN > or LITTLE_ENDIAN on your s390? It says BIG_ENDIAN on my emulated S390X with Debian Testing. Is arrayaccess.h correct for big endian machines? The 16 and 32 bit accessors look strange because they swap the words, but not the bytes.
Bug#849094: liblept5: Broken on s390x (+ other big endian archs)
Given the date, it sounds like we have an emergency situation. I'm really stuck here. My only known access to a big endian is an emulator with Wheezy. http://create.stephan-brumme.com/big-endian/ That's good enough for checking suspicious parts of Leptonica. I tried and found nothing. But it is not good enough to check Tesseract 4.0. Your OCRMyPdf s390 test provides strong evidence that Tesseract is broken on big endian. http://autopkgtest.ubuntu.com/packages/o/ocrmypdf/bionic/s390x However, if I wipe out that package on big-endian, it will cause a cascading dependency failure affecting hundreds of other packages. Not great to do 36 hours before release. Recommend we talk in real time. Will send you contact details by private email.
Bug#849094: liblept5: Broken on s390x (+ other big endian archs)
Hi Jeff On 3 January 2017 at 20:24, Jeff Breidenbachwrote: > Tesseract 4 is known to not work on big endian. Stefan (on CC) is excited to > take a look if someone can give him access to a big endian machine. I note this bug is still open, so I assume this is still the case. If so, would you please consider specifying only little-endian architectures in tesseract's debian/control? This will prevent ocrmypdf from being installable on big-endian architectures, and this in turn, will prevent ocrmypdf's testsuite from being run on s390x in Ubuntu and failing [1], which currently prevents ocrmypdf from being included in Ubuntu's 18.04 LTS release, scheduled for April 26. Regards Graham [1] http://autopkgtest.ubuntu.com/packages/o/ocrmypdf/bionic/s390x
Bug#849094: liblept5: Broken on s390x (+ other big endian archs)
Sorry, I wasn't aware of the guest account thing. Probably my fault for not reading email carefully enough. I am a Debian Developer and will sponsor this request. Fill out the information "Information guest needs to supply to sponsoring DD" and I will sign it. https://dsa.debian.org/doc/guest-account/ Asking around, another option appears to be Oregon State University which provides access to a big endian PowerPC machine. I do not know which approach is easier. People who work on the Go computer language use this. http://osuosl.org/services/powerdev/request_hosting/
Bug#849094: liblept5: Broken on s390x (+ other big endian archs)
On 01/04/17 08:03, Graham Inggs wrote: On 3 January 2017 at 20:24, Jeff Breidenbachwrote: Tesseract 4 is known to not work on big endian. Stefan (on CC) is excited to take a look if someone can give him access to a big endian machine. It is possible for non-DDs to request temporary access to porterboxes, see https://dsa.debian.org/doc/guest-account/ "People who are not yet DMs or NMs will need to find a DD who is willing to sponsor their request". That's what I tried to do. Stefan
Bug#849094: liblept5: Broken on s390x (+ other big endian archs)
On 3 January 2017 at 20:24, Jeff Breidenbachwrote: > Tesseract 4 is known to not work on big endian. Stefan (on CC) is excited to > take a look if someone can give him access to a big endian machine. It is possible for non-DDs to request temporary access to porterboxes, see https://dsa.debian.org/doc/guest-account/
Bug#849094: liblept5: Broken on s390x (+ other big endian archs)
I've just uploaded 1.74.1-1 to Debian, which contains something similar to Sean's patch.
Bug#849094: liblept5: Broken on s390x (+ other big endian archs)
Tesseract 4 is known to not work on big endian. Stefan (on CC) is excited to take a look if someone can give him access to a big endian machine. There are no known endian problems with Tesseract 3 or Leptonica, but if any are definitively found they will get immediate attention. I am not going to apply this patch in Debian right now. Instead will send it upstream for consideration. https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=849094
Bug#849094: liblept5: Broken on s390x (+ other big endian archs)
I'm the ocrmypdf upstream author. First, be aware that the output of OCR and autorotate is cached in the test suite and the results are persisted between test cases and runs of the test suite in the tests/cache folder. The cache hit/miss check is not smart enough to pick up changes that aren't reflected in leptonica's version number, that is, debian changes. However, it looks to like me the test suite is being run to target a temporary folder and that should remove cache effects. Nuke the test/cache folder between test suite runs to be sure. All the failing tests relate to "check_monochrome_correlation", a function that checks for close but not identical visual output compared to a reference. Because of a now-fixed leptonica bug in one of the underlying functions, I actually have a separate test that validates that this helper function, and that passes on big endian. The log shows that tesseract failed to properly detect page orientation and came back with a low confidence answer. I interpret that to mean there are endian issues in either tesseract or leptonica; the test isn't able to distinguish. It seems that the problem may be either a big endian issue in tesseract alone (perhaps affecting multiple versions, since tesseract does not have much a test suite) or it's some leptonica API that tesseract invokes while doing a page orientation check. Tesseract's test suite is very limited and probably doesn't check for consistency here. I looks like the patch is safe to apply and would be a net improvement even though it doesn't fix all of the issues my test suite finds. You can check orientation (skipping full OCR) in tesseract 3.04.01 with: $ tesseract -l eng -psm 0 test_image.png stdout The output for LinnSequencer.jpg on my macOS-x64 machine is: $ tesseract -l eng -psm 0 tests/resources/LinnSequencer.jpg stdout Warning in pixReadMemJpeg: work-around: writing to a temp file Page number: 0 Orientation in degrees: 0 Rotate: 0 Orientation confidence: 31.48 Script: Latin Script confidence: 100.95 >From the logs, tesseract reports (orientation, confidence) = (0, 1.32) for the same page on big endian, which means whatever data it is examining is much noisier, i.e. probably corrupted by endian swizzling. Quite likely the OCR output is garbage as well. It might be interesting to see what the behavior differences are for leptonica 1.73-patched, 1.74 and tesseract 3.04.01 and 4.00alpha all on big endian. The results matrix from those combinations would probably indicate whether to blame tesseract or leptonica.
Bug#849094: liblept5: Broken on s390x (+ other big endian archs)
Control: tags -1 - patch I've just tried running ocrmypdf's test suite against the recently released leptonlib 1.74.0 on powerpc and I get the same results I did with 1.73 and Sean's patch, i.e. the following three tests fail: test_autorotate[hocr] test_autorotate[tesseract] test_autorotate_threshold_low Untagging the patch as it incomplete.
Bug#849094: liblept5: Broken on s390x (+ other big endian archs)
I built leptonlib 1.73-6 including Sean's patch on powerpc and s390x. I then ran ocrmypdf's test suite against it. Test results went from: tests/test_hocrtransform.py . tests/test_main.py ...F..ss.F. tests/test_pageinfo.py to: tests/test_hocrtransform.py . tests/test_main.py ..ss...FFF. tests/test_pageinfo.py Tests on little-endian architectures remained successful, so it seems to be a step in the right direction, but we aren't quite there yet. Output of new failing tests attached. = test session starts == platform linux -- Python 3.5.2+, pytest-2.9.2, py-1.4.31, pluggy-0.3.1 rootdir: /data/adttmp/autopkgtest-virt-lxc.shared.hw7c_x0x/downtmp/build.hfb/ocrmypdf-4.3.4, inifile: pytest.ini collected 85 items test_requirements.txt s tests/test_hocrtransform.py . tests/test_main.py ..ss...FFF. tests/test_pageinfo.py === FAILURES === test_autorotate[hocr] _ spoof_tesseract_cache = {'ADTTMP': '/data/adttmp/autopkgtest-virt-lxc.shared.hw7c_x0x/downtmp/autopkgtest_tmp', 'ADT_ARTIFACTS': '/data/adttmp...untu1', 'AUTOPKGTEST_ARTIFACTS': '/data/adttmp/autopkgtest-virt-lxc.shared.hw7c_x0x/downtmp/test-suite-artifacts', ...} renderer = 'hocr' @pytest.mark.parametrize('renderer', [ 'hocr', 'tesseract', ]) def test_autorotate(spoof_tesseract_cache, renderer): # cardinal.pdf contains four copies of an image rotated in each cardinal # direction - these ones are "burned in" not tagged with /Rotate out = check_ocrmypdf('cardinal.pdf', 'test_autorotate_%s.pdf' % renderer, '-r', '-v', '1', env=spoof_tesseract_cache) for n in range(1, 4+1): correlation = check_monochrome_correlation( reference_pdf=_infile('cardinal.pdf'), reference_pageno=1, test_pdf=out, test_pageno=n) > assert correlation > 0.80 E assert 0.0562746599316597 > 0.8 tests/test_main.py:401: AssertionError - Captured stdout call - DEBUG - ocrmypdf 4.3.4 DEBUG - os.symlink(/data/adttmp/autopkgtest-virt-lxc.shared.hw7c_x0x/downtmp/build.hfb/ocrmypdf-4.3.4/tests/resources/cardinal.pdf, /tmp/com.github.ocrmypdf.5hxi1p4t/origin) DEBUG - os.symlink(/tmp/com.github.ocrmypdf.5hxi1p4t/origin, /tmp/com.github.ocrmypdf.5hxi1p4t/origin.pdf) DEBUG - [{'images': [{'dpi_h': Decimal('300.000'), 'type': 'image', 'width': 2550, 'bpc': 1, 'color': 'gray', 'dpi': Decimal('300.000'), 'name': '/Im0', 'comp': 1, 'dpi_w': Decimal('300.000'), 'enc': 'jbig2', 'height': 3300}], 'height_inches': Decimal('11'), 'rotate': 0, 'has_text': False, 'width_pixels': 2550, 'width_inches': Decimal('8.5'), 'xres': Decimal('300.000'), 'yres': Decimal('300.000'), 'pageno': 0, 'height_pixels': 3300}, {'images': [{'dpi_h': Decimal('300.000'), 'type': 'image', 'width': 2550, 'bpc': 1, 'color': 'gray', 'dpi': Decimal('300.000'), 'name': '/Im0', 'comp': 1, 'dpi_w': Decimal('300.000'), 'enc': 'jbig2', 'height': 3300}], 'height_inches': Decimal('8.5'), 'rotate': 0, 'has_text': False, 'width_pixels': 3300, 'width_inches': Decimal('11'), 'xres': Decimal('300.000'), 'yres': Decimal('300.000'), 'pageno': 1, 'height_pixels': 2550}, {'images': [{'dpi_h': Decimal('300.000'), 'type': 'image', 'width': 2550, 'bpc': 1, 'color': 'gray', 'dpi': Decimal('300.000'), 'name': '/Im0', 'comp': 1, 'dpi_w': Decimal('300.000'), 'enc': 'jbig2', 'height': 3300}], 'height_inches': Decimal('11'), 'rotate': 0, 'has_text': False, 'width_pixels': 2550, 'width_inches': Decimal('8.5'), 'xres': Decimal('300.000'), 'yres': Decimal('300.000'), 'pageno': 2, 'height_pixels': 3300}, {'images': [{'dpi_h': Decimal('300.000'), 'type': 'image', 'width': 2550, 'bpc': 1, 'color': 'gray', 'dpi': Decimal('300.000'), 'name': '/Im0', 'comp': 1, 'dpi_w': Decimal('300.000'), 'enc': 'jbig2', 'height': 3300}], 'height_inches': Decimal('8.5'), 'rotate': 0, 'has_text': False, 'width_pixels': 3300, 'width_inches': Decimal('11'), 'xres': Decimal('300.000'), 'yres': Decimal('300.000'), 'pageno': 3, 'height_pixels': 2550}] DEBUG - os.symlink(/tmp/com.github.ocrmypdf.5hxi1p4t/01.page.pdf, /tmp/com.github.ocrmypdf.5hxi1p4t/01.ocr.page.pdf) DEBUG - os.symlink(/tmp/com.github.ocrmypdf.5hxi1p4t/02.page.pdf, /tmp/com.github.ocrmypdf.5hxi1p4t/02.ocr.page.pdf) DEBUG - os.symlink(/tmp/com.github.ocrmypdf.5hxi1p4t/03.page.pdf, /tmp/com.github.ocrmypdf.5hxi1p4t/03.ocr.page.pdf) DEBUG - os.symlink(/tmp/com.github.ocrmypdf.5hxi1p4t/04.page.pdf,
Bug#849094: liblept5: Broken on s390x (+ other big endian archs)
Package: liblept5 Version: 1.73-6 Severity: normal Tags: patch Dear maintainer, liblept looks to be broken on big endian architectures. This was discovered by means of the OCRmyPDF test suite. It's failing on s390x,[1] the broken files are emitted at the stage where OCRmyPDF invokes liblept code, and the broken files are highly suggestive of endianness issues. I believe the attached backported patch will fix the problem, though I've only been able to confirm that it doesn't break building the package. (Many thanks to James R. Barlow, OCRmyPDF's upstream author, for examining the broken files, and to Mattia Rizzolo for running the tests on an s390x porterbox.) [1] http://autopkgtest.ubuntu.com/packages/o/ocrmypdf/zesty/s390x -- System Information: Debian Release: stretch/sid APT prefers testing APT policy: (900, 'testing') Architecture: i386 (i686) Kernel: Linux 4.8.0-2-686-pae (SMP w/2 CPU cores) Locale: LANG=en_GB.utf8, LC_CTYPE=en_GB.utf8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/dash Init: systemd (via /run/systemd/system) Versions of packages liblept5 depends on: ii libc62.24-8 ii libgif7 5.1.4-0.4 ii libjpeg62-turbo 1:1.5.1-2 ii libopenjp2-7 2.1.2-1 ii libpng16-16 1.6.26-6 ii libtiff5 4.0.7-1 ii libwebp6 0.5.1-4 ii zlib1g 1:1.2.8.dfsg-2+b3 liblept5 recommends no packages. liblept5 suggests no packages. -- no debconf information -- Sean Whitton Description: Fix endian detection across all build methods Unfortunately Leptonica has been broken on big endian systems when built with autotools because endianness.h was never used. This commit ensures it is generated by Autotools, CMake, and the static Makefile, so it has therefore been included unconditionally in alltypes.h. This will break the native Visual Studio build but that hasn't been maintained in a long time. For the static Makefile, the detection should now work under any Make implementation, not just GNU. With the detected endianness now preserved in a header, it is no longer necessary to define L_LITTLE_ENDIAN or L_BIG_ENDIAN manually when building against Leptonica. This is a substantial improvement as forgetting to do so would have resulted in broken behaviour on big endian systems despite a complete lack of errors or warnings. Note that it will still respect your choice if you do decide to define these manually. Building universal binaries for OS X should theoretically work with all build methods but this hasn't been tested. Feedback would be appreciated. Author: James Le CuirotReviewed-by: Sean Whitton Origin: upstream, fd252ce0a17561b74f8cc02726601e5be121ac58 Forwarded: not-needed --- diff --git a/.gitignore b/.gitignore index 72aa855..4dc5b4b 100644 --- a/.gitignore +++ b/.gitignore @@ -4,3 +4,5 @@ # build directories /build* /win* + +/src/endianness.h diff --git a/endiantest.c b/endiantest.c deleted file mode 100644 index e45f27f..000 --- a/endiantest.c +++ /dev/null @@ -1,50 +0,0 @@ -/** - - Copyright (C) 2001 Leptonica. All rights reserved. - - - - Redistribution and use in source and binary forms, with or without - - modification, are permitted provided that the following conditions - - are met: - - 1. Redistributions of source code must retain the above copyright - - notice, this list of conditions and the following disclaimer. - - 2. Redistributions in binary form must reproduce the above - - copyright notice, this list of conditions and the following - - disclaimer in the documentation and/or other materials - - provided with the distribution. - - - - THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS - - ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT - - LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR - - A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL ANY - - CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, - - EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, - - PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR - - PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY - - OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING - - NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - - SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - **/ - -/* - *endiantest.c - * - *This test was contributed by Bill Janssen. When used with the - *gnu compiler, it allows efficient computation of the endian - *flag as part of the normal compilation process. As a result, - *it is not necessary to set this flag either manually or - *through the configure Makefile generator. - */ - -#include - -int main() -{ -/*