Bug#888917: ocrmypdf fails to run it's testsuite

2018-03-26 Thread James R Barlow
v6.0.0 should fix this issue, as it includes a cache that allows most OCR
to be skipped.


Bug#888917: ocrmypdf fails to run it's testsuite

2018-02-15 Thread Sean Whitton
control: tag -1 +upstream

Hello James,

That's some impressive detective work.  Thank you for taking the time to
write up your conclusions.  I assume you don't need an upstream bug
report, but let me know if you want me to file one.

-- 
Sean Whitton


signature.asc
Description: PGP signature


Bug#888917: ocrmypdf fails to run it's testsuite

2018-02-15 Thread James R Barlow
Tesseract 4 is now in Debian unstable. When running on a processor that
lacks the AVX2 extensions (added at the Intel Haswell microarch, around
2013), it falls back on a slower version in SSE or something, which is so
much slower that it regularly hits the timeout. (Some of these failures are
less than graceful and I will fix that.)

The reason I think this is the case is that ci-worker[01,02].debian.net and
Sean's laptop consistently fail, and they fail on different tests at
different times all by hitting timeouts. My 2013 desktop which has a
Haswell processor and Matthias' new laptop are fine.

For the CI workers I looked over every pass and failure back to Jan 30, and
every test log that fails had worker 01 or 02. It wouldn't be surprising
for the lowest numbered boxes to be the oldest ones.
https://ci.debian.net/packages/o/ocrmypdf/unstable/amd64/

To confirm I compiled a version of Tesseract 4 with AVX2 disabled and using
the "best quality" training set. Results were as follows (ratios being
relevant).

Tesseract 4, AVX2, best quality training data: 5s
Tesseract 4, AVX2 disabled, best quality training data: 32s
Tesseract 4, AVX2 disabled, fast training data: 10s
Tesseract 3.05: 4s

So I will need to fix this because the test suite should be consistent even
if Tesseract isn't. I'll revise how the existing test cache works so that I
can ship precalculated OCR files with it.



On Mon, 12 Feb 2018 at 18:24 Sean Whitton  wrote:

> control: retitle -1 Test suite failures
>
> Hello James,
>
> On Fri, Feb 02 2018, James R. Barlow wrote:
>
> > Do you think you could take a few minutes to identify which test is
> > taking this long and report it? This may be an upstream bug, if some
> > input triggers an infinite loop.
>
> I ran the test suite on one of Debian's machines, in an up-to-date
> Debian unstable chroot.  It took 100 minutes and there were many
> failures.  Some of the test failed due to timeouts, and some of them
> failed for other reasons.  I'm attaching the full log.
>
> I see you have released 5.6.0, but from the release notes it seems
> likely there would be the same failures.
>
> Please let me know if you still need me to run individual tests and see
> how long they take.
>
> > I have my suspicions. My guess is that:
> >
> > pytest tests/test_qpdf.py # will never finish
> >
> > and
> >
> > pytest -n0 tests/test_qpdf.py # will fail in 15 seconds
> >
> > If so, you might have qpdf < 7.0.0 and upgrading to qpdf >= 7.0.0 will
> > fix it.
>
> We have qpdf 7.1.1 in Debian unstable right now, so this can't be it.
>
> --
> Sean Whitton
>


Bug#888917: ocrmypdf fails to run it's testsuite

2018-02-12 Thread Sean Whitton
control: retitle -1 Test suite failures

Hello James,

On Fri, Feb 02 2018, James R. Barlow wrote:

> Do you think you could take a few minutes to identify which test is
> taking this long and report it? This may be an upstream bug, if some
> input triggers an infinite loop.

I ran the test suite on one of Debian's machines, in an up-to-date
Debian unstable chroot.  It took 100 minutes and there were many
failures.  Some of the test failed due to timeouts, and some of them
failed for other reasons.  I'm attaching the full log.

I see you have released 5.6.0, but from the release notes it seems
likely there would be the same failures.

Please let me know if you still need me to run individual tests and see
how long they take.

> I have my suspicions. My guess is that:
>
> pytest tests/test_qpdf.py # will never finish
>
> and
>
> pytest -n0 tests/test_qpdf.py # will fail in 15 seconds
>
> If so, you might have qpdf < 7.0.0 and upgrading to qpdf >= 7.0.0 will
> fix it.

We have qpdf 7.1.1 in Debian unstable right now, so this can't be it.

-- 
Sean Whitton


ocrmypdf_5.5_tests.log
Description: Binary data


signature.asc
Description: PGP signature


Bug#888917: ocrmypdf fails to run it's testsuite

2018-02-04 Thread Sean Whitton
Hello,

On Fri, Feb 02 2018, James R Barlow wrote:

> Do you think you could take a few minutes to identify which test is
> taking this long and report it? This may be an upstream bug, if some
> input triggers an infinite loop.
>
> I have my suspicions. [...]

Thanks for the hints.  I will attempt to reproduce the hang at some
point in the next few weeks, and report back, with an upstream bug
report if I can pinpoint the issue.

-- 
Sean Whitton


signature.asc
Description: PGP signature


Bug#888917: ocrmypdf fails to run it's testsuite

2018-02-04 Thread Sean Whitton
control: tag -1 -moreinfo

Hello,

On Thu, Feb 01 2018, Matthias Klose wrote:

>> 4) Your implicit comment that I lied in the changelog and disabled
>> the test suite because I knew it would fail is entirely uncalled for.
>> Please do not treat fellow package maintainers like that.
>
> well, looking at the changelog is the way to see what is changed, and
> sometimes why. I didn't see that in 2). But yes, I was feeling that
> you didn't tell everything.

I'll add some more text to the 5.5-2 changelog entry in the next upload.

> Now reduced the severity. Feel free to close it if you think that any
> tests during the build is not necessary.

I'll attempt to reproduce the various failures you report in your most
recent message on a porterbox.  I'll keep the bug open until then, at
least.

-- 
Sean Whitton


signature.asc
Description: PGP signature


Bug#888917: ocrmypdf fails to run it's testsuite

2018-02-02 Thread James R Barlow
Hello Sean,

On Wed, 31 Jan 2018 22:06:42 -0700 Sean Whitton 
wrote:
> I further suspect that the test suite took 30 seconds only because so
> many tests failed early.  In recent upstream versions, the test suite
> has never finished running on my laptop after leaving it for multiple
> hours.  When you run the test suite on a totally ordinary file system,
> please report how long it takes, and whether your laptop is very
> new/high spec.

Do you think you could take a few minutes to identify which test is taking
this long and report it? This may be an upstream bug, if some input
triggers an infinite loop.

I have my suspicions. My guess is that:

pytest tests/test_qpdf.py # will never finish

and

pytest -n0 tests/test_qpdf.py # will fail in 15 seconds

If so, you might have qpdf < 7.0.0 and upgrading to qpdf >= 7.0.0 will fix
it.

But I'd appreciate if you can confirm.

Thanks.


Bug#888917: ocrmypdf fails to run it's testsuite

2018-02-01 Thread Matthias Klose
Control: severity -1 important

On 01.02.2018 06:06, Sean Whitton wrote:
> control: tag -1 +moreinfo
> 
> Dear Matthias,
> 
> On Wed, Jan 31 2018, Matthias Klose wrote:
> 
>> The recent changelog reads:
>>
>>   * Disable test suite at package build time.
>> It now takes a prohibitively long time to run, so we are relying on
>> autopkgtest instead.
>>
>> Sorry, but this is one of the most lame excuses I have ever seen. Trying to 
>> run
>> it on my laptop in unstable needs 30 seconds.  However re-enabling it and
>> running it reveals
>>
>>   === 122 failed, 24 passed, 4 skipped in 33.92 seconds 
>>
>> these results are after adding tesseract-ocr qpdf unpaper as build 
>> dependencies.
> 
> Looking at the errors, I strongly suspect that this is because you are
> running the test suite on a tmpfs -- we have seen these permission
> errors before under those conditions.  Could you try running the test
> suite on a totally ordinary file system, please?

I tried running these in a chroot created with debootstrap, and manually
entering the chroot.

[sid]
type=directory
description=debian (sid)
directory=/srv/chroot/sid
users=doko
groups=sbuild

so this should be on an ordinary file system? unless the testsuite uses /tmp
mounted on a /tmpfs?

> I further suspect that the test suite took 30 seconds only because so
> many tests failed early.  In recent upstream versions, the test suite
> has never finished running on my laptop after leaving it for multiple
> hours.  When you run the test suite on a totally ordinary file system,
> please report how long it takes, and whether your laptop is very
> new/high spec.

well, it's a new one, two cores.

> I note that Policy does not require that a package be buildable under a
> tmpfs, and certainly does not require that its test suite run under a
> tmpfs.
> 
>> doubting that the primary reason for this change was build time ...
> 
> Several things:
> 
> 1) I ran the test suite using deb-o-matic[1] before uploading.  Needless
>to say, I would not have uploaded had there been failures.

which only runs on amd64 afaik.

> 2) I should have mentioned in the changelog that another reason for this
>change was to reduce the number of heavy build dependencies.
> 
>A further reason is that it reduces the amount of fragile code in
>d/rules needed to get the test suite running -- upstream's test suite
>is designed to be run on the installed package.
> 
> 3) I am of the view that very heavy test suites are better run under
>autopkgtest.  We will soon have testing migration gating on
>autopkgtest, and it is not clear to me that it makes sense for the
>process of stitching the .deb to abort when a single integration test
>fails.

I'm not sure about that.  I dislike packages like the whole KDE which disables
testing during the build and then rebuilds and runs tests in the autopkg tests.
It doesn't hinder broken packages into the archive.

>(Ideally tests would be separated into those that should abort the
>build and those that should not, but in the absence of this work
>being done, it is reasonable not to run any of them.)

I'm a bit biased here, because I saw the autopkg test failures first in 
launchpad:
http://autopkgtest.ubuntu.com/packages/o/ocrmypdf/bionic/amd64

https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-bionic/bionic/amd64/o/ocrmypdf/20180130_155249_1faa5@/log.gz

but yes, there seem to be less failures

> 4) Your implicit comment that I lied in the changelog and disabled the
>test suite because I knew it would fail is entirely uncalled for.
>Please do not treat fellow package maintainers like that.

well, looking at the changelog is the way to see what is changed, and sometimes
why. I didn't see that in 2). But yes, I was feeling that you didn't tell
everything.

Now reduced the severity. Feel free to close it if you think that any tests
during the build is not necessary.



Bug#888917: ocrmypdf fails to run it's testsuite

2018-01-31 Thread Sean Whitton
Hello James,

On Wed, Jan 31 2018, James R Barlow wrote:

> Upstream here.

Thanks for the info.

> The reason the suite fails like that is that mandatory-for-testing
> dependencies were also removed.
>
> The test suite runs on Travis CI in 10-12 minutes. On Debian CI, 15
> minutes. For comparison ffmpeg, another compute intensive CLI program,
> takes 10 minutes.
>
> This is an OCR program and OCR takes a long time. There are
> opportunities to speed up testing on my end but no low hanging fruit
> without removing tests. I've done the obvious: use all cores, use
> caches and dummies where possible. Some OCR on the fly is essential
> because Tesseract is complex enough that output is not identical
> across platforms.
>
> Preserving the dynamically created tests/cache/ folder between test
> runs, if possible in Debian CI, would speed it up a lot.

Unfortunately not possible.

> I could mark a subset of essential tests for packagers so that Debian
> CI can specify it only wants those. There's a number of tests that are
> very unlikely to pass upstream testing (macOS and Ubuntu) then somehow
> fail downstream in Debian.

Just to be clear, this bug is about the tests run during the package
build, which is completely independent of Debian CI (in our terminology,
"autopkgtest" refers to Debian CI).

-- 
Sean Whitton


signature.asc
Description: PGP signature


Bug#888917: ocrmypdf fails to run it's testsuite

2018-01-31 Thread Sean Whitton
control: tag -1 +moreinfo

Dear Matthias,

On Wed, Jan 31 2018, Matthias Klose wrote:

> The recent changelog reads:
>
>   * Disable test suite at package build time.
> It now takes a prohibitively long time to run, so we are relying on
> autopkgtest instead.
>
> Sorry, but this is one of the most lame excuses I have ever seen. Trying to 
> run
> it on my laptop in unstable needs 30 seconds.  However re-enabling it and
> running it reveals
>
>   === 122 failed, 24 passed, 4 skipped in 33.92 seconds 
>
> these results are after adding tesseract-ocr qpdf unpaper as build 
> dependencies.

Looking at the errors, I strongly suspect that this is because you are
running the test suite on a tmpfs -- we have seen these permission
errors before under those conditions.  Could you try running the test
suite on a totally ordinary file system, please?

I further suspect that the test suite took 30 seconds only because so
many tests failed early.  In recent upstream versions, the test suite
has never finished running on my laptop after leaving it for multiple
hours.  When you run the test suite on a totally ordinary file system,
please report how long it takes, and whether your laptop is very
new/high spec.

I note that Policy does not require that a package be buildable under a
tmpfs, and certainly does not require that its test suite run under a
tmpfs.

> doubting that the primary reason for this change was build time ...

Several things:

1) I ran the test suite using deb-o-matic[1] before uploading.  Needless
   to say, I would not have uploaded had there been failures.

2) I should have mentioned in the changelog that another reason for this
   change was to reduce the number of heavy build dependencies.

   A further reason is that it reduces the amount of fragile code in
   d/rules needed to get the test suite running -- upstream's test suite
   is designed to be run on the installed package.

3) I am of the view that very heavy test suites are better run under
   autopkgtest.  We will soon have testing migration gating on
   autopkgtest, and it is not clear to me that it makes sense for the
   process of stitching the .deb to abort when a single integration test
   fails.

   (Ideally tests would be separated into those that should abort the
   build and those that should not, but in the absence of this work
   being done, it is reasonable not to run any of them.)

4) Your implicit comment that I lied in the changelog and disabled the
   test suite because I knew it would fail is entirely uncalled for.
   Please do not treat fellow package maintainers like that.

[1]  http://debomatic-amd64.debian.net/

-- 
Sean Whitton


signature.asc
Description: PGP signature


Bug#888917: ocrmypdf fails to run it's testsuite

2018-01-31 Thread James R Barlow
Upstream here.

The reason the suite fails like that is that mandatory-for-testing
dependencies were also removed.

The test suite runs on Travis CI in 10-12 minutes. On Debian CI, 15
minutes. For comparison ffmpeg, another compute intensive CLI program,
takes 10 minutes.

This is an OCR program and OCR takes a long time. There are opportunities
to speed up testing on my end but no low hanging fruit without removing
tests. I've done the obvious: use all cores, use caches and dummies where
possible. Some OCR on the fly is essential because Tesseract is complex
enough that output is not identical across platforms.

Preserving the dynamically created tests/cache/ folder between test runs,
if possible in Debian CI, would speed it up a lot.

I could mark a subset of essential tests for packagers so that Debian CI
can specify it only wants those. There's a number of tests that are very
unlikely to pass upstream testing (macOS and Ubuntu) then somehow fail
downstream in Debian.


Bug#888917: ocrmypdf fails to run it's testsuite

2018-01-30 Thread Matthias Klose
Package: src:ocrmypdf
Version: 5.5-2
Severity: serious
Tags: sid buster

The recent changelog reads:

  * Disable test suite at package build time.
It now takes a prohibitively long time to run, so we are relying on
autopkgtest instead.

Sorry, but this is one of the most lame excuses I have ever seen. Trying to run
it on my laptop in unstable needs 30 seconds.  However re-enabling it and
running it reveals

  === 122 failed, 24 passed, 4 skipped in 33.92 seconds 

these results are after adding tesseract-ocr qpdf unpaper as build dependencies.

doubting that the primary reason for this change was build time ...

dpkg-buildpackage: info: source package ocrmypdf
dpkg-buildpackage: info: source version 5.5-2
dpkg-buildpackage: info: source distribution unstable
dpkg-buildpackage: info: source changed by Sean Whitton 

 dpkg-source --before-build ocrmypdf-5.5
dpkg-buildpackage: info: host architecture amd64
dpkg-source: info: using options from ocrmypdf-5.5/debian/source/options:
--single-debian-patch --auto-commit --extend-diff-ignore=\.git_archival\.txt
 fakeroot debian/rules clean
dh clean --with python3,sphinxdoc --buildsystem=pybuild
   dh_auto_clean -O--buildsystem=pybuild
I: pybuild base:184: python3.6 setup.py clean
Skipping external program tests because of --force
running clean
removing '/home/packages/tmp/ocrmypdf-5.5/.pybuild/pythonX.Y_3.6/build' (and
everything under it)
'build/bdist.linux-amd64' does not exist -- can't clean it
'build/scripts-3.6' does not exist -- can't clean it
   dh_clean -O--buildsystem=pybuild
 debian/rules build
dh build --with python3,sphinxdoc --buildsystem=pybuild
   dh_update_autotools_config -O--buildsystem=pybuild
   dh_autoreconf -O--buildsystem=pybuild
   dh_auto_configure -O--buildsystem=pybuild
I: pybuild base:184: python3.6 setup.py config
Skipping external program tests because of --force
running config
   debian/rules override_dh_auto_build
make[1]: Entering directory '/home/packages/tmp/ocrmypdf-5.5'
mkdir -p debian/.debhelper
cp -R ocrmypdf debian/.debhelper
sed -i debian/.debhelper/ocrmypdf/__init__.py -e \
"s|^__version__ =.*|__version__ = \"5.5\"|"
PYTHONPATH=debian/.debhelper sphinx-build docs html
Running Sphinx v1.6.6
making output directory...
loading pickled environment... not yet created
building [mo]: targets for 0 po files that are out of date
building [html]: targets for 10 source files that are out of date
updating environment: 10 added, 0 changed, 0 removed
reading sources... [ 10%] advanced
reading sources... [ 20%] batch
reading sources... [ 30%] cookbook
reading sources... [ 40%] errors
reading sources... [ 50%] index
reading sources... [ 60%] installation
reading sources... [ 70%] introduction
reading sources... [ 80%] languages
reading sources... [ 90%] release_notes
reading sources... [100%] security

/home/packages/tmp/ocrmypdf-5.5/docs/installation.rst:2: WARNING: Duplicate
explicit target name: "docker".
/home/packages/tmp/ocrmypdf-5.5/docs/introduction.rst:108: WARNING: Unknown
target name: "using ocrmypdf online".
looking for now-outdated files... none found
pickling environment... done
checking consistency... /home/packages/tmp/ocrmypdf-5.5/docs/installation.rst:
WARNING: document isn't included in any toctree
done
preparing documents... done
writing output... [ 10%] advanced
writing output... [ 20%] batch
writing output... [ 30%] cookbook
writing output... [ 40%] errors
writing output... [ 50%] index
writing output... [ 60%] installation
writing output... [ 70%] introduction
writing output... [ 80%] languages
writing output... [ 90%] release_notes
writing output... [100%] security

generating indices... genindex
writing additional pages... search
copying images... [100%] bitmap_vs_svg.svg

copying static files... WARNING: html_static_path entry
'/home/packages/tmp/ocrmypdf-5.5/docs/_static' does not exist
done
copying extra files... done
dumping search index in English (code: en) ... done
dumping object inventory... done
build succeeded, 4 warnings.
dh_auto_build -O--buildsystem=pybuild
I: pybuild base:184: /usr/bin/python3 setup.py build
Skipping external program tests because of --force
running build
running build_py
creating /home/packages/tmp/ocrmypdf-5.5/.pybuild/pythonX.Y_3.6/build/ocrmypdf
copying ocrmypdf/_unicodefun.py ->
/home/packages/tmp/ocrmypdf-5.5/.pybuild/pythonX.Y_3.6/build/ocrmypdf
copying ocrmypdf/__main__.py ->
/home/packages/tmp/ocrmypdf-5.5/.pybuild/pythonX.Y_3.6/build/ocrmypdf
copying ocrmypdf/pdfa.py ->
/home/packages/tmp/ocrmypdf-5.5/.pybuild/pythonX.Y_3.6/build/ocrmypdf
copying ocrmypdf/leptonica.py ->
/home/packages/tmp/ocrmypdf-5.5/.pybuild/pythonX.Y_3.6/build/ocrmypdf
copying ocrmypdf/__init__.py ->
/home/packages/tmp/ocrmypdf-5.5/.pybuild/pythonX.Y_3.6/build/ocrmypdf
copying ocrmypdf/hocrtransform.py ->
/home/packages/tmp/ocrmypdf-5.5/.pybuild/pythonX.Y_3.6/build/ocrmypdf
copying ocrmypdf/helpers.py ->