Bug#1009680: ghostscript breaks ocrmypdf autopkgtest: seemingly multiple issues

james Thu, 14 Apr 2022 11:27:19 -0700

Ghostscript 9.56.0 introduced a serious bug from ocrmypdf’s perspective. 
Upgrading to ocrmypdf 13.4.2 would work or a newer Ghostscript if that’s been 
released.


> On Apr 14, 2022, at 02:15, Paul Gevers <elb...@debian.org> wrote:
> 
> Source: ghostscript, ocrmypdf
> Control: found -1 ghostscript/9.56.0~dfsg-1
> Control: found -1 ocrmypdf/13.4.0+dfsg-1
> Severity: serious
> Tags: sid bookworm
> User: debian...@lists.debian.org
> Usertags: breaks needs-update
> 
> Dear maintainer(s),
> 
> With a recent upload of ghostscript the autopkgtest of ocrmypdf fails in 
> testing when that autopkgtest is run with the binary packages of ghostscript 
> from unstable. It passes when run with only packages from testing. In tabular 
> form:
> 
>                       pass            fail
> ghostscript            from testing    9.56.0~dfsg-1
> ocrmypdf               from testing    13.4.0+dfsg-1
> all others             from testing    from testing
> 
> I copied some of the output at the bottom of this report.
> 
> Currently this regression is blocking the migration of ghostscript to testing 
> [1]. Due to the nature of this issue, I filed this bug report against both 
> packages. Can you please investigate the situation and reassign the bug to 
> the right package?
> 
> More information about this bug and the reason for filing it can be found on
> https://wiki.debian.org/ContinuousIntegration/RegressionEmailInformation
> 
> Paul
> 
> [1] https://qa.debian.org/excuses.php?package=ghostscript
> 
> https://ci.debian.net/data/autopkgtest/testing/amd64/o/ocrmypdf/20818050/log.gz
> 
> =================================== FAILURES 
> ===================================
> ________________________________ test_force_ocr 
> ________________________________
> 
> resources = 
> PosixPath('/tmp/autopkgtest-lxc.zdbcipww/downtmp/build.V8r/src/tests/resources')
> outpdf = PosixPath('/tmp/pytest-of-debci/pytest-0/test_force_ocr0/out.pdf')
> 
>    def test_force_ocr(resources, outpdf):
>        out = check_ocrmypdf(
>            resources / 'graph_ocred.pdf',
>            outpdf,
>            '-f',
>            '--plugin',
>            'tests/plugins/tesseract_cache.py',
>        )
>        pdfinfo = PdfInfo(out)
>>      assert pdfinfo[0].has_text
> E       assert False
> E        +  where False = <PageInfo pageno=0 
> 7.573333333333333333333333333"x6.16" rotation=0 dpi=400.000000x400.000000 
> has_text=False>.has_text
> 
> tests/test_main.py:83: AssertionError
> ----------------------------- Captured stderr call 
> -----------------------------
> 
> Scanning contents:   0%|          | 0/1 [00:00<?, ?page/s]
> Scanning contents: 100%|██████████| 1/1 [00:00<00:00, 62.30page/s]
> 
> OCR:   0%|          | 0.0/1.0 [00:00<?, ?page/s]
> OCR:  50%|█████     | 0.5/1.0 [00:02<00:02,  5.47s/page]
> OCR: 100%|██████████| 1.0/1.0 [00:02<00:00,  2.75s/page]
> 
> PDF/A conversion:   0%|          | 0/1 [00:00<?, ?page/s]
> 
> Recompressing JPEGs: 0image [00:00, ?image/s][A
> Recompressing JPEGs: 0image [00:00, ?image/s]
> 
> 
> Deflating JPEGs:   0%|          | 0/1 [00:00<?, ?image/s][A
> Deflating JPEGs: 100%|██████████| 1/1 [00:00<00:00, 74.34image/s]
> 
> 
> JBIG2: 0item [00:00, ?item/s][A
> JBIG2: 0item [00:00, ?item/s]
> ------------------------------ Captured log call 
> -------------------------------
> INFO     ocrmypdf._pipeline:_pipeline.py:275 page already has text! - 
> rasterizing text and running OCR anyway
> INFO     ocrmypdf._sync:_sync.py:301 Postprocessing...
> WARNING  ocrmypdf._pipeline:_pipeline.py:776 Some input metadata could not be 
> copied because it is not permitted in PDF/A. You may wish to examine the 
> output PDF's XMP metadata.
> INFO     ocrmypdf.optimize:optimize.py:665 Optimize ratio: 1.52 savings: 34.1%
> INFO     ocrmypdf._sync:_sync.py:399 Output file is a PDF/A-2B (as expected)
> WARNING  ocrmypdf._validation:_validation.py:381 The output file size is 
> 2.45× larger than the input file.
> Possible reasons for this include:
> The argument --force-ocr was issued, causing transcoding.
> The optional dependency 'jbig2' was not found, so some image optimizations 
> could not be attempted.
> PDF/A conversion was enabled. (Try `--output-type pdf`.)
> Plugins were used.
> --------------------------- Captured stderr teardown 
> ---------------------------
> 
> PDF/A conversion: 100%|██████████| 1/1 [00:01<00:00,  1.20s/page]
> ________________________________ test_skip_ocr 
> _________________________________
> 
> resources = 
> PosixPath('/tmp/autopkgtest-lxc.zdbcipww/downtmp/build.V8r/src/tests/resources')
> outpdf = PosixPath('/tmp/pytest-of-debci/pytest-0/test_skip_ocr0/out.pdf')
> 
>    def test_skip_ocr(resources, outpdf):
>        out = check_ocrmypdf(
>            resources / 'graph_ocred.pdf',
>            outpdf,
>            '-s',
>            '--plugin',
>            'tests/plugins/tesseract_cache.py',
>        )
>        pdfinfo = PdfInfo(out)
>>      assert pdfinfo[0].has_text
> E       assert False
> E        +  where False = <PageInfo pageno=0 
> 7.573333333333333333333333333"x6.16" rotation=0 dpi=150.000000x150.000000 
> has_text=False>.has_text
> 
> tests/test_main.py:95: AssertionError
> ----------------------------- Captured stderr call 
> -----------------------------
> 
> Scanning contents:   0%|          | 0/1 [00:00<?, ?page/s]
> Scanning contents: 100%|██████████| 1/1 [00:00<00:00, 70.71page/s]
> 
> OCR:   0%|          | 0.0/1.0 [00:00<?, ?page/s]
> OCR: 100%|██████████| 1.0/1.0 [00:00<00:00, 47.12page/s]
> 
> PDF/A conversion:   0%|          | 0/1 [00:00<?, ?page/s]
> 
> Recompressing JPEGs: 0image [00:00, ?image/s][A
> Recompressing JPEGs: 0image [00:00, ?image/s]
> 
> 
> Deflating JPEGs:   0%|          | 0/1 [00:00<?, ?image/s][A
> Deflating JPEGs: 100%|██████████| 1/1 [00:00<00:00, 235.24image/s]
> 
> 
> JBIG2: 0item [00:00, ?item/s][A
> JBIG2: 0item [00:00, ?item/s]
> ------------------------------ Captured log call 
> -------------------------------
> INFO     ocrmypdf._pipeline:_pipeline.py:287 skipping all processing on this 
> page
> INFO     ocrmypdf._sync:_sync.py:301 Postprocessing...
> WARNING  ocrmypdf._pipeline:_pipeline.py:776 Some input metadata could not be 
> copied because it is not permitted in PDF/A. You may wish to examine the 
> output PDF's XMP metadata.
> INFO     ocrmypdf.optimize:optimize.py:665 Optimize ratio: 1.14 savings: 12.6%
> INFO     ocrmypdf._sync:_sync.py:399 Output file is a PDF/A-2B (as expected)
> --------------------------- Captured stderr teardown 
> ---------------------------
> 
> PDF/A conversion: 100%|██████████| 1/1 [00:00<00:00,  4.16page/s]
> ________________________________ test_redo_ocr 
> _________________________________
> 
> resources = 
> PosixPath('/tmp/autopkgtest-lxc.zdbcipww/downtmp/build.V8r/src/tests/resources')
> outpdf = PosixPath('/tmp/pytest-of-debci/pytest-0/test_redo_ocr0/out.pdf')
> 
>    def test_redo_ocr(resources, outpdf):
>        in_ = resources / 'graph_ocred.pdf'
>        before = PdfInfo(in_, detailed_analysis=True)
>        out = outpdf
>        out = check_ocrmypdf(in_, out, '--redo-ocr')
>        after = PdfInfo(out, detailed_analysis=True)
>>      assert before[0].has_text and after[0].has_text
> E       assert (True and False)
> E        +  where True = <PageInfo pageno=0 
> 7.573333333333333333333333333"x6.16" rotation=0 dpi=150.000000x150.000000 
> has_text=True>.has_text
> E        +  and   False = <PageInfo pageno=0 
> 7.573333333333333333333333333"x6.16" rotation=0 dpi=150.000000x150.000000 
> has_text=False>.has_text
> 
> tests/test_main.py:104: AssertionError
> ----------------------------- Captured stderr call 
> -----------------------------
> 
> Scanning contents:   0%|          | 0/1 [00:00<?, ?page/s]
> Scanning contents: 100%|██████████| 1/1 [00:00<00:00, 20.63page/s]
> 
> OCR:   0%|          | 0.0/1.0 [00:00<?, ?page/s]
> OCR:  50%|█████     | 0.5/1.0 [00:04<00:04,  8.64s/page]
> OCR: 100%|██████████| 1.0/1.0 [00:04<00:00,  4.35s/page]
> 
> PDF/A conversion:   0%|          | 0/1 [00:00<?, ?page/s]
> 
> Recompressing JPEGs: 0image [00:00, ?image/s][A
> Recompressing JPEGs: 0image [00:00, ?image/s]
> 
> 
> Deflating JPEGs:   0%|          | 0/1 [00:00<?, ?image/s][A
> Deflating JPEGs: 100%|██████████| 1/1 [00:00<00:00, 254.88image/s]
> 
> 
> JBIG2: 0item [00:00, ?item/s][A
> JBIG2: 0item [00:00, ?item/s]
> ------------------------------ Captured log call 
> -------------------------------
> INFO     ocrmypdf._pipeline:_pipeline.py:284 redoing OCR
> INFO     ocrmypdf._sync:_sync.py:301 Postprocessing...
> ERROR    ocrmypdf._exec.ghostscript:ghostscript.py:277 GPL Ghostscript 9.56.0 
> (2022-03-29)
> Copyright (C) 2022 Artifex Software, Inc.  All rights reserved.
> This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
> see the file COPYING for details.
> Processing pages 1 through 1.
> Page 1
> 
> The following warnings were encountered at least once while processing this 
> file:
>    number uses illegal exponent form
> 
>   ERROR    ocrmypdf._exec.ghostscript:ghostscript.py:277  This file had 
> errors that were repaired or ignored.
>   ERROR    ocrmypdf._exec.ghostscript:ghostscript.py:277  The file was 
> produced by:    ERROR    ocrmypdf._exec.ghostscript:ghostscript.py:277 >>>> 
> GPL Ghostscript 9.15 <<<<
>   ERROR    ocrmypdf._exec.ghostscript:ghostscript.py:277  Please notify the 
> author of the software that produced this
>   ERROR    ocrmypdf._exec.ghostscript:ghostscript.py:277  file that it does 
> not conform to Adobe's published PDF
>   ERROR    ocrmypdf._exec.ghostscript:ghostscript.py:277  specification.
> 
> 
> WARNING  ocrmypdf._pipeline:_pipeline.py:776 Some input metadata could not be 
> copied because it is not permitted in PDF/A. You may wish to examine the 
> output PDF's XMP metadata.
> INFO     ocrmypdf.optimize:optimize.py:665 Optimize ratio: 1.14 savings: 12.6%
> INFO     ocrmypdf._sync:_sync.py:399 Output file is a PDF/A-2B (as expected)
> --------------------------- Captured stderr teardown 
> ---------------------------
> 
> PDF/A conversion: 100%|██████████| 1/1 [00:00<00:00,  3.91page/s]
> =========================== short test summary info 
> ============================
> FAILED tests/test_main.py::test_force_ocr - assert False
> FAILED tests/test_main.py::test_skip_ocr - assert False
> FAILED tests/test_main.py::test_redo_ocr - assert (True and False)
> ======= 3 failed, 274 passed, 37 skipped, 4 xfailed in 397.41s (0:06:37) 
> =======
> autopkgtest [08:17:33]: test test-suite
>

Bug#1009680: ghostscript breaks ocrmypdf autopkgtest: seemingly multiple issues

Reply via email to