Ghostscript 9.56.0 introduced a serious bug from ocrmypdf’s perspective. Upgrading to ocrmypdf 13.4.2 would work or a newer Ghostscript if that’s been released.
> On Apr 14, 2022, at 02:15, Paul Gevers <elb...@debian.org> wrote: > > Source: ghostscript, ocrmypdf > Control: found -1 ghostscript/9.56.0~dfsg-1 > Control: found -1 ocrmypdf/13.4.0+dfsg-1 > Severity: serious > Tags: sid bookworm > User: debian...@lists.debian.org > Usertags: breaks needs-update > > Dear maintainer(s), > > With a recent upload of ghostscript the autopkgtest of ocrmypdf fails in > testing when that autopkgtest is run with the binary packages of ghostscript > from unstable. It passes when run with only packages from testing. In tabular > form: > > pass fail > ghostscript from testing 9.56.0~dfsg-1 > ocrmypdf from testing 13.4.0+dfsg-1 > all others from testing from testing > > I copied some of the output at the bottom of this report. > > Currently this regression is blocking the migration of ghostscript to testing > [1]. Due to the nature of this issue, I filed this bug report against both > packages. Can you please investigate the situation and reassign the bug to > the right package? > > More information about this bug and the reason for filing it can be found on > https://wiki.debian.org/ContinuousIntegration/RegressionEmailInformation > > Paul > > [1] https://qa.debian.org/excuses.php?package=ghostscript > > https://ci.debian.net/data/autopkgtest/testing/amd64/o/ocrmypdf/20818050/log.gz > > =================================== FAILURES > =================================== > ________________________________ test_force_ocr > ________________________________ > > resources = > PosixPath('/tmp/autopkgtest-lxc.zdbcipww/downtmp/build.V8r/src/tests/resources') > outpdf = PosixPath('/tmp/pytest-of-debci/pytest-0/test_force_ocr0/out.pdf') > > def test_force_ocr(resources, outpdf): > out = check_ocrmypdf( > resources / 'graph_ocred.pdf', > outpdf, > '-f', > '--plugin', > 'tests/plugins/tesseract_cache.py', > ) > pdfinfo = PdfInfo(out) >> assert pdfinfo[0].has_text > E assert False > E + where False = <PageInfo pageno=0 > 7.573333333333333333333333333"x6.16" rotation=0 dpi=400.000000x400.000000 > has_text=False>.has_text > > tests/test_main.py:83: AssertionError > ----------------------------- Captured stderr call > ----------------------------- > > Scanning contents: 0%| | 0/1 [00:00<?, ?page/s] > Scanning contents: 100%|██████████| 1/1 [00:00<00:00, 62.30page/s] > > OCR: 0%| | 0.0/1.0 [00:00<?, ?page/s] > OCR: 50%|█████ | 0.5/1.0 [00:02<00:02, 5.47s/page] > OCR: 100%|██████████| 1.0/1.0 [00:02<00:00, 2.75s/page] > > PDF/A conversion: 0%| | 0/1 [00:00<?, ?page/s] > > Recompressing JPEGs: 0image [00:00, ?image/s][A > Recompressing JPEGs: 0image [00:00, ?image/s] > > > Deflating JPEGs: 0%| | 0/1 [00:00<?, ?image/s][A > Deflating JPEGs: 100%|██████████| 1/1 [00:00<00:00, 74.34image/s] > > > JBIG2: 0item [00:00, ?item/s][A > JBIG2: 0item [00:00, ?item/s] > ------------------------------ Captured log call > ------------------------------- > INFO ocrmypdf._pipeline:_pipeline.py:275 page already has text! - > rasterizing text and running OCR anyway > INFO ocrmypdf._sync:_sync.py:301 Postprocessing... > WARNING ocrmypdf._pipeline:_pipeline.py:776 Some input metadata could not be > copied because it is not permitted in PDF/A. You may wish to examine the > output PDF's XMP metadata. > INFO ocrmypdf.optimize:optimize.py:665 Optimize ratio: 1.52 savings: 34.1% > INFO ocrmypdf._sync:_sync.py:399 Output file is a PDF/A-2B (as expected) > WARNING ocrmypdf._validation:_validation.py:381 The output file size is > 2.45× larger than the input file. > Possible reasons for this include: > The argument --force-ocr was issued, causing transcoding. > The optional dependency 'jbig2' was not found, so some image optimizations > could not be attempted. > PDF/A conversion was enabled. (Try `--output-type pdf`.) > Plugins were used. > --------------------------- Captured stderr teardown > --------------------------- > > PDF/A conversion: 100%|██████████| 1/1 [00:01<00:00, 1.20s/page] > ________________________________ test_skip_ocr > _________________________________ > > resources = > PosixPath('/tmp/autopkgtest-lxc.zdbcipww/downtmp/build.V8r/src/tests/resources') > outpdf = PosixPath('/tmp/pytest-of-debci/pytest-0/test_skip_ocr0/out.pdf') > > def test_skip_ocr(resources, outpdf): > out = check_ocrmypdf( > resources / 'graph_ocred.pdf', > outpdf, > '-s', > '--plugin', > 'tests/plugins/tesseract_cache.py', > ) > pdfinfo = PdfInfo(out) >> assert pdfinfo[0].has_text > E assert False > E + where False = <PageInfo pageno=0 > 7.573333333333333333333333333"x6.16" rotation=0 dpi=150.000000x150.000000 > has_text=False>.has_text > > tests/test_main.py:95: AssertionError > ----------------------------- Captured stderr call > ----------------------------- > > Scanning contents: 0%| | 0/1 [00:00<?, ?page/s] > Scanning contents: 100%|██████████| 1/1 [00:00<00:00, 70.71page/s] > > OCR: 0%| | 0.0/1.0 [00:00<?, ?page/s] > OCR: 100%|██████████| 1.0/1.0 [00:00<00:00, 47.12page/s] > > PDF/A conversion: 0%| | 0/1 [00:00<?, ?page/s] > > Recompressing JPEGs: 0image [00:00, ?image/s][A > Recompressing JPEGs: 0image [00:00, ?image/s] > > > Deflating JPEGs: 0%| | 0/1 [00:00<?, ?image/s][A > Deflating JPEGs: 100%|██████████| 1/1 [00:00<00:00, 235.24image/s] > > > JBIG2: 0item [00:00, ?item/s][A > JBIG2: 0item [00:00, ?item/s] > ------------------------------ Captured log call > ------------------------------- > INFO ocrmypdf._pipeline:_pipeline.py:287 skipping all processing on this > page > INFO ocrmypdf._sync:_sync.py:301 Postprocessing... > WARNING ocrmypdf._pipeline:_pipeline.py:776 Some input metadata could not be > copied because it is not permitted in PDF/A. You may wish to examine the > output PDF's XMP metadata. > INFO ocrmypdf.optimize:optimize.py:665 Optimize ratio: 1.14 savings: 12.6% > INFO ocrmypdf._sync:_sync.py:399 Output file is a PDF/A-2B (as expected) > --------------------------- Captured stderr teardown > --------------------------- > > PDF/A conversion: 100%|██████████| 1/1 [00:00<00:00, 4.16page/s] > ________________________________ test_redo_ocr > _________________________________ > > resources = > PosixPath('/tmp/autopkgtest-lxc.zdbcipww/downtmp/build.V8r/src/tests/resources') > outpdf = PosixPath('/tmp/pytest-of-debci/pytest-0/test_redo_ocr0/out.pdf') > > def test_redo_ocr(resources, outpdf): > in_ = resources / 'graph_ocred.pdf' > before = PdfInfo(in_, detailed_analysis=True) > out = outpdf > out = check_ocrmypdf(in_, out, '--redo-ocr') > after = PdfInfo(out, detailed_analysis=True) >> assert before[0].has_text and after[0].has_text > E assert (True and False) > E + where True = <PageInfo pageno=0 > 7.573333333333333333333333333"x6.16" rotation=0 dpi=150.000000x150.000000 > has_text=True>.has_text > E + and False = <PageInfo pageno=0 > 7.573333333333333333333333333"x6.16" rotation=0 dpi=150.000000x150.000000 > has_text=False>.has_text > > tests/test_main.py:104: AssertionError > ----------------------------- Captured stderr call > ----------------------------- > > Scanning contents: 0%| | 0/1 [00:00<?, ?page/s] > Scanning contents: 100%|██████████| 1/1 [00:00<00:00, 20.63page/s] > > OCR: 0%| | 0.0/1.0 [00:00<?, ?page/s] > OCR: 50%|█████ | 0.5/1.0 [00:04<00:04, 8.64s/page] > OCR: 100%|██████████| 1.0/1.0 [00:04<00:00, 4.35s/page] > > PDF/A conversion: 0%| | 0/1 [00:00<?, ?page/s] > > Recompressing JPEGs: 0image [00:00, ?image/s][A > Recompressing JPEGs: 0image [00:00, ?image/s] > > > Deflating JPEGs: 0%| | 0/1 [00:00<?, ?image/s][A > Deflating JPEGs: 100%|██████████| 1/1 [00:00<00:00, 254.88image/s] > > > JBIG2: 0item [00:00, ?item/s][A > JBIG2: 0item [00:00, ?item/s] > ------------------------------ Captured log call > ------------------------------- > INFO ocrmypdf._pipeline:_pipeline.py:284 redoing OCR > INFO ocrmypdf._sync:_sync.py:301 Postprocessing... > ERROR ocrmypdf._exec.ghostscript:ghostscript.py:277 GPL Ghostscript 9.56.0 > (2022-03-29) > Copyright (C) 2022 Artifex Software, Inc. All rights reserved. > This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY: > see the file COPYING for details. > Processing pages 1 through 1. > Page 1 > > The following warnings were encountered at least once while processing this > file: > number uses illegal exponent form > > ERROR ocrmypdf._exec.ghostscript:ghostscript.py:277 This file had > errors that were repaired or ignored. > ERROR ocrmypdf._exec.ghostscript:ghostscript.py:277 The file was > produced by: ERROR ocrmypdf._exec.ghostscript:ghostscript.py:277 >>>> > GPL Ghostscript 9.15 <<<< > ERROR ocrmypdf._exec.ghostscript:ghostscript.py:277 Please notify the > author of the software that produced this > ERROR ocrmypdf._exec.ghostscript:ghostscript.py:277 file that it does > not conform to Adobe's published PDF > ERROR ocrmypdf._exec.ghostscript:ghostscript.py:277 specification. > > > WARNING ocrmypdf._pipeline:_pipeline.py:776 Some input metadata could not be > copied because it is not permitted in PDF/A. You may wish to examine the > output PDF's XMP metadata. > INFO ocrmypdf.optimize:optimize.py:665 Optimize ratio: 1.14 savings: 12.6% > INFO ocrmypdf._sync:_sync.py:399 Output file is a PDF/A-2B (as expected) > --------------------------- Captured stderr teardown > --------------------------- > > PDF/A conversion: 100%|██████████| 1/1 [00:00<00:00, 3.91page/s] > =========================== short test summary info > ============================ > FAILED tests/test_main.py::test_force_ocr - assert False > FAILED tests/test_main.py::test_skip_ocr - assert False > FAILED tests/test_main.py::test_redo_ocr - assert (True and False) > ======= 3 failed, 274 passed, 37 skipped, 4 xfailed in 397.41s (0:06:37) > ======= > autopkgtest [08:17:33]: test test-suite >