Package: imagemagick Version: 8:6.9.11.60+dfsg-1.3 Severity: normal Tags: upstream X-Debbugs-Cc: debbug.imagemag...@sideload.33mail.com
Metadata from a TIFF file is being transfered to the *body* of the target file when “converting” to a PDF file. This results in a PDF file that falsely appears to have searchable text. One side-effect of that is OCR programs raise errors saying the PDF has already been OCR-processed. Steps to reproduce: ① Use Gimp to save a TIFF file. The options to save metadata should probably be enabled. ② Verify that the “PageName” field is populated: $ tiffinfo gimp_output.tif TIFFReadDirectory: Warning, Unknown field with tag 326 (0x146) encountered. TIFFReadDirectory: Warning, Unknown field with tag 327 (0x147) encountered. TIFFReadDirectory: Warning, Unknown field with tag 328 (0x148) encountered. TIFF Directory at offset 0x8 (8) Image Width: 3544 Image Length: 6240 Resolution: 204, 196 pixels/inch Bits/Sample: 1 Sample Format: unsigned integer Compression Scheme: None Photometric Interpretation: min-is-white Orientation: row 0 top, col 0 lhs Samples/Pixel: 1 Rows/Strip: 128 Planar Configuration: single image plane SubIFD Offsets: 5392 PageName: pg04-5.tiff Software: GIMP 2.10.22 DateTime: 2023:08:05 20:24:13 XMLPacket (XMP Metadata): ③ Use ImageMagick-convert to produce a PDF: $ convert gimp_output.tif imagemagick_output.pdf convert-im6.q16: Unknown field with tag 326 (0x146) encountered. `TIFFReadDirectory' @ warning/tiff.c/TIFFWarnings/985. convert-im6.q16: Unknown field with tag 327 (0x147) encountered. `TIFFReadDirectory' @ warning/tiff.c/TIFFWarnings/985. convert-im6.q16: Unknown field with tag 328 (0x148) encountered. `TIFFReadDirectory' @ warning/tiff.c/TIFFWarnings/985. convert-im6.q16: Unknown field with tag 327 (0x147) encountered. `TIFFReadDirectory' @ warning/tiff.c/TIFFWarnings/985. convert-im6.q16: Unknown field with tag 328 (0x148) encountered. `TIFFReadDirectory' @ warning/tiff.c/TIFFWarnings/985. ④ Use pdf2txt to see the stray text that was injected into the PDF body: $ pdf2txt imagemagick_output.pdf pg04-5.tiff ⑤ Use pdfinfo to prove that the TIFF metadata (“PageName:”) did not make it into the PDF metadata: $ pdfinfo imagemagick_output.pdf Title: imagemagick_output Producer: https://imagemagick.org CreationDate: Sun Aug 6 10:14:34 2023 CEST ModDate: Sun Aug 6 10:14:34 2023 CEST Tagged: no UserProperties: no Suspects: no Form: none JavaScript: no Pages: 1 Encrypted: no Page size: 1250.82 x 2292.24 pts Page rot: 0 File size: 27485613 bytes Optimized: no PDF version: 1.7 ⑥ Use ocrmypdf to attempt making the text contained within the PDF searchable: $ ocrmypdf imagemagick_output.pdf searchable.pdf Scanning contents: 100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 68.27page/s] Using Tesseract OpenMP thread limit 2 OCR: 0%| | 0.0/1.0 [00:00<?, ?page/s] PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr Workaround: Of course the workaround for this particular workflow is to pass the --force-ocr option to ocrmypdf. This may not be an option in other situations. -- Package-specific info: ImageMagick program version --------------------------- animate: ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org compare: ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org convert: ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org composite: ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org conjure: ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org display: ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org identify: ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org import: ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org mogrify: ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org montage: ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org stream: ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org -- System Information: Debian Release: 11.5 APT prefers oldstable-updates APT policy: (990, 'oldstable-updates'), (990, 'oldstable-security'), (990, 'testing'), (990, 'oldstable') Architecture: amd64 (x86_64) Foreign Architectures: i386 Kernel: Linux 5.10.0-19-amd64 (SMP w/2 CPU threads) Kernel taint flags: TAINT_OOT_MODULE, TAINT_UNSIGNED_MODULE Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE not set Shell: /bin/sh linked to /bin/dash Init: systemd (via /run/systemd/system) LSM: AppArmor: enabled Versions of packages imagemagick depends on: ii imagemagick-6.q16 8:6.9.11.60+dfsg-1.3 imagemagick recommends no packages. imagemagick suggests no packages. -- no debconf information