Re: [Bug 1778988] Re: After PDF Files created by cups-pdf, cannot extract text from them

Bob Swanson Sun, 14 Feb 2021 08:20:45 -0800

I tested on my current system and the results show per my previous comment.
Previously, I was unable to determine the version of
CUPS. Following is output of "dpkg-query -l" filtered for "cups" only:



ii  cups                                              2.3.1-9ubuntu1.1
                     amd64        Common UNIX Printing System(tm) -
PPD/driver support, web interface
ii  cups-browsed                                      1.27.4-1
                     amd64        OpenPrinting CUPS Filters -
cups-browsed
ii  cups-bsd                                          2.3.1-9ubuntu1.1
                     amd64        Common UNIX Printing System(tm) -
BSD commands
ii  cups-client                                       2.3.1-9ubuntu1.1
                     amd64        Common UNIX Printing System(tm) -
client programs (SysV)
ii  cups-common                                       2.3.1-9ubuntu1.1
                     all          Common UNIX Printing System(tm) -
common files
ii  cups-core-drivers                                 2.3.1-9ubuntu1.1
                     amd64        Common UNIX Printing System(tm) -
driverless printing
ii  cups-daemon                                       2.3.1-9ubuntu1.1
                     amd64        Common UNIX Printing System(tm) -
daemon


Dated 2/14/2021 on my computer.

So, per your message, I am not running CUPS 3.0, but rather
2.3.1 as packaged by Ubuntu.

On 2/14/21, Martin-Éric Racine <1778...@bugs.launchpad.net> wrote:
> Thanks. Let's close it.
>
> ** Changed in: cups-pdf (Ubuntu)
>        Status: New => Invalid
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1778988
>
> Title:
>   After PDF Files created by cups-pdf, cannot extract text from them
>
> Status in cups-pdf package in Ubuntu:
>   Invalid
>
> Bug description:
>   PDF Creation Problem
>
>   Bob Swanson
>   bobswans...@gmail.com
>   26 June 2018
>
>   This file is part of the test package:
>
>   http://swansongrp.com/misc/testcase.zip
>
>
>   I have been able to demonstrate PDF
>   printing issues with LibreOffice and web
>   browsers. (For contrast, I have also
>   used the "wkhtmltopdf" command-line utility
>   output.)
>
>
>   USING LIBREOFFICE
>   -----------------
>
>   (Base file: mytest.odt)
>
>   This problem was originally associated
>   with a LibreOffice file containing mixed
>   font usages. When printed with "cups-pdf",
>   most of the displayed text could not be
>   selected in "evince" and extracted text was
>   garbage (The PDFBox Java code could not extract
>   reasonable text from the PDF file.) See:
>
>   https://issues.apache.org/jira/browse/PDFBOX-4250
>
>   (working environment described in that
>   bug report)
>
>   It is much easier to demonstrate this
>   problem without using PDFBox Java.
>
>   To demonstrate, view any of the resulting PDF
>   files with "evince". When viewing a PDF
>   test file, simply press CTRL-A to select all
>   text. Then "paste" the selected area into a text
>   editor (Gedit or VIM, for instance) to see the
>   resulting plain text. (Okular did no better
>   than evince)
>
>   The following results occur:
>
>   o) With original testcase: mytest3_cups_pdf.pdf,
>   created from LibreOffice using cups-pdf
>
>   Only one line highlights, but its text is correct.
>   This particular line uses a "standard" PDF font. The
>   other lines are not highlighted, and are not placed on
>   clipboard. This was the test that failed in PDFBox Java
>   text extraction.
>
>   Evince shows the many embedded fonts.
>
>
>   o) With original testcase: mytest_libreoffice_direct.pdf
>   created from LibreOffice using its "built in" PDF
>   creation option
>
>   ALL lines highlight, and when pasted, all text is
>   present.
>
>   Evince shows the many embedded fonts.
>
>
>   USING CHROMIUM BROWSER
>   ----------------------
>
>   (Base file: mytest.html)
>
>   I created several lines using different fonts,
>   as an HTML file. Viewed in Chromium browser,
>   then printed.
>
>   o) File: mytest_html_cups_pdf.pdf,
>   was printed from Chromium using the "cups-pdf"
>   "printer". All lines appear in the PDF, and
>   can be selected. But when pasted all resulting text
>   is garbage. Only one font embedded: "No name".
>
>   o) File: mytest_html_save_as_pdf.pdf,
>   was printed from Chromium using the "save as file"
>   option. All lines appear in the PDF, and can be
>   selected. All text (including text added by
>   the PDF creator) are present.
>
>   Evince shows the many embedded fonts.
>
>   (In the HTML cases, fonts used are no doubt
>   those already installed on my Ubuntu system.
>   The HTML code asked for fonts that may not
>   be present, and probably were substituted.)
>
>
>   USING BRAVE BROWSER
>   -------------------
>
>   (Base file: mytest.html)
>
>   Same testcase as for Chromium browser. Viewed
>   in Brave browser, then printed.
>
>   o) File: mytest_html_brave_cups_pdf.pdf,
>   was printed from Brave using the "cups-pdf"
>   "printer". All lines appear in the PDF. However,
>   when all selected, every character is highlighted
>   EXCEPT the initial "T" on the first line.  When
>   pasted, all resulting text is garbage. Only one
>   font embedded: "No name".
>
>   o) File: mytest_html_brave_save_as_pdf.pdf,
>   was printed from Brave using the "save as file"
>   option. All lines appear in the PDF, and can be
>   selected. All text is present. (No text added
>   by Brave).
>
>   Evince shows the many embedded fonts.
>
>   (Same notes may apply regarding fonts installed
>   on Ubuntu system)
>
>
>   USING WKHTMLTOPDF COMMAND
>   -------------------------
>
>   (Base file: mytest.html)
>
>   Same testcase as for browsers. I'm using
>   this example to show that multiple font
>   test output can be created in different ways.
>
>   Command:
>
>   wkhtmltopdf mytest.html mytest_wk.pdf
>
>   o) File: mytest_wk.pdf,
>   All lines appear in the PDF, and fonts
>   are sometimes quite different than those shown in
>   the browsers (they may actually be more correct).
>
>   All content can be highlighted and can be pasted
>   as text. Several text lines, however, contain
>   additional whitespace (tabs).
>
>   Evince shows the many embedded fonts.
>
>
>   NOTES
>   -----
>
>   The "creator" name embedded in the metadata for
>   these PDF files varies considerably, and it is
>   unclear to me whether the same engine is being
>   used by these various packages. It is clear, at
>   least that cups-pdf is using Ghostscript for
>   PDF creation.
>
>   ProblemType: Bug
>   DistroRelease: Ubuntu 16.04
>   Package: cups-pdf 2.6.1-21
>   ProcVersionSignature: Ubuntu 4.13.0-45.50~16.04.1-generic 4.13.16
>   Uname: Linux 4.13.0-45-generic x86_64
>   ApportVersion: 2.20.1-0ubuntu2.18
>   Architecture: amd64
>   CurrentDesktop: Unity
>   Date: Wed Jun 27 15:36:32 2018
>   InstallationDate: Installed on 2017-05-16 (406 days ago)
>   InstallationMedia: Ubuntu 16.04.2 LTS "Xenial Xerus" - Release amd64
> (20170215.2)
>   SourcePackage: cups-pdf
>   UpgradeStatus: No upgrade log present (probably fresh install)
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/cups-pdf/+bug/1778988/+subscriptions
>

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1778988

Title:
  After PDF Files created by cups-pdf, cannot extract text from them

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/cups-pdf/+bug/1778988/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 1778988] Re: After PDF Files created by cups-pdf, cannot extract text from them

Reply via email to