Bug#486115: PDF files from gscan2pdf are huge
tag 486115 pending thanks I've worked up a patch for this. It will go in with the next upstream release. The logic ended up being slightly different, as imagemagick give the depth per channel: if Depth = 1, compression = LZW otherwise if TrueColor, compression = JPG otherwise compression = PNG When I upload the new release, please test the results to see if this logic really does produce files of a reasonable size, and report any corner cases you find. -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#486115: PDF files from gscan2pdf are huge
Just did some experimentation with photos. Took a JPG, converted it to tiff $ convert ~/Photos/2008/01/12/IMG_4063.JPG IMG_4063.tif forced the compression with libtiff $ tiffcp -c jpeg:75 IMG_4063.tif IMG_4063.75.tif wrote it as a PDF $ tiff2pdf -o IMG_4063.75.pdf IMG_4063.75.tif imported the same tiff into gscan2pdf and wrote it as a PDF: $ ls -l ~/Photos/2008/01/12/IMG_4063.JPG IMG_4063.tif IMG_4063.75* -rwxr-xr-x 1 jeff jeff 3286311 2008-01-13 17:31 /home/jeff/Photos/2008/01/12/IMG_4063.JPG -rw-r--r-- 1 jeff jeff 632922 2008-06-15 09:41 IMG_4063.75g.pdf -rw-r--r-- 1 jeff jeff 694867 2008-06-15 09:38 IMG_4063.75.pdf -rw-r--r-- 1 jeff jeff 700976 2008-06-15 09:34 IMG_4063.75.tif -rw-r--r-- 1 jeff jeff 7805388 2008-06-15 09:31 IMG_4063.tif If I import the JPG directly into gscan2pdf, this becomes: -rw-r--r-- 1 jeff jeff 637825 2008-06-15 09:47 IMG_4063.75g.pdf which is still better than libtiff. So - as you have already said, it is possible to get small PDFs from gscan2pdf if you choose the appropriate compression. The question then is - how best to help the user choose the compression? Count the depth of the image - 1bit = LZW, 2-3 bit = PNG, 3bit = JPG? Have this as an extra automatic compression level? This at least would be a sane way to deal with several pages with different depths. -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#486115: PDF files from gscan2pdf are huge
Jeffrey Ratcliffe wrote: So - as you have already said, it is possible to get small PDFs from gscan2pdf if you choose the appropriate compression. The question then is - how best to help the user choose the compression? Count the depth of the image - 1bit = LZW, 2-3 bit = PNG, 3bit = JPG? Have this as an extra automatic compression level? This at least would be a sane way to deal with several pages with different depths. That seems reasonable. -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#486115: PDF files from gscan2pdf are huge
2008/6/14 Jeff Licquia [EMAIL PROTECTED]: I don't think I'd quibble over the few kilobytes between JPEG and command-line. But I don't think I'm the only one who will be taken aback at just how big the PNG version is. The verdict from most people won't be that well, I need to tweak the settings to get it just right; it'll be that gscan2pdf sucks or more likely Linux scanning sucks. Most of the scanning I do is BW (i.e. 1-bit). PNG does better, there, and JPG worst. DjVu is worth the extra hassel, though. -rw-r--r-- 1 jeff jeff13181 2008-06-14 07:55 5PsYMkaHJ3.djvu -rw-r--r-- 1 jeff jeff48858 2008-06-14 07:55 5PsYMkaHJ3.g4.pdf -rw-r--r-- 1 jeff jeff48858 2008-06-14 07:54 5PsYMkaHJ3.g3.pdf -rw-r--r-- 1 jeff jeff48858 2008-06-14 07:54 5PsYMkaHJ3.packbits.pdf -rw-r--r-- 1 jeff jeff48858 2008-06-14 07:54 5PsYMkaHJ3.zip.pdf -rw-r--r-- 1 jeff jeff 432705 2008-06-14 07:53 5PsYMkaHJ3.jpg.pdf -rw-r--r-- 1 jeff jeff 1042517 2008-06-14 07:52 5PsYMkaHJ3.pnm -rw-r--r-- 1 jeff jeff56204 2008-06-14 07:51 5PsYMkaHJ3.png.pdf -rw-r--r-- 1 jeff jeff48858 2008-06-14 07:51 5PsYMkaHJ3.lzw.pdf I will investigate why JPG doesn't do as well as the commandline tools (I have an idea). -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#486115: PDF files from gscan2pdf are huge
Package: gscan2pdf Version: 0.9.24-1 Severity: normal -rw-r--r-- 1 jeff jeff 840424 2008-06-13 09:12 july.pdf -rw-r--r-- 1 jeff jeff 3758009 2008-06-13 08:57 july-vacation-request.pdf These are both 1-page PDFs, using the same piece of paper, and the same scan settings. The first was scanned using xsane to PNM, and converted with the following command line: [EMAIL PROTECTED]:~/tmp/timeoff$ pnmtops july.pnm | ps2pdf14 - july.pdf The second was generated with gscan2pdf, using mostly default processing settings; I did turn off OCR, since I don't need it here. The PNM file generated from the scan was smaller than the resulting PDF file from gscan2pdf. And the manually converted scan doesn't have the benefit of processing with unpaper, so it has the paper borders, little spots, etc. Otherwise, the appearance of the two scans is not discernably different. I'm sure there are settings I could have chosen that would duplicate the results I got manually. Those setting should be the default. -- System Information: Debian Release: lenny/sid APT prefers testing APT policy: (990, 'testing'), (500, 'unstable'), (500, 'stable'), (1, 'experimental') Architecture: i386 (i686) Kernel: Linux 2.6.24-1-686 (SMP w/1 CPU core) Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/bash Versions of packages gscan2pdf depends on: ii imagemagick 7:6.3.7.9.dfsg1-2+b2 image manipulation programs ii libconfig-general-p 2.38-1 Generic Configuration Module ii libgtk2-ex-simple-l 0.50-1.1 A simple interface to Gtk2's compl ii libgtk2-imageview-p 0.04-1+b1Perl bindings for the GtkImageView ii liblocale-gettext-p 1.05-4 Using libc functions for internati ii libpdf-api2-perl0.69-2 create or modify PDF documents in ii librsvg2-common 2.22.2-2 SAX-based renderer library for SVG ii libsane 1.0.19-10API library for scanners ii libtiff-tools 3.8.2-8 TIFF manipulation and conversion t ii perlmagick 7:6.3.7.9.dfsg1-2+b2 Perl interface to the libMagick gr ii sane-utils 1.0.19-10API library for scanners -- utilit Versions of packages gscan2pdf recommends: ii djvulibre-bin 3.5.20-6 Utilities for the DjVu image forma ii gocr 0.41-1+b1 A command line OCR ii libgtk2-ex-podviewer-perl 0.17-2 Perl Gtk2 widget for displaying Pl ii sane 1.0.14-6 scanner graphical frontends ii tesseract-ocr 2.03-1 Command line OCR tool ii unpaper 0.3-1 post-processing tool for scanned p ii xdg-utils 1.0.2-4desktop integration utilities from -- no debconf information -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#486115: PDF files from gscan2pdf are huge
2008/6/13 Jeff Licquia [EMAIL PROTECTED]: I'm sure there are settings I could have chosen that would duplicate the results I got manually. Those setting should be the default. That is easy to say, but it depends a great deal on what sort of scan it is. For BW scans, LZW seems to be the best, but some people prefer the fax type G3 or G4 compression. For colour or greyscale, use JPG or PNG. So even a bit of logic to guess the type of scan wouldn't necessarily produce the best results. It would have to try all the sensible possibilities, which would not be very quick. What happens if you have several pages with different types of scan? gscan2pdf does remember the settings you used last. But DjVu beats PDF hands down on file size. Regards Jeff -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#486115: PDF files from gscan2pdf are huge
Jeffrey Ratcliffe wrote: That is easy to say, but it depends a great deal on what sort of scan it is. For BW scans, LZW seems to be the best, but some people prefer the fax type G3 or G4 compression. For colour or greyscale, use JPG or PNG. I suppose that would make more sense if the lossless PNM scan wasn't a megabyte smaller than the gscan2pdf PDF, or if I had passed exotic command-line arguments to the command-line convert tools. For the record, these were all 300-dpi scans in color, although the source was just a black-and-white form. (Greyscale seems to have a different bug not related to gscan2pdf, which makes it unusable.) What happens if you have several pages with different types of scan? Dunno; will try this later, as well as a few other things. But DjVu beats PDF hands down on file size. True, but PDF is more readily available, unfortunately. -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#486115: PDF files from gscan2pdf are huge
2008/6/13 Jeff Licquia [EMAIL PROTECTED]: I suppose that would make more sense if the lossless PNM scan wasn't a megabyte smaller than the gscan2pdf PDF, or if I had passed exotic command-line arguments to the command-line convert tools. Default seems to be PNG, which isn't bad for BW scans, and also for scans with limited numbers of colours. As your scans were colour, JPG would give the best size. But DjVu beats PDF hands down on file size. True, but PDF is more readily available, unfortunately. Evince reads DjVu files, so reading your own DjVu files is no problem. If you later want to give scan to somebody else as a PDF, gscan2pdf will reimport its own DjVu files and write a PDF. -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#486115: PDF files from gscan2pdf are huge
Jeffrey Ratcliffe wrote: Default seems to be PNG, which isn't bad for BW scans, and also for scans with limited numbers of colours. As your scans were colour, JPG would give the best size. I did some experimentation along those lines, and also to explore your earlier suggestion. First of all, a megabyte smaller than the uncompressed PNM isn't correct; my mistake. (Order of magnitude off; didn't count the places.) Here's my original, done a few new ways. For gscan2pdf, I imported the original PNM to ensure that the results would be comparable. -rw-r--r-- 1 jeff jeff 1248907 2008-06-13 17:48 july-jpegcomp-2.pdf -rw-r--r-- 1 jeff jeff 960727 2008-06-13 17:42 july-jpegcomp.pdf -rw-r--r-- 1 jeff jeff 840424 2008-06-13 09:12 july.pdf -rw-r--r-- 1 jeff jeff 4496216 2008-06-13 17:41 july-pngcomp.pdf -rw-r- 1 jeff jeff 26810741 2008-06-13 09:07 july.pnm As you can see JPEG did do better than the default (PNG). The first was done at 75% quality, the second at 90%. But neither was as good as my command line. For the second test, I scanned the front cover of this month's Linux Journal, with a photo of Matt Mullenweg on it; this, I thought, would be more photo-like. Here's those results: -rw-r- 1 jeff jeff 26810741 2008-06-13 09:07 july.pnm -rw-r--r-- 1 jeff jeff 3144781 2008-06-13 18:03 test-cmdline.pdf -rw-r--r-- 1 jeff jeff 3558691 2008-06-13 18:00 test-jpegcomp.pdf -rw-r--r-- 1 jeff jeff 22993707 2008-06-13 17:59 test-pngcomp.pdf -rw-r- 1 jeff jeff 26810741 2008-06-13 17:58 test.pnm Again, PNG was dismal, JPEG respectable, and the command line won again. Here's the command line I used to create that version: cat july.pnm test.pnm | pnmtops | ps2pdf14 - test-cmdline.pdf Note that at no time did I tell any of the utilities what kind of compression to use. And it seems to make the right decision with all my scans. I don't think I'd quibble over the few kilobytes between JPEG and command-line. But I don't think I'm the only one who will be taken aback at just how big the PNG version is. The verdict from most people won't be that well, I need to tweak the settings to get it just right; it'll be that gscan2pdf sucks or more likely Linux scanning sucks. -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]