Bug#486115: PDF files from gscan2pdf are huge

2008-07-24 Thread Jeffrey Ratcliffe
tag 486115 pending
thanks

I've worked up a patch for this. It will go in with the next upstream release.

The logic ended up being slightly different, as imagemagick give the
depth per channel:

if Depth = 1, compression = LZW
otherwise
if TrueColor, compression = JPG
otherwise
compression = PNG

When I upload the new release, please test the results to see if this
logic really does produce files of a reasonable size, and report any
corner cases you find.



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#486115: PDF files from gscan2pdf are huge

2008-06-15 Thread Jeffrey Ratcliffe
Just did some experimentation with photos. Took a JPG, converted it to tiff

$ convert ~/Photos/2008/01/12/IMG_4063.JPG IMG_4063.tif

forced the compression with libtiff

$ tiffcp -c jpeg:75 IMG_4063.tif IMG_4063.75.tif

wrote it as a PDF

$ tiff2pdf -o IMG_4063.75.pdf IMG_4063.75.tif

imported the same tiff into gscan2pdf and wrote it as a PDF:

$ ls -l ~/Photos/2008/01/12/IMG_4063.JPG IMG_4063.tif IMG_4063.75*
-rwxr-xr-x 1 jeff jeff 3286311 2008-01-13 17:31
/home/jeff/Photos/2008/01/12/IMG_4063.JPG
-rw-r--r-- 1 jeff jeff  632922 2008-06-15 09:41 IMG_4063.75g.pdf
-rw-r--r-- 1 jeff jeff  694867 2008-06-15 09:38 IMG_4063.75.pdf
-rw-r--r-- 1 jeff jeff  700976 2008-06-15 09:34 IMG_4063.75.tif
-rw-r--r-- 1 jeff jeff 7805388 2008-06-15 09:31 IMG_4063.tif

If I import the JPG directly into gscan2pdf, this becomes:

-rw-r--r-- 1 jeff jeff  637825 2008-06-15 09:47 IMG_4063.75g.pdf

which is still better than libtiff.

So - as you have already said, it is possible to get small PDFs from
gscan2pdf if you choose the appropriate compression.

The question then is - how best to help the user choose the
compression? Count the depth of the image - 1bit = LZW, 2-3 bit = PNG,
3bit = JPG? Have this as an extra automatic compression level? This
at least would be a sane way to deal with several pages with different
depths.



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#486115: PDF files from gscan2pdf are huge

2008-06-15 Thread Jeff Licquia

Jeffrey Ratcliffe wrote:

So - as you have already said, it is possible to get small PDFs from
gscan2pdf if you choose the appropriate compression.

The question then is - how best to help the user choose the
compression? Count the depth of the image - 1bit = LZW, 2-3 bit = PNG,

3bit = JPG? Have this as an extra automatic compression level? This

at least would be a sane way to deal with several pages with different
depths.


That seems reasonable.



--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#486115: PDF files from gscan2pdf are huge

2008-06-14 Thread Jeffrey Ratcliffe
2008/6/14 Jeff Licquia [EMAIL PROTECTED]:
 I don't think I'd quibble over the few kilobytes between JPEG and
 command-line.  But I don't think I'm the only one who will be taken aback at
 just how big the PNG version is.  The verdict from most people won't be that
 well, I need to tweak the settings to get it just right; it'll be that
 gscan2pdf sucks or more likely Linux scanning sucks.

Most of the scanning I do is BW (i.e. 1-bit). PNG does better, there,
and JPG worst. DjVu is worth the extra hassel, though.

-rw-r--r-- 1 jeff jeff13181 2008-06-14 07:55 5PsYMkaHJ3.djvu
-rw-r--r-- 1 jeff jeff48858 2008-06-14 07:55 5PsYMkaHJ3.g4.pdf
-rw-r--r-- 1 jeff jeff48858 2008-06-14 07:54 5PsYMkaHJ3.g3.pdf
-rw-r--r-- 1 jeff jeff48858 2008-06-14 07:54 5PsYMkaHJ3.packbits.pdf
-rw-r--r-- 1 jeff jeff48858 2008-06-14 07:54 5PsYMkaHJ3.zip.pdf
-rw-r--r-- 1 jeff jeff   432705 2008-06-14 07:53 5PsYMkaHJ3.jpg.pdf
-rw-r--r-- 1 jeff jeff  1042517 2008-06-14 07:52 5PsYMkaHJ3.pnm
-rw-r--r-- 1 jeff jeff56204 2008-06-14 07:51 5PsYMkaHJ3.png.pdf
-rw-r--r-- 1 jeff jeff48858 2008-06-14 07:51 5PsYMkaHJ3.lzw.pdf

I will investigate why JPG doesn't do as well as the commandline tools
(I have an idea).



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#486115: PDF files from gscan2pdf are huge

2008-06-13 Thread Jeff Licquia
Package: gscan2pdf
Version: 0.9.24-1
Severity: normal

-rw-r--r-- 1 jeff jeff  840424 2008-06-13 09:12 july.pdf
-rw-r--r-- 1 jeff jeff 3758009 2008-06-13 08:57 july-vacation-request.pdf

These are both 1-page PDFs, using the same piece of paper, and the
same scan settings.  The first was scanned using xsane to PNM, and
converted with the following command line:

[EMAIL PROTECTED]:~/tmp/timeoff$ pnmtops  july.pnm | ps2pdf14 - july.pdf

The second was generated with gscan2pdf, using mostly default
processing settings; I did turn off OCR, since I don't need it here.

The PNM file generated from the scan was smaller than the resulting
PDF file from gscan2pdf.  And the manually converted scan doesn't have
the benefit of processing with unpaper, so it has the paper borders,
little spots, etc.  Otherwise, the appearance of the two scans is not
discernably different.

I'm sure there are settings I could have chosen that would duplicate
the results I got manually.  Those setting should be the default.

-- System Information:
Debian Release: lenny/sid
  APT prefers testing
  APT policy: (990, 'testing'), (500, 'unstable'), (500, 'stable'), (1, 
'experimental')
Architecture: i386 (i686)

Kernel: Linux 2.6.24-1-686 (SMP w/1 CPU core)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/bash

Versions of packages gscan2pdf depends on:
ii  imagemagick 7:6.3.7.9.dfsg1-2+b2 image manipulation programs
ii  libconfig-general-p 2.38-1   Generic Configuration Module
ii  libgtk2-ex-simple-l 0.50-1.1 A simple interface to Gtk2's compl
ii  libgtk2-imageview-p 0.04-1+b1Perl bindings for the GtkImageView
ii  liblocale-gettext-p 1.05-4   Using libc functions for internati
ii  libpdf-api2-perl0.69-2   create or modify PDF documents in 
ii  librsvg2-common 2.22.2-2 SAX-based renderer library for SVG
ii  libsane 1.0.19-10API library for scanners
ii  libtiff-tools   3.8.2-8  TIFF manipulation and conversion t
ii  perlmagick  7:6.3.7.9.dfsg1-2+b2 Perl interface to the libMagick gr
ii  sane-utils  1.0.19-10API library for scanners -- utilit

Versions of packages gscan2pdf recommends:
ii  djvulibre-bin 3.5.20-6   Utilities for the DjVu image forma
ii  gocr  0.41-1+b1  A command line OCR
ii  libgtk2-ex-podviewer-perl 0.17-2 Perl Gtk2 widget for displaying Pl
ii  sane  1.0.14-6   scanner graphical frontends
ii  tesseract-ocr 2.03-1 Command line OCR tool
ii  unpaper   0.3-1  post-processing tool for scanned p
ii  xdg-utils 1.0.2-4desktop integration utilities from

-- no debconf information



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#486115: PDF files from gscan2pdf are huge

2008-06-13 Thread Jeffrey Ratcliffe
2008/6/13 Jeff Licquia [EMAIL PROTECTED]:
 I'm sure there are settings I could have chosen that would duplicate
 the results I got manually.  Those setting should be the default.

That is easy to say, but it depends a great deal on what sort of scan
it is. For BW scans, LZW seems to be the best, but some people prefer
the fax type G3 or G4 compression. For colour or greyscale, use JPG or
PNG.

So even a bit of logic to guess the type of scan wouldn't necessarily
produce the best results. It would have to try all the sensible
possibilities, which would not be very quick.

What happens if you have several pages with different types of scan?

gscan2pdf does remember the settings you used last.

But DjVu beats PDF hands down on file size.

Regards

Jeff



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#486115: PDF files from gscan2pdf are huge

2008-06-13 Thread Jeff Licquia

Jeffrey Ratcliffe wrote:

That is easy to say, but it depends a great deal on what sort of scan
it is. For BW scans, LZW seems to be the best, but some people prefer
the fax type G3 or G4 compression. For colour or greyscale, use JPG or
PNG.


I suppose that would make more sense if the lossless PNM scan wasn't a 
megabyte smaller than the gscan2pdf PDF, or if I had passed exotic 
command-line arguments to the command-line convert tools.


For the record, these were all 300-dpi scans in color, although the 
source was just a black-and-white form.  (Greyscale seems to have a 
different bug not related to gscan2pdf, which makes it unusable.)



What happens if you have several pages with different types of scan?


Dunno; will try this later, as well as a few other things.


But DjVu beats PDF hands down on file size.


True, but PDF is more readily available, unfortunately.




--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#486115: PDF files from gscan2pdf are huge

2008-06-13 Thread Jeffrey Ratcliffe
2008/6/13 Jeff Licquia [EMAIL PROTECTED]:
 I suppose that would make more sense if the lossless PNM scan wasn't a
 megabyte smaller than the gscan2pdf PDF, or if I had passed exotic
 command-line arguments to the command-line convert tools.

Default seems to be PNG, which isn't bad for BW scans, and also for
scans with limited numbers of colours. As your scans were colour, JPG
would give the best size.

 But DjVu beats PDF hands down on file size.

 True, but PDF is more readily available, unfortunately.

Evince reads DjVu files, so reading your own DjVu files is no problem.
If you later want to give scan to somebody else as a PDF, gscan2pdf
will reimport its own DjVu files and write a PDF.



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#486115: PDF files from gscan2pdf are huge

2008-06-13 Thread Jeff Licquia

Jeffrey Ratcliffe wrote:

Default seems to be PNG, which isn't bad for BW scans, and also for
scans with limited numbers of colours. As your scans were colour, JPG
would give the best size.


I did some experimentation along those lines, and also to explore your 
earlier suggestion.


First of all, a megabyte smaller than the uncompressed PNM isn't 
correct; my mistake.  (Order of magnitude off; didn't count the places.)


Here's my original, done a few new ways.  For gscan2pdf, I imported the 
original PNM to ensure that the results would be comparable.


-rw-r--r-- 1 jeff jeff  1248907 2008-06-13 17:48 july-jpegcomp-2.pdf
-rw-r--r-- 1 jeff jeff   960727 2008-06-13 17:42 july-jpegcomp.pdf
-rw-r--r-- 1 jeff jeff   840424 2008-06-13 09:12 july.pdf
-rw-r--r-- 1 jeff jeff  4496216 2008-06-13 17:41 july-pngcomp.pdf
-rw-r- 1 jeff jeff 26810741 2008-06-13 09:07 july.pnm

As you can see JPEG did do better than the default (PNG).  The first was 
done at 75% quality, the second at 90%.  But neither was as good as my 
command line.


For the second test, I scanned the front cover of this month's Linux 
Journal, with a photo of Matt Mullenweg on it; this, I thought, would be 
more photo-like.  Here's those results:


-rw-r- 1 jeff jeff 26810741 2008-06-13 09:07 july.pnm
-rw-r--r-- 1 jeff jeff  3144781 2008-06-13 18:03 test-cmdline.pdf
-rw-r--r-- 1 jeff jeff  3558691 2008-06-13 18:00 test-jpegcomp.pdf
-rw-r--r-- 1 jeff jeff 22993707 2008-06-13 17:59 test-pngcomp.pdf
-rw-r- 1 jeff jeff 26810741 2008-06-13 17:58 test.pnm

Again, PNG was dismal, JPEG respectable, and the command line won again. 
 Here's the command line I used to create that version:


cat july.pnm test.pnm | pnmtops | ps2pdf14 - test-cmdline.pdf

Note that at no time did I tell any of the utilities what kind of 
compression to use.  And it seems to make the right decision with all my 
scans.


I don't think I'd quibble over the few kilobytes between JPEG and 
command-line.  But I don't think I'm the only one who will be taken 
aback at just how big the PNG version is.  The verdict from most people 
won't be that well, I need to tweak the settings to get it just right; 
it'll be that gscan2pdf sucks or more likely Linux scanning sucks.




--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]