On Sat, 17 Sep 2005 11:25:59 -0700, David Forbes <[EMAIL PROTECTED]> wrote:
>Hi, > >I have placed raw 200DPI 8bit GIFs of the HP103AR manual on my website. > >There's no index file; the scans are at: > >http://www.nixiebunny.com/hp103ar/hp103ar01.gif > >through > >http://www.nixiebunny.com/hp103ar/hp103ar39.gif > >so you can wget them easily. The files average 2 megabytes each, so >there are many extra ones and zeroes in there. > >Leading up to the next question: > >What's a good post-processing program to shrink the scans of text >pages, possibly OCRing them, and make one big PDF file out of the >lot? I know that a few folks on this list have done this work, but I >don't know how they did it. > >If there's free/cheap software that works well, I'll get it and >proceed, otherwise would one of the folks with such software step up >to the plate and complete the job for us? > >If it's many hours of work, then some automated script that can >shrink the text-only pages would be sufficient for that work, and a >simple PDF maker would handle the rest. > >I await your suggestions. I use Ulead Photoimpact. It's mainly targeted at editing photographs, but has all the tools you need to clean the scans up. It has a lot of the functionality of Photoshop but is a lot less expensive. I'm sure there are other programs that could do the same, but I know and like this one. It does take a fair amount of work for something like the HP docs. I cleaned up what you have, a good bit, and it took me 2-3 hrs. It also is a complicated program. I'm pretty efficient with it now, but it took a long time using it to get to know what's there and how to use it. I could have done 80% of the improvement with batch mode commands, but took the time to do more. The first thing I did was use brightness/contrast tools to remove most of the gray. I also took the time to manually edit out the punch holes and some other noise. One of the keys to shrinking was to save them as gif files again but using 'grayscale 128' mode. The pages with photographic pictures took extra work. There was a lot of artifacting from scanning the halftone images. I used a combination of despeckle and blur filters to smooth them out before increasing contrast. I also stitched together the pages for scans of wide pages. The end result is the whole document is about 9.8 MB as opposed to 2-3 MB per page before. That could probably be cut about in half be reducing resolution on the pages -- in most cases I think they could go down about 50% and still be quite readable. I didn't do that, though. The resulting gif files can be had in one zip file here: http://www.xertech.net/data/hp103ar.zip Feel free to copy it and share it anywhere. A lot can theoretically be improved by adjusting settings at scan time -- contrast, resolution, moire. My previous scanner let me control a lot of stuff. My current HP seems to think I should not get involved. It irritates me that I need to adjust contrast on every image after the fact, rather than making a good setting of the scanner before I scan all those pages. As you suggest, OCRing the docs would reduce the size to the ultimate, but every OCR tool I have tried needs a HUGE amount of hand holding and corrections to get anything close to the original text and format. I have done it for a few things in the past, but it is very painful at best. Thanks for taking the time to do the scanning. Lots of good helpful people here in this group. _______________________________________________ time-nuts mailing list time-nuts@febo.com https://www.febo.com/cgi-bin/mailman/listinfo/time-nuts