On January 23, 2014, Levi Pearson wrote:
> There's a nice program called Scan Tailor, which is a GUI wrapper > around some command line tools, that helps to turn raw pages into > a clean DJVu or PDF archive of the book/magazine. I'll have to look at that one, thanks! BTW, What's DJVu? PDF is obvious. I don't think I've heard of DJVu before. :) --- Dan On Thu, Jan 23, 2014 at 9:26 PM, Levi Pearson <[email protected]> wrote: > On Thu, Jan 23, 2014 at 2:16 AM, Dan Egli <[email protected]> wrote: > > I was letting my mind wander last night, and I got to thinking about all > > these older magazines that I have stashed in various places. I kept them > > because they had interesting articles and the like. I was wondering if > > there was an easy way to convert them into a digital format, so I can > > recycle the dead trees. It's not just the idea of capturing the text and > > OCRing it, though. Many have pictures that are a vital part of the > article. > > My first thought was to write a PDF, but I don't see an easy way to do > > that. All I can think of on that is to scan each page into an image file > > (jpeg or similar), then import them into a LibreOffice document, then > save > > that document in PDF format. I imagine that would work, but it would also > > kill text searching, I'd think. I suppose I could scan the images in, > then > > scan the text in, OCR it, and then re-format it for the PDF, but that > seems > > like a LOT of work. Especially as I think I have over 50 old issues of > > various magazines lying around in storage that I'd like to convert. > > > > > > > > Does anyone know of any easy methods for converting articles on paper, > with > > images, into something digitally readable? I don't care if it's PDF or > ePub > > or something else, as long as it looks decent on the computer screen, > and I > > can search text within the article. > > There's a nice program called Scan Tailor, which is a GUI wrapper > around some command line tools, that helps to turn raw page scans into > a clean DJVu or PDF archive of the book/magazine. I've not scanned > any books myself, but I did use it to clean up and shrink some messy > scans I've downloaded from others. It lets you automate the splitting > of two-up scans, de-skew the pages, crop out margins, and re-center > with consistent margins. You can also run de-speckle algorithms, > convert to mono *for text-only regions* and blank out flaws/marks on > pages. You end up a directory of cleaned-up uncompressed page images, > which you can then use some other command-line tools to compile into > your preferred container format (PDF or DJVu), possibly with an OCR > phase to embed a textual representation as well, which enables > searchability. There are some relatively automated open source OCR > programs that can fit in this workflow and embed text for > searchability into your PDF, but I haven't got to the point of doing > that yet. > > Regarding DJVu vs. PDF: It used to be that only DJVu supported > compression mechanisms that allowed you to layer and compose mono and > grayscale/color page regions, which meant that you could get a much > more efficient archive with DJVu at the cost of significantly reducing > the set of programs that would read your archive. But sometime in the > last couple of years PDF gained some additional compression mechanisms > for bitmaps that allowed it to reach near-parity with DJVu in file > sizes (see archive.org for a whole lot of book scans in various > formats [https://archive.org/details/dasleidenunsersh00bras for > example]). The big advantage of PDF is that just about everyone has > PDF viewing software installed already, including phones and tablets. > Today's hi-res 10" tablet displays make wonderful PDF-reading > machines, and hopefully someday we'll have nice 10" or so e-ink > displays in affordable tablets as well. > > I've included some links below to some resources I found useful. > > --Levi > > [Scan Tailor]: http://scantailor.sourceforge.net/ > [Scan Tailor Guide]: > http://sourceforge.net/apps/mediawiki/scantailor/index.php?title=User_Guide > [DIY Book Scanning Info]: http://www.diybookscanner.org/ > > /* > PLUG: http://plug.org, #utah on irc.freenode.net > Unsubscribe: http://plug.org/mailman/options/plug > Don't fear the penguin. > */ > /* PLUG: http://plug.org, #utah on irc.freenode.net Unsubscribe: http://plug.org/mailman/options/plug Don't fear the penguin. */
