On Thu, Jan 23, 2014 at 2:16 AM, Dan Egli <[email protected]> wrote: > I was letting my mind wander last night, and I got to thinking about all > these older magazines that I have stashed in various places. I kept them > because they had interesting articles and the like. I was wondering if > there was an easy way to convert them into a digital format, so I can > recycle the dead trees. It's not just the idea of capturing the text and > OCRing it, though. Many have pictures that are a vital part of the article. > My first thought was to write a PDF, but I don't see an easy way to do > that. All I can think of on that is to scan each page into an image file > (jpeg or similar), then import them into a LibreOffice document, then save > that document in PDF format. I imagine that would work, but it would also > kill text searching, I'd think. I suppose I could scan the images in, then > scan the text in, OCR it, and then re-format it for the PDF, but that seems > like a LOT of work. Especially as I think I have over 50 old issues of > various magazines lying around in storage that I'd like to convert. > > > > Does anyone know of any easy methods for converting articles on paper, with > images, into something digitally readable? I don't care if it's PDF or ePub > or something else, as long as it looks decent on the computer screen, and I > can search text within the article.
There's a nice program called Scan Tailor, which is a GUI wrapper around some command line tools, that helps to turn raw page scans into a clean DJVu or PDF archive of the book/magazine. I've not scanned any books myself, but I did use it to clean up and shrink some messy scans I've downloaded from others. It lets you automate the splitting of two-up scans, de-skew the pages, crop out margins, and re-center with consistent margins. You can also run de-speckle algorithms, convert to mono *for text-only regions* and blank out flaws/marks on pages. You end up a directory of cleaned-up uncompressed page images, which you can then use some other command-line tools to compile into your preferred container format (PDF or DJVu), possibly with an OCR phase to embed a textual representation as well, which enables searchability. There are some relatively automated open source OCR programs that can fit in this workflow and embed text for searchability into your PDF, but I haven't got to the point of doing that yet. Regarding DJVu vs. PDF: It used to be that only DJVu supported compression mechanisms that allowed you to layer and compose mono and grayscale/color page regions, which meant that you could get a much more efficient archive with DJVu at the cost of significantly reducing the set of programs that would read your archive. But sometime in the last couple of years PDF gained some additional compression mechanisms for bitmaps that allowed it to reach near-parity with DJVu in file sizes (see archive.org for a whole lot of book scans in various formats [https://archive.org/details/dasleidenunsersh00bras for example]). The big advantage of PDF is that just about everyone has PDF viewing software installed already, including phones and tablets. Today's hi-res 10" tablet displays make wonderful PDF-reading machines, and hopefully someday we'll have nice 10" or so e-ink displays in affordable tablets as well. I've included some links below to some resources I found useful. --Levi [Scan Tailor]: http://scantailor.sourceforge.net/ [Scan Tailor Guide]: http://sourceforge.net/apps/mediawiki/scantailor/index.php?title=User_Guide [DIY Book Scanning Info]: http://www.diybookscanner.org/ /* PLUG: http://plug.org, #utah on irc.freenode.net Unsubscribe: http://plug.org/mailman/options/plug Don't fear the penguin. */
