On Thu, Jan 23, 2014 at 2:16 AM, Dan Egli <[email protected]> wrote:
> I was letting my mind wander last night, and I got to thinking about all
> these older magazines that I have stashed in various places. I kept them
> because they had interesting articles and the like. I was wondering if
> there was an easy way to convert them into a digital format, so I can
> recycle the dead trees. It's not just the idea of capturing the text and
> OCRing it, though. Many have pictures that are a vital part of the article.
> My first thought was to write a PDF, but I don't see an easy way to do
> that. All I can think of on that is to scan each page into an image file
> (jpeg or similar), then import them into a LibreOffice document, then save
> that document in PDF format. I imagine that would work, but it would also
> kill text searching, I'd think. I suppose I could scan the images in, then
> scan the text in, OCR it, and then re-format it for the PDF, but that seems
> like a LOT of work. Especially as I think I have over 50 old issues of
> various magazines lying around in storage that I'd like to convert.
>
>
>
> Does anyone know of any easy methods for converting articles on paper, with
> images, into something digitally readable? I don't care if it's PDF or ePub
> or something else, as long as it looks decent on the computer screen, and I
> can search text within the article.

There's a nice program called Scan Tailor, which is a GUI wrapper
around some command line tools, that helps to turn raw page scans into
a clean DJVu or PDF archive of the book/magazine.  I've not scanned
any books myself, but I did use it to clean up and shrink some messy
scans I've downloaded from others.  It lets you automate the splitting
of two-up scans, de-skew the pages, crop out margins, and re-center
with consistent margins. You can also run de-speckle algorithms,
convert to mono *for text-only regions* and blank out flaws/marks on
pages.  You end up a directory of cleaned-up uncompressed page images,
which you can then use some other command-line tools to compile into
your preferred container format (PDF or DJVu), possibly with an OCR
phase to embed a textual representation as well, which enables
searchability. There are some relatively automated open source OCR
programs that can fit in this workflow and embed text for
searchability into your PDF, but I haven't got to the point of doing
that yet.

Regarding DJVu vs. PDF: It used to be that only DJVu supported
compression mechanisms that allowed you to layer and compose mono and
grayscale/color page regions, which meant that you could get a much
more efficient archive with DJVu at the cost of significantly reducing
the set of programs that would read your archive.  But sometime in the
last couple of years PDF gained some additional compression mechanisms
for bitmaps that allowed it to reach near-parity with DJVu in file
sizes (see archive.org for a whole lot of book scans in various
formats [https://archive.org/details/dasleidenunsersh00bras for
example]). The big advantage of PDF is that just about everyone has
PDF viewing software installed already, including phones and tablets.
Today's hi-res 10" tablet displays make wonderful PDF-reading
machines, and hopefully someday we'll have nice 10" or so e-ink
displays in affordable tablets as well.

I've included some links below to some resources I found useful.

    --Levi

[Scan Tailor]: http://scantailor.sourceforge.net/
[Scan Tailor Guide]:
http://sourceforge.net/apps/mediawiki/scantailor/index.php?title=User_Guide
[DIY Book Scanning Info]: http://www.diybookscanner.org/

/*
PLUG: http://plug.org, #utah on irc.freenode.net
Unsubscribe: http://plug.org/mailman/options/plug
Don't fear the penguin.
*/

Reply via email to