My (much more primitive) version of the same thing involves reading and annotating articles using my Tablet PC. Although I do get a variety of print publications, I find I don't tend to annotate them as much anymore. I used to use EndNote to do the metadata, then I switched to Zotero. I hadn't thought to try to create a full-text search of the articles -- hmm.
-- Danielle Cunniff Plumer, Coordinator Texas Heritage Digitization Initiative Texas State Library and Archives Commission 512.463.5852 (phone) / 512.936.2306 (fax) dplu...@tsl.state.tx.us dcplu...@gmail.com On Tue, Sep 15, 2009 at 8:31 AM, Eric Lease Morgan <emor...@nd.edu> wrote: > I have been having fun recently indexing PDF files. > > For the pasts six months or so I have been keeping the articles I've read > in a pile, and I was rather amazed at the size of the pile. It was about a > foot tall. When I read these articles I "actively" read them -- meaning, I > write, scribble, highlight, and annotate the text with my own special > notation denoting names, keywords, definitions, citations, quotations, list > items, examples, etc. This active reading process: 1) makes for better > comprehension on my part, and 2) makes the articles easier to review and > pick out the ideas I thought were salient. Being the librarian I am, I > thought it might be cool ("kewl") to make the articles into a collection. > Thus, the beginnings of Highlights & Annotations: A Value-Added Reading > List. > > The techno-weenie process for creating and maintaining the content is > something this community might find interesting: > > 1. Print article and read it actively. > > 2. Convert the printed article into a PDF > file -- complete with embedded OCR -- > with my handy-dandy ScanSnap scanner. [1] > > 3. Use MyLibrary to create metadata (author, > title, date published, date read, note, > keywords, facet/term combinations, local > and remote URLs, etc.) describing the > article. [2] > > 4. Save the PDF to my file system. > > 5. Use pdttotext to extract the OCRed text > from the PDF and index it along with > the MyLibrary metadata using Solr. [3, 4] > > 6. Provide a searchable/browsable user > interface to the collection through a > mod_perl module. [5, 6] > > Software is never done, and if it were then it would be called hardware. > Accordingly, I know there are some things I need to do before I can truely > deem the system version 1.0. At the same time my excitment is overflowing > and I thought I'd share some geekdom with my fellow hackers. Fun with PDF > files and open source software. > > > [1] ScanSnap - http://tinyurl.com/oafgwe > [2] MyLibrary screen dump - http://infomotions.com/tmp/mylibrary.png > [3] pdftotext - http://www.foolabs.com/xpdf/ > [4] Solr - http://lucene.apache.org/solr/ > [5] module source code - http://infomotions.com/highlights/Highlights.pl > [6] user interface - http://infomotions.com/highlights/highlights.cgi > > -- > Eric Lease Morgan > University of Notre Dame > > > > > -- > Eric Lease Morgan > Head, Digital Access and Information Architecture Department > Hesburgh Libraries, University of Notre Dame > > (574) 631-8604 >