Re: [CODE4LIB] indexing pdf files

danielle plumer Tue, 15 Sep 2009 07:52:26 -0700

My (much more primitive) version of the same thing involves reading and
annotating articles using my Tablet PC. Although I do get a variety of print
publications, I find I don't tend to annotate them as much anymore. I used
to use EndNote to do the metadata, then I switched to Zotero. I hadn't
thought to try to create a full-text search of the articles -- hmm.


-- 
Danielle Cunniff Plumer, Coordinator
Texas Heritage Digitization Initiative
Texas State Library and Archives Commission
512.463.5852 (phone) / 512.936.2306 (fax)
dplu...@tsl.state.tx.us
dcplu...@gmail.com


On Tue, Sep 15, 2009 at 8:31 AM, Eric Lease Morgan <emor...@nd.edu> wrote:

> I have been having fun recently indexing PDF files.
>
> For the pasts six months or so I have been keeping the articles I've read
> in a pile, and I was rather amazed at the size of the pile. It was about a
> foot tall. When I read these articles I "actively" read them -- meaning, I
> write, scribble, highlight, and annotate the text with my own special
> notation denoting names, keywords, definitions, citations, quotations, list
> items, examples, etc. This active reading process: 1) makes for better
> comprehension on my part, and 2) makes the articles easier to review and
> pick out the ideas I thought were salient. Being the librarian I am, I
> thought it might be cool ("kewl") to make the articles into a collection.
> Thus, the beginnings of Highlights & Annotations: A Value-Added Reading
> List.
>
> The techno-weenie process for creating and maintaining the content is
> something this community might find interesting:
>
>  1. Print article and read it actively.
>
>  2. Convert the printed article into a PDF
>    file -- complete with embedded OCR --
>    with my handy-dandy ScanSnap scanner. [1]
>
>  3. Use MyLibrary to create metadata (author,
>    title, date published, date read, note,
>    keywords, facet/term combinations, local
>    and remote URLs, etc.) describing the
>    article. [2]
>
>  4. Save the PDF to my file system.
>
>  5. Use pdttotext to extract the OCRed text
>    from the PDF and index it along with
>    the MyLibrary metadata using Solr. [3, 4]
>
>  6. Provide a searchable/browsable user
>    interface to the collection through a
>    mod_perl module. [5, 6]
>
> Software is never done, and if it were then it would be called hardware.
> Accordingly, I know there are some things I need to do before I can truely
> deem the system version 1.0. At the same time my excitment is overflowing
> and I thought I'd share some geekdom with my fellow hackers. Fun with PDF
> files and open source software.
>
>
> [1] ScanSnap - http://tinyurl.com/oafgwe
> [2] MyLibrary screen dump - http://infomotions.com/tmp/mylibrary.png
> [3] pdftotext - http://www.foolabs.com/xpdf/
> [4] Solr - http://lucene.apache.org/solr/
> [5] module source code - http://infomotions.com/highlights/Highlights.pl
> [6] user interface - http://infomotions.com/highlights/highlights.cgi
>
> --
> Eric Lease Morgan
> University of Notre Dame
>
>
>
>
> --
> Eric Lease Morgan
> Head, Digital Access and Information Architecture Department
> Hesburgh Libraries, University of Notre Dame
>
> (574) 631-8604
>

Re: [CODE4LIB] indexing pdf files

Reply via email to