Hi all,
As Matt's problem is related to parsing citations, I would definitely have a 
look at the tools cited by Cindy because going with regexp will quickly become 
a nightmare. Even if citations have been created following a common reference 
style: there will necessarily be incoherence, amplified by the OCR process. 
This kind of tool already tries to deal with that, just give it a try (FreeCite 
lists other tools or libraries trying to accomplish this).

Looks like a fun project btw!

Regards,
Sylvain

-----Original Message-----
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Harper, 
Cynthia
Sent: 18 June 2015 19:49
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata 
and/or a Database

Eric or others, do you know of any utility that converts a PDF and retains 
coding for where font or font-style changes? Or converts a web page with 
associated CSS and notes where font-style and HTML text block stops and starts? 
 It seems that would be the starting point for recognizing citation entities.  
I've seen websites for FreeCite http://freecite.library.brown.edu/ and Parscit 
http://aye.comp.nus.edu.sg/parsCit/ through web searches, but don't know how 
close they got to the Grail before becoming legend.

Cindy Harper

-----Original Message-----
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Eric 
Lease Morgan
Sent: Thursday, June 18, 2015 1:04 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata 
and/or a Database

On Jun 18, 2015, at 12:02 PM, Matt Sherman <matt.r.sher...@gmail.com> wrote:

> I am working with colleague on a side project which involves some
> scanned bibliographies and making them more web 
> searchable/sortable/browse-able.
> While I am quite familiar with the metadata and organization aspects
> we need, but I am at a bit of a loss on how to automate the process of
> putting the bibliography in a more structured format so that we can
> avoid going through hundreds of pages by hand.  I am pretty sure
> regular expressions are needed, but I have not had an instance where I
> need to automate extracting data from one file type (PDF OCR or text
> extracted to Word doc) and place it into another (either a database or
> an XML file) with some enrichment.  I would appreciate any suggestions
> for approaches or tools to look into.  Thanks for any help/thoughts people 
> can give.


If I understand your question correctly, then you have two problems to address: 
1) converting PDF, Word, etc. files into plain text, and 2) marking up the 
result (which is a bibliography) into structure data. Correct?

If so, then if your PDF documents have already been OCRed, or if you have other 
files, then you can probably feed them to TIKA to quickly and easily extract 
the underlying plain text. [1] I wrote a brain-dead shell script to run TIKA in 
server mode and then convert Word (.docx) files. [2]

When it comes to marking up the result into structured data, well, good luck. I 
think such an application is something Library Land sought for a long time. 
“Can you say Holy Grail?"

[1] Tika - https://tika.apache.org
[2] brain-dead script - 
https://gist.github.com/ericleasemorgan/c4e34ffad96c0221f1ff

—
Eric


This email and any files transmitted with it were intended solely for the 
addressee. If you have received this email in error please let the sender know 
by return.

Please think before you print.

Reply via email to