Re: [CODE4LIB] Citation parsing?

Steve Toub Wed, 18 Jul 2007 03:39:01 -0700

Godmar Back wrote:

A year or so ago a couple of students looked into this for LibX. There
are a number of systems that people have published about, although
some are not available and none worked very well or were easy to get
to work. The systems also varied in their computational complexity,
with some not suitable for interactive use. Google for "libx citation
sensing", or generally for citation extraction, automatic record
boundary detection or extraction. (Unfortunately, pubs.dlib.vt.edu
appears to be down at the moment - otherwise, Suresh Menon's report
contains a useful bibliography of work. I'll ping them.)


I've tested ParaTools
<http://search.cpan.org/src/MJEWELL/Biblio-Document-Parser-1.10/docs/html/intro.html>
but after it choked on most of it's own examples, tried looking elsewhere.

Inera's eXtyles refXpress claims to do this. You can see it in action
at: <http://www.crossref.org/SimpleTextQuery/>. Better than ParaTools
but still missed a lot of things I thought would have been obvious.
Inera said most of the issues I picked out were a problem with
CrossRef's implementation, but the cost of the product was so great that
I didn't explore further.

There was an interesting paper at JCDL 2007 on an unsupervised way of
doing this that had promising results
<http://doi.acm.org/10.1145/1255175.1255219> but I haven't found any of
their code online.

For citations that contain item titles (which is true for a majority,
but definitely not all citation styles) LibX's magic button uses
Scholar as a hidden backend to produce an actionable OpenURL. Combined
with a similarity analysis, this  "magic button" functionality
produces a usable OpenURL in (on average) 81% of cases for a set of
400 randomly chosen citations from 4 widely read journals from 4
different areas published in 2006 [1].  With some fixes, we could
probably get this number up to 90%. Obviously, this approach only
works for individual use, Google would object for large scale batch
uses.


Agreed that a lookup against something like Google Scholar, Web of
Science, or a set of federated search targets instance may yield better
results. We've discussed by haven't done any testing.
       --SET

- Godmar

[1] Annette Bailey and Godmar Back, Retrieving Known Items with LibX.
The Serials Librarian, 2007. To appear.

On 7/17/07, Jonathan Rochkind <[EMAIL PROTECTED]> wrote:

Does anyone have any decent open source code to parse a citation? I'm
talking about a completely narrative citation like someone might
cut-and-paste from a bibliography or web page. I realize there are a
number of differnet formats this could be in (not to mention the human
error problems that always occur from human entered free text)--but
thinking about it, I suspect that with some work you could get something
that worked reasonably well (if not perfect). So I'm wondering if anyone
has donethis work.

(One of the commerical legal product--I forget if it's Lexis or
West--does this with legal citations--a more limited domain--quite
well.  I'm not sure if any of the commerical bibliographic citation
management software does this?)

The goal, as you can probably guess, is a box that the user can paste a
citation into; make an OpenURL out of it; show the user where to get the
citation.  I'm pretty confident something useful could be created here,
with enough time put into it. But saldy, it's probably more time than
anyone has individually. Unless someone's done it already?

Hopefully,
Jonathan

Re: [CODE4LIB] Citation parsing?

Reply via email to