Re: [CODE4LIB] Citation parsing?

2007-07-20 Thread Joe Atzberger
On 7/20/07, Eric Hellman <[EMAIL PROTECTED]> wrote: Have people been able to do a decent job of identifying parts of speech in natural language? I think trying to import broad NLP findings into our narrower problem of citation parsing is not likely to be fruitful but on the other hand ste

Re: [CODE4LIB] Citation parsing?

2007-07-20 Thread Nathan Vack
On Jul 20, 2007, at 9:14 AM, Eric Hellman wrote: Heuristics are perhaps the only way to deal with lack of consistent format. (i.e. "a cluster of words including "journal of" is likely to contain a journal name") You're right; in a lot of ways, it depends on what you consider a heuristic; every

Re: [CODE4LIB] Citation parsing?

2007-07-20 Thread Eric Hellman
On Jul 18, 2007, at 10:04 PM, Eric Hellman wrote: Also, even in (many) scholarly journals, editorial consistency is almost unbelievably poor -- lots of times, the rules just aren't followed. Punctuation gets missed, journal names (especially abbreviations!) are misspelled... and so on. Rule-based

Re: [CODE4LIB] Citation parsing?

2007-07-19 Thread Nathan Vack
On Jul 18, 2007, at 10:04 PM, Eric Hellman wrote: Anyway, almost all parsers rely on a set of heuristics. I have not seen any parsers that do a good job of managing their heuristics in a scaleable way. A successful open-source attack on this problem would have the following characteristics: 1. a

Re: [CODE4LIB] Citation parsing?

2007-07-18 Thread Eric Hellman
Having written a pretty decent citation parser 10 years ago (in Applescript!), and having seen a lot of people take whacks at the problem, I have to say that it's pretty easy to write one that works on 70-80% of citations, particularly if you stick to one scholarly subject area. On the other hand,

Re: [CODE4LIB] Citation parsing?

2007-07-18 Thread Nathan Vack
It's on our list of Big Problems To Solve; I'm hoping to have time to tackle it later this year :) -n On Jul 18, 2007, at 12:57 PM, Jonathan Rochkind wrote: Ha! If it's not too difficult, then with all the time you've spent "looking at it extensively", how come you don't have a solution yet?

Re: [CODE4LIB] Citation parsing?

2007-07-18 Thread Jonathan Rochkind
Ha! If it's not too difficult, then with all the time you've spent "looking at it extensively", how come you don't have a solution yet? Just kidding. :) Jonathan Nathan Vack wrote: We've looked at this pretty extensively, and we're pretty certain there's nothing downloadable that does a "good

Re: [CODE4LIB] Citation parsing?

2007-07-18 Thread Nathan Vack
We've looked at this pretty extensively, and we're pretty certain there's nothing downloadable that does a "good enough" job. However, it's by no means impossible -- it seems to be undergrad thesis-level work in Singapore: http://wing.comp.nus.edu.sg/parsCit/ There used to be a paper describing

Re: [CODE4LIB] Citation parsing?

2007-07-18 Thread Alberto Accomazzi
Hi Jonathan, There is a PERL module by Mike Jewell which was written for this purpose: http://search.cpan.org/~mjewell/Biblio-Citation-Parser-1.10/ I am not using the code, so I can't comment on how well it may work for your purpose, but it's probably worth a look. -- Alberto On 7/17/07, Jon

Re: [CODE4LIB] Citation parsing?

2007-07-18 Thread Godmar Back
On 7/18/07, Jonathan Rochkind <[EMAIL PROTECTED]> wrote: Nice, that might be what I need. Maybe I'll take a look at the LibX code, it's open source, right? Google Scholar has no API--you're screen scraping it? Yes and yes. - Godmar

Re: [CODE4LIB] Citation parsing?

2007-07-18 Thread Godmar Back
On 7/18/07, Steve Toub <[EMAIL PROTECTED]> wrote: Agreed that a lookup against something like Google Scholar, Web of Science, or a set of federated search targets instance may yield better results. We've discussed by haven't done any testing. Use your LibX edition, Steve. I can also send a dra

Re: [CODE4LIB] Citation parsing?

2007-07-18 Thread Jonathan Rochkind
Nice, that might be what I need. Maybe I'll take a look at the LibX code, it's open source, right? Google Scholar has no API--you're screen scraping it? Jonathan Godmar Back wrote: A year or so ago a couple of students looked into this for LibX. There are a number of systems that people have p

Re: [CODE4LIB] Citation parsing?

2007-07-18 Thread Steve Toub
Godmar Back wrote: A year or so ago a couple of students looked into this for LibX. There are a number of systems that people have published about, although some are not available and none worked very well or were easy to get to work. The systems also varied in their computational complexity, wit

Re: [CODE4LIB] Citation parsing?

2007-07-17 Thread Godmar Back
A year or so ago a couple of students looked into this for LibX. There are a number of systems that people have published about, although some are not available and none worked very well or were easy to get to work. The systems also varied in their computational complexity, with some not suitable

[CODE4LIB] Citation parsing?

2007-07-17 Thread Jonathan Rochkind
Does anyone have any decent open source code to parse a citation? I'm talking about a completely narrative citation like someone might cut-and-paste from a bibliography or web page. I realize there are a number of differnet formats this could be in (not to mention the human error problems that alw