Re: [CODE4LIB] anyone know about Inera?

2008-07-14 Thread Erik Hetzner
At Sat, 12 Jul 2008 10:46:06 -0400,
Godmar Back [EMAIL PROTECTED] wrote:

 Min, Eric, and others working in this domain -

 have you considered designing your software as a scalable web service
 from the get-go, using such frameworks as Google App Engine? You may
 be able to use Montepython for the CRF computations
 (http://montepython.sourceforge.net/)

 I know Min offers a WSDL wrapper around their software, but that's
 simply a gateway to one single-machine installation, and it's not
 intended as a production service at that.

Thanks for the link to montepython. It looks like it might be a good
tool for me to learn more about machine learning.

As for my citation metadata extractor, once the training data is
generated it would be trivial to scale it; there is no shared state.
All that is really needed is an implementation of the Viterbi
algorithm,  there is one (in pure Python) on the wikipedia page; it
is about 20 lines of code. So presumably it could be scaled on the
Google app engine pretty easily. But it could be scaled on anything
pretty easily; all you need is a load balancer and however many
servers are necessary (not many, I would think).

best,
Erik Hetzner
;; Erik Hetzner, California Digital Library
;; gnupg key id: 1024D/01DB07E3


pgpDckRg5SWMS.pgp
Description: PGP signature


Re: [CODE4LIB] anyone know about Inera?

2008-07-13 Thread Min-Yen Kan
[I forgot to CC: this to the list, I've edited my reply a bit from the
original email to Godmar.]

Hi Godmar, all:

We'd love to do this and may consider doing so in the future.
As we are primarily a research unit doing such services is wonderful
but only when staff have time.

Just FYI, the web service we offer is running on one machine for the
public, but internally in our group we have a number of machines that
handle ParsCit web service calls that are brokered by a load balancing
mechanism; however we cannot spare the computational bandwidth for our
public interfaces.  For us in-house, it is a production system (though
we have yet to really stress test this).  This is why we are hoping
others will find the system useful and install it on their own.

If someone does make ParsCit available in a scalable web service
environment free of charge, we'd certainly link to it from the main
ParsCit website.  Any takers for some volunteer work?

Cheers,

Min

PS - Godmar suggested that we (and others providing like web services)
consider designing our web services in a scalable way from the
beginning so that we don't have to worry or focus bandwidth on making
our services scalable.  I'm very happy to learn such technologies, if
someone can point us in the appropriate direction -- Google App or
otherwise.

On Sat, Jul 12, 2008 at 10:46 PM, Godmar Back [EMAIL PROTECTED] wrote:
 Min, Eric, and others working in this domain -

 have you considered designing your software as a scalable web service
 from the get-go, using such frameworks as Google App Engine? You may
 be able to use Montepython for the CRF computations
 (http://montepython.sourceforge.net/)

 I know Min offers a WSDL wrapper around their software, but that's
 simply a gateway to one single-machine installation, and it's not
 intended as a production service at that.

  - Godmar



-- 
Min-Yen KAN (Dr) :: Assistant Professor :: National University of
Singapore :: School of Computing, AS6 05-12, Law Link, Singapore
117590 :: 65-6516 1885(DID) :: 65-6779 4580 (Fax) ::
[EMAIL PROTECTED] (E) :: www.comp.nus.edu.sg/~kanmy (W)

Important: This email is confidential and may be privileged. If you
are not the intended recipient, please delete it and notify us
immediately; you should not copy or use it for any purpose, nor
disclose its contents to any other person. Thank you.


Re: [CODE4LIB] anyone know about Inera?

2008-07-12 Thread Min-Yen Kan
Hi Steve, all:

I'm the key developer of ParsCit.  I'm glad to hear your feedback
about what doesn't work with ParsCit.  Erik is correct in saying that
we have only trained the system for what data we have correct answers
for, namely computer science.  As such it doesn't perform well with
other data (especially health sciences citations, which we have also
done some pilot tests on.  I note that there are other citation
parsers out there, include Erik's own HMM parser (I think Erik
mentioned it as well, available from his website here:
http://gales.cdlib.org/~egh/hmm-citation-extractor/)

Anyways, I've tried your citation too, and got the same results from
the demo -- it doesn't handle the authors correctly in this case.  I
would very much love to have as many example cases of incorrectly
parsed citations as the community is willing to share with us so we
can improve ParsCit (it's open source so all can benefit from
improvements to ParsCit).

We are trying to be as proactive as possible about maintaining and
improving ParsCit.  I know of at least two groups that have said they
are willing to contribute more citations (with correct markings) to us
so that we can re-train ParsCit, and there is interest in porting it
to other languages (i.e. German right now).  We would love to get
samples of your data too, where the program does go wrong, to help
improve our system.  And to get feedback of other fields that need to
be parsed in as well: ISSN, ISBNs, volume, and issues.

We are also looking to make the output of the ParsCit system
compatible with EndNote, BibTeX.  We actually have an internal project
to try to hook up ParsCit to find references on arbitrary web pages
(to form something like Zotero that's not site specific and
non-template based).  If and when this project comes to fruition we'll
be announcing it to the list.

If anyone has used ParsCit and has feedback on what can be further
improved we'd love to hear from you.  You are our target audience!

Cheers,

Min

-- 
Min-Yen KAN (Dr) :: Assistant Professor :: National University of
Singapore :: School of Computing, AS6 05-12, Law Link, Singapore
117590 :: 65-6516 1885(DID) :: 65-6779 4580 (Fax) ::
[EMAIL PROTECTED] (E) :: www.comp.nus.edu.sg/~kanmy (W)

PS: Hi Erik, still planning on studying your HMM package for improving
ParsCit ... It's on my agenda.
Thanks again.

On Sat, Jul 12, 2008 at 5:36 AM, Steve Oberg [EMAIL PROTECTED] wrote:
 Yeah, I am beginning to wonder, based on these really helpful replies, if I
 need to scale back to what is doable and reasonable. And reassess
 ParsCit.

 Thanks to all for this additional information.

 Steve

 On Fri, Jul 11, 2008 at 4:18 PM, Nate Vack [EMAIL PROTECTED] wrote:

 On Fri, Jul 11, 2008 at 3:57 PM, Steve Oberg [EMAIL PROTECTED] wrote:

  I fully realize how much of a risk that is in terms of reliability and
  maintenance.  But right now I just want a way to do this in bulk with a
 high
  level of accuracy.

 How bad is it, really, if you get some (5%?) bad requests into your
 document delivery system? Customers submit poor quality requests by
 hand with some frequency, last I checked...

 Especially if you can hack your system to deliver the original
 citation all the way into your doc delivery system, you may be able to
 make the case that 'this is a good service to offer; let's just deal
 with the bad parses manually.'

 Trying to solve this via pure technology is gonna get into a world of
 diminishing returns. A surprising number of citations in references
 sections are wrong. Some correct citations are really hard to parse,
 even by humans who look at a lot of citations.

 ParsCit has, in my limited testing, worked as well as anything I've
 seen (commercial or OSS), and much better than most.

 My $0.02,
 -Nate




Re: [CODE4LIB] anyone know about Inera?

2008-07-12 Thread Godmar Back
Min, Eric, and others working in this domain -

have you considered designing your software as a scalable web service
from the get-go, using such frameworks as Google App Engine? You may
be able to use Montepython for the CRF computations
(http://montepython.sourceforge.net/)

I know Min offers a WSDL wrapper around their software, but that's
simply a gateway to one single-machine installation, and it's not
intended as a production service at that.

 - Godmar

On Sat, Jul 12, 2008 at 3:20 AM, Min-Yen Kan [EMAIL PROTECTED] wrote:
 Hi Steve, all:

 I'm the key developer of ParsCit.  I'm glad to hear your feedback
 about what doesn't work with ParsCit.  Erik is correct in saying that
 we have only trained the system for what data we have correct answers
 for, namely computer science.  As such it doesn't perform well with
 other data (especially health sciences citations, which we have also
 done some pilot tests on.  I note that there are other citation
 parsers out there, include Erik's own HMM parser (I think Erik
 mentioned it as well, available from his website here:
 http://gales.cdlib.org/~egh/hmm-citation-extractor/)

 Anyways, I've tried your citation too, and got the same results from
 the demo -- it doesn't handle the authors correctly in this case.  I
 would very much love to have as many example cases of incorrectly
 parsed citations as the community is willing to share with us so we
 can improve ParsCit (it's open source so all can benefit from
 improvements to ParsCit).

 We are trying to be as proactive as possible about maintaining and
 improving ParsCit.  I know of at least two groups that have said they
 are willing to contribute more citations (with correct markings) to us
 so that we can re-train ParsCit, and there is interest in porting it
 to other languages (i.e. German right now).  We would love to get
 samples of your data too, where the program does go wrong, to help
 improve our system.  And to get feedback of other fields that need to
 be parsed in as well: ISSN, ISBNs, volume, and issues.

 We are also looking to make the output of the ParsCit system
 compatible with EndNote, BibTeX.  We actually have an internal project
 to try to hook up ParsCit to find references on arbitrary web pages
 (to form something like Zotero that's not site specific and
 non-template based).  If and when this project comes to fruition we'll
 be announcing it to the list.

 If anyone has used ParsCit and has feedback on what can be further
 improved we'd love to hear from you.  You are our target audience!

 Cheers,

 Min

 --
 Min-Yen KAN (Dr) :: Assistant Professor :: National University of
 Singapore :: School of Computing, AS6 05-12, Law Link, Singapore
 117590 :: 65-6516 1885(DID) :: 65-6779 4580 (Fax) ::
 [EMAIL PROTECTED] (E) :: www.comp.nus.edu.sg/~kanmy (W)

 PS: Hi Erik, still planning on studying your HMM package for improving
 ParsCit ... It's on my agenda.
 Thanks again.

 On Sat, Jul 12, 2008 at 5:36 AM, Steve Oberg [EMAIL PROTECTED] wrote:
 Yeah, I am beginning to wonder, based on these really helpful replies, if I
 need to scale back to what is doable and reasonable. And reassess
 ParsCit.

 Thanks to all for this additional information.

 Steve

 On Fri, Jul 11, 2008 at 4:18 PM, Nate Vack [EMAIL PROTECTED] wrote:

 On Fri, Jul 11, 2008 at 3:57 PM, Steve Oberg [EMAIL PROTECTED] wrote:

  I fully realize how much of a risk that is in terms of reliability and
  maintenance.  But right now I just want a way to do this in bulk with a
 high
  level of accuracy.

 How bad is it, really, if you get some (5%?) bad requests into your
 document delivery system? Customers submit poor quality requests by
 hand with some frequency, last I checked...

 Especially if you can hack your system to deliver the original
 citation all the way into your doc delivery system, you may be able to
 make the case that 'this is a good service to offer; let's just deal
 with the bad parses manually.'

 Trying to solve this via pure technology is gonna get into a world of
 diminishing returns. A surprising number of citations in references
 sections are wrong. Some correct citations are really hard to parse,
 even by humans who look at a lot of citations.

 ParsCit has, in my limited testing, worked as well as anything I've
 seen (commercial or OSS), and much better than most.

 My $0.02,
 -Nate





[CODE4LIB] anyone know about Inera?

2008-07-11 Thread Steve Oberg
I recently became aware of a company that provides what it terms reference
correction software:  Inera.  This is the company that powers the crossRef
Simple Text Query box (http://www.crossref.org/freeTextQuery).

See http://www.inera.com/refcorrection.shtml for more details

Does anyone on this list have any knowledge of this company? I'm just
wondering if it would be better to use what they have rather than continue
to possibly reinvent the wheel for citation parsing.

Steve


Re: [CODE4LIB] anyone know about Inera?

2008-07-11 Thread Jason Ronallo
Steve,
If you need citation parsing, rather than reference correction, maybe
this will work for you:
http://aye.comp.nus.edu.sg/parsCit/

I haven't had a chance to try it yet, though.

Jason

On Fri, Jul 11, 2008 at 11:51 AM, Steve Oberg [EMAIL PROTECTED] wrote:
 I recently became aware of a company that provides what it terms reference
 correction software:  Inera.  This is the company that powers the crossRef
 Simple Text Query box (http://www.crossref.org/freeTextQuery).

 See http://www.inera.com/refcorrection.shtml for more details

 Does anyone on this list have any knowledge of this company? I'm just
 wondering if it would be better to use what they have rather than continue
 to possibly reinvent the wheel for citation parsing.

 Steve



Re: [CODE4LIB] anyone know about Inera?

2008-07-11 Thread Steve Oberg
Jason,

Thanks, yes, I knew of this effort and have actually spent a lot of time
working with this same software (or rather the same underlying software).
But I'm not sure it does enough or does it well enough for me at this point.
I'd like to take a list of one or two, up to hundreds of citations and dump
it into a web form and output SFX URLs as a result.

Steve

On Fri, Jul 11, 2008 at 1:51 PM, Jason Ronallo [EMAIL PROTECTED] wrote:

 Steve,
 If you need citation parsing, rather than reference correction, maybe
 this will work for you:
 http://aye.comp.nus.edu.sg/parsCit/

 I haven't had a chance to try it yet, though.

 Jason

 On Fri, Jul 11, 2008 at 11:51 AM, Steve Oberg [EMAIL PROTECTED] wrote:
  I recently became aware of a company that provides what it terms
 reference
  correction software:  Inera.  This is the company that powers the
 crossRef
  Simple Text Query box (http://www.crossref.org/freeTextQuery).
 
  See http://www.inera.com/refcorrection.shtml for more details
 
  Does anyone on this list have any knowledge of this company? I'm just
  wondering if it would be better to use what they have rather than
 continue
  to possibly reinvent the wheel for citation parsing.
 
  Steve
 



Re: [CODE4LIB] anyone know about Inera?

2008-07-11 Thread Nate Vack
Just out of curiosity, what makes parscit not optimal for this
purpose? Is it too slow? Not accurate enough?

I ask, as I've thought of doing similar things but haven't explored
the software deeply enough to know if it'd work.

Cheers,
-Nate

On Fri, Jul 11, 2008 at 2:44 PM, Steve Oberg [EMAIL PROTECTED] wrote:
 Jason,

 Thanks, yes, I knew of this effort and have actually spent a lot of time
 working with this same software (or rather the same underlying software).
 But I'm not sure it does enough or does it well enough for me at this point.
 I'd like to take a list of one or two, up to hundreds of citations and dump
 it into a web form and output SFX URLs as a result.


Re: [CODE4LIB] anyone know about Inera?

2008-07-11 Thread Ross Singer
Actually, SFX is probably not going to care what the title is.

It's much more likely to care about the ISSN, volume and issue.

Now, if the matching targets are EBSCO or Proquest, you might have a
problem (since they accept inbound OpenURLs from SFX), but I'm not
sure, exactly.

How many of these things do you have?
-Ross.

On Fri, Jul 11, 2008 at 3:55 PM, Steve Oberg [EMAIL PROTECTED] wrote:
 One example:

 Here's the citation I have in hand:

 Noordzij M, Korevaar JC, Boeschoten EW, Dekker FW, Bos WJ, Krediet RT et al.
 The Kidney Disease Outcomes Quality Initiative (K/DOQI) Guideline for Bone
 Metabolism and Disease in CKD: association with mortality in dialysis
 patients. American Journal of Kidney Diseases 2005; 46(5):925-932.

 Here's the output from ParsCit. Note the problem with the article title:

 algorithm name=ParsCit version=1.0
 citationList
 citation
 authors
 authorM Noordzij/author
 authorKorevaar JC/author
 /authors
 volume2005/volume
 titleBoeschoten EW, Dekker FW, Bos WJ, Krediet RT et al. The Kidney
 Disease Outcomes Quality Initiative (K/DOQI) Guideline for Bone
 Metabolism and Disease in CKD: association with mortality in dialysis
 patients/title
 journalAmerican Journal of Kidney Diseases/journal
 pages46--5/pages
 /citation
 /citationList
 /algorithm

 There's more but basically it isn't accurate enough. It's very good but not
 good enough for what I need at this juncture.  OpenURL resolvers like SFX
 are generally only as good as the metadata they are given to parse.  I need
 a high level of accuracy.

 Maybe that's a pipe dream.

 Steve

 On Fri, Jul 11, 2008 at 2:48 PM, Nate Vack [EMAIL PROTECTED] wrote:

 Just out of curiosity, what makes parscit not optimal for this
 purpose? Is it too slow? Not accurate enough?

 I ask, as I've thought of doing similar things but haven't explored
 the software deeply enough to know if it'd work.

 Cheers,
 -Nate

 On Fri, Jul 11, 2008 at 2:44 PM, Steve Oberg [EMAIL PROTECTED] wrote:
  Jason,
 
  Thanks, yes, I knew of this effort and have actually spent a lot of time
  working with this same software (or rather the same underlying software).
  But I'm not sure it does enough or does it well enough for me at this
 point.
  I'd like to take a list of one or two, up to hundreds of citations and
 dump
  it into a web form and output SFX URLs as a result.




Re: [CODE4LIB] anyone know about Inera?

2008-07-11 Thread Steve Oberg
Ross,


Actually, SFX is probably not going to care what the title is.

 It's much more likely to care about the ISSN, volume and issue.


Yes, true. But linking to full text is only partly the issue when it comes
to using SFX in this way.  I also want to ensure that those articles that we
don't already have available in full text are directly routed to our
internal doc. delivery form (in SFX speak, using an svc.ill=yes in the
OpenURL).  This would of course mean that however the citation got parsed is
how that form is filled out.  Incorrect title information is a problem in
this case.

Now, if the matching targets are EBSCO or Proquest, you might have a
 problem (since they accept inbound OpenURLs from SFX), but I'm not
 sure, exactly.

 How many of these things do you have?


Literally, possibly thousands. I can't divulge a great amount of detail
(there we go again with that restriction on info.) of the exact use. But
let's just say there are many very large documents (in PDF or Word), each of
which contains between 100-400 article citations, that I am working with.
Why on earth try to provide article-level OpenURLs? Well, for many reasons.
I fully realize how much of a risk that is in terms of reliability and
maintenance.  But right now I just want a way to do this in bulk with a high
level of accuracy.

Steve


Re: [CODE4LIB] anyone know about Inera?

2008-07-11 Thread Erik Hetzner
At Fri, 11 Jul 2008 14:55:18 -0500,
Steve Oberg [EMAIL PROTECTED] wrote:
 
 One example:
 
 Here's the citation I have in hand:
 
 Noordzij M, Korevaar JC, Boeschoten EW, Dekker FW, Bos WJ, Krediet RT et al.
 The Kidney Disease Outcomes Quality Initiative (K/DOQI) Guideline for Bone
 Metabolism and Disease in CKD: association with mortality in dialysis
 patients. American Journal of Kidney Diseases 2005; 46(5):925-932.
 
 Here's the output from ParsCit. Note the problem with the article title:

 […]

The output is a little different from what I get from the parsCit web
service. The parsCit authors recently published a new paper on a new
version of their systems with a new engine, which you might want to
look at [1].

 There's more but basically it isn't accurate enough. It's very good but not
 good enough for what I need at this juncture.  OpenURL resolvers like SFX
 are generally only as good as the metadata they are given to parse.  I need
 a high level of accuracy.
 
 Maybe that's a pipe dream.

I doubt that the software provided by Inera performs better than
parsCit. Inera does find a DOI for that citation but that is not
nearly so hard as determining which parts of a citation are which.
parsCit is pretty cutting edge  provides some of the best numbers I
have seen. The Flux-CiM system [2] also has pretty good numbers, but
the code for it is not available. I’ve also done a little bit of work
on this, which you might want to have a look at. [3]

One of the problems may be that the parsCit you are dealing with has
been trained on the Cora dataset of computer science citations. It is
a reasonably heterogeneous dataset of citations but it doesn’t have a
lot that looks like that health sciences format. If your citations are
largely drawn from the health sciences you might see about training it
on a health sciences dataset; you will probably get much better
results.

best,
Erik Hetzner

1. Isaac G. Councill, C. Lee Giles, Min-Yen Kan. (2008) ParsCit: An
open-source CRF reference string parsing package. In Proceedings of
the Language Resources and Evaluation Conference (LREC 08), Marrakesh,
Morrocco, May. Available from http://wing.comp.nus.edu.sg/parsCit/#p

2. Eli Cortez C. Vilarinho, Altigran Soares da Silva, Marcos André
Gonçalves, Filipe de Sá Mesquita, Edleno Silva de Moura. FLUX-CIM:
flexible unsupervised extraction of citation metadata. In Proceedings
of the 8th ACM/IEEE Joint Conference on Digital Libraries (JCDL 2007),
pp. 215-224.

3. A simple method for citation metadata extraction using hidden
Markov models. In Proc. of the Joint Conf. on Digital Libraries (JCDL
2008), Pittsburgh, Pa., 2008.
http://gales.cdlib.org/~egh/hmm-citation-extractor/
;; Erik Hetzner, California Digital Library
;; gnupg key id: 1024D/01DB07E3


pgp64luKWEnmY.pgp
Description: PGP signature


Re: [CODE4LIB] anyone know about Inera?

2008-07-11 Thread Nate Vack
On Fri, Jul 11, 2008 at 3:57 PM, Steve Oberg [EMAIL PROTECTED] wrote:

 I fully realize how much of a risk that is in terms of reliability and
 maintenance.  But right now I just want a way to do this in bulk with a high
 level of accuracy.

How bad is it, really, if you get some (5%?) bad requests into your
document delivery system? Customers submit poor quality requests by
hand with some frequency, last I checked...

Especially if you can hack your system to deliver the original
citation all the way into your doc delivery system, you may be able to
make the case that 'this is a good service to offer; let's just deal
with the bad parses manually.'

Trying to solve this via pure technology is gonna get into a world of
diminishing returns. A surprising number of citations in references
sections are wrong. Some correct citations are really hard to parse,
even by humans who look at a lot of citations.

ParsCit has, in my limited testing, worked as well as anything I've
seen (commercial or OSS), and much better than most.

My $0.02,
-Nate