Re: [CODE4LIB] anyone know about Inera?
At Sat, 12 Jul 2008 10:46:06 -0400, Godmar Back [EMAIL PROTECTED] wrote: Min, Eric, and others working in this domain - have you considered designing your software as a scalable web service from the get-go, using such frameworks as Google App Engine? You may be able to use Montepython for the CRF computations (http://montepython.sourceforge.net/) I know Min offers a WSDL wrapper around their software, but that's simply a gateway to one single-machine installation, and it's not intended as a production service at that. Thanks for the link to montepython. It looks like it might be a good tool for me to learn more about machine learning. As for my citation metadata extractor, once the training data is generated it would be trivial to scale it; there is no shared state. All that is really needed is an implementation of the Viterbi algorithm, there is one (in pure Python) on the wikipedia page; it is about 20 lines of code. So presumably it could be scaled on the Google app engine pretty easily. But it could be scaled on anything pretty easily; all you need is a load balancer and however many servers are necessary (not many, I would think). best, Erik Hetzner ;; Erik Hetzner, California Digital Library ;; gnupg key id: 1024D/01DB07E3 pgpDckRg5SWMS.pgp Description: PGP signature
Re: [CODE4LIB] anyone know about Inera?
[I forgot to CC: this to the list, I've edited my reply a bit from the original email to Godmar.] Hi Godmar, all: We'd love to do this and may consider doing so in the future. As we are primarily a research unit doing such services is wonderful but only when staff have time. Just FYI, the web service we offer is running on one machine for the public, but internally in our group we have a number of machines that handle ParsCit web service calls that are brokered by a load balancing mechanism; however we cannot spare the computational bandwidth for our public interfaces. For us in-house, it is a production system (though we have yet to really stress test this). This is why we are hoping others will find the system useful and install it on their own. If someone does make ParsCit available in a scalable web service environment free of charge, we'd certainly link to it from the main ParsCit website. Any takers for some volunteer work? Cheers, Min PS - Godmar suggested that we (and others providing like web services) consider designing our web services in a scalable way from the beginning so that we don't have to worry or focus bandwidth on making our services scalable. I'm very happy to learn such technologies, if someone can point us in the appropriate direction -- Google App or otherwise. On Sat, Jul 12, 2008 at 10:46 PM, Godmar Back [EMAIL PROTECTED] wrote: Min, Eric, and others working in this domain - have you considered designing your software as a scalable web service from the get-go, using such frameworks as Google App Engine? You may be able to use Montepython for the CRF computations (http://montepython.sourceforge.net/) I know Min offers a WSDL wrapper around their software, but that's simply a gateway to one single-machine installation, and it's not intended as a production service at that. - Godmar -- Min-Yen KAN (Dr) :: Assistant Professor :: National University of Singapore :: School of Computing, AS6 05-12, Law Link, Singapore 117590 :: 65-6516 1885(DID) :: 65-6779 4580 (Fax) :: [EMAIL PROTECTED] (E) :: www.comp.nus.edu.sg/~kanmy (W) Important: This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately; you should not copy or use it for any purpose, nor disclose its contents to any other person. Thank you.
Re: [CODE4LIB] anyone know about Inera?
Hi Steve, all: I'm the key developer of ParsCit. I'm glad to hear your feedback about what doesn't work with ParsCit. Erik is correct in saying that we have only trained the system for what data we have correct answers for, namely computer science. As such it doesn't perform well with other data (especially health sciences citations, which we have also done some pilot tests on. I note that there are other citation parsers out there, include Erik's own HMM parser (I think Erik mentioned it as well, available from his website here: http://gales.cdlib.org/~egh/hmm-citation-extractor/) Anyways, I've tried your citation too, and got the same results from the demo -- it doesn't handle the authors correctly in this case. I would very much love to have as many example cases of incorrectly parsed citations as the community is willing to share with us so we can improve ParsCit (it's open source so all can benefit from improvements to ParsCit). We are trying to be as proactive as possible about maintaining and improving ParsCit. I know of at least two groups that have said they are willing to contribute more citations (with correct markings) to us so that we can re-train ParsCit, and there is interest in porting it to other languages (i.e. German right now). We would love to get samples of your data too, where the program does go wrong, to help improve our system. And to get feedback of other fields that need to be parsed in as well: ISSN, ISBNs, volume, and issues. We are also looking to make the output of the ParsCit system compatible with EndNote, BibTeX. We actually have an internal project to try to hook up ParsCit to find references on arbitrary web pages (to form something like Zotero that's not site specific and non-template based). If and when this project comes to fruition we'll be announcing it to the list. If anyone has used ParsCit and has feedback on what can be further improved we'd love to hear from you. You are our target audience! Cheers, Min -- Min-Yen KAN (Dr) :: Assistant Professor :: National University of Singapore :: School of Computing, AS6 05-12, Law Link, Singapore 117590 :: 65-6516 1885(DID) :: 65-6779 4580 (Fax) :: [EMAIL PROTECTED] (E) :: www.comp.nus.edu.sg/~kanmy (W) PS: Hi Erik, still planning on studying your HMM package for improving ParsCit ... It's on my agenda. Thanks again. On Sat, Jul 12, 2008 at 5:36 AM, Steve Oberg [EMAIL PROTECTED] wrote: Yeah, I am beginning to wonder, based on these really helpful replies, if I need to scale back to what is doable and reasonable. And reassess ParsCit. Thanks to all for this additional information. Steve On Fri, Jul 11, 2008 at 4:18 PM, Nate Vack [EMAIL PROTECTED] wrote: On Fri, Jul 11, 2008 at 3:57 PM, Steve Oberg [EMAIL PROTECTED] wrote: I fully realize how much of a risk that is in terms of reliability and maintenance. But right now I just want a way to do this in bulk with a high level of accuracy. How bad is it, really, if you get some (5%?) bad requests into your document delivery system? Customers submit poor quality requests by hand with some frequency, last I checked... Especially if you can hack your system to deliver the original citation all the way into your doc delivery system, you may be able to make the case that 'this is a good service to offer; let's just deal with the bad parses manually.' Trying to solve this via pure technology is gonna get into a world of diminishing returns. A surprising number of citations in references sections are wrong. Some correct citations are really hard to parse, even by humans who look at a lot of citations. ParsCit has, in my limited testing, worked as well as anything I've seen (commercial or OSS), and much better than most. My $0.02, -Nate
Re: [CODE4LIB] anyone know about Inera?
Min, Eric, and others working in this domain - have you considered designing your software as a scalable web service from the get-go, using such frameworks as Google App Engine? You may be able to use Montepython for the CRF computations (http://montepython.sourceforge.net/) I know Min offers a WSDL wrapper around their software, but that's simply a gateway to one single-machine installation, and it's not intended as a production service at that. - Godmar On Sat, Jul 12, 2008 at 3:20 AM, Min-Yen Kan [EMAIL PROTECTED] wrote: Hi Steve, all: I'm the key developer of ParsCit. I'm glad to hear your feedback about what doesn't work with ParsCit. Erik is correct in saying that we have only trained the system for what data we have correct answers for, namely computer science. As such it doesn't perform well with other data (especially health sciences citations, which we have also done some pilot tests on. I note that there are other citation parsers out there, include Erik's own HMM parser (I think Erik mentioned it as well, available from his website here: http://gales.cdlib.org/~egh/hmm-citation-extractor/) Anyways, I've tried your citation too, and got the same results from the demo -- it doesn't handle the authors correctly in this case. I would very much love to have as many example cases of incorrectly parsed citations as the community is willing to share with us so we can improve ParsCit (it's open source so all can benefit from improvements to ParsCit). We are trying to be as proactive as possible about maintaining and improving ParsCit. I know of at least two groups that have said they are willing to contribute more citations (with correct markings) to us so that we can re-train ParsCit, and there is interest in porting it to other languages (i.e. German right now). We would love to get samples of your data too, where the program does go wrong, to help improve our system. And to get feedback of other fields that need to be parsed in as well: ISSN, ISBNs, volume, and issues. We are also looking to make the output of the ParsCit system compatible with EndNote, BibTeX. We actually have an internal project to try to hook up ParsCit to find references on arbitrary web pages (to form something like Zotero that's not site specific and non-template based). If and when this project comes to fruition we'll be announcing it to the list. If anyone has used ParsCit and has feedback on what can be further improved we'd love to hear from you. You are our target audience! Cheers, Min -- Min-Yen KAN (Dr) :: Assistant Professor :: National University of Singapore :: School of Computing, AS6 05-12, Law Link, Singapore 117590 :: 65-6516 1885(DID) :: 65-6779 4580 (Fax) :: [EMAIL PROTECTED] (E) :: www.comp.nus.edu.sg/~kanmy (W) PS: Hi Erik, still planning on studying your HMM package for improving ParsCit ... It's on my agenda. Thanks again. On Sat, Jul 12, 2008 at 5:36 AM, Steve Oberg [EMAIL PROTECTED] wrote: Yeah, I am beginning to wonder, based on these really helpful replies, if I need to scale back to what is doable and reasonable. And reassess ParsCit. Thanks to all for this additional information. Steve On Fri, Jul 11, 2008 at 4:18 PM, Nate Vack [EMAIL PROTECTED] wrote: On Fri, Jul 11, 2008 at 3:57 PM, Steve Oberg [EMAIL PROTECTED] wrote: I fully realize how much of a risk that is in terms of reliability and maintenance. But right now I just want a way to do this in bulk with a high level of accuracy. How bad is it, really, if you get some (5%?) bad requests into your document delivery system? Customers submit poor quality requests by hand with some frequency, last I checked... Especially if you can hack your system to deliver the original citation all the way into your doc delivery system, you may be able to make the case that 'this is a good service to offer; let's just deal with the bad parses manually.' Trying to solve this via pure technology is gonna get into a world of diminishing returns. A surprising number of citations in references sections are wrong. Some correct citations are really hard to parse, even by humans who look at a lot of citations. ParsCit has, in my limited testing, worked as well as anything I've seen (commercial or OSS), and much better than most. My $0.02, -Nate
[CODE4LIB] anyone know about Inera?
I recently became aware of a company that provides what it terms reference correction software: Inera. This is the company that powers the crossRef Simple Text Query box (http://www.crossref.org/freeTextQuery). See http://www.inera.com/refcorrection.shtml for more details Does anyone on this list have any knowledge of this company? I'm just wondering if it would be better to use what they have rather than continue to possibly reinvent the wheel for citation parsing. Steve
Re: [CODE4LIB] anyone know about Inera?
Steve, If you need citation parsing, rather than reference correction, maybe this will work for you: http://aye.comp.nus.edu.sg/parsCit/ I haven't had a chance to try it yet, though. Jason On Fri, Jul 11, 2008 at 11:51 AM, Steve Oberg [EMAIL PROTECTED] wrote: I recently became aware of a company that provides what it terms reference correction software: Inera. This is the company that powers the crossRef Simple Text Query box (http://www.crossref.org/freeTextQuery). See http://www.inera.com/refcorrection.shtml for more details Does anyone on this list have any knowledge of this company? I'm just wondering if it would be better to use what they have rather than continue to possibly reinvent the wheel for citation parsing. Steve
Re: [CODE4LIB] anyone know about Inera?
Jason, Thanks, yes, I knew of this effort and have actually spent a lot of time working with this same software (or rather the same underlying software). But I'm not sure it does enough or does it well enough for me at this point. I'd like to take a list of one or two, up to hundreds of citations and dump it into a web form and output SFX URLs as a result. Steve On Fri, Jul 11, 2008 at 1:51 PM, Jason Ronallo [EMAIL PROTECTED] wrote: Steve, If you need citation parsing, rather than reference correction, maybe this will work for you: http://aye.comp.nus.edu.sg/parsCit/ I haven't had a chance to try it yet, though. Jason On Fri, Jul 11, 2008 at 11:51 AM, Steve Oberg [EMAIL PROTECTED] wrote: I recently became aware of a company that provides what it terms reference correction software: Inera. This is the company that powers the crossRef Simple Text Query box (http://www.crossref.org/freeTextQuery). See http://www.inera.com/refcorrection.shtml for more details Does anyone on this list have any knowledge of this company? I'm just wondering if it would be better to use what they have rather than continue to possibly reinvent the wheel for citation parsing. Steve
Re: [CODE4LIB] anyone know about Inera?
Just out of curiosity, what makes parscit not optimal for this purpose? Is it too slow? Not accurate enough? I ask, as I've thought of doing similar things but haven't explored the software deeply enough to know if it'd work. Cheers, -Nate On Fri, Jul 11, 2008 at 2:44 PM, Steve Oberg [EMAIL PROTECTED] wrote: Jason, Thanks, yes, I knew of this effort and have actually spent a lot of time working with this same software (or rather the same underlying software). But I'm not sure it does enough or does it well enough for me at this point. I'd like to take a list of one or two, up to hundreds of citations and dump it into a web form and output SFX URLs as a result.
Re: [CODE4LIB] anyone know about Inera?
Actually, SFX is probably not going to care what the title is. It's much more likely to care about the ISSN, volume and issue. Now, if the matching targets are EBSCO or Proquest, you might have a problem (since they accept inbound OpenURLs from SFX), but I'm not sure, exactly. How many of these things do you have? -Ross. On Fri, Jul 11, 2008 at 3:55 PM, Steve Oberg [EMAIL PROTECTED] wrote: One example: Here's the citation I have in hand: Noordzij M, Korevaar JC, Boeschoten EW, Dekker FW, Bos WJ, Krediet RT et al. The Kidney Disease Outcomes Quality Initiative (K/DOQI) Guideline for Bone Metabolism and Disease in CKD: association with mortality in dialysis patients. American Journal of Kidney Diseases 2005; 46(5):925-932. Here's the output from ParsCit. Note the problem with the article title: algorithm name=ParsCit version=1.0 citationList citation authors authorM Noordzij/author authorKorevaar JC/author /authors volume2005/volume titleBoeschoten EW, Dekker FW, Bos WJ, Krediet RT et al. The Kidney Disease Outcomes Quality Initiative (K/DOQI) Guideline for Bone Metabolism and Disease in CKD: association with mortality in dialysis patients/title journalAmerican Journal of Kidney Diseases/journal pages46--5/pages /citation /citationList /algorithm There's more but basically it isn't accurate enough. It's very good but not good enough for what I need at this juncture. OpenURL resolvers like SFX are generally only as good as the metadata they are given to parse. I need a high level of accuracy. Maybe that's a pipe dream. Steve On Fri, Jul 11, 2008 at 2:48 PM, Nate Vack [EMAIL PROTECTED] wrote: Just out of curiosity, what makes parscit not optimal for this purpose? Is it too slow? Not accurate enough? I ask, as I've thought of doing similar things but haven't explored the software deeply enough to know if it'd work. Cheers, -Nate On Fri, Jul 11, 2008 at 2:44 PM, Steve Oberg [EMAIL PROTECTED] wrote: Jason, Thanks, yes, I knew of this effort and have actually spent a lot of time working with this same software (or rather the same underlying software). But I'm not sure it does enough or does it well enough for me at this point. I'd like to take a list of one or two, up to hundreds of citations and dump it into a web form and output SFX URLs as a result.
Re: [CODE4LIB] anyone know about Inera?
Ross, Actually, SFX is probably not going to care what the title is. It's much more likely to care about the ISSN, volume and issue. Yes, true. But linking to full text is only partly the issue when it comes to using SFX in this way. I also want to ensure that those articles that we don't already have available in full text are directly routed to our internal doc. delivery form (in SFX speak, using an svc.ill=yes in the OpenURL). This would of course mean that however the citation got parsed is how that form is filled out. Incorrect title information is a problem in this case. Now, if the matching targets are EBSCO or Proquest, you might have a problem (since they accept inbound OpenURLs from SFX), but I'm not sure, exactly. How many of these things do you have? Literally, possibly thousands. I can't divulge a great amount of detail (there we go again with that restriction on info.) of the exact use. But let's just say there are many very large documents (in PDF or Word), each of which contains between 100-400 article citations, that I am working with. Why on earth try to provide article-level OpenURLs? Well, for many reasons. I fully realize how much of a risk that is in terms of reliability and maintenance. But right now I just want a way to do this in bulk with a high level of accuracy. Steve
Re: [CODE4LIB] anyone know about Inera?
At Fri, 11 Jul 2008 14:55:18 -0500, Steve Oberg [EMAIL PROTECTED] wrote: One example: Here's the citation I have in hand: Noordzij M, Korevaar JC, Boeschoten EW, Dekker FW, Bos WJ, Krediet RT et al. The Kidney Disease Outcomes Quality Initiative (K/DOQI) Guideline for Bone Metabolism and Disease in CKD: association with mortality in dialysis patients. American Journal of Kidney Diseases 2005; 46(5):925-932. Here's the output from ParsCit. Note the problem with the article title: […] The output is a little different from what I get from the parsCit web service. The parsCit authors recently published a new paper on a new version of their systems with a new engine, which you might want to look at [1]. There's more but basically it isn't accurate enough. It's very good but not good enough for what I need at this juncture. OpenURL resolvers like SFX are generally only as good as the metadata they are given to parse. I need a high level of accuracy. Maybe that's a pipe dream. I doubt that the software provided by Inera performs better than parsCit. Inera does find a DOI for that citation but that is not nearly so hard as determining which parts of a citation are which. parsCit is pretty cutting edge provides some of the best numbers I have seen. The Flux-CiM system [2] also has pretty good numbers, but the code for it is not available. I’ve also done a little bit of work on this, which you might want to have a look at. [3] One of the problems may be that the parsCit you are dealing with has been trained on the Cora dataset of computer science citations. It is a reasonably heterogeneous dataset of citations but it doesn’t have a lot that looks like that health sciences format. If your citations are largely drawn from the health sciences you might see about training it on a health sciences dataset; you will probably get much better results. best, Erik Hetzner 1. Isaac G. Councill, C. Lee Giles, Min-Yen Kan. (2008) ParsCit: An open-source CRF reference string parsing package. In Proceedings of the Language Resources and Evaluation Conference (LREC 08), Marrakesh, Morrocco, May. Available from http://wing.comp.nus.edu.sg/parsCit/#p 2. Eli Cortez C. Vilarinho, Altigran Soares da Silva, Marcos André Gonçalves, Filipe de Sá Mesquita, Edleno Silva de Moura. FLUX-CIM: flexible unsupervised extraction of citation metadata. In Proceedings of the 8th ACM/IEEE Joint Conference on Digital Libraries (JCDL 2007), pp. 215-224. 3. A simple method for citation metadata extraction using hidden Markov models. In Proc. of the Joint Conf. on Digital Libraries (JCDL 2008), Pittsburgh, Pa., 2008. http://gales.cdlib.org/~egh/hmm-citation-extractor/ ;; Erik Hetzner, California Digital Library ;; gnupg key id: 1024D/01DB07E3 pgp64luKWEnmY.pgp Description: PGP signature
Re: [CODE4LIB] anyone know about Inera?
On Fri, Jul 11, 2008 at 3:57 PM, Steve Oberg [EMAIL PROTECTED] wrote: I fully realize how much of a risk that is in terms of reliability and maintenance. But right now I just want a way to do this in bulk with a high level of accuracy. How bad is it, really, if you get some (5%?) bad requests into your document delivery system? Customers submit poor quality requests by hand with some frequency, last I checked... Especially if you can hack your system to deliver the original citation all the way into your doc delivery system, you may be able to make the case that 'this is a good service to offer; let's just deal with the bad parses manually.' Trying to solve this via pure technology is gonna get into a world of diminishing returns. A surprising number of citations in references sections are wrong. Some correct citations are really hard to parse, even by humans who look at a lot of citations. ParsCit has, in my limited testing, worked as well as anything I've seen (commercial or OSS), and much better than most. My $0.02, -Nate