Min, Eric, and others working in this domain - have you considered designing your software as a scalable web service from the get-go, using such frameworks as Google App Engine? You may be able to use Montepython for the CRF computations (http://montepython.sourceforge.net/)
I know Min offers a WSDL wrapper around their software, but that's simply a gateway to one single-machine installation, and it's not intended as a production service at that. - Godmar On Sat, Jul 12, 2008 at 3:20 AM, Min-Yen Kan <[EMAIL PROTECTED]> wrote: > Hi Steve, all: > > I'm the key developer of ParsCit. I'm glad to hear your feedback > about what doesn't work with ParsCit. Erik is correct in saying that > we have only trained the system for what data we have correct answers > for, namely computer science. As such it doesn't perform well with > other data (especially health sciences citations, which we have also > done some pilot tests on. I note that there are other citation > parsers out there, include Erik's own HMM parser (I think Erik > mentioned it as well, available from his website here: > http://gales.cdlib.org/~egh/hmm-citation-extractor/) > > Anyways, I've tried your citation too, and got the same results from > the demo -- it doesn't handle the authors correctly in this case. I > would very much love to have as many example cases of incorrectly > parsed citations as the community is willing to share with us so we > can improve ParsCit (it's open source so all can benefit from > improvements to ParsCit). > > We are trying to be as proactive as possible about maintaining and > improving ParsCit. I know of at least two groups that have said they > are willing to contribute more citations (with correct markings) to us > so that we can re-train ParsCit, and there is interest in porting it > to other languages (i.e. German right now). We would love to get > samples of your data too, where the program does go wrong, to help > improve our system. And to get feedback of other fields that need to > be parsed in as well: ISSN, ISBNs, volume, and issues. > > We are also looking to make the output of the ParsCit system > compatible with EndNote, BibTeX. We actually have an internal project > to try to hook up ParsCit to find references on arbitrary web pages > (to form something like Zotero that's not site specific and > non-template based). If and when this project comes to fruition we'll > be announcing it to the list. > > If anyone has used ParsCit and has feedback on what can be further > improved we'd love to hear from you. You are our target audience! > > Cheers, > > Min > > -- > Min-Yen KAN (Dr) :: Assistant Professor :: National University of > Singapore :: School of Computing, AS6 05-12, Law Link, Singapore > 117590 :: 65-6516 1885(DID) :: 65-6779 4580 (Fax) :: > [EMAIL PROTECTED] (E) :: www.comp.nus.edu.sg/~kanmy (W) > > PS: Hi Erik, still planning on studying your HMM package for improving > ParsCit ... It's on my agenda. > Thanks again. > > On Sat, Jul 12, 2008 at 5:36 AM, Steve Oberg <[EMAIL PROTECTED]> wrote: >> Yeah, I am beginning to wonder, based on these really helpful replies, if I >> need to scale back to what is "doable" and "reasonable." And reassess >> ParsCit. >> >> Thanks to all for this additional information. >> >> Steve >> >> On Fri, Jul 11, 2008 at 4:18 PM, Nate Vack <[EMAIL PROTECTED]> wrote: >> >>> On Fri, Jul 11, 2008 at 3:57 PM, Steve Oberg <[EMAIL PROTECTED]> wrote: >>> >>> > I fully realize how much of a risk that is in terms of reliability and >>> > maintenance. But right now I just want a way to do this in bulk with a >>> high >>> > level of accuracy. >>> >>> How bad is it, really, if you get some (5%?) bad requests into your >>> document delivery system? Customers submit poor quality requests by >>> hand with some frequency, last I checked... >>> >>> Especially if you can hack your system to deliver the original >>> citation all the way into your doc delivery system, you may be able to >>> make the case that 'this is a good service to offer; let's just deal >>> with the bad parses manually.' >>> >>> Trying to solve this via pure technology is gonna get into a world of >>> diminishing returns. A surprising number of citations in references >>> sections are wrong. Some correct citations are really hard to parse, >>> even by humans who look at a lot of citations. >>> >>> ParsCit has, in my limited testing, worked as well as anything I've >>> seen (commercial or OSS), and much better than most. >>> >>> My $0.02, >>> -Nate >>> >> >