If someone else were getting started and didn't want to assemble their
own training data -- do you think it would be likely useful for them to
aggregate your training data _and_ Brown's training data together and
generate a new model? Was there a particular reason you chose not to
use Brown's training data and add on it to it, but start over from scratch?
Forgive me if this is a stupid question, I'm still trying to learn about
this stuff.
And start to figure out how I'm going to deal with it when I get around
to using FreeCite, which I surely will. Would it maybe make sense to
actually seperate the training data and trained model in a seperate
library, so people could even pick and choose what already built trained
model they want to use, or build their own, without dealing with repo
conflicts?
The training data is not currently under source control (it's in the
database), but the trained model is.
That's, admittedly, a bit of a downside to my fork (although the model
being checked into git is true of the original, as well) since you'd
always be in conflict with my trained model if you train your own.
-Ross.
On Monday, October 17, 2011, Jonathan Rochkind <[email protected]
<mailto:[email protected]>> wrote:
> When you say you've added to the training data, have you shared your
additions back with Brown, or your new improved training data is only
in your fork? Or is only held locally by you and isn't even in your
github fork? Please clarify, thanks!
>
> On 10/13/2011 8:52 PM, Ross Singer wrote:
>>
>> Yeah, we've been doing a lot with (and putting a lot of updates into)
>> FreeCite. We only use the webservice (although we don't use the
>> OpenURL context object and instead added a JSON response). It works
>> pretty well (not always great, but certainly better than nothing) -
>> especially for giving us something "good enough" to throw against some
>> OpenLibrary and Crossref data to look for matches. Basically what
>> we're using it for is to go from a citation string to an RDF graph.
>>
>> BTW, there have been no problems with post-2000 dates (not to say that
>> there aren't plenty of other problems) - this might have been either a
>> training issue or something a later version of CRF++ worked out. We
>> also add the citations it couldn't parse correctly to its training
>> data, which might help this.
>>
>> Anyway, yeah, if anybody is interested, feel free to try it out. One
>> thing my fork does is remove the PostgreSQL dependency, if that's an
>> issue for anybody. It's kind of handy to be able to just use SQLite
>> or MySQL or whatever to try it out.
>>
>> -Ross.
>>
>> On Thu, Oct 13, 2011 at 7:42 PM, Avram Lyon<[email protected]
<mailto:[email protected]>> wrote:
>>>
>>> On Thu, Oct 13, 2011 at 2:33 PM, Will Kurt<[email protected]
<mailto:[email protected]>> wrote:
>>>>
>>>> I always think that Brown's FreeCite api is under utilized.
>>>> http://freecite.library.brown.edu/
>>>> It's far from perfect, but I'm sure more use could be made of it.
>>>>
>>>> A few months back I threw together a copy/paste citation look-up
with it:
>>>> CiteBox
>>>> http://willkurt.github.com/CiteBox/
>>>>
>>>> Of course I don't think anyone is really making use of it, but I've
>>>> also done nothing to really promote it either ;)
>>>
>>> The FreeCite parser had major issues for a while with post-2000 dates,
>>> and I believe the installation at Brown still does, but, to judge by
>>> the GitHub activity (most active fork here:
>>> https://github.com/rsinger/free_cite/), some enterprising folks have
>>> picked it up after a period of apparent dormancy. This is great to
>>> see, and vital to any project that hopes to use its API for anything
>>> serious.
>>>
>>> By the way, the rarely-used XML representation of OpenURL
>>> ContextObjects that FreeCite produces is supported by Zotero as a
>>> full-fledged input format, a fact that might come in handy if you're
>>> hoping to have your API produce something that Zotero users can
>>> import.
>>>
>>> Avram
>>>
>>> UCLA Slavic, Zotero community dev
>>>
>