Thanks Daniel for chiming in, this was really helpful. I hope you don't mind a few more comments/questions?
On Friday, November 29, 2013, 4:27:38 PM, you wrote: > (a) finding a publication on a site other than the publisher's does > not necessarily mean that file is legally there, or even that it's > easy to determine (let alone algorithmically) whether that is the case At least for the STEM fields, the consolidation of publishers is really convenient. What I would intend to do is to crawl the publishers' sites (not that many due to consolidation) slowly as if the crawler were a person (or many students) from essentially all participating libraries. The metadata is with the publishers and easy to read. The crawler the strips all the tags that could identify the source of the article (i.e., everything except content), such that each article looks as if it was submitted by the author(s). Then, new tags are being added to all the articles to create a database of all the articles libraries have access to. One tag would denote if the article is unambiguously open access or not. Every article that is not unambiguously open access, public domain, whatever, is not accessible, not even visible from the outside (except, of course, its meta-data). > (b) mining those files may be prohibited by the copyright holder's > terms and conditions. Another tag would take care of this: miners would only see articles where mining is unambiguously legal. If it can't be determined, no mining. > Instead of crawling for the publications themselves, it may be less > problematic from a legal point of view to just have a platform that > aggregates metadata of publications, along with a link to a legal copy > (green or gold). Without standardized mark-up of the articles, their usefulness would be severely curtailed. This would only be a last resort option, IMHO. > (f) the official repositories are far from interoperable That is one of the things that need to be remedied ASAP! > For these reasons, I think it is best to develop crawling > infrastructure around the clearly licensed literature first (which is > a rather small subset at present), Doesn't copyright expire? Don't the green mandates work retroactively? So everything covered by green mandates (essentially everything with a public funder acknowledged), or where the publisher is not opposed to green deposition, or where copyright has expired or never existed (e.g. US government work, etc.) should be fair game. All of this seems like a lot - is it really not that much? Cheers, Bjoern -- Bj�rn Brembs --------------------------------------------- http://brembs.net Neurogenetics Universit�t Regensburg Germany
_______________________________________________ GOAL mailing list GOAL@eprints.org http://mailman.ecs.soton.ac.uk/mailman/listinfo/goal