Thanks Daniel for chiming in, this was really helpful. I hope you don't mind a 
few more comments/questions?

On Friday, November 29, 2013, 4:27:38 PM, you wrote:

> (a) finding a publication on a site other than the publisher's does
> not necessarily mean that file is legally there, or even that it's
> easy to determine (let alone algorithmically) whether that is the case

At least for the STEM fields, the consolidation of publishers is really 
convenient. What I would intend to do is to crawl the publishers' sites (not 
that many due to consolidation) slowly as if the crawler were a person (or many 
students) from essentially all participating libraries. The metadata is with 
the publishers and easy to read. The crawler the strips all the tags that could 
identify the source of the article (i.e., everything except content), such that 
each article looks as if it was submitted by the author(s). Then, new tags are 
being added to all the articles to create a database of all the articles 
libraries have access to. One tag would denote if the article is unambiguously 
open access or not.

Every article that is not unambiguously open access, public domain, whatever, 
is not accessible, not even visible from the outside (except, of course, its 
meta-data).

> (b) mining those files may be prohibited by the copyright holder's
> terms and conditions.

Another tag would take care of this: miners would only see articles where 
mining is unambiguously legal. If it can't be determined, no mining.

> Instead of crawling for the publications themselves, it may be less
> problematic from a legal point of view to just have a platform that
> aggregates metadata of publications, along with a link to a legal copy
> (green or gold).

Without standardized mark-up of the articles, their usefulness would be 
severely curtailed. This would only be a last resort option, IMHO.

> (f) the official repositories are far from interoperable

That is one of the things that need to be remedied ASAP!

> For these reasons, I think it is best to develop crawling
> infrastructure around the clearly licensed literature first (which is
> a rather small subset at present),

Doesn't copyright expire? Don't the green mandates work retroactively? So 
everything covered by green mandates (essentially everything with a public 
funder acknowledged), or where the publisher is not opposed to green 
deposition, or where copyright has expired or never existed (e.g. US government 
work, etc.) should be fair game.

All of this seems like a lot - is it really not that much?


Cheers,

Bjoern



-- 
Bj�rn Brembs
---------------------------------------------
http://brembs.net
Neurogenetics
Universit�t Regensburg
Germany


_______________________________________________
GOAL mailing list
GOAL@eprints.org
http://mailman.ecs.soton.ac.uk/mailman/listinfo/goal

Reply via email to