Re: [CODE4LIB] "Repositories", OAI-PMH and web crawling

Habing, Thomas Gerald Mon, 27 Feb 2012 09:36:47 -0800

Take a look at "Best Practices for Shareable Metadata":  
http://webservices.itcs.umich.edu/mediawiki/oaibp/index.php/ShareableMetadataPublic


There is a specific section on "Linking from a Record to a Resource and Other 
Linking Issues".

Regards,
Tom

> -----Original Message-----
> From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
> Joe Hourcle
> Sent: Monday, February 27, 2012 10:43 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] "Repositories", OAI-PMH and web crawling
> 
> On Feb 27, 2012, at 10:51 AM, Godmar Back wrote:
> > On Mon, Feb 27, 2012 at 8:31 AM, Diane Hillmann
> <metadata.ma...@gmail.com>wrote:
> >> On Mon, Feb 27, 2012 at 5:25 AM, Owen Stephens
> <o...@ostephens.com> wrote:
> 
> >>> This issue is certainly not unique to VT - we've come across this as
> >>> part of our project. While the OAI-PMH record may point at the PDF,
> >>> it can
> >> also
> >>> point to a intermediary page. This seems to be standard practice in
> >>> some instances - I think because there is a desire, or even
> >>> requirement, that
> >> a
> >>> user should see the intermediary page (which may contain rights
> >> information
> >>> etc.) before viewing the full-text item. There may also be an issue
> >>> where multiple files exist for the same item - maybe several data
> >>> files and a
> >> pdf
> >>> of the thesis attached to the same metadata record - as the metadata
> >>> via OAI-PMH may not describe each asset.
> >>>
> >>>
> >> This has been an issue since the early days of OAI-PMH, and many
> >> large providers provide such intermediate pages (arxiv.org, for
> >> instance). The other issue driving providers towards intermediate
> >> pages is that it allows them to continue to derive statistics from
> >> usage of their materials, which direct access URIs and multiple web
> >> caches don't.  For providers dependent on external funding, this is a
> biggie.
> >>
> > Why do you place direct access URI and multiple web caches into the
> > same category? I follow your argument re: usage statistics for web
> > caches, but as long as the item remains hosted in the repository
> > direct access URIs should still be counted (provided proper
> > cache-control headers are sent.) Perhaps it would require server-side
> statistics rather than client-based GA.
> 
> I'd agree -- if you can't get good statistics from direct linking, something's
> wrong with the methods you're using to collect usage information.  Google
> Analytics and similar tools might produce pretty reports, but they're really
> meant for tracking web sites and won't work when someone has javascript
> turned off, has specifically blacklisted the analytics server, or on anything
> that's not HTML.
> 
> You *really* need to analyze the server logs directly, as you can't be sure
> that all access is going to go through the intermediate 'landing pages' or 
> that
> it'd be tracked even if they did.
> 
> ...
> 
> I admit, the stuff I'm serving is a little different than most people on this 
> list,
> but we also have the issue that the collections are so large that we don't
> want people retrieving the files unless they really need them.  We serve
> multiple TB per day -- I'd rather a person figure out if they want a file
> *before* they retrieve it, rather than download a few GB of data and find
> out it won't serve their purposes.
> 
> It might not help our 'look how much we serve!' metrics to justify our
> funding, but it helps keep our costs down, and I personally believe it helps
> with good will in our designated community as they don't spend a day (or
> more) downloading only to find it's not what they thought.  (and it fits in 
> with
> Ranganathan's 4th law better than saving them from an extra click)
> 
> -Joe

Re: [CODE4LIB] "Repositories", OAI-PMH and web crawling

Reply via email to