Re: [CODE4LIB] DIY aggregate index

2010-07-02 Thread Blake, Miriam E
And its true that if you get the article metadata directly from the publishers,
you avoid the issues with duplication that we have with the secondary databases
who all re-format and add data to each record they receive.  However, I would
guess this requires many more negotiations (many more publishers) than
dealing with the AI vendors.

Miriam
LANL


On 7/2/10 6:57 AM, Laurence Lockton l.g.lock...@bath.ac.uk wrote:

Eric is right, a few European institutions have been doing this for
several years. At the University of Bath we've been using ELIN
http://elin.lub.lu.se/elinInfo, which Lund University in Sweden had been
operating since 2001 (until recently - it's now been effectively spun
off.) This is also what underlies the DOAJ site http://www.doaj.org/

It seems to me that there are two approaches to building these
aggregated indexes:
  (1) load whole databases (mostly AI) and catalogues, as an
alternative to federated search, and
  (2) collect article-level metadata, mostly from primary publishers, to
build an index of the library's e-journals collection, then possibly add
the print catalogue.

LANL sounds like it's taken the first approach; ELIN and Journal TOCs
http://www.journaltocs.hw.ac.uk/ are based on the second. The approach
taken by the commercial vendors is somewhat blurred between the two, but
I would suggest that EBSCO Discovery Service and OCLC WorldCat Local are
broadly based on the first approach and Serials Solutions Summon and Ex
Libris Primo Central are more focussed on the second. I think this is an
important consideration for anyone selecting a service, or contemplating
building their own.

Laurence Lockton
University of Bath
UK


Re: [CODE4LIB] DIY aggregate index

2010-07-01 Thread Jonathan Rochkind
The argument I've tried to make to content vendors (just in casual 
conversation, never in actual negotiations) is that we'll still send the 
user to their platform for actually accessing the text, we just want the 
metadata (possibly including textual fulltext for searching) for 
_searching_.  So they can still meter and completely control actual 
article access.


Sadly, in casual conversation, I have not generally found this to be 
persuasive with content vendors. Especially ones which are metadata 
aggregators only without any fulltext in the first place, heh.  
Publishers are more open to this -- but then publishers may have been 
ensnared in exclusive contracts with aggregators that leave them unable 
to do it even if they wanted. (See EBSCO).


I wrote on article on this a couple years ago in Library Journal. In 
retrospect, I think my article is over-optimistic about the technical 
feasibility of doing this -- running a Solr instance isn't that bad, but 
the technical issues of maintaining the regular flow of updates from 
dozens of content providers, and normalizing all data to go in the same 
index, are non-trivial, I think now.


http://www.libraryjournal.com/article/CA6413442.html

Owen Stephens wrote:

As others have suggested I think much of this is around the practicalities
of negotiating access, and the server power  expertise needed to run the
service - simply more efficient to do this in one place.

For me the change that we need to open this up is for publishers to start
pushing out a lot more of this data to all comers, rather than having to
have this conversation several times over with individual sites or
suppliers. How practical this is I'm not sure - especially as we are talking
about indexing full-text where available (I guess). I think the Google News
model (5-clicks free) is an interesting one - but not sure whether this, or
a similar approach, would work in a niche market which may not be so
interested in total traffic.

It seems (to me) obviously in the publishers interest for their content to
be as easily discoverable as possible that I am optimistic they will
gradually become more open to sharing more data that aids this - at least
metadata. I'd hope that this would eventually open up the market to a
broader set of suppliers, as well as institutions doing their own thing.

Owen

On Thu, Jul 1, 2010 at 2:37 AM, Eric Lease Morgan emor...@nd.edu wrote:

  

On Jun 30, 2010, at 8:43 PM, Blake, Miriam E wrote:



We have locally loaded records from the ISI databases, INSPEC,
BIOSIS, and the Department of Energy (as well as from full-text
publishers, but that is another story and system entirely.) Aside
from the contracts, I can also attest to the major amount of
work it has been. We have 95M bibliographic records, stored in 
75TB of disk, and counting. Its all running on SOLR, with a local
interface and the distributed aDORe repository on backend. ~ 2
FTE keep it running in production now.
  

I definitely think what is outlined above -- local indexing -- is the way
to go in the long run. Get the data. Index it. Integrate it into your other
system. Know that you have it when you change or drop the license. No
renting of data. And, We don't need no stinkin' interfaces! I believe a
number of European institutions have been doing this for a number of years.
I hear a few of us in the United States following suit.  ++

--
Eric Morgan
University of Notre Dame.






  


Re: [CODE4LIB] DIY aggregate index

2010-07-01 Thread Cory Rockliff
I'm planning on moving ahead with a proof-of-concept in the next year, 
after which I will certainly consider writing it up.


I really hope I can get the go-ahead from database vendors. It's good to 
hear that a few institutions have successfully negotiated with 
them--anyone from Los Alamos, the Scholars Portals, or any other local 
indexers feel free to give me pointers on smooth-talking the vendors! :)


I also hope you're wrong in maintaining, in the article you linked to, 
that using controlled vocabularies for retrieval will never work well 
across databases that use different vocabularies. The (admittedly 
arduous and complex) work of crosswalking library-created controlled 
vocabularies like LCSH to periodical index thesauri and other formal and 
less-formal indexing languages out in the wild is *exactly* what I think 
librarians should be spending their time doing. Catalogers (and I 
include myself) spend a lot of time making largely irrelevant tweaks to 
already-existing MARC records before exporting them into our local 
ILSes, but article-level metadata from vendors is generally served up to 
the user as-is.


I think Roy Tennant, as quoted in your article, is spot-on when he says 
that our inability to do any preprocessing of the data is a major 
hindrance. The data sources we subscribe to should be seen as starting 
points for generating a user experience, rather than letting the vendors 
decide what the discovery process is going to be like.


Cory

On 7/1/2010 11:39 AM, Jonathan Rochkind wrote:
I am eager to see you try it, Cory. Please consider writing up your 
results for the Code4Lib Journal. I'd be curious to hear the complete 
story, from issues of getting metadata, to issues of the technical 
infrastructure, any metadata normalization you need to do, issues of 
continuing to get the metadata on a regular basis, etc.
Whether you succeed or fail, but especially if you succeed, your 
project with just a couple databases could serve as a useful pilot 
for people considering doing it with more.


Jonathan


--
Cory Rockliff
Technical Services Librarian
Bard Graduate Center: Decorative Arts, Design History, Material Culture
18 West 86th Street
New York, NY 10024
T: (212) 501-3037
rockl...@bgc.bard.edu

---
[This E-mail scanned for viruses by Declude Virus]


Re: [CODE4LIB] DIY aggregate index

2010-06-30 Thread Jonathan Rochkind

Cory Rockliff wrote:
Do libraries opt for these commercial 'pre-indexed' services simply 
because they're a good value proposition compared to all the work of 
indexing multiple resources from multiple vendors into one local index, 
or is it that companies like iii and Ex Libris are the only ones with 
enough clout to negotiate access to otherwise-unavailable database 
vendors' content?
  
A little bit of both, I think. A library probably _could_ negotiate 
access to that content... but it would be a heck of a lot of work. When 
the staff time to negotiations come in, it becomes a good value 
proposition, regardless of how much the licensing would cost you.  And 
yeah, then the staff time to actually ingest and normalize and 
troubleshoot data-flows for all that stuff on the regular basis -- I've 
heard stories of libraries that tried to do that in the early 90s and it 
was nightmarish.


So, actually, I guess i've arrived at convincing myself it's mostly 
good value proposition, in that a library probably can't afford to do 
that on their own, with or without licensing issues.


But I'd really love to see you try anyway, maybe I'm wrong. :)

Can I assume that if a database vendor has exposed their content to me 
as a subscriber, whether via z39.50 or a web service or whatever, that 
I'm free to cache and index all that metadata locally if I so choose? Is 
this something to be negotiated on a vendor-by-vendor basis, or is it an 
impossibility?
  

I doubt you can assume that.  I don't think it's an impossibility.

Jonathan


Re: [CODE4LIB] DIY aggregate index

2010-06-30 Thread Walker, David
You might also need to factor in an extra server or three (in the cloud or 
otherwise) into that equation, given that we're talking 100s of millions of 
records that will need to be indexed.

 companies like iii and Ex Libris are the only ones with
 enough clout to negotiate access

I don't think III is doing any kind of aggregated indexing, hence their 
decision to try and leverage APIs.  I could be wrong.

--Dave

==
David Walker
Library Web Services Manager
California State University
http://xerxes.calstate.edu

From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Jonathan 
Rochkind [rochk...@jhu.edu]
Sent: Wednesday, June 30, 2010 1:15 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] DIY aggregate index

Cory Rockliff wrote:
 Do libraries opt for these commercial 'pre-indexed' services simply
 because they're a good value proposition compared to all the work of
 indexing multiple resources from multiple vendors into one local index,
 or is it that companies like iii and Ex Libris are the only ones with
 enough clout to negotiate access to otherwise-unavailable database
 vendors' content?

A little bit of both, I think. A library probably _could_ negotiate
access to that content... but it would be a heck of a lot of work. When
the staff time to negotiations come in, it becomes a good value
proposition, regardless of how much the licensing would cost you.  And
yeah, then the staff time to actually ingest and normalize and
troubleshoot data-flows for all that stuff on the regular basis -- I've
heard stories of libraries that tried to do that in the early 90s and it
was nightmarish.

So, actually, I guess i've arrived at convincing myself it's mostly
good value proposition, in that a library probably can't afford to do
that on their own, with or without licensing issues.

But I'd really love to see you try anyway, maybe I'm wrong. :)

 Can I assume that if a database vendor has exposed their content to me
 as a subscriber, whether via z39.50 or a web service or whatever, that
 I'm free to cache and index all that metadata locally if I so choose? Is
 this something to be negotiated on a vendor-by-vendor basis, or is it an
 impossibility?

I doubt you can assume that.  I don't think it's an impossibility.

Jonathan


Re: [CODE4LIB] DIY aggregate index

2010-06-30 Thread Cory Rockliff
Well, this is the thing: we're a small, highly-specialized collection, 
so I'm not talking about indexing the whole range of content which a 
university like JHU or even a small liberal arts college would need 
to--it's really a matter of a few key databases in our field(s). Don't 
get me wrong, it's still a slightly crazy idea, but I'm dissatisfied 
enough with existing solutions that I'd like to try it.


On 6/30/2010 4:15 PM, Jonathan Rochkind wrote:
A little bit of both, I think. A library probably _could_ negotiate 
access to that content... but it would be a heck of a lot of work. 
When the staff time to negotiations come in, it becomes a good value 
proposition, regardless of how much the licensing would cost you.  And 
yeah, then the staff time to actually ingest and normalize and 
troubleshoot data-flows for all that stuff on the regular basis -- 
I've heard stories of libraries that tried to do that in the early 90s 
and it was nightmarish.


I wonder if they would, in fact, demand licensing fees. I mean, we're 
already paying a subscription, and they're already exposing their 
content as a target for federated search applications (which probably do 
some caching for performance)...
So, actually, I guess i've arrived at convincing myself it's mostly 
good value proposition, in that a library probably can't afford to 
do that on their own, with or without licensing issues.

--
Cory Rockliff
Technical Services Librarian
Bard Graduate Center: Decorative Arts, Design History, Material Culture
18 West 86th Street
New York, NY 10024
T: (212) 501-3037
rockl...@bgc.bard.edu

---
[This E-mail scanned for viruses by Declude Virus]


Re: [CODE4LIB] DIY aggregate index

2010-06-30 Thread Cory Rockliff
We're looking at an infrastructure based on Marklogic running on Amazon 
EC2, so the scale of data to be indexed shouldn't actually be that big 
of an issue. Also, as I said to Jonathan, I only see myself indexing a 
handful of highly-relevant resources, so we're talking millions, rather 
than 100s of millions, of records.


On 6/30/2010 4:22 PM, Walker, David wrote:

You might also need to factor in an extra server or three (in the cloud or 
otherwise) into that equation, given that we're talking 100s of millions of 
records that will need to be indexed.

   

companies like iii and Ex Libris are the only ones with
enough clout to negotiate access
 

I don't think III is doing any kind of aggregated indexing, hence their 
decision to try and leverage APIs.  I could be wrong.

--Dave

==
David Walker
Library Web Services Manager
California State University
http://xerxes.calstate.edu

From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Jonathan 
Rochkind [rochk...@jhu.edu]
Sent: Wednesday, June 30, 2010 1:15 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] DIY aggregate index

Cory Rockliff wrote:
   

Do libraries opt for these commercial 'pre-indexed' services simply
because they're a good value proposition compared to all the work of
indexing multiple resources from multiple vendors into one local index,
or is it that companies like iii and Ex Libris are the only ones with
enough clout to negotiate access to otherwise-unavailable database
vendors' content?

 

A little bit of both, I think. A library probably _could_ negotiate
access to that content... but it would be a heck of a lot of work. When
the staff time to negotiations come in, it becomes a good value
proposition, regardless of how much the licensing would cost you.  And
yeah, then the staff time to actually ingest and normalize and
troubleshoot data-flows for all that stuff on the regular basis -- I've
heard stories of libraries that tried to do that in the early 90s and it
was nightmarish.

So, actually, I guess i've arrived at convincing myself it's mostly
good value proposition, in that a library probably can't afford to do
that on their own, with or without licensing issues.

But I'd really love to see you try anyway, maybe I'm wrong. :)

   

Can I assume that if a database vendor has exposed their content to me
as a subscriber, whether via z39.50 or a web service or whatever, that
I'm free to cache and index all that metadata locally if I so choose? Is
this something to be negotiated on a vendor-by-vendor basis, or is it an
impossibility?

 

I doubt you can assume that.  I don't think it's an impossibility.

Jonathan
---
[This E-mail scanned for viruses by Declude Virus]



   



--
Cory Rockliff
Technical Services Librarian
Bard Graduate Center: Decorative Arts, Design History, Material Culture
18 West 86th Street
New York, NY 10024
T: (212) 501-3037
rockl...@bgc.bard.edu

---
[This E-mail scanned for viruses by Declude Virus]


Re: [CODE4LIB] DIY aggregate index

2010-06-30 Thread Blake, Miriam E
We are one of those institutions that did this -negotiated for lots of content 
YEARS ago (before the providers really knew what
they or we were in for.)

We have locally loaded records from the ISI databases, INSPEC, BIOSIS, and the 
Department of Energy (as well as from full-text
publishers, but that is another story and system entirely.)  Aside from the 
contracts, I can also attest to the major amount of
work it has been.  We have 95M bibliographic records, stored in   75TB of 
disk, and counting.  Its all running on SOLR, with a local interface
and the distributed aDORe repository on backend.   ~ 2 FTE keep it running in 
production now.

Over the 15 years we've been loading this, we've had to migrate it 3 times, and 
deal with all the dirty metadata, duplication,
and other difficult issues around scale and lack of content provider interest 
in supporting the few of us who do this kind of stuff.
We believe we have now achieved a standardized format (MPEG-21 DIDL and MARCXML 
with some other standards mixed in) and accessible
through protocol-based services (OpenURL, REST, OAI-PMH), etc. so that we hope 
we won't have to mess with the data records
again and can move on to other more interesting things.

It is nice to have, very fast - very much beats federated search -  and allows 
us (finally) to begin to build neat services (for licensed users only!)  Data 
mining?
Of course a goal, but talk about sticky areas of contract negotiation.  And in 
the end, you never have everything someone
needs when they want all content about something specific.  And yes, local 
loading is expensive, for a lot of reasons.

Ex Libris, Summon, etc. are now getting into the game from this angle.  We will 
so feel their pain, but I hope technology
and content provider engagement have improved to make it a bit easier for them! 
 And it definitely adds a level of usability
much improved over federated search.

My .02,

Miriam Blake
Los Alamos National Laboratory Research Library




On 6/30/10 3:20 PM, Rosalyn Metz rosalynm...@gmail.com wrote:

i know that there are institutions that have negotiated contracts for just
the content, sans interface.  But those that I know of have TONS of money
and are using a 3rd party interface that ingests the data for them.  I'm not
sure what the terms of that contract were or how they get the data, but it
can be done.



On Wed, Jun 30, 2010 at 5:07 PM, Cory Rockliff rockl...@bgc.bard.eduwrote:

 We're looking at an infrastructure based on Marklogic running on Amazon
 EC2, so the scale of data to be indexed shouldn't actually be that big of an
 issue. Also, as I said to Jonathan, I only see myself indexing a handful of
 highly-relevant resources, so we're talking millions, rather than 100s of
 millions, of records.


 On 6/30/2010 4:22 PM, Walker, David wrote:

 You might also need to factor in an extra server or three (in the cloud or
 otherwise) into that equation, given that we're talking 100s of millions of
 records that will need to be indexed.



 companies like iii and Ex Libris are the only ones with
 enough clout to negotiate access


 I don't think III is doing any kind of aggregated indexing, hence their
 decision to try and leverage APIs.  I could be wrong.

 --Dave

 ==
 David Walker
 Library Web Services Manager
 California State University
 http://xerxes.calstate.edu
 
 From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Jonathan
 Rochkind [rochk...@jhu.edu]
 Sent: Wednesday, June 30, 2010 1:15 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] DIY aggregate index

 Cory Rockliff wrote:


 Do libraries opt for these commercial 'pre-indexed' services simply
 because they're a good value proposition compared to all the work of
 indexing multiple resources from multiple vendors into one local index,
 or is it that companies like iii and Ex Libris are the only ones with
 enough clout to negotiate access to otherwise-unavailable database
 vendors' content?



 A little bit of both, I think. A library probably _could_ negotiate
 access to that content... but it would be a heck of a lot of work. When
 the staff time to negotiations come in, it becomes a good value
 proposition, regardless of how much the licensing would cost you.  And
 yeah, then the staff time to actually ingest and normalize and
 troubleshoot data-flows for all that stuff on the regular basis -- I've
 heard stories of libraries that tried to do that in the early 90s and it
 was nightmarish.

 So, actually, I guess i've arrived at convincing myself it's mostly
 good value proposition, in that a library probably can't afford to do
 that on their own, with or without licensing issues.

 But I'd really love to see you try anyway, maybe I'm wrong. :)



 Can I assume that if a database vendor has exposed their content to me
 as a subscriber, whether via z39.50 or a web service or whatever, that
 I'm free to cache and index all