Re: [CODE4LIB] DIY aggregate index
And its true that if you get the article metadata directly from the publishers, you avoid the issues with duplication that we have with the secondary databases who all re-format and add data to each record they receive. However, I would guess this requires many more negotiations (many more publishers) than dealing with the AI vendors. Miriam LANL On 7/2/10 6:57 AM, Laurence Lockton l.g.lock...@bath.ac.uk wrote: Eric is right, a few European institutions have been doing this for several years. At the University of Bath we've been using ELIN http://elin.lub.lu.se/elinInfo, which Lund University in Sweden had been operating since 2001 (until recently - it's now been effectively spun off.) This is also what underlies the DOAJ site http://www.doaj.org/ It seems to me that there are two approaches to building these aggregated indexes: (1) load whole databases (mostly AI) and catalogues, as an alternative to federated search, and (2) collect article-level metadata, mostly from primary publishers, to build an index of the library's e-journals collection, then possibly add the print catalogue. LANL sounds like it's taken the first approach; ELIN and Journal TOCs http://www.journaltocs.hw.ac.uk/ are based on the second. The approach taken by the commercial vendors is somewhat blurred between the two, but I would suggest that EBSCO Discovery Service and OCLC WorldCat Local are broadly based on the first approach and Serials Solutions Summon and Ex Libris Primo Central are more focussed on the second. I think this is an important consideration for anyone selecting a service, or contemplating building their own. Laurence Lockton University of Bath UK
Re: [CODE4LIB] DIY aggregate index
The argument I've tried to make to content vendors (just in casual conversation, never in actual negotiations) is that we'll still send the user to their platform for actually accessing the text, we just want the metadata (possibly including textual fulltext for searching) for _searching_. So they can still meter and completely control actual article access. Sadly, in casual conversation, I have not generally found this to be persuasive with content vendors. Especially ones which are metadata aggregators only without any fulltext in the first place, heh. Publishers are more open to this -- but then publishers may have been ensnared in exclusive contracts with aggregators that leave them unable to do it even if they wanted. (See EBSCO). I wrote on article on this a couple years ago in Library Journal. In retrospect, I think my article is over-optimistic about the technical feasibility of doing this -- running a Solr instance isn't that bad, but the technical issues of maintaining the regular flow of updates from dozens of content providers, and normalizing all data to go in the same index, are non-trivial, I think now. http://www.libraryjournal.com/article/CA6413442.html Owen Stephens wrote: As others have suggested I think much of this is around the practicalities of negotiating access, and the server power expertise needed to run the service - simply more efficient to do this in one place. For me the change that we need to open this up is for publishers to start pushing out a lot more of this data to all comers, rather than having to have this conversation several times over with individual sites or suppliers. How practical this is I'm not sure - especially as we are talking about indexing full-text where available (I guess). I think the Google News model (5-clicks free) is an interesting one - but not sure whether this, or a similar approach, would work in a niche market which may not be so interested in total traffic. It seems (to me) obviously in the publishers interest for their content to be as easily discoverable as possible that I am optimistic they will gradually become more open to sharing more data that aids this - at least metadata. I'd hope that this would eventually open up the market to a broader set of suppliers, as well as institutions doing their own thing. Owen On Thu, Jul 1, 2010 at 2:37 AM, Eric Lease Morgan emor...@nd.edu wrote: On Jun 30, 2010, at 8:43 PM, Blake, Miriam E wrote: We have locally loaded records from the ISI databases, INSPEC, BIOSIS, and the Department of Energy (as well as from full-text publishers, but that is another story and system entirely.) Aside from the contracts, I can also attest to the major amount of work it has been. We have 95M bibliographic records, stored in 75TB of disk, and counting. Its all running on SOLR, with a local interface and the distributed aDORe repository on backend. ~ 2 FTE keep it running in production now. I definitely think what is outlined above -- local indexing -- is the way to go in the long run. Get the data. Index it. Integrate it into your other system. Know that you have it when you change or drop the license. No renting of data. And, We don't need no stinkin' interfaces! I believe a number of European institutions have been doing this for a number of years. I hear a few of us in the United States following suit. ++ -- Eric Morgan University of Notre Dame.
Re: [CODE4LIB] DIY aggregate index
I'm planning on moving ahead with a proof-of-concept in the next year, after which I will certainly consider writing it up. I really hope I can get the go-ahead from database vendors. It's good to hear that a few institutions have successfully negotiated with them--anyone from Los Alamos, the Scholars Portals, or any other local indexers feel free to give me pointers on smooth-talking the vendors! :) I also hope you're wrong in maintaining, in the article you linked to, that using controlled vocabularies for retrieval will never work well across databases that use different vocabularies. The (admittedly arduous and complex) work of crosswalking library-created controlled vocabularies like LCSH to periodical index thesauri and other formal and less-formal indexing languages out in the wild is *exactly* what I think librarians should be spending their time doing. Catalogers (and I include myself) spend a lot of time making largely irrelevant tweaks to already-existing MARC records before exporting them into our local ILSes, but article-level metadata from vendors is generally served up to the user as-is. I think Roy Tennant, as quoted in your article, is spot-on when he says that our inability to do any preprocessing of the data is a major hindrance. The data sources we subscribe to should be seen as starting points for generating a user experience, rather than letting the vendors decide what the discovery process is going to be like. Cory On 7/1/2010 11:39 AM, Jonathan Rochkind wrote: I am eager to see you try it, Cory. Please consider writing up your results for the Code4Lib Journal. I'd be curious to hear the complete story, from issues of getting metadata, to issues of the technical infrastructure, any metadata normalization you need to do, issues of continuing to get the metadata on a regular basis, etc. Whether you succeed or fail, but especially if you succeed, your project with just a couple databases could serve as a useful pilot for people considering doing it with more. Jonathan -- Cory Rockliff Technical Services Librarian Bard Graduate Center: Decorative Arts, Design History, Material Culture 18 West 86th Street New York, NY 10024 T: (212) 501-3037 rockl...@bgc.bard.edu --- [This E-mail scanned for viruses by Declude Virus]
Re: [CODE4LIB] DIY aggregate index
Cory Rockliff wrote: Do libraries opt for these commercial 'pre-indexed' services simply because they're a good value proposition compared to all the work of indexing multiple resources from multiple vendors into one local index, or is it that companies like iii and Ex Libris are the only ones with enough clout to negotiate access to otherwise-unavailable database vendors' content? A little bit of both, I think. A library probably _could_ negotiate access to that content... but it would be a heck of a lot of work. When the staff time to negotiations come in, it becomes a good value proposition, regardless of how much the licensing would cost you. And yeah, then the staff time to actually ingest and normalize and troubleshoot data-flows for all that stuff on the regular basis -- I've heard stories of libraries that tried to do that in the early 90s and it was nightmarish. So, actually, I guess i've arrived at convincing myself it's mostly good value proposition, in that a library probably can't afford to do that on their own, with or without licensing issues. But I'd really love to see you try anyway, maybe I'm wrong. :) Can I assume that if a database vendor has exposed their content to me as a subscriber, whether via z39.50 or a web service or whatever, that I'm free to cache and index all that metadata locally if I so choose? Is this something to be negotiated on a vendor-by-vendor basis, or is it an impossibility? I doubt you can assume that. I don't think it's an impossibility. Jonathan
Re: [CODE4LIB] DIY aggregate index
You might also need to factor in an extra server or three (in the cloud or otherwise) into that equation, given that we're talking 100s of millions of records that will need to be indexed. companies like iii and Ex Libris are the only ones with enough clout to negotiate access I don't think III is doing any kind of aggregated indexing, hence their decision to try and leverage APIs. I could be wrong. --Dave == David Walker Library Web Services Manager California State University http://xerxes.calstate.edu From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Jonathan Rochkind [rochk...@jhu.edu] Sent: Wednesday, June 30, 2010 1:15 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] DIY aggregate index Cory Rockliff wrote: Do libraries opt for these commercial 'pre-indexed' services simply because they're a good value proposition compared to all the work of indexing multiple resources from multiple vendors into one local index, or is it that companies like iii and Ex Libris are the only ones with enough clout to negotiate access to otherwise-unavailable database vendors' content? A little bit of both, I think. A library probably _could_ negotiate access to that content... but it would be a heck of a lot of work. When the staff time to negotiations come in, it becomes a good value proposition, regardless of how much the licensing would cost you. And yeah, then the staff time to actually ingest and normalize and troubleshoot data-flows for all that stuff on the regular basis -- I've heard stories of libraries that tried to do that in the early 90s and it was nightmarish. So, actually, I guess i've arrived at convincing myself it's mostly good value proposition, in that a library probably can't afford to do that on their own, with or without licensing issues. But I'd really love to see you try anyway, maybe I'm wrong. :) Can I assume that if a database vendor has exposed their content to me as a subscriber, whether via z39.50 or a web service or whatever, that I'm free to cache and index all that metadata locally if I so choose? Is this something to be negotiated on a vendor-by-vendor basis, or is it an impossibility? I doubt you can assume that. I don't think it's an impossibility. Jonathan
Re: [CODE4LIB] DIY aggregate index
Well, this is the thing: we're a small, highly-specialized collection, so I'm not talking about indexing the whole range of content which a university like JHU or even a small liberal arts college would need to--it's really a matter of a few key databases in our field(s). Don't get me wrong, it's still a slightly crazy idea, but I'm dissatisfied enough with existing solutions that I'd like to try it. On 6/30/2010 4:15 PM, Jonathan Rochkind wrote: A little bit of both, I think. A library probably _could_ negotiate access to that content... but it would be a heck of a lot of work. When the staff time to negotiations come in, it becomes a good value proposition, regardless of how much the licensing would cost you. And yeah, then the staff time to actually ingest and normalize and troubleshoot data-flows for all that stuff on the regular basis -- I've heard stories of libraries that tried to do that in the early 90s and it was nightmarish. I wonder if they would, in fact, demand licensing fees. I mean, we're already paying a subscription, and they're already exposing their content as a target for federated search applications (which probably do some caching for performance)... So, actually, I guess i've arrived at convincing myself it's mostly good value proposition, in that a library probably can't afford to do that on their own, with or without licensing issues. -- Cory Rockliff Technical Services Librarian Bard Graduate Center: Decorative Arts, Design History, Material Culture 18 West 86th Street New York, NY 10024 T: (212) 501-3037 rockl...@bgc.bard.edu --- [This E-mail scanned for viruses by Declude Virus]
Re: [CODE4LIB] DIY aggregate index
We're looking at an infrastructure based on Marklogic running on Amazon EC2, so the scale of data to be indexed shouldn't actually be that big of an issue. Also, as I said to Jonathan, I only see myself indexing a handful of highly-relevant resources, so we're talking millions, rather than 100s of millions, of records. On 6/30/2010 4:22 PM, Walker, David wrote: You might also need to factor in an extra server or three (in the cloud or otherwise) into that equation, given that we're talking 100s of millions of records that will need to be indexed. companies like iii and Ex Libris are the only ones with enough clout to negotiate access I don't think III is doing any kind of aggregated indexing, hence their decision to try and leverage APIs. I could be wrong. --Dave == David Walker Library Web Services Manager California State University http://xerxes.calstate.edu From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Jonathan Rochkind [rochk...@jhu.edu] Sent: Wednesday, June 30, 2010 1:15 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] DIY aggregate index Cory Rockliff wrote: Do libraries opt for these commercial 'pre-indexed' services simply because they're a good value proposition compared to all the work of indexing multiple resources from multiple vendors into one local index, or is it that companies like iii and Ex Libris are the only ones with enough clout to negotiate access to otherwise-unavailable database vendors' content? A little bit of both, I think. A library probably _could_ negotiate access to that content... but it would be a heck of a lot of work. When the staff time to negotiations come in, it becomes a good value proposition, regardless of how much the licensing would cost you. And yeah, then the staff time to actually ingest and normalize and troubleshoot data-flows for all that stuff on the regular basis -- I've heard stories of libraries that tried to do that in the early 90s and it was nightmarish. So, actually, I guess i've arrived at convincing myself it's mostly good value proposition, in that a library probably can't afford to do that on their own, with or without licensing issues. But I'd really love to see you try anyway, maybe I'm wrong. :) Can I assume that if a database vendor has exposed their content to me as a subscriber, whether via z39.50 or a web service or whatever, that I'm free to cache and index all that metadata locally if I so choose? Is this something to be negotiated on a vendor-by-vendor basis, or is it an impossibility? I doubt you can assume that. I don't think it's an impossibility. Jonathan --- [This E-mail scanned for viruses by Declude Virus] -- Cory Rockliff Technical Services Librarian Bard Graduate Center: Decorative Arts, Design History, Material Culture 18 West 86th Street New York, NY 10024 T: (212) 501-3037 rockl...@bgc.bard.edu --- [This E-mail scanned for viruses by Declude Virus]
Re: [CODE4LIB] DIY aggregate index
We are one of those institutions that did this -negotiated for lots of content YEARS ago (before the providers really knew what they or we were in for.) We have locally loaded records from the ISI databases, INSPEC, BIOSIS, and the Department of Energy (as well as from full-text publishers, but that is another story and system entirely.) Aside from the contracts, I can also attest to the major amount of work it has been. We have 95M bibliographic records, stored in 75TB of disk, and counting. Its all running on SOLR, with a local interface and the distributed aDORe repository on backend. ~ 2 FTE keep it running in production now. Over the 15 years we've been loading this, we've had to migrate it 3 times, and deal with all the dirty metadata, duplication, and other difficult issues around scale and lack of content provider interest in supporting the few of us who do this kind of stuff. We believe we have now achieved a standardized format (MPEG-21 DIDL and MARCXML with some other standards mixed in) and accessible through protocol-based services (OpenURL, REST, OAI-PMH), etc. so that we hope we won't have to mess with the data records again and can move on to other more interesting things. It is nice to have, very fast - very much beats federated search - and allows us (finally) to begin to build neat services (for licensed users only!) Data mining? Of course a goal, but talk about sticky areas of contract negotiation. And in the end, you never have everything someone needs when they want all content about something specific. And yes, local loading is expensive, for a lot of reasons. Ex Libris, Summon, etc. are now getting into the game from this angle. We will so feel their pain, but I hope technology and content provider engagement have improved to make it a bit easier for them! And it definitely adds a level of usability much improved over federated search. My .02, Miriam Blake Los Alamos National Laboratory Research Library On 6/30/10 3:20 PM, Rosalyn Metz rosalynm...@gmail.com wrote: i know that there are institutions that have negotiated contracts for just the content, sans interface. But those that I know of have TONS of money and are using a 3rd party interface that ingests the data for them. I'm not sure what the terms of that contract were or how they get the data, but it can be done. On Wed, Jun 30, 2010 at 5:07 PM, Cory Rockliff rockl...@bgc.bard.eduwrote: We're looking at an infrastructure based on Marklogic running on Amazon EC2, so the scale of data to be indexed shouldn't actually be that big of an issue. Also, as I said to Jonathan, I only see myself indexing a handful of highly-relevant resources, so we're talking millions, rather than 100s of millions, of records. On 6/30/2010 4:22 PM, Walker, David wrote: You might also need to factor in an extra server or three (in the cloud or otherwise) into that equation, given that we're talking 100s of millions of records that will need to be indexed. companies like iii and Ex Libris are the only ones with enough clout to negotiate access I don't think III is doing any kind of aggregated indexing, hence their decision to try and leverage APIs. I could be wrong. --Dave == David Walker Library Web Services Manager California State University http://xerxes.calstate.edu From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Jonathan Rochkind [rochk...@jhu.edu] Sent: Wednesday, June 30, 2010 1:15 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] DIY aggregate index Cory Rockliff wrote: Do libraries opt for these commercial 'pre-indexed' services simply because they're a good value proposition compared to all the work of indexing multiple resources from multiple vendors into one local index, or is it that companies like iii and Ex Libris are the only ones with enough clout to negotiate access to otherwise-unavailable database vendors' content? A little bit of both, I think. A library probably _could_ negotiate access to that content... but it would be a heck of a lot of work. When the staff time to negotiations come in, it becomes a good value proposition, regardless of how much the licensing would cost you. And yeah, then the staff time to actually ingest and normalize and troubleshoot data-flows for all that stuff on the regular basis -- I've heard stories of libraries that tried to do that in the early 90s and it was nightmarish. So, actually, I guess i've arrived at convincing myself it's mostly good value proposition, in that a library probably can't afford to do that on their own, with or without licensing issues. But I'd really love to see you try anyway, maybe I'm wrong. :) Can I assume that if a database vendor has exposed their content to me as a subscriber, whether via z39.50 or a web service or whatever, that I'm free to cache and index all