[CODE4LIB] OCA API
Hi Folks, The University Library at UNC-Chapel Hill has created an OCA API. We have harvested (and continue to harvest) standard bibliographic identifiers and link them to OCA identifiers. The API is deliberately modeled after Google's for ease of implementation. Here is a subjec search in UNC's catalog for "North Carolina" limited to the 19th century. http://search.lib.unc.edu/search?Ntk=Subject&Ne=2+200043+206475+206590+11&N=206596&Ntt=north%20carolina You will see links to OCA as well as Google. (The full record has an OCA icon if you want to look.) Right now we are only banging against the API with OCLC numbers, but ISSNs, ISBNs and LC numbers are in there. We are looking for a couple of partners to work with to take use beyond our local OPAC. You would be ideal if: you are interested, you already use the Google API, you have a significant corpus of pre-1923 works in your catalog. As the Google API is familiar to many of you, it would be easy to figure out how to implement UNC's without working with us. Please hold off until we are ready to open it up all the way? This is why we've not yet put up documentation. Caveats and other notes (feel free to skip): *We realize that Open Library has an API, but we had already gone a goodly distance and we are finding relatively meaningful differences in coverage and utility. *We collect the data from OCA as it comes in (the data should be up to date within a half hour or so)...but they occasionally have need to correct/remove works. Right now we are actively working on this issue, but do not yet have a great mechanism to pull deletes and update corrected identifiers. *The data is only as good as the data we harvest. There are a small number of bad links. See above. *Excerpt from a developer on UNC's holdings (we are an OCA Scribe site): ...I decided to run the same script against the [production] database as well to see how much the matching is changing over time with continual updates: - 429311 OCLC's tested - 72350 matched - 2599 of the matches were scanned by UNC So that's 808 new matches since the end of March, not too bad for one month. Effectively we are now linking to ~72 K digitized works that we were not previously able to provide (though as Google digitized books are being added to OCA, there is significant overlap). *When we do open it up it is the API we are offering, we are not prepared to be crawled for data. If you want the data, get in touch and we will see what we can do. If you are interested in being an early partner, please drop me a line and I will be in touch. Tim +++ Tim Shearer Web Development Coordinator The University Library University of North Carolina at Chapel Hill sh...@ils.unc.edu 919-962-1288 +++
Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?
On Fri, Mar 14, 2008 at 10:31 AM, Emily Lynema <[EMAIL PROTECTED]> wrote: available. I have to admit it seems odd to me to include so much > attribute information in a single element, but I suppose that > would be helpful in identifying what specific manifestation is being > referred to in the URL? > We made that design choice to be largely compatible with OCLC Research's version of xISBN service, and this kind of "flat" structure also help us to easily disseminate other formats, such as csv or json serialization: http://xisbn.worldcat.org/webservices/xid/isbn/0689868847?library=oca&fl=*&format=csv http://xisbn.worldcat.org/webservices/xid/isbn/0689868847?library=oca&fl=*&format=python > > In this scenario, is there a way to indicate free vs. licensed somewhere > in the entry? I'm assuming that the Netlibrary audio book is > *not* free. We have very few mechanisms to do that within MARC records; > it would be great to think about that here as most libraries will be > interested in *free* links to digitized content available from anywhere > (google book search, oca, etc.). > We support a "library=freeebook" flag to limit search scope to free ebook: http://xisbn.worldcat.org/webservices/xid/isbn/0689868847?library=freeebook&fl=* Current the free ebook collection is rather small (a few thousands of titles), hopefully we can grow the collection soon to make it more useful. You can find more statistical information from http://xisbn.worldcat.org/xisbnadmin/doc/stat.htm > > Also, have you considered the response for multiple digitized sources > for the same ISBN? > If an ISBN has multiple digitized sources, they are put in "url" attribute separated by space, e.g. http://xisbn.worldcat.org/webservices/xid/isbn/0596002815?library=ebook&fl=title,url Xiaoming
Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?
On Fri, Mar 14, 2008 at 10:31 AM, Emily Lynema <[EMAIL PROTECTED]> wrote: > > http://xisbn.worldcat.org/webservices/xid/isbn/0689868847?library=oca&fl=* > > I have to admit it seems odd to me to include so much > attribute information in a single element, but I suppose that > would be helpful in identifying what specific manifestation is being > referred to in the URL? You can also choose to return any combination of attributes by using the fl parameter. For example, to include just the year and the url, use: http://xisbn.worldcat.org/webservices/xid/isbn/0689868847?library=oca&fl=year,url Keith
[CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?
xiaoming, This is cool. just a couple of thoughts. If a user is interested in "The Golden Fleece and the Heroes Who Lived Before Achilles" with ISBN:0689868847, and he can limit the search to "OCA" by issuing xISBN request with "library=oca", such as: http://xisbn.worldcat.org/webservices/xid/isbn/0689868847?library=oca&fl=* This query limits search scope to OCA, and the result returns an ISBN match with its URL link to: http://www.archive.org/details/goldenfleecehero00colu I think many libraries, both those currently involved in OCA and those outside the project, would be happy to see something like this made available. I have to admit it seems odd to me to include so much attribute information in a single element, but I suppose that would be helpful in identifying what specific manifestation is being referred to in the URL? Similarly, a user can request same ISBN with the library limiting to "ebook", such as: http://xisbn.worldcat.org/webservices/xid/isbn/0689868847?library=ebook&fl=* It returns both OCA match and a Netlibrary Audio book match. In this scenario, is there a way to indicate free vs. licensed somewhere in the entry? I'm assuming that the Netlibrary audio book is *not* free. We have very few mechanisms to do that within MARC records; it would be great to think about that here as most libraries will be interested in *free* links to digitized content available from anywhere (google book search, oca, etc.). Also, have you considered the response for multiple digitized sources for the same ISBN? -emily -- Emily Lynema Systems Librarian for Digital Projects Information Technology, NCSU Libraries 919-513-8031 [EMAIL PROTECTED]
Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?
Jonathan, We are using public FRBR algorithm developed by OCLC Research Since we just loaded limited OCA records into xISBN service, it might be interesting to illustrate what can be done in current system. If a user is interested in "The Golden Fleece and the Heroes Who Lived Before Achilles" with ISBN:0689868847, and he can limit the search to "OCA" by issuing xISBN request with "library=oca", such as: http://xisbn.worldcat.org/webservices/xid/isbn/0689868847?library=oca&fl=* This query limits search scope to OCA, and the result returns an ISBN match with its URL link to: http://www.archive.org/details/goldenfleecehero00colu Similarly, a user can request same ISBN with the library limiting to "ebook", such as: http://xisbn.worldcat.org/webservices/xid/isbn/0689868847?library=ebook&fl=* It returns both OCA match and a Netlibrary Audio book match. Given we only have a very limited number of ISBN matches (over 1000 titles) with OCA, perhaps the result is not good enough for practical use. I believe the result will be significantly improved once we have xoclcnum in place. xiaoming * * On Wed, Mar 12, 2008 at 5:36 PM, Jonathan Rochkind <[EMAIL PROTECTED]> wrote: > This is great stuff. I am interested in what algorithms you are using to > group works. It sounds like you are doing that, above what OCA does > (which is nothing, I think). Have you gotten that far yet? What are you > thinking? Oh wait, you're from OCLC, you guys have already got all sorts > of stuff to do that, I guess. > > Jonathan > > Tim McCormick wrote: > > In our office we too have been investigating the e-book material at > > Internet Archive / OCA. > > > > We'd like to build just the sort of OCA index / id-switcher that Tim > > Shearer and others have described on this list -- in order to, among > > other things, add this type of capability to our xID (aka xISBN) > > service, and to WorldCat. > > > > So, I thought I'd report on results so far, and what we're working on. > > > > Data: > > 1) First, we used the Internet Archive's OAI interface to harvest > > brief records of all items categorized as "text". We found that this > > yielded only very brief records, though -- author, title, and OCA > > unique identifier (e.g. "northcarolinayea1910rale"). > > 2) Then we used the OCA identifier to check for, and harvest, MARC-XML > > records when available, using the lookup method described by Chris > > Freeland on Code4Lib on Feb 25. > > 3) The MARC files were examined for ISBNs and OCLCnums. (yes, we may > > look for other identifiers later). > > > > That yielded: > > - 290,756 total OCA "text" records found > > - 198,826 of those had MARC records > > - 1773 had ISBNs > > - 88537 had OCLC numbers (identified by record position & format, > > but not yet verified against WorldCat). > > > > Switching: > > In xID we currently support ISBN, have recently added LCCN, and we > > plan to release ISSN and OCLCnum support in upcoming releases. So, > > when those are fully phased in, the goal is that you could submit an > > identifier of any supported type, and get back all identifiers of > > whichever type that represent versions of the same "work"; or, when > > appropriate, the same manifestation. > > Therefore, the 88.537 OCLCnums will likely map to a much larger > > set of identifiers over all, allowing a lot of book records -- in > > library catalogs or elsewhere -- to hook into OCA materials. > > > > Free-text service: > > We imagine a service which, given an identifier, attempts to decide if > > a free-text version of the described work is available at OCA/IA: and > > if so, returns an access URL for that resource. > > > > Other work: > > We are investigating the case of free/open resources that lack > > standard identifiers -- for example, possibly, the 2/3 of IA texts for > > which we didn't find OCLCnum or ISBN. Here, we are looking at doing > > "best-guess" lookup of related identifiers, based on author and title > > information in the brief record. This might allow substantially > > broader indexing of open content materials, but the reliability of the > > identifier association is lower. > > > > Any tips, questions, suggestions, requests are welcome. > > thanks to Xiaoming Liu and Tom Ventimiglia in OCLC New Jersey office > > for work on this. > > > > Tim > > > > -- > > Tim McCormick > > Product Manager (xID), OCLC New Jersey > > Email: mccormit (at) oclc.org > > 2 Broad St., Suite 208, Bloomfield, New Jersey 07003 USA > > Phone: +1.973.868.5694 | Skype: tim_mccormick > > http://www.oclc.org/ > > > > > > -- > Jonathan Rochkind > Digital Services Software Engineer > The Sheridan Libraries > Johns Hopkins University > 410.516.8886 > rochkind (at) jhu.edu >
Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?
This is great stuff. I am interested in what algorithms you are using to group works. It sounds like you are doing that, above what OCA does (which is nothing, I think). Have you gotten that far yet? What are you thinking? Oh wait, you're from OCLC, you guys have already got all sorts of stuff to do that, I guess. Jonathan Tim McCormick wrote: In our office we too have been investigating the e-book material at Internet Archive / OCA. We'd like to build just the sort of OCA index / id-switcher that Tim Shearer and others have described on this list -- in order to, among other things, add this type of capability to our xID (aka xISBN) service, and to WorldCat. So, I thought I'd report on results so far, and what we're working on. Data: 1) First, we used the Internet Archive's OAI interface to harvest brief records of all items categorized as "text". We found that this yielded only very brief records, though -- author, title, and OCA unique identifier (e.g. "northcarolinayea1910rale"). 2) Then we used the OCA identifier to check for, and harvest, MARC-XML records when available, using the lookup method described by Chris Freeland on Code4Lib on Feb 25. 3) The MARC files were examined for ISBNs and OCLCnums. (yes, we may look for other identifiers later). That yielded: - 290,756 total OCA "text" records found - 198,826 of those had MARC records - 1773 had ISBNs - 88537 had OCLC numbers (identified by record position & format, but not yet verified against WorldCat). Switching: In xID we currently support ISBN, have recently added LCCN, and we plan to release ISSN and OCLCnum support in upcoming releases. So, when those are fully phased in, the goal is that you could submit an identifier of any supported type, and get back all identifiers of whichever type that represent versions of the same "work"; or, when appropriate, the same manifestation. Therefore, the 88.537 OCLCnums will likely map to a much larger set of identifiers over all, allowing a lot of book records -- in library catalogs or elsewhere -- to hook into OCA materials. Free-text service: We imagine a service which, given an identifier, attempts to decide if a free-text version of the described work is available at OCA/IA: and if so, returns an access URL for that resource. Other work: We are investigating the case of free/open resources that lack standard identifiers -- for example, possibly, the 2/3 of IA texts for which we didn't find OCLCnum or ISBN. Here, we are looking at doing "best-guess" lookup of related identifiers, based on author and title information in the brief record. This might allow substantially broader indexing of open content materials, but the reliability of the identifier association is lower. Any tips, questions, suggestions, requests are welcome. thanks to Xiaoming Liu and Tom Ventimiglia in OCLC New Jersey office for work on this. Tim -- Tim McCormick Product Manager (xID), OCLC New Jersey Email: mccormit (at) oclc.org 2 Broad St., Suite 208, Bloomfield, New Jersey 07003 USA Phone: +1.973.868.5694 | Skype: tim_mccormick http://www.oclc.org/ -- Jonathan Rochkind Digital Services Software Engineer The Sheridan Libraries Johns Hopkins University 410.516.8886 rochkind (at) jhu.edu
Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?
This is pretty good stuff. Consider submitting an article proposal to Code4Lib Journal about it. :) Jonathan Tim McCormick wrote: In our office we too have been investigating the e-book material at Internet Archive / OCA. We'd like to build just the sort of OCA index / id-switcher that Tim Shearer and others have described on this list -- in order to, among other things, add this type of capability to our xID (aka xISBN) service, and to WorldCat. So, I thought I'd report on results so far, and what we're working on. Data: 1) First, we used the Internet Archive's OAI interface to harvest brief records of all items categorized as "text". We found that this yielded only very brief records, though -- author, title, and OCA unique identifier (e.g. "northcarolinayea1910rale"). 2) Then we used the OCA identifier to check for, and harvest, MARC-XML records when available, using the lookup method described by Chris Freeland on Code4Lib on Feb 25. 3) The MARC files were examined for ISBNs and OCLCnums. (yes, we may look for other identifiers later). That yielded: - 290,756 total OCA "text" records found - 198,826 of those had MARC records - 1773 had ISBNs - 88537 had OCLC numbers (identified by record position & format, but not yet verified against WorldCat). Switching: In xID we currently support ISBN, have recently added LCCN, and we plan to release ISSN and OCLCnum support in upcoming releases. So, when those are fully phased in, the goal is that you could submit an identifier of any supported type, and get back all identifiers of whichever type that represent versions of the same "work"; or, when appropriate, the same manifestation. Therefore, the 88.537 OCLCnums will likely map to a much larger set of identifiers over all, allowing a lot of book records -- in library catalogs or elsewhere -- to hook into OCA materials. Free-text service: We imagine a service which, given an identifier, attempts to decide if a free-text version of the described work is available at OCA/IA: and if so, returns an access URL for that resource. Other work: We are investigating the case of free/open resources that lack standard identifiers -- for example, possibly, the 2/3 of IA texts for which we didn't find OCLCnum or ISBN. Here, we are looking at doing "best-guess" lookup of related identifiers, based on author and title information in the brief record. This might allow substantially broader indexing of open content materials, but the reliability of the identifier association is lower. Any tips, questions, suggestions, requests are welcome. thanks to Xiaoming Liu and Tom Ventimiglia in OCLC New Jersey office for work on this. Tim -- Tim McCormick Product Manager (xID), OCLC New Jersey Email: mccormit (at) oclc.org 2 Broad St., Suite 208, Bloomfield, New Jersey 07003 USA Phone: +1.973.868.5694 | Skype: tim_mccormick http://www.oclc.org/ -- Jonathan Rochkind Digital Services Software Engineer The Sheridan Libraries Johns Hopkins University 410.516.8886 rochkind (at) jhu.edu
Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?
Tim - This is awesome work! One thing to be aware of is that IA takes a non-hierarchical view of scanned books - there is no Title->Item (Bib->Item) relationship. When they scan a serial or multivolume monograph the MARCXML file for the Title is deposited in each scanned Item. For instance, the MARCXML for "The transactions of the Academy of Science of St. Louis" is dropped into this item, which is volume 21: http://www.archive.org/details/transactionsofac21acad -(Click the FTP link along the left, then the _marc.xml file) and this item, which is volume 22: http://www.archive.org/details/transactionsofac22acad You'll see they are identical files. So, your number of 198,826 MARC files does not correspond to 198,826 titles. You will need to group those MARC files by to get a true count of titles. This is what BHL does when we ingest materials from http://www.archive.org/details/biodiversity into http://www.biodiversitylibrary.org/ Chris -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Tim McCormick Sent: Wednesday, March 12, 2008 3:58 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api? In our office we too have been investigating the e-book material at Internet Archive / OCA. We'd like to build just the sort of OCA index / id-switcher that Tim Shearer and others have described on this list -- in order to, among other things, add this type of capability to our xID (aka xISBN) service, and to WorldCat. So, I thought I'd report on results so far, and what we're working on. Data: 1) First, we used the Internet Archive's OAI interface to harvest brief records of all items categorized as "text". We found that this yielded only very brief records, though -- author, title, and OCA unique identifier (e.g. "northcarolinayea1910rale"). 2) Then we used the OCA identifier to check for, and harvest, MARC-XML records when available, using the lookup method described by Chris Freeland on Code4Lib on Feb 25. 3) The MARC files were examined for ISBNs and OCLCnums. (yes, we may look for other identifiers later). That yielded: - 290,756 total OCA "text" records found - 198,826 of those had MARC records - 1773 had ISBNs - 88537 had OCLC numbers (identified by record position & format, but not yet verified against WorldCat). Switching: In xID we currently support ISBN, have recently added LCCN, and we plan to release ISSN and OCLCnum support in upcoming releases. So, when those are fully phased in, the goal is that you could submit an identifier of any supported type, and get back all identifiers of whichever type that represent versions of the same "work"; or, when appropriate, the same manifestation. Therefore, the 88.537 OCLCnums will likely map to a much larger set of identifiers over all, allowing a lot of book records -- in library catalogs or elsewhere -- to hook into OCA materials. Free-text service: We imagine a service which, given an identifier, attempts to decide if a free-text version of the described work is available at OCA/IA: and if so, returns an access URL for that resource. Other work: We are investigating the case of free/open resources that lack standard identifiers -- for example, possibly, the 2/3 of IA texts for which we didn't find OCLCnum or ISBN. Here, we are looking at doing "best-guess" lookup of related identifiers, based on author and title information in the brief record. This might allow substantially broader indexing of open content materials, but the reliability of the identifier association is lower. Any tips, questions, suggestions, requests are welcome. thanks to Xiaoming Liu and Tom Ventimiglia in OCLC New Jersey office for work on this. Tim -- Tim McCormick Product Manager (xID), OCLC New Jersey Email: mccormit (at) oclc.org 2 Broad St., Suite 208, Bloomfield, New Jersey 07003 USA Phone: +1.973.868.5694 | Skype: tim_mccormick http://www.oclc.org/
Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?
In our office we too have been investigating the e-book material at Internet Archive / OCA. We'd like to build just the sort of OCA index / id-switcher that Tim Shearer and others have described on this list -- in order to, among other things, add this type of capability to our xID (aka xISBN) service, and to WorldCat. So, I thought I'd report on results so far, and what we're working on. Data: 1) First, we used the Internet Archive's OAI interface to harvest brief records of all items categorized as "text". We found that this yielded only very brief records, though -- author, title, and OCA unique identifier (e.g. "northcarolinayea1910rale"). 2) Then we used the OCA identifier to check for, and harvest, MARC-XML records when available, using the lookup method described by Chris Freeland on Code4Lib on Feb 25. 3) The MARC files were examined for ISBNs and OCLCnums. (yes, we may look for other identifiers later). That yielded: - 290,756 total OCA "text" records found - 198,826 of those had MARC records - 1773 had ISBNs - 88537 had OCLC numbers (identified by record position & format, but not yet verified against WorldCat). Switching: In xID we currently support ISBN, have recently added LCCN, and we plan to release ISSN and OCLCnum support in upcoming releases. So, when those are fully phased in, the goal is that you could submit an identifier of any supported type, and get back all identifiers of whichever type that represent versions of the same "work"; or, when appropriate, the same manifestation. Therefore, the 88.537 OCLCnums will likely map to a much larger set of identifiers over all, allowing a lot of book records -- in library catalogs or elsewhere -- to hook into OCA materials. Free-text service: We imagine a service which, given an identifier, attempts to decide if a free-text version of the described work is available at OCA/IA: and if so, returns an access URL for that resource. Other work: We are investigating the case of free/open resources that lack standard identifiers -- for example, possibly, the 2/3 of IA texts for which we didn't find OCLCnum or ISBN. Here, we are looking at doing "best-guess" lookup of related identifiers, based on author and title information in the brief record. This might allow substantially broader indexing of open content materials, but the reliability of the identifier association is lower. Any tips, questions, suggestions, requests are welcome. thanks to Xiaoming Liu and Tom Ventimiglia in OCLC New Jersey office for work on this. Tim -- Tim McCormick Product Manager (xID), OCLC New Jersey Email: mccormit (at) oclc.org 2 Broad St., Suite 208, Bloomfield, New Jersey 07003 USA Phone: +1.973.868.5694 | Skype: tim_mccormick http://www.oclc.org/
[CODE4LIB] more musing/clarification on oca apiRe: [CODE4LIB] oca api?
ld you like it? I've not built a queryable webservice and am going on record as ignorant. Is there a query language I should lean toward? A return data structure that I should adopt? All this stems from the my belief that what I can see of the architecture indicates a split between what the participating library sends and what the oca system uses. It appears that their index/record of record simply ignores all those hooks people have been adding to bib records. If both parts were wrapped and offered up with a Solr interface I could get on with putting links into my catalog. Still, they make both their record (with their identifier) available, and my record (with the rest) available. So, like I said in an earlier post, I'm glad for the opportunity to be frustrated. Whew. If I'd been hacking instead of writing I'd have something to show and y'all would be less bored. Thanks! -t Like Karen and Bess and others have said, I recommend that you coordinate this with the Open Library project. At the meeting last Friday, it did sound like they would be interested in providing identifier disambiguation types of service - give them an ISBN, and they'll give you the records associated with it. Also, there was discussion about building an Open Librar yAPI (to enable some cool integration with wikipedia), and I suggested a that libraries using an API would want the search results to include information about whether the title has a digitized copy. So I would hope the service that you're envisioning is something that would be provided by an Open Library API (but we don't know when that might come about). As OCA moves forward, folks may well be digitizing identical books. So there may not be a one to one relationship between unique catalog identifier, unique oca identifier, and isbn/lccn/oclc number. -emily -- Date:Thu, 6 Mar 2008 08:47:04 -0500 From:Tim Shearer <[EMAIL PROTECTED]> Subject: musing on oca apiRe: [CODE4LIB] oca api? Howdy folks, I've been playing and thinking. I'd like to have what amounts to a unique identifier index to oca digitized texts. I want to be able to pull all the records that have oclc numbers, issns, isbns, etc. I want it to be lightweight, fast, searchable. Would anyone else want/use such a thing? I'm thinking about building something like this. If I do, it would be ideal if wouldn't be a duplication of effort, so anyone got this in the works? And if it would meet the needs of others. My basic notion is to crawl the site (starting with "americana", the American Libraries. Pull the oca unique identifier (e.g. northcarolinayea1910rale) and associate it with unique identifiers (oclc numbers, issns, isbns, lc numbers) contributing institution's alias and unique catalog identifier upload date That's all I was thinking of. Then there's what you might be able to do with it: Give me all the oca unique identifiers that have oclc numbers Give me all the oca unique identifiers with isbns that were uploaded between x and y date Give me the oca unique identifier for this oclc number Planning to do: keep crawling it and keep it up to date. Things I wasn't planning to do: worry about other unique ids (you'd have to go to xISBN or ThingISBN yourself) worry about storing anything else from oca. It would be good for being able to add an 856 to matches in your catalog. It would not be good for grabbing all marc records for all of oca. Anyhow, is this duplication of effort? Would you like something like this? What else would you like it to do (keeping in mind this is an unfunded pet project)? How would you want to talk to it? I was thinking of a web service, but hadn't thought too much about how to query it or how I'd deliver results. Of course I'm being an idiot and trying out new tools at the same time (python to see what the buzz is all about, sqlite just to learn it (it may not work out)). Thoughts? Vicious criticism? -t -- Date:Thu, 6 Mar 2008 11:05:41 -0500 From:Jodi Schneider <[EMAIL PROTECTED]> Subject: Re: musing on oca apiRe: [CODE4LIB] oca api? Great idea, Tim! The open library tech list that Bess mentions is [EMAIL PROTECTED], described at http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech -Jodi Jodi Schneider Science Library Specialist Amherst College 413-542-2076 -- Date:Thu, 6 Mar 2008 08:32:43 -0800 From:Karen Coyle <[EMAIL PROTECTED]> Subject: Re: musing on oca apiRe: [CODE4LIB] oca api? We talked about something like this at the Open Library meeting last Friday. The ol list is [EMAIL PROTECTED] (join at http://mail.archive.org/cgi-bin/mailman/listinfo/ol-lib). I thi
Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?
> Nope. ISBN was created in 1966. LCCNs exist for many resources > published before 1966. Even after 1966, not every single item that may > have been cataloged by the Library of Congress was neccesarily assigned > an ISBN by it's publisher All true. What I meant was that _if_ an isbn exists for an item and LC cataloged it, the LC record should have both. Many resources with and without isbns may not be in LC, so the lccn cannot be used as a substitute, but LC records can be considered a reasonably authoritative source of isbns for the stuff that they have. > Nope. I think you mean all items that have an LCCN should also have an > OCLC number. Probably true (mostly). But all items that have an OCLC > number will not neccesarily have an LCCN. You say so below "items that > were not cataloged by lc" will have oclc numbers but probably not > lccns. I misspoke but it appears you see what I mean. The relationship between oclc numbers and lccns is similar to the that between lccns and isbns. The oclc number is not a substitute for an lccn, but if a record that has an oclc number also contains an lccn, the oclc record can be considered an authoritative source for the lccn -- and an isbn if one exists. > I do not believe this is the case. But let us admit that our cooperative > cataloging corpus in fact IS not very reliable, it is full of incorrect > information. But we've got to deal with it anyway. A record that is > _missing_ an applicable identifier that it _could_ have contained may be > reliable in other respects, I wouldn't automatically assume it is not. The quality is variable, but it's the best we have and it's worth using the most reliable data available. Otherwise, inappropriate linkages start popping up. If there are relatively few, that's not a big deal, but once you get too much bad data in the system you have a real problem. kyle -- -- Kyle Banerjee Digital Services Program Manager Orbis Cascade Alliance [EMAIL PROTECTED] / 541.359.9599
Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?
Kyle Banerjee wrote: I want to be able to pull all the records that have oclc numbers, issns, isbns, etc. I want it to be lightweight, fast, searchable. Would anyone else want/use such a thing?... I like the idea, but in the long term, I just don't know how useful this will be. By and large, these identifiers are designed for dead tree resources. Although they are sometimes assigned to electronic resources, I find it hard to believe that the containers these identifiers are associated with will contain more than a tiny proportion of the information users want/need. The book structure just doesn't make nearly as much sense in an online environment. The utility that I see is that as things are digitized the "dead tree" identifier is often included in the metadata that accompanies the digital file. This makes it possible to go from legacy data (read: library catalogs) to the digital data. Not sure I understand the use case (i.e. the value of retrieving another identifier). Because the same "dead tree" item is being digitized multiple times in different locations under different projects. It's an interesting situation because where we once had an ISBN that identified EVERY copy of that "manifestation" we will now have many different copies (different because they were digitized separately). Those copies will probably have a variety of identifiers associated with them. One thing to keep in mind is that although the numbering schemes are independent, they can be thought of as hierarchical. Anything that has an lccn number should already have an isbn because of the standards lc catalogs to. And they put their holdings in OCLC, so all numbers that have an oclc number should contain these other identifiers. Items with oclc numbers that were not cataloged by lc should also have isbns. When such conditions are not met, it is a sign of a record containing unreliable information. Not the case. First, ISBNs only came into being in 1968. Nothing before that has one. Many items have NOT been cataloged by LC, many are NOT in OCLC, and oftentimes the records that you are working with have munged, stripped out, or lost the identity of the identifiers that are left. It's great luck if you find one clearly marked identifier in a bib record. kc kyle -- -- Kyle Banerjee Digital Services Program Manager Orbis Cascade Alliance [EMAIL PROTECTED] / 541.359.9599 -- --- Karen Coyle / Digital Library Consultant [EMAIL PROTECTED] http://www.kcoyle.net ph.: 510-540-7596 skype: kcoylenet fx.: 510-848-3913 mo.: 510-435-8234
Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?
Kyle Banerjee wrote: I like the idea, but in the long term, I just don't know how useful this will be. By and large, these identifiers are designed for dead tree resources. Only time will tell, but it's what we've got now, and I don't see our existing legacy records going away. So we will continue to need to try and match existing records to digitized resources representing those existing records. (Keep in mind that OCA for now is mostly only digitizing out of copyright stuff!) The more identifiers the more likely we can succesfully make such a match. One thing to keep in mind is that although the numbering schemes are independent, they can be thought of as hierarchical. Anything that has an lccn number should already have an isbn because of the standards lc catalogs to. Nope. ISBN was created in 1966. LCCNs exist for many resources published before 1966. Even after 1966, not every single item that may have been cataloged by the Library of Congress was neccesarily assigned an ISBN by it's publisher. (One obvious overlooked example---non-print resources, like music or videos! LC doesn't catalog very many of these, but any they have aren't going to have ISBNs! Other examples---foreign publishers, self-published stuff, the first few years after 66 when ISBN adoption curve was still on the way up, etc. ) And they put their holdings in OCLC, so all numbers that have an oclc number should contain these other identifiers. Nope. I think you mean all items that have an LCCN should also have an OCLC number. Probably true (mostly). But all items that have an OCLC number will not neccesarily have an LCCN. You say so below "items that were not cataloged by lc" will have oclc numbers but probably not lccns. And once we get away from LC, the chances of a cataloged item (with an OCLC number) not having an ISBN go up even more (any musical CD, for instance, not usually held by LC but held by public libraries accross the US). Items with oclc numbers that were not cataloged by lc should also have isbns. When such conditions are not met, it is a sign of a record containing unreliable information. I do not believe this is the case. But let us admit that our cooperative cataloging corpus in fact IS not very reliable, it is full of incorrect information. But we've got to deal with it anyway. A record that is _missing_ an applicable identifier that it _could_ have contained may be reliable in other respects, I wouldn't automatically assume it is not. Jonathan kyle -- -- Kyle Banerjee Digital Services Program Manager Orbis Cascade Alliance [EMAIL PROTECTED] / 541.359.9599 -- Jonathan Rochkind Digital Services Software Engineer The Sheridan Libraries Johns Hopkins University 410.516.8886 rochkind (at) jhu.edu
Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?
> I want to be able to pull all the > records that have oclc numbers, issns, isbns, etc. I want it to be > lightweight, fast, searchable. > > Would anyone else want/use such a thing?... I like the idea, but in the long term, I just don't know how useful this will be. By and large, these identifiers are designed for dead tree resources. Although they are sometimes assigned to electronic resources, I find it hard to believe that the containers these identifiers are associated with will contain more than a tiny proportion of the information users want/need. The book structure just doesn't make nearly as much sense in an online environment. > My basic notion is to crawl the site (starting with "americana", the American > Libraries. Pull the oca unique identifier (e.g. northcarolinayea1910rale) > and > associate it with > > unique identifiers (oclc numbers, issns, isbns, lc numbers) > contributing institution's alias and unique catalog identifier > upload date > > That's all I was thinking of. Then there's what you might be able to do with > it: > > Give me all the oca unique identifiers that have oclc numbers > Give me all the oca unique identifiers with isbns that were > uploaded between x and y date > Give me the oca unique identifier for this oclc number Not sure I understand the use case (i.e. the value of retrieving another identifier). One thing to keep in mind is that although the numbering schemes are independent, they can be thought of as hierarchical. Anything that has an lccn number should already have an isbn because of the standards lc catalogs to. And they put their holdings in OCLC, so all numbers that have an oclc number should contain these other identifiers. Items with oclc numbers that were not cataloged by lc should also have isbns. When such conditions are not met, it is a sign of a record containing unreliable information. kyle -- -- Kyle Banerjee Digital Services Program Manager Orbis Cascade Alliance [EMAIL PROTECTED] / 541.359.9599
Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?
I see a whole lot of "not invented here" syndrome from IA, honestly. They seem to want to re-invent everything themselves, rather than try to use existing conventions. Even if they come up with something slightly better than SRU, is it worth the pain to developers who would like to implement client code, and can't use their existing SRU client code to do so? Seems to me, and I tried to tell Brewster this when talking to him after his keynote at the conference, if the IA is serious about trying to get external developers to engage with IA stuff (which the IA folks at the conf mentioned was indeed a goal of theirs), then there are certain things the IA should put their resources into in order to facillitate and encourage this. Mainly: 1) Documenting their interfaces. Right now as far as I can tell everything is available on a "if you happen to notice it's there and then reverse engineer it yourself, and who knows if it might change and break your code" basis. I don't really have time for that. 2) When they make machine interfaces, use existing conventions and standards in use by the community of developers they want to target. [If the community of developers they want to target is not neccesarily library programmers, and that community they wish to target doesn't in fact use SRU at all right now, I suppose that might be fair. I dunno]. 3) Best of all, actually talk to people in this community of developers _before_ developing their stuff, to see what their needs are. "User centered development", right? You don't produce a giant piece of software without talking to those who you want to use it, and then wonder why they don't seem interested in using it. When I bring this up, I'm generally told "Oh, all that is YOUR responsibility. If you wanted it bad enough, you'd deal with it. We just make it available, the rest is up to you." That's fine, like I said, they can prioritize their resource allocation however they want. But they shouldn't be so surprised when they're having trouble getting external-developer-community adoption of their stuff when this is their attitude. That's what I would have said if I had been able to make the meeting last week. So maybe they're changing their approach a bit with regard to some of these things. They did meet with library developers, at least. I don't see much evidence of 1 or 2 yet though. Jonathan Eric Lease Morgan wrote: On Mar 7, 2008, at 8:22 AM, Emily Lynema wrote: Also, there was discussion about building an Open Library API (to enable some cool integration with wikipedia), and I suggested a that libraries using an API would want the search results to include information about whether the title has a digitized copy. So I would hope the service that you're envisioning is something that would be provided by an Open Library API (but we don't know when that might come about). I sat in on this discussion at the Meeting. It was driven by a consultant-type who is working for Wikipedia. His desire was to create an API that allowed people to authoritatively and consistently cite content from Wikipedia to Open Library. Ultimately, this API would allow a person to: * search Open Library via word, phrase, or key * return list of hits * select item * create "citation" * insert citation into Wikipedia article * regularly check the validity of the citation Regarding the first two items I tried to suggest the use of SRU. Regarding the last item, I tried to suggest OAI. In both cases I was shot down. "Too complicated", at the same time, they were outlining API's that had the *exact* functionality of SRU and OAI. I sort of saw his point. "Library" protocols are usually overly-complicated, yet he was totally unaware of either protocol. I also think he was suffering a bit from the Not Invented Here Syndrome. We also got into a bit of a religious war regarding the definition of REST-ful Web Services. In the end we talked a lot about JSON and a tiny bit about ATOM. -- Eric Lease Morgan University Libraries of Notre Dame -- Jonathan Rochkind Digital Services Software Engineer The Sheridan Libraries Johns Hopkins University 410.516.8886 rochkind (at) jhu.edu
Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?
On Mar 7, 2008, at 8:22 AM, Emily Lynema wrote: Also, there was discussion about building an Open Library API (to enable some cool integration with wikipedia), and I suggested a that libraries using an API would want the search results to include information about whether the title has a digitized copy. So I would hope the service that you're envisioning is something that would be provided by an Open Library API (but we don't know when that might come about). I sat in on this discussion at the Meeting. It was driven by a consultant-type who is working for Wikipedia. His desire was to create an API that allowed people to authoritatively and consistently cite content from Wikipedia to Open Library. Ultimately, this API would allow a person to: * search Open Library via word, phrase, or key * return list of hits * select item * create "citation" * insert citation into Wikipedia article * regularly check the validity of the citation Regarding the first two items I tried to suggest the use of SRU. Regarding the last item, I tried to suggest OAI. In both cases I was shot down. "Too complicated", at the same time, they were outlining API's that had the *exact* functionality of SRU and OAI. I sort of saw his point. "Library" protocols are usually overly-complicated, yet he was totally unaware of either protocol. I also think he was suffering a bit from the Not Invented Here Syndrome. We also got into a bit of a religious war regarding the definition of REST-ful Web Services. In the end we talked a lot about JSON and a tiny bit about ATOM. -- Eric Lease Morgan University Libraries of Notre Dame
Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?
Tim, It sounds like you want to be able to search on standard identifiers and are frustrated that the Internet Archive's access doesn't allow it (although it looks like they do have an ISBN search)? And I'm curious, why would you want or need to pull down only records that have OCLC numbers of ISBNs in particular? What is it you need to do that makes only those records useful? Like Karen and Bess and others have said, I recommend that you coordinate this with the Open Library project. At the meeting last Friday, it did sound like they would be interested in providing identifier disambiguation types of service - give them an ISBN, and they'll give you the records associated with it. Also, there was discussion about building an Open Librar yAPI (to enable some cool integration with wikipedia), and I suggested a that libraries using an API would want the search results to include information about whether the title has a digitized copy. So I would hope the service that you're envisioning is something that would be provided by an Open Library API (but we don't know when that might come about). As OCA moves forward, folks may well be digitizing identical books. So there may not be a one to one relationship between unique catalog identifier, unique oca identifier, and isbn/lccn/oclc number. -emily -- Date:Thu, 6 Mar 2008 08:47:04 -0500 From:Tim Shearer <[EMAIL PROTECTED]> Subject: musing on oca apiRe: [CODE4LIB] oca api? Howdy folks, I've been playing and thinking. I'd like to have what amounts to a unique identifier index to oca digitized texts. I want to be able to pull all the records that have oclc numbers, issns, isbns, etc. I want it to be lightweight, fast, searchable. Would anyone else want/use such a thing? I'm thinking about building something like this. If I do, it would be ideal if wouldn't be a duplication of effort, so anyone got this in the works? And if it would meet the needs of others. My basic notion is to crawl the site (starting with "americana", the American Libraries. Pull the oca unique identifier (e.g. northcarolinayea1910rale) and associate it with unique identifiers (oclc numbers, issns, isbns, lc numbers) contributing institution's alias and unique catalog identifier upload date That's all I was thinking of. Then there's what you might be able to do with it: Give me all the oca unique identifiers that have oclc numbers Give me all the oca unique identifiers with isbns that were uploaded between x and y date Give me the oca unique identifier for this oclc number Planning to do: keep crawling it and keep it up to date. Things I wasn't planning to do: worry about other unique ids (you'd have to go to xISBN or ThingISBN yourself) worry about storing anything else from oca. It would be good for being able to add an 856 to matches in your catalog. It would not be good for grabbing all marc records for all of oca. Anyhow, is this duplication of effort? Would you like something like this? What else would you like it to do (keeping in mind this is an unfunded pet project)? How would you want to talk to it? I was thinking of a web service, but hadn't thought too much about how to query it or how I'd deliver results. Of course I'm being an idiot and trying out new tools at the same time (python to see what the buzz is all about, sqlite just to learn it (it may not work out)). Thoughts? Vicious criticism? -t -- Date:Thu, 6 Mar 2008 11:05:41 -0500 From:Jodi Schneider <[EMAIL PROTECTED]> Subject: Re: musing on oca apiRe: [CODE4LIB] oca api? Great idea, Tim! The open library tech list that Bess mentions is [EMAIL PROTECTED], described at http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech -Jodi Jodi Schneider Science Library Specialist Amherst College 413-542-2076 -- Date:Thu, 6 Mar 2008 08:32:43 -0800 From:Karen Coyle <[EMAIL PROTECTED]> Subject: Re: musing on oca apiRe: [CODE4LIB] oca api? We talked about something like this at the Open Library meeting last Friday. The ol list is [EMAIL PROTECTED] (join at http://mail.archive.org/cgi-bin/mailman/listinfo/ol-lib). I think of this as a (or one or more) translate service between IDs. It's a realization that we will never have a unique ID that everyone agrees on, that most bibliographic items are really more than one thing, but that since we have data about the bibliographic item we have many opportunities to make connections even though people have used different identifiers. So we could use an "ID-switcher" to move among data stores and services. Is that the kind of thing you are thinking of? kc -- Emily Lynema Systems Librarian for Digital Projects Information Technology, NCSU Libraries 919-513-8031 [EMAIL PROTECTED]
[CODE4LIB] oca api
For some reason, the code4lib listservs reject my [EMAIL PROTECTED] mail (undoubtedly having to do with the domain name -- any ideas welcome) so I'll try to keep track of the list from this account. Meanwhile... We talked about something like this at the Open Library meeting last Friday in a group that including Rob Styles, who has thought long and hard about identifiers. I think of this as a (or one or more) translate service between IDs, aka xISBN on steroids. It's a realization that we will never have a unique ID that everyone agrees on, that most bibliographic items are really more than one thing, but that since we have data about the bibliographic item we have many opportunities to make connections even though people have used different identifiers. So we could use an "ID-switcher" to move among data stores and services. Is that the kind of thing folks are thinking of? kc -- --- Karen Coyle / Digital Library Consultant [EMAIL PROTECTED] http://www.kcoyle.net ph.: 510-540-7596 skype: kcoylenet fx.: 510-848-3913 mo.: 510-435-8234
Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?
We talked about something like this at the Open Library meeting last Friday. The ol list is [EMAIL PROTECTED] (join at http://mail.archive.org/cgi-bin/mailman/listinfo/ol-lib). I think of this as a (or one or more) translate service between IDs. It's a realization that we will never have a unique ID that everyone agrees on, that most bibliographic items are really more than one thing, but that since we have data about the bibliographic item we have many opportunities to make connections even though people have used different identifiers. So we could use an "ID-switcher" to move among data stores and services. Is that the kind of thing you are thinking of? kc Tim Shearer wrote: Howdy folks, I've been playing and thinking. I'd like to have what amounts to a unique identifier index to oca digitized texts. I want to be able to pull all the records that have oclc numbers, issns, isbns, etc. I want it to be lightweight, fast, searchable. Would anyone else want/use such a thing? I'm thinking about building something like this. If I do, it would be ideal if wouldn't be a duplication of effort, so anyone got this in the works? And if it would meet the needs of others. My basic notion is to crawl the site (starting with "americana", the American Libraries. Pull the oca unique identifier (e.g. northcarolinayea1910rale) and associate it with unique identifiers (oclc numbers, issns, isbns, lc numbers) contributing institution's alias and unique catalog identifier upload date That's all I was thinking of. Then there's what you might be able to do with it: Give me all the oca unique identifiers that have oclc numbers Give me all the oca unique identifiers with isbns that were uploaded between x and y date Give me the oca unique identifier for this oclc number Planning to do: keep crawling it and keep it up to date. Things I wasn't planning to do: worry about other unique ids (you'd have to go to xISBN or ThingISBN yourself) worry about storing anything else from oca. It would be good for being able to add an 856 to matches in your catalog. It would not be good for grabbing all marc records for all of oca. Anyhow, is this duplication of effort? Would you like something like this? What else would you like it to do (keeping in mind this is an unfunded pet project)? How would you want to talk to it? I was thinking of a web service, but hadn't thought too much about how to query it or how I'd deliver results. Of course I'm being an idiot and trying out new tools at the same time (python to see what the buzz is all about, sqlite just to learn it (it may not work out)). Thoughts? Vicious criticism? -t -- --- Karen Coyle / Digital Library Consultant [EMAIL PROTECTED] http://www.kcoyle.net ph.: 510-540-7596 skype: kcoylenet fx.: 510-848-3913 mo.: 510-435-8234
Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?
Great idea, Tim! The open library tech list that Bess mentions is [EMAIL PROTECTED], described at http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech -Jodi Jodi Schneider Science Library Specialist Amherst College 413-542-2076 >-Original Message- >From: Code for Libraries [mailto:[EMAIL PROTECTED] On >Behalf Of Tim Shearer >Sent: Thursday, March 06, 2008 8:47 AM >To: CODE4LIB@LISTSERV.ND.EDU >Subject: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api? > >Howdy folks, > >I've been playing and thinking. I'd like to have what amounts >to a unique >identifier index to oca digitized texts. I want to be able to >pull all the >records that have oclc numbers, issns, isbns, etc. I want it to be >lightweight, fast, searchable. > >Would anyone else want/use such a thing? > >I'm thinking about building something like this. > >If I do, it would be ideal if wouldn't be a duplication of >effort, so anyone >got this in the works? And if it would meet the needs of others. > >My basic notion is to crawl the site (starting with >"americana", the American >Libraries. Pull the oca unique identifier (e.g. >northcarolinayea1910rale) and >associate it with > >unique identifiers (oclc numbers, issns, isbns, lc numbers) >contributing institution's alias and unique catalog identifier >upload date > >That's all I was thinking of. Then there's what you might be >able to do with >it: > >Give me all the oca unique identifiers that have oclc numbers >Give me all the oca unique identifiers with isbns that were >uploaded between x and y date >Give me the oca unique identifier for this oclc number > >Planning to do: > >keep crawling it and keep it up to date. > >Things I wasn't planning to do: > >worry about other unique ids (you'd have to go to xISBN or >ThingISBN yourself) >worry about storing anything else from oca. > >It would be good for being able to add an 856 to matches in >your catalog. It >would not be good for grabbing all marc records for all of oca. > >Anyhow, is this duplication of effort? Would you like >something like this? >What else would you like it to do (keeping in mind this is an >unfunded pet >project)? How would you want to talk to it? I was thinking >of a web service, >but hadn't thought too much about how to query it or how I'd >deliver results. > >Of course I'm being an idiot and trying out new tools at the >same time (python >to see what the buzz is all about, sqlite just to learn it (it >may not work >out)). > >Thoughts? Vicious criticism? > >-t > > >On Tue, 26 Feb 2008, Chris Freeland wrote: > >> My guess is that, yes, the query interface we've been discussing here >> and the 'all sorts of interfaces that none of us knew about' are the >> same. It's not documented that I'm aware of. We've found >out about it >> by literally sitting next to IA developers and asking questions. >> >> Chris >> -Original Message- >> From: Code for Libraries [mailto:[EMAIL PROTECTED] >On Behalf Of >> Jonathan Rochkind >> Sent: Tuesday, February 26, 2008 12:18 PM >> To: CODE4LIB@LISTSERV.ND.EDU >> Subject: Re: [CODE4LIB] oca api? >> >> So in answer to my question here at the Code4Lib conference, after >> Brewster's keynote, Brewster suggests there are all sorts of >interfaces >> that none of us knew about. Or at least I didn't know about, >and haven't >> been able to figure out in months of trying! I'm going to try and >> corner him and ask for an email of who we should contact. >> >> Perhaps it's the XML interface that you guys know about >already. Is that >> documented anywhere? How the heck did you find out about it? >> >> Jonathan >> >> >>>>> Steve Toub <[EMAIL PROTECTED]> 02/25/08 9:41 PM >>> >> I'll add that when IA told me about >> http://www.archive.org/services/search.php interface to return >> XML, they asked that we not send more than 100 records at time since >> doing more would adversely >> affect production services. Which made it seem like OAI-PMH >was a better >> way to go. >> >> Chris, can you explain a bit more about what this means: "We >found their >> OAI interface to pull >> scanned items inconsistently based on date of scanning"? >I'm having >> trouble parsing. >> >> >> --SET >> >> >> >> >> --- Chris Fr
Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?
I would absolutely want and use such a thing. I don't know of anyone else doing that, although I have been thinking about it too (but don't really have time to do much with it). The approach and issues you have identified matches what I've been thinking, and I don't have much additional to add. Are you thinking of providing an index that you'd let the rest of us search? That would be great. Although there's always an issue with sustainability there; if I have my software use your index, what happens when you leave your job and your employer stops supporting it? It might make sense to try to find a more "neutral" host site for such a thing, and try to get together a small 'committee' to support it, so if you stop working on it for whatever reason a year from now, it is more likely to continue to work. Jonathan Tim Shearer wrote: Howdy folks, I've been playing and thinking. I'd like to have what amounts to a unique identifier index to oca digitized texts. I want to be able to pull all the records that have oclc numbers, issns, isbns, etc. I want it to be lightweight, fast, searchable. Would anyone else want/use such a thing? I'm thinking about building something like this. If I do, it would be ideal if wouldn't be a duplication of effort, so anyone got this in the works? And if it would meet the needs of others. My basic notion is to crawl the site (starting with "americana", the American Libraries. Pull the oca unique identifier (e.g. northcarolinayea1910rale) and associate it with unique identifiers (oclc numbers, issns, isbns, lc numbers) contributing institution's alias and unique catalog identifier upload date That's all I was thinking of. Then there's what you might be able to do with it: Give me all the oca unique identifiers that have oclc numbers Give me all the oca unique identifiers with isbns that were uploaded between x and y date Give me the oca unique identifier for this oclc number Planning to do: keep crawling it and keep it up to date. Things I wasn't planning to do: worry about other unique ids (you'd have to go to xISBN or ThingISBN yourself) worry about storing anything else from oca. It would be good for being able to add an 856 to matches in your catalog. It would not be good for grabbing all marc records for all of oca. Anyhow, is this duplication of effort? Would you like something like this? What else would you like it to do (keeping in mind this is an unfunded pet project)? How would you want to talk to it? I was thinking of a web service, but hadn't thought too much about how to query it or how I'd deliver results. Of course I'm being an idiot and trying out new tools at the same time (python to see what the buzz is all about, sqlite just to learn it (it may not work out)). Thoughts? Vicious criticism? -t On Tue, 26 Feb 2008, Chris Freeland wrote: My guess is that, yes, the query interface we've been discussing here and the 'all sorts of interfaces that none of us knew about' are the same. It's not documented that I'm aware of. We've found out about it by literally sitting next to IA developers and asking questions. Chris -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Jonathan Rochkind Sent: Tuesday, February 26, 2008 12:18 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] oca api? So in answer to my question here at the Code4Lib conference, after Brewster's keynote, Brewster suggests there are all sorts of interfaces that none of us knew about. Or at least I didn't know about, and haven't been able to figure out in months of trying! I'm going to try and corner him and ask for an email of who we should contact. Perhaps it's the XML interface that you guys know about already. Is that documented anywhere? How the heck did you find out about it? Jonathan Steve Toub <[EMAIL PROTECTED]> 02/25/08 9:41 PM >>> I'll add that when IA told me about http://www.archive.org/services/search.php interface to return XML, they asked that we not send more than 100 records at time since doing more would adversely affect production services. Which made it seem like OAI-PMH was a better way to go. Chris, can you explain a bit more about what this means: "We found their OAI interface to pull scanned items inconsistently based on date of scanning"? I'm having trouble parsing. --SET --- Chris Freeland <[EMAIL PROTECTED]> wrote: Jonathan - No, I don't believe it's documented - at least not anywhere publicly. If any IA/OCA folks are lurking, here's an opportunity to make a bunch of techies happy... Chris -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behal
Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?
PS: I'd want to use it for more targetted querries too. Instead of "give me all the records that have OCLC numbers", "Give me any record that has OCLC number X", or "give me any record that has ISBN Y", or "LCCN Z". Jonathan Tim Shearer wrote: Howdy folks, I've been playing and thinking. I'd like to have what amounts to a unique identifier index to oca digitized texts. I want to be able to pull all the records that have oclc numbers, issns, isbns, etc. I want it to be lightweight, fast, searchable. Would anyone else want/use such a thing? I'm thinking about building something like this. If I do, it would be ideal if wouldn't be a duplication of effort, so anyone got this in the works? And if it would meet the needs of others. My basic notion is to crawl the site (starting with "americana", the American Libraries. Pull the oca unique identifier (e.g. northcarolinayea1910rale) and associate it with unique identifiers (oclc numbers, issns, isbns, lc numbers) contributing institution's alias and unique catalog identifier upload date That's all I was thinking of. Then there's what you might be able to do with it: Give me all the oca unique identifiers that have oclc numbers Give me all the oca unique identifiers with isbns that were uploaded between x and y date Give me the oca unique identifier for this oclc number Planning to do: keep crawling it and keep it up to date. Things I wasn't planning to do: worry about other unique ids (you'd have to go to xISBN or ThingISBN yourself) worry about storing anything else from oca. It would be good for being able to add an 856 to matches in your catalog. It would not be good for grabbing all marc records for all of oca. Anyhow, is this duplication of effort? Would you like something like this? What else would you like it to do (keeping in mind this is an unfunded pet project)? How would you want to talk to it? I was thinking of a web service, but hadn't thought too much about how to query it or how I'd deliver results. Of course I'm being an idiot and trying out new tools at the same time (python to see what the buzz is all about, sqlite just to learn it (it may not work out)). Thoughts? Vicious criticism? -t On Tue, 26 Feb 2008, Chris Freeland wrote: My guess is that, yes, the query interface we've been discussing here and the 'all sorts of interfaces that none of us knew about' are the same. It's not documented that I'm aware of. We've found out about it by literally sitting next to IA developers and asking questions. Chris -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Jonathan Rochkind Sent: Tuesday, February 26, 2008 12:18 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] oca api? So in answer to my question here at the Code4Lib conference, after Brewster's keynote, Brewster suggests there are all sorts of interfaces that none of us knew about. Or at least I didn't know about, and haven't been able to figure out in months of trying! I'm going to try and corner him and ask for an email of who we should contact. Perhaps it's the XML interface that you guys know about already. Is that documented anywhere? How the heck did you find out about it? Jonathan Steve Toub <[EMAIL PROTECTED]> 02/25/08 9:41 PM >>> I'll add that when IA told me about http://www.archive.org/services/search.php interface to return XML, they asked that we not send more than 100 records at time since doing more would adversely affect production services. Which made it seem like OAI-PMH was a better way to go. Chris, can you explain a bit more about what this means: "We found their OAI interface to pull scanned items inconsistently based on date of scanning"? I'm having trouble parsing. --SET --- Chris Freeland <[EMAIL PROTECTED]> wrote: Jonathan - No, I don't believe it's documented - at least not anywhere publicly. If any IA/OCA folks are lurking, here's an opportunity to make a bunch of techies happy... Chris -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Jonathan Rochkind Sent: Monday, February 25, 2008 2:48 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] oca api? I hadn't known this "custom query interface" existed! This is welcome news. Is this documented anywhere? Jonathan Chris Freeland <[EMAIL PROTECTED]> 02/25/08 2:51 PM >>> Steve & Tim, I'm the tech director for the Biodiversity Heritage Library (BHL), which is a consortium of 10 natural history libraries who have partnered with Internet Archive (IA)/OCA for scanning our collections. We've just launched our revamped porta
Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?
Tim, I think this is a fantastic idea and the only suggestion I would make is to make sure you get on the Open Library developers list (I'm looking for the URL... I'll email when I find it unless someone else beats me to it) and discuss this there. (You may already have done this, I don't know.) They may be interested in hosting such a project, and of course it would be helpful to have their knowledge of the collections and apis on call. They seem to be keen on involving developers from outside the Internet Archives staff, and this seems like a perfect opportunity. I would be very interested in helping you test such a service, though, and I would definitely put links into our library catalogue. Bess Elizabeth (Bess) Sadler Research and Development Librarian Digital Scholarship Services Box 400129 Alderman Library University of Virginia Charlottesville, VA 22904 [EMAIL PROTECTED] (434) 243-2305 On Mar 6, 2008, at 8:47 AM, Tim Shearer wrote: Howdy folks, I've been playing and thinking. I'd like to have what amounts to a unique identifier index to oca digitized texts. I want to be able to pull all the records that have oclc numbers, issns, isbns, etc. I want it to be lightweight, fast, searchable. Would anyone else want/use such a thing? I'm thinking about building something like this. If I do, it would be ideal if wouldn't be a duplication of effort, so anyone got this in the works? And if it would meet the needs of others. My basic notion is to crawl the site (starting with "americana", the American Libraries. Pull the oca unique identifier (e.g. northcarolinayea1910rale) and associate it with unique identifiers (oclc numbers, issns, isbns, lc numbers) contributing institution's alias and unique catalog identifier upload date That's all I was thinking of. Then there's what you might be able to do with it: Give me all the oca unique identifiers that have oclc numbers Give me all the oca unique identifiers with isbns that were uploaded between x and y date Give me the oca unique identifier for this oclc number Planning to do: keep crawling it and keep it up to date. Things I wasn't planning to do: worry about other unique ids (you'd have to go to xISBN or ThingISBN yourself) worry about storing anything else from oca. It would be good for being able to add an 856 to matches in your catalog. It would not be good for grabbing all marc records for all of oca. Anyhow, is this duplication of effort? Would you like something like this? What else would you like it to do (keeping in mind this is an unfunded pet project)? How would you want to talk to it? I was thinking of a web service, but hadn't thought too much about how to query it or how I'd deliver results. Of course I'm being an idiot and trying out new tools at the same time (python to see what the buzz is all about, sqlite just to learn it (it may not work out)). Thoughts? Vicious criticism? -t
[CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?
Howdy folks, I've been playing and thinking. I'd like to have what amounts to a unique identifier index to oca digitized texts. I want to be able to pull all the records that have oclc numbers, issns, isbns, etc. I want it to be lightweight, fast, searchable. Would anyone else want/use such a thing? I'm thinking about building something like this. If I do, it would be ideal if wouldn't be a duplication of effort, so anyone got this in the works? And if it would meet the needs of others. My basic notion is to crawl the site (starting with "americana", the American Libraries. Pull the oca unique identifier (e.g. northcarolinayea1910rale) and associate it with unique identifiers (oclc numbers, issns, isbns, lc numbers) contributing institution's alias and unique catalog identifier upload date That's all I was thinking of. Then there's what you might be able to do with it: Give me all the oca unique identifiers that have oclc numbers Give me all the oca unique identifiers with isbns that were uploaded between x and y date Give me the oca unique identifier for this oclc number Planning to do: keep crawling it and keep it up to date. Things I wasn't planning to do: worry about other unique ids (you'd have to go to xISBN or ThingISBN yourself) worry about storing anything else from oca. It would be good for being able to add an 856 to matches in your catalog. It would not be good for grabbing all marc records for all of oca. Anyhow, is this duplication of effort? Would you like something like this? What else would you like it to do (keeping in mind this is an unfunded pet project)? How would you want to talk to it? I was thinking of a web service, but hadn't thought too much about how to query it or how I'd deliver results. Of course I'm being an idiot and trying out new tools at the same time (python to see what the buzz is all about, sqlite just to learn it (it may not work out)). Thoughts? Vicious criticism? -t On Tue, 26 Feb 2008, Chris Freeland wrote: My guess is that, yes, the query interface we've been discussing here and the 'all sorts of interfaces that none of us knew about' are the same. It's not documented that I'm aware of. We've found out about it by literally sitting next to IA developers and asking questions. Chris -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Jonathan Rochkind Sent: Tuesday, February 26, 2008 12:18 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] oca api? So in answer to my question here at the Code4Lib conference, after Brewster's keynote, Brewster suggests there are all sorts of interfaces that none of us knew about. Or at least I didn't know about, and haven't been able to figure out in months of trying! I'm going to try and corner him and ask for an email of who we should contact. Perhaps it's the XML interface that you guys know about already. Is that documented anywhere? How the heck did you find out about it? Jonathan Steve Toub <[EMAIL PROTECTED]> 02/25/08 9:41 PM >>> I'll add that when IA told me about http://www.archive.org/services/search.php interface to return XML, they asked that we not send more than 100 records at time since doing more would adversely affect production services. Which made it seem like OAI-PMH was a better way to go. Chris, can you explain a bit more about what this means: "We found their OAI interface to pull scanned items inconsistently based on date of scanning"? I'm having trouble parsing. --SET --- Chris Freeland <[EMAIL PROTECTED]> wrote: Jonathan - No, I don't believe it's documented - at least not anywhere publicly. If any IA/OCA folks are lurking, here's an opportunity to make a bunch of techies happy... Chris -Original Message----- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Jonathan Rochkind Sent: Monday, February 25, 2008 2:48 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] oca api? I hadn't known this "custom query interface" existed! This is welcome news. Is this documented anywhere? Jonathan Chris Freeland <[EMAIL PROTECTED]> 02/25/08 2:51 PM >>> Steve & Tim, I'm the tech director for the Biodiversity Heritage Library (BHL), which is a consortium of 10 natural history libraries who have partnered with Internet Archive (IA)/OCA for scanning our collections. We've just launched our revamped portal, complete with more than 7,500 books & 2.8 million pages scanned by IA & other digitization partners, at: http://www.biodiversitylibrary.org To build this portal we ingest metadata from IA. We found their OAI interface to pull scanned items inconsistently based on date of scanni
Re: [CODE4LIB] oca api?
Because the IA hasn't devoted resources to documenting this stuff, I guess. If they actually want their stuff to be used by folks like us, then seems to me resources devoted to such would be resources well spent. Jonathan K.G. Schneider wrote: But why are there hurdles? Karen G. Schneider On Wed, 27 Feb 2008 07:29:57 -0600, "Chris Freeland" <[EMAIL PROTECTED]> said: Roy, do you have an answer in mind? To me & my project it's the content that is open, which is why it's worth the hurdles. Once you 'crack the nut' you can grab metadata, scans, and derivatives and ingest, parse, recombine, remix...as we've done for BHL. Access to OCA content may not be standards-based, but it works. Chris -Original Message- From: "Roy Tennant" <[EMAIL PROTECTED]> To: "CODE4LIB@LISTSERV.ND.EDU" Sent: 2/27/2008 5:28 AM Subject: Re: [CODE4LIB] oca api? So what, exactly, is "open" about this? Anyone care to guess? Roy On 2/26/08 10:29 AM, "Chris Freeland" <[EMAIL PROTECTED]> wrote: My guess is that, yes, the query interface we've been discussing here and the 'all sorts of interfaces that none of us knew about' are the same. It's not documented that I'm aware of. We've found out about it by literally sitting next to IA developers and asking questions. Chris -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Jonathan Rochkind Sent: Tuesday, February 26, 2008 12:18 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] oca api? So in answer to my question here at the Code4Lib conference, after Brewster's keynote, Brewster suggests there are all sorts of interfaces that none of us knew about. Or at least I didn't know about, and haven't been able to figure out in months of trying! I'm going to try and corner him and ask for an email of who we should contact. Perhaps it's the XML interface that you guys know about already. Is that documented anywhere? How the heck did you find out about it? Jonathan Steve Toub <[EMAIL PROTECTED]> 02/25/08 9:41 PM >>> I'll add that when IA told me about http://www.archive.org/services/search.php interface to return XML, they asked that we not send more than 100 records at time since doing more would adversely affect production services. Which made it seem like OAI-PMH was a better way to go. Chris, can you explain a bit more about what this means: "We found their OAI interface to pull scanned items inconsistently based on date of scanning"? I'm having trouble parsing. --SET --- Chris Freeland <[EMAIL PROTECTED]> wrote: Jonathan - No, I don't believe it's documented - at least not anywhere publicly. If any IA/OCA folks are lurking, here's an opportunity to make a bunch of techies happy... Chris -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Jonathan Rochkind Sent: Monday, February 25, 2008 2:48 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] oca api? I hadn't known this "custom query interface" existed! This is welcome news. Is this documented anywhere? Jonathan Chris Freeland <[EMAIL PROTECTED]> 02/25/08 2:51 PM >>> Steve & Tim, I'm the tech director for the Biodiversity Heritage Library (BHL), which is a consortium of 10 natural history libraries who have partnered with Internet Archive (IA)/OCA for scanning our collections. We've just launched our revamped portal, complete with more than 7,500 books & 2.8 million pages scanned by IA & other digitization partners, at: http://www.biodiversitylibrary.org To build this portal we ingest metadata from IA. We found their OAI interface to pull scanned items inconsistently based on date of scanning, so we switched to using their custom query interface. Here's an example of a query we fire off: http://www.archive.org/services/search.php?query=collection:(biodiversit y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH OI%20Library)&limit=10&submit=submit This is returning scanned items from the "biodiversity" collection, updated between 10/31/2007 - 11/30/2007, restricted to one of our contributing libraries (MBLWHOI Library), and limited to 10 results. The results are styled in the browser; view source to see the good stuff. We use this list to grab the identifiers we've yet to ingest. Some background: When a book is scanned through IA/OCA scanning, they create their own unique identifier (like "annalesacademiae21univ") and grab a MARC record from the contributing library's catalog. All of the scanned files, derivatives, and metadata files are stored on IA's clusters in a directory named with the identifier. Steve mentioned using
Re: [CODE4LIB] oca api?
I see it as open it the way that google books is not. But, a huge part of being open is the provision of *access* and so having easy, documented APIs (in the way that google often does) would make this a whole lot easier to leverage. Still, it's a "good thing" and I'm pleased to have the opportunity to be frustrated! -t On Wed, 27 Feb 2008, Sebastian Hammer wrote: I concur. The content is open; and the OCA's use of MARC is open... I think they're waiting for the community to chip in the means and mechanisms to support whatever open APIs or protocols are deemed useful. We built a free Z39.50/SRU service based on a crawl through their text collection, incorporating MARC data where available.. it'd be great to see other organizations contribute funding and/or sweat to build additional services and tools. (our stuff is at http://indexdata.com/opencontent/) --Sebastian Chris Freeland wrote: Roy, do you have an answer in mind? To me & my project it's the content that is open, which is why it's worth the hurdles. Once you 'crack the nut' you can grab metadata, scans, and derivatives and ingest, parse, recombine, remix...as we've done for BHL. Access to OCA content may not be standards-based, but it works. Chris -Original Message- From: "Roy Tennant" <[EMAIL PROTECTED]> To: "CODE4LIB@LISTSERV.ND.EDU" Sent: 2/27/2008 5:28 AM Subject: Re: [CODE4LIB] oca api? So what, exactly, is "open" about this? Anyone care to guess? Roy On 2/26/08 10:29 AM, "Chris Freeland" <[EMAIL PROTECTED]> wrote: My guess is that, yes, the query interface we've been discussing here and the 'all sorts of interfaces that none of us knew about' are the same. It's not documented that I'm aware of. We've found out about it by literally sitting next to IA developers and asking questions. Chris -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Jonathan Rochkind Sent: Tuesday, February 26, 2008 12:18 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] oca api? So in answer to my question here at the Code4Lib conference, after Brewster's keynote, Brewster suggests there are all sorts of interfaces that none of us knew about. Or at least I didn't know about, and haven't been able to figure out in months of trying! I'm going to try and corner him and ask for an email of who we should contact. Perhaps it's the XML interface that you guys know about already. Is that documented anywhere? How the heck did you find out about it? Jonathan Steve Toub <[EMAIL PROTECTED]> 02/25/08 9:41 PM >>> I'll add that when IA told me about http://www.archive.org/services/search.php interface to return XML, they asked that we not send more than 100 records at time since doing more would adversely affect production services. Which made it seem like OAI-PMH was a better way to go. Chris, can you explain a bit more about what this means: "We found their OAI interface to pull scanned items inconsistently based on date of scanning"? I'm having trouble parsing. --SET --- Chris Freeland <[EMAIL PROTECTED]> wrote: Jonathan - No, I don't believe it's documented - at least not anywhere publicly. If any IA/OCA folks are lurking, here's an opportunity to make a bunch of techies happy... Chris -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Jonathan Rochkind Sent: Monday, February 25, 2008 2:48 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] oca api? I hadn't known this "custom query interface" existed! This is welcome news. Is this documented anywhere? Jonathan Chris Freeland <[EMAIL PROTECTED]> 02/25/08 2:51 PM >>> Steve & Tim, I'm the tech director for the Biodiversity Heritage Library (BHL), which is a consortium of 10 natural history libraries who have partnered with Internet Archive (IA)/OCA for scanning our collections. We've just launched our revamped portal, complete with more than 7,500 books & 2.8 million pages scanned by IA & other digitization partners, at: http://www.biodiversitylibrary.org To build this portal we ingest metadata from IA. We found their OAI interface to pull scanned items inconsistently based on date of scanning, so we switched to using their custom query interface. Here's an example of a query we fire off: http://www.archive.org/services/search.php?query=collection:(biodiversit y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH OI%20Library)&limit=10&submit=submit This is returning scanned items from the "biodiversity" collection, updated between 10/31/2007 - 11/30/2007, restricted to one of our contributing libraries (MBLWHOI Library)
Re: [CODE4LIB] oca api?
I concur. The content is open; and the OCA's use of MARC is open... I think they're waiting for the community to chip in the means and mechanisms to support whatever open APIs or protocols are deemed useful. We built a free Z39.50/SRU service based on a crawl through their text collection, incorporating MARC data where available.. it'd be great to see other organizations contribute funding and/or sweat to build additional services and tools. (our stuff is at http://indexdata.com/opencontent/) --Sebastian Chris Freeland wrote: Roy, do you have an answer in mind? To me & my project it's the content that is open, which is why it's worth the hurdles. Once you 'crack the nut' you can grab metadata, scans, and derivatives and ingest, parse, recombine, remix...as we've done for BHL. Access to OCA content may not be standards-based, but it works. Chris -Original Message- From: "Roy Tennant" <[EMAIL PROTECTED]> To: "CODE4LIB@LISTSERV.ND.EDU" Sent: 2/27/2008 5:28 AM Subject: Re: [CODE4LIB] oca api? So what, exactly, is "open" about this? Anyone care to guess? Roy On 2/26/08 10:29 AM, "Chris Freeland" <[EMAIL PROTECTED]> wrote: My guess is that, yes, the query interface we've been discussing here and the 'all sorts of interfaces that none of us knew about' are the same. It's not documented that I'm aware of. We've found out about it by literally sitting next to IA developers and asking questions. Chris -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Jonathan Rochkind Sent: Tuesday, February 26, 2008 12:18 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] oca api? So in answer to my question here at the Code4Lib conference, after Brewster's keynote, Brewster suggests there are all sorts of interfaces that none of us knew about. Or at least I didn't know about, and haven't been able to figure out in months of trying! I'm going to try and corner him and ask for an email of who we should contact. Perhaps it's the XML interface that you guys know about already. Is that documented anywhere? How the heck did you find out about it? Jonathan Steve Toub <[EMAIL PROTECTED]> 02/25/08 9:41 PM >>> I'll add that when IA told me about http://www.archive.org/services/search.php interface to return XML, they asked that we not send more than 100 records at time since doing more would adversely affect production services. Which made it seem like OAI-PMH was a better way to go. Chris, can you explain a bit more about what this means: "We found their OAI interface to pull scanned items inconsistently based on date of scanning"? I'm having trouble parsing. --SET --- Chris Freeland <[EMAIL PROTECTED]> wrote: Jonathan - No, I don't believe it's documented - at least not anywhere publicly. If any IA/OCA folks are lurking, here's an opportunity to make a bunch of techies happy... Chris -----Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Jonathan Rochkind Sent: Monday, February 25, 2008 2:48 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] oca api? I hadn't known this "custom query interface" existed! This is welcome news. Is this documented anywhere? Jonathan Chris Freeland <[EMAIL PROTECTED]> 02/25/08 2:51 PM >>> Steve & Tim, I'm the tech director for the Biodiversity Heritage Library (BHL), which is a consortium of 10 natural history libraries who have partnered with Internet Archive (IA)/OCA for scanning our collections. We've just launched our revamped portal, complete with more than 7,500 books & 2.8 million pages scanned by IA & other digitization partners, at: http://www.biodiversitylibrary.org To build this portal we ingest metadata from IA. We found their OAI interface to pull scanned items inconsistently based on date of scanning, so we switched to using their custom query interface. Here's an example of a query we fire off: http://www.archive.org/services/search.php?query=collection:(biodiversit y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH OI%20Library)&limit=10&submit=submit This is returning scanned items from the "biodiversity" collection, updated between 10/31/2007 - 11/30/2007, restricted to one of our contributing libraries (MBLWHOI Library), and limited to 10 results. The results are styled in the browser; view source to see the good stuff. We use this list to grab the identifiers we've yet to ingest. Some background: When a book is scanned through IA/OCA scanning, they create their own unique identifier (like "annalesacademiae21univ") and grab a MARC record from the contributing library's catalog.
Re: [CODE4LIB] oca api?
But why are there hurdles? Karen G. Schneider On Wed, 27 Feb 2008 07:29:57 -0600, "Chris Freeland" <[EMAIL PROTECTED]> said: > Roy, do you have an answer in mind? > > To me & my project it's the content that is open, which is why it's worth > the hurdles. Once you 'crack the nut' you can grab metadata, scans, and > derivatives and ingest, parse, recombine, remix...as we've done for BHL. > > Access to OCA content may not be standards-based, but it works. > > Chris > > -Original Message- > From: "Roy Tennant" <[EMAIL PROTECTED]> > To: "CODE4LIB@LISTSERV.ND.EDU" > Sent: 2/27/2008 5:28 AM > Subject: Re: [CODE4LIB] oca api? > > So what, exactly, is "open" about this? Anyone care to guess? > Roy > > > On 2/26/08 10:29 AM, "Chris Freeland" <[EMAIL PROTECTED]> wrote: > > > My guess is that, yes, the query interface we've been discussing here > > and the 'all sorts of interfaces that none of us knew about' are the > > same. It's not documented that I'm aware of. We've found out about it > > by literally sitting next to IA developers and asking questions. > > > > Chris > > -Original Message- > > From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of > > Jonathan Rochkind > > Sent: Tuesday, February 26, 2008 12:18 PM > > To: CODE4LIB@LISTSERV.ND.EDU > > Subject: Re: [CODE4LIB] oca api? > > > > So in answer to my question here at the Code4Lib conference, after > > Brewster's keynote, Brewster suggests there are all sorts of interfaces > > that none of us knew about. Or at least I didn't know about, and haven't > > been able to figure out in months of trying! I'm going to try and > > corner him and ask for an email of who we should contact. > > > > Perhaps it's the XML interface that you guys know about already. Is that > > documented anywhere? How the heck did you find out about it? > > > > Jonathan > > > > > >>>> Steve Toub <[EMAIL PROTECTED]> 02/25/08 9:41 PM >>> > > I'll add that when IA told me about > > http://www.archive.org/services/search.php interface to return > > XML, they asked that we not send more than 100 records at time since > > doing more would adversely > > affect production services. Which made it seem like OAI-PMH was a better > > way to go. > > > > Chris, can you explain a bit more about what this means: "We found their > > OAI interface to pull > > scanned items inconsistently based on date of scanning"? I'm having > > trouble parsing. > > > > > >--SET > > > > > > > > > > --- Chris Freeland <[EMAIL PROTECTED]> wrote: > > > >> Jonathan - No, I don't believe it's documented - at least not anywhere > >> publicly. If any IA/OCA folks are lurking, here's an opportunity to > >> make a bunch of techies happy... > >> > >> Chris > >> > >> -Original Message- > >> From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf > > Of > >> Jonathan Rochkind > >> Sent: Monday, February 25, 2008 2:48 PM > >> To: CODE4LIB@LISTSERV.ND.EDU > >> Subject: Re: [CODE4LIB] oca api? > >> > >> I hadn't known this "custom query interface" existed! This is welcome > >> news. Is this documented anywhere? > >> > >> Jonathan > >> > >> > >>>>> Chris Freeland <[EMAIL PROTECTED]> 02/25/08 2:51 PM >>> > >> Steve & Tim, > >> > >> I'm the tech director for the Biodiversity Heritage Library (BHL), > > which > >> is a consortium of 10 natural history libraries who have partnered > > with > >> Internet Archive (IA)/OCA for scanning our collections. We've just > >> launched our revamped portal, complete with more than 7,500 books & > > 2.8 > >> million pages scanned by IA & other digitization partners, at: > >> http://www.biodiversitylibrary.org > >> > >> To build this portal we ingest metadata from IA. We found their OAI > >> interface to pull scanned items inconsistently based on date of > >> scanning, so we switched to using their custom query interface. > > Here's > >> an example of a query we fire off: > >> > >> > > http://www.archive.org/services/search.php?query=collection:(biodiversit > >> > > y)+AND+upda
Re: [CODE4LIB] oca api?
Roy, do you have an answer in mind? To me & my project it's the content that is open, which is why it's worth the hurdles. Once you 'crack the nut' you can grab metadata, scans, and derivatives and ingest, parse, recombine, remix...as we've done for BHL. Access to OCA content may not be standards-based, but it works. Chris -Original Message- From: "Roy Tennant" <[EMAIL PROTECTED]> To: "CODE4LIB@LISTSERV.ND.EDU" Sent: 2/27/2008 5:28 AM Subject: Re: [CODE4LIB] oca api? So what, exactly, is "open" about this? Anyone care to guess? Roy On 2/26/08 10:29 AM, "Chris Freeland" <[EMAIL PROTECTED]> wrote: > My guess is that, yes, the query interface we've been discussing here > and the 'all sorts of interfaces that none of us knew about' are the > same. It's not documented that I'm aware of. We've found out about it > by literally sitting next to IA developers and asking questions. > > Chris > -Original Message- > From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of > Jonathan Rochkind > Sent: Tuesday, February 26, 2008 12:18 PM > To: CODE4LIB@LISTSERV.ND.EDU > Subject: Re: [CODE4LIB] oca api? > > So in answer to my question here at the Code4Lib conference, after > Brewster's keynote, Brewster suggests there are all sorts of interfaces > that none of us knew about. Or at least I didn't know about, and haven't > been able to figure out in months of trying! I'm going to try and > corner him and ask for an email of who we should contact. > > Perhaps it's the XML interface that you guys know about already. Is that > documented anywhere? How the heck did you find out about it? > > Jonathan > > >>>> Steve Toub <[EMAIL PROTECTED]> 02/25/08 9:41 PM >>> > I'll add that when IA told me about > http://www.archive.org/services/search.php interface to return > XML, they asked that we not send more than 100 records at time since > doing more would adversely > affect production services. Which made it seem like OAI-PMH was a better > way to go. > > Chris, can you explain a bit more about what this means: "We found their > OAI interface to pull > scanned items inconsistently based on date of scanning"? I'm having > trouble parsing. > > >--SET > > > > > --- Chris Freeland <[EMAIL PROTECTED]> wrote: > >> Jonathan - No, I don't believe it's documented - at least not anywhere >> publicly. If any IA/OCA folks are lurking, here's an opportunity to >> make a bunch of techies happy... >> >> Chris >> >> -Original Message- >> From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf > Of >> Jonathan Rochkind >> Sent: Monday, February 25, 2008 2:48 PM >> To: CODE4LIB@LISTSERV.ND.EDU >> Subject: Re: [CODE4LIB] oca api? >> >> I hadn't known this "custom query interface" existed! This is welcome >> news. Is this documented anywhere? >> >> Jonathan >> >> >>>>> Chris Freeland <[EMAIL PROTECTED]> 02/25/08 2:51 PM >>> >> Steve & Tim, >> >> I'm the tech director for the Biodiversity Heritage Library (BHL), > which >> is a consortium of 10 natural history libraries who have partnered > with >> Internet Archive (IA)/OCA for scanning our collections. We've just >> launched our revamped portal, complete with more than 7,500 books & > 2.8 >> million pages scanned by IA & other digitization partners, at: >> http://www.biodiversitylibrary.org >> >> To build this portal we ingest metadata from IA. We found their OAI >> interface to pull scanned items inconsistently based on date of >> scanning, so we switched to using their custom query interface. > Here's >> an example of a query we fire off: >> >> > http://www.archive.org/services/search.php?query=collection:(biodiversit >> > y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH >> OI%20Library)&limit=10&submit=submit >> >> This is returning scanned items from the "biodiversity" collection, >> updated between 10/31/2007 - 11/30/2007, restricted to one of our >> contributing libraries (MBLWHOI Library), and limited to 10 results. >> >> The results are styled in the browser; view source to see the good >> stuff. We use this list to grab the identifiers we've yet to ingest. >> >> Some background: When a book is scanned through IA/OCA scanning, they >> create their own unique identifier (like "a
Re: [CODE4LIB] oca api?
So what, exactly, is "open" about this? Anyone care to guess? Roy On 2/26/08 10:29 AM, "Chris Freeland" <[EMAIL PROTECTED]> wrote: > My guess is that, yes, the query interface we've been discussing here > and the 'all sorts of interfaces that none of us knew about' are the > same. It's not documented that I'm aware of. We've found out about it > by literally sitting next to IA developers and asking questions. > > Chris > -Original Message- > From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of > Jonathan Rochkind > Sent: Tuesday, February 26, 2008 12:18 PM > To: CODE4LIB@LISTSERV.ND.EDU > Subject: Re: [CODE4LIB] oca api? > > So in answer to my question here at the Code4Lib conference, after > Brewster's keynote, Brewster suggests there are all sorts of interfaces > that none of us knew about. Or at least I didn't know about, and haven't > been able to figure out in months of trying! I'm going to try and > corner him and ask for an email of who we should contact. > > Perhaps it's the XML interface that you guys know about already. Is that > documented anywhere? How the heck did you find out about it? > > Jonathan > > >>>> Steve Toub <[EMAIL PROTECTED]> 02/25/08 9:41 PM >>> > I'll add that when IA told me about > http://www.archive.org/services/search.php interface to return > XML, they asked that we not send more than 100 records at time since > doing more would adversely > affect production services. Which made it seem like OAI-PMH was a better > way to go. > > Chris, can you explain a bit more about what this means: "We found their > OAI interface to pull > scanned items inconsistently based on date of scanning"? I'm having > trouble parsing. > > >--SET > > > > > --- Chris Freeland <[EMAIL PROTECTED]> wrote: > >> Jonathan - No, I don't believe it's documented - at least not anywhere >> publicly. If any IA/OCA folks are lurking, here's an opportunity to >> make a bunch of techies happy... >> >> Chris >> >> -Original Message- >> From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf > Of >> Jonathan Rochkind >> Sent: Monday, February 25, 2008 2:48 PM >> To: CODE4LIB@LISTSERV.ND.EDU >> Subject: Re: [CODE4LIB] oca api? >> >> I hadn't known this "custom query interface" existed! This is welcome >> news. Is this documented anywhere? >> >> Jonathan >> >> >>>>> Chris Freeland <[EMAIL PROTECTED]> 02/25/08 2:51 PM >>> >> Steve & Tim, >> >> I'm the tech director for the Biodiversity Heritage Library (BHL), > which >> is a consortium of 10 natural history libraries who have partnered > with >> Internet Archive (IA)/OCA for scanning our collections. We've just >> launched our revamped portal, complete with more than 7,500 books & > 2.8 >> million pages scanned by IA & other digitization partners, at: >> http://www.biodiversitylibrary.org >> >> To build this portal we ingest metadata from IA. We found their OAI >> interface to pull scanned items inconsistently based on date of >> scanning, so we switched to using their custom query interface. > Here's >> an example of a query we fire off: >> >> > http://www.archive.org/services/search.php?query=collection:(biodiversit >> > y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH >> OI%20Library)&limit=10&submit=submit >> >> This is returning scanned items from the "biodiversity" collection, >> updated between 10/31/2007 - 11/30/2007, restricted to one of our >> contributing libraries (MBLWHOI Library), and limited to 10 results. >> >> The results are styled in the browser; view source to see the good >> stuff. We use this list to grab the identifiers we've yet to ingest. >> >> Some background: When a book is scanned through IA/OCA scanning, they >> create their own unique identifier (like "annalesacademiae21univ") and >> grab a MARC record from the contributing library's catalog. All of > the >> scanned files, derivatives, and metadata files are stored on IA's >> clusters in a directory named with the identifier. >> >> Steve mentioned using their /details/ directive, then sniffing the > page >> to get the cluster location and the files for downloading. An easier >> method is to use their /download/ directive, as in: >> >> http://www.archive.org/download/ID$
Re: [CODE4LIB] oca api?
It is the same interface Chris described. I had emailed with Brewster directly to learn about it. In that email exchange I got the sense that OAI-PMH was better. And my questions about a staging instance went unanswered. But in standing in here when Jonathan cornered Brewster, I got the sense he prefers the query interface. He didn't set concrete guidance about how many queries is too much but he was conscious of performance. --SET --- Chris Freeland <[EMAIL PROTECTED]> wrote: > My guess is that, yes, the query interface we've been discussing here > and the 'all sorts of interfaces that none of us knew about' are the > same. It's not documented that I'm aware of. We've found out about it > by literally sitting next to IA developers and asking questions. > > Chris > -Original Message- > From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of > Jonathan Rochkind > Sent: Tuesday, February 26, 2008 12:18 PM > To: CODE4LIB@LISTSERV.ND.EDU > Subject: Re: [CODE4LIB] oca api? > > So in answer to my question here at the Code4Lib conference, after > Brewster's keynote, Brewster suggests there are all sorts of interfaces > that none of us knew about. Or at least I didn't know about, and haven't > been able to figure out in months of trying! I'm going to try and > corner him and ask for an email of who we should contact. > > Perhaps it's the XML interface that you guys know about already. Is that > documented anywhere? How the heck did you find out about it? > > Jonathan > > > >>> Steve Toub <[EMAIL PROTECTED]> 02/25/08 9:41 PM >>> > I'll add that when IA told me about > http://www.archive.org/services/search.php interface to return > XML, they asked that we not send more than 100 records at time since > doing more would adversely > affect production services. Which made it seem like OAI-PMH was a better > way to go. > > Chris, can you explain a bit more about what this means: "We found their > OAI interface to pull > scanned items inconsistently based on date of scanning"? I'm having > trouble parsing. > > >--SET > > > > > --- Chris Freeland <[EMAIL PROTECTED]> wrote: > > > Jonathan - No, I don't believe it's documented - at least not anywhere > > publicly. If any IA/OCA folks are lurking, here's an opportunity to > > make a bunch of techies happy... > > > > Chris > > > > -Original Message- > > From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf > Of > > Jonathan Rochkind > > Sent: Monday, February 25, 2008 2:48 PM > > To: CODE4LIB@LISTSERV.ND.EDU > > Subject: Re: [CODE4LIB] oca api? > > > > I hadn't known this "custom query interface" existed! This is welcome > > news. Is this documented anywhere? > > > > Jonathan > > > > > > >>> Chris Freeland <[EMAIL PROTECTED]> 02/25/08 2:51 PM >>> > > Steve & Tim, > > > > I'm the tech director for the Biodiversity Heritage Library (BHL), > which > > is a consortium of 10 natural history libraries who have partnered > with > > Internet Archive (IA)/OCA for scanning our collections. We've just > > launched our revamped portal, complete with more than 7,500 books & > 2.8 > > million pages scanned by IA & other digitization partners, at: > > http://www.biodiversitylibrary.org > > > > To build this portal we ingest metadata from IA. We found their OAI > > interface to pull scanned items inconsistently based on date of > > scanning, so we switched to using their custom query interface. > Here's > > an example of a query we fire off: > > > > > http://www.archive.org/services/search.php?query=collection:(biodiversit > > > y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH > > OI%20Library)&limit=10&submit=submit > > > > This is returning scanned items from the "biodiversity" collection, > > updated between 10/31/2007 - 11/30/2007, restricted to one of our > > contributing libraries (MBLWHOI Library), and limited to 10 results. > > > > The results are styled in the browser; view source to see the good > > stuff. We use this list to grab the identifiers we've yet to ingest. > > > > Some background: When a book is scanned through IA/OCA scanning, they > > create their own unique identifier (like "annalesacademiae21univ") and > > grab a MARC record from the contributing library's catalog. All of > the > > scanned fil
Re: [CODE4LIB] oca api?
My guess is that, yes, the query interface we've been discussing here and the 'all sorts of interfaces that none of us knew about' are the same. It's not documented that I'm aware of. We've found out about it by literally sitting next to IA developers and asking questions. Chris -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Jonathan Rochkind Sent: Tuesday, February 26, 2008 12:18 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] oca api? So in answer to my question here at the Code4Lib conference, after Brewster's keynote, Brewster suggests there are all sorts of interfaces that none of us knew about. Or at least I didn't know about, and haven't been able to figure out in months of trying! I'm going to try and corner him and ask for an email of who we should contact. Perhaps it's the XML interface that you guys know about already. Is that documented anywhere? How the heck did you find out about it? Jonathan >>> Steve Toub <[EMAIL PROTECTED]> 02/25/08 9:41 PM >>> I'll add that when IA told me about http://www.archive.org/services/search.php interface to return XML, they asked that we not send more than 100 records at time since doing more would adversely affect production services. Which made it seem like OAI-PMH was a better way to go. Chris, can you explain a bit more about what this means: "We found their OAI interface to pull scanned items inconsistently based on date of scanning"? I'm having trouble parsing. --SET --- Chris Freeland <[EMAIL PROTECTED]> wrote: > Jonathan - No, I don't believe it's documented - at least not anywhere > publicly. If any IA/OCA folks are lurking, here's an opportunity to > make a bunch of techies happy... > > Chris > > -Original Message- > From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of > Jonathan Rochkind > Sent: Monday, February 25, 2008 2:48 PM > To: CODE4LIB@LISTSERV.ND.EDU > Subject: Re: [CODE4LIB] oca api? > > I hadn't known this "custom query interface" existed! This is welcome > news. Is this documented anywhere? > > Jonathan > > > >>> Chris Freeland <[EMAIL PROTECTED]> 02/25/08 2:51 PM >>> > Steve & Tim, > > I'm the tech director for the Biodiversity Heritage Library (BHL), which > is a consortium of 10 natural history libraries who have partnered with > Internet Archive (IA)/OCA for scanning our collections. We've just > launched our revamped portal, complete with more than 7,500 books & 2.8 > million pages scanned by IA & other digitization partners, at: > http://www.biodiversitylibrary.org > > To build this portal we ingest metadata from IA. We found their OAI > interface to pull scanned items inconsistently based on date of > scanning, so we switched to using their custom query interface. Here's > an example of a query we fire off: > > http://www.archive.org/services/search.php?query=collection:(biodiversit > y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH > OI%20Library)&limit=10&submit=submit > > This is returning scanned items from the "biodiversity" collection, > updated between 10/31/2007 - 11/30/2007, restricted to one of our > contributing libraries (MBLWHOI Library), and limited to 10 results. > > The results are styled in the browser; view source to see the good > stuff. We use this list to grab the identifiers we've yet to ingest. > > Some background: When a book is scanned through IA/OCA scanning, they > create their own unique identifier (like "annalesacademiae21univ") and > grab a MARC record from the contributing library's catalog. All of the > scanned files, derivatives, and metadata files are stored on IA's > clusters in a directory named with the identifier. > > Steve mentioned using their /details/ directive, then sniffing the page > to get the cluster location and the files for downloading. An easier > method is to use their /download/ directive, as in: > > http://www.archive.org/download/ID$, or in the example above: > http://www.archive.org/download/annalesacademiae21univ > > That automatically does a lookup on the cluster, which means you don't > have to scrape info off pages. You can also address any files within > that directory, as in: > http://www.archive.org/download/annalesacademiae21univ/annalesacademiae2 > 1univ_marc.xml > > The only way to get standard identifiers (ISBN, ISSN, OCLC, LCCN) for > these scanned books is to grab them out of the MARC record. So the > long-winded answer to your question, Tim, is no, there's no simple way > to crossref what IA has scann
Re: [CODE4LIB] oca api?
So in answer to my question here at the Code4Lib conference, after Brewster's keynote, Brewster suggests there are all sorts of interfaces that none of us knew about. Or at least I didn't know about, and haven't been able to figure out in months of trying! I'm going to try and corner him and ask for an email of who we should contact. Perhaps it's the XML interface that you guys know about already. Is that documented anywhere? How the heck did you find out about it? Jonathan >>> Steve Toub <[EMAIL PROTECTED]> 02/25/08 9:41 PM >>> I'll add that when IA told me about http://www.archive.org/services/search.php interface to return XML, they asked that we not send more than 100 records at time since doing more would adversely affect production services. Which made it seem like OAI-PMH was a better way to go. Chris, can you explain a bit more about what this means: "We found their OAI interface to pull scanned items inconsistently based on date of scanning"? I'm having trouble parsing. --SET --- Chris Freeland <[EMAIL PROTECTED]> wrote: > Jonathan - No, I don't believe it's documented - at least not anywhere > publicly. If any IA/OCA folks are lurking, here's an opportunity to > make a bunch of techies happy... > > Chris > > -Original Message- > From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of > Jonathan Rochkind > Sent: Monday, February 25, 2008 2:48 PM > To: CODE4LIB@LISTSERV.ND.EDU > Subject: Re: [CODE4LIB] oca api? > > I hadn't known this "custom query interface" existed! This is welcome > news. Is this documented anywhere? > > Jonathan > > > >>> Chris Freeland <[EMAIL PROTECTED]> 02/25/08 2:51 PM >>> > Steve & Tim, > > I'm the tech director for the Biodiversity Heritage Library (BHL), which > is a consortium of 10 natural history libraries who have partnered with > Internet Archive (IA)/OCA for scanning our collections. We've just > launched our revamped portal, complete with more than 7,500 books & 2.8 > million pages scanned by IA & other digitization partners, at: > http://www.biodiversitylibrary.org > > To build this portal we ingest metadata from IA. We found their OAI > interface to pull scanned items inconsistently based on date of > scanning, so we switched to using their custom query interface. Here's > an example of a query we fire off: > > http://www.archive.org/services/search.php?query=collection:(biodiversit > y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH > OI%20Library)&limit=10&submit=submit > > This is returning scanned items from the "biodiversity" collection, > updated between 10/31/2007 - 11/30/2007, restricted to one of our > contributing libraries (MBLWHOI Library), and limited to 10 results. > > The results are styled in the browser; view source to see the good > stuff. We use this list to grab the identifiers we've yet to ingest. > > Some background: When a book is scanned through IA/OCA scanning, they > create their own unique identifier (like "annalesacademiae21univ") and > grab a MARC record from the contributing library's catalog. All of the > scanned files, derivatives, and metadata files are stored on IA's > clusters in a directory named with the identifier. > > Steve mentioned using their /details/ directive, then sniffing the page > to get the cluster location and the files for downloading. An easier > method is to use their /download/ directive, as in: > > http://www.archive.org/download/ID$, or in the example above: > http://www.archive.org/download/annalesacademiae21univ > > That automatically does a lookup on the cluster, which means you don't > have to scrape info off pages. You can also address any files within > that directory, as in: > http://www.archive.org/download/annalesacademiae21univ/annalesacademiae2 > 1univ_marc.xml > > The only way to get standard identifiers (ISBN, ISSN, OCLC, LCCN) for > these scanned books is to grab them out of the MARC record. So the > long-winded answer to your question, Tim, is no, there's no simple way > to crossref what IA has scanned with your catalog - THAT I KNOW OF. Big > caveat on that last part. > > Happy to help with any other questions I can, > > Chris Freeland > > > -Original Message- > From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of > Steve Toub > Sent: Sunday, February 24, 2008 11:20 PM > To: CODE4LIB@LISTSERV.ND.EDU > Subject: Re: [CODE4LIB] oca api? > > --- Tim Shearer <[EMAIL PROTECTED]> wrote: > > > Hi Folks, > > > > I'm look
Re: [CODE4LIB] oca api?
On Feb 26, 2008, at 12:21 PM, Chris Freeland wrote: The biggest problem we found with the OAI implementation had to do with pulling incremental updates. If you ask for a date range like Dec 1 - 5 you get all of Dec. When we discussed this with IA we were shown the query interface and just decided to use that instead since we're doing mostly incremental updates. Incidentally, I was asked a few months ago about incorporating Open Library and/or Internet Archive material into a service I (barely) maintain called Ockham Alert. I told them I would be happy to do so, but since Ockham Alert relies on OAI date ranges, and their date ranges did not work, I was unable to oblige them. I suppose the date issue with their OAI implementation is a known issue. -- Eric Lease Morgan University Libraries of Notre Dame (574) 631-8604
Re: [CODE4LIB] oca api?
Steve - I'm not sure about the scalability of the query interface, so hopefully someone from IA can comment. The biggest problem we found with the OAI implementation had to do with pulling incremental updates. If you ask for a date range like Dec 1 - 5 you get all of Dec. When we discussed this with IA we were shown the query interface and just decided to use that instead since we're doing mostly incremental updates. The date inconsistency might not be enough to drive folks away from OAI if you're looking to do one-time, or infrequent, harvests. Chris -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Steve Toub Sent: Monday, February 25, 2008 8:41 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] oca api? I'll add that when IA told me about http://www.archive.org/services/search.php interface to return XML, they asked that we not send more than 100 records at time since doing more would adversely affect production services. Which made it seem like OAI-PMH was a better way to go. Chris, can you explain a bit more about what this means: "We found their OAI interface to pull scanned items inconsistently based on date of scanning"? I'm having trouble parsing. --SET --- Chris Freeland <[EMAIL PROTECTED]> wrote: > Jonathan - No, I don't believe it's documented - at least not anywhere > publicly. If any IA/OCA folks are lurking, here's an opportunity to > make a bunch of techies happy... > > Chris > > -Original Message- > From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of > Jonathan Rochkind > Sent: Monday, February 25, 2008 2:48 PM > To: CODE4LIB@LISTSERV.ND.EDU > Subject: Re: [CODE4LIB] oca api? > > I hadn't known this "custom query interface" existed! This is welcome > news. Is this documented anywhere? > > Jonathan > > > >>> Chris Freeland <[EMAIL PROTECTED]> 02/25/08 2:51 PM >>> > Steve & Tim, > > I'm the tech director for the Biodiversity Heritage Library (BHL), which > is a consortium of 10 natural history libraries who have partnered with > Internet Archive (IA)/OCA for scanning our collections. We've just > launched our revamped portal, complete with more than 7,500 books & 2.8 > million pages scanned by IA & other digitization partners, at: > http://www.biodiversitylibrary.org > > To build this portal we ingest metadata from IA. We found their OAI > interface to pull scanned items inconsistently based on date of > scanning, so we switched to using their custom query interface. Here's > an example of a query we fire off: > > http://www.archive.org/services/search.php?query=collection:(biodiversit > y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH > OI%20Library)&limit=10&submit=submit > > This is returning scanned items from the "biodiversity" collection, > updated between 10/31/2007 - 11/30/2007, restricted to one of our > contributing libraries (MBLWHOI Library), and limited to 10 results. > > The results are styled in the browser; view source to see the good > stuff. We use this list to grab the identifiers we've yet to ingest. > > Some background: When a book is scanned through IA/OCA scanning, they > create their own unique identifier (like "annalesacademiae21univ") and > grab a MARC record from the contributing library's catalog. All of the > scanned files, derivatives, and metadata files are stored on IA's > clusters in a directory named with the identifier. > > Steve mentioned using their /details/ directive, then sniffing the page > to get the cluster location and the files for downloading. An easier > method is to use their /download/ directive, as in: > > http://www.archive.org/download/ID$, or in the example above: > http://www.archive.org/download/annalesacademiae21univ > > That automatically does a lookup on the cluster, which means you don't > have to scrape info off pages. You can also address any files within > that directory, as in: > http://www.archive.org/download/annalesacademiae21univ/annalesacademiae2 > 1univ_marc.xml > > The only way to get standard identifiers (ISBN, ISSN, OCLC, LCCN) for > these scanned books is to grab them out of the MARC record. So the > long-winded answer to your question, Tim, is no, there's no simple way > to crossref what IA has scanned with your catalog - THAT I KNOW OF. Big > caveat on that last part. > > Happy to help with any other questions I can, > > Chris Freeland > > > -Original Message- > From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of > Steve Toub > Sent: Sunday, February 24, 2008 11:20 PM &
Re: [CODE4LIB] oca api?
I'll add that when IA told me about http://www.archive.org/services/search.php interface to return XML, they asked that we not send more than 100 records at time since doing more would adversely affect production services. Which made it seem like OAI-PMH was a better way to go. Chris, can you explain a bit more about what this means: "We found their OAI interface to pull scanned items inconsistently based on date of scanning"? I'm having trouble parsing. --SET --- Chris Freeland <[EMAIL PROTECTED]> wrote: > Jonathan - No, I don't believe it's documented - at least not anywhere > publicly. If any IA/OCA folks are lurking, here's an opportunity to > make a bunch of techies happy... > > Chris > > -Original Message- > From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of > Jonathan Rochkind > Sent: Monday, February 25, 2008 2:48 PM > To: CODE4LIB@LISTSERV.ND.EDU > Subject: Re: [CODE4LIB] oca api? > > I hadn't known this "custom query interface" existed! This is welcome > news. Is this documented anywhere? > > Jonathan > > > >>> Chris Freeland <[EMAIL PROTECTED]> 02/25/08 2:51 PM >>> > Steve & Tim, > > I'm the tech director for the Biodiversity Heritage Library (BHL), which > is a consortium of 10 natural history libraries who have partnered with > Internet Archive (IA)/OCA for scanning our collections. We've just > launched our revamped portal, complete with more than 7,500 books & 2.8 > million pages scanned by IA & other digitization partners, at: > http://www.biodiversitylibrary.org > > To build this portal we ingest metadata from IA. We found their OAI > interface to pull scanned items inconsistently based on date of > scanning, so we switched to using their custom query interface. Here's > an example of a query we fire off: > > http://www.archive.org/services/search.php?query=collection:(biodiversit > y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH > OI%20Library)&limit=10&submit=submit > > This is returning scanned items from the "biodiversity" collection, > updated between 10/31/2007 - 11/30/2007, restricted to one of our > contributing libraries (MBLWHOI Library), and limited to 10 results. > > The results are styled in the browser; view source to see the good > stuff. We use this list to grab the identifiers we've yet to ingest. > > Some background: When a book is scanned through IA/OCA scanning, they > create their own unique identifier (like "annalesacademiae21univ") and > grab a MARC record from the contributing library's catalog. All of the > scanned files, derivatives, and metadata files are stored on IA's > clusters in a directory named with the identifier. > > Steve mentioned using their /details/ directive, then sniffing the page > to get the cluster location and the files for downloading. An easier > method is to use their /download/ directive, as in: > > http://www.archive.org/download/ID$, or in the example above: > http://www.archive.org/download/annalesacademiae21univ > > That automatically does a lookup on the cluster, which means you don't > have to scrape info off pages. You can also address any files within > that directory, as in: > http://www.archive.org/download/annalesacademiae21univ/annalesacademiae2 > 1univ_marc.xml > > The only way to get standard identifiers (ISBN, ISSN, OCLC, LCCN) for > these scanned books is to grab them out of the MARC record. So the > long-winded answer to your question, Tim, is no, there's no simple way > to crossref what IA has scanned with your catalog - THAT I KNOW OF. Big > caveat on that last part. > > Happy to help with any other questions I can, > > Chris Freeland > > > -Original Message- > From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of > Steve Toub > Sent: Sunday, February 24, 2008 11:20 PM > To: CODE4LIB@LISTSERV.ND.EDU > Subject: Re: [CODE4LIB] oca api? > > --- Tim Shearer <[EMAIL PROTECTED]> wrote: > > > Hi Folks, > > > > I'm looking into tapping the texts in the Open Content Alliance. > > > > A few questions... > > > > As near as I can tell, they don't expose (perhaps even store?) any > common > > unique identifiers (oclc number, issn, isbn, loc number). > > I poked around in this world a few months ago in my previous job at > California Digital Library, > also an OCA partner. > > The unique key seems to be text string identifier (one that seems to be > completely different from > the text string identifier in Open Library). Apparently there was talk
Re: [CODE4LIB] oca api?
Yup, Chris' email was exactly what I was hoping for. Now if there were a nice way to pre-screen for records that don't have empty (isbn|issn|oclc#) without all the work of looking per record (and the overhead for the server, and the overhead if more than one organization starts to do this). I guess I want to search for uniqueID != NULL and only get their unique id back, and script from there. Still and all, this now seems a very doable thing. Chris, many thanks! -t On Mon, 25 Feb 2008, Tennant,Roy wrote: Well, from where Chris left off it would be fairly easy to check for a file in the directory with an "marc.xml" filename extension, then XSLT for: 39004822 If such exists, and then you'll have the ISBN. To sweeten it further, send that into xISBN or ThingISBN and get other ISBNs for the same work. This seems completely scriptable to me. Perhaps someone at c4l will have it done before the conference is over. And Tim, the example above is one that's in your catalog. Roy -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Chris Freeland Sent: Monday, February 25, 2008 11:51 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] oca api? Steve & Tim, I'm the tech director for the Biodiversity Heritage Library (BHL), which is a consortium of 10 natural history libraries who have partnered with Internet Archive (IA)/OCA for scanning our collections. We've just launched our revamped portal, complete with more than 7,500 books & 2.8 million pages scanned by IA & other digitization partners, at: http://www.biodiversitylibrary.org To build this portal we ingest metadata from IA. We found their OAI interface to pull scanned items inconsistently based on date of scanning, so we switched to using their custom query interface. Here's an example of a query we fire off: http://www.archive.org/services/search.php?query=collection:(biodiversit y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH OI%20Library)&limit=10&submit=submit This is returning scanned items from the "biodiversity" collection, updated between 10/31/2007 - 11/30/2007, restricted to one of our contributing libraries (MBLWHOI Library), and limited to 10 results. The results are styled in the browser; view source to see the good stuff. We use this list to grab the identifiers we've yet to ingest. Some background: When a book is scanned through IA/OCA scanning, they create their own unique identifier (like "annalesacademiae21univ") and grab a MARC record from the contributing library's catalog. All of the scanned files, derivatives, and metadata files are stored on IA's clusters in a directory named with the identifier. Steve mentioned using their /details/ directive, then sniffing the page to get the cluster location and the files for downloading. An easier method is to use their /download/ directive, as in: http://www.archive.org/download/ID$, or in the example above: http://www.archive.org/download/annalesacademiae21univ That automatically does a lookup on the cluster, which means you don't have to scrape info off pages. You can also address any files within that directory, as in: http://www.archive.org/download/annalesacademiae21univ/annalesacademiae2 1univ_marc.xml The only way to get standard identifiers (ISBN, ISSN, OCLC, LCCN) for these scanned books is to grab them out of the MARC record. So the long-winded answer to your question, Tim, is no, there's no simple way to crossref what IA has scanned with your catalog - THAT I KNOW OF. Big caveat on that last part. Happy to help with any other questions I can, Chris Freeland -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Steve Toub Sent: Sunday, February 24, 2008 11:20 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] oca api? --- Tim Shearer <[EMAIL PROTECTED]> wrote: Hi Folks, I'm looking into tapping the texts in the Open Content Alliance. A few questions... As near as I can tell, they don't expose (perhaps even store?) any common unique identifiers (oclc number, issn, isbn, loc number). I poked around in this world a few months ago in my previous job at California Digital Library, also an OCA partner. The unique key seems to be text string identifier (one that seems to be completely different from the text string identifier in Open Library). Apparently there was talk at the last partner meeting about moving to ISBNs: http://dilettantes.code4lib.org/2007/10/22/tales-from-the-open-content-a lliance/ To obtain identifiers in bulk, I think the recommended approach is the OAI-PMH interface, which seems more reliable in recent months: http://www.archive.org/services/oai.php?verb=Identify http://www.archive.org/services/oai.php?verb=ListIdentifiers&metadataPre fix=oai_dc&set=collection:cdl etc.
Re: [CODE4LIB] oca api?
Well, from where Chris left off it would be fairly easy to check for a file in the directory with an "marc.xml" filename extension, then XSLT for: 39004822 If such exists, and then you'll have the ISBN. To sweeten it further, send that into xISBN or ThingISBN and get other ISBNs for the same work. This seems completely scriptable to me. Perhaps someone at c4l will have it done before the conference is over. And Tim, the example above is one that's in your catalog. Roy -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Chris Freeland Sent: Monday, February 25, 2008 11:51 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] oca api? Steve & Tim, I'm the tech director for the Biodiversity Heritage Library (BHL), which is a consortium of 10 natural history libraries who have partnered with Internet Archive (IA)/OCA for scanning our collections. We've just launched our revamped portal, complete with more than 7,500 books & 2.8 million pages scanned by IA & other digitization partners, at: http://www.biodiversitylibrary.org To build this portal we ingest metadata from IA. We found their OAI interface to pull scanned items inconsistently based on date of scanning, so we switched to using their custom query interface. Here's an example of a query we fire off: http://www.archive.org/services/search.php?query=collection:(biodiversit y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH OI%20Library)&limit=10&submit=submit This is returning scanned items from the "biodiversity" collection, updated between 10/31/2007 - 11/30/2007, restricted to one of our contributing libraries (MBLWHOI Library), and limited to 10 results. The results are styled in the browser; view source to see the good stuff. We use this list to grab the identifiers we've yet to ingest. Some background: When a book is scanned through IA/OCA scanning, they create their own unique identifier (like "annalesacademiae21univ") and grab a MARC record from the contributing library's catalog. All of the scanned files, derivatives, and metadata files are stored on IA's clusters in a directory named with the identifier. Steve mentioned using their /details/ directive, then sniffing the page to get the cluster location and the files for downloading. An easier method is to use their /download/ directive, as in: http://www.archive.org/download/ID$, or in the example above: http://www.archive.org/download/annalesacademiae21univ That automatically does a lookup on the cluster, which means you don't have to scrape info off pages. You can also address any files within that directory, as in: http://www.archive.org/download/annalesacademiae21univ/annalesacademiae2 1univ_marc.xml The only way to get standard identifiers (ISBN, ISSN, OCLC, LCCN) for these scanned books is to grab them out of the MARC record. So the long-winded answer to your question, Tim, is no, there's no simple way to crossref what IA has scanned with your catalog - THAT I KNOW OF. Big caveat on that last part. Happy to help with any other questions I can, Chris Freeland -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Steve Toub Sent: Sunday, February 24, 2008 11:20 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] oca api? --- Tim Shearer <[EMAIL PROTECTED]> wrote: > Hi Folks, > > I'm looking into tapping the texts in the Open Content Alliance. > > A few questions... > > As near as I can tell, they don't expose (perhaps even store?) any common > unique identifiers (oclc number, issn, isbn, loc number). I poked around in this world a few months ago in my previous job at California Digital Library, also an OCA partner. The unique key seems to be text string identifier (one that seems to be completely different from the text string identifier in Open Library). Apparently there was talk at the last partner meeting about moving to ISBNs: http://dilettantes.code4lib.org/2007/10/22/tales-from-the-open-content-a lliance/ To obtain identifiers in bulk, I think the recommended approach is the OAI-PMH interface, which seems more reliable in recent months: http://www.archive.org/services/oai.php?verb=Identify http://www.archive.org/services/oai.php?verb=ListIdentifiers&metadataPre fix=oai_dc&set=collection:cdl etc. Additional instructions if you want to grab the content files. >From any book's metadata page (e.g., http://www.archive.org/details/chemicallecturee00newtrich) click through on the "Usage Rights: See Terms" link; the rights are on a pane on the left-hand side. Once you know the identifier, you can grab the content files, using this syntax: http://www.archive.org/details/$ID Like so: http://www.archive.org/details/chemicallecturee00newtrich And then sniff the page to find the F
Re: [CODE4LIB] oca api?
Jonathan - No, I don't believe it's documented - at least not anywhere publicly. If any IA/OCA folks are lurking, here's an opportunity to make a bunch of techies happy... Chris -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Jonathan Rochkind Sent: Monday, February 25, 2008 2:48 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] oca api? I hadn't known this "custom query interface" existed! This is welcome news. Is this documented anywhere? Jonathan >>> Chris Freeland <[EMAIL PROTECTED]> 02/25/08 2:51 PM >>> Steve & Tim, I'm the tech director for the Biodiversity Heritage Library (BHL), which is a consortium of 10 natural history libraries who have partnered with Internet Archive (IA)/OCA for scanning our collections. We've just launched our revamped portal, complete with more than 7,500 books & 2.8 million pages scanned by IA & other digitization partners, at: http://www.biodiversitylibrary.org To build this portal we ingest metadata from IA. We found their OAI interface to pull scanned items inconsistently based on date of scanning, so we switched to using their custom query interface. Here's an example of a query we fire off: http://www.archive.org/services/search.php?query=collection:(biodiversit y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH OI%20Library)&limit=10&submit=submit This is returning scanned items from the "biodiversity" collection, updated between 10/31/2007 - 11/30/2007, restricted to one of our contributing libraries (MBLWHOI Library), and limited to 10 results. The results are styled in the browser; view source to see the good stuff. We use this list to grab the identifiers we've yet to ingest. Some background: When a book is scanned through IA/OCA scanning, they create their own unique identifier (like "annalesacademiae21univ") and grab a MARC record from the contributing library's catalog. All of the scanned files, derivatives, and metadata files are stored on IA's clusters in a directory named with the identifier. Steve mentioned using their /details/ directive, then sniffing the page to get the cluster location and the files for downloading. An easier method is to use their /download/ directive, as in: http://www.archive.org/download/ID$, or in the example above: http://www.archive.org/download/annalesacademiae21univ That automatically does a lookup on the cluster, which means you don't have to scrape info off pages. You can also address any files within that directory, as in: http://www.archive.org/download/annalesacademiae21univ/annalesacademiae2 1univ_marc.xml The only way to get standard identifiers (ISBN, ISSN, OCLC, LCCN) for these scanned books is to grab them out of the MARC record. So the long-winded answer to your question, Tim, is no, there's no simple way to crossref what IA has scanned with your catalog - THAT I KNOW OF. Big caveat on that last part. Happy to help with any other questions I can, Chris Freeland -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Steve Toub Sent: Sunday, February 24, 2008 11:20 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] oca api? --- Tim Shearer <[EMAIL PROTECTED]> wrote: > Hi Folks, > > I'm looking into tapping the texts in the Open Content Alliance. > > A few questions... > > As near as I can tell, they don't expose (perhaps even store?) any common > unique identifiers (oclc number, issn, isbn, loc number). I poked around in this world a few months ago in my previous job at California Digital Library, also an OCA partner. The unique key seems to be text string identifier (one that seems to be completely different from the text string identifier in Open Library). Apparently there was talk at the last partner meeting about moving to ISBNs: http://dilettantes.code4lib.org/2007/10/22/tales-from-the-open-content-a lliance/ To obtain identifiers in bulk, I think the recommended approach is the OAI-PMH interface, which seems more reliable in recent months: http://www.archive.org/services/oai.php?verb=Identify http://www.archive.org/services/oai.php?verb=ListIdentifiers&metadataPre fix=oai_dc&set=collection:cdl etc. Additional instructions if you want to grab the content files. >From any book's metadata page (e.g., http://www.archive.org/details/chemicallecturee00newtrich) click through on the "Usage Rights: See Terms" link; the rights are on a pane on the left-hand side. Once you know the identifier, you can grab the content files, using this syntax: http://www.archive.org/details/$ID Like so: http://www.archive.org/details/chemicallecturee00newtrich And then sniff the page to find the FTP link: ftp://ia340915.us.archive.org/2/items/chemicallecturee00newtrich But
Re: [CODE4LIB] oca api?
I hadn't known this "custom query interface" existed! This is welcome news. Is this documented anywhere? Jonathan >>> Chris Freeland <[EMAIL PROTECTED]> 02/25/08 2:51 PM >>> Steve & Tim, I'm the tech director for the Biodiversity Heritage Library (BHL), which is a consortium of 10 natural history libraries who have partnered with Internet Archive (IA)/OCA for scanning our collections. We've just launched our revamped portal, complete with more than 7,500 books & 2.8 million pages scanned by IA & other digitization partners, at: http://www.biodiversitylibrary.org To build this portal we ingest metadata from IA. We found their OAI interface to pull scanned items inconsistently based on date of scanning, so we switched to using their custom query interface. Here's an example of a query we fire off: http://www.archive.org/services/search.php?query=collection:(biodiversit y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH OI%20Library)&limit=10&submit=submit This is returning scanned items from the "biodiversity" collection, updated between 10/31/2007 - 11/30/2007, restricted to one of our contributing libraries (MBLWHOI Library), and limited to 10 results. The results are styled in the browser; view source to see the good stuff. We use this list to grab the identifiers we've yet to ingest. Some background: When a book is scanned through IA/OCA scanning, they create their own unique identifier (like "annalesacademiae21univ") and grab a MARC record from the contributing library's catalog. All of the scanned files, derivatives, and metadata files are stored on IA's clusters in a directory named with the identifier. Steve mentioned using their /details/ directive, then sniffing the page to get the cluster location and the files for downloading. An easier method is to use their /download/ directive, as in: http://www.archive.org/download/ID$, or in the example above: http://www.archive.org/download/annalesacademiae21univ That automatically does a lookup on the cluster, which means you don't have to scrape info off pages. You can also address any files within that directory, as in: http://www.archive.org/download/annalesacademiae21univ/annalesacademiae2 1univ_marc.xml The only way to get standard identifiers (ISBN, ISSN, OCLC, LCCN) for these scanned books is to grab them out of the MARC record. So the long-winded answer to your question, Tim, is no, there's no simple way to crossref what IA has scanned with your catalog - THAT I KNOW OF. Big caveat on that last part. Happy to help with any other questions I can, Chris Freeland -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Steve Toub Sent: Sunday, February 24, 2008 11:20 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] oca api? --- Tim Shearer <[EMAIL PROTECTED]> wrote: > Hi Folks, > > I'm looking into tapping the texts in the Open Content Alliance. > > A few questions... > > As near as I can tell, they don't expose (perhaps even store?) any common > unique identifiers (oclc number, issn, isbn, loc number). I poked around in this world a few months ago in my previous job at California Digital Library, also an OCA partner. The unique key seems to be text string identifier (one that seems to be completely different from the text string identifier in Open Library). Apparently there was talk at the last partner meeting about moving to ISBNs: http://dilettantes.code4lib.org/2007/10/22/tales-from-the-open-content-a lliance/ To obtain identifiers in bulk, I think the recommended approach is the OAI-PMH interface, which seems more reliable in recent months: http://www.archive.org/services/oai.php?verb=Identify http://www.archive.org/services/oai.php?verb=ListIdentifiers&metadataPre fix=oai_dc&set=collection:cdl etc. Additional instructions if you want to grab the content files. >From any book's metadata page (e.g., http://www.archive.org/details/chemicallecturee00newtrich) click through on the "Usage Rights: See Terms" link; the rights are on a pane on the left-hand side. Once you know the identifier, you can grab the content files, using this syntax: http://www.archive.org/details/$ID Like so: http://www.archive.org/details/chemicallecturee00newtrich And then sniff the page to find the FTP link: ftp://ia340915.us.archive.org/2/items/chemicallecturee00newtrich But I think they prefer to use HTTP for these, not the FTP, so switch this to: http://ia340915.us.archive.org/2/items/chemicallecturee00newtrich Hope this helps! --SET > We're a contributer so I can use curl to grab our records via http (and > regexp my way to our local catalog identifiers, which they do > store/expose). > > I've played a bit with the z39.50 interface at
Re: [CODE4LIB] oca api?
Steve & Tim, I'm the tech director for the Biodiversity Heritage Library (BHL), which is a consortium of 10 natural history libraries who have partnered with Internet Archive (IA)/OCA for scanning our collections. We've just launched our revamped portal, complete with more than 7,500 books & 2.8 million pages scanned by IA & other digitization partners, at: http://www.biodiversitylibrary.org To build this portal we ingest metadata from IA. We found their OAI interface to pull scanned items inconsistently based on date of scanning, so we switched to using their custom query interface. Here's an example of a query we fire off: http://www.archive.org/services/search.php?query=collection:(biodiversit y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH OI%20Library)&limit=10&submit=submit This is returning scanned items from the "biodiversity" collection, updated between 10/31/2007 - 11/30/2007, restricted to one of our contributing libraries (MBLWHOI Library), and limited to 10 results. The results are styled in the browser; view source to see the good stuff. We use this list to grab the identifiers we've yet to ingest. Some background: When a book is scanned through IA/OCA scanning, they create their own unique identifier (like "annalesacademiae21univ") and grab a MARC record from the contributing library's catalog. All of the scanned files, derivatives, and metadata files are stored on IA's clusters in a directory named with the identifier. Steve mentioned using their /details/ directive, then sniffing the page to get the cluster location and the files for downloading. An easier method is to use their /download/ directive, as in: http://www.archive.org/download/ID$, or in the example above: http://www.archive.org/download/annalesacademiae21univ That automatically does a lookup on the cluster, which means you don't have to scrape info off pages. You can also address any files within that directory, as in: http://www.archive.org/download/annalesacademiae21univ/annalesacademiae2 1univ_marc.xml The only way to get standard identifiers (ISBN, ISSN, OCLC, LCCN) for these scanned books is to grab them out of the MARC record. So the long-winded answer to your question, Tim, is no, there's no simple way to crossref what IA has scanned with your catalog - THAT I KNOW OF. Big caveat on that last part. Happy to help with any other questions I can, Chris Freeland -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Steve Toub Sent: Sunday, February 24, 2008 11:20 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] oca api? --- Tim Shearer <[EMAIL PROTECTED]> wrote: > Hi Folks, > > I'm looking into tapping the texts in the Open Content Alliance. > > A few questions... > > As near as I can tell, they don't expose (perhaps even store?) any common > unique identifiers (oclc number, issn, isbn, loc number). I poked around in this world a few months ago in my previous job at California Digital Library, also an OCA partner. The unique key seems to be text string identifier (one that seems to be completely different from the text string identifier in Open Library). Apparently there was talk at the last partner meeting about moving to ISBNs: http://dilettantes.code4lib.org/2007/10/22/tales-from-the-open-content-a lliance/ To obtain identifiers in bulk, I think the recommended approach is the OAI-PMH interface, which seems more reliable in recent months: http://www.archive.org/services/oai.php?verb=Identify http://www.archive.org/services/oai.php?verb=ListIdentifiers&metadataPre fix=oai_dc&set=collection:cdl etc. Additional instructions if you want to grab the content files. >From any book's metadata page (e.g., http://www.archive.org/details/chemicallecturee00newtrich) click through on the "Usage Rights: See Terms" link; the rights are on a pane on the left-hand side. Once you know the identifier, you can grab the content files, using this syntax: http://www.archive.org/details/$ID Like so: http://www.archive.org/details/chemicallecturee00newtrich And then sniff the page to find the FTP link: ftp://ia340915.us.archive.org/2/items/chemicallecturee00newtrich But I think they prefer to use HTTP for these, not the FTP, so switch this to: http://ia340915.us.archive.org/2/items/chemicallecturee00newtrich Hope this helps! --SET > We're a contributer so I can use curl to grab our records via http (and > regexp my way to our local catalog identifiers, which they do > store/expose). > > I've played a bit with the z39.50 interface at indexdata > (http://www.indexdata.dk/opencontent/), but I'm not confident about the > content behind it. I get very limited results, for instance I can't find > any UNC records and we're
Re: [CODE4LIB] oca api?
--- Tim Shearer <[EMAIL PROTECTED]> wrote: > Hi Folks, > > I'm looking into tapping the texts in the Open Content Alliance. > > A few questions... > > As near as I can tell, they don't expose (perhaps even store?) any common > unique identifiers (oclc number, issn, isbn, loc number). I poked around in this world a few months ago in my previous job at California Digital Library, also an OCA partner. The unique key seems to be text string identifier (one that seems to be completely different from the text string identifier in Open Library). Apparently there was talk at the last partner meeting about moving to ISBNs: http://dilettantes.code4lib.org/2007/10/22/tales-from-the-open-content-alliance/ To obtain identifiers in bulk, I think the recommended approach is the OAI-PMH interface, which seems more reliable in recent months: http://www.archive.org/services/oai.php?verb=Identify http://www.archive.org/services/oai.php?verb=ListIdentifiers&metadataPrefix=oai_dc&set=collection:cdl etc. Additional instructions if you want to grab the content files. >From any book's metadata page (e.g., >http://www.archive.org/details/chemicallecturee00newtrich) click through on the "Usage Rights: See Terms" link; the rights are on a pane on the left-hand side. Once you know the identifier, you can grab the content files, using this syntax: http://www.archive.org/details/$ID Like so: http://www.archive.org/details/chemicallecturee00newtrich And then sniff the page to find the FTP link: ftp://ia340915.us.archive.org/2/items/chemicallecturee00newtrich But I think they prefer to use HTTP for these, not the FTP, so switch this to: http://ia340915.us.archive.org/2/items/chemicallecturee00newtrich Hope this helps! --SET > We're a contributer so I can use curl to grab our records via http (and > regexp my way to our local catalog identifiers, which they do > store/expose). > > I've played a bit with the z39.50 interface at indexdata > (http://www.indexdata.dk/opencontent/), but I'm not confident about the > content behind it. I get very limited results, for instance I can't find > any UNC records and we're fairly new to the game. > > Again, I'm looking for unique identifiers in what I can get back and it's > slim pickings. > > Anyone cracked this nut? Got any life lessons for me? > > Thanks! > Tim > > +++ > Tim Shearer > > Web Development Coordinator > The University Library > University of North Carolina at Chapel Hill > [EMAIL PROTECTED] > 919-962-1288 > +++ >
[CODE4LIB] oca api?
Hi Folks, I'm looking into tapping the texts in the Open Content Alliance. A few questions... As near as I can tell, they don't expose (perhaps even store?) any common unique identifiers (oclc number, issn, isbn, loc number). We're a contributer so I can use curl to grab our records via http (and regexp my way to our local catalog identifiers, which they do store/expose). I've played a bit with the z39.50 interface at indexdata (http://www.indexdata.dk/opencontent/), but I'm not confident about the content behind it. I get very limited results, for instance I can't find any UNC records and we're fairly new to the game. Again, I'm looking for unique identifiers in what I can get back and it's slim pickings. Anyone cracked this nut? Got any life lessons for me? Thanks! Tim +++ Tim Shearer Web Development Coordinator The University Library University of North Carolina at Chapel Hill [EMAIL PROTECTED] 919-962-1288 +++