[CODE4LIB] OCA API

2009-05-15 Thread Tim Shearer

Hi Folks,

The University Library at UNC-Chapel Hill has created an OCA API.  We have 
harvested (and continue to harvest) standard bibliographic identifiers and 
link them to OCA identifiers.  The API is deliberately modeled after 
Google's for ease of implementation.


Here is a subjec search in UNC's catalog for North Carolina limited to 
the 19th century.


http://search.lib.unc.edu/search?Ntk=SubjectNe=2+200043+206475+206590+11N=206596Ntt=north%20carolina

You will see links to OCA as well as Google.  (The full record has an OCA 
icon if you want to look.)  Right now we are only banging against the API 
with OCLC numbers, but ISSNs, ISBNs and LC numbers are in there.


We are looking for a couple of partners to work with to take use beyond 
our local OPAC.  You would be ideal if: you are interested, you already 
use the Google API, you have a significant corpus of pre-1923 works in 
your catalog.


As the Google API is familiar to many of you, it would be easy to figure 
out how to implement UNC's without working with us.  Please hold off until 
we are ready to open it up all the way? This is why we've not yet put up 
documentation.


Caveats and other notes (feel free to skip):

*We realize that Open Library has an API, but we had already gone a goodly 
distance and we are finding relatively meaningful differences in coverage 
and utility.


*We collect the data from OCA as it comes in (the data should be up to 
date within a half hour or so)...but they occasionally have need to 
correct/remove works.  Right now we are actively working on this issue, 
but do not yet have a great mechanism to pull deletes and update corrected 
identifiers.


*The data is only as good as the data we harvest.  There are a small 
number of bad links.  See above.


*Excerpt from a developer on UNC's holdings (we are an OCA Scribe site):

...I decided to run the same script against the [production] database as 
well to see how much the matching is changing over time with continual 
updates:

- 429311 OCLC's tested
- 72350 matched
- 2599 of the matches were scanned by UNC

So that's 808 new matches since the end of March, not too bad for one 
month.


Effectively we are now linking to ~72 K digitized works that we were not 
previously able to provide (though as Google digitized books are being 
added to OCA, there is significant overlap).


*When we do open it up it is the API we are offering, we are not prepared 
to be crawled for data.  If you want the data, get in touch and we will 
see what we can do.


If you are interested in being an early partner, please drop me a line and 
I will be in touch.


Tim

+++
Tim Shearer

Web Development Coordinator
The University Library
University of North Carolina at Chapel Hill
sh...@ils.unc.edu
919-962-1288
+++


Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?

2008-03-14 Thread Xiaoming Liu
On Fri, Mar 14, 2008 at 10:31 AM, Emily Lynema [EMAIL PROTECTED]
wrote:

snip/

available. I have to admit it seems odd to me to include so much
 attribute information in a single isbn element, but I suppose that
 would be helpful in identifying what specific manifestation is being
 referred to in the URL?



We made that design choice to be largely compatible with OCLC Research's
version of xISBN service, and this kind of flat structure also help us to
easily disseminate other formats, such as csv or json serialization:

http://xisbn.worldcat.org/webservices/xid/isbn/0689868847?library=ocafl=*format=csv
http://xisbn.worldcat.org/webservices/xid/isbn/0689868847?library=ocafl=*format=python



 In this scenario, is there a way to indicate free vs. licensed somewhere
 in the isbn entry? I'm assuming that the Netlibrary audio book is
 *not* free. We have very few mechanisms to do that within MARC records;
 it would be great to think about that here as most libraries will be
 interested in *free* links to digitized content available from anywhere
 (google book search, oca, etc.).



We support a library=freeebook flag to limit search scope to free ebook:

http://xisbn.worldcat.org/webservices/xid/isbn/0689868847?library=freeebookfl=*

Current the free ebook collection is rather small (a few thousands of
titles), hopefully we can grow the collection soon to make it more useful.
You can find more statistical information from
http://xisbn.worldcat.org/xisbnadmin/doc/stat.htm





 Also, have you considered the response for multiple digitized sources
 for the same ISBN?



If an ISBN has multiple digitized sources, they are put in url attribute
separated by space, e.g.

http://xisbn.worldcat.org/webservices/xid/isbn/0596002815?library=ebookfl=title,url


Xiaoming


Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?

2008-03-12 Thread Chris Freeland
Tim - This is awesome work!  One thing to be aware of is that IA takes a
non-hierarchical view of scanned books - there is no Title-Item
(Bib-Item) relationship.  When they scan a serial or multivolume
monograph the MARCXML file for the Title is deposited in each scanned
Item.

For instance, the MARCXML for The transactions of the Academy of
Science of St. Louis is dropped into this item, which is volume 21:
http://www.archive.org/details/transactionsofac21acad
-(Click the FTP link along the left, then the _marc.xml file)

and this item, which is volume 22:
http://www.archive.org/details/transactionsofac22acad

You'll see they are identical files.  So, your number of 198,826 MARC
files does not correspond to 198,826 titles.  You will need to group
those MARC files by leader to get a true count of titles.  This is
what BHL does when we ingest materials from
http://www.archive.org/details/biodiversity into
http://www.biodiversitylibrary.org/

Chris

-Original Message-
From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
Tim McCormick
Sent: Wednesday, March 12, 2008 3:58 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?

In our office we too have been investigating the e-book material at
Internet Archive / OCA.

We'd like to build just the sort of OCA index / id-switcher that Tim
Shearer and others have described on this list -- in order to, among
other things, add this type of capability to our xID (aka xISBN)
service, and to WorldCat.

So, I thought I'd report on results so far, and what we're working on.

Data:
1) First, we used the Internet Archive's OAI interface to harvest
brief records of all items categorized as text.  We found that this
yielded only very brief records, though -- author, title, and OCA
unique identifier (e.g. northcarolinayea1910rale).
2) Then we used the OCA identifier to check for, and harvest, MARC-XML
records when available, using the lookup method described by Chris
Freeland on Code4Lib on Feb 25.
3) The MARC files were examined for ISBNs and OCLCnums.  (yes, we may
look for other identifiers later).

That yielded:
  - 290,756 total OCA text records found
  - 198,826 of those had MARC records
  - 1773 had ISBNs
  - 88537 had OCLC numbers (identified by record position  format,
but not yet verified against WorldCat).

Switching:
In xID we currently support ISBN, have recently added LCCN, and we
plan to release ISSN and OCLCnum support in upcoming releases.  So,
when those are fully phased in, the goal is that you could submit an
identifier of any supported type, and get back all identifiers of
whichever type that represent versions of the same work;  or, when
appropriate, the same manifestation.
Therefore, the 88.537 OCLCnums will likely map to a much larger
set of identifiers over all, allowing a lot of book records -- in
library catalogs or elsewhere -- to hook into OCA materials.

Free-text service:
We imagine a service which, given an identifier, attempts to decide if
a free-text version of the described work is available at OCA/IA: and
if so, returns an access URL for that resource.

Other work:
We are investigating the case of free/open resources that lack
standard identifiers -- for example, possibly, the 2/3 of IA texts for
which we didn't find OCLCnum or ISBN.  Here, we are looking at doing
best-guess lookup of related identifiers, based on author and title
information in the brief record.   This might allow substantially
broader indexing of open content materials, but the reliability of the
identifier association is lower.

Any tips, questions, suggestions, requests are welcome.
thanks to Xiaoming Liu and Tom Ventimiglia in OCLC New Jersey office
for work on this.

Tim

--
Tim McCormick
Product Manager (xID), OCLC New Jersey
Email: mccormit (at) oclc.org
2 Broad St., Suite 208, Bloomfield, New Jersey 07003 USA
Phone: +1.973.868.5694  |  Skype:  tim_mccormick
http://www.oclc.org/


Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?

2008-03-07 Thread Emily Lynema

Tim,

It sounds like you want to be able to search on standard identifiers and
are frustrated that the Internet Archive's access doesn't allow it
(although it looks like they do have an ISBN search)? And I'm curious,
why would you want or need to pull down only records that have OCLC
numbers of ISBNs in particular? What is it you need to do that makes
only those records useful?

Like Karen and Bess and others have said, I recommend that you
coordinate this with the Open Library project. At the meeting last
Friday, it did sound like they would be interested in providing
identifier disambiguation types of service - give them an ISBN, and
they'll give you the records associated with it.

Also, there was discussion about building an Open Librar yAPI (to enable
some cool integration with wikipedia), and I suggested a that libraries
using an API would want the search results to include information about
whether the title has a digitized copy. So I would hope the service that
you're envisioning is something that would be provided by an Open
Library API (but we don't know when that might come about).

As OCA moves forward, folks may well be digitizing identical books. So
there may not be a one to one relationship between unique catalog
identifier, unique oca identifier, and isbn/lccn/oclc number.

-emily



--

Date:Thu, 6 Mar 2008 08:47:04 -0500
From:Tim Shearer [EMAIL PROTECTED]
Subject: musing on oca apiRe: [CODE4LIB] oca api?

Howdy folks,

I've been playing and thinking.  I'd like to have what amounts to a unique
identifier index to oca digitized texts.  I want to be able to pull all the
records that have oclc numbers, issns, isbns, etc.  I want it to be
lightweight, fast, searchable.

Would anyone else want/use such a thing?

I'm thinking about building something like this.

If I do, it would be ideal if wouldn't be a duplication of effort, so anyone
got this in the works?  And if it would meet the needs of others.

My basic notion is to crawl the site (starting with americana, the American
Libraries.  Pull the oca unique identifier (e.g. northcarolinayea1910rale) and
associate it with

unique identifiers (oclc numbers, issns, isbns, lc numbers)
contributing institution's alias and unique catalog identifier
upload date

That's all I was thinking of.  Then there's what you might be able to do with
it:

Give me all the oca unique identifiers that have oclc numbers
Give me all the oca unique identifiers with isbns that were
uploaded between x and y date
Give me the oca unique identifier for this oclc number

Planning to do:

keep crawling it and keep it up to date.

Things I wasn't planning to do:

worry about other unique ids (you'd have to go to xISBN or
ThingISBN yourself)
worry about storing anything else from oca.

It would be good for being able to add an 856 to matches in your catalog. It
would not be good for grabbing all marc records for all of oca.

Anyhow, is this duplication of effort?  Would you like something like this?
What else would you like it to do (keeping in mind this is an unfunded pet
project)?  How would you want to talk to it?  I was thinking of a web service,
but hadn't thought too much about how to query it or how I'd deliver results.

Of course I'm being an idiot and trying out new tools at the same time (python
to see what the buzz is all about, sqlite just to learn it (it may not work
out)).

Thoughts?  Vicious criticism?

-t


--

Date:Thu, 6 Mar 2008 11:05:41 -0500
From:Jodi Schneider [EMAIL PROTECTED]
Subject: Re: musing on oca apiRe: [CODE4LIB] oca api?

Great idea, Tim!

The open library tech list that Bess mentions is [EMAIL PROTECTED],
described at
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech

-Jodi

Jodi Schneider
Science Library Specialist
Amherst College
413-542-2076





--

Date:Thu, 6 Mar 2008 08:32:43 -0800
From:Karen Coyle [EMAIL PROTECTED]
Subject: Re: musing on oca apiRe: [CODE4LIB] oca api?

We talked about something like this at the Open Library meeting last
Friday. The ol list is [EMAIL PROTECTED] (join at
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-lib). I think of
this as a (or one or more) translate service between IDs. It's a
realization that we will never have a unique ID that everyone agrees on,
that most bibliographic items are really more than one thing, but that
since we have data about the bibliographic item we have many
opportunities to make connections even though people have used different
identifiers. So we could use an ID-switcher to move among data stores
and services. Is that the kind of thing you are thinking of?

kc




--
Emily Lynema
Systems Librarian for Digital Projects
Information Technology, NCSU Libraries
919-513-8031
[EMAIL PROTECTED]


Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?

2008-03-07 Thread Eric Lease Morgan

On Mar 7, 2008, at 8:22 AM, Emily Lynema wrote:


Also, there was discussion about building an Open Library API (to
enable
some cool integration with wikipedia), and I suggested a that
libraries
using an API would want the search results to include information
about
whether the title has a digitized copy. So I would hope the service
that
you're envisioning is something that would be provided by an Open
Library API (but we don't know when that might come about).



I sat in on this discussion at the Meeting. It was driven by a
consultant-type who is working for Wikipedia. His desire was to
create an API that allowed people to authoritatively and consistently
cite content from Wikipedia to Open Library. Ultimately, this API
would allow a person to:

  * search Open Library via word, phrase, or key
  * return list of hits
  * select item
  * create citation
  * insert citation into Wikipedia article
  * regularly check the validity of the citation

Regarding the first two items I tried to suggest the use of SRU.
Regarding the last item, I tried to suggest OAI. In both cases I was
shot down. Too complicated, at the same time, they were outlining
API's that had the *exact* functionality of SRU and OAI. I sort of
saw his point. Library protocols are usually overly-complicated,
yet he was totally unaware of either protocol. I also think he was
suffering a bit from the Not Invented Here Syndrome. We also got into
a bit of a religious war regarding the definition of REST-ful Web
Services.

In the end we talked a lot about JSON and a tiny bit about ATOM.

--
Eric Lease Morgan
University Libraries of Notre Dame


Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?

2008-03-07 Thread Jonathan Rochkind

I see a whole lot of not invented here syndrome from IA, honestly.
They seem to want to re-invent everything themselves, rather than try to
use existing conventions.  Even if they come up with something slightly
better than SRU, is it worth the pain to developers who would like to
implement client code, and can't use their existing SRU client code to
do so?

Seems to me, and I tried to tell Brewster this when talking to him after
his keynote at the conference, if the IA is serious about trying to get
external developers to engage with IA stuff (which the IA folks at the
conf mentioned was indeed a goal of theirs), then there are certain
things the IA should put their resources into in order to facillitate
and encourage this. Mainly:

1) Documenting their interfaces.  Right now as far as I can tell
everything is available on a if you happen to notice it's there and
then reverse engineer it yourself, and who knows if it might change and
break your code basis.  I don't really have time for that.

2) When they make machine interfaces, use existing conventions and
standards in use by the community of developers they want to target. [If
the community of developers they want to target is not neccesarily
library programmers, and that community they wish to target doesn't in
fact use SRU at all right now, I suppose that might be fair. I dunno].

3) Best of all, actually talk to people in this community of developers
_before_ developing their stuff, to see what their needs are. User
centered development, right? You don't produce a giant piece of
software without talking to those who you want to use it, and then
wonder why they don't seem interested in  using it.

When I bring this up, I'm generally told Oh, all that is YOUR
responsibility. If you wanted it bad enough, you'd deal with it. We just
make it available, the rest is up to you.  That's fine, like I said,
they can prioritize their resource allocation however they want.  But
they shouldn't be so surprised when they're having trouble getting
external-developer-community adoption of their stuff when this is their
attitude.   That's what I would have said if I had been able to make the
meeting last week.

So maybe they're changing their approach a bit with regard to some of
these things. They did meet with library developers, at least. I don't
see much evidence of 1 or 2 yet though.

Jonathan

Eric Lease Morgan wrote:

On Mar 7, 2008, at 8:22 AM, Emily Lynema wrote:


Also, there was discussion about building an Open Library API (to
enable
some cool integration with wikipedia), and I suggested a that
libraries
using an API would want the search results to include information
about
whether the title has a digitized copy. So I would hope the service
that
you're envisioning is something that would be provided by an Open
Library API (but we don't know when that might come about).



I sat in on this discussion at the Meeting. It was driven by a
consultant-type who is working for Wikipedia. His desire was to
create an API that allowed people to authoritatively and consistently
cite content from Wikipedia to Open Library. Ultimately, this API
would allow a person to:

  * search Open Library via word, phrase, or key
  * return list of hits
  * select item
  * create citation
  * insert citation into Wikipedia article
  * regularly check the validity of the citation

Regarding the first two items I tried to suggest the use of SRU.
Regarding the last item, I tried to suggest OAI. In both cases I was
shot down. Too complicated, at the same time, they were outlining
API's that had the *exact* functionality of SRU and OAI. I sort of
saw his point. Library protocols are usually overly-complicated,
yet he was totally unaware of either protocol. I also think he was
suffering a bit from the Not Invented Here Syndrome. We also got into
a bit of a religious war regarding the definition of REST-ful Web
Services.

In the end we talked a lot about JSON and a tiny bit about ATOM.

--
Eric Lease Morgan
University Libraries of Notre Dame



--
Jonathan Rochkind
Digital Services Software Engineer
The Sheridan Libraries
Johns Hopkins University
410.516.8886
rochkind (at) jhu.edu


Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?

2008-03-07 Thread Kyle Banerjee
   I want to be able to pull all the
  records that have oclc numbers, issns, isbns, etc.  I want it to be
  lightweight, fast, searchable.

  Would anyone else want/use such a thing?...

I like the idea, but in the long term, I just don't know how useful
this will be. By and large, these identifiers are designed for dead
tree resources. Although they are sometimes assigned to electronic
resources,  I find it hard to believe that the containers these
identifiers are associated with will contain more than a tiny
proportion of the information users want/need. The book structure just
doesn't make nearly as much sense in an online environment.


  My basic notion is to crawl the site (starting with americana, the American
  Libraries.  Pull the oca unique identifier (e.g. northcarolinayea1910rale) 
 and
  associate it with

  unique identifiers (oclc numbers, issns, isbns, lc numbers)
  contributing institution's alias and unique catalog identifier
  upload date

  That's all I was thinking of.  Then there's what you might be able to do with
  it:

 Give me all the oca unique identifiers that have oclc numbers
 Give me all the oca unique identifiers with isbns that were
 uploaded between x and y date
 Give me the oca unique identifier for this oclc number

Not sure I understand the use case (i.e. the value of retrieving
another identifier).

One thing to keep in mind is that although the numbering schemes are
independent, they can be thought of as hierarchical. Anything that has
an lccn number should already have an isbn because of the standards lc
catalogs to. And they put their holdings in OCLC, so all numbers that
have an oclc number should contain these other identifiers. Items with
oclc numbers that were not cataloged by lc should also have isbns.
When such conditions are not met, it is a sign of a record containing
unreliable information.

kyle
--
--
Kyle Banerjee
Digital Services Program Manager
Orbis Cascade Alliance
[EMAIL PROTECTED] / 541.359.9599


Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?

2008-03-07 Thread Jonathan Rochkind

Kyle Banerjee wrote:


I like the idea, but in the long term, I just don't know how useful
this will be. By and large, these identifiers are designed for dead
tree resources.

Only time will tell, but it's what we've got now, and I don't see our
existing legacy records going away. So we will continue to need to try
and match existing records to digitized resources representing those
existing records. (Keep in mind that OCA for now is mostly only
digitizing out of copyright stuff!) The more identifiers the more likely
we can succesfully make such a match.

One thing to keep in mind is that although the numbering schemes are
independent, they can be thought of as hierarchical. Anything that has
an lccn number should already have an isbn because of the standards lc
catalogs to.

Nope. ISBN was created in 1966.  LCCNs exist for many resources
published before 1966.  Even after 1966, not every single item that may
have been cataloged by the Library of Congress was neccesarily assigned
an ISBN by it's publisher. (One obvious overlooked example---non-print
resources, like music or videos! LC doesn't catalog very many of these,
but any they have aren't going to have ISBNs! Other examples---foreign
publishers, self-published stuff, the first few years after 66 when ISBN
adoption curve was still on the way up, etc. )


And they put their holdings in OCLC, so all numbers that
have an oclc number should contain these other identifiers.

Nope. I think you mean all items that have an LCCN should also have an
OCLC number. Probably true (mostly). But all items that have an OCLC
number will not neccesarily have an LCCN. You say so below items that
were not cataloged by lc will have oclc numbers but probably not
lccns.  And once we get away from LC, the chances of a cataloged item
(with an OCLC number) not having an ISBN go up even more (any musical
CD, for instance, not usually held by LC but held by public libraries
accross the US).


Items with
oclc numbers that were not cataloged by lc should also have isbns.
When such conditions are not met, it is a sign of a record containing
unreliable information.


I do not believe this is the case. But let us admit that our cooperative
cataloging corpus in fact IS not very reliable, it is full of incorrect
information. But we've got to deal with it anyway.  A record that is
_missing_ an applicable identifier that it _could_ have contained may be
reliable in other respects, I wouldn't automatically assume it is not.

Jonathan


kyle
--
--
Kyle Banerjee
Digital Services Program Manager
Orbis Cascade Alliance
[EMAIL PROTECTED] / 541.359.9599




--
Jonathan Rochkind
Digital Services Software Engineer
The Sheridan Libraries
Johns Hopkins University
410.516.8886
rochkind (at) jhu.edu


Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?

2008-03-07 Thread Karen Coyle

Kyle Banerjee wrote:

  I want to be able to pull all the
 records that have oclc numbers, issns, isbns, etc.  I want it to be
 lightweight, fast, searchable.

 Would anyone else want/use such a thing?...


I like the idea, but in the long term, I just don't know how useful
this will be. By and large, these identifiers are designed for dead
tree resources. Although they are sometimes assigned to electronic
resources,  I find it hard to believe that the containers these
identifiers are associated with will contain more than a tiny
proportion of the information users want/need. The book structure just
doesn't make nearly as much sense in an online environment.


The utility that I see is that as things are digitized the dead tree
identifier is often included in the metadata that accompanies the
digital file. This makes it possible to go from legacy data (read:
library catalogs) to the digital data.



Not sure I understand the use case (i.e. the value of retrieving
another identifier).


Because the same dead tree item is being digitized multiple times in
different locations under different projects. It's an interesting
situation because where we once had an ISBN that identified EVERY copy
of that manifestation we will now have many different copies
(different because they were digitized separately). Those copies will
probably have a variety of identifiers associated with them.



One thing to keep in mind is that although the numbering schemes are
independent, they can be thought of as hierarchical. Anything that has
an lccn number should already have an isbn because of the standards lc
catalogs to. And they put their holdings in OCLC, so all numbers that
have an oclc number should contain these other identifiers. Items with
oclc numbers that were not cataloged by lc should also have isbns.
When such conditions are not met, it is a sign of a record containing
unreliable information.


Not the case. First, ISBNs only came into being in 1968. Nothing before
that has one. Many items have NOT been cataloged by LC, many are NOT in
OCLC, and oftentimes the records that you are working with have munged,
stripped out, or lost the identity of the identifiers that are left.
It's great luck if you find one clearly marked identifier in a bib record.

kc



kyle
--
--
Kyle Banerjee
Digital Services Program Manager
Orbis Cascade Alliance
[EMAIL PROTECTED] / 541.359.9599




--
---
Karen Coyle / Digital Library Consultant
[EMAIL PROTECTED] http://www.kcoyle.net
ph.: 510-540-7596   skype: kcoylenet
fx.: 510-848-3913
mo.: 510-435-8234



Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?

2008-03-07 Thread Kyle Banerjee
  Nope. ISBN was created in 1966.  LCCNs exist for many resources
  published before 1966.  Even after 1966, not every single item that may
  have been cataloged by the Library of Congress was neccesarily assigned
  an ISBN by it's publisher

All true. What I meant was that _if_ an isbn exists for an item and LC
cataloged it, the LC record should have both. Many resources with and
without isbns may not be in LC, so the lccn cannot be used as a
substitute, but LC records can be considered a reasonably
authoritative source of isbns for the stuff that they have.

  Nope. I think you mean all items that have an LCCN should also have an
  OCLC number. Probably true (mostly). But all items that have an OCLC
  number will not neccesarily have an LCCN. You say so below items that
  were not cataloged by lc will have oclc numbers but probably not
  lccns.

I misspoke but it appears you see what I mean. The relationship
between oclc numbers and lccns is similar to the that between lccns
and isbns. The oclc number is not a substitute for an lccn, but if a
record that has an oclc number also contains an lccn, the oclc record
can be considered an authoritative source for the lccn -- and an isbn
if one exists.

  I do not believe this is the case. But let us admit that our cooperative
  cataloging corpus in fact IS not very reliable, it is full of incorrect
  information. But we've got to deal with it anyway.  A record that is
  _missing_ an applicable identifier that it _could_ have contained may be
  reliable in other respects, I wouldn't automatically assume it is not.

The quality is variable, but it's the best we have and it's worth
using the most reliable data available. Otherwise, inappropriate
linkages start popping up. If there are relatively few, that's not a
big deal, but once you get too much bad data in the system you have a
real problem.

kyle


--
--
Kyle Banerjee
Digital Services Program Manager
Orbis Cascade Alliance
[EMAIL PROTECTED] / 541.359.9599


Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?

2008-03-06 Thread Bess Sadler

Tim, I think this is a fantastic idea and the only suggestion I would
make is to make sure you get on the Open Library developers list (I'm
looking for the URL... I'll email when I find it unless someone else
beats me to it) and discuss this there. (You may already have done
this, I don't know.) They may be interested in hosting such a
project, and of course it would be helpful to have their knowledge of
the collections and apis on call. They seem to be keen on involving
developers from outside the Internet Archives staff, and this seems
like a perfect opportunity.

I would be very interested in helping you test such a service,
though, and I would definitely put links into our library catalogue.

Bess

Elizabeth (Bess) Sadler
Research and Development Librarian
Digital Scholarship Services
Box 400129
Alderman Library
University of Virginia
Charlottesville, VA 22904

[EMAIL PROTECTED]
(434) 243-2305

On Mar 6, 2008, at 8:47 AM, Tim Shearer wrote:


Howdy folks,

I've been playing and thinking.  I'd like to have what amounts to a
unique
identifier index to oca digitized texts.  I want to be able to pull
all the
records that have oclc numbers, issns, isbns, etc.  I want it to be
lightweight, fast, searchable.

Would anyone else want/use such a thing?

I'm thinking about building something like this.

If I do, it would be ideal if wouldn't be a duplication of effort,
so anyone
got this in the works?  And if it would meet the needs of others.

My basic notion is to crawl the site (starting with americana,
the American
Libraries.  Pull the oca unique identifier (e.g.
northcarolinayea1910rale) and
associate it with

unique identifiers (oclc numbers, issns, isbns, lc numbers)
contributing institution's alias and unique catalog identifier
upload date

That's all I was thinking of.  Then there's what you might be able
to do with
it:

Give me all the oca unique identifiers that have oclc numbers
Give me all the oca unique identifiers with isbns that were
uploaded between x and y date
Give me the oca unique identifier for this oclc number

Planning to do:

keep crawling it and keep it up to date.

Things I wasn't planning to do:

worry about other unique ids (you'd have to go to xISBN or
ThingISBN yourself)
worry about storing anything else from oca.

It would be good for being able to add an 856 to matches in your
catalog. It
would not be good for grabbing all marc records for all of oca.

Anyhow, is this duplication of effort?  Would you like something
like this?
What else would you like it to do (keeping in mind this is an
unfunded pet
project)?  How would you want to talk to it?  I was thinking of a
web service,
but hadn't thought too much about how to query it or how I'd
deliver results.

Of course I'm being an idiot and trying out new tools at the same
time (python
to see what the buzz is all about, sqlite just to learn it (it may
not work
out)).

Thoughts?  Vicious criticism?

-t


[CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?

2008-03-06 Thread Tim Shearer

Howdy folks,

I've been playing and thinking.  I'd like to have what amounts to a unique
identifier index to oca digitized texts.  I want to be able to pull all the
records that have oclc numbers, issns, isbns, etc.  I want it to be
lightweight, fast, searchable.

Would anyone else want/use such a thing?

I'm thinking about building something like this.

If I do, it would be ideal if wouldn't be a duplication of effort, so anyone
got this in the works?  And if it would meet the needs of others.

My basic notion is to crawl the site (starting with americana, the American
Libraries.  Pull the oca unique identifier (e.g. northcarolinayea1910rale) and
associate it with

unique identifiers (oclc numbers, issns, isbns, lc numbers)
contributing institution's alias and unique catalog identifier
upload date

That's all I was thinking of.  Then there's what you might be able to do with
it:

   Give me all the oca unique identifiers that have oclc numbers
   Give me all the oca unique identifiers with isbns that were
   uploaded between x and y date
   Give me the oca unique identifier for this oclc number

Planning to do:

   keep crawling it and keep it up to date.

Things I wasn't planning to do:

   worry about other unique ids (you'd have to go to xISBN or
   ThingISBN yourself)
   worry about storing anything else from oca.

It would be good for being able to add an 856 to matches in your catalog. It
would not be good for grabbing all marc records for all of oca.

Anyhow, is this duplication of effort?  Would you like something like this?
What else would you like it to do (keeping in mind this is an unfunded pet
project)?  How would you want to talk to it?  I was thinking of a web service,
but hadn't thought too much about how to query it or how I'd deliver results.

Of course I'm being an idiot and trying out new tools at the same time (python
to see what the buzz is all about, sqlite just to learn it (it may not work
out)).

Thoughts?  Vicious criticism?

-t


On Tue, 26 Feb 2008, Chris Freeland wrote:


My guess is that, yes, the query interface we've been discussing here
and the 'all sorts of interfaces that none of us knew about' are the
same.  It's not documented that I'm aware of.  We've found out about it
by literally sitting next to IA developers and asking questions.

Chris
-Original Message-
From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
Jonathan Rochkind
Sent: Tuesday, February 26, 2008 12:18 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] oca api?

So in answer to my question here at the Code4Lib conference, after
Brewster's keynote, Brewster suggests there are all sorts of interfaces
that none of us knew about. Or at least I didn't know about, and haven't
been able to figure out in months of trying!  I'm going to try and
corner him and ask for an email of who we should contact.

Perhaps it's the XML interface that you guys know about already. Is that
documented anywhere? How the heck did you find out about it?

Jonathan



Steve Toub [EMAIL PROTECTED] 02/25/08 9:41 PM 

I'll add that when IA told me about
http://www.archive.org/services/search.php interface to return
XML, they asked that we not send more than 100 records at time since
doing more would adversely
affect production services. Which made it seem like OAI-PMH was a better
way to go.

Chris, can you explain a bit more about what this means: We found their
OAI interface to pull
scanned items inconsistently based on date of scanning? I'm having
trouble parsing.


  --SET




--- Chris Freeland [EMAIL PROTECTED] wrote:


Jonathan - No, I don't believe it's documented - at least not anywhere
publicly.  If any IA/OCA folks are lurking, here's an opportunity to
make a bunch of techies happy...

Chris

-Original Message-
From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf

Of

Jonathan Rochkind
Sent: Monday, February 25, 2008 2:48 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] oca api?

I hadn't known this custom query interface existed! This is welcome
news. Is this documented anywhere?

Jonathan



Chris Freeland [EMAIL PROTECTED] 02/25/08 2:51 PM 

Steve  Tim,

I'm the tech director for the Biodiversity Heritage Library (BHL),

which

is a consortium of 10 natural history libraries who have partnered

with

Internet Archive (IA)/OCA for scanning our collections.  We've just
launched our revamped portal, complete with more than 7,500 books 

2.8

million pages scanned by IA  other digitization partners, at:
http://www.biodiversitylibrary.org

To build this portal we ingest metadata from IA.  We found their OAI
interface to pull scanned items inconsistently based on date of
scanning, so we switched to using their custom query interface.

Here's

an example of a query we fire off:



http://www.archive.org/services/search.php?query=collection:(biodiversit



y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH

OI

Re: [CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?

2008-03-06 Thread Jonathan Rochkind

I would absolutely want and use such a thing.

I don't know of anyone else doing that, although I have been thinking
about it too (but don't really have time to do much with it). The
approach and issues you have identified matches what I've been thinking,
and I don't have much additional to add.

Are you thinking of providing an index that you'd let the rest of us
search?  That would be great. Although there's always an issue with
sustainability there; if I have my software use your index, what happens
when you leave your job and your employer stops supporting it? It might
make sense to try to find a more neutral host site for such a thing,
and try to get together a small 'committee' to support it, so if you
stop working on it for whatever reason a year from now, it is more
likely to continue to work.

Jonathan

Tim Shearer wrote:

Howdy folks,

I've been playing and thinking.  I'd like to have what amounts to a
unique
identifier index to oca digitized texts.  I want to be able to pull
all the
records that have oclc numbers, issns, isbns, etc.  I want it to be
lightweight, fast, searchable.

Would anyone else want/use such a thing?

I'm thinking about building something like this.

If I do, it would be ideal if wouldn't be a duplication of effort, so
anyone
got this in the works?  And if it would meet the needs of others.

My basic notion is to crawl the site (starting with americana, the
American
Libraries.  Pull the oca unique identifier (e.g.
northcarolinayea1910rale) and
associate it with

unique identifiers (oclc numbers, issns, isbns, lc numbers)
contributing institution's alias and unique catalog identifier
upload date

That's all I was thinking of.  Then there's what you might be able to
do with
it:

   Give me all the oca unique identifiers that have oclc numbers
   Give me all the oca unique identifiers with isbns that were
   uploaded between x and y date
   Give me the oca unique identifier for this oclc number

Planning to do:

   keep crawling it and keep it up to date.

Things I wasn't planning to do:

   worry about other unique ids (you'd have to go to xISBN or
   ThingISBN yourself)
   worry about storing anything else from oca.

It would be good for being able to add an 856 to matches in your
catalog. It
would not be good for grabbing all marc records for all of oca.

Anyhow, is this duplication of effort?  Would you like something like
this?
What else would you like it to do (keeping in mind this is an unfunded
pet
project)?  How would you want to talk to it?  I was thinking of a web
service,
but hadn't thought too much about how to query it or how I'd deliver
results.

Of course I'm being an idiot and trying out new tools at the same time
(python
to see what the buzz is all about, sqlite just to learn it (it may not
work
out)).

Thoughts?  Vicious criticism?

-t


On Tue, 26 Feb 2008, Chris Freeland wrote:


My guess is that, yes, the query interface we've been discussing here
and the 'all sorts of interfaces that none of us knew about' are the
same.  It's not documented that I'm aware of.  We've found out about it
by literally sitting next to IA developers and asking questions.

Chris
-Original Message-
From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
Jonathan Rochkind
Sent: Tuesday, February 26, 2008 12:18 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] oca api?

So in answer to my question here at the Code4Lib conference, after
Brewster's keynote, Brewster suggests there are all sorts of interfaces
that none of us knew about. Or at least I didn't know about, and haven't
been able to figure out in months of trying!  I'm going to try and
corner him and ask for an email of who we should contact.

Perhaps it's the XML interface that you guys know about already. Is that
documented anywhere? How the heck did you find out about it?

Jonathan



Steve Toub [EMAIL PROTECTED] 02/25/08 9:41 PM 

I'll add that when IA told me about
http://www.archive.org/services/search.php interface to return
XML, they asked that we not send more than 100 records at time since
doing more would adversely
affect production services. Which made it seem like OAI-PMH was a better
way to go.

Chris, can you explain a bit more about what this means: We found their
OAI interface to pull
scanned items inconsistently based on date of scanning? I'm having
trouble parsing.


  --SET




--- Chris Freeland [EMAIL PROTECTED] wrote:


Jonathan - No, I don't believe it's documented - at least not anywhere
publicly.  If any IA/OCA folks are lurking, here's an opportunity to
make a bunch of techies happy...

Chris

-Original Message-
From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf

Of

Jonathan Rochkind
Sent: Monday, February 25, 2008 2:48 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] oca api?

I hadn't known this custom query interface existed! This is welcome
news. Is this documented anywhere?

Jonathan



Chris

[CODE4LIB] oca api

2008-03-06 Thread Karen Coyle
For some reason, the code4lib listservs reject my [EMAIL PROTECTED] mail
(undoubtedly having to do with the domain name -- any ideas welcome) so I'll
try to keep track of the list from this account.

Meanwhile...

We talked about something like this at the Open Library meeting last Friday
in a group that including Rob Styles, who has thought long and hard about
identifiers.  I think of this as a (or one or more) translate service
between IDs, aka xISBN on steroids. It's a realization that we will never
have a unique ID that everyone agrees on, that most bibliographic items are
really more than one thing, but that since we have data about the
bibliographic item we have many opportunities to make connections even
though people have used different identifiers. So we could use an
ID-switcher to move among data stores and services. Is that the kind of
thing folks are thinking of?

kc

--
---
Karen Coyle / Digital Library Consultant
[EMAIL PROTECTED] http://www.kcoyle.net
ph.: 510-540-7596   skype: kcoylenet
fx.: 510-848-3913
mo.: 510-435-8234



Re: [CODE4LIB] oca api?

2008-02-27 Thread Chris Freeland
Roy, do you have an answer in mind?

To me  my project it's the content that is open, which is why it's worth the 
hurdles.  Once you 'crack the nut' you can grab metadata, scans, and 
derivatives and ingest, parse, recombine, remix...as we've done for BHL.

Access to OCA content may not be standards-based, but it works.

Chris

-Original Message-
From: Roy Tennant [EMAIL PROTECTED]
To: CODE4LIB@LISTSERV.ND.EDU CODE4LIB@LISTSERV.ND.EDU
Sent: 2/27/2008 5:28 AM
Subject: Re: [CODE4LIB] oca api?

So what, exactly, is open about this? Anyone care to guess?
Roy


On 2/26/08 10:29 AM, Chris Freeland [EMAIL PROTECTED] wrote:

 My guess is that, yes, the query interface we've been discussing here
 and the 'all sorts of interfaces that none of us knew about' are the
 same.  It's not documented that I'm aware of.  We've found out about it
 by literally sitting next to IA developers and asking questions.

 Chris
 -Original Message-
 From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
 Jonathan Rochkind
 Sent: Tuesday, February 26, 2008 12:18 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] oca api?

 So in answer to my question here at the Code4Lib conference, after
 Brewster's keynote, Brewster suggests there are all sorts of interfaces
 that none of us knew about. Or at least I didn't know about, and haven't
 been able to figure out in months of trying!  I'm going to try and
 corner him and ask for an email of who we should contact.

 Perhaps it's the XML interface that you guys know about already. Is that
 documented anywhere? How the heck did you find out about it?

 Jonathan


 Steve Toub [EMAIL PROTECTED] 02/25/08 9:41 PM 
 I'll add that when IA told me about
 http://www.archive.org/services/search.php interface to return
 XML, they asked that we not send more than 100 records at time since
 doing more would adversely
 affect production services. Which made it seem like OAI-PMH was a better
 way to go.

 Chris, can you explain a bit more about what this means: We found their
 OAI interface to pull
 scanned items inconsistently based on date of scanning? I'm having
 trouble parsing.


--SET




 --- Chris Freeland [EMAIL PROTECTED] wrote:

 Jonathan - No, I don't believe it's documented - at least not anywhere
 publicly.  If any IA/OCA folks are lurking, here's an opportunity to
 make a bunch of techies happy...

 Chris

 -Original Message-
 From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf
 Of
 Jonathan Rochkind
 Sent: Monday, February 25, 2008 2:48 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] oca api?

 I hadn't known this custom query interface existed! This is welcome
 news. Is this documented anywhere?

 Jonathan


 Chris Freeland [EMAIL PROTECTED] 02/25/08 2:51 PM 
 Steve  Tim,

 I'm the tech director for the Biodiversity Heritage Library (BHL),
 which
 is a consortium of 10 natural history libraries who have partnered
 with
 Internet Archive (IA)/OCA for scanning our collections.  We've just
 launched our revamped portal, complete with more than 7,500 books 
 2.8
 million pages scanned by IA  other digitization partners, at:
 http://www.biodiversitylibrary.org

 To build this portal we ingest metadata from IA.  We found their OAI
 interface to pull scanned items inconsistently based on date of
 scanning, so we switched to using their custom query interface.
 Here's
 an example of a query we fire off:


 http://www.archive.org/services/search.php?query=collection:(biodiversit

 y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH
 OI%20Library)limit=10submit=submit

 This is returning scanned items from the biodiversity collection,
 updated between 10/31/2007 - 11/30/2007, restricted to one of our
 contributing libraries (MBLWHOI Library), and limited to 10 results.

 The results are styled in the browser; view source to see the good
 stuff.  We use this list to grab the identifiers we've yet to ingest.

 Some background: When a book is scanned through IA/OCA scanning, they
 create their own unique identifier (like annalesacademiae21univ) and
 grab a MARC record from the contributing library's catalog.  All of
 the
 scanned files, derivatives, and metadata files are stored on IA's
 clusters in a directory named with the identifier.

 Steve mentioned using their /details/ directive, then sniffing the
 page
 to get the cluster location and the files for downloading.  An easier
 method is to use their /download/ directive, as in:

 http://www.archive.org/download/ID$, or in the example above:
 http://www.archive.org/download/annalesacademiae21univ

 That automatically does a lookup on the cluster, which means you don't
 have to scrape info off pages.  You can also address any files within
 that directory, as in:

 http://www.archive.org/download/annalesacademiae21univ/annalesacademiae2
 1univ_marc.xml

 The only way to get standard identifiers (ISBN, ISSN, OCLC, LCCN) for
 these scanned books is to grab them out

Re: [CODE4LIB] oca api?

2008-02-27 Thread Sebastian Hammer

I concur. The content is open; and the OCA's use of MARC is open... I
think they're waiting for the community to chip in the means and
mechanisms to support whatever open APIs or protocols are deemed useful.

We built a free Z39.50/SRU service based on a crawl through their text
collection, incorporating MARC data where available.. it'd be great to
see other organizations contribute funding and/or sweat to build
additional services and tools.

(our stuff is at http://indexdata.com/opencontent/)

--Sebastian

Chris Freeland wrote:

Roy, do you have an answer in mind?

To me  my project it's the content that is open, which is why it's worth the 
hurdles.  Once you 'crack the nut' you can grab metadata, scans, and derivatives 
and ingest, parse, recombine, remix...as we've done for BHL.

Access to OCA content may not be standards-based, but it works.

Chris

-Original Message-
From: Roy Tennant [EMAIL PROTECTED]
To: CODE4LIB@LISTSERV.ND.EDU CODE4LIB@LISTSERV.ND.EDU
Sent: 2/27/2008 5:28 AM
Subject: Re: [CODE4LIB] oca api?

So what, exactly, is open about this? Anyone care to guess?
Roy


On 2/26/08 10:29 AM, Chris Freeland [EMAIL PROTECTED] wrote:



My guess is that, yes, the query interface we've been discussing here
and the 'all sorts of interfaces that none of us knew about' are the
same.  It's not documented that I'm aware of.  We've found out about it
by literally sitting next to IA developers and asking questions.

Chris
-Original Message-
From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
Jonathan Rochkind
Sent: Tuesday, February 26, 2008 12:18 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] oca api?

So in answer to my question here at the Code4Lib conference, after
Brewster's keynote, Brewster suggests there are all sorts of interfaces
that none of us knew about. Or at least I didn't know about, and haven't
been able to figure out in months of trying!  I'm going to try and
corner him and ask for an email of who we should contact.

Perhaps it's the XML interface that you guys know about already. Is that
documented anywhere? How the heck did you find out about it?

Jonathan




Steve Toub [EMAIL PROTECTED] 02/25/08 9:41 PM 


I'll add that when IA told me about
http://www.archive.org/services/search.php interface to return
XML, they asked that we not send more than 100 records at time since
doing more would adversely
affect production services. Which made it seem like OAI-PMH was a better
way to go.

Chris, can you explain a bit more about what this means: We found their
OAI interface to pull
scanned items inconsistently based on date of scanning? I'm having
trouble parsing.


   --SET




--- Chris Freeland [EMAIL PROTECTED] wrote:



Jonathan - No, I don't believe it's documented - at least not anywhere
publicly.  If any IA/OCA folks are lurking, here's an opportunity to
make a bunch of techies happy...

Chris

-Original Message-
From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf


Of


Jonathan Rochkind
Sent: Monday, February 25, 2008 2:48 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] oca api?

I hadn't known this custom query interface existed! This is welcome
news. Is this documented anywhere?

Jonathan




Chris Freeland [EMAIL PROTECTED] 02/25/08 2:51 PM 


Steve  Tim,

I'm the tech director for the Biodiversity Heritage Library (BHL),


which


is a consortium of 10 natural history libraries who have partnered


with


Internet Archive (IA)/OCA for scanning our collections.  We've just
launched our revamped portal, complete with more than 7,500 books 


2.8


million pages scanned by IA  other digitization partners, at:
http://www.biodiversitylibrary.org

To build this portal we ingest metadata from IA.  We found their OAI
interface to pull scanned items inconsistently based on date of
scanning, so we switched to using their custom query interface.


Here's


an example of a query we fire off:




http://www.archive.org/services/search.php?query=collection:(biodiversit

y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH


OI%20Library)limit=10submit=submit

This is returning scanned items from the biodiversity collection,
updated between 10/31/2007 - 11/30/2007, restricted to one of our
contributing libraries (MBLWHOI Library), and limited to 10 results.

The results are styled in the browser; view source to see the good
stuff.  We use this list to grab the identifiers we've yet to ingest.

Some background: When a book is scanned through IA/OCA scanning, they
create their own unique identifier (like annalesacademiae21univ) and
grab a MARC record from the contributing library's catalog.  All of


the


scanned files, derivatives, and metadata files are stored on IA's
clusters in a directory named with the identifier.

Steve mentioned using their /details/ directive, then sniffing the


page


to get the cluster location and the files for downloading.  An easier
method is to use their /download/ directive

Re: [CODE4LIB] oca api?

2008-02-27 Thread K.G. Schneider
But why are there hurdles?

Karen G. Schneider

On Wed, 27 Feb 2008 07:29:57 -0600, Chris Freeland
[EMAIL PROTECTED] said:
 Roy, do you have an answer in mind?

 To me  my project it's the content that is open, which is why it's worth
 the hurdles.  Once you 'crack the nut' you can grab metadata, scans, and
 derivatives and ingest, parse, recombine, remix...as we've done for BHL.

 Access to OCA content may not be standards-based, but it works.

 Chris

 -Original Message-
 From: Roy Tennant [EMAIL PROTECTED]
 To: CODE4LIB@LISTSERV.ND.EDU CODE4LIB@LISTSERV.ND.EDU
 Sent: 2/27/2008 5:28 AM
 Subject: Re: [CODE4LIB] oca api?

 So what, exactly, is open about this? Anyone care to guess?
 Roy


 On 2/26/08 10:29 AM, Chris Freeland [EMAIL PROTECTED] wrote:

  My guess is that, yes, the query interface we've been discussing here
  and the 'all sorts of interfaces that none of us knew about' are the
  same.  It's not documented that I'm aware of.  We've found out about it
  by literally sitting next to IA developers and asking questions.
 
  Chris
  -Original Message-
  From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
  Jonathan Rochkind
  Sent: Tuesday, February 26, 2008 12:18 PM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] oca api?
 
  So in answer to my question here at the Code4Lib conference, after
  Brewster's keynote, Brewster suggests there are all sorts of interfaces
  that none of us knew about. Or at least I didn't know about, and haven't
  been able to figure out in months of trying!  I'm going to try and
  corner him and ask for an email of who we should contact.
 
  Perhaps it's the XML interface that you guys know about already. Is that
  documented anywhere? How the heck did you find out about it?
 
  Jonathan
 
 
  Steve Toub [EMAIL PROTECTED] 02/25/08 9:41 PM 
  I'll add that when IA told me about
  http://www.archive.org/services/search.php interface to return
  XML, they asked that we not send more than 100 records at time since
  doing more would adversely
  affect production services. Which made it seem like OAI-PMH was a better
  way to go.
 
  Chris, can you explain a bit more about what this means: We found their
  OAI interface to pull
  scanned items inconsistently based on date of scanning? I'm having
  trouble parsing.
 
 
 --SET
 
 
 
 
  --- Chris Freeland [EMAIL PROTECTED] wrote:
 
  Jonathan - No, I don't believe it's documented - at least not anywhere
  publicly.  If any IA/OCA folks are lurking, here's an opportunity to
  make a bunch of techies happy...
 
  Chris
 
  -Original Message-
  From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf
  Of
  Jonathan Rochkind
  Sent: Monday, February 25, 2008 2:48 PM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] oca api?
 
  I hadn't known this custom query interface existed! This is welcome
  news. Is this documented anywhere?
 
  Jonathan
 
 
  Chris Freeland [EMAIL PROTECTED] 02/25/08 2:51 PM 
  Steve  Tim,
 
  I'm the tech director for the Biodiversity Heritage Library (BHL),
  which
  is a consortium of 10 natural history libraries who have partnered
  with
  Internet Archive (IA)/OCA for scanning our collections.  We've just
  launched our revamped portal, complete with more than 7,500 books 
  2.8
  million pages scanned by IA  other digitization partners, at:
  http://www.biodiversitylibrary.org
 
  To build this portal we ingest metadata from IA.  We found their OAI
  interface to pull scanned items inconsistently based on date of
  scanning, so we switched to using their custom query interface.
  Here's
  an example of a query we fire off:
 
 
  http://www.archive.org/services/search.php?query=collection:(biodiversit
 
  y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH
  OI%20Library)limit=10submit=submit
 
  This is returning scanned items from the biodiversity collection,
  updated between 10/31/2007 - 11/30/2007, restricted to one of our
  contributing libraries (MBLWHOI Library), and limited to 10 results.
 
  The results are styled in the browser; view source to see the good
  stuff.  We use this list to grab the identifiers we've yet to ingest.
 
  Some background: When a book is scanned through IA/OCA scanning, they
  create their own unique identifier (like annalesacademiae21univ) and
  grab a MARC record from the contributing library's catalog.  All of
  the
  scanned files, derivatives, and metadata files are stored on IA's
  clusters in a directory named with the identifier.
 
  Steve mentioned using their /details/ directive, then sniffing the
  page
  to get the cluster location and the files for downloading.  An easier
  method is to use their /download/ directive, as in:
 
  http://www.archive.org/download/ID$, or in the example above:
  http://www.archive.org/download/annalesacademiae21univ
 
  That automatically does a lookup on the cluster, which means you don't
  have to scrape info off pages.  You can also

Re: [CODE4LIB] oca api?

2008-02-27 Thread Tim Shearer

I see it as open it the way that google books is not.

But, a huge part of being open is the provision of *access* and so having
easy, documented APIs (in the way that google often does) would make this
a whole lot easier to leverage.

Still, it's a good thing and I'm pleased to have the opportunity to be
frustrated!

-t

On Wed, 27 Feb 2008, Sebastian Hammer wrote:


I concur. The content is open; and the OCA's use of MARC is open... I
think they're waiting for the community to chip in the means and
mechanisms to support whatever open APIs or protocols are deemed useful.

We built a free Z39.50/SRU service based on a crawl through their text
collection, incorporating MARC data where available.. it'd be great to
see other organizations contribute funding and/or sweat to build
additional services and tools.

(our stuff is at http://indexdata.com/opencontent/)

--Sebastian

Chris Freeland wrote:

Roy, do you have an answer in mind?

To me  my project it's the content that is open, which is why it's worth
the hurdles.  Once you 'crack the nut' you can grab metadata, scans, and
derivatives and ingest, parse, recombine, remix...as we've done for BHL.

Access to OCA content may not be standards-based, but it works.

Chris

-Original Message-
From: Roy Tennant [EMAIL PROTECTED]
To: CODE4LIB@LISTSERV.ND.EDU CODE4LIB@LISTSERV.ND.EDU
Sent: 2/27/2008 5:28 AM
Subject: Re: [CODE4LIB] oca api?

So what, exactly, is open about this? Anyone care to guess?
Roy


On 2/26/08 10:29 AM, Chris Freeland [EMAIL PROTECTED] wrote:



My guess is that, yes, the query interface we've been discussing here
and the 'all sorts of interfaces that none of us knew about' are the
same.  It's not documented that I'm aware of.  We've found out about it
by literally sitting next to IA developers and asking questions.

Chris
-Original Message-
From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
Jonathan Rochkind
Sent: Tuesday, February 26, 2008 12:18 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] oca api?

So in answer to my question here at the Code4Lib conference, after
Brewster's keynote, Brewster suggests there are all sorts of interfaces
that none of us knew about. Or at least I didn't know about, and haven't
been able to figure out in months of trying!  I'm going to try and
corner him and ask for an email of who we should contact.

Perhaps it's the XML interface that you guys know about already. Is that
documented anywhere? How the heck did you find out about it?

Jonathan




Steve Toub [EMAIL PROTECTED] 02/25/08 9:41 PM 


I'll add that when IA told me about
http://www.archive.org/services/search.php interface to return
XML, they asked that we not send more than 100 records at time since
doing more would adversely
affect production services. Which made it seem like OAI-PMH was a better
way to go.

Chris, can you explain a bit more about what this means: We found their
OAI interface to pull
scanned items inconsistently based on date of scanning? I'm having
trouble parsing.


   --SET




--- Chris Freeland [EMAIL PROTECTED] wrote:



Jonathan - No, I don't believe it's documented - at least not anywhere
publicly.  If any IA/OCA folks are lurking, here's an opportunity to
make a bunch of techies happy...

Chris

-Original Message-
From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf


Of


Jonathan Rochkind
Sent: Monday, February 25, 2008 2:48 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] oca api?

I hadn't known this custom query interface existed! This is welcome
news. Is this documented anywhere?

Jonathan




Chris Freeland [EMAIL PROTECTED] 02/25/08 2:51 PM 


Steve  Tim,

I'm the tech director for the Biodiversity Heritage Library (BHL),


which


is a consortium of 10 natural history libraries who have partnered


with


Internet Archive (IA)/OCA for scanning our collections.  We've just
launched our revamped portal, complete with more than 7,500 books 


2.8


million pages scanned by IA  other digitization partners, at:
http://www.biodiversitylibrary.org

To build this portal we ingest metadata from IA.  We found their OAI
interface to pull scanned items inconsistently based on date of
scanning, so we switched to using their custom query interface.


Here's


an example of a query we fire off:




http://www.archive.org/services/search.php?query=collection:(biodiversit

y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH


OI%20Library)limit=10submit=submit

This is returning scanned items from the biodiversity collection,
updated between 10/31/2007 - 11/30/2007, restricted to one of our
contributing libraries (MBLWHOI Library), and limited to 10 results.

The results are styled in the browser; view source to see the good
stuff.  We use this list to grab the identifiers we've yet to ingest.

Some background: When a book is scanned through IA/OCA scanning, they
create their own unique identifier (like annalesacademiae21univ) and
grab a MARC

Re: [CODE4LIB] oca api?

2008-02-26 Thread Chris Freeland
Steve - I'm not sure about the scalability of the query interface, so
hopefully someone from IA can comment.

The biggest problem we found with the OAI implementation had to do with
pulling incremental updates.  If you ask for a date range like Dec 1 - 5
you get all of Dec.  When we discussed this with IA we were shown the
query interface and just decided to use that instead since we're doing
mostly incremental updates.

The date inconsistency might not be enough to drive folks away from OAI
if you're looking to do one-time, or infrequent, harvests.

Chris

-Original Message-
From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
Steve Toub
Sent: Monday, February 25, 2008 8:41 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] oca api?

I'll add that when IA told me about
http://www.archive.org/services/search.php interface to return
XML, they asked that we not send more than 100 records at time since
doing more would adversely
affect production services. Which made it seem like OAI-PMH was a better
way to go.

Chris, can you explain a bit more about what this means: We found their
OAI interface to pull
scanned items inconsistently based on date of scanning? I'm having
trouble parsing.


   --SET




--- Chris Freeland [EMAIL PROTECTED] wrote:

 Jonathan - No, I don't believe it's documented - at least not anywhere
 publicly.  If any IA/OCA folks are lurking, here's an opportunity to
 make a bunch of techies happy...

 Chris

 -Original Message-
 From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf
Of
 Jonathan Rochkind
 Sent: Monday, February 25, 2008 2:48 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] oca api?

 I hadn't known this custom query interface existed! This is welcome
 news. Is this documented anywhere?

 Jonathan


  Chris Freeland [EMAIL PROTECTED] 02/25/08 2:51 PM 
 Steve  Tim,

 I'm the tech director for the Biodiversity Heritage Library (BHL),
which
 is a consortium of 10 natural history libraries who have partnered
with
 Internet Archive (IA)/OCA for scanning our collections.  We've just
 launched our revamped portal, complete with more than 7,500 books 
2.8
 million pages scanned by IA  other digitization partners, at:
 http://www.biodiversitylibrary.org

 To build this portal we ingest metadata from IA.  We found their OAI
 interface to pull scanned items inconsistently based on date of
 scanning, so we switched to using their custom query interface.
Here's
 an example of a query we fire off:


http://www.archive.org/services/search.php?query=collection:(biodiversit

y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH
 OI%20Library)limit=10submit=submit

 This is returning scanned items from the biodiversity collection,
 updated between 10/31/2007 - 11/30/2007, restricted to one of our
 contributing libraries (MBLWHOI Library), and limited to 10 results.

 The results are styled in the browser; view source to see the good
 stuff.  We use this list to grab the identifiers we've yet to ingest.

 Some background: When a book is scanned through IA/OCA scanning, they
 create their own unique identifier (like annalesacademiae21univ) and
 grab a MARC record from the contributing library's catalog.  All of
the
 scanned files, derivatives, and metadata files are stored on IA's
 clusters in a directory named with the identifier.

 Steve mentioned using their /details/ directive, then sniffing the
page
 to get the cluster location and the files for downloading.  An easier
 method is to use their /download/ directive, as in:

 http://www.archive.org/download/ID$, or in the example above:
 http://www.archive.org/download/annalesacademiae21univ

 That automatically does a lookup on the cluster, which means you don't
 have to scrape info off pages.  You can also address any files within
 that directory, as in:

http://www.archive.org/download/annalesacademiae21univ/annalesacademiae2
 1univ_marc.xml

 The only way to get standard identifiers (ISBN, ISSN, OCLC, LCCN) for
 these scanned books is to grab them out of the MARC record.  So the
 long-winded answer to your question, Tim, is no, there's no simple way
 to crossref what IA has scanned with your catalog - THAT I KNOW OF.
Big
 caveat on that last part.

 Happy to help with any other questions I can,

 Chris Freeland


 -Original Message-
 From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf
Of
 Steve Toub
 Sent: Sunday, February 24, 2008 11:20 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] oca api?

 --- Tim Shearer [EMAIL PROTECTED] wrote:

  Hi Folks,
 
  I'm looking into tapping the texts in the Open Content Alliance.
 
  A few questions...
 
  As near as I can tell, they don't expose (perhaps even store?) any
 common
  unique identifiers (oclc number, issn, isbn, loc number).

 I poked around in this world a few months ago in my previous job at
 California Digital Library,
 also an OCA partner.

 The unique key seems to be text string

Re: [CODE4LIB] oca api?

2008-02-26 Thread Eric Lease Morgan

On Feb 26, 2008, at 12:21 PM, Chris Freeland wrote:


The biggest problem we found with the OAI implementation had to do
with
pulling incremental updates.  If you ask for a date range like Dec
1 - 5
you get all of Dec.  When we discussed this with IA we were shown the
query interface and just decided to use that instead since we're doing
mostly incremental updates.




Incidentally, I was asked a few months ago about incorporating Open
Library and/or Internet Archive material into a service I (barely)
maintain called Ockham Alert. I told them I would be happy to do so,
but since Ockham Alert relies on OAI date ranges, and their date
ranges did not work, I was unable to oblige them. I suppose the date
issue with their OAI implementation is a known issue.

--
Eric Lease Morgan
University Libraries of Notre Dame

(574) 631-8604


Re: [CODE4LIB] oca api?

2008-02-26 Thread Jonathan Rochkind
So in answer to my question here at the Code4Lib conference, after Brewster's 
keynote, Brewster suggests there are all sorts of interfaces that none of us 
knew about. Or at least I didn't know about, and haven't been able to figure 
out in months of trying!  I'm going to try and corner him and ask for an email 
of who we should contact.

Perhaps it's the XML interface that you guys know about already. Is that 
documented anywhere? How the heck did you find out about it?

Jonathan


 Steve Toub [EMAIL PROTECTED] 02/25/08 9:41 PM 
I'll add that when IA told me about http://www.archive.org/services/search.php 
interface to return
XML, they asked that we not send more than 100 records at time since doing more 
would adversely
affect production services. Which made it seem like OAI-PMH was a better way to 
go.

Chris, can you explain a bit more about what this means: We found their OAI 
interface to pull
scanned items inconsistently based on date of scanning? I'm having trouble 
parsing.


   --SET




--- Chris Freeland [EMAIL PROTECTED] wrote:

 Jonathan - No, I don't believe it's documented - at least not anywhere
 publicly.  If any IA/OCA folks are lurking, here's an opportunity to
 make a bunch of techies happy...

 Chris

 -Original Message-
 From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
 Jonathan Rochkind
 Sent: Monday, February 25, 2008 2:48 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] oca api?

 I hadn't known this custom query interface existed! This is welcome
 news. Is this documented anywhere?

 Jonathan


  Chris Freeland [EMAIL PROTECTED] 02/25/08 2:51 PM 
 Steve  Tim,

 I'm the tech director for the Biodiversity Heritage Library (BHL), which
 is a consortium of 10 natural history libraries who have partnered with
 Internet Archive (IA)/OCA for scanning our collections.  We've just
 launched our revamped portal, complete with more than 7,500 books  2.8
 million pages scanned by IA  other digitization partners, at:
 http://www.biodiversitylibrary.org

 To build this portal we ingest metadata from IA.  We found their OAI
 interface to pull scanned items inconsistently based on date of
 scanning, so we switched to using their custom query interface.  Here's
 an example of a query we fire off:

 http://www.archive.org/services/search.php?query=collection:(biodiversit
 y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH
 OI%20Library)limit=10submit=submit

 This is returning scanned items from the biodiversity collection,
 updated between 10/31/2007 - 11/30/2007, restricted to one of our
 contributing libraries (MBLWHOI Library), and limited to 10 results.

 The results are styled in the browser; view source to see the good
 stuff.  We use this list to grab the identifiers we've yet to ingest.

 Some background: When a book is scanned through IA/OCA scanning, they
 create their own unique identifier (like annalesacademiae21univ) and
 grab a MARC record from the contributing library's catalog.  All of the
 scanned files, derivatives, and metadata files are stored on IA's
 clusters in a directory named with the identifier.

 Steve mentioned using their /details/ directive, then sniffing the page
 to get the cluster location and the files for downloading.  An easier
 method is to use their /download/ directive, as in:

 http://www.archive.org/download/ID$, or in the example above:
 http://www.archive.org/download/annalesacademiae21univ

 That automatically does a lookup on the cluster, which means you don't
 have to scrape info off pages.  You can also address any files within
 that directory, as in:
 http://www.archive.org/download/annalesacademiae21univ/annalesacademiae2
 1univ_marc.xml

 The only way to get standard identifiers (ISBN, ISSN, OCLC, LCCN) for
 these scanned books is to grab them out of the MARC record.  So the
 long-winded answer to your question, Tim, is no, there's no simple way
 to crossref what IA has scanned with your catalog - THAT I KNOW OF.  Big
 caveat on that last part.

 Happy to help with any other questions I can,

 Chris Freeland


 -Original Message-
 From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
 Steve Toub
 Sent: Sunday, February 24, 2008 11:20 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] oca api?

 --- Tim Shearer [EMAIL PROTECTED] wrote:

  Hi Folks,
 
  I'm looking into tapping the texts in the Open Content Alliance.
 
  A few questions...
 
  As near as I can tell, they don't expose (perhaps even store?) any
 common
  unique identifiers (oclc number, issn, isbn, loc number).

 I poked around in this world a few months ago in my previous job at
 California Digital Library,
 also an OCA partner.

 The unique key seems to be text string identifier (one that seems to be
 completely different from
 the text string identifier in Open Library). Apparently there was talk
 at the last partner meeting
 about moving to ISBNs:
 http://dilettantes.code4lib.org/2007/10/22

Re: [CODE4LIB] oca api?

2008-02-25 Thread Steve Toub
--- Tim Shearer [EMAIL PROTECTED] wrote:

 Hi Folks,

 I'm looking into tapping the texts in the Open Content Alliance.

 A few questions...

 As near as I can tell, they don't expose (perhaps even store?) any common
 unique identifiers (oclc number, issn, isbn, loc number).

I poked around in this world a few months ago in my previous job at California 
Digital Library,
also an OCA partner.

The unique key seems to be text string identifier (one that seems to be 
completely different from
the text string identifier in Open Library). Apparently there was talk at the 
last partner meeting
about moving to ISBNs:
http://dilettantes.code4lib.org/2007/10/22/tales-from-the-open-content-alliance/

To obtain identifiers in bulk, I think the recommended approach is the OAI-PMH 
interface, which
seems more reliable in recent months:

http://www.archive.org/services/oai.php?verb=Identify

http://www.archive.org/services/oai.php?verb=ListIdentifiersmetadataPrefix=oai_dcset=collection:cdl

etc.


Additional instructions if you want to grab the content files.

From any book's metadata page (e.g., 
http://www.archive.org/details/chemicallecturee00newtrich)
click through on the Usage Rights: See Terms link; the rights are on a pane 
on the left-hand
side.

Once you know the identifier, you can grab the content files, using this syntax:
http://www.archive.org/details/$ID
Like so:
http://www.archive.org/details/chemicallecturee00newtrich

And then sniff the page to find the FTP link:
ftp://ia340915.us.archive.org/2/items/chemicallecturee00newtrich

But I think they prefer to use HTTP for these, not the FTP, so switch this to:
http://ia340915.us.archive.org/2/items/chemicallecturee00newtrich

Hope this helps!

  --SET


 We're a contributer so I can use curl to grab our records via http (and
 regexp my way to our local catalog identifiers, which they do
 store/expose).

 I've played a bit with the z39.50 interface at indexdata
 (http://www.indexdata.dk/opencontent/), but I'm not confident about the
 content behind it.  I get very limited results, for instance I can't find
 any UNC records and we're fairly new to the game.

 Again, I'm looking for unique identifiers in what I can get back and it's
 slim pickings.

 Anyone cracked this nut?  Got any life lessons for me?

 Thanks!
 Tim

 +++
 Tim Shearer

 Web Development Coordinator
 The University Library
 University of North Carolina at Chapel Hill
 [EMAIL PROTECTED]
 919-962-1288
 +++



Re: [CODE4LIB] oca api?

2008-02-25 Thread Tennant,Roy
Well, from where Chris left off it would be fairly easy to check for a
file in the directory with an marc.xml filename extension, then XSLT
for:

 datafield tag=010 ind1=  ind2= 
subfield code=a39004822/subfield
/datafield

If such exists, and then you'll have the ISBN. To sweeten it further,
send that into xISBN or ThingISBN and get other ISBNs for the same work.
This seems completely scriptable to me. Perhaps someone at c4l will have
it done before the conference is over. And Tim, the example above is one
that's in your catalog.
Roy

-Original Message-
From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
Chris Freeland
Sent: Monday, February 25, 2008 11:51 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] oca api?

Steve  Tim,

I'm the tech director for the Biodiversity Heritage Library (BHL), which
is a consortium of 10 natural history libraries who have partnered with
Internet Archive (IA)/OCA for scanning our collections.  We've just
launched our revamped portal, complete with more than 7,500 books  2.8
million pages scanned by IA  other digitization partners, at:
http://www.biodiversitylibrary.org

To build this portal we ingest metadata from IA.  We found their OAI
interface to pull scanned items inconsistently based on date of
scanning, so we switched to using their custom query interface.  Here's
an example of a query we fire off:

http://www.archive.org/services/search.php?query=collection:(biodiversit
y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH
OI%20Library)limit=10submit=submit

This is returning scanned items from the biodiversity collection,
updated between 10/31/2007 - 11/30/2007, restricted to one of our
contributing libraries (MBLWHOI Library), and limited to 10 results.

The results are styled in the browser; view source to see the good
stuff.  We use this list to grab the identifiers we've yet to ingest.

Some background: When a book is scanned through IA/OCA scanning, they
create their own unique identifier (like annalesacademiae21univ) and
grab a MARC record from the contributing library's catalog.  All of the
scanned files, derivatives, and metadata files are stored on IA's
clusters in a directory named with the identifier.

Steve mentioned using their /details/ directive, then sniffing the page
to get the cluster location and the files for downloading.  An easier
method is to use their /download/ directive, as in:

http://www.archive.org/download/ID$, or in the example above:
http://www.archive.org/download/annalesacademiae21univ

That automatically does a lookup on the cluster, which means you don't
have to scrape info off pages.  You can also address any files within
that directory, as in:
http://www.archive.org/download/annalesacademiae21univ/annalesacademiae2
1univ_marc.xml

The only way to get standard identifiers (ISBN, ISSN, OCLC, LCCN) for
these scanned books is to grab them out of the MARC record.  So the
long-winded answer to your question, Tim, is no, there's no simple way
to crossref what IA has scanned with your catalog - THAT I KNOW OF.  Big
caveat on that last part.

Happy to help with any other questions I can,

Chris Freeland


-Original Message-
From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
Steve Toub
Sent: Sunday, February 24, 2008 11:20 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] oca api?

--- Tim Shearer [EMAIL PROTECTED] wrote:

 Hi Folks,

 I'm looking into tapping the texts in the Open Content Alliance.

 A few questions...

 As near as I can tell, they don't expose (perhaps even store?) any
common
 unique identifiers (oclc number, issn, isbn, loc number).

I poked around in this world a few months ago in my previous job at
California Digital Library, also an OCA partner.

The unique key seems to be text string identifier (one that seems to be
completely different from the text string identifier in Open Library).
Apparently there was talk at the last partner meeting about moving to
ISBNs:
http://dilettantes.code4lib.org/2007/10/22/tales-from-the-open-content-a
lliance/

To obtain identifiers in bulk, I think the recommended approach is the
OAI-PMH interface, which seems more reliable in recent months:

http://www.archive.org/services/oai.php?verb=Identify

http://www.archive.org/services/oai.php?verb=ListIdentifiersmetadataPre
fix=oai_dcset=collection:cdl

etc.


Additional instructions if you want to grab the content files.

From any book's metadata page (e.g.,
http://www.archive.org/details/chemicallecturee00newtrich)
click through on the Usage Rights: See Terms link; the rights are on a
pane on the left-hand side.

Once you know the identifier, you can grab the content files, using this
syntax:
http://www.archive.org/details/$ID
Like so:
http://www.archive.org/details/chemicallecturee00newtrich

And then sniff the page to find the FTP link:
ftp://ia340915.us.archive.org/2/items/chemicallecturee00newtrich

But I think they prefer to use HTTP

Re: [CODE4LIB] oca api?

2008-02-25 Thread Tim Shearer

Yup,

Chris' email was exactly what I was hoping for.  Now if there were a nice
way to pre-screen for records that don't have empty (isbn|issn|oclc#)
without all the work of looking per record (and the overhead for the
server, and the overhead if more than one organization starts to do this).

I guess I want to search for uniqueID != NULL and only get their unique id
back, and script from there.

Still and all, this now seems a very doable thing.

Chris, many thanks!
-t

On Mon, 25 Feb 2008, Tennant,Roy wrote:


Well, from where Chris left off it would be fairly easy to check for a
file in the directory with an marc.xml filename extension, then XSLT
for:

datafield tag=010 ind1=  ind2= 
subfield code=a39004822/subfield
/datafield

If such exists, and then you'll have the ISBN. To sweeten it further,
send that into xISBN or ThingISBN and get other ISBNs for the same work.
This seems completely scriptable to me. Perhaps someone at c4l will have
it done before the conference is over. And Tim, the example above is one
that's in your catalog.
Roy

-Original Message-
From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
Chris Freeland
Sent: Monday, February 25, 2008 11:51 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] oca api?

Steve  Tim,

I'm the tech director for the Biodiversity Heritage Library (BHL), which
is a consortium of 10 natural history libraries who have partnered with
Internet Archive (IA)/OCA for scanning our collections.  We've just
launched our revamped portal, complete with more than 7,500 books  2.8
million pages scanned by IA  other digitization partners, at:
http://www.biodiversitylibrary.org

To build this portal we ingest metadata from IA.  We found their OAI
interface to pull scanned items inconsistently based on date of
scanning, so we switched to using their custom query interface.  Here's
an example of a query we fire off:

http://www.archive.org/services/search.php?query=collection:(biodiversit
y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH
OI%20Library)limit=10submit=submit

This is returning scanned items from the biodiversity collection,
updated between 10/31/2007 - 11/30/2007, restricted to one of our
contributing libraries (MBLWHOI Library), and limited to 10 results.

The results are styled in the browser; view source to see the good
stuff.  We use this list to grab the identifiers we've yet to ingest.

Some background: When a book is scanned through IA/OCA scanning, they
create their own unique identifier (like annalesacademiae21univ) and
grab a MARC record from the contributing library's catalog.  All of the
scanned files, derivatives, and metadata files are stored on IA's
clusters in a directory named with the identifier.

Steve mentioned using their /details/ directive, then sniffing the page
to get the cluster location and the files for downloading.  An easier
method is to use their /download/ directive, as in:

http://www.archive.org/download/ID$, or in the example above:
http://www.archive.org/download/annalesacademiae21univ

That automatically does a lookup on the cluster, which means you don't
have to scrape info off pages.  You can also address any files within
that directory, as in:
http://www.archive.org/download/annalesacademiae21univ/annalesacademiae2
1univ_marc.xml

The only way to get standard identifiers (ISBN, ISSN, OCLC, LCCN) for
these scanned books is to grab them out of the MARC record.  So the
long-winded answer to your question, Tim, is no, there's no simple way
to crossref what IA has scanned with your catalog - THAT I KNOW OF.  Big
caveat on that last part.

Happy to help with any other questions I can,

Chris Freeland


-Original Message-
From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
Steve Toub
Sent: Sunday, February 24, 2008 11:20 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] oca api?

--- Tim Shearer [EMAIL PROTECTED] wrote:


Hi Folks,

I'm looking into tapping the texts in the Open Content Alliance.

A few questions...

As near as I can tell, they don't expose (perhaps even store?) any

common

unique identifiers (oclc number, issn, isbn, loc number).


I poked around in this world a few months ago in my previous job at
California Digital Library, also an OCA partner.

The unique key seems to be text string identifier (one that seems to be
completely different from the text string identifier in Open Library).
Apparently there was talk at the last partner meeting about moving to
ISBNs:
http://dilettantes.code4lib.org/2007/10/22/tales-from-the-open-content-a
lliance/

To obtain identifiers in bulk, I think the recommended approach is the
OAI-PMH interface, which seems more reliable in recent months:

http://www.archive.org/services/oai.php?verb=Identify

http://www.archive.org/services/oai.php?verb=ListIdentifiersmetadataPre
fix=oai_dcset=collection:cdl

etc.


Additional instructions if you want to grab the content files.


From any book's metadata page (e.g