[CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?

Tim Shearer Thu, 06 Mar 2008 05:51:04 -0800

Howdy folks,

I've been playing and thinking.  I'd like to have what amounts to a unique
identifier index to oca digitized texts.  I want to be able to pull all the
records that have oclc numbers, issns, isbns, etc.  I want it to be
lightweight, fast, searchable.


Would anyone else want/use such a thing?

I'm thinking about building something like this.

If I do, it would be ideal if wouldn't be a duplication of effort, so anyone
got this in the works?  And if it would meet the needs of others.

My basic notion is to crawl the site (starting with "americana", the American
Libraries.  Pull the oca unique identifier (e.g. northcarolinayea1910rale) and
associate it with

unique identifiers (oclc numbers, issns, isbns, lc numbers)
contributing institution's alias and unique catalog identifier
upload date

That's all I was thinking of.  Then there's what you might be able to do with
it:

       Give me all the oca unique identifiers that have oclc numbers
       Give me all the oca unique identifiers with isbns that were
               uploaded between x and y date
       Give me the oca unique identifier for this oclc number

Planning to do:

       keep crawling it and keep it up to date.

Things I wasn't planning to do:

       worry about other unique ids (you'd have to go to xISBN or
               ThingISBN yourself)
       worry about storing anything else from oca.

It would be good for being able to add an 856 to matches in your catalog. It
would not be good for grabbing all marc records for all of oca.

Anyhow, is this duplication of effort?  Would you like something like this?
What else would you like it to do (keeping in mind this is an unfunded pet
project)?  How would you want to talk to it?  I was thinking of a web service,
but hadn't thought too much about how to query it or how I'd deliver results.

Of course I'm being an idiot and trying out new tools at the same time (python
to see what the buzz is all about, sqlite just to learn it (it may not work
out)).

Thoughts?  Vicious criticism?

-t


On Tue, 26 Feb 2008, Chris Freeland wrote:

My guess is that, yes, the query interface we've been discussing here
and the 'all sorts of interfaces that none of us knew about' are the
same.  It's not documented that I'm aware of.  We've found out about it
by literally sitting next to IA developers and asking questions.

Chris
-----Original Message-----
From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
Jonathan Rochkind
Sent: Tuesday, February 26, 2008 12:18 PM
To: [email protected]
Subject: Re: [CODE4LIB] oca api?

So in answer to my question here at the Code4Lib conference, after
Brewster's keynote, Brewster suggests there are all sorts of interfaces
that none of us knew about. Or at least I didn't know about, and haven't
been able to figure out in months of trying!  I'm going to try and
corner him and ask for an email of who we should contact.

Perhaps it's the XML interface that you guys know about already. Is that
documented anywhere? How the heck did you find out about it?

Jonathan

Steve Toub <[EMAIL PROTECTED]> 02/25/08 9:41 PM >>>

I'll add that when IA told me about
http://www.archive.org/services/search.php interface to return
XML, they asked that we not send more than 100 records at time since
doing more would adversely
affect production services. Which made it seem like OAI-PMH was a better
way to go.

Chris, can you explain a bit more about what this means: "We found their
OAI interface to pull
scanned items inconsistently based on date of scanning...."? I'm having
trouble parsing.


  --SET




--- Chris Freeland <[EMAIL PROTECTED]> wrote:

Jonathan - No, I don't believe it's documented - at least not anywhere
publicly.  If any IA/OCA folks are lurking, here's an opportunity to
make a bunch of techies happy...

Chris

-----Original Message-----
From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf

Of

Jonathan Rochkind
Sent: Monday, February 25, 2008 2:48 PM
To: [email protected]
Subject: Re: [CODE4LIB] oca api?

I hadn't known this "custom query interface" existed! This is welcome
news. Is this documented anywhere?

Jonathan

Chris Freeland <[EMAIL PROTECTED]> 02/25/08 2:51 PM >>>

Steve & Tim,

I'm the tech director for the Biodiversity Heritage Library (BHL),

which

is a consortium of 10 natural history libraries who have partnered

with

Internet Archive (IA)/OCA for scanning our collections.  We've just
launched our revamped portal, complete with more than 7,500 books &

2.8

million pages scanned by IA & other digitization partners, at:
http://www.biodiversitylibrary.org

To build this portal we ingest metadata from IA.  We found their OAI
interface to pull scanned items inconsistently based on date of
scanning, so we switched to using their custom query interface.

Here's

an example of a query we fire off:

http://www.archive.org/services/search.php?query=collection:(biodiversit

y)+AND+updatedate:%5b2007-10-31+TO+2007-11-30%5d+AND+-contributor:(MBLWH

OI%20Library)&limit=10&submit=submit

This is returning scanned items from the "biodiversity" collection,
updated between 10/31/2007 - 11/30/2007, restricted to one of our
contributing libraries (MBLWHOI Library), and limited to 10 results.

The results are styled in the browser; view source to see the good
stuff.  We use this list to grab the identifiers we've yet to ingest.

Some background: When a book is scanned through IA/OCA scanning, they
create their own unique identifier (like "annalesacademiae21univ") and
grab a MARC record from the contributing library's catalog.  All of

the

scanned files, derivatives, and metadata files are stored on IA's
clusters in a directory named with the identifier.

Steve mentioned using their /details/ directive, then sniffing the

page

to get the cluster location and the files for downloading.  An easier
method is to use their /download/ directive, as in:

http://www.archive.org/download/ID$, or in the example above:
http://www.archive.org/download/annalesacademiae21univ

That automatically does a lookup on the cluster, which means you don't
have to scrape info off pages.  You can also address any files within
that directory, as in:

http://www.archive.org/download/annalesacademiae21univ/annalesacademiae2

1univ_marc.xml

The only way to get standard identifiers (ISBN, ISSN, OCLC, LCCN) for
these scanned books is to grab them out of the MARC record.  So the
long-winded answer to your question, Tim, is no, there's no simple way
to crossref what IA has scanned with your catalog - THAT I KNOW OF.

Big

caveat on that last part.

Happy to help with any other questions I can,

Chris Freeland


-----Original Message-----
From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf

Of

Steve Toub
Sent: Sunday, February 24, 2008 11:20 PM
To: [email protected]
Subject: Re: [CODE4LIB] oca api?

--- Tim Shearer <[EMAIL PROTECTED]> wrote:

Hi Folks,

I'm looking into tapping the texts in the Open Content Alliance.

A few questions...

As near as I can tell, they don't expose (perhaps even store?) any

common

unique identifiers (oclc number, issn, isbn, loc number).


I poked around in this world a few months ago in my previous job at
California Digital Library,
also an OCA partner.

The unique key seems to be text string identifier (one that seems to

be

completely different from
the text string identifier in Open Library). Apparently there was talk
at the last partner meeting
about moving to ISBNs:

http://dilettantes.code4lib.org/2007/10/22/tales-from-the-open-content-a

lliance/

To obtain identifiers in bulk, I think the recommended approach is the
OAI-PMH interface, which
seems more reliable in recent months:

http://www.archive.org/services/oai.php?verb=Identify

http://www.archive.org/services/oai.php?verb=ListIdentifiers&metadataPre

fix=oai_dc&set=collection:cdl

etc.


Additional instructions if you want to grab the content files.

From any book's metadata page (e.g.,
http://www.archive.org/details/chemicallecturee00newtrich)
click through on the "Usage Rights: See Terms" link; the rights are on

pane on the left-hand
side.

Once you know the identifier, you can grab the content files, using

this

syntax:
    http://www.archive.org/details/$ID
Like so:
    http://www.archive.org/details/chemicallecturee00newtrich

And then sniff the page to find the FTP link:
    ftp://ia340915.us.archive.org/2/items/chemicallecturee00newtrich

But I think they prefer to use HTTP for these, not the FTP, so switch
this to:
    http://ia340915.us.archive.org/2/items/chemicallecturee00newtrich

Hope this helps!

  --SET

We're a contributer so I can use curl to grab our records via http

(and

regexp my way to our local catalog identifiers, which they do
store/expose).

I've played a bit with the z39.50 interface at indexdata
(http://www.indexdata.dk/opencontent/), but I'm not confident about

the

content behind it.  I get very limited results, for instance I can't

find

any UNC records and we're fairly new to the game.

Again, I'm looking for unique identifiers in what I can get back and

it's

slim pickings.

Anyone cracked this nut?  Got any life lessons for me?

Thanks!
Tim

+++++++++++++++++++++++++++++++++++++++++++
Tim Shearer

Web Development Coordinator
The University Library
University of North Carolina at Chapel Hill
[EMAIL PROTECTED]
919-962-1288
+++++++++++++++++++++++++++++++++++++++++++

[CODE4LIB] musing on oca apiRe: [CODE4LIB] oca api?

Reply via email to