Re: [CODE4LIB] Software used in Panama Papers Analysis

2016-04-12 Thread Owen Stephens
Another interesting post on this - this one from Le Monde (in French)
http://data.blog.lemonde.fr/2016/04/08/panama-papers-un-defi-technique-pour-le-journalisme-de-donnees/
 
<http://data.blog.lemonde.fr/2016/04/08/panama-papers-un-defi-technique-pour-le-journalisme-de-donnees/>

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

> On 12 Apr 2016, at 16:05, Tom Cramer <tcra...@stanford.edu> wrote:
> 
> The IJNet article is particularly interesting—thanks for posting this. 
> Excerpts like the one below make me wonder if there is a “Code4News” 
> community, and if so, how do we find and connect with them. It seems we have 
> a lot in common, and maybe a lot to offer each other.
> 
> 
> MC: What we’ve achieved is pretty remarkable. Newsrooms are in an economic 
> crisis. No newsroom right now--except for maybe The New York Times and a few 
> others--have the capability to do something major like this at a global 
> scale. But we’re showing it’s possible. We share data, we produce tools for 
> communication, we share our stories and our interactives, to make it happen.
> 
> - Tom
> 
> 
> 
> 
> 
> 
> On Apr 7, 2016, at 7:24 AM, Gregory Markus 
> <gmar...@beeldengeluid.nl<mailto:gmar...@beeldengeluid.nl>> wrote:
> 
> Hey Sebastian,
> 
> They go into a lot of detail in this article
> 
> https://ijnet.org/en/blog/how-icij-pulled-large-scale-cross-border-investigative-collaboration
> 
> Indeed this is pretty interesting stuff and a good shout out for Blacklight
> and other OS tools!
> 
> -greg
> 
> On Thu, Apr 7, 2016 at 4:21 PM, Sebastian Karcher <
> karc...@u.northwestern.edu> wrote:
> 
> Hi everyone,
> 
> from one of the New York Times stories on the Panama Papers:
> "The ICIJ made a number of powerful research tools available to the
> consortium that the group had developed for previous leak investigations.
> Those included a secure, Facebook-type forum where reporters could post the
> fruits of their research, as well as database search program called
> “Blacklight” that allowed the teams to hunt for specific names, countries
> or sources."
> 
> http://www.nytimes.com/2016/04/06/business/media/how-a-cryptic-message-interested-in-data-led-to-the-panama-papers.html
> 
> I assume this is http://projectblacklight.org/, which is pretty cool to
> see
> used that way. Does anyone know or have read anything about the other tools
> they used? What did they use for OCR? Did they use qualitative data
> analysis software? Some type of annotation tools? It seems like there's a
> lot to learn from this effort.
> 
> Thanks,
> 
> --
> Sebastian Karcher, PhD
> Qualitative Data Repository, Syracuse University
> qdr.syr.edu
> 
> 
> 
> 
> --
> 
> *Gregory Markus*
> 
> Project Assistant
> 
> *Netherlands Institute for Sound and Vision*
> *Media Parkboulevard 1, 1217 WE  Hilversum | Postbus 1060, 1200 BB
> Hilversum | *
> *beeldengeluid.nl* <http://www.beeldengeluid.nl/>
> *T* 0612350556
> 
> *Aanwezig:* - ma, di, wo, do, vr
> 


Re: [CODE4LIB] searching metadata vs searching content

2016-01-28 Thread Owen Stephens
To share the practice from a project I work on - the Jisc Historical Texts 
platform[1] which provides searching across digitised texts from the 16th to 
19th centuries. In this case we had the option to build the search application 
from scratch, rather than using a product such as ContentDM etc. I should say 
that all the technical work was done by K-Int [2] and Gooii [3], I was there to 
advise on metadata and user requirements, and so the following is based on my 
understanding of how the system works, and any errors are down to me :)

There are currently three major collections within the Historical Texts 
platform, with different data sources behind each one. In general the data we 
have for each collection consists of MARC metadata records, full text in XML 
documents (either from transcription or from OCR processes) and image files of 
the pages. 

The platform is build using the ElasticSearch [4] (ES) indexing software (as 
with Solr this is built on top of Lucene).

We structure the data we index in ES in two layers - the ‘publication’ record, 
which is essentially where all the MARC metadata lives (although not as MARC - 
we transform this to an internal scheme), and the ‘page’ records - one record 
per page in the item. The text content lives in the page record, along with 
links to the image files for the page. The ‘page’ records are all what ES calls 
‘child’ records of the relevant publication record. We make this relationship 
through shared IDs in the MARC records and the XML fulltext documents.

We create a whole range of indexes from this data. Obviously field specific 
searchs like title or author only search the relevant metadata fields. But we 
also have a (default) ’search all’ option which searches through all the 
metadata and fulltext. If the user wants to search the text only, they check an 
option and we limit the search to only text from records of the ‘page’ type.

The results the user gets initially are always the publication level records - 
so essentially your results list is a list of books. For each result you can 
view ‘matches in text’ which shows snippets of where your search term appears 
in the fulltext. You can then either click to view the whole book, or click the 
relevant page from the list of snippets. When you view the book, the software 
retrieves all the ‘page’ records for the book, and from the page records can 
retrieve the image files. When the user goes to the book viewer, we also carry 
over the search terms from their search, so they can see the same text snippets 
of where the terms appear alongside the book viewer - so the user can navigate 
to the pages which contain the search terms easily.

For more on the ES indexing side of this, Rob Tice from Knowledge Integration 
did a talk about the use of ES in this context at the London Elasticsearch 
usergroup [5]. Unfortunately the interface itself requires a login, but if you 
want to get a feel for how this all works in the UI, there is also a screencast 
which gives an overview of the UI available [6].

Best wishes,

Owen

1. https://historicaltexts.jisc.ac.uk
2. http://www.k-int.com
3. http://www.gooii.com
4. https://www.elastic.co
5. 
http://www.k-int.com/Rob-Tice-Elastic-London-complex-modelling-of-rich-text-data-in-Elasticsearch
6. http://historicaltexts.jisc.ac.uk/support

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

> On 27 Jan 2016, at 00:30, Laura Buchholz <laura.buchh...@reed.edu> wrote:
> 
> Hi all,
> 
> I'm trying to understand how digital library systems work when there is a
> need to search both metadata and item text content (plain text/full text),
> and when the item is made up of more than one file (so, think a digitized
> multi-page yearbook or newspaper). I'm not looking for answers to a
> specific problem, really, just looking to know what is the current state of
> community practice.
> 
> In our current system (ContentDM), the "full text" of something lives in
> the metadata record, so it is indexed and searched along with the metadata,
> and essentially treated as if it were metadata. (Correct?). This causes
> problems in advanced searching and muddies the relationship between what is
> typically a descriptive metadata record and the file that is associated
> with the record. It doesn't seem like a great model for the average digital
> library. True? I know the answer is "it depends", but humor me... :)
> 
> If it isn't great, and there are better models, what are they? I was taught
> METS in school, and based on that, I'd approach the metadata in a METS or
> METS-like fashion. But I'm unclear on the steps from having a bunch of METS
> records that include descriptive metadata and pointers to text files of the
> OCR (we don't, but if we did...) to indexing and providing results to
> users. I think anot

Re: [CODE4LIB] Job: Wine Loving Developer at University of California, Davis

2015-12-11 Thread Owen Stephens
That may well be true, but ‘getting the job done’ isn’t the only aspect of a 
crowdsourcing project. It can be used to engage an audience more deeply in the 
collection and give them some investment in it. This can help with overall 
visibility of the collection on the web (through those people who have engaged 
sharing what they are doing/seeing etc.), and future use, and be a platform for 
further projects.

A project like this could also offer a way of experimenting with crowdsourcing 
in a low risk way. And of course the developer is needed for the visualisation 
aspect anyway, so the recruitment needs to happen and a wage needs to be paid 
anyway ...

Whether all this balances out against the economics/efficiency of getting the 
job done in the cheapest possible way is a judgement that needs to be made, but 
I don’t think the simple economic argument is the only one in play here.

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

> On 10 Dec 2015, at 23:42, James Morley <james.mor...@europeana.eu> wrote:
> 
> I agree with Thomas's logic, if not the maths (surely $2,000?)
> 
> I was going to do a few myself but it looks like comments have been disabled 
> on the Flickr images?
> 
> 
> From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf of Thomas 
> Krichel [kric...@openlib.org]
> Sent: 10 December 2015 23:17
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Job: Wine Loving Developer  at University of 
> California, Davis
> 
>  j...@code4lib.org writes
> 
> 
>> **PROJECT DETAILS**
>> The UC Davis University Library is launching a project to digitize the
>> [Amerine wine label 
>> collection](https://www.flickr.com/photos/brantley/sets/72
>> 157655817440104/with/21116552632/)
> 
>  Some look like hard to read.
> 
>> and engage the public to transcribe the information contained on the
>> labels and associated annotations.
> 
>  This may take a long time. I suggest rather than doing that, take
>  somebody in a low-income country who speaks French, say, and who will
>  type all the data in. That way you get consistency in the data.  I
>  live in Siberia, I can find somebody there. Once this data is in a
>  simple text file, you can use in-house staff to attach it to the
>  label images in your systems.
> 
>  Crowdsource sounds cool, but for 4000 label it makes no sense.
>  If the typist gets $10/h, and gets 20 labels done in 1h, we
>  are talking $200. The visit you are planning for your developer
>  will cost that much.
> --
> 
>  Cheers,
> 
>  Thomas Krichel  http://openlib.org/home/krichel
>  skype:thomaskrichel


Re: [CODE4LIB] Protocol-relative URLs in MARC

2015-08-17 Thread Owen Stephens
In theory the 1st indicator dictates the protocol used and 4 =HTTP. However, in 
all examples on http://www.loc.gov/marc/bibliographic/bd856.html, despite the 
indicator being used, the protocol part of the URI it is then repeated in the 
$u field.

You can put ‘7’ in the 1st indicator, then use subfield $2 to define other 
methods.

Since only ‘http’ is one of the preset protocols, not https, I guess in theory 
this means you should use something like

856 70 $uhttps://example.com$2https

I’d be pretty surprised if in practice people don’t just do:

856 40 $uhttps://example.com

Owen


Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

 On 17 Aug 2015, at 21:41, Stuart A. Yeates syea...@gmail.com wrote:
 
 I'm in the middle of some work which includes touching the 856s in lots of
 MARC records pointing to websites we control. The websites are available on
 both https://example.org/ and http://example.org/
 
 Can I put //example.org/ in the MARC or is this contrary to the standard?
 
 Note that there is a separate question about whether various software
 systems support this, but that's entirely secondary to the question of the
 standard.
 
 cheers
 stuart
 --
 ...let us be heard from red core to black sky


Re: [CODE4LIB] Processing Circ data

2015-08-05 Thread Owen Stephens
Another option might be to use OpenRefine http://openrefine.org - this should 
easily handle 250,000 rows. I find it good for basic data analysis, and there 
are extensions which offer some visualisations (e.g. the VIB BITs extension 
which will plot simple data using d3 
https://www.bits.vib.be/index.php/software-overview/openrefine 
https://www.bits.vib.be/index.php/software-overview/openrefine)

I’ve written an introduction to OpenRefine available at 
http://www.meanboyfriend.com/overdue_ideas/2014/11/working-with-data-using-openrefine/
 
http://www.meanboyfriend.com/overdue_ideas/2014/11/working-with-data-using-openrefine/

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

 On 5 Aug 2015, at 21:07, Harper, Cynthia char...@vts.edu wrote:
 
 Hi all. What are you using to process circ data for ad-hoc queries.  I 
 usually extract csv or tab-delimited files - one row per item record, with 
 identifying bib record data, then total checkouts over the given time 
 period(s).  I have been importing these into Access then grouping them by bib 
 record. I think that I've reached the limits of scalability for Access for 
 this project now, with 250,000 item records.  Does anyone do this in R?  My 
 other go-to- software for data processing is RapidMiner free version.  Or do 
 you just use MySQL or other SQL database?  I was looking into doing it in R 
 with RSQLite (just read about this and sqldf  
 http://www.r-bloggers.com/make-r-speak-sql-with-sqldf/ ) because I'm sure my 
 IT department will be skeptical of letting me have MySQL on my desktop.  
 (I've moved into a much more users-don't-do-real-computing kind of 
 environment).  I'm rusty enough in R that if anyone will give me some 
 start-off data import code, that would be great.
 
 Cindy Harper
 E-services and periodicals librarian
 Virginia Theological Seminary
 Bishop Payne Library
 3737 Seminary Road
 Alexandria VA 22304
 char...@vts.edumailto:char...@vts.edu
 703-461-1794


Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata and/or a Database

2015-06-18 Thread Owen Stephens
It may depend on the format of the PDF, but I’ve used the Scraperwiki Python 
Module ‘pdf2xml’ function to extract text data from PDFs in the past. There is 
a write up (not by me) at 
http://schoolofdata.org/2013/08/16/scraping-pdfs-with-python-and-the-scraperwiki-module/
 
http://schoolofdata.org/2013/08/16/scraping-pdfs-with-python-and-the-scraperwiki-module/,
 and an example of how I’ve used it at 
https://github.com/ostephens/british_library_directory_of_library_codes/blob/master/scraper.py
 
https://github.com/ostephens/british_library_directory_of_library_codes/blob/master/scraper.py

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

 On 18 Jun 2015, at 17:02, Matt Sherman matt.r.sher...@gmail.com wrote:
 
 Hi Code4Libbers,
 
 I am working with colleague on a side project which involves some scanned
 bibliographies and making them more web searchable/sortable/browse-able.
 While I am quite familiar with the metadata and organization aspects we
 need, but I am at a bit of a loss on how to automate the process of putting
 the bibliography in a more structured format so that we can avoid going
 through hundreds of pages by hand.  I am pretty sure regular expressions
 are needed, but I have not had an instance where I need to automate
 extracting data from one file type (PDF OCR or text extracted to Word doc)
 and place it into another (either a database or an XML file) with some
 enrichment.  I would appreciate any suggestions for approaches or tools to
 look into.  Thanks for any help/thoughts people can give.
 
 Matt Sherman


Re: [CODE4LIB] eebo [perfect texts]

2015-06-09 Thread Owen Stephens
And some of the researchers definitely care about this (authority control, high 
quality descriptive metadata). I went to a hack day focussing on the EEBO-TCP 
Phase 1 release (these texts). I mentioned to one of the researchers (not a 
librarian) that I had access to some MARC records which described the works. 
Their immediate response was “Ah - but which MARC records, because they aren’t 
all of the same quality”!

There are good cataloguing records for the works but they have not been made 
available under an open licence alongside the transcribed texts. Probably the 
highest quality records are those in the English Short Title Catalogue (ESTC) 
http://estc.bl.uk.

There have been some great steps forward in the last few years, but I still 
feel libraries need to increase the amount they are doing to publish metadata 
under explicitly open licences.

Owen


Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

 On 8 Jun 2015, at 23:23, Stuart A. Yeates syea...@gmail.com wrote:
 
 Another thing that could usefully be done is significantly better authority
 control. Authors, works, geographical places, subjects, etc, etc.
 
 Good core librarianship stuff that is essentially orthogonal to all the
 other work that appears to be happening.
 
 cheers
 stuart
 
 --
 ...let us be heard from red core to black sky
 
 On Tue, Jun 9, 2015 at 12:42 AM, Eric Lease Morgan emor...@nd.edu wrote:
 
 On Jun 8, 2015, at 7:32 AM, Owen Stephens o...@ostephens.com wrote:
 
 I’ve just seen another interesting take based (mainly) on data in the
 TCP-EEBO release:
 
 
 https://scalablereading.northwestern.edu/2015/06/07/shakespeare-his-contemporaries-shc-released/
 
 It includes mention of MorphAdorner[1] which does some clever stuff
 around tagging parts of speech, spelling variations, lemmata etc. and
 another tool which I hadn’t come across before AnnoLex[2] for the
 correction and annotation of lexical data in Early Modern texts”.
 
 This paper[3] from Alistair Baron and Andrew Hardie at the University of
 Lancaster in the UK about preparing EEBO-TCP texts for corpus-based
 analysis may also be of interest, and the team at Lancaster have developed
 a tool called VARD which supports pre-processing texts[4]
 
 [1] http://morphadorner.northwestern.edu
 [2] http://annolex.at.northwestern.edu
 [3] http://eprints.lancs.ac.uk/60272/1/Baron_Hardie.pdf
 [4] http://ucrel.lancs.ac.uk/vard/about/
 
 
 All of this is really very interesting. Really. At the same time, there
 seems to be a WHOLE lot of effort spent on cleaning and normalizing data,
 and very little done to actually analyze it beyond “close reading”. The
 final goal of all these interfaces seem to be refined search. Frankly, I
 don’t need search. And the only community who will want this level of
 search will be the scholarly scholar. “What about the undergraduate
 student? What about the just more than casual reader? What about the
 engineer?” Most people don’t know how or why parts-of-speech are important
 let alone what a lemma is. Nor do they care. I can find plenty of things. I
 need (want) analysis. Let’s assume the data is clean — or rather, accept
 the fact that there is dirty data akin to the dirty data created through
 OCR and there is nothing a person can do about it — lets see some automated
 comparisons between texts. Examples might include:
 
  * this one is longer
  * this one is shorter
  * this one includes more action
  * this one discusses such  such theme more than this one
  * so  so theme came and went during a particular time period
  * the meaning of this phrase changed over time
  * the author’s message of this text is…
  * this given play asserts the following facts
  * here is a map illustrating where the protagonist went when
  * a summary of this text includes…
  * this work is fiction
  * this work is non-fiction
  * this work was probably influenced by…
 
 We don’t need perfect texts before analysis can be done. Sure, perfect
 texts help, but they are not necessary. Observations and generalization can
 be made even without perfectly transcribed texts.
 
 —
 ELM
 


[CODE4LIB] Global Open Knowledgebase APIs

2015-06-08 Thread Owen Stephens
Dear all,

GOKb, the Global Open Knowledgebase, is a community-managed project that aims 
to describe electronic journals and books, publisher packages, and platforms in 
a way that will be familiar to librarians who have worked with electronic 
resources. I’ve been working on the project since it started working with 
others to gather requirements, develop the underlying data models and specify 
functionality for the system.

GOKb opened to ‘public preview’ in January 2015, and you can signup for an 
account and access the service at https://gokb.kuali.org/gokb/ 
https://gokb.kuali.org/gokb/

Several hundred ejournal packages, and associated information about the 
ejournal titles, platforms and organisations have been added to the 
knowledgebase over the past few months. Alongside this work of adding content 
we have also opened up APIs to interact with the service.

We are interested in:

* Understanding how people would like to use data from GOKb via APIs (or other 
mechanisms)
* Getting some use of the initial APIs and getting feedback on these 
* Getting feedback on other APIs people would like to see

The current APIs we support are:

The ‘Coreference’ service
The main aim of this API is to provide back a list of identifiers associated 
with a title. The service allows you to provide a journal identifier (such as 
an ISSN) and get back basic information about the journal including title and 
other identifiers associated with the journal (other ISSNs, DOIs, publisher 
identifiers etc.). 

Documentation: https://github.com/k-int/gokb-phase1/wiki/Co-referencing-Detail 
https://github.com/k-int/gokb-phase1/wiki/Co-referencing-Detail
Access: https://gokb.kuali.org/gokb/coreference/index 
https://gokb.kuali.org/gokb/coreference/index

OAI Interfaces
The main aim of this API is to enable other services to obtain data from GOKb 
on an ongoing basis. Information about ejournal packages, titles and 
organisations can be obtained via this service

Documentation: 
https://github.com/k-int/gokb-phase1/wiki/OAI-Interfaces-for-Synchronization 
https://github.com/k-int/gokb-phase1/wiki/OAI-Interfaces-for-Synchronization
Access: http://gokb.kuali.org/gokb/oai http://gokb.kuali.org/gokb/oai

Add/Update API
This API supports adding and updating data in GOKb. You can add new, or update 
existing, Organisations and Platforms. You can add additional identifiers to 
Journal titles.

Documentation: 
https://github.com/k-int/gokb-phase1/wiki/Integration---Telling-GOKb-about-new-or-corresponding-resources-and-local-identifiers
 
https://github.com/k-int/gokb-phase1/wiki/Integration---Telling-GOKb-about-new-or-corresponding-resources-and-local-identifiers

We also have a SPARQL endpoint available on our test service (which contains 
test data only). The SPARQL endpoint is at http://test-gokb.kuali.org/sparql 
http://test-gokb.kuali.org/sparql, and a set of example queries are given at 
https://github.com/k-int/gokb-phase1/wiki/Sample-SPARQL 
https://github.com/k-int/gokb-phase1/wiki/Sample-SPARQL

Feedback on any/all of this would be very welcome - either to the list for 
discussion, or directly to me. We want to make sure we can provide useful data 
and services and hope you can help us do this.

Best wishes,

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936


Re: [CODE4LIB] eebo [developments]

2015-06-08 Thread Owen Stephens
Great stuff Eric.

I’ve just seen another interesting take based (mainly) on data in the TCP-EEBO 
release 
https://scalablereading.northwestern.edu/2015/06/07/shakespeare-his-contemporaries-shc-released/

It includes mention of MorphAdorner[1] which does some clever stuff around 
tagging parts of speech, spelling variations, lemmata etc. and another tool 
which I hadn’t come across before AnnoLex[2] for the correction and annotation 
of lexical data in Early Modern texts”.

This paper[3] from Alistair Baron and Andrew Hardie at the University of 
Lancaster in the UK about preparing EEBO-TCP texts for corpus-based analysis 
may also be of interest, and the team at Lancaster have developed a tool called 
VARD which supports pre-processing texts[4]

Owen

[1] http://morphadorner.northwestern.edu
[2] http://annolex.at.northwestern.edu
[3] http://eprints.lancs.ac.uk/60272/1/Baron_Hardie.pdf
[4] http://ucrel.lancs.ac.uk/vard/about/

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

 On 7 Jun 2015, at 18:48, Eric Lease Morgan emor...@nd.edu wrote:
 
 Here some of developments with my playing with the EEBO data. 
 
 I used the repository on Box to get my content, and I mirrored it locally. 
 [1, 2] I then looped through the content using XPath to extract rudimentary 
 metadata, thus creating a “catalog” (index). Along the way I calculated the 
 number of words in each document and saved that as a field of each record. 
 Being a tab-delimited file, it is trivial to import the catalog into my 
 favorite spreadsheet, database, editor, or statistics program. This allowed 
 me to browse the collection. I then used grep to search my catalog, and save 
 the results to a file. [5] I searched for Richard Baxter. [6, 7, 8]. I then 
 used an R script to graph the numeric data of my search results. Currently, 
 there are only two types: 1) dates, and 2) number of words. [9, 10, 11, 12] 
 From these graphs I can tell that Baxter wrote a lot of relatively short 
 things, and I can easily see when he published many of his works. (He 
 published a lot around 1680 but little in 1665.) I then transformed the 
 search resu!
 lt!
 s into a browsable HTML table. [13] The table has hidden features. (Can you 
 say, “Usability?”) For example, you can click on table headers to sort. This 
 is cool because I want sort things by number of words. (Number of pages 
 doesn’t really tell me anything about length.) There is also a hidden link to 
 the left of each record. Upon clicking on the blank space you can see 
 subjects, publisher, language, and a link to the raw XML. 
 
 For a good time, I then repeated the process for things Shakespeare and 
 things astronomy. [14, 15] Baxter took me about twelve hours worth of work, 
 not counting the caching of the data. Combined, Shakespeare and astronomy 
 took me less than five minutes. I then got tired.
 
 My next steps are multi-faceted and presented in the following incomplete 
 unordered list:
 
  * create browsable lists - the TEI metadata is clean and
consistent. The authors and subjects lend themselves very well to
the creation of browsable lists.
 
  * CGI interface - The ability to search via Web interface is
imperative, and indexing is a prerequisite.
 
  * transform into HTML - TEI/XML is cool, but…
 
  * create sets - The collection as a whole is very interesting,
but many scholars will want sub-sets of the collection. I will do
this sort of work, akin to my work with the HathiTrust. [16]
 
  * do text analysis - This is really the whole point. Given the
full text combined with the inherent functionality of a computer,
additional analysis and interpretation can be done against the
corpus or its subsets. This analysis can be based the counting of
words, the association of themes, parts-of-speech, etc. For
example, I plan to give each item in the collection a colors,
“big” names, and “great” ideas coefficient. These are scores
denoting the use of researcher-defined “themes”. [17, 18, 19] You
can see how these themes play out against the complete writings
of “Dead White Men With Three Names”. [20, 21, 22]
 
 Fun with TEI/XML, text mining, and the definition of librarianship.
 
 
 [1] Box - http://bit.ly/1QcvxLP
 [2] mirror - http://dh.crc.nd.edu/sandbox/eebo-tcp/xml/
 [3] xpath script - http://dh.crc.nd.edu/sandbox/eebo-tcp/bin/xml2tab.pl
 [4] catalog (index) - http://dh.crc.nd.edu/sandbox/eebo-tcp/catalog.txt
 [5] search results - http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/baxter.txt
 [6] Baxter at VIAF - http://viaf.org/viaf/54178741
 [7] Baxter at WorldCat - http://www.worldcat.org/wcidentities/lccn-n50-5510
 [8] Baxter at Wikipedia - http://en.wikipedia.org/wiki/Richard_Baxter
 [9] box plot of dates - 
 http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/boxplot-dates.png
 [10] box plot of words - 
 http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/boxplot-words.png

Re: [CODE4LIB] eebo

2015-06-05 Thread Owen Stephens
Hi Eric,

I’ve worked with EEBO as part of the Jisc Historical Texts 
(https://historicaltexts.jisc.ac.uk/home) platform - which provides access to 
EEBO and other collections for UK Universities. My work was around the metadata 
and search of metadata and full text and display of results. I was mainly 
looking at metadata but did some digging into the TEI files to see how the 
markup could be used to extract metadata (e.g. presence of illustrations in the 
text).

I was lucky (?!) enough to have access to the MARC records, but I did also do 
some work looking at the metadata included in the TEI files.

If there is anything I can help with I’d be happy to.

 The people who worked with the files in detail were a UK s/w development 
company Knowledge Integration (http://www.k-int.com/) - I can give you a 
contact there if that would be helpful.

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

 On 5 Jun 2015, at 13:10, Eric Lease Morgan emor...@nd.edu wrote:
 
 Does anybody here have experience reading the SGML/XML files representing the 
 content of EEBO? 
 
 I’ve gotten my hands on approximately 24 GB of SGML/XML files representing 
 the content of EEBO (Early English Books Online). This data does not include 
 page images. Instead it includes metadata of various ilks as well as the 
 transcribed full text. I desire to reverse engineer the SGML/XML in order to: 
 1) provide an alternative search/browse interface to the collection, and 2) 
 support various types of text mining services. 
 
 While I am making progress against the data, it would be nice to learn of 
 other people’s experience so I do not not re-invent the wheel (too many 
 times). ‘Got ideas?
 
 —
 Eric Lease Morgan
 University Of Notre Dame


Re: [CODE4LIB] linked data question

2015-02-26 Thread Owen Stephens
I highly recommend Chapter 6 of the Linked Data book which details different 
design approaches for Linked Data applications - sections 6.3  
(http://linkeddatabook.com/editions/1.0/#htoc84) summarises the approaches as:

1. Crawling Pattern
2. On-the-fly dereferencing pattern
3. Query federation pattern

Generally my view would be that (1) and (2) are viable approaches for different 
applications, but that (3) is generally a bad idea (having been through 
federated search before!)

Owen



Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

 On 26 Feb 2015, at 14:40, Eric Lease Morgan emor...@nd.edu wrote:
 
 On Feb 25, 2015, at 2:48 PM, Esmé Cowles escow...@ticklefish.org wrote:
 
 In the non-techie library world, linked data is being talked about (perhaps 
 only in listserv traffic) as if the data (bibliographic data, for instance) 
 will reside on remote sites (as a SPARQL endpoint??? We don't know the 
 technical implications of that), and be displayed by your local 
 catalog/the centralized inter-national catalog by calling data from that 
 remote site. But the original question was how the data on those remote 
 sites would be access points - how can I start my search by searching for 
 that remote content?  I assume there has to be a database implementation 
 that visits that data and pre-indexes it for it to be searchable, and 
 therefore the index has to be local (or global a la Google or OCLC or its 
 bibliographic-linked-data equivalent). 
 
 I think there are several options for how this works, and different 
 applications may take different approaches.  The most basic approach would 
 be to just include the URIs in your local system and retrieve them any time 
 you wanted to work with them.  But the performance of that would be 
 terrible, and your application would stop working if it couldn't retrieve 
 the URIs.
 
 So there are lots of different approaches (which could be combined):
 
 - Retrieve the URIs the first time, and then cache them locally.
 - Download an entire data dump of the remote vocabulary and host it locally.
 - Add text fields in parallel to the URIs, so you at least have a label for 
 it.
 - Index the data in Solr, Elasticsearch, etc. and use that most of the time, 
 esp. for read-only operations.
 
 
 Yes, exactly. I believe Esmé has articulated the possible solutions well. 
 escowles++  —ELM


Re: [CODE4LIB] Code4LibCon video crew thanks

2015-02-17 Thread Owen Stephens
Apologies for a +1 message, but you know... +1 and some

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

 On 13 Feb 2015, at 18:00, Cary Gordon listu...@chillco.com wrote:
 
 I want to deeply thank Ashley Blewer, Steven Anderson and Josh Wilson for 
 running the video streaming and capture at Code4LibCon in Portland. Because 
 of you, we had great video in real time (and I got to actually watch the 
 presentations). I also want to again thank Riley Childs, who could not make 
 it this year. Riley moved the bar up last year by putting together our 
 YouTube presence.
 
 For the second year running, we requested and were not allowed to setup and 
 test the day before, and for the second year running lost part of the opening 
 session. Fortunately, we did capture most of what did not get streamed on 
 Tuesday, and I will put that online next week. There is always next year.
 
 Thanks,
 
 Cary


Re: [CODE4LIB] Automatically updating documentation with screenshots

2015-01-26 Thread Owen Stephens
... and further to this I've just found a neat Chrome plugin which will record 
a set of actions/tests as CasperJS script, including screenshots - my first 
impressions are pretty positive - code produced looks pretty clean.

The plugin is called 'Ressurectio' [https://github.com/ebrehault/resurrectio 
https://github.com/ebrehault/resurrectio]

Cheers

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

 On 26 Jan 2015, at 13:48, Owen Stephens o...@ostephens.com wrote:
 
 Thanks all - I'm looking at both Selenium and Casperjs now.
 
 I also came across a plugin for 'Robot Framework' [http://robotframework.org 
 http://robotframework.org/] which allows you to grab screenshots (via 
 Selenium) and annotate with notes - along the lines that Ross suggested. The 
 plugin is 'Selenium2Screenshots' 
 [https://github.com/datakurre/robotframework-selenium2screenshots 
 https://github.com/datakurre/robotframework-selenium2screenshots]
 
 Owen
 
 Owen Stephens
 Owen Stephens Consulting
 Web: http://www.ostephens.com http://www.ostephens.com/
 Email: o...@ostephens.com mailto:o...@ostephens.com
 Telephone: 0121 288 6936
 
 On 26 Jan 2015, at 13:16, Mads Villadsen m...@statsbiblioteket.dk 
 mailto:m...@statsbiblioteket.dk wrote:
 
 I have used casperjs for this purpose. A small script that loads urls at 
 multiple different resolutions/user agents and takes a screenshot of each of 
 them.
 
 Regards
 
 -- 
 Mads Villadsen m...@statsbiblioteket.dk mailto:m...@statsbiblioteket.dk
 Statsbiblioteket
 It-udvikler
 


[CODE4LIB] Automatically updating documentation with screenshots

2015-01-26 Thread Owen Stephens
I work on a web application and when we release a new version there are often 
updates to make to existing user documentation - especially screenshots where 
unrelated changes (e.g. the addition of a new top level menu item) can make 
whole sets of screenshots desirable across all the documentation.

I'm looking at whether we could automate the generation of screenshots somehow 
which has taken me into documentation tools such as Sphinx 
[http://sphinx-doc.org] and Dexy [http://dexy.it]. However, ideally I want 
something simple enough for the application support staff to be able to use.

Anyone done/tried anything like this?

Cheers

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936


Re: [CODE4LIB] Stack Overflow

2014-11-04 Thread Owen Stephens
Another option would be a 'code4lib QA' site. Becky Yoose set up one for 
Coding/Cataloguing and so can comment on how much effort its been. In terms of 
asking/answering questions the use is clearly low but I think the content that 
is there is (generally) good quality and useful.

I guess the hard part of any project like this is going to be building the 
community around it. The first things that occur to me is how you encourage 
people to ask the question on this new site, rather than via existing methods 
and how do you build enough community activity around housekeeping such as 
noting duplicate questions and merging/closing. The latter might be a nice 
problem to have, but the former is where both the Library / LIS SE and the 
Digital Preservation SE fell down, and libcatcode suffers the same problem - 
just not enough activity to be a go-to destination.

I'm supportive of the idea, but I'd hate to see this go through the pain of the 
SE process only to fail for the same reasons as previous efforts in this area. 
I think we need to think about this underlying problem - but I'm not sure what 
the solution is/solutions are.

Owen


Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

 On 4 Nov 2014, at 15:34, Schulkins, Joe joseph.schulk...@liverpool.ac.uk 
 wrote:
 
 To be honest I absolutely hate the whole reputation and badge system for 
 exactly the reasons you outline, but I can't deny that I do find the family 
 of Stack Exchange sites extremely useful and by comparison Listservs just 
 seem very archaic to me as it's all too easy for a question (and/or its 
 answer) to drop through the cracks of a popular discussion. Are Listservs 
 really the best way to deal with help? I would even prefer a Drupal site...   
 
 
 Joseph Schulkins| Systems Librarian| University of Liverpool Library| PO Box 
 123 | Liverpool L69 3DA | joseph.schulk...@liverpool.ac.uk| T 0151 794 3844 
 
 Follow us: @LivUniLibrary Like us: LivUniLibrary Visit us: 
 http://www.liv.ac.uk/library 
 Special Collections  Archives blog: http://manuscriptsandmore.liv.ac.uk
 
 
 
 
 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of 
 Joshua Welker
 Sent: 04 November 2014 14:43
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Stack Overflow
 
 The concept of a library technology Stack Exchange site as a google-able 
 repository of information sounds great. However, I do have quite a few 
 reservations.
 
 1. Stack Exchange sites seem to naturally lead to gatekeeping, snobbishness, 
 and other troll behaviors. The reputation system built into those sites 
 really go to a lot of folks' heads. High-ranking users seem to take pleasure 
 in shutting down questions as off-topic, redundant, etc.
 Argument and one-upmanship are actively promoted--The previous answer sucks. 
 Here's my better answer!  This  tends to attract certain (often
 male) personalities and to repel certain (often female) personalities.
 This seems very contrary to the direction the Code4Lib community has tried to 
 move in the last few years of being more inclusive and inviting to women 
 instead of just promoting the stereotypical IT guy qualities that dominate 
 most IT-related discussions on the Internet. More here:
 
 http://www.banane.com/2012/06/20/there-are-no-women-on-stackoverflow-or-ar
 e-there/
 http://michael.richter.name/blogs/why-i-no-longer-contribute-to-stackoverf
 low
 
 2. Having a Stack Exchange site might fragment the already quite small and 
 nascent library technology community. This might be an unfounded worry, but 
 it's worth consideration. A lot of QA takes place on this listserv, and it 
 would be awkward to try to have all this information in both places. That 
 said, searching StackExchange is much easier than searching a listserv.
 
 3. I echo your concerns about vendors. Libraries have a culture of protecting 
 vendors from criticism. Sure, we do lots of criticism behind closed doors, 
 but nowhere that leaves an online footprint. Often, our contracts include a 
 clause that we have to keep certain kinds of information private. I don't 
 think this is a very positive aspect of librarian culture, but it is there.
 
 I think a year or two ago that there was a pretty long discussion on this 
 listserv about creating a Stack Exchange site.
 
 Josh Welker
 
 
 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of 
 Schulkins, Joe
 Sent: Tuesday, November 04, 2014 8:12 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: [CODE4LIB] Stack Overflow
 
 Presumably I'm not alone in this, but I find Stack Overflow a valuable 
 resource for various bits of web development and I was wondering whether 
 anyone has given any thought about proposing a Library Technology site to 
 Stack Exchange's Area 51 (http://area51.stackexchange.com/)? Doing a search 
 of the proposals shows there was one

Re: [CODE4LIB] Stack Overflow

2014-11-04 Thread Owen Stephens
Thanks for that Mark. That's running on 'question2answer' which looks to have a 
reasonable amount of development going on around it 
https://github.com/q2a/question2answer/graphs/contributors (given Becky's 
comments about OSQA which still hold true)

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

 On 4 Nov 2014, at 16:05, Mark A. Matienzo mark.matie...@gmail.com wrote:
 
 On Tue, Nov 4, 2014 at 11:00 AM, Owen Stephens o...@ostephens.com wrote:
 
 Another option would be a 'code4lib QA' site. Becky Yoose set up one for
 Coding/Cataloguing and so can comment on how much effort its been. In terms
 of asking/answering questions the use is clearly low but I think the
 content that is there is (generally) good quality and useful.
 
 I guess the hard part of any project like this is going to be building the
 community around it. The first things that occur to me is how you encourage
 people to ask the question on this new site, rather than via existing
 methods and how do you build enough community activity around housekeeping
 such as noting duplicate questions and merging/closing. The latter might be
 a nice problem to have, but the former is where both the Library / LIS SE
 and the Digital Preservation SE fell down, and libcatcode suffers the same
 problem - just not enough activity to be a go-to destination.
 
 
 I would add that the Digital Preservation SE has been reinstantiated as
 Digital Preservation QA http://qanda.digipres.org/, which is organized
 and supported by the Open Planets Foundation and the National Digital
 Stewardship Alliance.
 
 Mark A. Matienzo m...@matienzo.org
 Director of Technology, Digital Public Library of America


Re: [CODE4LIB] MARC reporting engine

2014-11-03 Thread Owen Stephens
The MARC XML seemed to be an archive within an archive - I had to gunzip to get 
innzmetadata.xml then rename to innzmetadata.xml.gz and gunzip again to get the 
actual xml

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

 On 3 Nov 2014, at 22:38, Robert Haschart rh...@virginia.edu wrote:
 
 I was going to echo Eric Hatcher's recommendation of Solr and SolrMarc, since 
 I'm the creator of SolrMarc.
 It does provide many of the same tools as are described in the toolset you 
 linked to,  but it is designed to write to Solr rather than to a SQL style 
 database.   Solr may or may not be more suitable for your needs then a SQL 
 database.   However I decided to download the data to see whether SolrMarc 
 could handle it.   I started with the MARCXML.gz data, ungzipped it to get a 
 .XML file, but the resulting file causes SolrMarc to blow chunks.   Either 
 I'm missing something or there is something way wrong with that data.The 
 gzipped binary MARC file work fine with the SolrMarc tools.
 
 Creating a SolrMarc script to extract the 700 fields, plus a bash script to 
 cluster and count them, and sort by frequency took about 20 minutes.
 
 -Bob Haschart
 
 
 On 11/3/2014 3:00 PM, Stuart Yeates wrote:
 Thank you to all who responded with software suggestions. 
 https://github.com/ubleipzig/marctools is looking like the most promising 
 candidate so far. The more I read through the recommendations the more it 
 dawned on me that I don't want to have to configure yet another java 
 toolchain (yes I know, that may be personal bias).
 
 Thank you to all who responded about the challenges of authority control in 
 such collections. I'm aware of these issues. The current project is about 
 marshalling resources for editors to make informed decisions about rather 
 than automating the creation of articles, because there is human judgement 
 involved in the last step I can afford to take a few authority control 
 'risks'
 
 cheers
 stuart
 
 --
 I have a new phone number: 04 463 5692
 
 
 From: Code for LibrariesCODE4LIB@LISTSERV.ND.EDU  on behalf of raffaele 
 messutiraffaele.mess...@gmail.com
 Sent: Monday, 3 November 2014 11:39 p.m.
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] MARC reporting engine
 
 Stuart Yeates wrote:
 Do any of these have built-in indexing? 800k records isn't going to fit in 
 memory and if building my own MARC indexer is 'relatively straightforward' 
 then you're a better coder than I am.
 you could try marcdb[1] from marctools[2]
 
 [1] https://github.com/ubleipzig/marctools#marcdb
 [2] https://github.com/ubleipzig/marctools
 
 
 --
 raffaele


Re: [CODE4LIB] Linux distro for librarians

2014-10-21 Thread Owen Stephens
This triggered a memory of a project that was putting together a ready-to-go 
toolset for Digital Humanities - which I then couldn't remember the details of 
- but luckily Twitter was able to remember it for me (thanks to @mackymoo 
https://twitter.com/mackymoo)

The project is DH Box http://dhbox.org which tries to put together an 
environment suitable for DH work. I think that originally this was to be done 
via installation on the user's local machine, but due the challenges of dealing 
the variation in local environment they've now moved to a 'box in the cloud' 
approach (the change of direction is noted at 
http://dhbox.commons.gc.cuny.edu/blog/2014/dh-box-new-friend-new-platform#sthash.27THWR6E.dpbs).
 To be honest I'm not 100% sure where the project is right now, as although it 
looks like not much has been updated since May 2014.

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

 On 21 Oct 2014, at 15:42, Brad Coffield bcoffield.libr...@gmail.com wrote:
 
 Is what you're really after is an environment pre-loaded with useful tools
 for various types of librarians? If so, maybe instead of rolling your own
 distro (and all the work and headache that involves, like a second
 full-time job) maybe create software bundles for linux? Have a website
 where you have lists of software by librarian type. Then make it easy for
 linux users to install them (repo's and what not) ((I haven't been active
 in linux for a while))
 
 Just thinking out loud.
 
 
 -- 
 Brad Coffield, MLIS
 Assistant Information and Web Services Librarian
 Saint Francis University
 814-472-3315
 bcoffi...@francis.edu


Re: [CODE4LIB] ISSN lists?

2014-10-17 Thread Owen Stephens
It may depend on exactly what you need.

The ISSN Centre offer licensed access to their ISSN portal at a cost 
http://www.issn.org - my experience is that this is pretty comprehensive
The ISSN Centre also offer a download of ISSN-L tables - this is available for 
free (although you have to state what you intend to do with it before you can 
download) - this is just ISSNs (mapped to their ISSN-Ls) but if you don't need 
bibliographic details then it would be a good source
As well as WorldCat you could also try Suncat which offers a z39.50 connection 
http://www.suncat.ac.uk/support/z-target.shtml, but obviously this has the same 
issue as the WorldCat approach
GOKb and KB+ are both initiatives trying to build knowledgebases containing 
many ISSNs with data to be made available under a CC0 declaration. Both of 
these are focussed on describing bundles/packages of journals. GOKb is going to 
be going into preview imminently (http://gokb.org/news) and KB+ already offers 
downloads http://www.kbplus.ac.uk/kbplus/publicExport. KB+ currently has 
details of around 25k journals.
There may also be some largescale open data initiatives that give you a 
reasonably good set of ISSNs. For example the RLUK release of 60m+ records at 
http://www.theeuropeanlibrary.org/tel4/access/data/lod, or the 12million 
records released by Harvard http://openmetadata.lib.harvard.edu/bibdata (both 
CC0)

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 17 Oct 2014, at 03:16, Stuart Yeates stuart.yea...@vuw.ac.nz wrote:

 My understanding is that there is no universal ISSN list but that worldcat 
 allows querying of their database by ISSN. 
 
 Which method of sampling the ISSN namespace is going to cause least pain? 
 http://www.worldcat.org/ISSN/ seems to be the one talked about, but is there 
 another that's less resource intensive? Maybe someone's already exported this 
 data?
 
 cheers
 stuart
 --
 I have a new phone number: 04 463 5692


Re: [CODE4LIB] Python or Perl script for reading RDF/XML, Turtle, or N-triples Files

2014-09-30 Thread Owen Stephens
I've not tried using the LCNAF RDF files, and I've not used RDFLib, but a 
couple of things from (a relatively small amount of) experience parsing RDF:

Don't try to parse the RDF/XML, use n-triples instead
As Kyle mentioned, you might want to use command line tools to strip down the 
n-triples to only deal with data you actually want
Rapper and the Redland RDF libraries are a good place to start, and have 
bindings to Perl, PHP, Python and Ruby (http://librdf.org/raptor/rapper.html 
and http://librdf.org). This StackOverflow QA might help getting started 
http://stackoverflow.com/questions/5678623/how-to-parse-big-datasets-using-rdflib
If you want to move between RDF formats an alternative to Rapper is 
http://www.l3s.de/~minack/rdf2rdf/ - this succeeded converting a file of 48 
million triples in ttl to ntriples where Rapper failed with an 'out of memory' 
error (once in ntriples, Rapper can be used for further parsing)


Some slightly random advice there, but maybe some of it will be useful!

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 30 Sep 2014, at 15:54, Jeremy Nelson jeremy.nel...@coloradocollege.edu 
wrote:

 Hi Jean,
 I've found rdflib (https://github.com/RDFLib/rdflib) on the Python side 
 exceeding simple to work with and use. For example, to load the current 
 BIBFRAME vocabulary as an RDF graph using a Python shell:
 
 import rdflib
 bf_vocab = rdflib.Graph().parse('http://bibframe.org/vocab/')
 len(bf_vocab) # Total number of triples
 1683
 set([s for s in bf_vocab]) # A set of all unique subjects in the graph
 
 
 This module offers RDF/XML, Turtle, or N-triples support and with various 
 options for retrieving and manipulating the graph's subjects, predicate, and 
 objects. I would advise installing the JSON-LD 
 (https://github.com/RDFLib/rdflib-jsonld) extension as well.
 
 Jeremy Nelson
 Metadata and Systems Librarian
 Colorado College
 
 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jean 
 Roth
 Sent: Tuesday, September 30, 2014 8:14 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: [CODE4LIB] Python or Perl script for reading RDF/XML, Turtle, or 
 N-triples Files
 
 Thank you so much for the reply.
 
 I have not investigated the LCNAF data set thoroughly.  However, my 
 default/ideal is to read in all variables from a dataset.  
 
 So, I was wondering if any one had an example Python or Perl script for 
 reading RDF/XML, Turtle, or N-triples file.  A simple/partial example would 
 be fine.
 
 Thanks,
 
 Jean
 
 On Mon, 29 Sep 2014, Kyle Banerjee wrote:
 
 KB The best way to handle them depends on what you want to do. You need 
 KB to actually download the NAF files rather than countries or other 
 KB small files as different kinds of data will be organized 
 KB differently. Just don't try to read multigigabyte files in a text 
 KB editor :)
 KB 
 KB If you start with one of the giant XML files, the first thing you'll 
 KB probably want to do is extract just the elements that are 
 KB interesting to you. A short string parsing or SAX routine in your 
 KB language of choice should let you get the information in a format you 
 like.
 KB 
 KB If you download the linked data files and you're interested in 
 KB actual headings (as opposed to traversing relationships), grep and 
 KB sed in combination with the join utility are handy for extracting 
 KB the elements you want and flattening the relationships into 
 KB something more convenient to work with. But there are plenty of other 
 tools that you could also use.
 KB 
 KB If you don't already have a convenient environment to work on, I'm a  
 KB fan of virtualbox. You can drag and drop things into and out of your 
 KB regular desktop or even access it directly. That way you can 
 KB view/manipulate files with the linux utilities without having to 
 KB deal with a bunch of clunky file transfer operations involving 
 KB another machine. Very handy for when you have to deal with multigigabyte 
 files.
 KB 
 KB kyle
 KB 
 KB On Mon, Sep 29, 2014 at 11:19 AM, Jean Roth jr...@nber.org wrote:
 KB 
 KB  Thank you!  It looks like the files are available as  RDF/XML, 
 KB  Turtle, or N-triples files.
 KB 
 KB  Any examples or suggestions for reading any of these formats?
 KB 
 KB  The MARC Countries file is small, 31-79 kb.  I assume a script 
 KB  that would read a small file like that would at least be a start 
 KB  for the LCNAF
 KB 
 KB 
 KB 


Re: [CODE4LIB] IFTTT and barcodes

2014-09-11 Thread Owen Stephens
As noted by Tara, when using IFTTT (or similar tools like Bip.io and WappWolf) 
you are limited to the channels/services the tool has already integrated. You 
are also in the position of having to give a third party service access to 
personal information and the ability to read/write certain services.

I was investigating these types of services very briefly for a recent workshop 
and I came across an open source alternative called Huginn which you can run on 
your own server and of course can extend to work with whatever 
services/channels you want. I thought it looked interesting - available from 
https://github.com/cantino/huginn

Overkill for this particular problem but may be of more general interest

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 11 Sep 2014, at 08:21, Sylvain Machefert smachef...@u-bordeaux3.fr wrote:

 Hello,
 maybe that an easier solution, more IFTTT related, would be to develop a 
 Yahoo pipe, using the ISBN  querying the webpac should be easy for Yahoo 
 Pipes, you can then search in the page using xpath or thing like that. Should 
 be easier thant developping a custom script (if you have no development 
 knowledge, ortherwise it should be scripted easily in PHP, python, whatever).
 
 I haven't used YPipes in a long time but I think it's worth looking at it.
 
 Sylvain
 
 
 Le 10/09/2014 21:48, Ian Walls a écrit :
 I don't think IFTTT is the right tool, but the basic idea is sound.
 
 With a spot of custom scripting on some server somewhere, one could take in
 an ISBN, lookup via the III WebPac, assess eligibility conditions, then
 return yes or no.  Barcode Scanner on Android has the ability to do custom
 search URLs, so if your yes/no script can accept URL params, then you should
 be all set.
 
 Barring a script, just a lookup of the MARC record may be possible, and if
 it was styled in a mobile-friendly manner, perhaps you could quickly glean
 whether it's okay or not for copy cataloging.
 
 Side question: is there connectivity in the stacks for doing this kind of
 lookup?  I know in my library, that's not always the case.
 
 
 -Ian
 
 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Riley Childs
 Sent: Wednesday, September 10, 2014 3:31 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] IFTTT and barcodes
 
 Webhooks via the WordPress channel?
 
 Riley Childs
 Senior
 Charlotte United Christian Academy
 Library Services Administrator
 IT Services
 (704) 497-2086
 rileychilds.net
 @rowdychildren
 
 From: Tara Robertsonmailto:trobert...@langara.bc.ca
 Sent: ‎9/‎10/‎2014 3:03 PM
 To: CODE4LIB@LISTSERV.ND.EDUmailto:CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] IFTTT and barcodes
 
 Hi,
 
 I don't think this is possible using IFTTT right now as existing channels
 don't exist to create a recipe. I'm trying to think of what those channels
 would be and can't quite...I don't think IFTTT is the best tool for this
 task.
 
 What ILS are you using? Could you hook a barcode scanner up to a tablet and
 scan, then check the MARC...nah, that's seeming almost as time consuming as
 taking it to your desk (depending on how far your desk is).
 I recall at an Evergreen hackfest that someone was tweaking the web
 interface for an inventory type exercise, where it would show red or green
 depending on some condition.
 
 Cheers,
 Tara
 
 On 10/09/2014 11:52 AM, Harper, Cynthia wrote:
 Now that someone has mentioned IFTTT, I'm reading up on it and wonder if
 it could make this task possible:
 One of my tasks is copy cataloging. I'm only authorized to do LC copy,
 which involves opening the record (already downloaded in the acq process),
 and checking to see that 490 doesn't exist (I can't handle series), and
 looking for DLC in the 040 |a and |c.
 It's discouraging when I take 10 books back to my desk from the cataloging
 shelf, and all 10 are not eligible for copy cataloging.
 S...  could I take my phone to the cataloging shelf, use IFTTT to scan
 my ISBN, search in the III Webpac, look at the MARc record and tell me
 whether it's LC copy?
 Empower the frontline workers! :)
 
 Cindy Harper
 Electronic Services and Serials Librarian Virginia Theological
 Seminary
 3737 Seminary Road
 Alexandria VA 22304
 703-461-1794
 char...@vts.edu
 
 --
 
 Tara Robertson
 
 Accessibility Librarian, CAPER-BC http://caperbc.ca/ T  604.323.5254 F
 604.323.5954 trobert...@langara.bc.ca
 mailto:tara%20robertson%20%3ctrobert...@langara.bc.ca%3E
 
 Langara. http://www.langara.bc.ca
 
 100 West 49th Avenue, Vancouver, BC, V5Y 2Z6


Re: [CODE4LIB] Automated searching of Copac/Worldcat

2014-08-13 Thread Owen Stephens
The worksheets I circulated earlier in the week include examples of how to take 
a list of ISBNs from a spreadsheet/csv file and search on Worldcat (see the 
'Automated Love Examples' docs in http://bit.ly/automatedlovefolder)
What these examples don't do is include how to check the outcome of the search 
automatically are record that.

I think it would be relatively easy to add to the iMacros example to extract a 
hit count / no hits message and write this to a file using the iMacros SAVEAS 
command but I haven't tried this. For a 'no results' option you'd want to look 
for the presence/extract the contents of a div with id=div-results-none
For a results count you'd want to to look for the contents of a table within 
the div with class=resultsinfo

Alternatively you could look at the Selenium IDE extension for Firefox which is 
more complex but allows more sophisticated approach to checking and writing out 
information about text present/absent in web pages retrieved.

Hope that is of some help

Owen



Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 13 Aug 2014, at 11:20, Nicholas Brown nbr...@iniva.org wrote:

 Apologies for cross posting
 
 Dear collective wisdom,
 
 I'm interested in using automation software such as Macro Express or iMacros 
 to feed a list of ISBNs from a spreadsheet into Copac or Worldcat and output 
 a list of those that return no matches in the results screen. The idea would 
 be to create a tool that can quickly, although rather roughly, identify rare 
 items in a collection (though obviously this would be limited to items with 
 ISBNs or other unique identifiers). I can write a macro which will 
 sequentially search either catalogue for a list of ISBNs but am struggling 
 with how to have the macro identify items with no matches (I have a vague 
 idea about searching the results screen for the text Sorry, there are no 
 search results) and to compile them back into a spreadsheet.
 
 I'd be keen to hear if anyone has attempted something similar, general 
 advice, any potential pitfalls in the method outlined above or suggestions 
 for a better way to achieve the same results. If something useful comes of it 
 I'd be happy to share the results. 
 
 Many thanks for your help,
 Nick 
 
 Nicholas Brown
 Library and Information Manager
 nbr...@iniva.org
 +44 (0)20 7749  1125
 www.iniva.org


[CODE4LIB] Automation tools - session at the Pi and Mash unconference

2014-08-11 Thread Owen Stephens
Dear all,

A month or so ago I asked for recommendations for automation tools that people 
used in libraries to help inform a session I was going to run. The unconference 
event (Pi and Mash) ran this weekend, and I just wanted to share the materials 
I wrote for the session in case they are of any help. The materials consist of 
a slidedeck called Automated Love Presentation (available as Keynote, 
Powerpoint and PDF) and some examples and exercises you can work through in a 
document called Automated Love Examples (available as Pages, Word doc, PDF 
and ePub). There are also two accompanying files 'ISBNs.xlsx' and 'isbns.csv' 
which are used in the examples/exercises.

All materials are available at http://bit.ly/automatedlovefolder

Thanks to all who made suggestions which contributed towards the session.

Best wishes,

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936


Re: [CODE4LIB] 'automation' tools

2014-07-08 Thread Owen Stephens
Thanks again all,

I love OpenRefine - I've been working on the GOKb project (http://gokb.org) 
where K-Int (a UK based company) have developed an extension for OpenRefine 
which helps with the cleaning of data about electronic resources (esp. 
journals) from publishers and then pushes it into the GOKb database. The 
extension is fully integrated into the GOKb database but if anyone wants a look 
code is at https://github.com/k-int/gokb-phase1/tree/dev/refine. The extension 
checks the data and reports errors as well as offering ways of fixing common 
issues - there's more on the wiki 
https://wiki.kuali.org/display/OLE/OpenRefine+How-Tos

I did pitch an OpenRefine workshop for the same event as a 'data 
wrangling/cleaning' tool but the 'automation' session got the vote in the end - 
although there is definitely overlap. However I am delivering an OpenRefine 
workshop at the British Library next week - and great to see it is getting used 
across libraries.

The Google Doc Spreadsheets is also a great tip - I've run a course at the 
British Library which uses this to introduce the concept of APIs to 
non-techies. I blogged the original tutorial at 
http://www.meanboyfriend.com/overdue_ideas/2013/02/introduction-to-apis/ but a 
change to the BL open data platform means this no longer works :((

Thanks all again - I'll be trying to put stuff from the automation workshop 
online at some point and I'll post here when there is something up.

Best wishes,

Owen


Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 8 Jul 2014, at 03:52, davesgonechina davesgonech...@gmail.com wrote:

 +1 to OpenRefine. Some extensions, like RDF Refine http://refine.deri.ie/,
 currently only work with the old Google Refine (still available here
 https://code.google.com/p/google-refine/). There's a good deal of
 interesting projects for OpenRefine on GitHub and GitHub Gist.
 
 Google Docs Spreadsheets also has a surprising amount of functionality,
 such as importXML if you're willing to get your hands dirty with regular
 expressions.
 
 Dave
 
 
 On Tue, Jul 8, 2014 at 3:12 AM, Tillman, Ruth K. (GSFC-272.0)[CADENCE GROUP
 ASSOC] ruth.k.till...@nasa.gov wrote:
 
 Definite cosign on Open Refine. It's intuitive and spreadsheet-like enough
 that a lot of people can understand it. You can do anything from
 standardizing state names you get from a patron form to normalizing
 metadata keywords for a database, so I think it'd be useful even for
 non-techies.
 
 Ruth Kitchin Tillman
 Metadata Librarian, Cadence Group
 NASA Goddard Space Flight Center Library, Code 272
 Greenbelt, MD 20771
 Goddard Library Repository: http://gsfcir.gsfc.nasa.gov/
 301.286.6246
 
 
 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Terry Brady
 Sent: Monday, July 07, 2014 1:35 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] 'automation' tools
 
 I learned about Open Refine http://openrefine.org/ at the Code4Lib
 conference, and it looks like it would be a great tool for normalizing
 data.  I worked on a few projects in the past in which this would have been
 very helpful.
 


Re: [CODE4LIB] 'automation' tools

2014-07-07 Thread Owen Stephens
Thanks Riley and Andrew for these pointers - some great stuff in there

Other tools and examples still very welcome :)

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 4 Jul 2014, at 15:04, Andrew Weidner metaweid...@gmail.com wrote:

 Great idea for a workshop, Owen.
 
 My staff and I use AutoHotkey every day. We have some apps for data
 cleaning in the CONTENTdm Project Client that I presented on recently:
 http://scholarcommons.sc.edu/cdmusers/cdmusersMay2014/May2014/13/. I'll be
 talking about those in more detail at the Upper Midwest Digital Collections
 Conference http://www.wils.org/news-events/wilsevents/umdcc/ if anyone is
 interested.
 
 I did an in-house training session for our ILS and database management
 folks on a simple AHK app that they now use for repetitive data entry:
 https://github.com/metaweidner/AutoType. When I was working with digital
 newspapers I developed a suite of tools for making repetitive quality
 review tasks easier: https://github.com/drewhop/AutoHotkey/wiki/NDNP_QR
 
 Basic AHK scripts are really great for text wrangling. Just yesterday I
 wrote a script to grab some values from a spreadsheet, remove commas from
 the numbers, and dump them into a tab delimited file in the format that we
 need. That script will become part of our regular workflow. Wrote another
 one-off script to transform labels on our wiki into links. It wrapped the
 labels in the wiki link syntax, and then I copied and pasted the unique
 URLs into the appropriate spots.
 
 It's also useful for keeping things organized. I have a set of scripts that
 open up frequently used network drive folders and applications, and I
 packaged them as drop down menu choices in a little GUI that's always open
 on the desktop. We have a few search scripts that either grab values from a
 spreadsheet or input box and then run a search for those terms in a web
 database (e.g. id.loc.gov).
 
 You might check out Selenium IDE for working with web forms:
 http://docs.seleniumhq.org/projects/ide/. The recording feature makes it
 really easy to get started with as an automation tool. I've used it
 extensively for automated metadata editing:
 http://digital.library.unt.edu/ark:/67531/metadc86138/m1/1/
 
 Cheers!
 
 Andrew
 
 
 On Fri, Jul 4, 2014 at 6:54 AM, Riley Childs ri...@tfsgeo.com wrote:
 
 Don't forget AutoIT (auto IT, pretty clever eh?)
 http://www.autoitscript.com/site/autoit/
 
 Riley Childs
 Student
 Asst. Head of IT Services
 Charlotte United Christian Academy
 (704) 497-2086
 RileyChilds.net
 Sent from my Windows Phone, please excuse mistakes
 
 -Original Message-
 From: Owen Stephens o...@ostephens.com
 Sent: ‎7/‎4/‎2014 4:55 AM
 To: CODE4LIB@LISTSERV.ND.EDU CODE4LIB@LISTSERV.ND.EDU
 Subject: [CODE4LIB] 'automation' tools
 
 I'm doing a workshop in the UK at a library tech unconference-style event
 (Pi and Mash http://piandmash.info) on automating computer based tasks.
 I want to cover tools that are usable by non-programmers and that would
 work in a typical library environment. The types of tools I'm thinking of
 are:
 
 MacroExpress
 AutoHotKey
 iMacros for Firefox
 
 While I'm hoping workshop attendees will bring ideas about tasks they
 would like to automate the type of thing I have in mind are things like:
 
 Filling out a set of standard data on a GUI or Web form (e.g. standard set
 of budget codes for an order)
 Processing a list of item barcodes from a spreadsheet and doing something
 with them on the library system (e.g. change loan status, check for holds)
 Similarly for User IDs
 Navigating to a web page and doing some task
 
 Clearly some of these tasks would be better automated with appropriate
 APIs and scripts, but I want to try to introduce those without programming
 skills to some of the concepts and tools and essentially how they can work
 around problems themselves to some extent.
 
 What tools do you use for this kind of automation task, and what kind of
 tasks do they best deal with?
 
 Thanks,
 
 Owen
 
 Owen Stephens
 Owen Stephens Consulting
 Web: http://www.ostephens.com
 Email: o...@ostephens.com
 Telephone: 0121 288 6936
 


Re: [CODE4LIB] coders who library? [was: Let me shadow you, librarians who code!]

2014-07-07 Thread Owen Stephens
I'm a librarian, and a slightly poor excuse for a coder second. I've always 
focussed on the IT/tech side of librarianship in my career and did at one point 
cross from libraries into more general IT management - then firmly put myself 
back into libraries. To a certain extent I left library employment to freelance 
as a consultant to get out of the academic library career path that kept taking 
me into management - which I realised, after several years doing it, was just 
not what got me out of bed in the morning.

There is a name for people without an MLS who can still quote MARC subfields or 
write MODS XML freehand. http://shambrarian.org :)


Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 7 Jul 2014, at 15:36, Miles Fidelman mfidel...@meetinghouse.net wrote:

 This recent spate of message leads me to wonder: How many folks here who 
 code for libraries have a library science degree/background, vs. folks who 
 come from other backgrounds?  What about folks who end up in technology 
 management/direction positions for libraries?
 
 Personally: Computer scientist and systems engineer, did some early 
 Internet-in-public library deployments, got to write a book about it.  Not 
 actively doing library related work at the moment.
 
 Miles Fidelman
 
 
 Dot Porter wrote:
 I'm a medieval manuscripts curator who codes, in Philadelphia, and I'd be
 happy to talk to you as well.
 
 Dot
 
 
 On Tue, Jul 1, 2014 at 10:30 AM, David Mayo pobo...@gmail.com wrote:
 
 If you'd like to talk to someone who did a library degree, and currently
 works as a web developer supporting an academic library, I'd be happy to
 talk with you.
 
 - Dave Mayo
   Software Engineer @ Harvard  HUIT  LTS
 
 
 On Tue, Jul 1, 2014 at 10:12 AM, Steven Anderson 
 stevencander...@hotmail.com wrote:
 
 Jennie,
 As with others, I'm not a librarian as I lack a library degree, but I do
 Digital Repository Development for the Boston Public Library
 (specifically:
 https://www.digitalcommonwealth.org/). Feel free to let me know you want
 to chat for your masters paper.
 Sincerely,Steven AndersonWeb Services - Digital Library Repository
 developer617-859-2393sander...@bpl.org
 
 Date: Tue, 1 Jul 2014 13:51:07 +
 From: mschofi...@nova.edu
 Subject: Re: [CODE4LIB] Let me shadow you, librarians who code!
 To: CODE4LIB@LISTSERV.ND.EDU
 
 Hey Jennie,
 
 I'm waaay south of MA but I'm pretty addicted to talking about coding
 as
 a library job O_o. If you are still in want of guinea-pigs, I'd love to
 skype / hangout.
 Michael Schofield
 // mschofi...@nova.edu
 
 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf
 Of
 Jennie Rose Halperin
 Sent: Monday, June 30, 2014 3:58 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: [CODE4LIB] Let me shadow you, librarians who code!
 
 hey Code4Lib,
 
 Do you work in a library and also like coding?  Do you do coding as
 part
 of your job?
 I'm writing my masters paper for the University of North Carolina at
 Chapel Hill and I'd like to shadow and interview up to 10 librarians and
 archivists who also work with code in some way in the Boston area for the
 next two weeks.
 I'd come by and chat for about 2 hours, and the whole thing will not
 take up too much of your time.
 Not in Massachusetts?  Want to skype? Let me know and that would be
 possible.
 I know that this list has a pretty big North American presence, but I
 will be in Berlin beginning July 14, and could potentially shadow anyone
 in
 Germany as well.
 Best,
 
 Jennie Rose Halperin
 jennie.halpe...@gmail.com
 
 
 
 
 
 -- 
 In theory, there is no difference between theory and practice.
 In practice, there is.    Yogi Berra


[CODE4LIB] 'automation' tools

2014-07-04 Thread Owen Stephens
I'm doing a workshop in the UK at a library tech unconference-style event (Pi 
and Mash http://piandmash.info) on automating computer based tasks.
I want to cover tools that are usable by non-programmers and that would work in 
a typical library environment. The types of tools I'm thinking of are:

MacroExpress
AutoHotKey
iMacros for Firefox

While I'm hoping workshop attendees will bring ideas about tasks they would 
like to automate the type of thing I have in mind are things like:

Filling out a set of standard data on a GUI or Web form (e.g. standard set of 
budget codes for an order)
Processing a list of item barcodes from a spreadsheet and doing something with 
them on the library system (e.g. change loan status, check for holds)
Similarly for User IDs
Navigating to a web page and doing some task 

Clearly some of these tasks would be better automated with appropriate APIs and 
scripts, but I want to try to introduce those without programming skills to 
some of the concepts and tools and essentially how they can work around 
problems themselves to some extent.

What tools do you use for this kind of automation task, and what kind of tasks 
do they best deal with?

Thanks,

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936


Re: [CODE4LIB] Is ISNI / ISO 27729:2012 a name identifier or an entity identifier?

2014-06-19 Thread Owen Stephens
An aside but interesting to see how some of this identity stuff seems to be 
playing out in the wild now. Google for Catherine Sefton:

https://www.google.co.uk/search?q=catherine+sefton

The Knowledge Graph displays information about Martin Waddell. Catherine Sefton 
is a pseudonym of Martin Waddell. It is impossible to know, but the most likely 
source of this knowledge is Wikipedia which includes the ISNI for Catherine 
Sefton in the Wikipeda page for Martin Waddell 
(http://en.wikipedia.org/wiki/Martin_Waddell) (although oddly not the ISNI for 
Martin Waddell under his own name).

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 18 Jun 2014, at 23:28, Stuart Yeates stuart.yea...@vuw.ac.nz wrote:

 My reading of that suggests that 
 http://isni-url.oclc.nl/isni/000122816316 shouldn't have both Bell, 
 Currer and Brontë, Charlotte, which it clearly does...
 
 Is this is a case of one of our sources of truth doesn't distinguish betweens 
 identities and entities and we're allowing it to pollute our data?
 
 If that source of truth is wikipedia, we can fix that.
 
 cheers
 stuart
 
 On 06/19/2014 12:11 AM, Richard Wallis wrote:
 Hi all,
 
 Seeing this thread I checked with the ISNI team and got the following
 answer from Janifer Gatenby who asked me to post it on her behalf:
 
 SNI identifies “public identities”.The scope as stated in the standard
 is
 
 
 
 “This International Standard specifies the International Standard name
 identif*i*er (ISNI) for the identification of public identities of parties;
 that is, the identities used publicly by parties involved throughout the
 media content industries in the creation, production, management, and
 content distribution chains.”
 
 
 
 The relevant definitions are:
 
 
 
 *3.1*
 
 *party*
 
 natural person or legal person, whether or not incorporated, or a group of
 either
 
 *3.3*
 
 *public identity*
 
 Identity of a *party *(3.1) or a fictional character that is or was
 presented to the public
 
 *3.4*
 
 *name*
 
 character string by which a *public identity *(3.3) is or was commonly
 referenced
 
 
 
 A party may have multiple public identities and a public identity may have
 multiple names (e.g. pseudonyms)
 
 
 
 ISNI data is available as linked data.  There are currently 8 million ISNIs
 assigned and 16 million links.
 
 
 
 Example:
 
 
 
 [image: image001.png]
 
 ~Richard.
 
 
 On 16 June 2014 10:54, Ben Companjen ben.compan...@dans.knaw.nl wrote:
 
 Hi Stuart,
 
 I don't have a copy of the official standard, but from the documents on
 the ISNI website I remember that there are name variations and 'public
 identities' (as the lemma on Wikipedia also uses). I'm not sure where the
 borderline is or who decides when different names are different identities.
 
 If it were up to me: pseudonyms are definitely different public
 identities, name changes after marriage probably not, name change after
 gender change could mean a different public identity. Different public
 identities get different ISNIs; the ISNI organisation says the ISNI system
 can keep track of connected public identities.
 
 Discussions about name variations or aliases are not new, of course. I
 remember the discussions about 'aliases' vs 'Artist Name Variations' that
 are/were happening on Discogs.com, e.g. 'is J Dilla an alias or a ANV of
 Jay Dee?' It appears the users on Discogs finally went with aliases, but
 VIAF put the names/identities together: http://viaf.org/viaf/32244000 -
 and there is no ISNI (yet).
 
 It gets more confusing when you look at Washington Irving who had several
 pseudonyms: they are just listed under one ISNI. Maybe because he is dead,
 or because all other databases already know and connected the pseudonyms
 to the birth name? (I just sent a comment asking about the record at
 http://isni.org/isni/000121370797 )
 
 
 [Here goes the reference list…]
 
 Hope this helps :)
 
 Groeten van Ben
 
 On 15-06-14 23:11, Stuart Yeates stuart.yea...@vuw.ac.nz wrote:
 
 Could someone with access to the official text of ISO 27729:2012 tell me
 whether an ISNI is a name identifier or an entity identifier? That is,
 if someone changes their name (adopts a pseudonym, changes their name by
 to marriage, transitions gender, etc), should they be assigned a new
 identifier?
 
 If the answer is 'No' why is this called a 'name identifier'?
 
 Ideally someone with access to the official text would update the
 article at
 https://en.wikipedia.org/wiki/International_Standard_Name_Identifier
 With a brief quote referenced to the standard with a page number.
 
 [The context of this is ORCID, which is being touted as an entity
 identifier, while not being clear on whether it's a name or entity
 identifier.]
 
 cheers
 stuart
 
 
 
 


Re: [CODE4LIB] Any good introduction to SPARQL workshops out there?

2014-05-01 Thread Owen Stephens
I contributed to a session like this in the UK aimed at cataloguers/metadata 
librarians 
http://www.cilip.org.uk/cataloguing-and-indexing-group/events/linked-data-what-cataloguers-need-know-cig-event.
All the slide decks used are available at 
http://www.cilip.org.uk/cataloguing-and-indexing-group/linked-data-what-cataloguers-need-know
Specifically my introduction to SPARQL slides are at 
http://www.slideshare.net/ostephens/selecting-with-sparql-using-british-national-bibliography-as,
 and link to various example SPARQL queries that can be run on the BNB SPARQL 
endpoint (SPARQL examples are all Gists at https://gist.github.com/ostephens)

Not sure about the practicalities of bringing this to staff in the US, although 
planning is in progress for another event in the UK along the same lines and 
I'd be happy to put you in touch with the relevant people on the committee to 
see if there is any possibility of having it webcast if there was interest.

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 1 May 2014, at 17:23, Hutt, Arwen ah...@ucsd.edu wrote:

 We're interested in an introduction to SPARQL workshop for a smallish group 
 of staff.  Specifically an introduction for fairly tech comfortable 
 non-programmers (in our case metadata librarians), as well as a refresher for 
 programmers who aren't using it regularly.
 
 Ideally (depending on cost) we'd like to bring the workshop to our staff, 
 since it'll allow more people to attend, but any recommendations for good 
 introductory workshops or tutorials would be welcome!
 
 Thanks!
 Arwen
 
 
 Arwen Hutt
 Head, Digital Object Metadata Management Unit
 Metadata Services, Geisel Library
 University of California, San Diego
 


Re: [CODE4LIB] barriers to open metadata?

2014-04-30 Thread Owen Stephens
Hi Laura,

I've done some work on this in the UK[1][2] and there have been a number of 
associated projects looking at the open release of library, archive and museum 
metadata[3].

For libraries (it is different of archives and museums) I think I'd sum up the 
reasons in three ways - in order of how commonly I think they apply

a. Ignorance/lack of thought - libraries don't tend to licence their metadata, 
and often make no statement about how it can be used - my experience is that 
often no-one has even asked the questions about licencing/data release
b. No business case - in the UK we talked to a group of university librarians 
and found that they didn't see a compelling business case for making open data 
releases of their catalogue records
c. Concern about breaking contractual agreements or impinging on 3rd party 
copyright over records. The Comet project at the University of Cambridge did a 
lot of work in this area[4]

As Roy notes, there have been some significant changes recently with OCLC and 
many national libraries releasing data under open licences. However, while this 
changes (c) it doesn't impact so much on (a) and (b) - so these remain as 
fundamental issues and I have a (unsubstantiated) concern that big data 
releases lead to libraries taking less interest (someone else is doing this 
for us) rather than taking advantage of the clarity and openess these big data 
releases and associated announcements bring.

A final point - looking at libraries behaviour in relation to 
institutional/open access repositories, where you'd expect at least (a) to be 
considered, unfortunately when I looked a couple of years ago I found similar 
issues. Working for the CORE project at the Open University[5] I found that 
OpenDOAR[6] listed Metadata re-use policy explicitly undefined for 57 out of 
125 UK repositories with OAI-PMH services. Only 18 repositories were listed as 
permitting commerical re-use of metadata. Hopefully this has improved in the 
intervening 2 years!

Hope some of this is helpful

Owen

1 Jisc Guide to Open Bibliographic Data http://obd.jisc.ac.uk
2 Jisc Discovery principles http://discovery.ac.uk/businesscase/principles/
3 Jisc Discovery Case studies http://guidance.discovery.ac.uk
4 COMET  http://cul-comet.blogspot.co.uk/p/ownership-of-marc-21-records.html
5 CORE blog http://core-project.kmi.open.ac.uk/node/32
6 OpenDOAR http://www.opendoar.org/

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 29 Apr 2014, at 21:06, Ben Companjen ben.compan...@dans.knaw.nl wrote:

 Hi Laura,
 
 Here are some reasons I may have overheard.
 
 Stuck halfway: We have an OAI-PMH endpoint, so we're open, right?
 
 Lack of funding for sorting out our own rights: We gathered metadata from
 various sources and integrated the result - we even call ourselves Open
 L*y - but we [don't have manpower to figure out what we can do with
 it, so we added a disclaimer].
 
 Cultural: We're not sure how to prevent losing the records' provenance
 after we released our metadata.
 
 
 Groeten van Ben
 
 On 29-04-14 19:02, Laura Krier laura.kr...@gmail.com wrote:
 
 Hi Code4Libbers,
 
 I'd like to find out from as many people as are interested what barriers
 you feel exist right now to you releasing your library's bibliographic
 metadata openly. I'm curious about all kinds of barriers: technical,
 political, financial, cultural. Even if it seems obvious, I'd like to hear
 about it.
 
 Thanks in advance for your feedback! You can send it to me privately if
 you'd prefer.
 
 Laura
 
 -- 
 Laura Krier
 
 laurapants.comhttp://laurapants.com/?utm_source=email_sigutm_medium=emai
 lutm_campaign=email


Re: [CODE4LIB] distributed responsibility for web content

2014-04-18 Thread Owen Stephens
I'd second the suggestions from Erin with regards establishing style guides and 
Ross's suggestion of peer review. While not quite directly about the issue you 
have, Paul Boag a UK web designer has spoken and blogged on how clear policies 
relying on quantitative measures can help establish clear policies and 
(perhaps!) take some of the emotion out of decision making - e.g. see 
http://boagworld.com/business-strategy/website-animal/ - perhaps a similar 
approach might help here as well.

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 18 Apr 2014, at 15:15, Erin White erwh...@vcu.edu wrote:

 Develop a brief content and design style guide, then have it approved by
 your leadership team and share it with your organization. (Easier
 said than done, I know.) Bonus points if you work with your (typically)
 print-focused communications person to develop this guide and get his/her
 buy-in on creating content for the web.
 
 A style guide sets expectations across the board and helps you when you
 need to play they heavy. As you need, you can e-mail folks with a link to
 the style guide, ask them to revise, and offer assistance or suggestions if
 they want.
 
 Folks are grumpy about this at first, but generally appreciate the overall
 strategy to make the website more consistent and professional-looking. It
 ain't the wild wild west anymore - our web content is both functional and
 part of an overall communications strategy, and we need to treat it
 accordingly.
 
 --
 Erin White
 Web Systems Librarian, VCU Libraries
 804-827-3552 | erwh...@vcu.edu | www.library.vcu.edu
 
 
 On Fri, Apr 18, 2014 at 9:39 AM, Pikas, Christina K. 
 christina.pi...@jhuapl.edu wrote:
 
 Laughing and feeling your pain... we have a communications person (that's
 her job) who keeps using bold, italics, h1, in pink (yes pink), randomly in
 pages... luckily she only does internal pages, and not external.
 
 You could schedule some writing for the web sessions, but I don't know
 that it will help. You could remove any text formatting... In the end, you
 probably should just do as I do: close the page, breathe deeply, get up and
 take a walk, and get on with other things.
 
 Christina
 
 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@listserv.nd.edu] On Behalf Of
 Simon LeFranc
 Sent: Thursday, April 17, 2014 7:43 PM
 To: CODE4LIB@listserv.nd.edu
 Subject: [CODE4LIB] distributed responsibility for web content
 
 My organization has recently adopted an enterprise Content Management
 System. For the first time, staff across 8 divisions became web authors,
 given responsibility for their division's web pages. Training on the
 software, which has a WYSIWYG interface for editing, is available and with
 practice, all are capable of mastering the basic tools. Some simple style
 decisions were made for them, however, it is extremely difficult to get
 these folks not to elaborate on or improvise new styles.  Examples:
 
making text red or another color in the belief that color will draw
 readers' attentionmaking text bold and/or italic and/or the size of a
 war-is-declared headline (see 1);using images that are too small to be
 effectiveadding a few more images that are too small to be effective
 attempting to emphasize statements using ! or !! or !writing in a
 too-informal tone (Come on in outta the rain!) [We are a research
 organization and museum.]feeling compelled to ornament pages with
 clipart, curlicues, et al.centering everything
 There is no one person in the organization with the time or authority to
 act as editorial overseer. What are some techniques for ensuring that the
 site maintains a clean, professional appearance?
 
 Simon
 
 
 


[CODE4LIB] Research Libraries UK Hack day

2014-04-04 Thread Owen Stephens
Just over a year and a half ago I posted about some work I was doing on behalf 
of Research Libraries UK (RLUK) who were looking at the potential of publishing 
several million of their bibliographic records (drawn from the major research 
libraries in the UK) as linked open data.In August last year RLUK announced it 
would join The European Library (TEL)[1], and would work with the team at TEL 
to publish RLUK data, along with other data held by The European Library, as 
linked open data. I'm happy to say that they are now very close to making the 
(approximately) 17 million RLUK records available. 

To start the process of working with the wider community of librarians, 
developers, and anyone interested in exploiting this data, RLUK is holding a 
hack day in London on 14th May. Here the RLUK Linked Open Data will be 
introduced, along with the TEL API (OpenSearch based). There will be prizes (to 
be announced) for hacks in the following areas which represent areas of 
interest to RLUK and TEL:

• Linking Up datasets - a prize for work that combines data from 
multiple data sets
• WWI 
• Eastern Europe
• Delivering a valuable hack for RLUK members

The event is free and you can sign up now at 
https://www.eventbrite.co.uk/e/rluk-hack-day-rlukhack-tickets-11197529111 - I 
hope to see some of you there

Best wishes

Owen

1. http://www.rluk.ac.uk/news/rluk-joins-european-library/

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936


Re: [CODE4LIB] semantic web browsers

2014-03-22 Thread Owen Stephens
Your findings reflect my experience - there isn't much out there and what is 
basic or doesn't work at all.
Link Sailor is another http://linksailor.com but I suspect not actively 
maintained (developed by Ian Davis when he was at Talis doing linked data work)

I think the Graphite based browser from Southampton *does* support 
content-negotiation - what makes you think it doesn't?

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 22 Mar 2014, at 20:49, Eric Lease Morgan emor...@nd.edu wrote:

 Do you know of any working Semantic Web browsers?
 
 Below is a small set of easy-to-use Semantic Web browsers. Give them URIs and 
 they allow you to follow and describe the links they include.
 
  * LOD Browser Switch (http://browse.semanticweb.org) - This is
really a gateway to other Semantic Web browsers. Feed it a URI
and it will create lists of URLs pointing to Semantic Web
interfaces, but many of the URLs (Semantic Web interfaces) do not
seem to work. Some of the resulting URLs point to RDF
serialization converters
 
  * LodLive (http://en.lodlive.it) - This Semantic Web browser
allows you to feed it a URI and interactively follow the links
associated with it. URIs can come from DBedia, Freebase, or one
of your own.
 
  * Open Link Data Explorer
(http://demo.openlinksw.com/rdfbrowser2/) - The most
sophisticated Semantic Web browser in this set. Given a URI it
creates various views of the resulting triples associated with
including lists of all its properties and objects, networks
graphs, tabular views, and maps (if the data includes geographic
points).
 
  * Quick and Dirty RDF browser
(http://graphite.ecs.soton.ac.uk/browser/) - Given the URL
pointing to a file of RDF statements, this tool returns all the
triples in the file and verbosely lists each of their predicate
and object values. Quick and easy.  This is a good for reading
everything about a particular resource. The tool does not seem
to support content negotiation.
 
 If you need some URIs to begin with, then try some of these:
 
  * Ray Family Papers - http://infomotions.com/sandbox/liam/data/mum432.rdf
  * Catholics and Jews - 
 http://infomotions.com/sandbox/liam/data/shumarc681792.rdf
  * Walt Disney via VIAF - http://viaf.org/viaf/36927108/
  * origami via the Library of Congress - 
 http://id.loc.gov/authorities/subjects/sh85095643
  * Paris from DBpedia - http://dbpedia.org/resource/Paris
 
 To me, this seems like a really small set of browser possibilities. I’ve seen 
 others but could not get them to work very well. Do you know of others? Am I 
 missing something significant?
 
 —
 Eric Lease Morgan


Re: [CODE4LIB] tool for finding close matches in vocabular list

2014-03-21 Thread Owen Stephens
As Roy suggests, Open Refine is designed for this type of work and could easily 
deal with the volume you are talking about here. It can cluster terms using a 
variety of algorithms and easily apply a set of standard transformations.

The screencasts and info at http://freeyourmetadata.org/cleanup/ might be a 
good starting point if you want to see what Refine can do

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 21 Mar 2014, at 18:24, Ken Irwin kir...@wittenberg.edu wrote:

 Hi folks,
 
 I'm looking for a tool that can look at a list of all of subject terms in a 
 poorly-controlled index as possible candidates for term consolidation. Our 
 student newspaper index has about 16,000 subject terms and they include a lot 
 of meaningless typographical and nomenclatural difference, e.g.:
 
 Irwin, Ken
 Irwin, Kenneth
 Irwin, Mr. Kenneth
 Irwin, Kenneth R.
 
 Basketball - Women
 Basketball - Women's
 Basketball-Women
 Basketball-Women's
 
 I would love to have some sort of pattern-matching tool that's smart about 
 this sort of thing that could go through the list of terms (as a text list, 
 database, xml file, or whatever structure it wants to ingest) and spit out 
 some clusters of possible matches.
 
 Does anyone know of a tool that's good for that sort of thing?
 
 The index is just a bunch of MySQL tables - there is no real controlled-vocab 
 system, though I've recently built some systems to suggest known SH's to 
 reduce this sort of redundancy.
 
 Any ideas?
 
 Thanks!
 Ken


Re: [CODE4LIB] Retrieving ISSN using a DOI

2014-03-05 Thread Owen Stephens
You should be able to use the content negotiation support on Crossref to get 
the metadata, which does include the ISSNs - or at least has the potential to 
if they are available. E.g. 

curl -LH Accept: application/rdf+xml;q=0.5, 
application/vnd.citationstyles.csl+json;q=1.0 
http://dx.doi.org/10.1126/science.169.3946.635

Gives 

{
  subtitle: [],
  subject: [
General
  ],
  issued: {
date-parts: [
  [
1970,
8,
14
  ]
]
  },
  score: 1.0,
  prefix: http://id.crossref.org/prefix/10.1126;,
  author: [
{
  family: Frank,
  given: H. S.
}
  ],
  container-title: Science,
  page: 635-641,
  deposited: {
date-parts: [
  [
2011,
6,
27
  ]
],
timestamp: 130913280
  },
  issue: 3946,
  title: The Structure of Ordinary Water: New data and interpretations are 
yielding new insights into this fascinating substance,
  type: journal-article,
  DOI: 10.1126/science.169.3946.635,
  ISSN: [
0036-8075,
1095-9203
  ],
  URL: http://dx.doi.org/10.1126/science.169.3946.635;,
  source: CrossRef,
  publisher: American Association for the Advancement of Science (AAAS),
  indexed: {
date-parts: [
  [
2013,
11,
7
  ]
],
timestamp: 1383796678887
  },
  volume: 169
}


Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 5 Mar 2014, at 12:30, Graham, Stephen s.grah...@herts.ac.uk wrote:

 OK, I've received a couple of emails telling me that the ISSN is not always 
 included in the DOI - that it depends on the publisher. So, I guess my 
 original question still stands!
 
 Stephen
 
 From: Graham, Stephen
 Sent: 05 March 2014 12:25
 To: 'CODE4LIB@LISTSERV.ND.EDU'
 Subject: RE: Retrieving ISSN using a DOI
 
 Sorry - I've answered my own question. The ISSN is actually contained in the 
 DOI. Didn't realise this! D'oh!
 
 Stephen
 
 From: Graham, Stephen
 Sent: 05 March 2014 12:14
 To: 'CODE4LIB@LISTSERV.ND.EDU'
 Subject: Retrieving ISSN using a DOI
 
 Hi All - is there a service/API that will return the ISSN if I provide the 
 DOI? I was hoping that the Crossref API would do this, but I can't see the 
 ISSN in the JSON it returns.
 
 I'm adding a DOI field to our OPAC ILL form, so if the user has the DOI they 
 can use this to populate the form rather than add all the data manually. When 
 the user submits the form I'm querying our openURL resolver API to see if we 
 have access to the article. If we do then the form will alert the user and 
 provide a link. The query to the openURL resolver works better if we have the 
 ISSN, but if the user has used a DOI the ISSN is frustratingly never there.
 
 Stephen
 
 Stephen Graham
 Online Information Manager
 Information Collections and Services
 University of Hertfordshire, Hatfield.  AL10 9AB
 Tel. 01707 286111
 Email s.grah...@herts.ac.ukmailto:s.grah...@herts.ac.uk


Re: [CODE4LIB] Library of Congress

2013-10-01 Thread Owen Stephens
+1

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 1 Oct 2013, at 14:21, Doran, Michael D do...@uta.edu wrote:

 As far as I can tell the LOC is up and the offices are closed. HORRAY!!
 Let's celebrate!
 
 Before we start celebrating, let's consider our friends and colleagues at the 
 LOC (some of who are code4lib people) who aren't able to work and aren't 
 getting paid starting today.
 
 -- Michael
 
 # Michael Doran, Systems Librarian
 # University of Texas at Arlington
 # 817-272-5326 office
 # 817-688-1926 mobile
 # do...@uta.edu
 # http://rocky.uta.edu/doran/
 
 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Riley Childs
 Sent: Tuesday, October 01, 2013 5:28 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: [CODE4LIB] Library of Congress
 
 As far as I can tell the LOC is up and the offices are closed. HORRAY!!
 Let's celebrate!
 
 Riley Childs
 Junior and Library Tech Manager
 Charlotte United Christian Academy
 +1 (704) 497-2086
 Sent from my iPhone
 Please excuse mistakes


Re: [CODE4LIB] Open Source ERM

2013-09-20 Thread Owen Stephens
I'm involved in the GOKb project, and also a related project in the UK called 
'KB+' which is a national service providing a knowledgebase and the ability to 
manage subscriptions/licences.
As Adam said - GOKb is definitely more of a service, although the software 
could be run by anyone it isn't designed with ERM functionality in mind - but 
to be able to be a GOKb is a community managed knowledgebase - and so far much 
of the work has been to build a set of tools for bringing in data from 
publishers and content providers, and to store and manage that data. In the not 
too distant future GOKb will provide data via APIs for use in downstream 
systems.

Two specific downstream systems GOKb is going to be working with are the Kuali 
OLE system (https://www.kuali.org/ole) and the KB+ system mentioned above. KB+ 
started with very similar ideas to GOKb in terms of building a community 
managed knowledgebase, but with the UK HE community specifically in mind. 
However it is clear that collaborating with GOKb will have significant benefits 
and help the community focus its effort in a single knowledgebase, and so it is 
expected that eventually KB+ will consume data from GOKb, and the community 
will contribute to the data managed in GOKb.

However KB+ also provides more ERM style functionality available to UK 
Universities. Each institution can setup its own subscriptions and licenses, 
drawing on the shared knowledgebase information which is managed centrally by a 
team at Jisc Collections (who negotiate licenses for much of the content in the 
UK, among other things). I think the KB+ software could work as a standalone 
ERMs in terms of functionality, but its strength is as a multi-institution 
system with a shared knowledgebase. We are releasing v3.3 next week which 
brings integration with various discussion forum software - hoping we can put 
community discussion and collaboration at the heart of the product

Development on both KB+ and GOKb is being done by a UK software house called 
Knowledge Integration, and while licenses for the respective code bases have 
not yet been implemented, both should be released under an open licence in the 
future. However the code is already on Github if anyone is interested
http://github.com/k-int/KBPlus/
https://github.com/k-int/gokb-phase1

In both cases they are web apps written in Groovy. GOKb has the added 
complication/interest of also having a Open (was Google) Refine extension as 
this is the tool chose for loading messing e-journal data into the system

Sorry to go on, hope the above is of some interest

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 20 Sep 2013, at 16:26, Karl Holten khol...@switchinc.org wrote:

 A couple of months ago our organization began looking at new ERM solutions / 
 link resolvers, so I thought I'd share my thoughts based on my research of 
 the topic. Unfortunately, I think this is one area where open source 
 offerings are a bit thin. Many offerings look promising at first but are no 
 longer under development. I'd be careful about adopting something that's no 
 longer supported. Out of all the options that are no longer developed, I 
 thought the CUFTS/GODOT combo was the most promising. Out of the options that 
 seem to still be under development, there were two options that stood out: 
 CORAL and GOKb. Neither includes a link resolver, so they weren't good for 
 our needs. CORAL has the advantage of being out on the market right now. GOKb 
 is backed by some pretty big institutions and looks more sophisticated, but 
 other than some slideshows there's not a lot to look at to actually evaluate 
 it at the moment. 
 
 Ultimately, I came to the conclusion that nothing out there right now matches 
 the proprietary software, especially in terms of link resolvers and in terms 
 of a knowledge base. If I were forced to go open source I'd say the GOKb and 
 CORAL look the most promising. Hope that helps narrow things down at least a 
 little bit.
 
 Regards,
 Karl Holten
 Systems Integration Specialist
 SWITCH Consortium
 6801 North Yates Road
 Milwaukee, WI 53217
 http://topcat.switchinc.org/ 
 
 
 
 
 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of 
 Riesner, Giles W.
 Sent: Thursday, September 19, 2013 5:33 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Open Source ERM
 
 Thank you, Peter.  I took a quick look at the list and found ERMes there as 
 well as a few others.
 Not everything under this category really fits what I'm looking for (e.g. 
 Calibre). I'll look a little deeper.
 
 Regards,
 
 
 Giles W. Riesner, Jr., Lead Library Technician, Library Technology Community 
 College of Baltimore County
 800 S. Rolling Road  Baltimore, MD 21228
 gries...@ccbcmd.edu   1-443-840-2736
 
 
 
 From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf

Re: [CODE4LIB] What do you want to learn about linked data?

2013-09-04 Thread Owen Stephens
Just a recommendation for a source of information - I've found 
http://linkeddatabook.com/editions/1.0/ very useful especially in thinking 
about the practicalities of linked data publication and consumption in 
applications

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 4 Sep 2013, at 15:13, Akerman, Laura lib...@emory.edu wrote:

 Karen,
 
 It's hard to say what basics are.  We had a learning group at Emory that 
 covered a lot of the what is it, including mostly what you've listed but 
 also the environment (library and cultural heritage, and larger environment), 
 but we had a harder time getting to the what do you do with it which is 
 what would really motivate and empower people to go ahead and get beyond 
 basics.
 
 Maybe add:
 
 How do you embed linked data in web pages using RDFa
 (Difference between RDFa and schema.org/other microdata)
 How do you harvest linked data from web pages, endpoints, or other modes of 
 delivery?
 Different serializations and how to convert
 How do you establish relations between different vocabularies (classes and 
 properties) using RDFS and OWL?
 (Demo) New answers to your questions enabled by combining and querying linked 
 data!
 
 Maybe a step toward what can you do with it would be to show (or have an 
 exercise):
 
 How can a web application interface with linked data?
 
 I suspect there are a lot of people who've read about it and/or have had 
 tutorials here and there, and who really want to get their hands in it.  
 That's where there's a real dearth of training.
 
 An intermediate level workshop addressing (but not necessarily answering!) 
 questions like:
 
 Do you need a triplestore or will a relational database do?
 Do you need to store your data as RDF or can you do everything you need with 
 XML or some other format, converting on the way out or in?
 Should you query external endpoints in real time in your application, or 
 cache the data?
 Other than SPARQL, how do you search linked data?  Indexing strategies...  
 tools...
 If asserting  OWL sameAs is too dangerous in your context, what other 
 strategies for expressing close to it relationships between resources 
 (concepts) might work for you?
 Advanced SPARQL using regular expressions, CREATE, etc.
 Care and feeding of triplestores (persistence, memory, )
 Costing out linked data applications:
   How much additional server space and bandwidth will I (my institution) need 
 to provision in order to work with this stuff?
   Open source, free, vs. commercial management systems?
 Backward conversion -transformations from linked data to other data 
 serializations (e.g. metadata standards in XML).
 What else?
 
 Unfortunately (or maybe just, how it is) no one has built an interface that 
 hides all the programming and technical details from people but lets them 
 experience/experiment with this stuff (have they?).  So some knowledge is 
 necessary.  What are prerequisites and how could we make the burden of 
 knowing them not so onerous to people who don't have much experience in web 
 programming or system administration, so they could get value from a 
 tutorial,?
 
 Laura
 
 Laura Akerman
 Technology and Metadata Librarian
 Room 208, Robert W. Woodruff Library
 Emory University, Atlanta, Ga. 30322
 (404) 727-6888
 lib...@emory.edu
 
 
 
 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Karen 
 Coyle
 Sent: Wednesday, September 04, 2013 4:59 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] What do you want to learn about linked data?
 
 All,
 
 I had a few off-list requests for basics - what are the basic things that 
 librarians need to know about linked data? I have a site where I am putting 
 up a somewhat crudely designed tutorial (with exercises):
 
 http://kcoyle.net/metadata/
 
 As you can see, it is incomplete, but I work away on it when so inspired. It 
 includes what I consider to be the basic knowledge:
 
 1. What is metadata?
 2. Data vs. text
 3. Identifiers (esp. URIs)
 4. Statements (not records) (read: triples) 5. Semantic Web basics 6. URIs 
 (more in depth) 7. Ontologies 8. Vocabularies
 
 I intend to link various slide sets to this, and anyone is welcome to make 
 use of the content there. It would be GREAT for it to become an actual 
 tutorial, perhaps using better software, but I haven't found anything yet 
 that I like working with.
 
 If you have basics to add, please let me know!
 
 kc
 
 
 
 On 9/1/13 5:37 PM, Karen Coyle wrote:
 I'm thinking about training needs around linked data -- yes, that
 includes basic concepts, but at the moment I'm wondering what specific
 technologies or tasks people would like to learn about? Some obvious
 examples are: how to do SPARQL queries; how to use triples in
 databases; maybe how to use Protege (free software) [1] to create an
 ontology. Those are just a quick shot across the bow

Re: [CODE4LIB] netflix search mashups w/ library tools?

2013-08-19 Thread Owen Stephens
From the Netflix API Terms of Use Titles and Title Metadata may be stored for 
no more than twenty four (24) hours.
http://developer.netflix.com/page/Api_terms_of_use

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 19 Aug 2013, at 16:59, Ken Irwin kir...@wittenberg.edu wrote:

 Thanks Karen,
 
 This goes in a bit of a direction from what I'm hoping for and your project 
 does suggest that some matching to build such searches might be possible. 
 
 What I really want is to apply LCSH and related data to the Netflix search 
 process, essentially dropping Netflix holdings into a library catalog 
 interface. I suspect you'd have to build a local cache of the OCLC data for 
 known Netflix items to do so, and maybe a local cache of the Netflix title 
 list. I wonder if either or both of those actions would violate the TOS for 
 the respective services. 
 
 Ken
 
 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Karen 
 Coombs
 Sent: Monday, August 19, 2013 11:26 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] netflix search mashups w/ library tools?
 
 Ken,
 
 I did a mashup that took Netflix's top 100 movies and looked to see if a 
 specific library had that item.
 http://www.oclc.org/developer/applications/netflix-my-library
 
 You might think about doing the following. Search WorldCat for titles on a 
 particular topic and then check to see if the title is available via Netflix. 
 Netflix API for searching their catalog is pretty limited though so it might 
 not give you what you want. It looks like it only allows you to search their 
 streamable content.
 
 Also I had a lot of trouble with trying to match Netflix titles and library 
 holdings. Because there isn't a good match point. DVDs don't have ISBNs and 
 if you use title you can get into trouble because movies get remade. So title 
 + date seems to work best if you can get the information.
 
 Karen
 
 On Mon, Aug 19, 2013 at 8:54 AM, Ken Irwin kir...@wittenberg.edu wrote:
 Hi folks,
 
 Is anyone out there using library-like tools for searching Netflix? I'm 
 imagining a world in which Netflix data gets mashed up with OCLC data or 
 something like it to populate a more robustly searchable Netflix title list.
 
 Does anything like this exist?
 
 What I really want at the moment is a list of Netflix titles dealing with 
 Islamic topics (Muhammed, the Qu'ran, the history of Islamic civilizations, 
 the Hajj, Ramadan, etc.) for doing beyond-the-library readers' advisory in 
 connection with our ALA/NEH Muslim Journey's Bookshelf. Netflix's own search 
 tool is singularly awful, and I thought that the library world might have an 
 interest in doing better.
 
 Any ideas?
 Thanks
 Ken


Re: [CODE4LIB] Releasing library holdings metadata openly on the web (was: Libraries and IT Innovation)

2013-07-24 Thread Owen Stephens
On the holdings front also see the work being done on a holding ontology at 
https://github.com/dini-ag-kim/holding-ontology (and related mailing list 
http://lists.d-nb.de/mailman/listinfo/dini-ag-kim-bestandsdaten) - discussion 
all in English

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 23 Jul 2013, at 21:14, Dan Scott deni...@gmail.com wrote:

 Hi Laura:
 
 On Tue, Jul 23, 2013 at 12:36 PM, Laura Krier laura.kr...@gmail.com wrote:
 
 snip
 
 The area where I'm most involved right now is in releasing library holdings
 metadata openly on the web, in discoverable and re-usable forms. It's
 amazing to me that we still don't do this. Imagine the things that could be
 created by users and software developers if they had access to information
 about which libraries hold which resources.
 
 I'm really interested in your efforts on this front, and where this
 work is taking place, as that's what I'm trying to do as part of my
 participation in the W3C Schema Bib Extend Community Group at
 http://www.w3.org/community/schemabibex/
 
 See the thread starting around
 http://lists.w3.org/Archives/Public/public-schemabibex/2013Jul/0068.html
 where we're trying to work out how best to surface library holdings in
 schema.org structured data, with one effort focusing on reusing the
 Offer class. There are many open questions, of course, but one of
 the end goals (at least for me) is to get the holdings into a place
 where regular people are most likely to find them: in search results
 served up by search engines like Google and Bing.
 
 If you're not involved in the W3C community group, maybe you should
 be! And it would be great if you could point out where your work is
 taking place so that we can combine forces.
 
 Dan


Re: [CODE4LIB] Anyone have access to well-disambiguated sets of publication data?

2013-07-09 Thread Owen Stephens
I'd echo the other comments that finding reliable data is problematic but as a 
suggestion of reasonably good data you could try:

Names was a Jisc funded project that as far as I know isn't currently active 
but the data available should be of reasonable quality I think. More details on 
the project available at 
http://names.mimas.ac.uk/files/Final_Report_Names_Phase_Two_September_2011.pdf

Names: for author names + identifiers - e.g. 
http://names.mimas.ac.uk/individual/25256.html?outputfields=identifiers (this 
one has an ISNI)
Names also provides links to Journal articles - e.g. for same person 
http://names.mimas.ac.uk/individual/25256.html?outputfields=resultpublications
You could then use the Crossref DOI lookup service to get journal identifiers

Not sure this will get you what you need but might be worth a look

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 9 Jul 2013, at 16:32, Paul Albert paa2...@med.cornell.edu wrote:

 I am exploring methods for author disambiguation, and I would like to have 
 access to one or more set of well-disambiguated data set containing:
 – a unique author identifier (email address, institutional identifier)
 – a unique article identifier (PMID, DOI, etc.)
 – a unique journal identifier (ISSN)
 
 Definition for well-disambiguated – for a given set of authors, you know 
 the identity of their journal articles to a precision and recall of greater 
 than 90-95%.
 
 Any ideas?
 
 thanks,
 Paul
 
 
 Paul Albert
 Project Manager, VIVO
 Weill Cornell Medical Library
 646.962.2551


Re: [CODE4LIB] best way to make MARC files available to anyone

2013-06-13 Thread Owen Stephens
On 13 Jun 2013, at 02:57, Dana Pearson dbpearsonm...@gmail.com wrote:

 quick followup on the thread..
 
 github:  I looked at the cooperhewitt collection but don't see a way to
 download the content...I could copy and paste their content but that may
 not be the best approach for my files...documentation is thin, seems i
 would have to provide email addresses for those seeking access...but
 clearly that is not the case with how the cooperhewitt archive is
 configured..
 
 My primary concern has been to make it as simple a process as possible for
 libraries which have limited technical expertise. 

I suspect from what you say that GitHub is not what you want in this case. 
However, I just wanted to clarify that you can download files as a Zip file 
(e.g. for Cooper Hewitt 
https://github.com/cooperhewitt/collection/archive/master.zip), and that this 
link is towards the top left on each screen in GitHub. The repository is a 
public one (which is the default, and only option unless you have a paid 
account on GitHub) and you do not need to provide email addresses or anything 
else to access the files on a public repository

Owen


Re: [CODE4LIB] best way to make MARC files available to anyone

2013-06-12 Thread Owen Stephens
Putting the files on GitHub might be an option - free for public repositories, 
and 38Mb should not be a problem to host there

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 12 Jun 2013, at 02:24, Dana Pearson dbpearsonm...@gmail.com wrote:

 I have crosswalked the Project Gutenberg RDF/DC metadata to MARC.  I would
 like to make these files available to any library that is interested.
 
 I thought that I would put them on my website via FTP but don't know if
 that is the best way.  Don't have an ftp client myself so was thinking that
 that may be now passé.
 
 I tried using Google Drive with access available via the link to two
 versions of the files, UTF8 and MARC8.  However, it seems that that is not
 a viable solution.  I can access the files with the URLs provided by
 setting the access to anyone with the URL but doesn't work for some of
 those testing it for me or with the links I have on my webpage..
 
 I have five folders with files of about 38 MB total.  I have separated the
 ebooks, audio books, juvenile content, miscellaneous and non-Latin scripts
 such as Chinese, Modern Greek.  Most of the content is in the ebook folder.
 
 I would like to make access as easy as possible.
 
 Google Drive seems to work for me.  Here's the link to my page with the
 links in case you would like to look at the folders.  Works for me but not
 for everyone who's tried it.
 
 http://dbpearsonmlis.com/ProjectGutenbergMarcRecords.html
 
 thanks,
 dana
 
 -- 
 Dana Pearson
 dbpearsonmlis.com


Re: [CODE4LIB] best way to make MARC files available to anyone

2013-06-12 Thread Owen Stephens
On 12 Jun 2013, at 14:06, Dana Pearson dbpearsonm...@gmail.com wrote:

 Thanks for the replies..I had looked at GitHub but thought it something
 different, ie, collaborative software development...I will look again

Yes - that's the main use (git is version control software, GitHub hosts git 
repositories) - but of course git doesn't care what types of files you have 
under version control. It came to mind because I know it's been used to 
distribute metadata files before - e.g. this set of metadata from the Cooper 
Hewitt National Design Museum https://github.com/cooperhewitt/collection

There could be some additional benefits gained through using git to version 
control this type of file, and GitHub to distribute them if you were 
interested, but it can act as simply a place to put the files and make them 
available for download. But of course the other suggestions would do this 
simpler task just as well.

Owen


Re: [CODE4LIB] DOI scraping

2013-05-17 Thread Owen Stephens
I'd say yes to the investment in jQuery generally - not too difficult to get 
the basics if you already use javascript, and makes some things a lot easier

It sounds like you are trying to do something not dissimilar to LibX 
http://libx.org ? (except via bookmarklet rather than as a browser plugin).
Also looking for custom database scrapers it might be worth looking at Zotero 
translators, as they already exist for many major sources and I guess will be 
grabbing the DOI where it exists if they can 
http://www.zotero.org/support/dev/translators

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 17 May 2013, at 05:32, Fitchett, Deborah deborah.fitch...@lincoln.ac.nz 
wrote:

 Kia ora koutou,
 
 I’m wanting to create a bookmarklet that will let people on a journal article 
 webpage just click the bookmarklet and get a permalink to that article, 
 including our proxy information so it can be accessed off-campus.
 
 Once I’ve got a DOI (or other permalink, but I’ll cross that bridge later), 
 the rest is easy. The trouble is getting the DOI. The options seem to be:
 
 1.   Require the user to locate and manually highlight the DOI on the 
 page. This is very easy to code, not so easy for the user who may not even 
 know what a DOI is let alone how to find it; and some interfaces make it hard 
 to accurately select (I’m looking at you, ScienceDirect).
 
 2.   Live in hope of universal CoiNS implementation. I might be waiting a 
 long time.
 
 3.   Work out, for each database we use, how to scrape the relevant 
 information from the page. Harder/tedious to code, but makes it easy for the 
 user.
 
 I’ve been looking around for existing code that something like #3. So far 
 I’ve found:
 
 · CiteULike’s bookmarklet (jQuery at http://www.citeulike.org/bm - 
 afaik it’s all rights reserved)
 
 · AltMetrics’ bookmarklet (jQuery at 
 http://altmetric-bookmarklet.dsci.it/assets/content.js - MIT licensed)
 
 Can anyone think of anything else I should be looking at for inspiration?
 
 Also on a more general matter: I have the general level of Javascript that 
 one gets by poking at things and doing small projects and then getting 
 distracted by other things and then coming back some months later for a 
 different small project and having to relearn it all over again. I’ve long 
 had jQuery on my “I guess I’m going to have to learn this someday but, um, 
 today I just wanna stick with what I know” list. So is this the kind of thing 
 where it’s going to be quicker to learn something about jQuery before I get 
 started, or can I just as easily muddle along with my existing limited 
 Javascript? (What really are the pros and cons here?)
 
 Nāku noa, nā
 
 Deborah Fitchett
 Digital Access Coordinator
 Library, Teaching and Learning
 
 p +64 3 423 0358
 e deborah.fitch...@lincoln.ac.nzmailto:deborah.fitch...@lincoln.ac.nz | w 
 library.lincoln.ac.nzhttp://library.lincoln.ac.nz/
 
 Lincoln University, Te Whare Wānaka o Aoraki
 New Zealand's specialist land-based university
 
 
 
 P Please consider the environment before you print this email.
 The contents of this e-mail (including any attachments) may be confidential 
 and/or subject to copyright. Any unauthorised use, 
 distribution, or copying of the contents is expressly prohibited.  If you 
 have received this e-mail in error, please advise the sender 
 by return e-mail or telephone and then delete this e-mail together with all 
 attachments from your system.
 


[CODE4LIB] British Library Directory of Libraries (probably of interest to UK only)

2013-04-23 Thread Owen Stephens
The British Library has a directory of library codes used by UK registered 
users of it's Document Supply service. The Directory of Library Codes enables 
British Library customers to convert into names and addresses the library codes 
they are given in response to location searches. It also indicates each 
library's supply and charging policies. More information at 
http://www.bl.uk/reshelp/atyourdesk/docsupply/help/replycodes/dirlibcodes/

As far as I know the only format this data has ever been made available in is 
PDF. I've always thought this a shame, so I've written a scraper on scraperwiki 
to extract the data from the PDF and make it available as structured, 
query-able, data. The scraper and output is at 
https://scraperwiki.com/scrapers/british_library_directory_of_library_codes/

Just in case anyone would find it useful. Also any suggestions for improving 
the scraper welcome (I don't usually write Python so the code is probably even 
ropier than my normal code :)

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936


Re: [CODE4LIB] You *are* a coder. So what am I?

2013-02-13 Thread Owen Stephens
Shambrarian: Someone who knows enough truth about how libraries really work, 
but not enough to go insane or be qualified as a real librarian. (See more at 
http://m.urbandictionary.com/#define?term=Shambrarian)

More information available at http://shambrarian.org/

And Dave Pattern has published a handy guide to Librarian/Shambrarian 
interactions
(DO NOT bore the librarian by showing them your Roy Tennant Fan Club 
membership card)
http://daveyp.wordpress.com/2011/07/21/librarianshambrarian-venn-diagram/

Tongue firmly in cheek,

Owen 

On 14 Feb 2013, at 00:22, Maccabee Levine levi...@uwosh.edu wrote:

 Andromeda's talk this afternoon really struck a chord, as I shared with her
 afterwards, because I have the same issue from the other side of the fence.
 I'm among the 1/3 of the crowd today with a CS degree and and IT
 background (and no MLS).  I've worked in libraries for years, but when I
 have a point to make about how technology can benefit instruction or
 reference or collection development, I generally preface it with I'm not a
 librarian, but  I shouldn't have to be defensive about that.
 
 Problem is, 'coder' doesn't imply a particular degree -- just the
 experience from doing the task, and as Andromeda said, she and most C4Lers
 definitely are coders.  But 'librarian' *does* imply MLS/MSLS/etc., and I
 respect that.
 
 What's a library word I can use in the same way as coder?
 
 Maccabee
 
 -- 
 Maccabee Levine
 Head of Library Technology Services
 University of Wisconsin Oshkosh
 levi...@uwosh.edu
 920-424-7332


Re: [CODE4LIB] Directories of OAI-PMH repositories

2013-02-08 Thread Owen Stephens
Also see OpenDOAR

http://www.opendoar.org
 
We used this listing when building Core http://core.kmi.open.ac.uk/search - 
which aggregates and does full-text analysis and similarity matching across OA 
repositories

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 7 Feb 2013, at 23:19, Wilhelmina Randtke rand...@gmail.com wrote:

 Thanks!  The list of lists is very helpful.
 
 -Wilhelmina Randtke
 
 On Thu, Feb 7, 2013 at 2:40 PM, Habing, Thomas Gerald
 thab...@illinois.eduwrote:
 
 Here is a registry of OAI-PMH repositories that we maintain (sporadically)
 here at Illinois:  http://gita.grainger.uiuc.edu/registry/
 
 Tom
 
 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Phillips, Mark
 Sent: Thursday, February 07, 2013 2:13 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Directories of OAI-PMH repositories
 
 You could start here.
 
 http://www.openarchives.org/pmh/
 
 Mark
 
 From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf of
 Wilhelmina Randtke [rand...@gmail.com]
 Sent: Thursday, February 07, 2013 2:03 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: [CODE4LIB] Directories of OAI-PMH repositories
 
 Is there a central listing of places that track and list OAI-PMH
 repository
 feeds?  I have an OAI-PMH compliant repository, so now am looking for
 places to list that so that harvesters or anyone who is interested can
 find it.
 
 -Wilhelmina Randtke
 


Re: [CODE4LIB] XMP Metadata to tab-delemited file

2013-01-14 Thread Owen Stephens
I'm not familiar with what XMP RDF/XML looks like but it might be worth using 
an RDF parser rather than using XSLT?

Graphite (http://graphite.ecs.soton.ac.uk/) is pretty easy to use if you are 
comfortable with PHP

Owen

On 14 Jan 2013, at 19:09, Kyle Banerjee kyle.baner...@gmail.com wrote:

 On Sat, Jan 12, 2013 at 1:36 PM, Michael Hopwood mich...@editeur.orgwrote:
 
 I got as far as producing XMP RDF/XML files but the problem then remains;
 how to usefully manage these via XSLT transforms?
 
 The problem is that XMP uses an RDF syntax that comes in many flavours and
 doesn't result in a predictable set of xpaths to apply the XSLT to.
 
 XSLT is not a good tool for many kinds of XML processing. In your
 situation, string processing or scanning for what tags are present and then
 outputting in delimited text so you know what is where is probably a better
 way to go.
 
 kyle


Re: [CODE4LIB] What is a coder?

2012-11-30 Thread Owen Stephens
I've been involved in running library/tech unconferences in the UK (the Mashed 
Library events http://mashedlibrary.com). For the second event (organised by 
Dave Pattern and others at the University of Huddersfield) we put together a 
very short list of things you could expect to get out the event 
(http://mashlib09.wordpress.com/2009/04/28/event-info-why-come-to-mashed-libary/)
 - the idea being these were things that could go on requests to attend the 
event.

More recently we realised there was a lot of interest from staff on the 
cataloguing/metadata side of libraries to attend a more 'tech' oriented event 
but that institutions were often limiting the number of people who could 
attend, and it was these staff who often lost out as the event was judged to be 
more appropriate for others. Working with Tom Meehan at UCL and Celine Carty at 
the University of Cambridge (and others) we were able to put on an event that 
while still attracting tech staff was also squarely aimed at getting 
cataloguers/metadata people along - and this definitely worked in terms of the 
make up of attendees of that particular event.

All of which is a preamble to saying - it might be worth putting together 
either a theoretical list, or direct testimonials, from people who have 
attended the conference in the past, ideally from a variety of library roles, 
with what they can/did get out of the conference. This could provide much 
needed evidence when applying to attend/travel?

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 29 Nov 2012, at 22:51, William Denton w...@pobox.com wrote:

 On 29 November 2012, Cary Gordon wrote:
 
 Obviously, we need to offer trainings on how to get funding to attend
 conferences. The should be collocated with the conferences.
 
 This is a good idea; this should be a BOF or something---how to hack your 
 system to get funding---maybe report back with a lightning talk?  Some folks 
 have good funding support, which is great.  Some don't, but given the 
 different problems or constraints, what's worked or could work to get people 
 to a Code4Lib conference (major or chapter)?
 
 I know some people pay their own way and some use vacation time to go ... be 
 good to hear that approach too.  If someone's looking to change what they're 
 doing in the library/technology world, getting to Code4Lib however they can 
 is something to seriously consider.
 
 Bill
 -- 
 William Denton
 Toronto, Canada
 http://www.miskatonic.org/


Re: [CODE4LIB] COinS

2012-11-21 Thread Owen Stephens
Agreed.

The SchemaBibex group is having some of this discussion, and I think the 
'appropriate copy' problem is one the library community can potentially bring 
to the table. There are no guarantees, and it could be we end up with yet 
another set of standards/guidelines/practices that the wider world/web doesn't 
care about - but I think there is an opportunity to position this so that other 
services can see the benefits of pushing relevant data out, and search engines 
can see how it can be used to enhance their services. I suspect that discussing 
this and coming up with proposals in the context of Schema.org is the best bet 
(for the moment at least) at moving this kind of work from the current niche to 
a more mainstream position.

I'd argue that matching resources (via descriptions) to availability to is now 
a more general problem than when OpenURL was conceived as the growth of 
subscription based services like Netflix/Kindle lending/Spotify etc. lead to 
the same issues. This is expressed on the SchemaBibex wiki 
http://www.w3.org/community/schemabibex/wiki/Why_Extend. Also several of the 
use cases described are in this area - 
http://www.w3.org/community/schemabibex/wiki/Use_Cases#Use_case:_Describe_library_holding.2Favailability_information,
 alongside use cases that look at how to describe scholarly articles 
http://www.w3.org/community/schemabibex/wiki/Use_Cases#Use_case:_journal_articles_and_other_periodical_publications

If we are going to see adoption, I strongly believe the outcomes we are 
describing have to be compelling to search engines, and their users, as well as 
publishers and other service providers. It would be great to get more 
discussion of what a compelling proposal might look like on the SchemaBibex 
list http://lists.w3.org/Archives/Public/public-schemabibex/ or wiki 
http://www.w3.org/community/schemabibex/wiki/Main_Page

Owen


Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 21 Nov 2012, at 07:37, Dave Caroline dave.thearchiv...@gmail.com wrote:

 
 
 In terms of vocabulary, Schema.org is “extensible” via several mechanisms 
 including mashups with other vocabularies or, ideally, direct integration 
 into the Schema.org namespace such as we’ve seen with RNews 
 http://blog.schema.org/2011/09/extended-schemaorg-news-support.html , 
 JobPostings 
 http://blog.schema.org/2011/11/schemaorg-support-for-job-postings.html , 
 and GoodRelations 
 http://blog.schema.org/2012/11/good-relations-and-schemaorg.html . This is 
 a win/win scenario, but it requires communities to prove they can articulate 
 a sensible set of extensions and deliver the information in that model. 
 Within the “bibliographic” community, this is the mandate set for the 
 http://www.w3.org/community/schemabibex/ group. If you are disappointed with 
 OpenURL metadata formats, poor support for COinS, and disappointing 
 probabilities for content resolution, here’s your chance for leveraging SEO 
 for those purposes.
 
 But... it is no good choosing a random extension if the Search engine
 is or will be blind to that particular method.
 As someone who likes to leverage SEO the right way so one does not
 get penalised, some standardisation  is needed.
 
 Dave Caroline, waiting


Re: [CODE4LIB] OpenURL linking but from the content provider's point of view

2012-11-21 Thread Owen Stephens
The only difference between COinS and a full OpenURL is the addition of a link 
resolver address. Most databases that provide OpenURL links directly (rather 
than simply COinS) use some profile information - usually set by the 
subscribing library, although some based on information supplied by an 
individual user. If set by the library this is then linked to specific users by 
IP or by login.

There are a couple(?) of generic base URLs you can use which will try to 
redirect to an appropriate link resolver based on IP range of the requester, 
with fallback options if it can't find an appropriate resolver (I think this is 
how the WorldCat resolver works? The 'OpenURL Router' in the UK definitely 
works like this)

The LibX toolbar allows users to set their link resolver address, and then 
translates COinS into OpenURLs when you view a page - all user driven, no need 
for the data publisher to do anything beyond COinS

There is also the 'cookie pusher' solution which ArXiv uses - where the user 
can set a cookie containing the base URL, and this is picked up and used by 
ArXiV (http://arxiv.org/help/openurl)

Owen

PS it occurs to me that the other part of the question is 'what metadata should 
be included in the OpenURL to give it the best chance of working with a link 
resolver'?

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 20 Nov 2012, at 19:39, David Lawrence david.lawre...@sdsu.edu wrote:

 I have some experience with the library side of link resolver code.
 However, we want to implement OpenURL hooks on our open access literature
 database and I can not find where to begin.
 
 SafetyLit is a free service of San Diego State University in cooperation
 with the World Health Organization. We already provide embedded metadata in
 both COinS and unAPI formats to allow its capture by Mendeley, Papers,
 Zotero, etc. Over the past few months, I have emailed or talked with many
 people and read everything I can get my hands on about this but I'm clearly
 not finding the right people or information sources.
 
 Please help me to find references to examples of the code that is required
 on the literature database server that will enable library link resolvers
 to recognize the SafetyLit.org metadata and allow appropriate linking to
 full text.
 
 SafetyLit.org receives more than 65,000 unique (non-robot) visitors and the
 database responds to almost 500,000 search queries every week. The most
 frequently requested improvement is to add link resolver capacity.
 
 I hope that code4lib users will be able to help.
 
 Best regards,
 
 David
 
 David W. Lawrence, PhD, MPH, Director
 Center for Injury Prevention Policy and Practice
 San Diego State University, School of Public Health
 6475 Alvarado Road, Suite 105
 San Diego, CA  92120  usadavid.lawre...@sdsu.edu
 V 619 594 1994   F 619 594 1995  Skype: DWL-SDCAwww.CIPPP.org  --
 www.SafetyLit.org


Re: [CODE4LIB] OpenURL linking but from the content provider's point of view

2012-11-21 Thread Owen Stephens
Failure rate on resolving DOIs  via CrossRef is high enough that I'd argue for 
belt  braces

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 21 Nov 2012, at 15:08, Young,Jeff (OR) jyo...@oclc.org wrote:

 If the referent has a DOI, then I would argue that
 rft_id=http://dx.doi.org/10.1145/2132176.2132212 is all you need. The
 descriptive information that typically goes in the ContextObject can be
 obtained (if necessary) by content-negotiating for application/rdf+xml.
 OTOH, if someone pokes this same URI from a browser instead, you will
 generally get redirected to the publisher's web site with the full-text
 close at hand.
 
 The same principle should apply for any bibliographic resource that has
 a Linked Data identifier.
 
 Jeff
 
 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf
 Of
 Owen Stephens
 Sent: Wednesday, November 21, 2012 9:55 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] OpenURL linking but from the content
 provider's
 point of view
 
 The only difference between COinS and a full OpenURL is the addition
 of
 a link resolver address. Most databases that provide OpenURL links
 directly (rather than simply COinS) use some profile information -
 usually set by the subscribing library, although some based on
 information supplied by an individual user. If set by the library this
 is then linked to specific users by IP or by login.
 
 There are a couple(?) of generic base URLs you can use which will try
 to redirect to an appropriate link resolver based on IP range of the
 requester, with fallback options if it can't find an appropriate
 resolver (I think this is how the WorldCat resolver works? The
 'OpenURL
 Router' in the UK definitely works like this)
 
 The LibX toolbar allows users to set their link resolver address, and
 then translates COinS into OpenURLs when you view a page - all user
 driven, no need for the data publisher to do anything beyond COinS
 
 There is also the 'cookie pusher' solution which ArXiv uses - where
 the
 user can set a cookie containing the base URL, and this is picked up
 and used by ArXiV (http://arxiv.org/help/openurl)
 
 Owen
 
 PS it occurs to me that the other part of the question is 'what
 metadata should be included in the OpenURL to give it the best chance
 of working with a link resolver'?
 
 Owen Stephens
 Owen Stephens Consulting
 Web: http://www.ostephens.com
 Email: o...@ostephens.com
 Telephone: 0121 288 6936
 
 On 20 Nov 2012, at 19:39, David Lawrence david.lawre...@sdsu.edu
 wrote:
 
 I have some experience with the library side of link resolver code.
 However, we want to implement OpenURL hooks on our open access
 literature database and I can not find where to begin.
 
 SafetyLit is a free service of San Diego State University in
 cooperation with the World Health Organization. We already provide
 embedded metadata in both COinS and unAPI formats to allow its
 capture
 by Mendeley, Papers, Zotero, etc. Over the past few months, I have
 emailed or talked with many people and read everything I can get my
 hands on about this but I'm clearly not finding the right people or
 information sources.
 
 Please help me to find references to examples of the code that is
 required on the literature database server that will enable library
 link resolvers to recognize the SafetyLit.org metadata and allow
 appropriate linking to full text.
 
 SafetyLit.org receives more than 65,000 unique (non-robot) visitors
 and the database responds to almost 500,000 search queries every
 week.
 The most frequently requested improvement is to add link resolver
 capacity.
 
 I hope that code4lib users will be able to help.
 
 Best regards,
 
 David
 
 David W. Lawrence, PhD, MPH, Director
 Center for Injury Prevention Policy and Practice San Diego State
 University, School of Public Health
 6475 Alvarado Road, Suite 105
 San Diego, CA  92120  usadavid.lawre...@sdsu.edu
 V 619 594 1994   F 619 594 1995  Skype: DWL-SDCAwww.CIPPP.org  --
 www.SafetyLit.org


Re: [CODE4LIB] SRU MARC fields with indicators

2012-11-07 Thread Owen Stephens
Thanks Karen - probably should have known that! That's the nice thing about 
MARC - always some new thing to cope with :)

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 6 Nov 2012, at 19:37, Karen Coyle li...@kcoyle.net wrote:

 The 9s are available in all indicator positions for local use as defined in 
 the MARC record (not MARC21) spec. [1] So what is in the MARC21 spec under a 
 particular tag is the non-local values. I suspect that most systems just 
 ignore any '9's they encounter unless those are defined as part of local 
 system processing.
 
 kc
 [1] http://www.loc.gov/marc/specifications/specrecstruc.html
 
 
 On 11/6/12 10:20 AM, Owen Stephens wrote:
 According to the MARC spec, 035 doesn't support '9' as a valid indicator. My 
 very uneducated guess would be the invalid indicator is causing the 
 underlying system not to index it?
 
 Owen Stephens
 Owen Stephens Consulting
 Web: http://www.ostephens.com
 Email: o...@ostephens.com
 Telephone: 0121 288 6936
 
 On 6 Nov 2012, at 17:43, Alevtina Verbovetskaya 
 alevtina.verbovetsk...@mail.cuny.edu wrote:
 
 Let's say I've defined these indexes in pqf.properties on the SRU server:
 index.marc.020 = 1=7 # ISBN
 index.marc.035:1 = 1=1211 # OCLC/utility number where first indicator is 
 non-blank
 index.marc.100:1 = 1=1 # author where first indicator is non-blank
 
 I can use the ISBN index to search for records, e.g.:
 http://apps.appl.cuny.edu:5661/CENTRAL?version=1.1operation=searchRetrievequery=marc.020=9780801449437startRecord=1maximumRecords=15
 
 I can also use the author index to search for records, e.g.:
 http://apps.appl.cuny.edu:5661/CENTRAL?version=1.1operation=searchRetrievequery=marc.100:1=ArmenterosstartRecord=1maximumRecords=15
 
 So why can't I search for records by utility number (035) with a non-blank 
 first indicator?
 http://apps.appl.cuny.edu:5661/CENTRAL?version=1.1operation=searchRetrievequery=marc.035:1=ebr10488669startRecord=1maximumRecords=15
 
 If you're playing along, you'll notice that these all point to the same 
 record. However, when I try to search for it with 
 query=marc.035:1=util_num, I get no results. I thought maybe this was 
 because there's already another 035 field (with blank indicators) that's an 
 OCLC number so I temporarily removed it... but that didn't solve the issue.
 
 Anyone have any experience with this? I need to be able to search by 0359# 
 and I can't figure out what I'm doing wrong. I would greatly appreciate 
 some assistance!
 
 Thank you,
 Allie
 
 --
 Alevtina (Allie) Verbovetskaya
 Web and Mobile Systems Librarian (Substitute)
 Office of Library Services
 City University of New York
 555 W 57th St, 13th fl.
 New York, NY 10019
 T: 646-313-8158
 F: 646-216-7064
 alevtina.verbovetsk...@mail.cuny.edu
 
 -- 
 Karen Coyle
 kco...@kcoyle.net http://kcoyle.net
 ph: 1-510-540-7596
 m: 1-510-435-8234
 skype: kcoylenet


Re: [CODE4LIB] SRU MARC fields with indicators

2012-11-06 Thread Owen Stephens
According to the MARC spec, 035 doesn't support '9' as a valid indicator. My 
very uneducated guess would be the invalid indicator is causing the underlying 
system not to index it?

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 6 Nov 2012, at 17:43, Alevtina Verbovetskaya 
alevtina.verbovetsk...@mail.cuny.edu wrote:

 Let's say I've defined these indexes in pqf.properties on the SRU server:
 index.marc.020 = 1=7 # ISBN
 index.marc.035:1 = 1=1211 # OCLC/utility number where first indicator is 
 non-blank
 index.marc.100:1 = 1=1 # author where first indicator is non-blank
 
 I can use the ISBN index to search for records, e.g.:
 http://apps.appl.cuny.edu:5661/CENTRAL?version=1.1operation=searchRetrievequery=marc.020=9780801449437startRecord=1maximumRecords=15
 
 I can also use the author index to search for records, e.g.:
 http://apps.appl.cuny.edu:5661/CENTRAL?version=1.1operation=searchRetrievequery=marc.100:1=ArmenterosstartRecord=1maximumRecords=15
 
 So why can't I search for records by utility number (035) with a non-blank 
 first indicator?
 http://apps.appl.cuny.edu:5661/CENTRAL?version=1.1operation=searchRetrievequery=marc.035:1=ebr10488669startRecord=1maximumRecords=15
 
 If you're playing along, you'll notice that these all point to the same 
 record. However, when I try to search for it with 
 query=marc.035:1=util_num, I get no results. I thought maybe this was 
 because there's already another 035 field (with blank indicators) that's an 
 OCLC number so I temporarily removed it... but that didn't solve the issue.
 
 Anyone have any experience with this? I need to be able to search by 0359# 
 and I can't figure out what I'm doing wrong. I would greatly appreciate some 
 assistance!
 
 Thank you,
 Allie
 
 --
 Alevtina (Allie) Verbovetskaya
 Web and Mobile Systems Librarian (Substitute)
 Office of Library Services
 City University of New York
 555 W 57th St, 13th fl.
 New York, NY 10019
 T: 646-313-8158
 F: 646-216-7064
 alevtina.verbovetsk...@mail.cuny.edu


Re: [CODE4LIB] open circ data

2012-10-26 Thread Owen Stephens
The University of Huddersfield released circulation data - see 
http://library.hud.ac.uk/data/usagedata/_readme.html
The University of Lincoln also release some data linked from 
http://library.hud.ac.uk/wikis/mosaic/index.php/Project_Data (along with the 
Huddersfield data in a different format I think)
The SALT project offers some data - although the project involves University of 
Manchester, University of Cambridge as well as Huddersfield and Lincoln, I 
think the data offered for download is just from Manchester but I could be 
wrong - data at http://vm-salt.mimas.ac.uk/data/ - and a recommender API based 
on the data http://copac.ac.uk/innovations/activity-data/?page_id=227

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 26 Oct 2012, at 23:04, Jimmy Ghaphery jghap...@vcu.edu wrote:

 Are there any other repositories of circ data similar to the OhioLINK/OCLC
 project (http://www.oclc.org/research/activities/ohiolink/circulation.html
 ).
 
 I seem to remember a large set of British data, but I can't track that
 down. We have some eager IS grad students looking for data to use for a
 recommender engine and I'm looking forward to see what they might come up
 with.
 
 thanks for any pointers!
 
 -Jimmy
 
 -- 
 Jimmy Ghaphery
 Head, Library Information Systems
 VCU Libraries
 804-827-3551


Re: [CODE4LIB] Q.: software for vendor title list processing

2012-10-17 Thread Owen Stephens
Are there any examples of data in this format in the wild we can look at?

Also given KBART and ONIX for Serials Online Holdings have NISO involvement, is 
there any view on how these two activities complement each other?

Thanks,

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 17 Oct 2012, at 09:47, Michael Hopwood mich...@editeur.org wrote:

 Hi Godmar,
 
 There is also ONIX for Serials Online Holdings 
 (http://www.editeur.org/120/ONIX-SOH/). I'm copying in Tim Devenport who 
 might say more.
 
 Best wishes,
 
 Michael
 
 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Owen 
 Stephens
 Sent: 16 October 2012 23:09
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Q.: software for vendor title list processing
 
 I'm working on the JISC KB+ project that Tom mentioned.
 
 As part of the project we've been collating journal title lists from various 
 sources. We've been working with members of the KBART steering group and have 
 used KBART where possible, although we've been collecting data not covered by 
 KBART.
 
 All the data we have at this level is published under a CC0 licence at 
 http://www.kbplus.ac.uk/kbplus/publicExport - including a csv that uses the 
 KBART data elements. The focus so far has been on packages negotiated by JISC 
 in the UK - although in many cases the title lists may be the same as are 
 made available in other markets. We also include what we call 'Master lists' 
 which are an attempt to capture the complete list of titles and coverage 
 offered by a content provider. We'd very much welcome any feedback on these 
 exports, and of course be interested to know if anyone makes use of them.
 
 So far a lot of the work on collating/coverting/standardising the data has 
 been done by hand - which is clearly not ideal. In the next phase of the 
 project the KB+ project is going to work with the GoKB project 
 http://gokb.org - as part of this collaboration we are currently working on 
 ways of streamlining the data processing from publisher files or other 
 sources, to standardised data. While we are still working on how this is 
 going to be implemented, we are currently investigating the possibility of 
 using Google/Open Refine to capture and re-run sets of rules across data sets 
 from specific sources. We should be making progress on this in the next 
 couple of months.
 
 Hope that's helpful
 
 Owen
 
 Owen Stephens
 Owen Stephens Consulting
 Web: http://www.ostephens.com
 Email: o...@ostephens.com
 Telephone: 0121 288 6936
 
 On 16 Oct 2012, at 20:23, Tom Pasley tom.pas...@gmail.com wrote:
 
 You might also be interested in the work at http://www.kbplus.ac.uk . 
 The site is up at the moment, but I can't reach it for some reason... 
 they have a public export page which you might want to know about 
 http://www.kbplus.ac.uk/kbplus/publicExport
 
 Tom
 
 On Wed, Oct 17, 2012 at 8:12 AM, Jonathan Rochkind rochk...@jhu.edu wrote:
 
 I think KBART is such an effort.  As with most library standards 
 groups, there may not be online documentation of their most recent 
 efforts or successes, but: http://www.uksg.org/kbart
 
 http://www.uksg.org/kbart/s5/**guidelines/data_formathttp://www.uksg
 .org/kbart/s5/guidelines/data_format
 
 
 
 On 10/16/2012 2:16 PM, Godmar Back wrote:
 
 Hi,
 
 at our library, there's an emerging need to process title lists from 
 vendors for various purposes, such as checking that the titles 
 purchased can be discovered via discovery system and/or OPAC. It 
 appears that the formats in which those lists are provided are 
 non-uniform, as is the process of obtaining them.
 
 For example, one vendor - let's call them Expedition Scrolls - 
 provides title lists for download to Excel, but which upon closer 
 inspection turn out to be HTML tables. They are encoded using an odd 
 mixture of CP1250 and HTML entities. Other vendors use entirely different 
 formats.
 
 My question is whether there are efforts, software, or anything 
 related to streamlining the acquisition and processing of vendor 
 title lists in software systems that aid in the collection 
 development and maintenance process. Any pointers would be appreciated.
 
 - Godmar
 
 
 


Re: [CODE4LIB] Q.: software for vendor title list processing

2012-10-17 Thread Owen Stephens
There are things that could be improved about the KBART guidelines (and you've 
picked on one here I definitely agree with). 

There is an interest group mailing list which can be used for 
discussion/feedback http://www.niso.org/lists/kbart_interest/

I suspect that for both approaches at the moment the question of 
uptake/compliance is the bigger issue.

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 17 Oct 2012, at 14:48, Jonathan Rochkind rochk...@jhu.edu wrote:

 I've always been a fan of ONIX for SOH, although never had the chance to use 
 it -- but the spec is written nicely, based on my experience with this stuff, 
 it actually accomplishes the goal of machine-readable statement of serial 
 holdings (theoretically useful for print or online holdings) well.
 
 KBART, I have some concerns about, when it comes to holdings. Is there a 
 place to send feedback to KBART?  Just on a quick skim of the parts of 
 interest to me, I am filled with alarm at how much missing the point this is: 
we recommend that the ISO 8601 date syntax should be used...  For 
 simplicity, '365D' will always be equivalent to one year, and '30D' will 
 always be equivalent to one month, even in leap years and months that do not 
 have 30 days.
 
 Totally missing the point of ISO 8601 to allow/encourage this when 1Y and 1M 
 are available -- dealing with calendar dates is harder than one might naively 
 think, and by trying to 'improve' on ISO 8601 like this, you just create a 
 mess of ambiguous and difficult to deal with data.
 
 On 10/17/2012 5:11 AM, Owen Stephens wrote:
 Are there any examples of data in this format in the wild we can look at?
 
 Also given KBART and ONIX for Serials Online Holdings have NISO involvement, 
 is there any view on how these two activities complement each other?
 
 Thanks,
 
 Owen
 
 Owen Stephens
 Owen Stephens Consulting
 Web: http://www.ostephens.com
 Email: o...@ostephens.com
 Telephone: 0121 288 6936
 
 On 17 Oct 2012, at 09:47, Michael Hopwood mich...@editeur.org wrote:
 
 Hi Godmar,
 
 There is also ONIX for Serials Online Holdings 
 (http://www.editeur.org/120/ONIX-SOH/). I'm copying in Tim Devenport who 
 might say more.
 
 Best wishes,
 
 Michael
 
 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of 
 Owen Stephens
 Sent: 16 October 2012 23:09
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Q.: software for vendor title list processing
 
 I'm working on the JISC KB+ project that Tom mentioned.
 
 As part of the project we've been collating journal title lists from 
 various sources. We've been working with members of the KBART steering 
 group and have used KBART where possible, although we've been collecting 
 data not covered by KBART.
 
 All the data we have at this level is published under a CC0 licence at 
 http://www.kbplus.ac.uk/kbplus/publicExport - including a csv that uses the 
 KBART data elements. The focus so far has been on packages negotiated by 
 JISC in the UK - although in many cases the title lists may be the same as 
 are made available in other markets. We also include what we call 'Master 
 lists' which are an attempt to capture the complete list of titles and 
 coverage offered by a content provider. We'd very much welcome any feedback 
 on these exports, and of course be interested to know if anyone makes use 
 of them.
 
 So far a lot of the work on collating/coverting/standardising the data has 
 been done by hand - which is clearly not ideal. In the next phase of the 
 project the KB+ project is going to work with the GoKB project 
 http://gokb.org - as part of this collaboration we are currently working on 
 ways of streamlining the data processing from publisher files or other 
 sources, to standardised data. While we are still working on how this is 
 going to be implemented, we are currently investigating the possibility of 
 using Google/Open Refine to capture and re-run sets of rules across data 
 sets from specific sources. We should be making progress on this in the 
 next couple of months.
 
 Hope that's helpful
 
 Owen
 
 Owen Stephens
 Owen Stephens Consulting
 Web: http://www.ostephens.com
 Email: o...@ostephens.com
 Telephone: 0121 288 6936
 
 On 16 Oct 2012, at 20:23, Tom Pasley tom.pas...@gmail.com wrote:
 
 You might also be interested in the work at http://www.kbplus.ac.uk .
 The site is up at the moment, but I can't reach it for some reason...
 they have a public export page which you might want to know about
 http://www.kbplus.ac.uk/kbplus/publicExport
 
 Tom
 
 On Wed, Oct 17, 2012 at 8:12 AM, Jonathan Rochkind rochk...@jhu.edu 
 wrote:
 
 I think KBART is such an effort.  As with most library standards
 groups, there may not be online documentation of their most recent
 efforts or successes, but: http://www.uksg.org/kbart
 
 http://www.uksg.org/kbart/s5/**guidelines/data_formathttp://www.uksg

Re: [CODE4LIB] Q.: software for vendor title list processing

2012-10-17 Thread Owen Stephens
 
 This leads to three follow-up questions.
 
 First, is there software to translate/normalize existing vendor lists from
 vendors that have not yet adopted either of these standards into these
 formats? I'm thinking of a collection of adapters or converters, perhaps.
 Each would likely constitute small effort, but there would be benefits from
 sharing development and maintenance.

Not that I'm aware of, but if I understand you then this is one of the tasks 
GoKB is undertaking in partnership with KB+ (the work I mentioned using Refine)

 
 Second, if holdings lists were provided in, or converted to, for instance
 the KBART format, what software understands these formats to further
 process them? In other words, is there immediate bang for the buck of
 adopting these standards?

The KBART format was aimed at Link Resolver population - so I'd hope there was 
some immediate payback on this front, but I don't have any information on this

 
 Third, unsurprisingly, these efforts arose in the managements of serials
 because holdings there change frequently depending on purchase agreements,
 etc. It is my understanding that eBooks are now posing similar collection
 management challenges. Are there separate normative efforts for eBooks or
 is it believed that efforts such as KBART/ONIX can encompass eBooks as well?
 

KBART definitely has ambitions to encompass eBooks as well. There are already 
some hooks for this (e.g. 'first author' field), and the working group is 
looking at how ebooks will work I think

 - Godmar


Re: [CODE4LIB] Q.: software for vendor title list processing

2012-10-16 Thread Owen Stephens
I'm working on the JISC KB+ project that Tom mentioned.

As part of the project we've been collating journal title lists from various 
sources. We've been working with members of the KBART steering group and have 
used KBART where possible, although we've been collecting data not covered by 
KBART.

All the data we have at this level is published under a CC0 licence at 
http://www.kbplus.ac.uk/kbplus/publicExport - including a csv that uses the 
KBART data elements. The focus so far has been on packages negotiated by JISC 
in the UK - although in many cases the title lists may be the same as are made 
available in other markets. We also include what we call 'Master lists' which 
are an attempt to capture the complete list of titles and coverage offered by a 
content provider. We'd very much welcome any feedback on these exports, and of 
course be interested to know if anyone makes use of them.

So far a lot of the work on collating/coverting/standardising the data has been 
done by hand - which is clearly not ideal. In the next phase of the project the 
KB+ project is going to work with the GoKB project http://gokb.org - as part of 
this collaboration we are currently working on ways of streamlining the data 
processing from publisher files or other sources, to standardised data. While 
we are still working on how this is going to be implemented, we are currently 
investigating the possibility of using Google/Open Refine to capture and re-run 
sets of rules across data sets from specific sources. We should be making 
progress on this in the next couple of months.

Hope that's helpful

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 16 Oct 2012, at 20:23, Tom Pasley tom.pas...@gmail.com wrote:

 You might also be interested in the work at http://www.kbplus.ac.uk . The
 site is up at the moment, but I can't reach it for some reason... they have
 a public export page which you might want to know about
 http://www.kbplus.ac.uk/kbplus/publicExport
 
 Tom
 
 On Wed, Oct 17, 2012 at 8:12 AM, Jonathan Rochkind rochk...@jhu.edu wrote:
 
 I think KBART is such an effort.  As with most library standards groups,
 there may not be online documentation of their most recent efforts or
 successes, but: http://www.uksg.org/kbart
 
 http://www.uksg.org/kbart/s5/**guidelines/data_formathttp://www.uksg.org/kbart/s5/guidelines/data_format
 
 
 
 On 10/16/2012 2:16 PM, Godmar Back wrote:
 
 Hi,
 
 at our library, there's an emerging need to process title lists from
 vendors for various purposes, such as checking that the titles purchased
 can be discovered via discovery system and/or OPAC. It appears that the
 formats in which those lists are provided are non-uniform, as is the
 process of obtaining them.
 
 For example, one vendor - let's call them Expedition Scrolls - provides
 title lists for download to Excel, but which upon closer inspection turn
 out to be HTML tables. They are encoded using an odd mixture of CP1250 and
 HTML entities. Other vendors use entirely different formats.
 
 My question is whether there are efforts, software, or anything related to
 streamlining the acquisition and processing of vendor title lists in
 software systems that aid in the collection development and maintenance
 process. Any pointers would be appreciated.
 
  - Godmar
 
 
 


Re: [CODE4LIB] Citation manager -- ??? -- BePress Bulk-upload Excel spreadsheet

2012-10-15 Thread Owen Stephens
No idea if this is useful, but just to note that RefWorks also has an API in 
case that offers any more options to you in terms of pushing the data around 
http://rwt.refworks.com/rwapireference/

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 15 Oct 2012, at 13:37, Mita Williams mita.willi...@gmail.com wrote:

 Here's the summary of the summary of what we've found out
 (the full summary is here:
 http://librarian.newjackalmanac.ca/2012/10/bibliographic-software-bepress.html
 )
 
 Roy Tennant, some years ago, created a Perl script that collects the latest
 citations in Web of Science via their WSDL. Lisa Schiff kindly found this
 script and sent it my way. While the script is admittedly out of date (and
 Web of Science now provides an API for this sort of thing) it still will
 prove useful if and when we get to a point in which we want to automate and
 script our workflow. [Thank you Roy, Kirk Hastings, and Lisa Schiff!]
 
 And here are some things that we’ve figured out ourselves:
 
 Between Zotero and RefWorks, Zotero exports the cleanest results to excel
 format. RefWorks can easily export to Excel, while Zotero requires the use
 of an SQLite extension and this script (
 https://github.com/RoyceKimmons/Zotero-to-Excel-SQLite-Export/blob/master/export.sql)
 kindly provided by Royce Kimmons. On this page (
 http://royce.kimmons.me/node/24) Kimmons explains how one can select one or
 more Zotero collection/folder for export.
 
 No one that we know of has created an Excel macro to automate transferring
 the result of an export to Excel from RefWorks or Zotero to ease the
 cutting and pasting necessary to get the information into BePress’s Excel
 Spreadsheet.
 
 An alternative means of sharing citations is to avoid Excel exporting
 altogether and instead, have staff make their papers available on
 Zotero.org in a public library and have the IR coordinator use Zotero to
 download the citations using that are either tagged as appropriate (e.g.
 https://www.zotero.org/copystar/items/tag/publisherPDF) or those that have
 been placed in a given collection folder (e.g.
 https://www.zotero.org/copystar/items/collectionKey/THDEN26X).
 
 Papers on BePress can be added to Zotero on each item level page but not on
 a collection page.  Improving this capability would require creating a
 special Zotero translator for BePress:
 https://github.com/zotero/translators/issues/212
 
 Thank you everyone who has helped us work through this. I hope what we’ve
 learned proves useful to you as well.
 
 
 On Fri, Oct 5, 2012 at 3:20 PM, Mita Williams mita.willi...@gmail.comwrote:
 
 Yes, a partner in crime has asked a similar question in the bepress list
 and I've been talking to a Zotero developer as well.
 
 Once I get this pieces into context, I will definitely share back with the
 rest of the list.  It's the least I can do.  Much thanks all
 
 
 On Fri, Oct 5, 2012 at 12:43 PM, lindsey danis danis@gmail.comwrote:
 
 There is a discussion on this  topic right now in the Digital Commons
 Google Group, fyi.
 
 On Fri, Oct 5, 2012 at 12:40 PM, Sam Kome sam_k...@cuc.claremont.edu
 wrote:
 
 At some point bring it back to the list, please. Enquiring minds want to
 know...
 
 Thanks,
 
 SK
 
 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Roy Tennant
 Sent: Thursday, October 04, 2012 10:44 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Citation manager -- ??? -- BePress Bulk-upload
 Excel spreadsheet
 
 Mita,
 A while back (I mean at least six years ago) I wrote some code to take
 citations downloaded from an index provider, reformat them into bepress
 spreadsheet format, and bulk upload them. The purpose of the project
 was to
 identify published articles by University of California faculty, email
 them
 that we had citations of their work in our system, and wouldn't they
 like
 to upload their copy of their article into the repository? I don't have
 the
 numbers on that project, but I recall that it did boost submissions.
 
 Unfortunately, I think the code, which was likely crappy anyway, has
 long
 since moldered to dust on a server somewhere that I no longer have
 access
 to, but I can put you in touch with someone at UC who might be doing
 something like this. I'll email you off-list.
 Roy
 
 On Thu, Oct 4, 2012 at 9:32 AM, Mita Williams mita.willi...@gmail.com
 wrote:
 
 We're trying to figure out a workflow for our BePress IR and was
 curious if anyone in code4libland has developed something (an Excel
 macro? a Zotero export function?) that could take formatted citations
 and put them in the proper order so they could be bulk added to the
 BePress bulk upload Excel spreadsheet.  Or perhaps there's an
 altogether different way of going about collecting, formatting, and
 adding such things for BePress.
 
 Everything counts in large amounts.
 Mita
 
 
 
 
 


Re: [CODE4LIB] CODE4LIB equivalent in UK?

2012-10-08 Thread Owen Stephens
Code4lib is a many headed beast :) It may depend on what you are looking for. 
(mailing list, IRC, conference, journal, etc.)

Mashed Library is a set of events (I ran the first one and I've been involved 
in many of the subsequent ones), partially bourne out of my frustration that I 
never got to go to the Code4Lib or Access conferences in North America. There 
is no organising committee or particular restriction on using the name - so 
anyone can run a 'mashed library' event, and they can be whatever format you 
want. The events have tended to be one day, cheap or free to attend, and have 
at least some 'unconference' element, and often some 'hands on' time/practical 
sessions. The events have tended to target a mixture of developers and 'tech 
interested' people - of course the mix varies between events. The last one was 
in Cambridge this summer and was focussed very much on cataloguing/metadata - 
there is a collection of presentations and blog posts at 
http://www.mashcat.info if you want a flavour of this.

After the first event I discussed with a few others the idea of having a 
mailing list etc. but in the end the question is always - why duplicate the 
code4lib mailing list? The original question asked about an equivalent 'British 
list', and I guess I've never really been sure what the point of it would be? 
What would be 'British' about it - what are the UK specific needs that can't be 
addressed on Code4lib? We use the same s/w generally, have the same code at our 
displosal etc.

To cover off the other thing mentioned DevCSI is a JISC funded initiative which 
has run a wide variety of events. The focus is coders in UK HE - so not 
repositories specifically, nor libraries specifically - however there are 
regular events run by DevCSI that are in these spaces. DevCSI have also 
supported several of the Mashed Library events - they are interested in making 
events happen and generally supporing the developer community, not necessarily 
always running things themselves.

They have run a big annual event 'Dev8D' for the last few years 
http://dev8d.org which has been a week long usually in February in London. 
They've also run one  student developer event DevXS http://devxs.org - I'm not 
clear if this will be repeated

Back to my question above - what is it that the code4lib list doesn't satisfy 
that people would like to see from a UK based list?

Owen


Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 8 Oct 2012, at 09:14, Richard Wallis richard.wal...@dataliberate.com wrote:

 The Mashed Library folks might be fertile ground for gaining interest in a
 code4libuk.
 http://www.mashedlibrary.com
 
 ~Richard.
 
 On 7 October 2012 16:28, Tim Hill th...@astreetpress.com wrote:
 
 Here's another lurking UK code4libber! I work for a UK/US company, but
 I spend the bulk of my time in the UK (and never enough in the US to
 coincide with a code4lib meetup). I'd certainly be interested in
 getting the/a community more active in the UK.
 
 Tim Hill
 
 On Tue, Oct 2, 2012 at 9:12 AM, Simeon Warner simeon.war...@cornell.edu
 wrote:
 Have a look at http://devcsi.ukoln.ac.uk/ . This is mainly focused on
 repositories but seems somewhat similar from an outside view.
 
 Cheers,
 Simeon (lurking expat Brit)
 
 
 On 10/2/12 4:11 AM, Michael Hopwood wrote:
 
 Yes - my question was implicitly aimed at lurking UKavians.
 
 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Dave Caroline
 Sent: 02 October 2012 09:08
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] CODE4LIB equivalent in UK?
 
 On Tue, Oct 2, 2012 at 8:55 AM, Michael Hopwood mich...@editeur.org
 wrote:
 
 I know that CODE4LIB isn't per se in the USA but it seems like a
 large
 number of its active users are.
 
 Is there an equivalent list that you folks know of?
 
 
 I dont know of an equivalent British list but there are a few of us
 brits
 about lurking in #cod4lib too (archivist)
 
 Dave Caroline
 
 
 
 
 
 
 -- 
 Richard Wallis
 Founder, Data Liberate
 http://dataliberate.com
 Tel: +44 (0)7767 886 005
 
 Linkedin: http://www.linkedin.com/in/richardwallis
 Skype: richard.wallis1
 Twitter: @rjw
 IM: rjw3...@hotmail.com


Re: [CODE4LIB] Seeking examples of outstanding discovery layers

2012-09-21 Thread Owen Stephens
The stuff by Mitchell Whitelaw on Generous Interfaces (and he cites some 
aspects of Trove as an example of a generous interface) seems relevant to this 
discussion:

Slides: http://www.slideshare.net/mtchl/generous-interfaces
Paper: 
http://www.ica2012.com/files/data/Full%20papers%20upload/ica12Final00423.pd
f

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936


Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google

2012-08-28 Thread Owen Stephens
The JISC funded CLOCK project did some thinking around cataloguing processes 
and tracking changes to statements and/or records - e.g. 
http://clock.blogs.lincoln.ac.uk/2012/05/23/its-a-model-and-its-looking-good/

Not solutions of course, but hopefully of interest

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 28 Aug 2012, at 19:43, Simon Spero sesunc...@gmail.com wrote:

 On Aug 28, 2012, at 2:17 PM, Joe Hourcle wrote:
 
 I seem to recall seeing a presentation a couple of years ago from someone in 
 the intelligence community, where they'd keep all of their intelligence, but 
 they stored RDF quads so they could track the source.
 
 They'd then assign a confidence level to each source, so they could get an 
 overall level of confidence on their inferences.
 […]
 It's possible that it was in the context of provenance, but I'm getting 
 bogged down in too many articles about people storing provenance information 
 using RDF-triples (without actually tracking the provenance of the triple 
 itself)
 
 Provenance is of great importance in the IC and related sectors.   
 
 An good overview of the nature of evidential reasoning is David A Schum 
 (1994;2001). Evidential Foundations of Probabilistic Reasoning. Wiley  Sons, 
 1994; Northwestern University Press, 2001 [Paperback edition].
 
 There are usually papers on provenance and associated semantics at the GMU 
 Semantic Technology for Intelligence, Defense, and Security (STIDS).  This 
 years conference is 23 - 26 October 2012; see http://stids.c4i.gmu.edu/ for 
 more details. 
 
 Simon


[CODE4LIB] Open data and Research Libraries UK

2012-08-03 Thread Owen Stephens
Hello all

I've been commissioned by Research Libraries UK (RLUK) to look at the 
possibility of making RLUK data openly available, and the related issues and 
challenges. As part of this work it is important for us to understand who the 
audience for such open data might be, how they might use the data, and what 
licences, formats and mechanisms will best support this use. I hope you are 
able to help by completing the survey linked below.

To give a bit more detail on the data we are talking about. Research Libraries 
UK, through JISC and MIMAS, makes available a large database of bibliographic 
data. RLUK estimates that approximately 16 million bibliographic records in its 
database are free from restrictions in terms of redistribution and open 
licensing. 

RLUK is committed to the principle of open bibliographic data, and is a 
signatory to the JISC Discovery Open Metadata Principles 
(http://discovery.ac.uk/businesscase/principles/). RLUK would therefore like to 
determine the most effective way of publishing the available records as open 
metadata, with an emphasis on enabling reuse. 

The survey should only take about 10 minutes to complete and is available at: 
https://www.surveymonkey.com/s/5RH8KH8

Thanks and best wishes

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936


Re: [CODE4LIB] Recommendations for a teaching OPAC?

2012-08-03 Thread Owen Stephens
On 3 Aug 2012, at 15:56, Joseph Montibello joseph.montibe...@dartmouth.edu 
wrote:
 search, you could probably do worse than to install Blacklight.  It
 probably doesn't really meet the simple criteria - there's a lot more to
 it than I could talk about.  But getting it out of the box, turned on, and
 searching against a few records is something that you and students could
 probably manage. I've got a year of unix/ssh/command line experience and
 with a bit of mucking about, googling, and asking for help I was able to
 get a local (non-production) instance up and running, so it's definitely
 easy enough.

I'd agree - either Blacklight http://projectblacklight.org or VuFind 
http://vufind.org are straightforward to get running. I've found Blacklight 
setup using the Ruby Gem very easy both on Windows and OS X. Since they are 
both powered by Solr and use SolrMARC there are a lot of similarities on the 
indexing/searching side. However on the interface side they differ in terms of 
setup - so it might be this that would sway you one way of the other (or a 
preference for PHP (VuFind) or Ruby (Blacklight)).



 
 Lesson: Interfaces, usability, accessibility
 Exercise: Use the OPAC, populate it with some data, assess its usability

Once you've got VuFind/Blacklight setup populating with data is a matter of 
uploading some MARC21 records - Blacklight comes with some test records 
bundled, I suspect VuFind does to but can't remember

 
 Lesson: HTML/CSS
 Exercise: Use CSS to skin the OPAC, customize the HTML for your site

This is slightly more complex I guess - both systems can be highly customised, 
but in either case it isn't necessarily just a matter of editing CSS or HTML. 
Both use templating systems and both have configuration files that control 
certain aspects of the interface (e.g. what is searched, how facets display). 
CSS is probably more straightforward - VuFind you can just drop in CSS to 
override the default - not sure about Blacklight

 
 Lesson: Data management, search, IR
 Exercise: See if we can peak under the hood about how the OPAC's search
 works
 

I think this would be the real strength of using Blacklight/VuFind - 
Solr/Lucene is a powerful combination, and used widely outside the library 
sector. You can also configure the indexing to a high degree - lots of options, 
the most basic of which I explore in 
http://www.meanboyfriend.com/overdue_ideas/2012/07/marc-and-solrmarc/

The thing I really like about this is students would see some of the complexity 
of MARC as well as some of it's utility - and where it doesn't work well

 Lesson: Interfaces to data: databases, XML, SQL
 Exercise: Use the OPAC as an living example to work with those interfaces

This is less well served by Blacklight/VuFind - no database, no SQL.

 
 This idea primarily came from trying to get some simple XML/SQL
 exercises that didn't suck (the setup for these environments is almost
 as involved as any exercises itself), and the fact the previous classes
 really liked dissecting the nextgen catalogs we've explored from a
 software selection and 2.0 integration perspective.

Unfortunately it may be that Blacklight/VuFind don't work for your scenario 
because they don't provide an environment for SQL. You could do some XML stuff 
(there is configuration files, and Solr can be updated via XML messages) - but 
I'm not clear whether this is the kind of XML work you want. However, I do 
think they open up some other avenues that are well worth exploring, and use 
technologies that are going to become more relevant in the future.

Another option might be BibServer, which uses elastic search rather than Solr - 
but I've never tried installing it 
http://bibserver.readthedocs.org/en/latest/install.html


[CODE4LIB] Code and Catalogue data event

2012-04-26 Thread Owen Stephens
I'm happy to announce a new 'mashed libary' event focussing on 
cataloguing data. The event (#mashcat) will be held in Cambridge (UK) on 
5th July, and is free to attend. We hope to encourage a mixture of 
developers, cataloguers and metadata specialists to come along, 
exchanging ideas and knowledge. The programme has not yet been finalised 
but is likely to be a mixture of talks, and time to pursue ideas, 
discussions and projects.


Details of the event are available from http://www.mashcat.info/ and you 
can register at http://mashcat.eventbrite.co.uk/


#mashcat is being supported by DevCSI (http://devcsi.ukoln.ac.uk/about/)

I hope some of you can make it

Owen

--
Owen Stephens Consulting
http://ostephens.com
e: o...@ostephens.com
t: 0121 288 6936
skype: owen.stephens


Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-03-01 Thread Owen Stephens
Thanks Jason and Ed,

I suspect within this project we'll keep using OAI-PMH because we've got tight 
deadlines and the other project strands (which do stuff with the harvested 
content) need time from the developer. At the moment it looks like we will 
probably combine OAI-PMH with web crawling (using nutch) - so use data from the 

However, that said, one of the things we are meant to be doing is offering 
recommendations or good practice guidelines back to the (repository) community 
based on our experience. If we have time I would love to tackle the questions 
(a)-(d) that you highlight here - perhaps especially (a) and (c). Since this 
particular project is part of the wider JISC 'Discovery' programme 
(http://discovery.ac.uk and tech principles at 
http://technicalfoundations.ukoln.info/guidance/technical-principles-discovery-ecosystem)
 - from which one of the main themes might be summarised as 'work with the web' 
these questions are definitely relevant.

I need to look at Jason's stuff again as I think this definitely has parallels 
with some of the Discovery work, as, of course, does some of the recent 
discussion on here about the question of the indexing of library catalogues by 
search engines.

Thanks again to all who have contributed to the discussion - very useful

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 1 Mar 2012, at 11:42, Ed Summers wrote:

 On Mon, Feb 27, 2012 at 12:15 PM, Jason Ronallo jrona...@gmail.com wrote:
 I'd like to bring this back to your suggestion to just forget OAI-PMH
 and crawl the web. I think that's probably the long-term way forward.
 
 I definitely had the same thoughts while reading this thread. Owen,
 are you forced to stay within the context of OAI-PMH because you are
 working with existing institutional repositories? I don't know if it's
 appropriate, or if it has been done before, but as part of your work
 it would be interesting to determine:
 
 a) how many IRs allow crawling (robots.txt or lack thereof)
 b) how many IRs support crawling with a sitemap
 c) how many IR HTML splashpages use the rel-license [1] pattern
 d) how many IRs support syndication (RSS/Atom) to publish changes
 
 If you could do this in a semi-automated way for the UK it would be
 great if you could then apply it to IRs around the world. It would
 also align really nicely with the sort of work that Jason has been
 doing around CAPS [2].
 
 It seems to me that there might be an opportunity to educate digital
 repository managers about better aligning their content w/ the Web ...
 instead of trying to cook up new standards. I imagine this is way out
 of scope for what you are currently doing--if so, maybe this can be
 your next grant :-)
 
 //Ed
 
 [1] http://microformats.org/wiki/rel-license
 [2] https://github.com/jronallo/capsys


Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-03-01 Thread Owen Stephens
Thanks Ian,

Agree that it is clear from this discussion that there are differing viewpoints 
and also very different requirements depending on the context and desired 
outcomes.

I think I said earlier in the thread - I'm not against niche solutions, they 
just make me want to double check that they are justified. For me I'd say the 
jury is still out on 'crawl' vs 'harvest' - but I think it definitely needs 
more investigation and thought - and of course different problems require 
different solutions. It would be interesting to try to go through the case for 
OAI-PMH, especially specific examples where it has achieved something that 
would have been difficult/impossible to do with more general solutions. Not 
sure if that could be done here on list, or better/easier through other 
discussion - or both (possibly over that beer? :)

From the CORE project, any 'best practice' would be focussed on institutional 
research publication repositories, and I it seems highly unlikely to make a 
recommendation on 'crawl' vs 'harvest' - we just won't have time to do enough 
work on this to understand the pros/cons of these even from our own singular 
perspective. I think any recommendations are more along the lines of ensuring 
robots.txt is consistent with other policies; the impact of using splash pages 
as opposed to links to actual resources in the OAI-PMH feed; configuring 
access to embargoed papers (as per Raffaele's suggestion); how to deal with 
multi-part resources etc. Anything coming out of the project would, of course, 
be just one projects recommendations for JISC to consider not more than that. 

Cheers,

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 1 Mar 2012, at 14:38, Ian Ibbotson wrote:

 Owen...
 
 Just wanted to say that, whilst I've been silent since my initial response,
 I'm not sure I agree with all the viewpoints presented here.. From a point
 of view of (for example, CultureGrid) I'm not sure what has been done could
 have been pragmatically achieved soley with web crawling as it's described
 in this thread. Don't have a problem with anything thats been written here.
 It certainly represent a great cross-section of viewpoints. However, from a
 jisc discovery perspective, I don't want to contribute to any confirmation
 bias that we could dispose of pesky old OAI. I'd be interested in providing
 a counter-point to any Best practice document that suggested we could.
 
 Ian.
 
 On Thu, Mar 1, 2012 at 12:36 PM, Owen Stephens o...@ostephens.com wrote:
 
 Thanks Jason and Ed,
 
 I suspect within this project we'll keep using OAI-PMH because we've got
 tight deadlines and the other project strands (which do stuff with the
 harvested content) need time from the developer. At the moment it looks
 like we will probably combine OAI-PMH with web crawling (using nutch) - so
 use data from the
 
 However, that said, one of the things we are meant to be doing is offering
 recommendations or good practice guidelines back to the (repository)
 community based on our experience. If we have time I would love to tackle
 the questions (a)-(d) that you highlight here - perhaps especially (a) and
 (c). Since this particular project is part of the wider JISC 'Discovery'
 programme (http://discovery.ac.uk and tech principles at
 http://technicalfoundations.ukoln.info/guidance/technical-principles-discovery-ecosystem)
 - from which one of the main themes might be summarised as 'work with the
 web' these questions are definitely relevant.
 
 I need to look at Jason's stuff again as I think this definitely has
 parallels with some of the Discovery work, as, of course, does some of the
 recent discussion on here about the question of the indexing of library
 catalogues by search engines.
 
 Thanks again to all who have contributed to the discussion - very useful
 
 Owen
 
 Owen Stephens
 Owen Stephens Consulting
 Web: http://www.ostephens.com
 Email: o...@ostephens.com
 Telephone: 0121 288 6936
 
 On 1 Mar 2012, at 11:42, Ed Summers wrote:
 
 On Mon, Feb 27, 2012 at 12:15 PM, Jason Ronallo jrona...@gmail.com
 wrote:
 I'd like to bring this back to your suggestion to just forget OAI-PMH
 and crawl the web. I think that's probably the long-term way forward.
 
 I definitely had the same thoughts while reading this thread. Owen,
 are you forced to stay within the context of OAI-PMH because you are
 working with existing institutional repositories? I don't know if it's
 appropriate, or if it has been done before, but as part of your work
 it would be interesting to determine:
 
 a) how many IRs allow crawling (robots.txt or lack thereof)
 b) how many IRs support crawling with a sitemap
 c) how many IR HTML splashpages use the rel-license [1] pattern
 d) how many IRs support syndication (RSS/Atom) to publish changes
 
 If you could do this in a semi-automated way for the UK it would be
 great if you could then apply it to IRs around the world

Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-02-27 Thread Owen Stephens
 of
 full-text indexing in such indexes as Google or Summon, not does it allow
 them to restrict where copies are served from.  Similarly, the dc:rights
 section in the OAI-PMH records address copyright only.  In practice, Google
 crawls, indexes, and serves full-text copies of our dissertations.
 


Of course, it is absolutely reasonable that some content either not be open or 
have an embargo period - in which case I'd expect it to either not be added to 
the repository, or added and protected by some security which prevents public 
access. I know that in some cases authors wish to delay release of the thesis 
in order to publish a book which may draw on the PhD research - and this can 
take several years, although different institutions set different limits on 
this. I also know of at least one case where a PhD contained information that 
was deemed so confidential, it was agreed never to release it (I wasn't allowed 
to know what the information was!)

In theory copyright could be seen as sufficient to cover the use of the 
full-text item by third parties - either Google is protected by fair use (in 
the US anyway) or not. Unfortunately (and this would certainly be true in the 
UK) - the only way of really discovering if you have a case against Google 
would be to take them to court. Google would say (as they did to the 
newspapers) it's easy to request we don't index/cache your content - we obey 
robots.txt. Which sort of brings me back to the starting point of the project 
I'm working on - while two wrongs don't make a right, it seems to us that if 
repositories are not preventing Google (or others - for example notably 
CiteSeerX is in the business of crawling repositories 
http://csxstatic.ist.psu.edu/about/crawler) crawling/indexing/caching their 
content, then we hope that a non-profit, publicly funded, service should feel 
able to do the same in the interests of making the content of repositories more 
discoverable and more widely dissmeniated.

Owen


Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936


Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-02-27 Thread Owen Stephens
On 27 Feb 2012, at 13:31, Diane Hillmann wrote:

 On Mon, Feb 27, 2012 at 5:25 AM, Owen Stephens o...@ostephens.com wrote:
 
 
 providers provide such intermediate pages (arxiv.org, for instance). The
 other issue driving providers towards intermediate pages is that it allows
 them to continue to derive statistics from usage of their materials, which
 direct access URIs and multiple web caches don't.  For providers dependent
 on external funding, this is a biggie.
 

Definitely proof of use is a big issue - and one I've seen in other scenarios 
(for example, Museum's discussing whether to open up access to collections 
online) although it really feels like the tail wagging the dog. However, if 
this is *the* key issue for repositories then it would be good to look at 
alternative approaches - for example it would be possible to provide an API 
back from services with usage stats per paper/URI, or possibly simply pass on 
'clicks' when a cached paper is accessed.

I realise that this depends on cooperation of the third party, and you aren't 
going to always get this - but then, get perfection when tracking use is never 
going to happen. Perhaps we need to both be more robust in justifying open 
access as part of a public good mission (otherwise you could just leave it to 
the publishers?) and consider the question of measuring and reporting impact of 
offering papers in repositories in a more sophisticated way.

On the otherhand, it may be that repository managers/institutions have other 
reasons for not wanting the full-text to be directly accessed - e.g. they 
believe that it would be against some of the terms and conditions set by 
publishers regarding self-archiving (or seen to be encouraging others to break 
the TC?).

Owen


Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-02-26 Thread Owen Stephens
On 24 Feb 2012, at 16:52, Ian Ibbotson wrote:

 Sorry.. late to the discussion...
 
 Isn't this a little apples and oranges?
 
 Surely robots.txt exists because many static resources are served directly
 from a tree structured filesystem?
 
 (Nearly) all OAI requests are responded to by specific service applications
 which are perfectly capable of deciding, on a resource by resource basis if
 an anonymous user should or should not see a given resource. As has been
 said, why would you list a resource in OAI if you didn't think **someone**
 would find it useful. If you want to take something out of circulation, you
 mark it deleted so that clients connecting for updates know it should be
 removed.
 
 OAI isn't about fully enumerating a tree on every visit to see whats new,
 it's about a short and efficient visit to say What, if anything, has
 changed since I was last here. I don't want to have to walk an entire
 repository of 3 million items to discover item 299 was deleted.. I want
 a message to say Oh, item 299 was removed on X.
 

I agree about OAI being an efficient way of harvesting content  finding 
changes, and perhaps for repositories on the scale of millions of items it 
would be needed (although if you get to that scale, perhaps other approaches 
like dumps of data and deltas would be even better?) - however, most 
Institutional repositories aren't close to this scale (yet?).

I also agree there is a bit of apples and oranges here - they aren't exactly 
the same thing. However, in some scenarios - and I think really the main ones - 
the intended outcome seems to be the same. Google Scholar seems to me to be the 
main point of comparison - this harvests metadata (if correctly embedded in 
html meta tags) but does it via crawling web pages not OAI-PMH. Because of the 
advantages of being in Google Scholar (people use it!) repositories support 
this mechanism anyway - making OAI-PMH an additional overhead. My 
investigations so far definitely suggest these multiple routes lead to 
inconsistencies in configuration of different mechanisms.

I don't think my thoughts on it are completely clear either! But OAI-PMH is 
clearly 'niche' compared to the web, and while niche is sometimes needed, it 
always makes me slightly jumpy :)

Owen


Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-02-26 Thread Owen Stephens
On 24 Feb 2012, at 18:20, Joe Hourcle wrote:

 On Feb 24, 2012, at 9:25 AM, Kyle Banerjee wrote:
 I see it like the people who request that their pages not be cached elsewhere 
 -- they want to make their object 'discoverable', but they want to control 
 the access to those objects -- so it's one thing for a search engine to get a 
 copy, but they don't want that search engine being an agent to distribute 
 copies to others.
 
That's maybe true - certainly some repositories publish policy statements that 
imply this type of thinking - e.g. a typical phrase used is Full items must 
not be harvested by robots except transiently for full-text indexing or 
citation analysis. This type of policy is usually made available via OAI-PMH 
'Identify'. There are some issues with this. Firstly, textual policy statements 
like this don't help when you want to machine harvest many repositories. 
Secondly, these statements won't ever be seen by a web crawler. Thirdly 
'transiently' is not defined. Lastly, the limitation to two specific uses seems 
odd - for instance it would seem to me that semantic analysis of the text would 
not strictly be covered by this - but was this the intention of those framing 
the policy, or did they just want to say don't copy our stuff and serve it up 
from your own application (of course, different repositories will have 
different views on this).

Also some of the policies go further than this. For example the University of 
Cambridge policy states that *for metadata* The metadata must not be re-used 
in any medium for commercial purposes without formal permission - but does not 
block search engines from crawling in robots.txt - this is the kind of thing I 
see as inconsistent. I realise robots.txt is just a request to search engines, 
and isn't equivalent to a policy on reuse (e.g. a permissive robots.txt doesn't 
imply there is no copyright in the content being made available) - but there is 
no doubt that Google use the content they harvest for commercial purposes. So, 
this is a mixed message to some extent - meaning a well behaved OAI-PMH 
harvester might feel more constrained than a well behaved web crawler (even 
though I guess the legal situation would be pretty much the same for both in 
terms of actual rights to using the data harvested).

Again, I don't mean to pick on Cambridge - they aren't the only institution to 
run this kind of policy, but they are one everyone will have heard of :)

 Eg, all of the journal publishers who charge access fees -- they want people 
 to find that they have a copy of that article that you're interested in ... 
 but they want to collect their $35 for you to read it.

Agreed - this type of issue came up with Google News and led to the 
introduction of the 'first click free' programme 
(http://googlenewsblog.blogspot.com/2009/12/update-to-first-click-free.html) - 
although I'm not sure this is still in action?

 
 In the case of scientific data, the problem is that to make stuff 
 discoverable, we often have to perform some lossy transformation to fit some 
 metadata standard, and those standards rarely have mechanisms for describing 
 error (accuracy, precision, etc.).  You can do some science with the catalog 
 records, but it's going to introduce some bias into your results, so you're 
 typically better of getting the data from the archive.  (and sometimes, they 
 have nice clean catalogs in FITS, VOTable, CDF, NetCDF, HDF or whatever their 
 discipline's preferred data format is)

This is going into areas I'm not so familiar with - at the moment the project 
I'm working on is looking at article level data only (so mostly pdfs with 
straightforward metadata)
 
 ...
 
 Also, I don't know if things have changed in the last year, but I seem to 
 remember someone mentioning at last year's RDAP (Research Data Access  
 Preservation) summit that Google had coordinated with some libraries for 
 feeds from their catalogs, but was only interested in books, not other 
 objects.
 
 I don't know how other search engines might use data from OAI-PMH, or if 
 they'd filter it because they didn't consider it to be information they cared 
 about.
 
I don't think that Google ever used OAI-PMH to harvest metadata like this, 
although they did use it for sitemaps for a short time 
http://googlewebmastercentral.blogspot.com/2008/04/retiring-support-for-oai-pmh-in.html.
 It may be they have used it in specific cases to get library catalogue 
records, but I'm not aware of it.

Thanks

Owen


Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-02-26 Thread Owen Stephens
Thanks Peter - this sounds very interesting.

My main plea would be that some consideration is given to how web search 
engines interact with the same data. If web search engines feel free to ignore 
policies, and are left to it by publishers (and I realise NISO doesn't have 
control over this!) then we end up with a 'might is right' scenario. So I 
believe we should be aiming at:

Policies expressed in machine readable formats
Policies that are realistically implementable on a (semi-) automated basis 
(that probably means 'not very nuanced')
A single mechanism that both web crawlers, and any other mechanisms like 
OAI-PMH can follow

I realise these may not be achievable, but just my thoughts 

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 25 Feb 2012, at 22:18, Peter Noerr wrote:

 This post veers nearer to something I was going to add as an FYI, so here 
 goes...
 
 FYI: NISO has recently started a working group to study best practices for 
 discovery services. The ODI (=Open Discovery Initiative) working group is 
 hoping to look at exactly this issue (how should a content provider tell a 
 content requestor what it can have) among others (how to convey commercial 
 restrictions, how to produce statistics meaningful to providers, discovery 
 services, and consumers of the discovery service), and hopefully produce 
 guidelines on procedures and formats, etc. for this. 
 
 This is a new working group and its timescale doesn't expect any deliverables 
 until Q3 of 2012, so it is a bit late to help Owen, but anyone who is 
 interested in this may want to follow, from time to time, the NISO progress. 
 Look at www.niso.org and find the ODI working group. If you're really 
 interested contact the group to offer thoughts. And many of you may be 
 contacted by a survey to find out your thoughts as part of the process, 
 anyway. Just like the long reach of OCLC, there is no escaping NISO.
 
 Peter   
 
 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Joe 
 Hourcle
 Sent: Friday, February 24, 2012 10:20 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Repositories, OAI-PMH and web crawling
 
 On Feb 24, 2012, at 9:25 AM, Kyle Banerjee wrote:
 
 
 One of the questions this raises is what we are/aren't allowed to do
 in terms of harvesting full-text. While I realise we could get into
 legal stuff here, at the moment we want to put that question to one
 side. Instead we want to consider what Google, and other search
 engines, do, the mechanisms available to control this, and what we
 do, and the equivalent mechanisms - our starting point is that we
 don't feel we should be at a disadvantage to a web search engine in
 our harvesting and use of repository records.
 
 Of course, Google and other crawlers can crawl the bits of the
 repository that are on the open web, and 'good' crawlers will obey
 the contents of robots.txt We use OAI-PMH, and while we often see
 (usually general and sometimes
 contradictory) statements about what we can/can't do with the
 contents of a repository (or a specific record), it feels like there
 isn't a nice simple mechanism for a repository to say don't harvest this 
 bit.
 
 
 I would argue there is -- the whole point of OAI-PMH is to make stuff
 available for harvesting. If someone goes to the trouble of making
 things available via a protocol that exists only to make things
 harvestable and then doesn't want it harvested, you can dismiss them
 as being totally mental.
 
 I see it like the people who request that their pages not be cached 
 elsewhere -- they want to make
 their object 'discoverable', but they want to control the access to those 
 objects -- so it's one thing
 for a search engine to get a copy, but they don't want that search engine 
 being an agent to distribute
 copies to others.
 
 Eg, all of the journal publishers who charge access fees -- they want people 
 to find that they have a
 copy of that article that you're interested in ... but they want to collect 
 their $35 for you to read
 it.
 
 In the case of scientific data, the problem is that to make stuff 
 discoverable, we often have to
 perform some lossy transformation to fit some metadata standard, and those 
 standards rarely have
 mechanisms for describing error (accuracy, precision, etc.).  You can do 
 some science with the catalog
 records, but it's going to introduce some bias into your results, so you're 
 typically better of
 getting the data from the archive.  (and sometimes, they have nice clean 
 catalogs in FITS, VOTable,
 CDF, NetCDF, HDF or whatever their discipline's preferred data format is)
 
 ...
 
 Also, I don't know if things have changed in the last year, but I seem to 
 remember someone mentioning
 at last year's RDAP (Research Data Access  Preservation) summit that Google 
 had coordinated with some
 libraries for feeds from their catalogs

Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-02-26 Thread Owen Stephens
On 24 Feb 2012, at 18:20, Joe Hourcle wrote:
 
 I see it like the people who request that their pages not be cached elsewhere 
 -- they want to make their object 'discoverable', but they want to control 
 the access to those objects -- so it's one thing for a search engine to get a 
 copy, but they don't want that search engine being an agent to distribute 
 copies to others.
Also meant to say that Google (and others) support a 'Noarchive' instruction 
(not quite sure if this can be implemented in robots.txt or only via robots 
meta tags and x-robots-tags - if anyone can tell me I'd be grateful) which I 
think would fulfil this type of instruction - index, but don't keep a copy.

Owen

Re: [CODE4LIB] URL checking for the catalog

2012-02-24 Thread Owen Stephens
It's not quite the same thing, but I worked on a project a couple of years ago 
integrating references/citations into a learning environment (called Telstar 
http://www8.open.ac.uk/telstar/) , and looked at the question of how to deal 
with broken links from references.

We proposed a more reactive mechanism than running link checking software. This 
clearly has some disadvantages, but I think a major advantage is the targetting 
of staff time towards those links that are being used. The mechanism proposed 
was to add a level of redirection, with an intermediary script checking the 
availability of the destination URL before either:

a) passing the user on to the destination
b) finding the destination URL unresponsive (e.g. 404), automatically reporting 
the issue to library staff, and directing the user to a page explaining that 
the resource was not currently responding and that library staff had been 
informed

Particularly we proposed putting the destination URL into the rft_id of an 
OpenURL to achieve this, but this was only because it allowed us to piggyback 
on existing infrastructure using a standard approach - you could do the same 
with a simple script, with the destination URL as a parameter (if you are 
really interested, we created a new Source parser in SFX to do (a) and (b) ). 
Because we didn't necessarily have control over the URL in the reference, we 
also built a table that allowed us to map broken URLs being used in the 
learning environment to alternative URLs so we could offer a temporary redirect 
while we worked with the relevant staff to get corrections made to the 
reference link.

There's some more on this at 
http://www.open.ac.uk/blogs/telstar/remit-toc/remit-the-open-university-approach/remit-providing-links-to-resources-from-references/6-8-3-telstar-approach/
 although for some reason (my fault) this doesn't include a write up of the 
link checking process/code we created.

Of course, this approach is in no way incompatible with regular proactive link 
checking.

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 23 Feb 2012, at 17:02, Tod Olson wrote:

 There's been some recent discussion at our site about revi(s|v)ing URL 
 checking in our catalog, and I was wondering if other sites have any 
 strategies that they have found to be effective.
 
 We used to run some home-grown link checking software. It fit nicely into a 
 shell pipeline, so it was easy to filter out sites that didn't want to be 
 link checked. But still the reports had too many spurious errors. And with 
 over a million links in the catalog, there are some issues of scale, both for 
 checking the links and consuming any report.
 
 Anyhow, if you have some system you use as part of catalog link maintenance, 
 or if there's some link checking software that you've had good experiences 
 with, or if there's some related experience you'd like to share, I'd like to 
 hear about it.
 
 Thanks,
 
 -Tod
 
 
 Tod Olson t...@uchicago.edu
 Systems Librarian 
 University of Chicago Library


Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-02-24 Thread Owen Stephens
Thanks both...

Kyle said: If someone goes to the trouble of making things available via a 
protocol that exists only to make things harvestable and
then doesn't want it harvested, you can dismiss them ...

True - but that's essentially what Southampton's configuration seems to say.

Thomas said: The M in PMH still stands for Metadata, right?  So opening an 
OAI-PMH server implicitly says you're willing to share metadata.  I can 
certainly sympathize with sites wanting to do that but not necessarily wanting 
to offer anything more than normal end-user access to full text.

This is a fair point - but I've yet to see an example of a robots.txt file that 
makes this distinction - that is, in general Google is not being told to not 
crawl and cache pdfs, while being granted explicit permission to crawl the 
metadata, no matter what the OAI-PMH situation.

Kyle said: OAI-PMH runs on top of HTTP, so anything robots.txt already applies 
-- i.e. if they want you to crawl metadata only but not download the objects 
themselves because they don't want to deal with the load or bandwidth charges, 
this should be indicated for all crawlers.

OK - this suggests a way forward for me. Although I don't think we can regard 
robots.txt applying across the board for OAI-PMH (as in the Southampton 
example, the OAI-PMH endpoint is disallowed by robots.txt), it seems to make 
sense that given a resource identifier in the metadata we could use robots.txt 
(and I guess potentially x-robots-tag, assuming most of the resources are not 
simple html) to see whether a web crawler is permitted to crawl it, and so make 
the right decision about what we do.

That sounds vaguely sensible (although I'm still left thinking, maybe we should 
just use a web crawler and ignore OAI-PMH but I guess this was we maybe get the 
best of both worlds).

Thanks again (and of course further thoughts welcome)

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 24 Feb 2012, at 14:45, Thomas Dowling wrote:

 On 02/24/2012 09:25 AM, Kyle Banerjee wrote:
 
 We use OAI-PMH, and while we often see (usually general and sometimes
 contradictory) statements about what we can/can't do with the contents of a
 repository (or a specific record), it feels like there isn't a nice simple
 mechanism for a repository to say don't harvest this bit.
 
 
 I would argue there is -- the whole point of OAI-PMH is to make stuff
 available for harvesting. If someone goes to the trouble of making things
 available via a protocol that exists only to make things harvestable and
 then doesn't want it harvested, you can dismiss them as being totally
 mental.
 
 The M in PMH still stands for Metadata, right?  So opening an OAI-PMH
 server implicitly says you're willing to share metadata.  I can certainly
 sympathize with sites wanting to do that but not necessarily wanting to
 offer anything more than normal end-user access to full text.
 
 That said, in a world with unfriendly bots, the repository should still be
 making informed choices about controlling full text crawlers (robots.txt,
 meta tags, HTTP cache directives, etc etc.).
 
 
 -- 
 Thomas Dowling
 thomas.dowl...@gmail.com


Re: [CODE4LIB] Namespace management, was Models of MARC in RDF

2011-12-12 Thread Owen Stephens
The other issue that the 'modelling' brings (IMO) is that the model influences 
use - or better the other way round, the intended use and/or audience should 
influence the model. This raises questions for me about the value of a 
'neutral' model - which is what I perceive libraries as aiming for - treating 
users as a homogenous mass with needs that will be met by a single approach. 
Obviously there are resource implications to developing multiple models for 
different uses/audiences, and once again I'd argue that an advantage of the 
linked data approach is that it allows for the effort to be distributed amongst 
the relevant communities.

To be provocative - has the time come for us to abandon the idea that 
'libraries' act as one where cataloguing is concerned, and our metadata serves 
the same purpose in all contexts? (I can't decide if I'm serious about this or 
not!)

Owen



Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 11 Dec 2011, at 23:47, Karen Coyle wrote:

 Quoting Richard Wallis richard.wal...@talis.com:
 
 
 You get the impression that the BL chose a subset of their current
 bibliographic data to expose as LD - it was kind of the other way around.
 Having modeled the 'things' in the British National Bibliography domain
 (plus those in related domain vocabularis such as VIAF, LCSH, Geonames,
 Bio, etc.), they then looked at the information held in their [Marc] bib
 records to identify what could be extracted to populate it.
 
 Richard, I've been thinking of something along these lines myself, especially 
 as I see the number of translating X to RDF projects go on. I begin to 
 wonder what there is in library data that is *unique*, and my conclusion is: 
 not much. Books, people, places, topics: they all exist independently of 
 libraries, and libraries cannot take the credit for creating any of them. So 
 we should be able to say quite a bit about the resources in libraries using 
 shared data points -- and by that I mean, data points that are also used by 
 others. So once you decide on a model (as BL did), then it is a matter of 
 looking *outward* for the data to re-use.
 
 I maintain, however, as per my LITA Forum talk [1] that the subject headings 
 (without talking about quality thereof) and classification designations that 
 libraries provide are an added value, and we should do more to make them 
 useful for discovery.
 
 
 
 I know it is only semantics (no pun intended), but we need to stop using
 the word 'record' when talking about the future description of 'things' or
 entities that are then linked together.   That word has so many built in
 assumptions, especially in the library world.
 
 I'll let you battle that one out with Simon :-), but I am often at a loss for 
 a better term to describe the unit of metadata that libraries may create in 
 the future to describe their resources. Suggestions highly welcome.
 
 kc
 [1] http://kcoyle.net/presentations/lita2011.html
 
 
 
 
 
 -- 
 Karen Coyle
 kco...@kcoyle.net http://kcoyle.net
 ph: 1-510-540-7596
 m: 1-510-435-8234
 skype: kcoylenet


Re: [CODE4LIB] Namespace management, was Models of MARC in RDF

2011-12-12 Thread Owen Stephens
On 11 Dec 2011, at 23:30, Richard Wallis wrote:

 
 There is no document I am aware of, but I can point you at the blog post by
 Tim Hodson [
 http://consulting.talis.com/2011/07/british-library-data-model-overview/]
 who helped the BL get to grips with and start thinking Linked Data.
 Another by the BL's Neil Wilson [
 http://consulting.talis.com/2011/10/establishing-the-connection/] filling
 in the background around his recent presentations about their work.

Neil Wilson at the BL has indicated a few times that in principle the BL has no 
problem sharing the software they used to extract the relevant data from the 
MARC records, but that there are licensing issues around the s/w due to the use 
of a proprietary compiler (sorry, I don't have any more details so I can't 
explain any more than this). I'm not sure whether this extends to sharing the 
source that would tell us what exactly was happening, but I think this would be 
worth more discussion with Neil - I'll try to pursue it with him when I get a 
chance

Owen


Re: [CODE4LIB] Models of MARC in RDF

2011-12-07 Thread Owen Stephens
Fair point. Just instinct on my part that putting it in a triple is a bit ugly 
:)

It probably doesn't make any difference, although I don't think storing in a 
triple ensures that it sticks to the object (you could store the triple 
anywhere as well)

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 6 Dec 2011, at 22:43, Fleming, Declan wrote:

 Hi - point at it where?  We could point back to the library catalog that we 
 harvested in the MARC to MODS to RDF process, but what if that goes away?  
 Why not write ourselves a 1K insurance policy that sticks with the object for 
 its life?
 
 D
 
 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Owen 
 Stephens
 Sent: Tuesday, December 06, 2011 8:06 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Models of MARC in RDF
 
 I'd suggest that rather than shove it in a triple it might be better to point 
 at alternative representations, including MARC if desirable (keep meaning to 
 blog some thoughts about progressively enhanced metadata...)
 
 Owen
 
 Owen Stephens
 Owen Stephens Consulting
 Web: http://www.ostephens.com
 Email: o...@ostephens.com
 Telephone: 0121 288 6936
 
 On 6 Dec 2011, at 15:44, Karen Coyle wrote:
 
 Quoting Fleming, Declan dflem...@ucsd.edu:
 
 Hi - I'll note that the mapping decisions were made by our metadata 
 services (then Cataloging) group, not by the tech folks making it all 
 work, though we were all involved in the discussions.  One idea that 
 came up was to do a, perhaps, lossy translation, but also stuff one 
 triple with a text dump of the whole MARC record just in case we 
 needed to grab some other element out we might need.  We didn't do 
 that, but I still like the idea.  Ok, it was my idea.  ;)
 
 I like that idea! Now that disk space is no longer an issue, it makes good 
 sense to keep around the original state of any data that you transform, 
 just in case you change your mind. I hadn't thought about incorporating the 
 entire MARC record string in the transformation, but as I recall the average 
 size of a MARC record is somewhere around 1K, which really isn't all that 
 much by today's standards.
 
 (As an old-timer, I remember running the entire Univ. of California 
 union catalog on 35 megabytes, something that would now be considered 
 a smallish email attachment.)
 
 kc
 
 
 D
 
 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf 
 Of Esme Cowles
 Sent: Monday, December 05, 2011 11:22 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Models of MARC in RDF
 
 I looked into this a little more closely, and it turns out it's a little 
 more complicated than I remembered.  We built support for transforming to 
 MODS using the MODS21slim2MODS.xsl stylesheet, but don't use that.  
 Instead, we use custom Java code to do the mapping.
 
 I don't have a lot of public examples, but there's at least one public 
 object which you can view the MARC from our OPAC:
 
 http://roger.ucsd.edu/search/.b4827884/.b4827884/1,1,1,B/detlmarc~123
 4567FF=1,0,
 
 The public display in our digital collections site:
 
 http://libraries.ucsd.edu/ark:/20775/bb0648473d
 
 The RDF for the MODS looks like:
 
   mods:classification rdf:parseType=Resource
   mods:authoritylocal/mods:authority
   rdf:valueFVLP 222-1/rdf:value
   /mods:classification
   mods:identifier rdf:parseType=Resource
   mods:typeARK/mods:type
   
 rdf:valuehttp://libraries.ucsd.edu/ark:/20775/bb0648473d/rdf:value
   /mods:identifier
   mods:name rdf:parseType=Resource
   mods:namePartBrown, Victor W/mods:namePart
   mods:typepersonal/mods:type
   /mods:name
   mods:name rdf:parseType=Resource
   mods:namePartAmateur Film Club of San Diego/mods:namePart
   mods:typecorporate/mods:type
   /mods:name
   mods:originInfo rdf:parseType=Resource
   mods:dateCreated[196-]/mods:dateCreated
   /mods:originInfo
   mods:originInfo rdf:parseType=Resource
   mods:dateIssued2005/mods:dateIssued
   mods:publisherFilm and Video Library, University of California, 
 San Diego, La Jolla, CA 92093-0175 
 http://orpheus.ucsd.edu/fvl/FVLPAGE.HTM/mods:publisher
   /mods:originInfo
   mods:physicalDescription rdf:parseType=Resource
   mods:digitalOriginreformatted digital/mods:digitalOrigin
   mods:note16mm; 1 film reel (25 min.) :; sd., col. ;/mods:note
   /mods:physicalDescription
   mods:subject rdf:parseType=Resource
   mods:authoritylcsh/mods:authority
   mods:topicRanching/mods:topic
   /mods:subject
 
 etc.
 
 
 There is definitely some loss in the conversion process -- I don't know 
 enough about the MARC leader and control fields to know if they are 
 captured in the MODS and/or RDF in some way.  But there are quite

Re: [CODE4LIB] Models of MARC in RDF

2011-12-07 Thread Owen Stephens
When I did a project converting records from UKMARC - MARC21 we kept the 
UKMARC records for a period (about 5 years I think) while we assured ourselves 
that we hadn't missed anything vital. We did occasionally refer back to the 
older record to check things, but having not found any major issues with the 
conversion after that period we felt confident disposing of the record. This is 
the type of usage I was imagining for a copy of the MARC record in this 
scenario.

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 7 Dec 2011, at 01:52, Montoya, Gabriela wrote:

 One critical thing to consider with MARC records (or any metadata, for that 
 matter) is that it they are not stagnant, so what is the value of storing 
 entire record strings into one triple if we know that metadata is volatile? 
 As an example, UCSD has over 200,000 art images that had their metadata 
 records ingested into our local DAMS over five years ago. Since then, many of 
 these records have been edited/massaged in our OPAC (and ARTstor), but these 
 updated records have not been refreshed in our DAMS. Now we find ourselves 
 needing to desperately have the What is our database of record? 
 conversation.
 
 I'd much rather see resources invested in data synching than spending it in 
 saving text dumps that will most likely not be referred to again.
 
 Dream Team for Building a MARC  RDF Model: Karen Coyle, Alistair Miles, 
 Diane Hillman, Ed Summers, Bradley Westbrook.
 
 Gabriela
 
 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Karen 
 Coyle
 Sent: Tuesday, December 06, 2011 7:44 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Models of MARC in RDF
 
 Quoting Fleming, Declan dflem...@ucsd.edu:
 
 Hi - I'll note that the mapping decisions were made by our metadata 
 services (then Cataloging) group, not by the tech folks making it all 
 work, though we were all involved in the discussions.  One idea that 
 came up was to do a, perhaps, lossy translation, but also stuff one 
 triple with a text dump of the whole MARC record just in case we 
 needed to grab some other element out we might need.  We didn't do 
 that, but I still like the idea.  Ok, it was my idea.  ;)
 
 I like that idea! Now that disk space is no longer an issue, it makes good 
 sense to keep around the original state of any data that you transform, 
 just in case you change your mind. I hadn't thought about incorporating the 
 entire MARC record string in the transformation, but as I recall the average 
 size of a MARC record is somewhere around 1K, which really isn't all that 
 much by today's standards.
 
 (As an old-timer, I remember running the entire Univ. of California union 
 catalog on 35 megabytes, something that would now be considered a smallish 
 email attachment.)
 
 kc
 
 
 D
 
 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf 
 Of Esme Cowles
 Sent: Monday, December 05, 2011 11:22 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Models of MARC in RDF
 
 I looked into this a little more closely, and it turns out it's a 
 little more complicated than I remembered.  We built support for 
 transforming to MODS using the MODS21slim2MODS.xsl stylesheet, but 
 don't use that.  Instead, we use custom Java code to do the mapping.
 
 I don't have a lot of public examples, but there's at least one public 
 object which you can view the MARC from our OPAC:
 
 http://roger.ucsd.edu/search/.b4827884/.b4827884/1,1,1,B/detlmarc~1234
 567FF=1,0,
 
 The public display in our digital collections site:
 
 http://libraries.ucsd.edu/ark:/20775/bb0648473d
 
 The RDF for the MODS looks like:
 
mods:classification rdf:parseType=Resource
mods:authoritylocal/mods:authority
rdf:valueFVLP 222-1/rdf:value
/mods:classification
mods:identifier rdf:parseType=Resource
mods:typeARK/mods:type
 
 rdf:valuehttp://libraries.ucsd.edu/ark:/20775/bb0648473d/rdf:value
/mods:identifier
mods:name rdf:parseType=Resource
mods:namePartBrown, Victor W/mods:namePart
mods:typepersonal/mods:type
/mods:name
mods:name rdf:parseType=Resource
mods:namePartAmateur Film Club of San Diego/mods:namePart
mods:typecorporate/mods:type
/mods:name
mods:originInfo rdf:parseType=Resource
mods:dateCreated[196-]/mods:dateCreated
/mods:originInfo
mods:originInfo rdf:parseType=Resource
mods:dateIssued2005/mods:dateIssued
mods:publisherFilm and Video Library, University of 
 California, San Diego, La Jolla, CA 92093-0175 
 http://orpheus.ucsd.edu/fvl/FVLPAGE.HTM/mods:publisher
/mods:originInfo
mods:physicalDescription rdf:parseType=Resource
mods:digitalOriginreformatted digital

Re: [CODE4LIB] Namespace management, was Models of MARC in RDF

2011-12-07 Thread Owen Stephens
On 7 Dec 2011, at 00:38, Alexander Johannesen wrote:

 Hiya,
 
 Karen Coyle li...@kcoyle.net wrote:
 I wonder how easy it will be to
 manage a metadata scheme that has cherry-picked from existing ones, so
 something like:
 
 dc:title
 bibo:chapter
 foaf:depiction
 
 Yes, you're right in pointing out this as a problem. And my answer is;
 it's complicated. My previous rant on this list was about data
 models*, and dangnabbit if this isn't related as well.
 
 What your example is doing is pointing out a new model based on bits
 of other models. This works fine, for the most part, when the concepts
 are simple; simple to understand, simple to extend. Often you'll find
 that what used to be unclear has grown clear over time (as more and
 more have used FOAF, you'll find some things are more used and better
 understood, while other parts of it fade into 'we don't really use
 that anymore')
 
 But when things get complicated, it *can* render your model unusable.
 Mixed data models can be good, but can also lead directly to meta data
 hell. For example ;
 
  dc:title
  foaf:title
 
 Ouch. Although not a biggie, I see this kind of discrepancy all the
 time, so the argument against mixed models is of course that the power
 of definition lies with you rather than some third-party that might
 change their mind (albeit rare) or have similar terms that differ
 (more often).
 
 I personally would say that the library world should define RDA as you
 need it to be, and worry less about reuse at this stage unless you
 know for sure that the external models do bibliographic meta data
 well.
 

I agree this is a risk, and I suspect there is a further risk around simply the 
feeling of 'ownership' by the community - perhaps it is easier to feel 
ownership over an entire ontoloy than an 'application profile' of somekind.
It maybe that mapping is the solution to this, but if this is really going to 
work I suspect it needs to be done from the very start - otherwise it is just 
another crosswalk, and we'll get varying views on how much one thing maps to 
another (but perhaps that's OK - I'm not looking for perfection)

That said, I believe we need absolutely to be aiming for a world in which we 
work with mixed ontologies - no matter what we do other, relevant, data sources 
will use FOAF, Bibo etc.. I'm convinced that this gives us the opportunity to 
stop treating what are very mixed materials in a single way, while still 
exploiting common properties. For example Musical materials are really not well 
catered for in MARC, and we know there are real issues with applying FRBR to 
them - and I see the implementation of RDF/Linked Data as an opportunity to 
tackle this issue by adopting alternative ontologies where it makes sense, 
while still assigning common properties (dc:title) where this makes sense.


 HOWEVER!
 
 When we're done talking about ontologies and vocabularies, we need to
 talk about identifiers, and there I would swing the other way and let
 reuse govern, because it is when you reuse an identifier you start
 thinking about what that identifiers means to *both* parties. Or, put
 differently ;
 
 It's remarkably easier to get this right if the identifier is a
 number, rather than some word. And for that reason I'd say reuse
 identifiers (subject proxies) as they are easier to get right and
 bring a lot of benefits, but not ontologies (model proxies) as they
 can be very difficult to get right and don't necessarily give you what
 you want.

Agreed :)


Re: [CODE4LIB] Models of MARC in RDF

2011-12-06 Thread Owen Stephens
I'd suggest that rather than shove it in a triple it might be better to point 
at alternative representations, including MARC if desirable (keep meaning to 
blog some thoughts about progressively enhanced metadata...)

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 6 Dec 2011, at 15:44, Karen Coyle wrote:

 Quoting Fleming, Declan dflem...@ucsd.edu:
 
 Hi - I'll note that the mapping decisions were made by our metadata services 
 (then Cataloging) group, not by the tech folks making it all work, though we 
 were all involved in the discussions.  One idea that came up was to do a, 
 perhaps, lossy translation, but also stuff one triple with a text dump of 
 the whole MARC record just in case we needed to grab some other element out 
 we might need.  We didn't do that, but I still like the idea.  Ok, it was my 
 idea.  ;)
 
 I like that idea! Now that disk space is no longer an issue, it makes good 
 sense to keep around the original state of any data that you transform, 
 just in case you change your mind. I hadn't thought about incorporating the 
 entire MARC record string in the transformation, but as I recall the average 
 size of a MARC record is somewhere around 1K, which really isn't all that 
 much by today's standards.
 
 (As an old-timer, I remember running the entire Univ. of California union 
 catalog on 35 megabytes, something that would now be considered a smallish 
 email attachment.)
 
 kc
 
 
 D
 
 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Esme 
 Cowles
 Sent: Monday, December 05, 2011 11:22 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Models of MARC in RDF
 
 I looked into this a little more closely, and it turns out it's a little 
 more complicated than I remembered.  We built support for transforming to 
 MODS using the MODS21slim2MODS.xsl stylesheet, but don't use that.  Instead, 
 we use custom Java code to do the mapping.
 
 I don't have a lot of public examples, but there's at least one public 
 object which you can view the MARC from our OPAC:
 
 http://roger.ucsd.edu/search/.b4827884/.b4827884/1,1,1,B/detlmarc~1234567FF=1,0,
 
 The public display in our digital collections site:
 
 http://libraries.ucsd.edu/ark:/20775/bb0648473d
 
 The RDF for the MODS looks like:
 
mods:classification rdf:parseType=Resource
mods:authoritylocal/mods:authority
rdf:valueFVLP 222-1/rdf:value
/mods:classification
mods:identifier rdf:parseType=Resource
mods:typeARK/mods:type

 rdf:valuehttp://libraries.ucsd.edu/ark:/20775/bb0648473d/rdf:value
/mods:identifier
mods:name rdf:parseType=Resource
mods:namePartBrown, Victor W/mods:namePart
mods:typepersonal/mods:type
/mods:name
mods:name rdf:parseType=Resource
mods:namePartAmateur Film Club of San Diego/mods:namePart
mods:typecorporate/mods:type
/mods:name
mods:originInfo rdf:parseType=Resource
mods:dateCreated[196-]/mods:dateCreated
/mods:originInfo
mods:originInfo rdf:parseType=Resource
mods:dateIssued2005/mods:dateIssued
mods:publisherFilm and Video Library, University of California, 
 San Diego, La Jolla, CA 92093-0175 
 http://orpheus.ucsd.edu/fvl/FVLPAGE.HTM/mods:publisher
/mods:originInfo
mods:physicalDescription rdf:parseType=Resource
mods:digitalOriginreformatted digital/mods:digitalOrigin
mods:note16mm; 1 film reel (25 min.) :; sd., col. ;/mods:note
/mods:physicalDescription
mods:subject rdf:parseType=Resource
mods:authoritylcsh/mods:authority
mods:topicRanching/mods:topic
/mods:subject
 
 etc.
 
 
 There is definitely some loss in the conversion process -- I don't know 
 enough about the MARC leader and control fields to know if they are captured 
 in the MODS and/or RDF in some way.  But there are quite a few local and 
 note fields that aren't present in the RDF.  Other fields (e.g. 300 and 505) 
 are mapped to MODS, but not displayed in our access system (though they are 
 indexed for searching).
 
 I agree it's hard to quantify lossy-ness.  Counting fields or characters 
 would be the most objective, but has obvious problems with control 
 characters sometimes containing a lot of information, and then the relative 
 importance of different fields to the overall description.  There are other 
 issues too -- some fields in this record weren't migrated because they 
 duplicated collection-wide values, which are formulated slightly differently 
 from the MARC record.  Some fields weren't migrated because they concern the 
 physical object, and therefore don't really apply to the digital object.  So 
 that really seems like a morass to me.
 
 -Esme
 --
 Esme Cowles escow...@ucsd.edu
 
 Necessity

Re: [CODE4LIB] Models of MARC in RDF

2011-12-06 Thread Owen Stephens
I think the strength of adopting RDF is that it doesn't tie us to a single 
vocab/schema. That isn't to say it isn't desirable for us to establish common 
approaches, but that we need to think slightly differently about how this is 
done - more application profiles than 'one true schema'.

This is why RDA worries me - because it (seems to?) suggest that we define a 
schema that stands alone from everything else and that is used by the library 
community. I'd prefer to see the library community adopting the best of what 
already exists and then enhancing where the existing ontologies are lacking. If 
we are going to have a (web of) linked data, then re-use of ontologies and IDs 
is needed. For example in the work I did at the Open University in the UK we 
ended up only a single property from a specific library ontology (the draft 
ISBD http://metadataregistry.org/schemaprop/show/id/1957.html has place of 
publication, production, distribution).

I think it is interesting that many of the MARC-RDF mappings so far have 
adopting many of the same ontologies (although no doubt partly because there is 
a 'follow the leader' element to this - or at least there was for me when 
looking at the transformation at the Open University)

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 5 Dec 2011, at 18:56, Jonathan Rochkind wrote:

 On 12/5/2011 1:40 PM, Karen Coyle wrote:
 
 This brings up another point that I haven't fully grokked yet: the use of 
 MARC kept library data consistent across the many thousands of libraries 
 that had MARC-based systems. 
 
 Well, only somewhat consistent, but, yeah.
 
 What happens if we move to RDF without a standard? Can we rely on linking to 
 provide interoperability without that rigid consistency of data models?
 
 Definitely not. I think this is a real issue.  There is no magic to linking 
 or RDF that provides interoperability for free; it's all about the 
 vocabularies/schemata -- whether in MARC or in anything else.   (Note 
 different national/regional  library communities used different schemata in 
 MARC, which made interoperability infeasible there. Some still do, although 
 gradually people have moved to Marc21 precisely for this reason, even when 
 Marc21 was less powerful than the MARC variant they started with).
 
 That is to say, if we just used MARC's own implicit vocabularies, but output 
 them as RDF, sure, we'd still have consistency, although we wouldn't really 
 _gain_ much.On the other hand, if we switch to a new better vocabulary -- 
 we've got to actually switch to a new better vocabulary.  If it's just 
 whatever anyone wants to use, we've made it VERY difficult to share data, 
 which is something pretty darn important to us.
 
 Of course, the goal of the RDA process (or one of em) was to create a new 
 schema for us to consistently use. That's the library community effort to 
 maintain a common schema that is more powerful and flexible than MARC.  If 
 people are using other things instead, apparently that failed, or at least 
 has not yet succeeded.


Re: [CODE4LIB] Models of MARC in RDF

2011-12-02 Thread Owen Stephens
Hi Esme - thanks for this. Do you have any documentation on which predicates 
you've used and MODS-RDF transformation?

Owen

On 2 Dec 2011, at 16:07, Esme Cowles escow...@ucsd.edu wrote:

 Owen-
 
 Another strategy for capturing MARC data in RDF is to convert it to MODS (we 
 do this using the LoC MARC to MODS stylesheet: 
 http://www.loc.gov/standards/marcxml/xslt/MARC21slim2MODS.xsl).  From there, 
 it's pretty easy to incorporate into RDF.  There are some issues to be aware 
 of, such as how to map the MODS XML names to predicates and how to handle 
 elements that can appear in multiple places in the hierarchy.
 
 -Esme
 --
 Esme Cowles escow...@ucsd.edu
 
 Necessity is the plea for every infringement of human freedom. It is the
 argument of tyrants; it is the creed of slaves. -- William Pitt, 1783
 
 On 11/28/2011, at 8:25 AM, Owen Stephens wrote:
 
 It would be great to start collecting transforms together - just a quick 
 brain dump of some I'm aware of
 
 MARC21 transformations
 Cambridge University Library - http://data.lib.cam.ac.uk - transformation 
 made available (in code) from same site
 Open University - http://data.open.ac.uk - specific transform for materials 
 related to teaching, code available at 
 http://code.google.com/p/luceroproject/source/browse/trunk%20luceroproject/OULinkedData/src/uk/ac/open/kmi/lucero/rdfextractor/RDFExtractor.java
  (MARC transform is in libraryRDFExtraction method)
 COPAC - small set of records from the COPAC Union catalogue - data and 
 transform not yet published
 Podes Projekt - LinkedAuthors - documentation at 
 http://bibpode.no/linkedauthors/doc/Pode-LinkedAuthors-Documentation.pdf - 2 
 stage transformation firstly from MARC to FRBRized version of data, then 
 from FRBRized data to RDF. These linked from documentation
 Podes Project - LinkedNonFiction - documentation at 
 http://bibpode.no/linkednonfiction/doc/Pode-LinkedNonFiction-Documentation.pdf
  - MARC data transformed using xslt 
 https://github.com/pode/LinkedNonFiction/blob/master/marcslim2n3.xsl
 
 British Library British National Bibliography - 
 http://www.bl.uk/bibliographic/datafree.html - data model documented, but no 
 code available
 Libris.se - some notes in various presentations/blogposts (e.g. 
 http://dc2008.de/wp-content/uploads/2008/09/malmsten.pdf) but can't find 
 explicit transformation
 Hungarian National library - 
 http://thedatahub.org/dataset/hungarian-national-library-catalog and 
 http://nektar.oszk.hu/wiki/Semantic_web#Implementation - some information on 
 ontologies used but no code or explicit transformation (not 100% sure this 
 is from MARC)
 Talis - implemented in several live catalogues including 
 http://catalogue.library.manchester.ac.uk/  - no documentation or code afaik 
 although some notes in 
 
 MAB transformation
 HBZ - some of the transformation documented at 
 https://wiki1.hbz-nrw.de/display/SEM/Converting+the+Open+Data+from+the+hbz+to+BIBO,
  don't think any code published?
 
 Would be really helpful if more projects published their transformations (or 
 someone told me where to look!)
 
 Owen
 
 Owen Stephens
 Owen Stephens Consulting
 Web: http://www.ostephens.com
 Email: o...@ostephens.com
 Telephone: 0121 288 6936
 
 On 26 Nov 2011, at 15:58, Karen Coyle wrote:
 
 A few of the code4lib talk proposals mention projects that have or will 
 transform MARC records into RDF. If any of you have documentation and/or 
 examples of this, I would be very interested to see them, even if they are 
 under construction.
 
 Thanks,
 kc
 
 -- 
 Karen Coyle
 kco...@kcoyle.net http://kcoyle.net
 ph: 1-510-540-7596
 m: 1-510-435-8234
 skype: kcoylenet


Re: [CODE4LIB] Models of MARC in RDF

2011-12-02 Thread Owen Stephens
Oh - and perhaps just/more importantly - how do you create URIs for you data 
and how do you reconcile against other sources?

Owen

On 2 Dec 2011, at 16:07, Esme Cowles escow...@ucsd.edu wrote:

 Owen-
 
 Another strategy for capturing MARC data in RDF is to convert it to MODS (we 
 do this using the LoC MARC to MODS stylesheet: 
 http://www.loc.gov/standards/marcxml/xslt/MARC21slim2MODS.xsl).  From there, 
 it's pretty easy to incorporate into RDF.  There are some issues to be aware 
 of, such as how to map the MODS XML names to predicates and how to handle 
 elements that can appear in multiple places in the hierarchy.
 
 -Esme
 --
 Esme Cowles escow...@ucsd.edu
 
 Necessity is the plea for every infringement of human freedom. It is the
 argument of tyrants; it is the creed of slaves. -- William Pitt, 1783
 
 On 11/28/2011, at 8:25 AM, Owen Stephens wrote:
 
 It would be great to start collecting transforms together - just a quick 
 brain dump of some I'm aware of
 
 MARC21 transformations
 Cambridge University Library - http://data.lib.cam.ac.uk - transformation 
 made available (in code) from same site
 Open University - http://data.open.ac.uk - specific transform for materials 
 related to teaching, code available at 
 http://code.google.com/p/luceroproject/source/browse/trunk%20luceroproject/OULinkedData/src/uk/ac/open/kmi/lucero/rdfextractor/RDFExtractor.java
  (MARC transform is in libraryRDFExtraction method)
 COPAC - small set of records from the COPAC Union catalogue - data and 
 transform not yet published
 Podes Projekt - LinkedAuthors - documentation at 
 http://bibpode.no/linkedauthors/doc/Pode-LinkedAuthors-Documentation.pdf - 2 
 stage transformation firstly from MARC to FRBRized version of data, then 
 from FRBRized data to RDF. These linked from documentation
 Podes Project - LinkedNonFiction - documentation at 
 http://bibpode.no/linkednonfiction/doc/Pode-LinkedNonFiction-Documentation.pdf
  - MARC data transformed using xslt 
 https://github.com/pode/LinkedNonFiction/blob/master/marcslim2n3.xsl
 
 British Library British National Bibliography - 
 http://www.bl.uk/bibliographic/datafree.html - data model documented, but no 
 code available
 Libris.se - some notes in various presentations/blogposts (e.g. 
 http://dc2008.de/wp-content/uploads/2008/09/malmsten.pdf) but can't find 
 explicit transformation
 Hungarian National library - 
 http://thedatahub.org/dataset/hungarian-national-library-catalog and 
 http://nektar.oszk.hu/wiki/Semantic_web#Implementation - some information on 
 ontologies used but no code or explicit transformation (not 100% sure this 
 is from MARC)
 Talis - implemented in several live catalogues including 
 http://catalogue.library.manchester.ac.uk/  - no documentation or code afaik 
 although some notes in 
 
 MAB transformation
 HBZ - some of the transformation documented at 
 https://wiki1.hbz-nrw.de/display/SEM/Converting+the+Open+Data+from+the+hbz+to+BIBO,
  don't think any code published?
 
 Would be really helpful if more projects published their transformations (or 
 someone told me where to look!)
 
 Owen
 
 Owen Stephens
 Owen Stephens Consulting
 Web: http://www.ostephens.com
 Email: o...@ostephens.com
 Telephone: 0121 288 6936
 
 On 26 Nov 2011, at 15:58, Karen Coyle wrote:
 
 A few of the code4lib talk proposals mention projects that have or will 
 transform MARC records into RDF. If any of you have documentation and/or 
 examples of this, I would be very interested to see them, even if they are 
 under construction.
 
 Thanks,
 kc
 
 -- 
 Karen Coyle
 kco...@kcoyle.net http://kcoyle.net
 ph: 1-510-540-7596
 m: 1-510-435-8234
 skype: kcoylenet


Re: [CODE4LIB] Models of MARC in RDF

2011-11-28 Thread Owen Stephens
It would be great to start collecting transforms together - just a quick brain 
dump of some I'm aware of

MARC21 transformations
Cambridge University Library - http://data.lib.cam.ac.uk - transformation made 
available (in code) from same site
Open University - http://data.open.ac.uk - specific transform for materials 
related to teaching, code available at 
http://code.google.com/p/luceroproject/source/browse/trunk%20luceroproject/OULinkedData/src/uk/ac/open/kmi/lucero/rdfextractor/RDFExtractor.java
 (MARC transform is in libraryRDFExtraction method)
COPAC - small set of records from the COPAC Union catalogue - data and 
transform not yet published
Podes Projekt - LinkedAuthors - documentation at 
http://bibpode.no/linkedauthors/doc/Pode-LinkedAuthors-Documentation.pdf - 2 
stage transformation firstly from MARC to FRBRized version of data, then from 
FRBRized data to RDF. These linked from documentation
Podes Project - LinkedNonFiction - documentation at 
http://bibpode.no/linkednonfiction/doc/Pode-LinkedNonFiction-Documentation.pdf 
- MARC data transformed using xslt 
https://github.com/pode/LinkedNonFiction/blob/master/marcslim2n3.xsl

British Library British National Bibliography - 
http://www.bl.uk/bibliographic/datafree.html - data model documented, but no 
code available
Libris.se - some notes in various presentations/blogposts (e.g. 
http://dc2008.de/wp-content/uploads/2008/09/malmsten.pdf) but can't find 
explicit transformation
Hungarian National library - 
http://thedatahub.org/dataset/hungarian-national-library-catalog and 
http://nektar.oszk.hu/wiki/Semantic_web#Implementation - some information on 
ontologies used but no code or explicit transformation (not 100% sure this is 
from MARC)
Talis - implemented in several live catalogues including 
http://catalogue.library.manchester.ac.uk/  - no documentation or code afaik 
although some notes in 

MAB transformation
HBZ - some of the transformation documented at 
https://wiki1.hbz-nrw.de/display/SEM/Converting+the+Open+Data+from+the+hbz+to+BIBO,
 don't think any code published?

Would be really helpful if more projects published their transformations (or 
someone told me where to look!)

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 26 Nov 2011, at 15:58, Karen Coyle wrote:

 A few of the code4lib talk proposals mention projects that have or will 
 transform MARC records into RDF. If any of you have documentation and/or 
 examples of this, I would be very interested to see them, even if they are 
 under construction.
 
 Thanks,
 kc
 
 -- 
 Karen Coyle
 kco...@kcoyle.net http://kcoyle.net
 ph: 1-510-540-7596
 m: 1-510-435-8234
 skype: kcoylenet


[CODE4LIB] Mobile technologies in libraries - fact finding survey

2011-11-24 Thread Owen Stephens
The m-libraries support project (http://www.m-libraries.info/) is part of 
JISC’s Mobile Infrastructure for Libraries programme 
(http://infteam.jiscinvolve.org/wp/2011/10/11/mobile-infrastructure-for-libraries-new-projects/)
 running from November 2011 until September 2012.

The project aims to build a collection of useful resources and case studies 
based on current developments using mobile technologies in libraries, and to 
foster a community for those working in the m-library area or interested in 
learning more.

A brief introductory survey has been devised to help inform the project - as a 
way of starting to gather information, to discover what information is needed 
to help libraries decide on a way forward, and to begin to understand what an 
m-libraries community could offer to help.

The survey should only take 5-10 minutes and all questions are optional. 

This is an open survey - please pass the survey link on to anyone else you 
think might be interested via email or social media: http://svy.mk/mlibs1 

If you’re interested in mobile technologies in libraries and would like to 
receive updates about the project, please visit our project blog at 
http://m-libraries.info and subscribe to updates (links in the right hand side 
for RSS or email subscriptions).

Thanks and best wishes,

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936


Re: [CODE4LIB] CIS students, service learning, and the library

2011-10-14 Thread Owen Stephens
I was going to point to that too, and also note that the DevXS event was the 
brainchild of two students at the University of Lincoln, who went onto work at 
the University - including developing 'Jerome' a library search interface using 
MongoDB and the Sphinx index/search s/w http://jerome.library.lincoln.ac.uk/

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 13 Oct 2011, at 23:04, Robert Robertson wrote:

 Hi Ellen,
 
 The event hasn't been held yet but it might be worth taking a look at what 
 DevCSI are doing with their DevXS event http://devxs.org/ and seeing what 
 comes out of it after the fact.
 
 The DevCSI initiative (http://devcsi.ukoln.ac.uk/blog/) has run quite a few 
 hackday events (inlcuding dev8D ) as part of an effort to build a stronger 
 community of developers in HE in the UK and some of their events and 
 challenges have been around library data. 
 
 DevXS is their first major foray into trying the same idea with CS and other 
 students but it might offer some ideas for events that could raise interest 
 in longer term service learning projects or tackle specific tasks.
 
 cheers,
 John
 
 
 R. John Robertson
 skype: rjohnrobertson
 Research Fellow/ Open Education Resources programme support officer (JISC 
 CETIS),
 Centre for Academic Practice and Learning Enhancement
 University of Strathclyde
 Tel:+44 (0) 141 548 3072
 http://blogs.cetis.ac.uk/johnr/
 The University of Strathclyde is a charitable body, registered in Scotland, 
 with registration number SC015263
 
 From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Ellen K. 
 Wilson [ewil...@jaguar1.usouthal.edu]
 Sent: Thursday, October 13, 2011 9:29 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: [CODE4LIB] CIS students, service learning, and the library
 
 I am wondering if anyone has experience working with students
 (particularly CIS students) in service learning projects involving the
 library. I am currently supervising four first-year students who are
 working on a brief (10 hour) project involving the usability and
 redesign of the homepage as part of a first year seminar course.
 Obviously we won't get the whole thing done, but it is providing us with
 some valuable student insight into what should be on the page, etc.
 
 I anticipate the CIS department's first-year experience program will
 want to continue this collaboration, so I'm trying to brainstorm some
 projects that might be useful for future semesters particularly for
 freshmen who are just beginning their course of study in computer
 science, information technology, or information systems. This semester's
 project was thrown together in only a few days and I would like to not
 do that again! Ideas would be appreciated.
 
 Best regards,
 
 Ellen
 
 --
 Ellen Knowlton Wilson
 Instructional Services Librarian
 Room 250, University Library
 University of South Alabama
 5901 USA Drive North
 Mobile, AL 36688
 (251) 460-6045
 ewil...@jaguar1.usouthal.edu


[CODE4LIB] Show reuse of library/archive/museum data and win prizes

2011-08-08 Thread Owen Stephens
 fully the benefits of sharing it and 
improve our services. Please contact metad...@bl.uk if you wish to share your 
experiences with us and those that are using this service. Give Credit Where 
Credit is Due: The British Library has a responsibility to maintain its 
bibliographic data on the nation’s behalf. Please credit all use of this data 
to the British Library and link back to www.bl.uk/bibliographic/datafree.html 
in order that this information can be shared and developed with today’s 
Internet users as well as future generations. Duplicate of package:bluk-bnb

Tyne and Wear Museums Collections (Imagine)
Part of the Europeana Linked Open Data, this is a collection of metadata 
describing (and linking to digital copies where appropriate) items in the Tyne 
and Wear Museums Collections.

Cambridge University Library dataset #1
This data marks the first major out put of the COMET project. COMET is a JISC 
funded collaboration between Cambridge University Library and CARET, University 
of Cambridge. It is funded under the JISC Infrastructure for Resource Discovery 
programme. It represents work over a 20+ year period which contains a number of 
changes in practices and cataloguing tools. No attempt has been made to screen 
for quaility of records other than the Voyager export process. This data also 
includes the 180,000 'Tower Project' records published under the JISC Open 
Bibliography Project. 

JISC MOSAIC Activity Data
The JISC MOSAIC (www.sero.co.uk/jisc-mosaic.html) project gathered together 
data covering user activity in a few UK Higher Education libraries. The data is 
available for download and via an API and contains information on books 
borrowed during specific time periods, and where available describes links 
between books, courses, and year of study.

OpenURL Router Data (EDINA)
EDINA is making the OpenURL Router Data available from April 2011. It is 
derived from the logs of the OpenURL Router, which directs user requests for 
academic papers to the appropriate institutional resolver. It enables 
institutions to register their resolver once only, at 
[http://openurl.ac.uk](http://openurl.ac.uk OpenURL Router), and service 
providers may then use openurl.ac.uk as the “base URL” for OpenURL links for UK 
HE and FE customers. This is the product of JISC-funded project activity, and 
provides a unique data set. The data captured varies from request to request 
since different users enter different information into requests. Further 
information on the details of the data set, sample files and the data itself is 
available at 
[http://openurl.ac.uk/doc/data/data.html](http://openurl.ac.uk/doc/data/data.html
 OpenURL Router Data). The team would like to thank all the institutions 
involved in this initiative for their participation. The data are made 
available under the Open Data Commons (ODC) Public Domain Dedication and 
Licence and the ODC Attribution Sharealike Community Norms.



Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936


[CODE4LIB] Developer Competition using Library/Archive/Museum data

2011-07-04 Thread Owen Stephens
Celebrate Liberation – A worldwide competition for open software developers  
open data
UK Discovery (http://discovery.ac.uk/) and the Developer Community Supporting 
Innovation (DevCSI) project based at UKOLN are running a global Developer 
Competition throughout July 2011 to build open source software applications / 
tools, using at least one of our 10 open data sources collected from libraries, 
museums and archives.
Enter simply by blogging about your application and emailing the blog post URI 
to joy.pal...@manchester.ac.uk by the deadline of 2359 (your local time) on 
Monday 1 August 2011.
Full details of the competition, the data sets and how to enter are at 
http://discovery.ac.uk/developers/competition/
There are 13 prizes including 
Best entry for each dataset – there are 10 datasets so there could be 10 
winners of £30 Amazon vouchers and an aggregation could win more than one!

Data Munging – Best example of Consolidating or Aggregating or De-duplicating 
or Entity matching or … one prize of £100 Amazon voucher.

Overall winners – An EEE Pad Transformer for the overall winner and a £200 
Amazon voucher for the Runner Up.

And you can win more than once :)
Specific competition tag on twitter is #discodev, but #devcsi and #ukdiscovery 
also good to follow/use
Excited to see what people come up with - hope some of you are able to enter
Owen
Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936


[CODE4LIB] PDF-text extraction

2011-06-21 Thread Owen Stephens
The CORE project at The Open University in the UK is doing some work on finding 
similarity between papers in institutional repositories (see 
http://core-project.kmi.open.ac.uk/ for more info).  The first step in the 
process is extracting text from the (mainly) pdf documents harvested from 
repositories

We've tried iText but had issues with quality
We moved to PDFBox but are having performance issues

Any other suggestions/experience?

Thanks,

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936


Re: [CODE4LIB] RDF for opening times/hours?

2011-06-07 Thread Owen Stephens
I'd suggest having a look at the Goid Relations ontology 
http://wiki.goodrelations-vocabulary.org/Quickstart - it's aimed at businesses 
but the OpeningHours specification might do what you need 
http://www.heppnetz.de/ontologies/goodrelations/v1.html#OpeningHoursSpecification

While handling public holidays etc is not immediately obvious it is covered in 
this mail 
http://ebusiness-unibw.org/pipermail/goodrelations/2010-October/000261.html

Picking up on the previous comment Good Relations in RDFa is one of the formats 
Google use for Rich Snippets and it is also picked up by Yahoo

Owen

On 7 Jun 2011, at 23:05, Tom Keays tomke...@gmail.com wrote:

 There was a time, about 5 years ago, when I assumed that microformats
 were the way to go and spent a bit of time looking at hCalendar for
 representing iCalendar-formatted event information.
 
 http://microformats.org/wiki/hcalendar
 
 Not long after that, there was a lot of talk about RDF and RDFa for
 this same purpose. Now I was confused as to whether to change my
 strategy or not, but RDF Calendar seemed to be a good idea. The latter
 also was nice because it could be used to syndicate event information
 via RSS.
 
 http://pemberton-vandf.blogspot.com/2008/06/how-to-do-hcalendar-in-rdfa.html
 http://www.w3.org/TR/rdfcal/
 
 These days it seems to be all about HTML5 microdata, especially
 because of Rich Snippets and Google's support for this approach.
 
 http://html5doctor.com/microdata/#microdata-action
 
 All three approaches allow you to embed iCalendar formatted event
 information on a web page. All three of them do it differently. I'm
 even more confused now than I was 5 years ago. This should not be this
 hard, yet there is still no definitive way to deploy this information
 and preserve the semantics of the event information. Part of this may
 be because the iCalendar format, although widely used, is itself
 insufficient.
 
 Tom


Re: [CODE4LIB] [dpla-discussion] Rethinking the library part of DPLA

2011-04-10 Thread Owen Stephens
I guess that people may already be familiar with the Candide 2.0 project at 
NYPL http://candide.nypl.org/text/ - this sounds not dissimilar to the type of 
approach being suggested

This document is built using Wordpress with the Digress.it plugin 
(http://digress.it/)

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 10 Apr 2011, at 17:35, Nate Hill wrote:

 Eric, thanks for finding enough merit in my post on the DPLA listserv
 to repost it here.
 
 Karen and Peter, I completely agree with your feelings-
 But my point in throwing this idea out there was that despite all of
 the copyright issues, we don't really do a great job making a simple,
 intuitive, branded interface for the works that *are* available - the
 public domain stuff.  Instead we seem to be content with knowing that
 this content is out there, and letting vendors add it to their
 difficult-to-use interfaces.
 
 I guess my hope, seeing this reposted here is that someone might have
 a suggestion as to why I would not host public domain ebooks on my own
 library's site.  Are there technical hurdles to consider?
 
 I feel like I see a tiny little piece of the ebook access problem that
 we *can* solve here, while some of the larger issues will indeed be
 debated in forums like the DPLA for quite a while.  By solving a small
 problem along the way, perhaps when the giant 1923-2011 problem is
 resolved we'll have a clearer path as to what type of access we might
 provide.
 
 
 On 4/10/11, Peter Murray peter.mur...@lyrasis.org wrote:
 I, too, have been struggling with this aspect of the discussion. (I'm on the
 DPLA list as well.) There seems to be this blind spot within the leadership
 of the group to ignore the copyright problem and any interaction with
 publishers of popular materials. One of the great hopes that I have for this
 group, with all of the publicity it is generating, is to serve as a voice
 and a focal point to bring authors, publishers and librarians together to
 talk about a new digital ownership and sharing model.
 
 That doesn't seem to be happening.
 
 
 Peter
 
 On Apr 10, 2011, at 10:05, Karen Coyle li...@kcoyle.net wrote:
 
 I appreciate the spirit of this, but despair at the idea that
 libraries organize their services around public domain works, thus
 becoming early 20th century institutions. The gap between 1923 and
 2011 is huge, and it makes no sense to users that a library provide
 services based on publication date, much less that enhanced services
 stop at 1923.
 
 kc
 
 Quoting Eric Hellman e...@hellman.net:
 
 The DPLA listserv is probably too impractical for most of Code4Lib,
 but Nate Hill (who's on this list as well) made this contribution
 there, which I think deserves attention from library coders here.
 
 On Apr 5, 2011, at 11:15 AM, Nate Hill wrote:
 
 It is awesome that the project Gutenberg stuff is out there, it is
 a great start.  But libraries aren't using it right.  There's been
 talk on this list about the changing role of the public library in
 people's lives, there's been talk about the library brand, and some
 talk about what 'local' might mean in this context.  I'd suggest
 that we should find ways to make reading library ebooks feel local
 and connected to an immediate community.  Brick and mortar library
 facilities are public spaces, and librarians are proud of that.  We
 have collections of materials in there, and we host programs and
 events to give those materials context within the community.
 There's something special about watching a child find a good book,
 and then show it to his  or her friend and talk about how awesome
 it is.  There's also something special about watching a senior
 citizens book group get together and discuss a new novel every
 month.  For some reason, libraries really struggle with treating
 their digital spaces the same way.
 
 I'd love to see libraries creating online conversations around
 ebooks in much the same way.  Take a title from project Gutenberg:
 The Adventures of Huckleberry Finn.  Why not host that book
 directly on my library website so that it can be found at an
 intuitive URL, www.sjpl.org/the-adventures-of-huckleberry-finn and
 then create a forum for it?  The URL itself takes care of the
 'local' piece; certainly my most likely visitors will be San Jose
 residents- especially if other libraries do this same thing.  The
 brand remains intact, when I launch this web page that holds the
 book I can promote my library's identity.  The interface is no
 problem because I can optimize the page to load well on any device
 and I can link to different formats of the book.  Finally, and most
 importantly, I've created a local digital space for this book so
 that people can converse about it via comments, uploaded pictures,
 video, whatever.  I really think this community conversation and
 context-creation around materials is a big part of what makes
 public libraries special

Re: [CODE4LIB] LCSH and Linked Data

2011-04-08 Thread Owen Stephens
Thanks for all the information and discussion.

I don't think I'm familiar enough with Authority file formats to completely
comprehend - but I certainly understand the issues around the question of
'place' vs 'histo-geo-poltical entity'. Some of this makes me worry about
the immediate applicability of the LC Authority files in the Linked Data
space - someone said to me recently 'SKOS is just a way of avoiding dealing
with the real semantics' :)

Anyway - putting that to one side, the simplest approach for me at the
moment seems to only look at authorised LCSH as represented on id.loc.gov.
Picking up on Andy's first response:

On Thu, Apr 7, 2011 at 3:46 PM, Houghton,Andrew hough...@oclc.org wrote:

 After having done numerous matching and mapping projects, there are some
 issues that you will face with your strategy, assuming I understand it
 correctly. Trying to match a heading starting at the left most subfield and
 working forward will not necessarily produce correct results when matching
 against the LCSH authority file. Using your example:



 650 _0 $a Education $z England $x Finance



 is a good example of why processing the heading starting at the left will
 not necessarily produce the correct results.  Assuming I understand your
 proposal you would first search for:



 150 __ $a Education



 and find the heading with LCCN sh85040989. Next you would look for:



 181 __ $z England



 and you would NOT find this heading in LCSH.


OK - ignoring the question of where the best place to look for this is - I
can live with not matching it for now. Later (perhaps when I understand it
better, or when these headings are added to id.loc.gov we can revisit this)


 The second issue using your example is that you want to find the “longest”
 matching heading. While the pieces parts are there, so is the enumerated
 authority heading:



 150 __ $a Education $z England



 as LCCN sh2008102746. So your heading is actually composed of the
 enumerated headings:



 sh2008102746150 __ $a Education $z England

 sh2002007885180 __ $x Finance



 and not the separate headings:



 sh85040989 150 __ $a Education

 n82068148   150 __ $a England

 sh2002007885180 __ $x Finance



 Although one could argue that either analysis is correct depending upon
 what you are trying to accomplish.




What I'm interested in is representing the data as RDF/Linked Data in a way
that opens up the best opportunities for both understanding and querying the
data. Unfortunately at the moment there isn't a good way of representing
LCSH directly in RDF (the MADS work may help I guess but to be honest at the
moment I see that as overly complex - but that's another discussion).

What I can do is make statements that an item is 'about' a subject (probably
using dc:subject) and then point at an id.loc.gov URI. However, if I only
express individual headings:
Education
England (natch)
Finance

Then obviously I lose the context of the full heading - so I also want to
look for
Education--England--Finance (which I won't find on id.loc.gov as not
authorised)

At this point I could stop, but my feeling is that it is useful to also look
for other combinations of the terms:

Education--England (not authorised)
Education--Finance (authorised! http://id.loc.gov/authorities/sh85041008)

My theory is that as long as I stick to combinations that start with a
topical term I'm not going to make startlingly inaccurate statements?


 The matching algorithm I have used in the past contains two routines. The
 first f(a) will accept a heading as a parameter, scrub the heading, e.g.,
 remove unnecessary subfield like $0, $3, $6, $8, etc. and do any other
 pre-processing necessary on the heading, then call the second function f(b).
 The f(b) function accepts a heading as a parameter and recursively calls
 itself until it builds up the list LCCNs that comprise the heading. It first
 looks for the given heading when it doesn’t find it, it removes the **last
 ** subfield and recursively calls itself, otherwise it appends the found
 LCCN to the returned list and exits. This strategy will find the longest
 match.


Unless I've misunderstood this, this strategy would not find
'Education--Finance'? Instead I need to remove each *subdivision* in turn
(no matter where it appears in the heading order) and try all possible
combinations checking each for a match on id.loc.gov. Again, I can do this
without worrying about possible invalid headings, as these wouldn't have
been authorised anyway...

I can check the number of variations around this but I guess that in my
limited set of records (only 30k) there will be a relatively small number of
possible patterns to check.

Does that make sense?


Re: [CODE4LIB] LCSH and Linked Data

2011-04-08 Thread Owen Stephens
Thanks Ross - I have been pushing some cataloguing folk to comment on some
of this as well (and have some feedback) - but I take the point that wider
consultation via autocat could be a good idea. (for some reason this makes
me slightly nervous!)s

In terms of whether Education--England--Finance is authorised or not - I
think I took from Andy's response that it wasn't, but also looking at it on
authorities.loc.gov it isn't marked as 'authorised'. Anyway - the relevant
thing for me at this stage is that I won't find a match via id.loc.gov - so
I can't get a URI for it anyway.

There are clearly quite a few issues with interacting with LCSH as Linked
Data at the moment - I'm not that keen on how this currently works, and my
reaction to the MADS/RDF ontology is similar to that of Bruce D'Arcus (see
http://metadata.posterous.com/lcs-madsrdf-ontology-and-the-future-of-the-se),
but on the otherhand I want to embrace the opportunity to start joining some
stuff up and seeing what happens :)

Owen

On Fri, Apr 8, 2011 at 3:10 PM, Ross Singer rossfsin...@gmail.com wrote:

 On Fri, Apr 8, 2011 at 5:02 AM, Owen Stephens o...@ostephens.com wrote:

  Then obviously I lose the context of the full heading - so I also want to
  look for
  Education--England--Finance (which I won't find on id.loc.gov as not
  authorised)
 
  At this point I could stop, but my feeling is that it is useful to also
 look
  for other combinations of the terms:
 
  Education--England (not authorised)
  Education--Finance (authorised! http://id.loc.gov/authorities/sh85041008
 )
 
  My theory is that as long as I stick to combinations that start with a
  topical term I'm not going to make startlingly inaccurate statements?

 I would definitely ask this question somewhere other than Code4lib
 (autocat, maybe?), since I think the answer is more complicated than
 this (although they could validate/invalidate your assumption about
 whether or not this approach would get you close enough).

 My understanding is that Education--England--Finance *is* authorized,
 because Education--Finance is and England is a free-floating
 geographic subdivision.  Because it's also an authorized heading,
 Education--England--Finance is, in fact, an authority.  The problem
 is that free-floating subdivisions cause an almost infinite number of
 permutations, so there aren't LCCNs issued for them.

 This is where things get super-wonky.  It's also the reason I
 initially created lcsubjects.org, specifically to give these (and,
 ideally, locally controlled subject headings) a publishing
 platform/centralized repository, but it quickly grew to be more than
 just a side project.  There were issues of how the data would be
 constructed (esp. since, at the time, I had no access to the NAF), how
 to reconcile changes, provenance, etc.  Add to the fact that 2 years
 ago, there wasn't much linked library data going on, it was really
 hard to justify the effort.

 But, yeah, it would be worth running your ideas by a few catalogers to
 see what they think.

 -Ross.




-- 
Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com


[CODE4LIB] LCSH and Linked Data

2011-04-07 Thread Owen Stephens
We are working on converting some MARC library records to RDF, and looking
at how we handle links to LCSH (id.loc.gov) - and I'm looking for feedback
on how we are proposing to do this...

I'm not 100% confident about the approach, and to some extent I'm trying to
work around the nature of how LCSH interacts with RDF at the moment I
guess... but here goes - I would very much appreciate
feedback/criticism/being told why what I'm proposing is wrong:

I guess what I want to do is preserve aspects of the faceted nature of LCSH
in a useful way, give useful links back to id.loc.gov where possible, and
give access to a wide range of facets on which the data set could be
queried. Because of this I'm proposing not just expressing the whole of the
650 field as a LCSH and checking for it's existence on id.loc.gov, but also
checking for various combinations of topical term and subdivisions from the
650 field. So for any 650 field I'm proposing we should check on
id.loc.govfor labels matching:

check(650$$a) -- topical term
check(650$$b) -- topical term
check(650$$v) -- Form subdivision
check(650$$x) -- General subdivision
check(650$$y) -- Chronological subdivision
check(650$$z) -- Geographic subdivision

Then using whichever elements exist (all as topical terms):
Check(650$$a--650$$b)
Check(650$$a--650$$v)
Check(650$$a--650$$x)
Check(650$$a--650$$y)
Check(650$$a--650$$z)
Check(650$$a--650$$b--650$$v)
Check(650$$a--650$$b--650$$x)
Check(650$$a--650$$b--650$$y)
Check(650$$a--650$$b--650$$z)
Check(650$$a--650$$b--650$$x--650$$v)
Check(650$$a--650$$b--650$$x--650$$y)
Check(650$$a--650$$b--650$$x--650$$z)
Check(650$$a--650$$b--650$$x--650$$z--650$$v)
Check(650$$a--650$$b--650$$x--650$$z--650$$y)
Check(650$$a--650$$b--650$$x--650$$z--650$$y--650$$v)


As an example given:

650 00 $$aPopular music$$xHistory$$y20th century

We would be checking id.loc.gov for

'Popular music' as a topical term (http://id.loc.gov/authorities/sh85088865)
'History' as a general subdivision (http://id.loc.gov/authorities/sh99005024
)
'20th century' as a chronological subdivision (
http://id.loc.gov/authorities/sh2002012476)
'Popular music--History and criticism' as a topical term (
http://id.loc.gov/authorities/sh2008109787)
'Popular music--20th century' as a topical term (not authorised)
'Popular music--History and criticism--20th century' as a topical term (not
authorised)


And expressing all matches in our RDF.

My understanding of LCSH isn't what it might be - but the ordering of terms
in the combined string checking is based on what I understand to be the
usual order - is this correct, and should we be checking for alternative
orderings?

Thanks

Owen


-- 
Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com


Re: [CODE4LIB] LCSH and Linked Data

2011-04-07 Thread Owen Stephens
Thanks Tom - very helpful

Perhaps this suggests that rather using an order we should check
combinations while preserving the order of the original 650 field (I assume
this should in theory be correct always - or at least done to the best of
the cataloguers knowledge)?

So for:

650 _0 $$a Education $$z England $$x Finance.

check:

Education
England (subdiv)
Finance (subdiv)
Education--England
Education--Finance
Education--England--Finance

While for 650 _0 $$a Education $$x Economic aspects $$z England we check

Education
Economic aspects (subdiv)
England (subdiv)
Education--Economic aspects
Education--England
Education--Economic aspects--England


 - It is possible for other orders in special circumstances, e.g. with
 language dictionaries which can go something like:

 650 _0 $$a English language $$v Dictionaries $$x Albanian.


This possiblity would also covered by preserving the order - check:

English Language
Dictionaries (subdiv)
Albanian (subdiv)
English Language--Dictionaries
English Language--Albanian
English Language--Dictionaries-Albanian

Creating possibly invalid headings isn't necessarily a problem - as we won't
get a match on id.loc.gov anyway. (Instinctively English Language--Albanian
doesn't feel right)



 - Some of these are repeatable, so you can have too $$vs following each
 other (e.g. Biography--Dictionaries); two $$zs (very common), as in
 Education--England--London; two $xs (e.g. Biography--History and criticism).

 OK - that's fine, we can use each individually and in combination for any
repeated headings I think


 - I'm not I've ever come across a lot of $$bs in 650s. Do you have a lot of
 them in the database?

 Hadn't checked until you asked! We have 1 in the dataset in question (c.30k
records) :)


 I'm not sure how possible it would be to come up with a definitive list of
 (reasonable) possible combinations.

 You are probably right - but I'm not too bothered about aiming at
'definitive' at this stage anyway - but I do want to get something
relatively functional/useful


 Tom

 Thomas Meehan
 Head of Current Cataloguing
 University College London Library Services

 Owen Stephens wrote:

 We are working on converting some MARC library records to RDF, and looking
 at how we handle links to LCSH (id.loc.gov http://id.loc.gov) - and I'm
 looking for feedback on how we are proposing to do this...


 I'm not 100% confident about the approach, and to some extent I'm trying
 to work around the nature of how LCSH interacts with RDF at the moment I
 guess... but here goes - I would very much appreciate
 feedback/criticism/being told why what I'm proposing is wrong:

 I guess what I want to do is preserve aspects of the faceted nature of
 LCSH in a useful way, give useful links back to id.loc.gov 
 http://id.loc.gov where possible, and give access to a wide range of
 facets on which the data set could be queried. Because of this I'm proposing
 not just expressing the whole of the 650 field as a LCSH and checking for
 it's existence on id.loc.gov http://id.loc.gov, but also checking for
 various combinations of topical term and subdivisions from the 650 field. So
 for any 650 field I'm proposing we should check on id.loc.gov 
 http://id.loc.gov for labels matching:


 check(650$$a) -- topical term
 check(650$$b) -- topical term
 check(650$$v) -- Form subdivision
 check(650$$x) -- General subdivision
 check(650$$y) -- Chronological subdivision
 check(650$$z) -- Geographic subdivision

 Then using whichever elements exist (all as topical terms):
 Check(650$$a--650$$b)
 Check(650$$a--650$$v)
 Check(650$$a--650$$x)
 Check(650$$a--650$$y)
 Check(650$$a--650$$z)
 Check(650$$a--650$$b--650$$v)
 Check(650$$a--650$$b--650$$x)
 Check(650$$a--650$$b--650$$y)
 Check(650$$a--650$$b--650$$z)
 Check(650$$a--650$$b--650$$x--650$$v)
 Check(650$$a--650$$b--650$$x--650$$y)
 Check(650$$a--650$$b--650$$x--650$$z)
 Check(650$$a--650$$b--650$$x--650$$z--650$$v)
 Check(650$$a--650$$b--650$$x--650$$z--650$$y)
 Check(650$$a--650$$b--650$$x--650$$z--650$$y--650$$v)


 As an example given:

 650 00 $$aPopular music$$xHistory$$y20th century

 We would be checking id.loc.gov http://id.loc.gov for


 'Popular music' as a topical term (
 http://id.loc.gov/authorities/sh85088865)
 'History' as a general subdivision (
 http://id.loc.gov/authorities/sh99005024)
 '20th century' as a chronological subdivision (
 http://id.loc.gov/authorities/sh2002012476)
 'Popular music--History and criticism' as a topical term (
 http://id.loc.gov/authorities/sh2008109787)
 'Popular music--20th century' as a topical term (not authorised)
 'Popular music--History and criticism--20th century' as a topical term
 (not authorised)


 And expressing all matches in our RDF.

 My understanding of LCSH isn't what it might be - but the ordering of
 terms in the combined string checking is based on what I understand to be
 the usual order - is this correct, and should we be checking for alternative
 orderings?

 Thanks

 Owen


 --
 Owen

Re: [CODE4LIB] LCSH and Linked Data

2011-04-07 Thread Owen Stephens
Still digesting Andrew's response (thanks Andrew), but

On Thu, Apr 7, 2011 at 4:17 PM, Ya'aqov Ziso yaaq...@gmail.com wrote:

 *Currently under id.loc.gov you will not find name authority records, but
 you can find them at viaf.org*.
 *[YZ]*  viaf.org does not include geographic names. I just checked there
 England.


Is this not the relevant VIAF entry
http://viaf.org/viaf/14299580http://viaf.org/viaf/142995804


-- 
Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com


Re: [CODE4LIB] LCSH and Linked Data

2011-04-07 Thread Owen Stephens
I'm out of my depth here :)

But... this is what I understood Andrew to be saying. In this instance
(?because 'England' is a Name Authority?) rather than create a separate LCSH
authority record for 'England' (as the 151), rather the LCSH subdivision is
recorded in the 781 of the existing Name Authority record.

Searching on http://authorities.loc.gov for England, I find an Authorised
heading, marked as a LCSH - but when I go to that record what I get is the
name authority record n 82068148 - the name authority record as represented
on VIAF by http://viaf.org/viaf/142995804/ (which links to
http://errol.oclc.org/laf/n%20%2082068148.html)

Just as this is getting interesting time differences mean I'm about to head
home :)

Owen

On Thu, Apr 7, 2011 at 4:34 PM, LeVan,Ralph le...@oclc.org wrote:

 If you look at the fields those names come from, I think they mean
 England as a corporation, not England as a place.

 Ralph

  -Original Message-
  From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf
 Of
  Owen Stephens
  Sent: Thursday, April 07, 2011 11:28 AM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] LCSH and Linked Data
 
  Still digesting Andrew's response (thanks Andrew), but
 
  On Thu, Apr 7, 2011 at 4:17 PM, Ya'aqov Ziso yaaq...@gmail.com
 wrote:
 
   *Currently under id.loc.gov you will not find name authority
 records, but
   you can find them at viaf.org*.
   *[YZ]*  viaf.org does not include geographic names. I just checked
 there
   England.
  
 
  Is this not the relevant VIAF entry
  http://viaf.org/viaf/14299580http://viaf.org/viaf/142995804
 
 
  --
  Owen Stephens
  Owen Stephens Consulting
  Web: http://www.ostephens.com
  Email: o...@ostephens.com




-- 
Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com


  1   2   >