Re: [CODE4LIB] Software used in Panama Papers Analysis
Another interesting post on this - this one from Le Monde (in French) http://data.blog.lemonde.fr/2016/04/08/panama-papers-un-defi-technique-pour-le-journalisme-de-donnees/ <http://data.blog.lemonde.fr/2016/04/08/panama-papers-un-defi-technique-pour-le-journalisme-de-donnees/> Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 > On 12 Apr 2016, at 16:05, Tom Cramer <tcra...@stanford.edu> wrote: > > The IJNet article is particularly interesting—thanks for posting this. > Excerpts like the one below make me wonder if there is a “Code4News” > community, and if so, how do we find and connect with them. It seems we have > a lot in common, and maybe a lot to offer each other. > > > MC: What we’ve achieved is pretty remarkable. Newsrooms are in an economic > crisis. No newsroom right now--except for maybe The New York Times and a few > others--have the capability to do something major like this at a global > scale. But we’re showing it’s possible. We share data, we produce tools for > communication, we share our stories and our interactives, to make it happen. > > - Tom > > > > > > > On Apr 7, 2016, at 7:24 AM, Gregory Markus > <gmar...@beeldengeluid.nl<mailto:gmar...@beeldengeluid.nl>> wrote: > > Hey Sebastian, > > They go into a lot of detail in this article > > https://ijnet.org/en/blog/how-icij-pulled-large-scale-cross-border-investigative-collaboration > > Indeed this is pretty interesting stuff and a good shout out for Blacklight > and other OS tools! > > -greg > > On Thu, Apr 7, 2016 at 4:21 PM, Sebastian Karcher < > karc...@u.northwestern.edu> wrote: > > Hi everyone, > > from one of the New York Times stories on the Panama Papers: > "The ICIJ made a number of powerful research tools available to the > consortium that the group had developed for previous leak investigations. > Those included a secure, Facebook-type forum where reporters could post the > fruits of their research, as well as database search program called > “Blacklight” that allowed the teams to hunt for specific names, countries > or sources." > > http://www.nytimes.com/2016/04/06/business/media/how-a-cryptic-message-interested-in-data-led-to-the-panama-papers.html > > I assume this is http://projectblacklight.org/, which is pretty cool to > see > used that way. Does anyone know or have read anything about the other tools > they used? What did they use for OCR? Did they use qualitative data > analysis software? Some type of annotation tools? It seems like there's a > lot to learn from this effort. > > Thanks, > > -- > Sebastian Karcher, PhD > Qualitative Data Repository, Syracuse University > qdr.syr.edu > > > > > -- > > *Gregory Markus* > > Project Assistant > > *Netherlands Institute for Sound and Vision* > *Media Parkboulevard 1, 1217 WE Hilversum | Postbus 1060, 1200 BB > Hilversum | * > *beeldengeluid.nl* <http://www.beeldengeluid.nl/> > *T* 0612350556 > > *Aanwezig:* - ma, di, wo, do, vr >
Re: [CODE4LIB] searching metadata vs searching content
To share the practice from a project I work on - the Jisc Historical Texts platform[1] which provides searching across digitised texts from the 16th to 19th centuries. In this case we had the option to build the search application from scratch, rather than using a product such as ContentDM etc. I should say that all the technical work was done by K-Int [2] and Gooii [3], I was there to advise on metadata and user requirements, and so the following is based on my understanding of how the system works, and any errors are down to me :) There are currently three major collections within the Historical Texts platform, with different data sources behind each one. In general the data we have for each collection consists of MARC metadata records, full text in XML documents (either from transcription or from OCR processes) and image files of the pages. The platform is build using the ElasticSearch [4] (ES) indexing software (as with Solr this is built on top of Lucene). We structure the data we index in ES in two layers - the ‘publication’ record, which is essentially where all the MARC metadata lives (although not as MARC - we transform this to an internal scheme), and the ‘page’ records - one record per page in the item. The text content lives in the page record, along with links to the image files for the page. The ‘page’ records are all what ES calls ‘child’ records of the relevant publication record. We make this relationship through shared IDs in the MARC records and the XML fulltext documents. We create a whole range of indexes from this data. Obviously field specific searchs like title or author only search the relevant metadata fields. But we also have a (default) ’search all’ option which searches through all the metadata and fulltext. If the user wants to search the text only, they check an option and we limit the search to only text from records of the ‘page’ type. The results the user gets initially are always the publication level records - so essentially your results list is a list of books. For each result you can view ‘matches in text’ which shows snippets of where your search term appears in the fulltext. You can then either click to view the whole book, or click the relevant page from the list of snippets. When you view the book, the software retrieves all the ‘page’ records for the book, and from the page records can retrieve the image files. When the user goes to the book viewer, we also carry over the search terms from their search, so they can see the same text snippets of where the terms appear alongside the book viewer - so the user can navigate to the pages which contain the search terms easily. For more on the ES indexing side of this, Rob Tice from Knowledge Integration did a talk about the use of ES in this context at the London Elasticsearch usergroup [5]. Unfortunately the interface itself requires a login, but if you want to get a feel for how this all works in the UI, there is also a screencast which gives an overview of the UI available [6]. Best wishes, Owen 1. https://historicaltexts.jisc.ac.uk 2. http://www.k-int.com 3. http://www.gooii.com 4. https://www.elastic.co 5. http://www.k-int.com/Rob-Tice-Elastic-London-complex-modelling-of-rich-text-data-in-Elasticsearch 6. http://historicaltexts.jisc.ac.uk/support Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 > On 27 Jan 2016, at 00:30, Laura Buchholz <laura.buchh...@reed.edu> wrote: > > Hi all, > > I'm trying to understand how digital library systems work when there is a > need to search both metadata and item text content (plain text/full text), > and when the item is made up of more than one file (so, think a digitized > multi-page yearbook or newspaper). I'm not looking for answers to a > specific problem, really, just looking to know what is the current state of > community practice. > > In our current system (ContentDM), the "full text" of something lives in > the metadata record, so it is indexed and searched along with the metadata, > and essentially treated as if it were metadata. (Correct?). This causes > problems in advanced searching and muddies the relationship between what is > typically a descriptive metadata record and the file that is associated > with the record. It doesn't seem like a great model for the average digital > library. True? I know the answer is "it depends", but humor me... :) > > If it isn't great, and there are better models, what are they? I was taught > METS in school, and based on that, I'd approach the metadata in a METS or > METS-like fashion. But I'm unclear on the steps from having a bunch of METS > records that include descriptive metadata and pointers to text files of the > OCR (we don't, but if we did...) to indexing and providing results to > users. I think anot
Re: [CODE4LIB] Job: Wine Loving Developer at University of California, Davis
That may well be true, but ‘getting the job done’ isn’t the only aspect of a crowdsourcing project. It can be used to engage an audience more deeply in the collection and give them some investment in it. This can help with overall visibility of the collection on the web (through those people who have engaged sharing what they are doing/seeing etc.), and future use, and be a platform for further projects. A project like this could also offer a way of experimenting with crowdsourcing in a low risk way. And of course the developer is needed for the visualisation aspect anyway, so the recruitment needs to happen and a wage needs to be paid anyway ... Whether all this balances out against the economics/efficiency of getting the job done in the cheapest possible way is a judgement that needs to be made, but I don’t think the simple economic argument is the only one in play here. Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 > On 10 Dec 2015, at 23:42, James Morley <james.mor...@europeana.eu> wrote: > > I agree with Thomas's logic, if not the maths (surely $2,000?) > > I was going to do a few myself but it looks like comments have been disabled > on the Flickr images? > > > From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf of Thomas > Krichel [kric...@openlib.org] > Sent: 10 December 2015 23:17 > To: CODE4LIB@LISTSERV.ND.EDU > Subject: Re: [CODE4LIB] Job: Wine Loving Developer at University of > California, Davis > > j...@code4lib.org writes > > >> **PROJECT DETAILS** >> The UC Davis University Library is launching a project to digitize the >> [Amerine wine label >> collection](https://www.flickr.com/photos/brantley/sets/72 >> 157655817440104/with/21116552632/) > > Some look like hard to read. > >> and engage the public to transcribe the information contained on the >> labels and associated annotations. > > This may take a long time. I suggest rather than doing that, take > somebody in a low-income country who speaks French, say, and who will > type all the data in. That way you get consistency in the data. I > live in Siberia, I can find somebody there. Once this data is in a > simple text file, you can use in-house staff to attach it to the > label images in your systems. > > Crowdsource sounds cool, but for 4000 label it makes no sense. > If the typist gets $10/h, and gets 20 labels done in 1h, we > are talking $200. The visit you are planning for your developer > will cost that much. > -- > > Cheers, > > Thomas Krichel http://openlib.org/home/krichel > skype:thomaskrichel
Re: [CODE4LIB] Protocol-relative URLs in MARC
In theory the 1st indicator dictates the protocol used and 4 =HTTP. However, in all examples on http://www.loc.gov/marc/bibliographic/bd856.html, despite the indicator being used, the protocol part of the URI it is then repeated in the $u field. You can put ‘7’ in the 1st indicator, then use subfield $2 to define other methods. Since only ‘http’ is one of the preset protocols, not https, I guess in theory this means you should use something like 856 70 $uhttps://example.com$2https I’d be pretty surprised if in practice people don’t just do: 856 40 $uhttps://example.com Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 17 Aug 2015, at 21:41, Stuart A. Yeates syea...@gmail.com wrote: I'm in the middle of some work which includes touching the 856s in lots of MARC records pointing to websites we control. The websites are available on both https://example.org/ and http://example.org/ Can I put //example.org/ in the MARC or is this contrary to the standard? Note that there is a separate question about whether various software systems support this, but that's entirely secondary to the question of the standard. cheers stuart -- ...let us be heard from red core to black sky
Re: [CODE4LIB] Processing Circ data
Another option might be to use OpenRefine http://openrefine.org - this should easily handle 250,000 rows. I find it good for basic data analysis, and there are extensions which offer some visualisations (e.g. the VIB BITs extension which will plot simple data using d3 https://www.bits.vib.be/index.php/software-overview/openrefine https://www.bits.vib.be/index.php/software-overview/openrefine) I’ve written an introduction to OpenRefine available at http://www.meanboyfriend.com/overdue_ideas/2014/11/working-with-data-using-openrefine/ http://www.meanboyfriend.com/overdue_ideas/2014/11/working-with-data-using-openrefine/ Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 5 Aug 2015, at 21:07, Harper, Cynthia char...@vts.edu wrote: Hi all. What are you using to process circ data for ad-hoc queries. I usually extract csv or tab-delimited files - one row per item record, with identifying bib record data, then total checkouts over the given time period(s). I have been importing these into Access then grouping them by bib record. I think that I've reached the limits of scalability for Access for this project now, with 250,000 item records. Does anyone do this in R? My other go-to- software for data processing is RapidMiner free version. Or do you just use MySQL or other SQL database? I was looking into doing it in R with RSQLite (just read about this and sqldf http://www.r-bloggers.com/make-r-speak-sql-with-sqldf/ ) because I'm sure my IT department will be skeptical of letting me have MySQL on my desktop. (I've moved into a much more users-don't-do-real-computing kind of environment). I'm rusty enough in R that if anyone will give me some start-off data import code, that would be great. Cindy Harper E-services and periodicals librarian Virginia Theological Seminary Bishop Payne Library 3737 Seminary Road Alexandria VA 22304 char...@vts.edumailto:char...@vts.edu 703-461-1794
Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata and/or a Database
It may depend on the format of the PDF, but I’ve used the Scraperwiki Python Module ‘pdf2xml’ function to extract text data from PDFs in the past. There is a write up (not by me) at http://schoolofdata.org/2013/08/16/scraping-pdfs-with-python-and-the-scraperwiki-module/ http://schoolofdata.org/2013/08/16/scraping-pdfs-with-python-and-the-scraperwiki-module/, and an example of how I’ve used it at https://github.com/ostephens/british_library_directory_of_library_codes/blob/master/scraper.py https://github.com/ostephens/british_library_directory_of_library_codes/blob/master/scraper.py Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 18 Jun 2015, at 17:02, Matt Sherman matt.r.sher...@gmail.com wrote: Hi Code4Libbers, I am working with colleague on a side project which involves some scanned bibliographies and making them more web searchable/sortable/browse-able. While I am quite familiar with the metadata and organization aspects we need, but I am at a bit of a loss on how to automate the process of putting the bibliography in a more structured format so that we can avoid going through hundreds of pages by hand. I am pretty sure regular expressions are needed, but I have not had an instance where I need to automate extracting data from one file type (PDF OCR or text extracted to Word doc) and place it into another (either a database or an XML file) with some enrichment. I would appreciate any suggestions for approaches or tools to look into. Thanks for any help/thoughts people can give. Matt Sherman
Re: [CODE4LIB] eebo [perfect texts]
And some of the researchers definitely care about this (authority control, high quality descriptive metadata). I went to a hack day focussing on the EEBO-TCP Phase 1 release (these texts). I mentioned to one of the researchers (not a librarian) that I had access to some MARC records which described the works. Their immediate response was “Ah - but which MARC records, because they aren’t all of the same quality”! There are good cataloguing records for the works but they have not been made available under an open licence alongside the transcribed texts. Probably the highest quality records are those in the English Short Title Catalogue (ESTC) http://estc.bl.uk. There have been some great steps forward in the last few years, but I still feel libraries need to increase the amount they are doing to publish metadata under explicitly open licences. Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 8 Jun 2015, at 23:23, Stuart A. Yeates syea...@gmail.com wrote: Another thing that could usefully be done is significantly better authority control. Authors, works, geographical places, subjects, etc, etc. Good core librarianship stuff that is essentially orthogonal to all the other work that appears to be happening. cheers stuart -- ...let us be heard from red core to black sky On Tue, Jun 9, 2015 at 12:42 AM, Eric Lease Morgan emor...@nd.edu wrote: On Jun 8, 2015, at 7:32 AM, Owen Stephens o...@ostephens.com wrote: I’ve just seen another interesting take based (mainly) on data in the TCP-EEBO release: https://scalablereading.northwestern.edu/2015/06/07/shakespeare-his-contemporaries-shc-released/ It includes mention of MorphAdorner[1] which does some clever stuff around tagging parts of speech, spelling variations, lemmata etc. and another tool which I hadn’t come across before AnnoLex[2] for the correction and annotation of lexical data in Early Modern texts”. This paper[3] from Alistair Baron and Andrew Hardie at the University of Lancaster in the UK about preparing EEBO-TCP texts for corpus-based analysis may also be of interest, and the team at Lancaster have developed a tool called VARD which supports pre-processing texts[4] [1] http://morphadorner.northwestern.edu [2] http://annolex.at.northwestern.edu [3] http://eprints.lancs.ac.uk/60272/1/Baron_Hardie.pdf [4] http://ucrel.lancs.ac.uk/vard/about/ All of this is really very interesting. Really. At the same time, there seems to be a WHOLE lot of effort spent on cleaning and normalizing data, and very little done to actually analyze it beyond “close reading”. The final goal of all these interfaces seem to be refined search. Frankly, I don’t need search. And the only community who will want this level of search will be the scholarly scholar. “What about the undergraduate student? What about the just more than casual reader? What about the engineer?” Most people don’t know how or why parts-of-speech are important let alone what a lemma is. Nor do they care. I can find plenty of things. I need (want) analysis. Let’s assume the data is clean — or rather, accept the fact that there is dirty data akin to the dirty data created through OCR and there is nothing a person can do about it — lets see some automated comparisons between texts. Examples might include: * this one is longer * this one is shorter * this one includes more action * this one discusses such such theme more than this one * so so theme came and went during a particular time period * the meaning of this phrase changed over time * the author’s message of this text is… * this given play asserts the following facts * here is a map illustrating where the protagonist went when * a summary of this text includes… * this work is fiction * this work is non-fiction * this work was probably influenced by… We don’t need perfect texts before analysis can be done. Sure, perfect texts help, but they are not necessary. Observations and generalization can be made even without perfectly transcribed texts. — ELM
[CODE4LIB] Global Open Knowledgebase APIs
Dear all, GOKb, the Global Open Knowledgebase, is a community-managed project that aims to describe electronic journals and books, publisher packages, and platforms in a way that will be familiar to librarians who have worked with electronic resources. I’ve been working on the project since it started working with others to gather requirements, develop the underlying data models and specify functionality for the system. GOKb opened to ‘public preview’ in January 2015, and you can signup for an account and access the service at https://gokb.kuali.org/gokb/ https://gokb.kuali.org/gokb/ Several hundred ejournal packages, and associated information about the ejournal titles, platforms and organisations have been added to the knowledgebase over the past few months. Alongside this work of adding content we have also opened up APIs to interact with the service. We are interested in: * Understanding how people would like to use data from GOKb via APIs (or other mechanisms) * Getting some use of the initial APIs and getting feedback on these * Getting feedback on other APIs people would like to see The current APIs we support are: The ‘Coreference’ service The main aim of this API is to provide back a list of identifiers associated with a title. The service allows you to provide a journal identifier (such as an ISSN) and get back basic information about the journal including title and other identifiers associated with the journal (other ISSNs, DOIs, publisher identifiers etc.). Documentation: https://github.com/k-int/gokb-phase1/wiki/Co-referencing-Detail https://github.com/k-int/gokb-phase1/wiki/Co-referencing-Detail Access: https://gokb.kuali.org/gokb/coreference/index https://gokb.kuali.org/gokb/coreference/index OAI Interfaces The main aim of this API is to enable other services to obtain data from GOKb on an ongoing basis. Information about ejournal packages, titles and organisations can be obtained via this service Documentation: https://github.com/k-int/gokb-phase1/wiki/OAI-Interfaces-for-Synchronization https://github.com/k-int/gokb-phase1/wiki/OAI-Interfaces-for-Synchronization Access: http://gokb.kuali.org/gokb/oai http://gokb.kuali.org/gokb/oai Add/Update API This API supports adding and updating data in GOKb. You can add new, or update existing, Organisations and Platforms. You can add additional identifiers to Journal titles. Documentation: https://github.com/k-int/gokb-phase1/wiki/Integration---Telling-GOKb-about-new-or-corresponding-resources-and-local-identifiers https://github.com/k-int/gokb-phase1/wiki/Integration---Telling-GOKb-about-new-or-corresponding-resources-and-local-identifiers We also have a SPARQL endpoint available on our test service (which contains test data only). The SPARQL endpoint is at http://test-gokb.kuali.org/sparql http://test-gokb.kuali.org/sparql, and a set of example queries are given at https://github.com/k-int/gokb-phase1/wiki/Sample-SPARQL https://github.com/k-int/gokb-phase1/wiki/Sample-SPARQL Feedback on any/all of this would be very welcome - either to the list for discussion, or directly to me. We want to make sure we can provide useful data and services and hope you can help us do this. Best wishes, Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936
Re: [CODE4LIB] eebo [developments]
Great stuff Eric. I’ve just seen another interesting take based (mainly) on data in the TCP-EEBO release https://scalablereading.northwestern.edu/2015/06/07/shakespeare-his-contemporaries-shc-released/ It includes mention of MorphAdorner[1] which does some clever stuff around tagging parts of speech, spelling variations, lemmata etc. and another tool which I hadn’t come across before AnnoLex[2] for the correction and annotation of lexical data in Early Modern texts”. This paper[3] from Alistair Baron and Andrew Hardie at the University of Lancaster in the UK about preparing EEBO-TCP texts for corpus-based analysis may also be of interest, and the team at Lancaster have developed a tool called VARD which supports pre-processing texts[4] Owen [1] http://morphadorner.northwestern.edu [2] http://annolex.at.northwestern.edu [3] http://eprints.lancs.ac.uk/60272/1/Baron_Hardie.pdf [4] http://ucrel.lancs.ac.uk/vard/about/ Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 7 Jun 2015, at 18:48, Eric Lease Morgan emor...@nd.edu wrote: Here some of developments with my playing with the EEBO data. I used the repository on Box to get my content, and I mirrored it locally. [1, 2] I then looped through the content using XPath to extract rudimentary metadata, thus creating a “catalog” (index). Along the way I calculated the number of words in each document and saved that as a field of each record. Being a tab-delimited file, it is trivial to import the catalog into my favorite spreadsheet, database, editor, or statistics program. This allowed me to browse the collection. I then used grep to search my catalog, and save the results to a file. [5] I searched for Richard Baxter. [6, 7, 8]. I then used an R script to graph the numeric data of my search results. Currently, there are only two types: 1) dates, and 2) number of words. [9, 10, 11, 12] From these graphs I can tell that Baxter wrote a lot of relatively short things, and I can easily see when he published many of his works. (He published a lot around 1680 but little in 1665.) I then transformed the search resu! lt! s into a browsable HTML table. [13] The table has hidden features. (Can you say, “Usability?”) For example, you can click on table headers to sort. This is cool because I want sort things by number of words. (Number of pages doesn’t really tell me anything about length.) There is also a hidden link to the left of each record. Upon clicking on the blank space you can see subjects, publisher, language, and a link to the raw XML. For a good time, I then repeated the process for things Shakespeare and things astronomy. [14, 15] Baxter took me about twelve hours worth of work, not counting the caching of the data. Combined, Shakespeare and astronomy took me less than five minutes. I then got tired. My next steps are multi-faceted and presented in the following incomplete unordered list: * create browsable lists - the TEI metadata is clean and consistent. The authors and subjects lend themselves very well to the creation of browsable lists. * CGI interface - The ability to search via Web interface is imperative, and indexing is a prerequisite. * transform into HTML - TEI/XML is cool, but… * create sets - The collection as a whole is very interesting, but many scholars will want sub-sets of the collection. I will do this sort of work, akin to my work with the HathiTrust. [16] * do text analysis - This is really the whole point. Given the full text combined with the inherent functionality of a computer, additional analysis and interpretation can be done against the corpus or its subsets. This analysis can be based the counting of words, the association of themes, parts-of-speech, etc. For example, I plan to give each item in the collection a colors, “big” names, and “great” ideas coefficient. These are scores denoting the use of researcher-defined “themes”. [17, 18, 19] You can see how these themes play out against the complete writings of “Dead White Men With Three Names”. [20, 21, 22] Fun with TEI/XML, text mining, and the definition of librarianship. [1] Box - http://bit.ly/1QcvxLP [2] mirror - http://dh.crc.nd.edu/sandbox/eebo-tcp/xml/ [3] xpath script - http://dh.crc.nd.edu/sandbox/eebo-tcp/bin/xml2tab.pl [4] catalog (index) - http://dh.crc.nd.edu/sandbox/eebo-tcp/catalog.txt [5] search results - http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/baxter.txt [6] Baxter at VIAF - http://viaf.org/viaf/54178741 [7] Baxter at WorldCat - http://www.worldcat.org/wcidentities/lccn-n50-5510 [8] Baxter at Wikipedia - http://en.wikipedia.org/wiki/Richard_Baxter [9] box plot of dates - http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/boxplot-dates.png [10] box plot of words - http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/boxplot-words.png
Re: [CODE4LIB] eebo
Hi Eric, I’ve worked with EEBO as part of the Jisc Historical Texts (https://historicaltexts.jisc.ac.uk/home) platform - which provides access to EEBO and other collections for UK Universities. My work was around the metadata and search of metadata and full text and display of results. I was mainly looking at metadata but did some digging into the TEI files to see how the markup could be used to extract metadata (e.g. presence of illustrations in the text). I was lucky (?!) enough to have access to the MARC records, but I did also do some work looking at the metadata included in the TEI files. If there is anything I can help with I’d be happy to. The people who worked with the files in detail were a UK s/w development company Knowledge Integration (http://www.k-int.com/) - I can give you a contact there if that would be helpful. Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 5 Jun 2015, at 13:10, Eric Lease Morgan emor...@nd.edu wrote: Does anybody here have experience reading the SGML/XML files representing the content of EEBO? I’ve gotten my hands on approximately 24 GB of SGML/XML files representing the content of EEBO (Early English Books Online). This data does not include page images. Instead it includes metadata of various ilks as well as the transcribed full text. I desire to reverse engineer the SGML/XML in order to: 1) provide an alternative search/browse interface to the collection, and 2) support various types of text mining services. While I am making progress against the data, it would be nice to learn of other people’s experience so I do not not re-invent the wheel (too many times). ‘Got ideas? — Eric Lease Morgan University Of Notre Dame
Re: [CODE4LIB] linked data question
I highly recommend Chapter 6 of the Linked Data book which details different design approaches for Linked Data applications - sections 6.3 (http://linkeddatabook.com/editions/1.0/#htoc84) summarises the approaches as: 1. Crawling Pattern 2. On-the-fly dereferencing pattern 3. Query federation pattern Generally my view would be that (1) and (2) are viable approaches for different applications, but that (3) is generally a bad idea (having been through federated search before!) Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 26 Feb 2015, at 14:40, Eric Lease Morgan emor...@nd.edu wrote: On Feb 25, 2015, at 2:48 PM, Esmé Cowles escow...@ticklefish.org wrote: In the non-techie library world, linked data is being talked about (perhaps only in listserv traffic) as if the data (bibliographic data, for instance) will reside on remote sites (as a SPARQL endpoint??? We don't know the technical implications of that), and be displayed by your local catalog/the centralized inter-national catalog by calling data from that remote site. But the original question was how the data on those remote sites would be access points - how can I start my search by searching for that remote content? I assume there has to be a database implementation that visits that data and pre-indexes it for it to be searchable, and therefore the index has to be local (or global a la Google or OCLC or its bibliographic-linked-data equivalent). I think there are several options for how this works, and different applications may take different approaches. The most basic approach would be to just include the URIs in your local system and retrieve them any time you wanted to work with them. But the performance of that would be terrible, and your application would stop working if it couldn't retrieve the URIs. So there are lots of different approaches (which could be combined): - Retrieve the URIs the first time, and then cache them locally. - Download an entire data dump of the remote vocabulary and host it locally. - Add text fields in parallel to the URIs, so you at least have a label for it. - Index the data in Solr, Elasticsearch, etc. and use that most of the time, esp. for read-only operations. Yes, exactly. I believe Esmé has articulated the possible solutions well. escowles++ —ELM
Re: [CODE4LIB] Code4LibCon video crew thanks
Apologies for a +1 message, but you know... +1 and some Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 13 Feb 2015, at 18:00, Cary Gordon listu...@chillco.com wrote: I want to deeply thank Ashley Blewer, Steven Anderson and Josh Wilson for running the video streaming and capture at Code4LibCon in Portland. Because of you, we had great video in real time (and I got to actually watch the presentations). I also want to again thank Riley Childs, who could not make it this year. Riley moved the bar up last year by putting together our YouTube presence. For the second year running, we requested and were not allowed to setup and test the day before, and for the second year running lost part of the opening session. Fortunately, we did capture most of what did not get streamed on Tuesday, and I will put that online next week. There is always next year. Thanks, Cary
Re: [CODE4LIB] Automatically updating documentation with screenshots
... and further to this I've just found a neat Chrome plugin which will record a set of actions/tests as CasperJS script, including screenshots - my first impressions are pretty positive - code produced looks pretty clean. The plugin is called 'Ressurectio' [https://github.com/ebrehault/resurrectio https://github.com/ebrehault/resurrectio] Cheers Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 26 Jan 2015, at 13:48, Owen Stephens o...@ostephens.com wrote: Thanks all - I'm looking at both Selenium and Casperjs now. I also came across a plugin for 'Robot Framework' [http://robotframework.org http://robotframework.org/] which allows you to grab screenshots (via Selenium) and annotate with notes - along the lines that Ross suggested. The plugin is 'Selenium2Screenshots' [https://github.com/datakurre/robotframework-selenium2screenshots https://github.com/datakurre/robotframework-selenium2screenshots] Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com http://www.ostephens.com/ Email: o...@ostephens.com mailto:o...@ostephens.com Telephone: 0121 288 6936 On 26 Jan 2015, at 13:16, Mads Villadsen m...@statsbiblioteket.dk mailto:m...@statsbiblioteket.dk wrote: I have used casperjs for this purpose. A small script that loads urls at multiple different resolutions/user agents and takes a screenshot of each of them. Regards -- Mads Villadsen m...@statsbiblioteket.dk mailto:m...@statsbiblioteket.dk Statsbiblioteket It-udvikler
[CODE4LIB] Automatically updating documentation with screenshots
I work on a web application and when we release a new version there are often updates to make to existing user documentation - especially screenshots where unrelated changes (e.g. the addition of a new top level menu item) can make whole sets of screenshots desirable across all the documentation. I'm looking at whether we could automate the generation of screenshots somehow which has taken me into documentation tools such as Sphinx [http://sphinx-doc.org] and Dexy [http://dexy.it]. However, ideally I want something simple enough for the application support staff to be able to use. Anyone done/tried anything like this? Cheers Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936
Re: [CODE4LIB] Stack Overflow
Another option would be a 'code4lib QA' site. Becky Yoose set up one for Coding/Cataloguing and so can comment on how much effort its been. In terms of asking/answering questions the use is clearly low but I think the content that is there is (generally) good quality and useful. I guess the hard part of any project like this is going to be building the community around it. The first things that occur to me is how you encourage people to ask the question on this new site, rather than via existing methods and how do you build enough community activity around housekeeping such as noting duplicate questions and merging/closing. The latter might be a nice problem to have, but the former is where both the Library / LIS SE and the Digital Preservation SE fell down, and libcatcode suffers the same problem - just not enough activity to be a go-to destination. I'm supportive of the idea, but I'd hate to see this go through the pain of the SE process only to fail for the same reasons as previous efforts in this area. I think we need to think about this underlying problem - but I'm not sure what the solution is/solutions are. Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 4 Nov 2014, at 15:34, Schulkins, Joe joseph.schulk...@liverpool.ac.uk wrote: To be honest I absolutely hate the whole reputation and badge system for exactly the reasons you outline, but I can't deny that I do find the family of Stack Exchange sites extremely useful and by comparison Listservs just seem very archaic to me as it's all too easy for a question (and/or its answer) to drop through the cracks of a popular discussion. Are Listservs really the best way to deal with help? I would even prefer a Drupal site... Joseph Schulkins| Systems Librarian| University of Liverpool Library| PO Box 123 | Liverpool L69 3DA | joseph.schulk...@liverpool.ac.uk| T 0151 794 3844 Follow us: @LivUniLibrary Like us: LivUniLibrary Visit us: http://www.liv.ac.uk/library Special Collections Archives blog: http://manuscriptsandmore.liv.ac.uk -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Joshua Welker Sent: 04 November 2014 14:43 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Stack Overflow The concept of a library technology Stack Exchange site as a google-able repository of information sounds great. However, I do have quite a few reservations. 1. Stack Exchange sites seem to naturally lead to gatekeeping, snobbishness, and other troll behaviors. The reputation system built into those sites really go to a lot of folks' heads. High-ranking users seem to take pleasure in shutting down questions as off-topic, redundant, etc. Argument and one-upmanship are actively promoted--The previous answer sucks. Here's my better answer! This tends to attract certain (often male) personalities and to repel certain (often female) personalities. This seems very contrary to the direction the Code4Lib community has tried to move in the last few years of being more inclusive and inviting to women instead of just promoting the stereotypical IT guy qualities that dominate most IT-related discussions on the Internet. More here: http://www.banane.com/2012/06/20/there-are-no-women-on-stackoverflow-or-ar e-there/ http://michael.richter.name/blogs/why-i-no-longer-contribute-to-stackoverf low 2. Having a Stack Exchange site might fragment the already quite small and nascent library technology community. This might be an unfounded worry, but it's worth consideration. A lot of QA takes place on this listserv, and it would be awkward to try to have all this information in both places. That said, searching StackExchange is much easier than searching a listserv. 3. I echo your concerns about vendors. Libraries have a culture of protecting vendors from criticism. Sure, we do lots of criticism behind closed doors, but nowhere that leaves an online footprint. Often, our contracts include a clause that we have to keep certain kinds of information private. I don't think this is a very positive aspect of librarian culture, but it is there. I think a year or two ago that there was a pretty long discussion on this listserv about creating a Stack Exchange site. Josh Welker -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Schulkins, Joe Sent: Tuesday, November 04, 2014 8:12 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] Stack Overflow Presumably I'm not alone in this, but I find Stack Overflow a valuable resource for various bits of web development and I was wondering whether anyone has given any thought about proposing a Library Technology site to Stack Exchange's Area 51 (http://area51.stackexchange.com/)? Doing a search of the proposals shows there was one
Re: [CODE4LIB] Stack Overflow
Thanks for that Mark. That's running on 'question2answer' which looks to have a reasonable amount of development going on around it https://github.com/q2a/question2answer/graphs/contributors (given Becky's comments about OSQA which still hold true) Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 4 Nov 2014, at 16:05, Mark A. Matienzo mark.matie...@gmail.com wrote: On Tue, Nov 4, 2014 at 11:00 AM, Owen Stephens o...@ostephens.com wrote: Another option would be a 'code4lib QA' site. Becky Yoose set up one for Coding/Cataloguing and so can comment on how much effort its been. In terms of asking/answering questions the use is clearly low but I think the content that is there is (generally) good quality and useful. I guess the hard part of any project like this is going to be building the community around it. The first things that occur to me is how you encourage people to ask the question on this new site, rather than via existing methods and how do you build enough community activity around housekeeping such as noting duplicate questions and merging/closing. The latter might be a nice problem to have, but the former is where both the Library / LIS SE and the Digital Preservation SE fell down, and libcatcode suffers the same problem - just not enough activity to be a go-to destination. I would add that the Digital Preservation SE has been reinstantiated as Digital Preservation QA http://qanda.digipres.org/, which is organized and supported by the Open Planets Foundation and the National Digital Stewardship Alliance. Mark A. Matienzo m...@matienzo.org Director of Technology, Digital Public Library of America
Re: [CODE4LIB] MARC reporting engine
The MARC XML seemed to be an archive within an archive - I had to gunzip to get innzmetadata.xml then rename to innzmetadata.xml.gz and gunzip again to get the actual xml Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 3 Nov 2014, at 22:38, Robert Haschart rh...@virginia.edu wrote: I was going to echo Eric Hatcher's recommendation of Solr and SolrMarc, since I'm the creator of SolrMarc. It does provide many of the same tools as are described in the toolset you linked to, but it is designed to write to Solr rather than to a SQL style database. Solr may or may not be more suitable for your needs then a SQL database. However I decided to download the data to see whether SolrMarc could handle it. I started with the MARCXML.gz data, ungzipped it to get a .XML file, but the resulting file causes SolrMarc to blow chunks. Either I'm missing something or there is something way wrong with that data.The gzipped binary MARC file work fine with the SolrMarc tools. Creating a SolrMarc script to extract the 700 fields, plus a bash script to cluster and count them, and sort by frequency took about 20 minutes. -Bob Haschart On 11/3/2014 3:00 PM, Stuart Yeates wrote: Thank you to all who responded with software suggestions. https://github.com/ubleipzig/marctools is looking like the most promising candidate so far. The more I read through the recommendations the more it dawned on me that I don't want to have to configure yet another java toolchain (yes I know, that may be personal bias). Thank you to all who responded about the challenges of authority control in such collections. I'm aware of these issues. The current project is about marshalling resources for editors to make informed decisions about rather than automating the creation of articles, because there is human judgement involved in the last step I can afford to take a few authority control 'risks' cheers stuart -- I have a new phone number: 04 463 5692 From: Code for LibrariesCODE4LIB@LISTSERV.ND.EDU on behalf of raffaele messutiraffaele.mess...@gmail.com Sent: Monday, 3 November 2014 11:39 p.m. To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARC reporting engine Stuart Yeates wrote: Do any of these have built-in indexing? 800k records isn't going to fit in memory and if building my own MARC indexer is 'relatively straightforward' then you're a better coder than I am. you could try marcdb[1] from marctools[2] [1] https://github.com/ubleipzig/marctools#marcdb [2] https://github.com/ubleipzig/marctools -- raffaele
Re: [CODE4LIB] Linux distro for librarians
This triggered a memory of a project that was putting together a ready-to-go toolset for Digital Humanities - which I then couldn't remember the details of - but luckily Twitter was able to remember it for me (thanks to @mackymoo https://twitter.com/mackymoo) The project is DH Box http://dhbox.org which tries to put together an environment suitable for DH work. I think that originally this was to be done via installation on the user's local machine, but due the challenges of dealing the variation in local environment they've now moved to a 'box in the cloud' approach (the change of direction is noted at http://dhbox.commons.gc.cuny.edu/blog/2014/dh-box-new-friend-new-platform#sthash.27THWR6E.dpbs). To be honest I'm not 100% sure where the project is right now, as although it looks like not much has been updated since May 2014. Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 21 Oct 2014, at 15:42, Brad Coffield bcoffield.libr...@gmail.com wrote: Is what you're really after is an environment pre-loaded with useful tools for various types of librarians? If so, maybe instead of rolling your own distro (and all the work and headache that involves, like a second full-time job) maybe create software bundles for linux? Have a website where you have lists of software by librarian type. Then make it easy for linux users to install them (repo's and what not) ((I haven't been active in linux for a while)) Just thinking out loud. -- Brad Coffield, MLIS Assistant Information and Web Services Librarian Saint Francis University 814-472-3315 bcoffi...@francis.edu
Re: [CODE4LIB] ISSN lists?
It may depend on exactly what you need. The ISSN Centre offer licensed access to their ISSN portal at a cost http://www.issn.org - my experience is that this is pretty comprehensive The ISSN Centre also offer a download of ISSN-L tables - this is available for free (although you have to state what you intend to do with it before you can download) - this is just ISSNs (mapped to their ISSN-Ls) but if you don't need bibliographic details then it would be a good source As well as WorldCat you could also try Suncat which offers a z39.50 connection http://www.suncat.ac.uk/support/z-target.shtml, but obviously this has the same issue as the WorldCat approach GOKb and KB+ are both initiatives trying to build knowledgebases containing many ISSNs with data to be made available under a CC0 declaration. Both of these are focussed on describing bundles/packages of journals. GOKb is going to be going into preview imminently (http://gokb.org/news) and KB+ already offers downloads http://www.kbplus.ac.uk/kbplus/publicExport. KB+ currently has details of around 25k journals. There may also be some largescale open data initiatives that give you a reasonably good set of ISSNs. For example the RLUK release of 60m+ records at http://www.theeuropeanlibrary.org/tel4/access/data/lod, or the 12million records released by Harvard http://openmetadata.lib.harvard.edu/bibdata (both CC0) Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 17 Oct 2014, at 03:16, Stuart Yeates stuart.yea...@vuw.ac.nz wrote: My understanding is that there is no universal ISSN list but that worldcat allows querying of their database by ISSN. Which method of sampling the ISSN namespace is going to cause least pain? http://www.worldcat.org/ISSN/ seems to be the one talked about, but is there another that's less resource intensive? Maybe someone's already exported this data? cheers stuart -- I have a new phone number: 04 463 5692
Re: [CODE4LIB] Python or Perl script for reading RDF/XML, Turtle, or N-triples Files
I've not tried using the LCNAF RDF files, and I've not used RDFLib, but a couple of things from (a relatively small amount of) experience parsing RDF: Don't try to parse the RDF/XML, use n-triples instead As Kyle mentioned, you might want to use command line tools to strip down the n-triples to only deal with data you actually want Rapper and the Redland RDF libraries are a good place to start, and have bindings to Perl, PHP, Python and Ruby (http://librdf.org/raptor/rapper.html and http://librdf.org). This StackOverflow QA might help getting started http://stackoverflow.com/questions/5678623/how-to-parse-big-datasets-using-rdflib If you want to move between RDF formats an alternative to Rapper is http://www.l3s.de/~minack/rdf2rdf/ - this succeeded converting a file of 48 million triples in ttl to ntriples where Rapper failed with an 'out of memory' error (once in ntriples, Rapper can be used for further parsing) Some slightly random advice there, but maybe some of it will be useful! Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 30 Sep 2014, at 15:54, Jeremy Nelson jeremy.nel...@coloradocollege.edu wrote: Hi Jean, I've found rdflib (https://github.com/RDFLib/rdflib) on the Python side exceeding simple to work with and use. For example, to load the current BIBFRAME vocabulary as an RDF graph using a Python shell: import rdflib bf_vocab = rdflib.Graph().parse('http://bibframe.org/vocab/') len(bf_vocab) # Total number of triples 1683 set([s for s in bf_vocab]) # A set of all unique subjects in the graph This module offers RDF/XML, Turtle, or N-triples support and with various options for retrieving and manipulating the graph's subjects, predicate, and objects. I would advise installing the JSON-LD (https://github.com/RDFLib/rdflib-jsonld) extension as well. Jeremy Nelson Metadata and Systems Librarian Colorado College -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jean Roth Sent: Tuesday, September 30, 2014 8:14 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] Python or Perl script for reading RDF/XML, Turtle, or N-triples Files Thank you so much for the reply. I have not investigated the LCNAF data set thoroughly. However, my default/ideal is to read in all variables from a dataset. So, I was wondering if any one had an example Python or Perl script for reading RDF/XML, Turtle, or N-triples file. A simple/partial example would be fine. Thanks, Jean On Mon, 29 Sep 2014, Kyle Banerjee wrote: KB The best way to handle them depends on what you want to do. You need KB to actually download the NAF files rather than countries or other KB small files as different kinds of data will be organized KB differently. Just don't try to read multigigabyte files in a text KB editor :) KB KB If you start with one of the giant XML files, the first thing you'll KB probably want to do is extract just the elements that are KB interesting to you. A short string parsing or SAX routine in your KB language of choice should let you get the information in a format you like. KB KB If you download the linked data files and you're interested in KB actual headings (as opposed to traversing relationships), grep and KB sed in combination with the join utility are handy for extracting KB the elements you want and flattening the relationships into KB something more convenient to work with. But there are plenty of other tools that you could also use. KB KB If you don't already have a convenient environment to work on, I'm a KB fan of virtualbox. You can drag and drop things into and out of your KB regular desktop or even access it directly. That way you can KB view/manipulate files with the linux utilities without having to KB deal with a bunch of clunky file transfer operations involving KB another machine. Very handy for when you have to deal with multigigabyte files. KB KB kyle KB KB On Mon, Sep 29, 2014 at 11:19 AM, Jean Roth jr...@nber.org wrote: KB KB Thank you! It looks like the files are available as RDF/XML, KB Turtle, or N-triples files. KB KB Any examples or suggestions for reading any of these formats? KB KB The MARC Countries file is small, 31-79 kb. I assume a script KB that would read a small file like that would at least be a start KB for the LCNAF KB KB KB
Re: [CODE4LIB] IFTTT and barcodes
As noted by Tara, when using IFTTT (or similar tools like Bip.io and WappWolf) you are limited to the channels/services the tool has already integrated. You are also in the position of having to give a third party service access to personal information and the ability to read/write certain services. I was investigating these types of services very briefly for a recent workshop and I came across an open source alternative called Huginn which you can run on your own server and of course can extend to work with whatever services/channels you want. I thought it looked interesting - available from https://github.com/cantino/huginn Overkill for this particular problem but may be of more general interest Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 11 Sep 2014, at 08:21, Sylvain Machefert smachef...@u-bordeaux3.fr wrote: Hello, maybe that an easier solution, more IFTTT related, would be to develop a Yahoo pipe, using the ISBN querying the webpac should be easy for Yahoo Pipes, you can then search in the page using xpath or thing like that. Should be easier thant developping a custom script (if you have no development knowledge, ortherwise it should be scripted easily in PHP, python, whatever). I haven't used YPipes in a long time but I think it's worth looking at it. Sylvain Le 10/09/2014 21:48, Ian Walls a écrit : I don't think IFTTT is the right tool, but the basic idea is sound. With a spot of custom scripting on some server somewhere, one could take in an ISBN, lookup via the III WebPac, assess eligibility conditions, then return yes or no. Barcode Scanner on Android has the ability to do custom search URLs, so if your yes/no script can accept URL params, then you should be all set. Barring a script, just a lookup of the MARC record may be possible, and if it was styled in a mobile-friendly manner, perhaps you could quickly glean whether it's okay or not for copy cataloging. Side question: is there connectivity in the stacks for doing this kind of lookup? I know in my library, that's not always the case. -Ian -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Riley Childs Sent: Wednesday, September 10, 2014 3:31 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] IFTTT and barcodes Webhooks via the WordPress channel? Riley Childs Senior Charlotte United Christian Academy Library Services Administrator IT Services (704) 497-2086 rileychilds.net @rowdychildren From: Tara Robertsonmailto:trobert...@langara.bc.ca Sent: 9/10/2014 3:03 PM To: CODE4LIB@LISTSERV.ND.EDUmailto:CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] IFTTT and barcodes Hi, I don't think this is possible using IFTTT right now as existing channels don't exist to create a recipe. I'm trying to think of what those channels would be and can't quite...I don't think IFTTT is the best tool for this task. What ILS are you using? Could you hook a barcode scanner up to a tablet and scan, then check the MARC...nah, that's seeming almost as time consuming as taking it to your desk (depending on how far your desk is). I recall at an Evergreen hackfest that someone was tweaking the web interface for an inventory type exercise, where it would show red or green depending on some condition. Cheers, Tara On 10/09/2014 11:52 AM, Harper, Cynthia wrote: Now that someone has mentioned IFTTT, I'm reading up on it and wonder if it could make this task possible: One of my tasks is copy cataloging. I'm only authorized to do LC copy, which involves opening the record (already downloaded in the acq process), and checking to see that 490 doesn't exist (I can't handle series), and looking for DLC in the 040 |a and |c. It's discouraging when I take 10 books back to my desk from the cataloging shelf, and all 10 are not eligible for copy cataloging. S... could I take my phone to the cataloging shelf, use IFTTT to scan my ISBN, search in the III Webpac, look at the MARc record and tell me whether it's LC copy? Empower the frontline workers! :) Cindy Harper Electronic Services and Serials Librarian Virginia Theological Seminary 3737 Seminary Road Alexandria VA 22304 703-461-1794 char...@vts.edu -- Tara Robertson Accessibility Librarian, CAPER-BC http://caperbc.ca/ T 604.323.5254 F 604.323.5954 trobert...@langara.bc.ca mailto:tara%20robertson%20%3ctrobert...@langara.bc.ca%3E Langara. http://www.langara.bc.ca 100 West 49th Avenue, Vancouver, BC, V5Y 2Z6
Re: [CODE4LIB] Automated searching of Copac/Worldcat
The worksheets I circulated earlier in the week include examples of how to take a list of ISBNs from a spreadsheet/csv file and search on Worldcat (see the 'Automated Love Examples' docs in http://bit.ly/automatedlovefolder) What these examples don't do is include how to check the outcome of the search automatically are record that. I think it would be relatively easy to add to the iMacros example to extract a hit count / no hits message and write this to a file using the iMacros SAVEAS command but I haven't tried this. For a 'no results' option you'd want to look for the presence/extract the contents of a div with id=div-results-none For a results count you'd want to to look for the contents of a table within the div with class=resultsinfo Alternatively you could look at the Selenium IDE extension for Firefox which is more complex but allows more sophisticated approach to checking and writing out information about text present/absent in web pages retrieved. Hope that is of some help Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 13 Aug 2014, at 11:20, Nicholas Brown nbr...@iniva.org wrote: Apologies for cross posting Dear collective wisdom, I'm interested in using automation software such as Macro Express or iMacros to feed a list of ISBNs from a spreadsheet into Copac or Worldcat and output a list of those that return no matches in the results screen. The idea would be to create a tool that can quickly, although rather roughly, identify rare items in a collection (though obviously this would be limited to items with ISBNs or other unique identifiers). I can write a macro which will sequentially search either catalogue for a list of ISBNs but am struggling with how to have the macro identify items with no matches (I have a vague idea about searching the results screen for the text Sorry, there are no search results) and to compile them back into a spreadsheet. I'd be keen to hear if anyone has attempted something similar, general advice, any potential pitfalls in the method outlined above or suggestions for a better way to achieve the same results. If something useful comes of it I'd be happy to share the results. Many thanks for your help, Nick Nicholas Brown Library and Information Manager nbr...@iniva.org +44 (0)20 7749 1125 www.iniva.org
[CODE4LIB] Automation tools - session at the Pi and Mash unconference
Dear all, A month or so ago I asked for recommendations for automation tools that people used in libraries to help inform a session I was going to run. The unconference event (Pi and Mash) ran this weekend, and I just wanted to share the materials I wrote for the session in case they are of any help. The materials consist of a slidedeck called Automated Love Presentation (available as Keynote, Powerpoint and PDF) and some examples and exercises you can work through in a document called Automated Love Examples (available as Pages, Word doc, PDF and ePub). There are also two accompanying files 'ISBNs.xlsx' and 'isbns.csv' which are used in the examples/exercises. All materials are available at http://bit.ly/automatedlovefolder Thanks to all who made suggestions which contributed towards the session. Best wishes, Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936
Re: [CODE4LIB] 'automation' tools
Thanks again all, I love OpenRefine - I've been working on the GOKb project (http://gokb.org) where K-Int (a UK based company) have developed an extension for OpenRefine which helps with the cleaning of data about electronic resources (esp. journals) from publishers and then pushes it into the GOKb database. The extension is fully integrated into the GOKb database but if anyone wants a look code is at https://github.com/k-int/gokb-phase1/tree/dev/refine. The extension checks the data and reports errors as well as offering ways of fixing common issues - there's more on the wiki https://wiki.kuali.org/display/OLE/OpenRefine+How-Tos I did pitch an OpenRefine workshop for the same event as a 'data wrangling/cleaning' tool but the 'automation' session got the vote in the end - although there is definitely overlap. However I am delivering an OpenRefine workshop at the British Library next week - and great to see it is getting used across libraries. The Google Doc Spreadsheets is also a great tip - I've run a course at the British Library which uses this to introduce the concept of APIs to non-techies. I blogged the original tutorial at http://www.meanboyfriend.com/overdue_ideas/2013/02/introduction-to-apis/ but a change to the BL open data platform means this no longer works :(( Thanks all again - I'll be trying to put stuff from the automation workshop online at some point and I'll post here when there is something up. Best wishes, Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 8 Jul 2014, at 03:52, davesgonechina davesgonech...@gmail.com wrote: +1 to OpenRefine. Some extensions, like RDF Refine http://refine.deri.ie/, currently only work with the old Google Refine (still available here https://code.google.com/p/google-refine/). There's a good deal of interesting projects for OpenRefine on GitHub and GitHub Gist. Google Docs Spreadsheets also has a surprising amount of functionality, such as importXML if you're willing to get your hands dirty with regular expressions. Dave On Tue, Jul 8, 2014 at 3:12 AM, Tillman, Ruth K. (GSFC-272.0)[CADENCE GROUP ASSOC] ruth.k.till...@nasa.gov wrote: Definite cosign on Open Refine. It's intuitive and spreadsheet-like enough that a lot of people can understand it. You can do anything from standardizing state names you get from a patron form to normalizing metadata keywords for a database, so I think it'd be useful even for non-techies. Ruth Kitchin Tillman Metadata Librarian, Cadence Group NASA Goddard Space Flight Center Library, Code 272 Greenbelt, MD 20771 Goddard Library Repository: http://gsfcir.gsfc.nasa.gov/ 301.286.6246 -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Terry Brady Sent: Monday, July 07, 2014 1:35 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] 'automation' tools I learned about Open Refine http://openrefine.org/ at the Code4Lib conference, and it looks like it would be a great tool for normalizing data. I worked on a few projects in the past in which this would have been very helpful.
Re: [CODE4LIB] 'automation' tools
Thanks Riley and Andrew for these pointers - some great stuff in there Other tools and examples still very welcome :) Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 4 Jul 2014, at 15:04, Andrew Weidner metaweid...@gmail.com wrote: Great idea for a workshop, Owen. My staff and I use AutoHotkey every day. We have some apps for data cleaning in the CONTENTdm Project Client that I presented on recently: http://scholarcommons.sc.edu/cdmusers/cdmusersMay2014/May2014/13/. I'll be talking about those in more detail at the Upper Midwest Digital Collections Conference http://www.wils.org/news-events/wilsevents/umdcc/ if anyone is interested. I did an in-house training session for our ILS and database management folks on a simple AHK app that they now use for repetitive data entry: https://github.com/metaweidner/AutoType. When I was working with digital newspapers I developed a suite of tools for making repetitive quality review tasks easier: https://github.com/drewhop/AutoHotkey/wiki/NDNP_QR Basic AHK scripts are really great for text wrangling. Just yesterday I wrote a script to grab some values from a spreadsheet, remove commas from the numbers, and dump them into a tab delimited file in the format that we need. That script will become part of our regular workflow. Wrote another one-off script to transform labels on our wiki into links. It wrapped the labels in the wiki link syntax, and then I copied and pasted the unique URLs into the appropriate spots. It's also useful for keeping things organized. I have a set of scripts that open up frequently used network drive folders and applications, and I packaged them as drop down menu choices in a little GUI that's always open on the desktop. We have a few search scripts that either grab values from a spreadsheet or input box and then run a search for those terms in a web database (e.g. id.loc.gov). You might check out Selenium IDE for working with web forms: http://docs.seleniumhq.org/projects/ide/. The recording feature makes it really easy to get started with as an automation tool. I've used it extensively for automated metadata editing: http://digital.library.unt.edu/ark:/67531/metadc86138/m1/1/ Cheers! Andrew On Fri, Jul 4, 2014 at 6:54 AM, Riley Childs ri...@tfsgeo.com wrote: Don't forget AutoIT (auto IT, pretty clever eh?) http://www.autoitscript.com/site/autoit/ Riley Childs Student Asst. Head of IT Services Charlotte United Christian Academy (704) 497-2086 RileyChilds.net Sent from my Windows Phone, please excuse mistakes -Original Message- From: Owen Stephens o...@ostephens.com Sent: 7/4/2014 4:55 AM To: CODE4LIB@LISTSERV.ND.EDU CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] 'automation' tools I'm doing a workshop in the UK at a library tech unconference-style event (Pi and Mash http://piandmash.info) on automating computer based tasks. I want to cover tools that are usable by non-programmers and that would work in a typical library environment. The types of tools I'm thinking of are: MacroExpress AutoHotKey iMacros for Firefox While I'm hoping workshop attendees will bring ideas about tasks they would like to automate the type of thing I have in mind are things like: Filling out a set of standard data on a GUI or Web form (e.g. standard set of budget codes for an order) Processing a list of item barcodes from a spreadsheet and doing something with them on the library system (e.g. change loan status, check for holds) Similarly for User IDs Navigating to a web page and doing some task Clearly some of these tasks would be better automated with appropriate APIs and scripts, but I want to try to introduce those without programming skills to some of the concepts and tools and essentially how they can work around problems themselves to some extent. What tools do you use for this kind of automation task, and what kind of tasks do they best deal with? Thanks, Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936
Re: [CODE4LIB] coders who library? [was: Let me shadow you, librarians who code!]
I'm a librarian, and a slightly poor excuse for a coder second. I've always focussed on the IT/tech side of librarianship in my career and did at one point cross from libraries into more general IT management - then firmly put myself back into libraries. To a certain extent I left library employment to freelance as a consultant to get out of the academic library career path that kept taking me into management - which I realised, after several years doing it, was just not what got me out of bed in the morning. There is a name for people without an MLS who can still quote MARC subfields or write MODS XML freehand. http://shambrarian.org :) Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 7 Jul 2014, at 15:36, Miles Fidelman mfidel...@meetinghouse.net wrote: This recent spate of message leads me to wonder: How many folks here who code for libraries have a library science degree/background, vs. folks who come from other backgrounds? What about folks who end up in technology management/direction positions for libraries? Personally: Computer scientist and systems engineer, did some early Internet-in-public library deployments, got to write a book about it. Not actively doing library related work at the moment. Miles Fidelman Dot Porter wrote: I'm a medieval manuscripts curator who codes, in Philadelphia, and I'd be happy to talk to you as well. Dot On Tue, Jul 1, 2014 at 10:30 AM, David Mayo pobo...@gmail.com wrote: If you'd like to talk to someone who did a library degree, and currently works as a web developer supporting an academic library, I'd be happy to talk with you. - Dave Mayo Software Engineer @ Harvard HUIT LTS On Tue, Jul 1, 2014 at 10:12 AM, Steven Anderson stevencander...@hotmail.com wrote: Jennie, As with others, I'm not a librarian as I lack a library degree, but I do Digital Repository Development for the Boston Public Library (specifically: https://www.digitalcommonwealth.org/). Feel free to let me know you want to chat for your masters paper. Sincerely,Steven AndersonWeb Services - Digital Library Repository developer617-859-2393sander...@bpl.org Date: Tue, 1 Jul 2014 13:51:07 + From: mschofi...@nova.edu Subject: Re: [CODE4LIB] Let me shadow you, librarians who code! To: CODE4LIB@LISTSERV.ND.EDU Hey Jennie, I'm waaay south of MA but I'm pretty addicted to talking about coding as a library job O_o. If you are still in want of guinea-pigs, I'd love to skype / hangout. Michael Schofield // mschofi...@nova.edu -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jennie Rose Halperin Sent: Monday, June 30, 2014 3:58 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] Let me shadow you, librarians who code! hey Code4Lib, Do you work in a library and also like coding? Do you do coding as part of your job? I'm writing my masters paper for the University of North Carolina at Chapel Hill and I'd like to shadow and interview up to 10 librarians and archivists who also work with code in some way in the Boston area for the next two weeks. I'd come by and chat for about 2 hours, and the whole thing will not take up too much of your time. Not in Massachusetts? Want to skype? Let me know and that would be possible. I know that this list has a pretty big North American presence, but I will be in Berlin beginning July 14, and could potentially shadow anyone in Germany as well. Best, Jennie Rose Halperin jennie.halpe...@gmail.com -- In theory, there is no difference between theory and practice. In practice, there is. Yogi Berra
[CODE4LIB] 'automation' tools
I'm doing a workshop in the UK at a library tech unconference-style event (Pi and Mash http://piandmash.info) on automating computer based tasks. I want to cover tools that are usable by non-programmers and that would work in a typical library environment. The types of tools I'm thinking of are: MacroExpress AutoHotKey iMacros for Firefox While I'm hoping workshop attendees will bring ideas about tasks they would like to automate the type of thing I have in mind are things like: Filling out a set of standard data on a GUI or Web form (e.g. standard set of budget codes for an order) Processing a list of item barcodes from a spreadsheet and doing something with them on the library system (e.g. change loan status, check for holds) Similarly for User IDs Navigating to a web page and doing some task Clearly some of these tasks would be better automated with appropriate APIs and scripts, but I want to try to introduce those without programming skills to some of the concepts and tools and essentially how they can work around problems themselves to some extent. What tools do you use for this kind of automation task, and what kind of tasks do they best deal with? Thanks, Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936
Re: [CODE4LIB] Is ISNI / ISO 27729:2012 a name identifier or an entity identifier?
An aside but interesting to see how some of this identity stuff seems to be playing out in the wild now. Google for Catherine Sefton: https://www.google.co.uk/search?q=catherine+sefton The Knowledge Graph displays information about Martin Waddell. Catherine Sefton is a pseudonym of Martin Waddell. It is impossible to know, but the most likely source of this knowledge is Wikipedia which includes the ISNI for Catherine Sefton in the Wikipeda page for Martin Waddell (http://en.wikipedia.org/wiki/Martin_Waddell) (although oddly not the ISNI for Martin Waddell under his own name). Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 18 Jun 2014, at 23:28, Stuart Yeates stuart.yea...@vuw.ac.nz wrote: My reading of that suggests that http://isni-url.oclc.nl/isni/000122816316 shouldn't have both Bell, Currer and Brontë, Charlotte, which it clearly does... Is this is a case of one of our sources of truth doesn't distinguish betweens identities and entities and we're allowing it to pollute our data? If that source of truth is wikipedia, we can fix that. cheers stuart On 06/19/2014 12:11 AM, Richard Wallis wrote: Hi all, Seeing this thread I checked with the ISNI team and got the following answer from Janifer Gatenby who asked me to post it on her behalf: SNI identifies “public identities”.The scope as stated in the standard is “This International Standard specifies the International Standard name identif*i*er (ISNI) for the identification of public identities of parties; that is, the identities used publicly by parties involved throughout the media content industries in the creation, production, management, and content distribution chains.” The relevant definitions are: *3.1* *party* natural person or legal person, whether or not incorporated, or a group of either *3.3* *public identity* Identity of a *party *(3.1) or a fictional character that is or was presented to the public *3.4* *name* character string by which a *public identity *(3.3) is or was commonly referenced A party may have multiple public identities and a public identity may have multiple names (e.g. pseudonyms) ISNI data is available as linked data. There are currently 8 million ISNIs assigned and 16 million links. Example: [image: image001.png] ~Richard. On 16 June 2014 10:54, Ben Companjen ben.compan...@dans.knaw.nl wrote: Hi Stuart, I don't have a copy of the official standard, but from the documents on the ISNI website I remember that there are name variations and 'public identities' (as the lemma on Wikipedia also uses). I'm not sure where the borderline is or who decides when different names are different identities. If it were up to me: pseudonyms are definitely different public identities, name changes after marriage probably not, name change after gender change could mean a different public identity. Different public identities get different ISNIs; the ISNI organisation says the ISNI system can keep track of connected public identities. Discussions about name variations or aliases are not new, of course. I remember the discussions about 'aliases' vs 'Artist Name Variations' that are/were happening on Discogs.com, e.g. 'is J Dilla an alias or a ANV of Jay Dee?' It appears the users on Discogs finally went with aliases, but VIAF put the names/identities together: http://viaf.org/viaf/32244000 - and there is no ISNI (yet). It gets more confusing when you look at Washington Irving who had several pseudonyms: they are just listed under one ISNI. Maybe because he is dead, or because all other databases already know and connected the pseudonyms to the birth name? (I just sent a comment asking about the record at http://isni.org/isni/000121370797 ) [Here goes the reference list…] Hope this helps :) Groeten van Ben On 15-06-14 23:11, Stuart Yeates stuart.yea...@vuw.ac.nz wrote: Could someone with access to the official text of ISO 27729:2012 tell me whether an ISNI is a name identifier or an entity identifier? That is, if someone changes their name (adopts a pseudonym, changes their name by to marriage, transitions gender, etc), should they be assigned a new identifier? If the answer is 'No' why is this called a 'name identifier'? Ideally someone with access to the official text would update the article at https://en.wikipedia.org/wiki/International_Standard_Name_Identifier With a brief quote referenced to the standard with a page number. [The context of this is ORCID, which is being touted as an entity identifier, while not being clear on whether it's a name or entity identifier.] cheers stuart
Re: [CODE4LIB] Any good introduction to SPARQL workshops out there?
I contributed to a session like this in the UK aimed at cataloguers/metadata librarians http://www.cilip.org.uk/cataloguing-and-indexing-group/events/linked-data-what-cataloguers-need-know-cig-event. All the slide decks used are available at http://www.cilip.org.uk/cataloguing-and-indexing-group/linked-data-what-cataloguers-need-know Specifically my introduction to SPARQL slides are at http://www.slideshare.net/ostephens/selecting-with-sparql-using-british-national-bibliography-as, and link to various example SPARQL queries that can be run on the BNB SPARQL endpoint (SPARQL examples are all Gists at https://gist.github.com/ostephens) Not sure about the practicalities of bringing this to staff in the US, although planning is in progress for another event in the UK along the same lines and I'd be happy to put you in touch with the relevant people on the committee to see if there is any possibility of having it webcast if there was interest. Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 1 May 2014, at 17:23, Hutt, Arwen ah...@ucsd.edu wrote: We're interested in an introduction to SPARQL workshop for a smallish group of staff. Specifically an introduction for fairly tech comfortable non-programmers (in our case metadata librarians), as well as a refresher for programmers who aren't using it regularly. Ideally (depending on cost) we'd like to bring the workshop to our staff, since it'll allow more people to attend, but any recommendations for good introductory workshops or tutorials would be welcome! Thanks! Arwen Arwen Hutt Head, Digital Object Metadata Management Unit Metadata Services, Geisel Library University of California, San Diego
Re: [CODE4LIB] barriers to open metadata?
Hi Laura, I've done some work on this in the UK[1][2] and there have been a number of associated projects looking at the open release of library, archive and museum metadata[3]. For libraries (it is different of archives and museums) I think I'd sum up the reasons in three ways - in order of how commonly I think they apply a. Ignorance/lack of thought - libraries don't tend to licence their metadata, and often make no statement about how it can be used - my experience is that often no-one has even asked the questions about licencing/data release b. No business case - in the UK we talked to a group of university librarians and found that they didn't see a compelling business case for making open data releases of their catalogue records c. Concern about breaking contractual agreements or impinging on 3rd party copyright over records. The Comet project at the University of Cambridge did a lot of work in this area[4] As Roy notes, there have been some significant changes recently with OCLC and many national libraries releasing data under open licences. However, while this changes (c) it doesn't impact so much on (a) and (b) - so these remain as fundamental issues and I have a (unsubstantiated) concern that big data releases lead to libraries taking less interest (someone else is doing this for us) rather than taking advantage of the clarity and openess these big data releases and associated announcements bring. A final point - looking at libraries behaviour in relation to institutional/open access repositories, where you'd expect at least (a) to be considered, unfortunately when I looked a couple of years ago I found similar issues. Working for the CORE project at the Open University[5] I found that OpenDOAR[6] listed Metadata re-use policy explicitly undefined for 57 out of 125 UK repositories with OAI-PMH services. Only 18 repositories were listed as permitting commerical re-use of metadata. Hopefully this has improved in the intervening 2 years! Hope some of this is helpful Owen 1 Jisc Guide to Open Bibliographic Data http://obd.jisc.ac.uk 2 Jisc Discovery principles http://discovery.ac.uk/businesscase/principles/ 3 Jisc Discovery Case studies http://guidance.discovery.ac.uk 4 COMET http://cul-comet.blogspot.co.uk/p/ownership-of-marc-21-records.html 5 CORE blog http://core-project.kmi.open.ac.uk/node/32 6 OpenDOAR http://www.opendoar.org/ Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 29 Apr 2014, at 21:06, Ben Companjen ben.compan...@dans.knaw.nl wrote: Hi Laura, Here are some reasons I may have overheard. Stuck halfway: We have an OAI-PMH endpoint, so we're open, right? Lack of funding for sorting out our own rights: We gathered metadata from various sources and integrated the result - we even call ourselves Open L*y - but we [don't have manpower to figure out what we can do with it, so we added a disclaimer]. Cultural: We're not sure how to prevent losing the records' provenance after we released our metadata. Groeten van Ben On 29-04-14 19:02, Laura Krier laura.kr...@gmail.com wrote: Hi Code4Libbers, I'd like to find out from as many people as are interested what barriers you feel exist right now to you releasing your library's bibliographic metadata openly. I'm curious about all kinds of barriers: technical, political, financial, cultural. Even if it seems obvious, I'd like to hear about it. Thanks in advance for your feedback! You can send it to me privately if you'd prefer. Laura -- Laura Krier laurapants.comhttp://laurapants.com/?utm_source=email_sigutm_medium=emai lutm_campaign=email
Re: [CODE4LIB] distributed responsibility for web content
I'd second the suggestions from Erin with regards establishing style guides and Ross's suggestion of peer review. While not quite directly about the issue you have, Paul Boag a UK web designer has spoken and blogged on how clear policies relying on quantitative measures can help establish clear policies and (perhaps!) take some of the emotion out of decision making - e.g. see http://boagworld.com/business-strategy/website-animal/ - perhaps a similar approach might help here as well. Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 18 Apr 2014, at 15:15, Erin White erwh...@vcu.edu wrote: Develop a brief content and design style guide, then have it approved by your leadership team and share it with your organization. (Easier said than done, I know.) Bonus points if you work with your (typically) print-focused communications person to develop this guide and get his/her buy-in on creating content for the web. A style guide sets expectations across the board and helps you when you need to play they heavy. As you need, you can e-mail folks with a link to the style guide, ask them to revise, and offer assistance or suggestions if they want. Folks are grumpy about this at first, but generally appreciate the overall strategy to make the website more consistent and professional-looking. It ain't the wild wild west anymore - our web content is both functional and part of an overall communications strategy, and we need to treat it accordingly. -- Erin White Web Systems Librarian, VCU Libraries 804-827-3552 | erwh...@vcu.edu | www.library.vcu.edu On Fri, Apr 18, 2014 at 9:39 AM, Pikas, Christina K. christina.pi...@jhuapl.edu wrote: Laughing and feeling your pain... we have a communications person (that's her job) who keeps using bold, italics, h1, in pink (yes pink), randomly in pages... luckily she only does internal pages, and not external. You could schedule some writing for the web sessions, but I don't know that it will help. You could remove any text formatting... In the end, you probably should just do as I do: close the page, breathe deeply, get up and take a walk, and get on with other things. Christina -Original Message- From: Code for Libraries [mailto:CODE4LIB@listserv.nd.edu] On Behalf Of Simon LeFranc Sent: Thursday, April 17, 2014 7:43 PM To: CODE4LIB@listserv.nd.edu Subject: [CODE4LIB] distributed responsibility for web content My organization has recently adopted an enterprise Content Management System. For the first time, staff across 8 divisions became web authors, given responsibility for their division's web pages. Training on the software, which has a WYSIWYG interface for editing, is available and with practice, all are capable of mastering the basic tools. Some simple style decisions were made for them, however, it is extremely difficult to get these folks not to elaborate on or improvise new styles. Examples: making text red or another color in the belief that color will draw readers' attentionmaking text bold and/or italic and/or the size of a war-is-declared headline (see 1);using images that are too small to be effectiveadding a few more images that are too small to be effective attempting to emphasize statements using ! or !! or !writing in a too-informal tone (Come on in outta the rain!) [We are a research organization and museum.]feeling compelled to ornament pages with clipart, curlicues, et al.centering everything There is no one person in the organization with the time or authority to act as editorial overseer. What are some techniques for ensuring that the site maintains a clean, professional appearance? Simon
[CODE4LIB] Research Libraries UK Hack day
Just over a year and a half ago I posted about some work I was doing on behalf of Research Libraries UK (RLUK) who were looking at the potential of publishing several million of their bibliographic records (drawn from the major research libraries in the UK) as linked open data.In August last year RLUK announced it would join The European Library (TEL)[1], and would work with the team at TEL to publish RLUK data, along with other data held by The European Library, as linked open data. I'm happy to say that they are now very close to making the (approximately) 17 million RLUK records available. To start the process of working with the wider community of librarians, developers, and anyone interested in exploiting this data, RLUK is holding a hack day in London on 14th May. Here the RLUK Linked Open Data will be introduced, along with the TEL API (OpenSearch based). There will be prizes (to be announced) for hacks in the following areas which represent areas of interest to RLUK and TEL: • Linking Up datasets - a prize for work that combines data from multiple data sets • WWI • Eastern Europe • Delivering a valuable hack for RLUK members The event is free and you can sign up now at https://www.eventbrite.co.uk/e/rluk-hack-day-rlukhack-tickets-11197529111 - I hope to see some of you there Best wishes Owen 1. http://www.rluk.ac.uk/news/rluk-joins-european-library/ Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936
Re: [CODE4LIB] semantic web browsers
Your findings reflect my experience - there isn't much out there and what is basic or doesn't work at all. Link Sailor is another http://linksailor.com but I suspect not actively maintained (developed by Ian Davis when he was at Talis doing linked data work) I think the Graphite based browser from Southampton *does* support content-negotiation - what makes you think it doesn't? Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 22 Mar 2014, at 20:49, Eric Lease Morgan emor...@nd.edu wrote: Do you know of any working Semantic Web browsers? Below is a small set of easy-to-use Semantic Web browsers. Give them URIs and they allow you to follow and describe the links they include. * LOD Browser Switch (http://browse.semanticweb.org) - This is really a gateway to other Semantic Web browsers. Feed it a URI and it will create lists of URLs pointing to Semantic Web interfaces, but many of the URLs (Semantic Web interfaces) do not seem to work. Some of the resulting URLs point to RDF serialization converters * LodLive (http://en.lodlive.it) - This Semantic Web browser allows you to feed it a URI and interactively follow the links associated with it. URIs can come from DBedia, Freebase, or one of your own. * Open Link Data Explorer (http://demo.openlinksw.com/rdfbrowser2/) - The most sophisticated Semantic Web browser in this set. Given a URI it creates various views of the resulting triples associated with including lists of all its properties and objects, networks graphs, tabular views, and maps (if the data includes geographic points). * Quick and Dirty RDF browser (http://graphite.ecs.soton.ac.uk/browser/) - Given the URL pointing to a file of RDF statements, this tool returns all the triples in the file and verbosely lists each of their predicate and object values. Quick and easy. This is a good for reading everything about a particular resource. The tool does not seem to support content negotiation. If you need some URIs to begin with, then try some of these: * Ray Family Papers - http://infomotions.com/sandbox/liam/data/mum432.rdf * Catholics and Jews - http://infomotions.com/sandbox/liam/data/shumarc681792.rdf * Walt Disney via VIAF - http://viaf.org/viaf/36927108/ * origami via the Library of Congress - http://id.loc.gov/authorities/subjects/sh85095643 * Paris from DBpedia - http://dbpedia.org/resource/Paris To me, this seems like a really small set of browser possibilities. I’ve seen others but could not get them to work very well. Do you know of others? Am I missing something significant? — Eric Lease Morgan
Re: [CODE4LIB] tool for finding close matches in vocabular list
As Roy suggests, Open Refine is designed for this type of work and could easily deal with the volume you are talking about here. It can cluster terms using a variety of algorithms and easily apply a set of standard transformations. The screencasts and info at http://freeyourmetadata.org/cleanup/ might be a good starting point if you want to see what Refine can do Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 21 Mar 2014, at 18:24, Ken Irwin kir...@wittenberg.edu wrote: Hi folks, I'm looking for a tool that can look at a list of all of subject terms in a poorly-controlled index as possible candidates for term consolidation. Our student newspaper index has about 16,000 subject terms and they include a lot of meaningless typographical and nomenclatural difference, e.g.: Irwin, Ken Irwin, Kenneth Irwin, Mr. Kenneth Irwin, Kenneth R. Basketball - Women Basketball - Women's Basketball-Women Basketball-Women's I would love to have some sort of pattern-matching tool that's smart about this sort of thing that could go through the list of terms (as a text list, database, xml file, or whatever structure it wants to ingest) and spit out some clusters of possible matches. Does anyone know of a tool that's good for that sort of thing? The index is just a bunch of MySQL tables - there is no real controlled-vocab system, though I've recently built some systems to suggest known SH's to reduce this sort of redundancy. Any ideas? Thanks! Ken
Re: [CODE4LIB] Retrieving ISSN using a DOI
You should be able to use the content negotiation support on Crossref to get the metadata, which does include the ISSNs - or at least has the potential to if they are available. E.g. curl -LH Accept: application/rdf+xml;q=0.5, application/vnd.citationstyles.csl+json;q=1.0 http://dx.doi.org/10.1126/science.169.3946.635 Gives { subtitle: [], subject: [ General ], issued: { date-parts: [ [ 1970, 8, 14 ] ] }, score: 1.0, prefix: http://id.crossref.org/prefix/10.1126;, author: [ { family: Frank, given: H. S. } ], container-title: Science, page: 635-641, deposited: { date-parts: [ [ 2011, 6, 27 ] ], timestamp: 130913280 }, issue: 3946, title: The Structure of Ordinary Water: New data and interpretations are yielding new insights into this fascinating substance, type: journal-article, DOI: 10.1126/science.169.3946.635, ISSN: [ 0036-8075, 1095-9203 ], URL: http://dx.doi.org/10.1126/science.169.3946.635;, source: CrossRef, publisher: American Association for the Advancement of Science (AAAS), indexed: { date-parts: [ [ 2013, 11, 7 ] ], timestamp: 1383796678887 }, volume: 169 } Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 5 Mar 2014, at 12:30, Graham, Stephen s.grah...@herts.ac.uk wrote: OK, I've received a couple of emails telling me that the ISSN is not always included in the DOI - that it depends on the publisher. So, I guess my original question still stands! Stephen From: Graham, Stephen Sent: 05 March 2014 12:25 To: 'CODE4LIB@LISTSERV.ND.EDU' Subject: RE: Retrieving ISSN using a DOI Sorry - I've answered my own question. The ISSN is actually contained in the DOI. Didn't realise this! D'oh! Stephen From: Graham, Stephen Sent: 05 March 2014 12:14 To: 'CODE4LIB@LISTSERV.ND.EDU' Subject: Retrieving ISSN using a DOI Hi All - is there a service/API that will return the ISSN if I provide the DOI? I was hoping that the Crossref API would do this, but I can't see the ISSN in the JSON it returns. I'm adding a DOI field to our OPAC ILL form, so if the user has the DOI they can use this to populate the form rather than add all the data manually. When the user submits the form I'm querying our openURL resolver API to see if we have access to the article. If we do then the form will alert the user and provide a link. The query to the openURL resolver works better if we have the ISSN, but if the user has used a DOI the ISSN is frustratingly never there. Stephen Stephen Graham Online Information Manager Information Collections and Services University of Hertfordshire, Hatfield. AL10 9AB Tel. 01707 286111 Email s.grah...@herts.ac.ukmailto:s.grah...@herts.ac.uk
Re: [CODE4LIB] Library of Congress
+1 Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 1 Oct 2013, at 14:21, Doran, Michael D do...@uta.edu wrote: As far as I can tell the LOC is up and the offices are closed. HORRAY!! Let's celebrate! Before we start celebrating, let's consider our friends and colleagues at the LOC (some of who are code4lib people) who aren't able to work and aren't getting paid starting today. -- Michael # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-688-1926 mobile # do...@uta.edu # http://rocky.uta.edu/doran/ -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Riley Childs Sent: Tuesday, October 01, 2013 5:28 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] Library of Congress As far as I can tell the LOC is up and the offices are closed. HORRAY!! Let's celebrate! Riley Childs Junior and Library Tech Manager Charlotte United Christian Academy +1 (704) 497-2086 Sent from my iPhone Please excuse mistakes
Re: [CODE4LIB] Open Source ERM
I'm involved in the GOKb project, and also a related project in the UK called 'KB+' which is a national service providing a knowledgebase and the ability to manage subscriptions/licences. As Adam said - GOKb is definitely more of a service, although the software could be run by anyone it isn't designed with ERM functionality in mind - but to be able to be a GOKb is a community managed knowledgebase - and so far much of the work has been to build a set of tools for bringing in data from publishers and content providers, and to store and manage that data. In the not too distant future GOKb will provide data via APIs for use in downstream systems. Two specific downstream systems GOKb is going to be working with are the Kuali OLE system (https://www.kuali.org/ole) and the KB+ system mentioned above. KB+ started with very similar ideas to GOKb in terms of building a community managed knowledgebase, but with the UK HE community specifically in mind. However it is clear that collaborating with GOKb will have significant benefits and help the community focus its effort in a single knowledgebase, and so it is expected that eventually KB+ will consume data from GOKb, and the community will contribute to the data managed in GOKb. However KB+ also provides more ERM style functionality available to UK Universities. Each institution can setup its own subscriptions and licenses, drawing on the shared knowledgebase information which is managed centrally by a team at Jisc Collections (who negotiate licenses for much of the content in the UK, among other things). I think the KB+ software could work as a standalone ERMs in terms of functionality, but its strength is as a multi-institution system with a shared knowledgebase. We are releasing v3.3 next week which brings integration with various discussion forum software - hoping we can put community discussion and collaboration at the heart of the product Development on both KB+ and GOKb is being done by a UK software house called Knowledge Integration, and while licenses for the respective code bases have not yet been implemented, both should be released under an open licence in the future. However the code is already on Github if anyone is interested http://github.com/k-int/KBPlus/ https://github.com/k-int/gokb-phase1 In both cases they are web apps written in Groovy. GOKb has the added complication/interest of also having a Open (was Google) Refine extension as this is the tool chose for loading messing e-journal data into the system Sorry to go on, hope the above is of some interest Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 20 Sep 2013, at 16:26, Karl Holten khol...@switchinc.org wrote: A couple of months ago our organization began looking at new ERM solutions / link resolvers, so I thought I'd share my thoughts based on my research of the topic. Unfortunately, I think this is one area where open source offerings are a bit thin. Many offerings look promising at first but are no longer under development. I'd be careful about adopting something that's no longer supported. Out of all the options that are no longer developed, I thought the CUFTS/GODOT combo was the most promising. Out of the options that seem to still be under development, there were two options that stood out: CORAL and GOKb. Neither includes a link resolver, so they weren't good for our needs. CORAL has the advantage of being out on the market right now. GOKb is backed by some pretty big institutions and looks more sophisticated, but other than some slideshows there's not a lot to look at to actually evaluate it at the moment. Ultimately, I came to the conclusion that nothing out there right now matches the proprietary software, especially in terms of link resolvers and in terms of a knowledge base. If I were forced to go open source I'd say the GOKb and CORAL look the most promising. Hope that helps narrow things down at least a little bit. Regards, Karl Holten Systems Integration Specialist SWITCH Consortium 6801 North Yates Road Milwaukee, WI 53217 http://topcat.switchinc.org/ -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Riesner, Giles W. Sent: Thursday, September 19, 2013 5:33 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Open Source ERM Thank you, Peter. I took a quick look at the list and found ERMes there as well as a few others. Not everything under this category really fits what I'm looking for (e.g. Calibre). I'll look a little deeper. Regards, Giles W. Riesner, Jr., Lead Library Technician, Library Technology Community College of Baltimore County 800 S. Rolling Road Baltimore, MD 21228 gries...@ccbcmd.edu 1-443-840-2736 From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf
Re: [CODE4LIB] What do you want to learn about linked data?
Just a recommendation for a source of information - I've found http://linkeddatabook.com/editions/1.0/ very useful especially in thinking about the practicalities of linked data publication and consumption in applications Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 4 Sep 2013, at 15:13, Akerman, Laura lib...@emory.edu wrote: Karen, It's hard to say what basics are. We had a learning group at Emory that covered a lot of the what is it, including mostly what you've listed but also the environment (library and cultural heritage, and larger environment), but we had a harder time getting to the what do you do with it which is what would really motivate and empower people to go ahead and get beyond basics. Maybe add: How do you embed linked data in web pages using RDFa (Difference between RDFa and schema.org/other microdata) How do you harvest linked data from web pages, endpoints, or other modes of delivery? Different serializations and how to convert How do you establish relations between different vocabularies (classes and properties) using RDFS and OWL? (Demo) New answers to your questions enabled by combining and querying linked data! Maybe a step toward what can you do with it would be to show (or have an exercise): How can a web application interface with linked data? I suspect there are a lot of people who've read about it and/or have had tutorials here and there, and who really want to get their hands in it. That's where there's a real dearth of training. An intermediate level workshop addressing (but not necessarily answering!) questions like: Do you need a triplestore or will a relational database do? Do you need to store your data as RDF or can you do everything you need with XML or some other format, converting on the way out or in? Should you query external endpoints in real time in your application, or cache the data? Other than SPARQL, how do you search linked data? Indexing strategies... tools... If asserting OWL sameAs is too dangerous in your context, what other strategies for expressing close to it relationships between resources (concepts) might work for you? Advanced SPARQL using regular expressions, CREATE, etc. Care and feeding of triplestores (persistence, memory, ) Costing out linked data applications: How much additional server space and bandwidth will I (my institution) need to provision in order to work with this stuff? Open source, free, vs. commercial management systems? Backward conversion -transformations from linked data to other data serializations (e.g. metadata standards in XML). What else? Unfortunately (or maybe just, how it is) no one has built an interface that hides all the programming and technical details from people but lets them experience/experiment with this stuff (have they?). So some knowledge is necessary. What are prerequisites and how could we make the burden of knowing them not so onerous to people who don't have much experience in web programming or system administration, so they could get value from a tutorial,? Laura Laura Akerman Technology and Metadata Librarian Room 208, Robert W. Woodruff Library Emory University, Atlanta, Ga. 30322 (404) 727-6888 lib...@emory.edu -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Karen Coyle Sent: Wednesday, September 04, 2013 4:59 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] What do you want to learn about linked data? All, I had a few off-list requests for basics - what are the basic things that librarians need to know about linked data? I have a site where I am putting up a somewhat crudely designed tutorial (with exercises): http://kcoyle.net/metadata/ As you can see, it is incomplete, but I work away on it when so inspired. It includes what I consider to be the basic knowledge: 1. What is metadata? 2. Data vs. text 3. Identifiers (esp. URIs) 4. Statements (not records) (read: triples) 5. Semantic Web basics 6. URIs (more in depth) 7. Ontologies 8. Vocabularies I intend to link various slide sets to this, and anyone is welcome to make use of the content there. It would be GREAT for it to become an actual tutorial, perhaps using better software, but I haven't found anything yet that I like working with. If you have basics to add, please let me know! kc On 9/1/13 5:37 PM, Karen Coyle wrote: I'm thinking about training needs around linked data -- yes, that includes basic concepts, but at the moment I'm wondering what specific technologies or tasks people would like to learn about? Some obvious examples are: how to do SPARQL queries; how to use triples in databases; maybe how to use Protege (free software) [1] to create an ontology. Those are just a quick shot across the bow
Re: [CODE4LIB] netflix search mashups w/ library tools?
From the Netflix API Terms of Use Titles and Title Metadata may be stored for no more than twenty four (24) hours. http://developer.netflix.com/page/Api_terms_of_use Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 19 Aug 2013, at 16:59, Ken Irwin kir...@wittenberg.edu wrote: Thanks Karen, This goes in a bit of a direction from what I'm hoping for and your project does suggest that some matching to build such searches might be possible. What I really want is to apply LCSH and related data to the Netflix search process, essentially dropping Netflix holdings into a library catalog interface. I suspect you'd have to build a local cache of the OCLC data for known Netflix items to do so, and maybe a local cache of the Netflix title list. I wonder if either or both of those actions would violate the TOS for the respective services. Ken -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Karen Coombs Sent: Monday, August 19, 2013 11:26 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] netflix search mashups w/ library tools? Ken, I did a mashup that took Netflix's top 100 movies and looked to see if a specific library had that item. http://www.oclc.org/developer/applications/netflix-my-library You might think about doing the following. Search WorldCat for titles on a particular topic and then check to see if the title is available via Netflix. Netflix API for searching their catalog is pretty limited though so it might not give you what you want. It looks like it only allows you to search their streamable content. Also I had a lot of trouble with trying to match Netflix titles and library holdings. Because there isn't a good match point. DVDs don't have ISBNs and if you use title you can get into trouble because movies get remade. So title + date seems to work best if you can get the information. Karen On Mon, Aug 19, 2013 at 8:54 AM, Ken Irwin kir...@wittenberg.edu wrote: Hi folks, Is anyone out there using library-like tools for searching Netflix? I'm imagining a world in which Netflix data gets mashed up with OCLC data or something like it to populate a more robustly searchable Netflix title list. Does anything like this exist? What I really want at the moment is a list of Netflix titles dealing with Islamic topics (Muhammed, the Qu'ran, the history of Islamic civilizations, the Hajj, Ramadan, etc.) for doing beyond-the-library readers' advisory in connection with our ALA/NEH Muslim Journey's Bookshelf. Netflix's own search tool is singularly awful, and I thought that the library world might have an interest in doing better. Any ideas? Thanks Ken
Re: [CODE4LIB] Releasing library holdings metadata openly on the web (was: Libraries and IT Innovation)
On the holdings front also see the work being done on a holding ontology at https://github.com/dini-ag-kim/holding-ontology (and related mailing list http://lists.d-nb.de/mailman/listinfo/dini-ag-kim-bestandsdaten) - discussion all in English Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 23 Jul 2013, at 21:14, Dan Scott deni...@gmail.com wrote: Hi Laura: On Tue, Jul 23, 2013 at 12:36 PM, Laura Krier laura.kr...@gmail.com wrote: snip The area where I'm most involved right now is in releasing library holdings metadata openly on the web, in discoverable and re-usable forms. It's amazing to me that we still don't do this. Imagine the things that could be created by users and software developers if they had access to information about which libraries hold which resources. I'm really interested in your efforts on this front, and where this work is taking place, as that's what I'm trying to do as part of my participation in the W3C Schema Bib Extend Community Group at http://www.w3.org/community/schemabibex/ See the thread starting around http://lists.w3.org/Archives/Public/public-schemabibex/2013Jul/0068.html where we're trying to work out how best to surface library holdings in schema.org structured data, with one effort focusing on reusing the Offer class. There are many open questions, of course, but one of the end goals (at least for me) is to get the holdings into a place where regular people are most likely to find them: in search results served up by search engines like Google and Bing. If you're not involved in the W3C community group, maybe you should be! And it would be great if you could point out where your work is taking place so that we can combine forces. Dan
Re: [CODE4LIB] Anyone have access to well-disambiguated sets of publication data?
I'd echo the other comments that finding reliable data is problematic but as a suggestion of reasonably good data you could try: Names was a Jisc funded project that as far as I know isn't currently active but the data available should be of reasonable quality I think. More details on the project available at http://names.mimas.ac.uk/files/Final_Report_Names_Phase_Two_September_2011.pdf Names: for author names + identifiers - e.g. http://names.mimas.ac.uk/individual/25256.html?outputfields=identifiers (this one has an ISNI) Names also provides links to Journal articles - e.g. for same person http://names.mimas.ac.uk/individual/25256.html?outputfields=resultpublications You could then use the Crossref DOI lookup service to get journal identifiers Not sure this will get you what you need but might be worth a look Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 9 Jul 2013, at 16:32, Paul Albert paa2...@med.cornell.edu wrote: I am exploring methods for author disambiguation, and I would like to have access to one or more set of well-disambiguated data set containing: – a unique author identifier (email address, institutional identifier) – a unique article identifier (PMID, DOI, etc.) – a unique journal identifier (ISSN) Definition for well-disambiguated – for a given set of authors, you know the identity of their journal articles to a precision and recall of greater than 90-95%. Any ideas? thanks, Paul Paul Albert Project Manager, VIVO Weill Cornell Medical Library 646.962.2551
Re: [CODE4LIB] best way to make MARC files available to anyone
On 13 Jun 2013, at 02:57, Dana Pearson dbpearsonm...@gmail.com wrote: quick followup on the thread.. github: I looked at the cooperhewitt collection but don't see a way to download the content...I could copy and paste their content but that may not be the best approach for my files...documentation is thin, seems i would have to provide email addresses for those seeking access...but clearly that is not the case with how the cooperhewitt archive is configured.. My primary concern has been to make it as simple a process as possible for libraries which have limited technical expertise. I suspect from what you say that GitHub is not what you want in this case. However, I just wanted to clarify that you can download files as a Zip file (e.g. for Cooper Hewitt https://github.com/cooperhewitt/collection/archive/master.zip), and that this link is towards the top left on each screen in GitHub. The repository is a public one (which is the default, and only option unless you have a paid account on GitHub) and you do not need to provide email addresses or anything else to access the files on a public repository Owen
Re: [CODE4LIB] best way to make MARC files available to anyone
Putting the files on GitHub might be an option - free for public repositories, and 38Mb should not be a problem to host there Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 12 Jun 2013, at 02:24, Dana Pearson dbpearsonm...@gmail.com wrote: I have crosswalked the Project Gutenberg RDF/DC metadata to MARC. I would like to make these files available to any library that is interested. I thought that I would put them on my website via FTP but don't know if that is the best way. Don't have an ftp client myself so was thinking that that may be now passé. I tried using Google Drive with access available via the link to two versions of the files, UTF8 and MARC8. However, it seems that that is not a viable solution. I can access the files with the URLs provided by setting the access to anyone with the URL but doesn't work for some of those testing it for me or with the links I have on my webpage.. I have five folders with files of about 38 MB total. I have separated the ebooks, audio books, juvenile content, miscellaneous and non-Latin scripts such as Chinese, Modern Greek. Most of the content is in the ebook folder. I would like to make access as easy as possible. Google Drive seems to work for me. Here's the link to my page with the links in case you would like to look at the folders. Works for me but not for everyone who's tried it. http://dbpearsonmlis.com/ProjectGutenbergMarcRecords.html thanks, dana -- Dana Pearson dbpearsonmlis.com
Re: [CODE4LIB] best way to make MARC files available to anyone
On 12 Jun 2013, at 14:06, Dana Pearson dbpearsonm...@gmail.com wrote: Thanks for the replies..I had looked at GitHub but thought it something different, ie, collaborative software development...I will look again Yes - that's the main use (git is version control software, GitHub hosts git repositories) - but of course git doesn't care what types of files you have under version control. It came to mind because I know it's been used to distribute metadata files before - e.g. this set of metadata from the Cooper Hewitt National Design Museum https://github.com/cooperhewitt/collection There could be some additional benefits gained through using git to version control this type of file, and GitHub to distribute them if you were interested, but it can act as simply a place to put the files and make them available for download. But of course the other suggestions would do this simpler task just as well. Owen
Re: [CODE4LIB] DOI scraping
I'd say yes to the investment in jQuery generally - not too difficult to get the basics if you already use javascript, and makes some things a lot easier It sounds like you are trying to do something not dissimilar to LibX http://libx.org ? (except via bookmarklet rather than as a browser plugin). Also looking for custom database scrapers it might be worth looking at Zotero translators, as they already exist for many major sources and I guess will be grabbing the DOI where it exists if they can http://www.zotero.org/support/dev/translators Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 17 May 2013, at 05:32, Fitchett, Deborah deborah.fitch...@lincoln.ac.nz wrote: Kia ora koutou, I’m wanting to create a bookmarklet that will let people on a journal article webpage just click the bookmarklet and get a permalink to that article, including our proxy information so it can be accessed off-campus. Once I’ve got a DOI (or other permalink, but I’ll cross that bridge later), the rest is easy. The trouble is getting the DOI. The options seem to be: 1. Require the user to locate and manually highlight the DOI on the page. This is very easy to code, not so easy for the user who may not even know what a DOI is let alone how to find it; and some interfaces make it hard to accurately select (I’m looking at you, ScienceDirect). 2. Live in hope of universal CoiNS implementation. I might be waiting a long time. 3. Work out, for each database we use, how to scrape the relevant information from the page. Harder/tedious to code, but makes it easy for the user. I’ve been looking around for existing code that something like #3. So far I’ve found: · CiteULike’s bookmarklet (jQuery at http://www.citeulike.org/bm - afaik it’s all rights reserved) · AltMetrics’ bookmarklet (jQuery at http://altmetric-bookmarklet.dsci.it/assets/content.js - MIT licensed) Can anyone think of anything else I should be looking at for inspiration? Also on a more general matter: I have the general level of Javascript that one gets by poking at things and doing small projects and then getting distracted by other things and then coming back some months later for a different small project and having to relearn it all over again. I’ve long had jQuery on my “I guess I’m going to have to learn this someday but, um, today I just wanna stick with what I know” list. So is this the kind of thing where it’s going to be quicker to learn something about jQuery before I get started, or can I just as easily muddle along with my existing limited Javascript? (What really are the pros and cons here?) Nāku noa, nā Deborah Fitchett Digital Access Coordinator Library, Teaching and Learning p +64 3 423 0358 e deborah.fitch...@lincoln.ac.nzmailto:deborah.fitch...@lincoln.ac.nz | w library.lincoln.ac.nzhttp://library.lincoln.ac.nz/ Lincoln University, Te Whare Wānaka o Aoraki New Zealand's specialist land-based university P Please consider the environment before you print this email. The contents of this e-mail (including any attachments) may be confidential and/or subject to copyright. Any unauthorised use, distribution, or copying of the contents is expressly prohibited. If you have received this e-mail in error, please advise the sender by return e-mail or telephone and then delete this e-mail together with all attachments from your system.
[CODE4LIB] British Library Directory of Libraries (probably of interest to UK only)
The British Library has a directory of library codes used by UK registered users of it's Document Supply service. The Directory of Library Codes enables British Library customers to convert into names and addresses the library codes they are given in response to location searches. It also indicates each library's supply and charging policies. More information at http://www.bl.uk/reshelp/atyourdesk/docsupply/help/replycodes/dirlibcodes/ As far as I know the only format this data has ever been made available in is PDF. I've always thought this a shame, so I've written a scraper on scraperwiki to extract the data from the PDF and make it available as structured, query-able, data. The scraper and output is at https://scraperwiki.com/scrapers/british_library_directory_of_library_codes/ Just in case anyone would find it useful. Also any suggestions for improving the scraper welcome (I don't usually write Python so the code is probably even ropier than my normal code :) Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936
Re: [CODE4LIB] You *are* a coder. So what am I?
Shambrarian: Someone who knows enough truth about how libraries really work, but not enough to go insane or be qualified as a real librarian. (See more at http://m.urbandictionary.com/#define?term=Shambrarian) More information available at http://shambrarian.org/ And Dave Pattern has published a handy guide to Librarian/Shambrarian interactions (DO NOT bore the librarian by showing them your Roy Tennant Fan Club membership card) http://daveyp.wordpress.com/2011/07/21/librarianshambrarian-venn-diagram/ Tongue firmly in cheek, Owen On 14 Feb 2013, at 00:22, Maccabee Levine levi...@uwosh.edu wrote: Andromeda's talk this afternoon really struck a chord, as I shared with her afterwards, because I have the same issue from the other side of the fence. I'm among the 1/3 of the crowd today with a CS degree and and IT background (and no MLS). I've worked in libraries for years, but when I have a point to make about how technology can benefit instruction or reference or collection development, I generally preface it with I'm not a librarian, but I shouldn't have to be defensive about that. Problem is, 'coder' doesn't imply a particular degree -- just the experience from doing the task, and as Andromeda said, she and most C4Lers definitely are coders. But 'librarian' *does* imply MLS/MSLS/etc., and I respect that. What's a library word I can use in the same way as coder? Maccabee -- Maccabee Levine Head of Library Technology Services University of Wisconsin Oshkosh levi...@uwosh.edu 920-424-7332
Re: [CODE4LIB] Directories of OAI-PMH repositories
Also see OpenDOAR http://www.opendoar.org We used this listing when building Core http://core.kmi.open.ac.uk/search - which aggregates and does full-text analysis and similarity matching across OA repositories Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 7 Feb 2013, at 23:19, Wilhelmina Randtke rand...@gmail.com wrote: Thanks! The list of lists is very helpful. -Wilhelmina Randtke On Thu, Feb 7, 2013 at 2:40 PM, Habing, Thomas Gerald thab...@illinois.eduwrote: Here is a registry of OAI-PMH repositories that we maintain (sporadically) here at Illinois: http://gita.grainger.uiuc.edu/registry/ Tom -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Phillips, Mark Sent: Thursday, February 07, 2013 2:13 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Directories of OAI-PMH repositories You could start here. http://www.openarchives.org/pmh/ Mark From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf of Wilhelmina Randtke [rand...@gmail.com] Sent: Thursday, February 07, 2013 2:03 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] Directories of OAI-PMH repositories Is there a central listing of places that track and list OAI-PMH repository feeds? I have an OAI-PMH compliant repository, so now am looking for places to list that so that harvesters or anyone who is interested can find it. -Wilhelmina Randtke
Re: [CODE4LIB] XMP Metadata to tab-delemited file
I'm not familiar with what XMP RDF/XML looks like but it might be worth using an RDF parser rather than using XSLT? Graphite (http://graphite.ecs.soton.ac.uk/) is pretty easy to use if you are comfortable with PHP Owen On 14 Jan 2013, at 19:09, Kyle Banerjee kyle.baner...@gmail.com wrote: On Sat, Jan 12, 2013 at 1:36 PM, Michael Hopwood mich...@editeur.orgwrote: I got as far as producing XMP RDF/XML files but the problem then remains; how to usefully manage these via XSLT transforms? The problem is that XMP uses an RDF syntax that comes in many flavours and doesn't result in a predictable set of xpaths to apply the XSLT to. XSLT is not a good tool for many kinds of XML processing. In your situation, string processing or scanning for what tags are present and then outputting in delimited text so you know what is where is probably a better way to go. kyle
Re: [CODE4LIB] What is a coder?
I've been involved in running library/tech unconferences in the UK (the Mashed Library events http://mashedlibrary.com). For the second event (organised by Dave Pattern and others at the University of Huddersfield) we put together a very short list of things you could expect to get out the event (http://mashlib09.wordpress.com/2009/04/28/event-info-why-come-to-mashed-libary/) - the idea being these were things that could go on requests to attend the event. More recently we realised there was a lot of interest from staff on the cataloguing/metadata side of libraries to attend a more 'tech' oriented event but that institutions were often limiting the number of people who could attend, and it was these staff who often lost out as the event was judged to be more appropriate for others. Working with Tom Meehan at UCL and Celine Carty at the University of Cambridge (and others) we were able to put on an event that while still attracting tech staff was also squarely aimed at getting cataloguers/metadata people along - and this definitely worked in terms of the make up of attendees of that particular event. All of which is a preamble to saying - it might be worth putting together either a theoretical list, or direct testimonials, from people who have attended the conference in the past, ideally from a variety of library roles, with what they can/did get out of the conference. This could provide much needed evidence when applying to attend/travel? Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 29 Nov 2012, at 22:51, William Denton w...@pobox.com wrote: On 29 November 2012, Cary Gordon wrote: Obviously, we need to offer trainings on how to get funding to attend conferences. The should be collocated with the conferences. This is a good idea; this should be a BOF or something---how to hack your system to get funding---maybe report back with a lightning talk? Some folks have good funding support, which is great. Some don't, but given the different problems or constraints, what's worked or could work to get people to a Code4Lib conference (major or chapter)? I know some people pay their own way and some use vacation time to go ... be good to hear that approach too. If someone's looking to change what they're doing in the library/technology world, getting to Code4Lib however they can is something to seriously consider. Bill -- William Denton Toronto, Canada http://www.miskatonic.org/
Re: [CODE4LIB] COinS
Agreed. The SchemaBibex group is having some of this discussion, and I think the 'appropriate copy' problem is one the library community can potentially bring to the table. There are no guarantees, and it could be we end up with yet another set of standards/guidelines/practices that the wider world/web doesn't care about - but I think there is an opportunity to position this so that other services can see the benefits of pushing relevant data out, and search engines can see how it can be used to enhance their services. I suspect that discussing this and coming up with proposals in the context of Schema.org is the best bet (for the moment at least) at moving this kind of work from the current niche to a more mainstream position. I'd argue that matching resources (via descriptions) to availability to is now a more general problem than when OpenURL was conceived as the growth of subscription based services like Netflix/Kindle lending/Spotify etc. lead to the same issues. This is expressed on the SchemaBibex wiki http://www.w3.org/community/schemabibex/wiki/Why_Extend. Also several of the use cases described are in this area - http://www.w3.org/community/schemabibex/wiki/Use_Cases#Use_case:_Describe_library_holding.2Favailability_information, alongside use cases that look at how to describe scholarly articles http://www.w3.org/community/schemabibex/wiki/Use_Cases#Use_case:_journal_articles_and_other_periodical_publications If we are going to see adoption, I strongly believe the outcomes we are describing have to be compelling to search engines, and their users, as well as publishers and other service providers. It would be great to get more discussion of what a compelling proposal might look like on the SchemaBibex list http://lists.w3.org/Archives/Public/public-schemabibex/ or wiki http://www.w3.org/community/schemabibex/wiki/Main_Page Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 21 Nov 2012, at 07:37, Dave Caroline dave.thearchiv...@gmail.com wrote: In terms of vocabulary, Schema.org is “extensible” via several mechanisms including mashups with other vocabularies or, ideally, direct integration into the Schema.org namespace such as we’ve seen with RNews http://blog.schema.org/2011/09/extended-schemaorg-news-support.html , JobPostings http://blog.schema.org/2011/11/schemaorg-support-for-job-postings.html , and GoodRelations http://blog.schema.org/2012/11/good-relations-and-schemaorg.html . This is a win/win scenario, but it requires communities to prove they can articulate a sensible set of extensions and deliver the information in that model. Within the “bibliographic” community, this is the mandate set for the http://www.w3.org/community/schemabibex/ group. If you are disappointed with OpenURL metadata formats, poor support for COinS, and disappointing probabilities for content resolution, here’s your chance for leveraging SEO for those purposes. But... it is no good choosing a random extension if the Search engine is or will be blind to that particular method. As someone who likes to leverage SEO the right way so one does not get penalised, some standardisation is needed. Dave Caroline, waiting
Re: [CODE4LIB] OpenURL linking but from the content provider's point of view
The only difference between COinS and a full OpenURL is the addition of a link resolver address. Most databases that provide OpenURL links directly (rather than simply COinS) use some profile information - usually set by the subscribing library, although some based on information supplied by an individual user. If set by the library this is then linked to specific users by IP or by login. There are a couple(?) of generic base URLs you can use which will try to redirect to an appropriate link resolver based on IP range of the requester, with fallback options if it can't find an appropriate resolver (I think this is how the WorldCat resolver works? The 'OpenURL Router' in the UK definitely works like this) The LibX toolbar allows users to set their link resolver address, and then translates COinS into OpenURLs when you view a page - all user driven, no need for the data publisher to do anything beyond COinS There is also the 'cookie pusher' solution which ArXiv uses - where the user can set a cookie containing the base URL, and this is picked up and used by ArXiV (http://arxiv.org/help/openurl) Owen PS it occurs to me that the other part of the question is 'what metadata should be included in the OpenURL to give it the best chance of working with a link resolver'? Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 20 Nov 2012, at 19:39, David Lawrence david.lawre...@sdsu.edu wrote: I have some experience with the library side of link resolver code. However, we want to implement OpenURL hooks on our open access literature database and I can not find where to begin. SafetyLit is a free service of San Diego State University in cooperation with the World Health Organization. We already provide embedded metadata in both COinS and unAPI formats to allow its capture by Mendeley, Papers, Zotero, etc. Over the past few months, I have emailed or talked with many people and read everything I can get my hands on about this but I'm clearly not finding the right people or information sources. Please help me to find references to examples of the code that is required on the literature database server that will enable library link resolvers to recognize the SafetyLit.org metadata and allow appropriate linking to full text. SafetyLit.org receives more than 65,000 unique (non-robot) visitors and the database responds to almost 500,000 search queries every week. The most frequently requested improvement is to add link resolver capacity. I hope that code4lib users will be able to help. Best regards, David David W. Lawrence, PhD, MPH, Director Center for Injury Prevention Policy and Practice San Diego State University, School of Public Health 6475 Alvarado Road, Suite 105 San Diego, CA 92120 usadavid.lawre...@sdsu.edu V 619 594 1994 F 619 594 1995 Skype: DWL-SDCAwww.CIPPP.org -- www.SafetyLit.org
Re: [CODE4LIB] OpenURL linking but from the content provider's point of view
Failure rate on resolving DOIs via CrossRef is high enough that I'd argue for belt braces Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 21 Nov 2012, at 15:08, Young,Jeff (OR) jyo...@oclc.org wrote: If the referent has a DOI, then I would argue that rft_id=http://dx.doi.org/10.1145/2132176.2132212 is all you need. The descriptive information that typically goes in the ContextObject can be obtained (if necessary) by content-negotiating for application/rdf+xml. OTOH, if someone pokes this same URI from a browser instead, you will generally get redirected to the publisher's web site with the full-text close at hand. The same principle should apply for any bibliographic resource that has a Linked Data identifier. Jeff -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Owen Stephens Sent: Wednesday, November 21, 2012 9:55 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] OpenURL linking but from the content provider's point of view The only difference between COinS and a full OpenURL is the addition of a link resolver address. Most databases that provide OpenURL links directly (rather than simply COinS) use some profile information - usually set by the subscribing library, although some based on information supplied by an individual user. If set by the library this is then linked to specific users by IP or by login. There are a couple(?) of generic base URLs you can use which will try to redirect to an appropriate link resolver based on IP range of the requester, with fallback options if it can't find an appropriate resolver (I think this is how the WorldCat resolver works? The 'OpenURL Router' in the UK definitely works like this) The LibX toolbar allows users to set their link resolver address, and then translates COinS into OpenURLs when you view a page - all user driven, no need for the data publisher to do anything beyond COinS There is also the 'cookie pusher' solution which ArXiv uses - where the user can set a cookie containing the base URL, and this is picked up and used by ArXiV (http://arxiv.org/help/openurl) Owen PS it occurs to me that the other part of the question is 'what metadata should be included in the OpenURL to give it the best chance of working with a link resolver'? Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 20 Nov 2012, at 19:39, David Lawrence david.lawre...@sdsu.edu wrote: I have some experience with the library side of link resolver code. However, we want to implement OpenURL hooks on our open access literature database and I can not find where to begin. SafetyLit is a free service of San Diego State University in cooperation with the World Health Organization. We already provide embedded metadata in both COinS and unAPI formats to allow its capture by Mendeley, Papers, Zotero, etc. Over the past few months, I have emailed or talked with many people and read everything I can get my hands on about this but I'm clearly not finding the right people or information sources. Please help me to find references to examples of the code that is required on the literature database server that will enable library link resolvers to recognize the SafetyLit.org metadata and allow appropriate linking to full text. SafetyLit.org receives more than 65,000 unique (non-robot) visitors and the database responds to almost 500,000 search queries every week. The most frequently requested improvement is to add link resolver capacity. I hope that code4lib users will be able to help. Best regards, David David W. Lawrence, PhD, MPH, Director Center for Injury Prevention Policy and Practice San Diego State University, School of Public Health 6475 Alvarado Road, Suite 105 San Diego, CA 92120 usadavid.lawre...@sdsu.edu V 619 594 1994 F 619 594 1995 Skype: DWL-SDCAwww.CIPPP.org -- www.SafetyLit.org
Re: [CODE4LIB] SRU MARC fields with indicators
Thanks Karen - probably should have known that! That's the nice thing about MARC - always some new thing to cope with :) Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 6 Nov 2012, at 19:37, Karen Coyle li...@kcoyle.net wrote: The 9s are available in all indicator positions for local use as defined in the MARC record (not MARC21) spec. [1] So what is in the MARC21 spec under a particular tag is the non-local values. I suspect that most systems just ignore any '9's they encounter unless those are defined as part of local system processing. kc [1] http://www.loc.gov/marc/specifications/specrecstruc.html On 11/6/12 10:20 AM, Owen Stephens wrote: According to the MARC spec, 035 doesn't support '9' as a valid indicator. My very uneducated guess would be the invalid indicator is causing the underlying system not to index it? Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 6 Nov 2012, at 17:43, Alevtina Verbovetskaya alevtina.verbovetsk...@mail.cuny.edu wrote: Let's say I've defined these indexes in pqf.properties on the SRU server: index.marc.020 = 1=7 # ISBN index.marc.035:1 = 1=1211 # OCLC/utility number where first indicator is non-blank index.marc.100:1 = 1=1 # author where first indicator is non-blank I can use the ISBN index to search for records, e.g.: http://apps.appl.cuny.edu:5661/CENTRAL?version=1.1operation=searchRetrievequery=marc.020=9780801449437startRecord=1maximumRecords=15 I can also use the author index to search for records, e.g.: http://apps.appl.cuny.edu:5661/CENTRAL?version=1.1operation=searchRetrievequery=marc.100:1=ArmenterosstartRecord=1maximumRecords=15 So why can't I search for records by utility number (035) with a non-blank first indicator? http://apps.appl.cuny.edu:5661/CENTRAL?version=1.1operation=searchRetrievequery=marc.035:1=ebr10488669startRecord=1maximumRecords=15 If you're playing along, you'll notice that these all point to the same record. However, when I try to search for it with query=marc.035:1=util_num, I get no results. I thought maybe this was because there's already another 035 field (with blank indicators) that's an OCLC number so I temporarily removed it... but that didn't solve the issue. Anyone have any experience with this? I need to be able to search by 0359# and I can't figure out what I'm doing wrong. I would greatly appreciate some assistance! Thank you, Allie -- Alevtina (Allie) Verbovetskaya Web and Mobile Systems Librarian (Substitute) Office of Library Services City University of New York 555 W 57th St, 13th fl. New York, NY 10019 T: 646-313-8158 F: 646-216-7064 alevtina.verbovetsk...@mail.cuny.edu -- Karen Coyle kco...@kcoyle.net http://kcoyle.net ph: 1-510-540-7596 m: 1-510-435-8234 skype: kcoylenet
Re: [CODE4LIB] SRU MARC fields with indicators
According to the MARC spec, 035 doesn't support '9' as a valid indicator. My very uneducated guess would be the invalid indicator is causing the underlying system not to index it? Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 6 Nov 2012, at 17:43, Alevtina Verbovetskaya alevtina.verbovetsk...@mail.cuny.edu wrote: Let's say I've defined these indexes in pqf.properties on the SRU server: index.marc.020 = 1=7 # ISBN index.marc.035:1 = 1=1211 # OCLC/utility number where first indicator is non-blank index.marc.100:1 = 1=1 # author where first indicator is non-blank I can use the ISBN index to search for records, e.g.: http://apps.appl.cuny.edu:5661/CENTRAL?version=1.1operation=searchRetrievequery=marc.020=9780801449437startRecord=1maximumRecords=15 I can also use the author index to search for records, e.g.: http://apps.appl.cuny.edu:5661/CENTRAL?version=1.1operation=searchRetrievequery=marc.100:1=ArmenterosstartRecord=1maximumRecords=15 So why can't I search for records by utility number (035) with a non-blank first indicator? http://apps.appl.cuny.edu:5661/CENTRAL?version=1.1operation=searchRetrievequery=marc.035:1=ebr10488669startRecord=1maximumRecords=15 If you're playing along, you'll notice that these all point to the same record. However, when I try to search for it with query=marc.035:1=util_num, I get no results. I thought maybe this was because there's already another 035 field (with blank indicators) that's an OCLC number so I temporarily removed it... but that didn't solve the issue. Anyone have any experience with this? I need to be able to search by 0359# and I can't figure out what I'm doing wrong. I would greatly appreciate some assistance! Thank you, Allie -- Alevtina (Allie) Verbovetskaya Web and Mobile Systems Librarian (Substitute) Office of Library Services City University of New York 555 W 57th St, 13th fl. New York, NY 10019 T: 646-313-8158 F: 646-216-7064 alevtina.verbovetsk...@mail.cuny.edu
Re: [CODE4LIB] open circ data
The University of Huddersfield released circulation data - see http://library.hud.ac.uk/data/usagedata/_readme.html The University of Lincoln also release some data linked from http://library.hud.ac.uk/wikis/mosaic/index.php/Project_Data (along with the Huddersfield data in a different format I think) The SALT project offers some data - although the project involves University of Manchester, University of Cambridge as well as Huddersfield and Lincoln, I think the data offered for download is just from Manchester but I could be wrong - data at http://vm-salt.mimas.ac.uk/data/ - and a recommender API based on the data http://copac.ac.uk/innovations/activity-data/?page_id=227 Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 26 Oct 2012, at 23:04, Jimmy Ghaphery jghap...@vcu.edu wrote: Are there any other repositories of circ data similar to the OhioLINK/OCLC project (http://www.oclc.org/research/activities/ohiolink/circulation.html ). I seem to remember a large set of British data, but I can't track that down. We have some eager IS grad students looking for data to use for a recommender engine and I'm looking forward to see what they might come up with. thanks for any pointers! -Jimmy -- Jimmy Ghaphery Head, Library Information Systems VCU Libraries 804-827-3551
Re: [CODE4LIB] Q.: software for vendor title list processing
Are there any examples of data in this format in the wild we can look at? Also given KBART and ONIX for Serials Online Holdings have NISO involvement, is there any view on how these two activities complement each other? Thanks, Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 17 Oct 2012, at 09:47, Michael Hopwood mich...@editeur.org wrote: Hi Godmar, There is also ONIX for Serials Online Holdings (http://www.editeur.org/120/ONIX-SOH/). I'm copying in Tim Devenport who might say more. Best wishes, Michael -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Owen Stephens Sent: 16 October 2012 23:09 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Q.: software for vendor title list processing I'm working on the JISC KB+ project that Tom mentioned. As part of the project we've been collating journal title lists from various sources. We've been working with members of the KBART steering group and have used KBART where possible, although we've been collecting data not covered by KBART. All the data we have at this level is published under a CC0 licence at http://www.kbplus.ac.uk/kbplus/publicExport - including a csv that uses the KBART data elements. The focus so far has been on packages negotiated by JISC in the UK - although in many cases the title lists may be the same as are made available in other markets. We also include what we call 'Master lists' which are an attempt to capture the complete list of titles and coverage offered by a content provider. We'd very much welcome any feedback on these exports, and of course be interested to know if anyone makes use of them. So far a lot of the work on collating/coverting/standardising the data has been done by hand - which is clearly not ideal. In the next phase of the project the KB+ project is going to work with the GoKB project http://gokb.org - as part of this collaboration we are currently working on ways of streamlining the data processing from publisher files or other sources, to standardised data. While we are still working on how this is going to be implemented, we are currently investigating the possibility of using Google/Open Refine to capture and re-run sets of rules across data sets from specific sources. We should be making progress on this in the next couple of months. Hope that's helpful Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 16 Oct 2012, at 20:23, Tom Pasley tom.pas...@gmail.com wrote: You might also be interested in the work at http://www.kbplus.ac.uk . The site is up at the moment, but I can't reach it for some reason... they have a public export page which you might want to know about http://www.kbplus.ac.uk/kbplus/publicExport Tom On Wed, Oct 17, 2012 at 8:12 AM, Jonathan Rochkind rochk...@jhu.edu wrote: I think KBART is such an effort. As with most library standards groups, there may not be online documentation of their most recent efforts or successes, but: http://www.uksg.org/kbart http://www.uksg.org/kbart/s5/**guidelines/data_formathttp://www.uksg .org/kbart/s5/guidelines/data_format On 10/16/2012 2:16 PM, Godmar Back wrote: Hi, at our library, there's an emerging need to process title lists from vendors for various purposes, such as checking that the titles purchased can be discovered via discovery system and/or OPAC. It appears that the formats in which those lists are provided are non-uniform, as is the process of obtaining them. For example, one vendor - let's call them Expedition Scrolls - provides title lists for download to Excel, but which upon closer inspection turn out to be HTML tables. They are encoded using an odd mixture of CP1250 and HTML entities. Other vendors use entirely different formats. My question is whether there are efforts, software, or anything related to streamlining the acquisition and processing of vendor title lists in software systems that aid in the collection development and maintenance process. Any pointers would be appreciated. - Godmar
Re: [CODE4LIB] Q.: software for vendor title list processing
There are things that could be improved about the KBART guidelines (and you've picked on one here I definitely agree with). There is an interest group mailing list which can be used for discussion/feedback http://www.niso.org/lists/kbart_interest/ I suspect that for both approaches at the moment the question of uptake/compliance is the bigger issue. Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 17 Oct 2012, at 14:48, Jonathan Rochkind rochk...@jhu.edu wrote: I've always been a fan of ONIX for SOH, although never had the chance to use it -- but the spec is written nicely, based on my experience with this stuff, it actually accomplishes the goal of machine-readable statement of serial holdings (theoretically useful for print or online holdings) well. KBART, I have some concerns about, when it comes to holdings. Is there a place to send feedback to KBART? Just on a quick skim of the parts of interest to me, I am filled with alarm at how much missing the point this is: we recommend that the ISO 8601 date syntax should be used... For simplicity, '365D' will always be equivalent to one year, and '30D' will always be equivalent to one month, even in leap years and months that do not have 30 days. Totally missing the point of ISO 8601 to allow/encourage this when 1Y and 1M are available -- dealing with calendar dates is harder than one might naively think, and by trying to 'improve' on ISO 8601 like this, you just create a mess of ambiguous and difficult to deal with data. On 10/17/2012 5:11 AM, Owen Stephens wrote: Are there any examples of data in this format in the wild we can look at? Also given KBART and ONIX for Serials Online Holdings have NISO involvement, is there any view on how these two activities complement each other? Thanks, Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 17 Oct 2012, at 09:47, Michael Hopwood mich...@editeur.org wrote: Hi Godmar, There is also ONIX for Serials Online Holdings (http://www.editeur.org/120/ONIX-SOH/). I'm copying in Tim Devenport who might say more. Best wishes, Michael -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Owen Stephens Sent: 16 October 2012 23:09 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Q.: software for vendor title list processing I'm working on the JISC KB+ project that Tom mentioned. As part of the project we've been collating journal title lists from various sources. We've been working with members of the KBART steering group and have used KBART where possible, although we've been collecting data not covered by KBART. All the data we have at this level is published under a CC0 licence at http://www.kbplus.ac.uk/kbplus/publicExport - including a csv that uses the KBART data elements. The focus so far has been on packages negotiated by JISC in the UK - although in many cases the title lists may be the same as are made available in other markets. We also include what we call 'Master lists' which are an attempt to capture the complete list of titles and coverage offered by a content provider. We'd very much welcome any feedback on these exports, and of course be interested to know if anyone makes use of them. So far a lot of the work on collating/coverting/standardising the data has been done by hand - which is clearly not ideal. In the next phase of the project the KB+ project is going to work with the GoKB project http://gokb.org - as part of this collaboration we are currently working on ways of streamlining the data processing from publisher files or other sources, to standardised data. While we are still working on how this is going to be implemented, we are currently investigating the possibility of using Google/Open Refine to capture and re-run sets of rules across data sets from specific sources. We should be making progress on this in the next couple of months. Hope that's helpful Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 16 Oct 2012, at 20:23, Tom Pasley tom.pas...@gmail.com wrote: You might also be interested in the work at http://www.kbplus.ac.uk . The site is up at the moment, but I can't reach it for some reason... they have a public export page which you might want to know about http://www.kbplus.ac.uk/kbplus/publicExport Tom On Wed, Oct 17, 2012 at 8:12 AM, Jonathan Rochkind rochk...@jhu.edu wrote: I think KBART is such an effort. As with most library standards groups, there may not be online documentation of their most recent efforts or successes, but: http://www.uksg.org/kbart http://www.uksg.org/kbart/s5/**guidelines/data_formathttp://www.uksg
Re: [CODE4LIB] Q.: software for vendor title list processing
This leads to three follow-up questions. First, is there software to translate/normalize existing vendor lists from vendors that have not yet adopted either of these standards into these formats? I'm thinking of a collection of adapters or converters, perhaps. Each would likely constitute small effort, but there would be benefits from sharing development and maintenance. Not that I'm aware of, but if I understand you then this is one of the tasks GoKB is undertaking in partnership with KB+ (the work I mentioned using Refine) Second, if holdings lists were provided in, or converted to, for instance the KBART format, what software understands these formats to further process them? In other words, is there immediate bang for the buck of adopting these standards? The KBART format was aimed at Link Resolver population - so I'd hope there was some immediate payback on this front, but I don't have any information on this Third, unsurprisingly, these efforts arose in the managements of serials because holdings there change frequently depending on purchase agreements, etc. It is my understanding that eBooks are now posing similar collection management challenges. Are there separate normative efforts for eBooks or is it believed that efforts such as KBART/ONIX can encompass eBooks as well? KBART definitely has ambitions to encompass eBooks as well. There are already some hooks for this (e.g. 'first author' field), and the working group is looking at how ebooks will work I think - Godmar
Re: [CODE4LIB] Q.: software for vendor title list processing
I'm working on the JISC KB+ project that Tom mentioned. As part of the project we've been collating journal title lists from various sources. We've been working with members of the KBART steering group and have used KBART where possible, although we've been collecting data not covered by KBART. All the data we have at this level is published under a CC0 licence at http://www.kbplus.ac.uk/kbplus/publicExport - including a csv that uses the KBART data elements. The focus so far has been on packages negotiated by JISC in the UK - although in many cases the title lists may be the same as are made available in other markets. We also include what we call 'Master lists' which are an attempt to capture the complete list of titles and coverage offered by a content provider. We'd very much welcome any feedback on these exports, and of course be interested to know if anyone makes use of them. So far a lot of the work on collating/coverting/standardising the data has been done by hand - which is clearly not ideal. In the next phase of the project the KB+ project is going to work with the GoKB project http://gokb.org - as part of this collaboration we are currently working on ways of streamlining the data processing from publisher files or other sources, to standardised data. While we are still working on how this is going to be implemented, we are currently investigating the possibility of using Google/Open Refine to capture and re-run sets of rules across data sets from specific sources. We should be making progress on this in the next couple of months. Hope that's helpful Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 16 Oct 2012, at 20:23, Tom Pasley tom.pas...@gmail.com wrote: You might also be interested in the work at http://www.kbplus.ac.uk . The site is up at the moment, but I can't reach it for some reason... they have a public export page which you might want to know about http://www.kbplus.ac.uk/kbplus/publicExport Tom On Wed, Oct 17, 2012 at 8:12 AM, Jonathan Rochkind rochk...@jhu.edu wrote: I think KBART is such an effort. As with most library standards groups, there may not be online documentation of their most recent efforts or successes, but: http://www.uksg.org/kbart http://www.uksg.org/kbart/s5/**guidelines/data_formathttp://www.uksg.org/kbart/s5/guidelines/data_format On 10/16/2012 2:16 PM, Godmar Back wrote: Hi, at our library, there's an emerging need to process title lists from vendors for various purposes, such as checking that the titles purchased can be discovered via discovery system and/or OPAC. It appears that the formats in which those lists are provided are non-uniform, as is the process of obtaining them. For example, one vendor - let's call them Expedition Scrolls - provides title lists for download to Excel, but which upon closer inspection turn out to be HTML tables. They are encoded using an odd mixture of CP1250 and HTML entities. Other vendors use entirely different formats. My question is whether there are efforts, software, or anything related to streamlining the acquisition and processing of vendor title lists in software systems that aid in the collection development and maintenance process. Any pointers would be appreciated. - Godmar
Re: [CODE4LIB] Citation manager -- ??? -- BePress Bulk-upload Excel spreadsheet
No idea if this is useful, but just to note that RefWorks also has an API in case that offers any more options to you in terms of pushing the data around http://rwt.refworks.com/rwapireference/ Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 15 Oct 2012, at 13:37, Mita Williams mita.willi...@gmail.com wrote: Here's the summary of the summary of what we've found out (the full summary is here: http://librarian.newjackalmanac.ca/2012/10/bibliographic-software-bepress.html ) Roy Tennant, some years ago, created a Perl script that collects the latest citations in Web of Science via their WSDL. Lisa Schiff kindly found this script and sent it my way. While the script is admittedly out of date (and Web of Science now provides an API for this sort of thing) it still will prove useful if and when we get to a point in which we want to automate and script our workflow. [Thank you Roy, Kirk Hastings, and Lisa Schiff!] And here are some things that we’ve figured out ourselves: Between Zotero and RefWorks, Zotero exports the cleanest results to excel format. RefWorks can easily export to Excel, while Zotero requires the use of an SQLite extension and this script ( https://github.com/RoyceKimmons/Zotero-to-Excel-SQLite-Export/blob/master/export.sql) kindly provided by Royce Kimmons. On this page ( http://royce.kimmons.me/node/24) Kimmons explains how one can select one or more Zotero collection/folder for export. No one that we know of has created an Excel macro to automate transferring the result of an export to Excel from RefWorks or Zotero to ease the cutting and pasting necessary to get the information into BePress’s Excel Spreadsheet. An alternative means of sharing citations is to avoid Excel exporting altogether and instead, have staff make their papers available on Zotero.org in a public library and have the IR coordinator use Zotero to download the citations using that are either tagged as appropriate (e.g. https://www.zotero.org/copystar/items/tag/publisherPDF) or those that have been placed in a given collection folder (e.g. https://www.zotero.org/copystar/items/collectionKey/THDEN26X). Papers on BePress can be added to Zotero on each item level page but not on a collection page. Improving this capability would require creating a special Zotero translator for BePress: https://github.com/zotero/translators/issues/212 Thank you everyone who has helped us work through this. I hope what we’ve learned proves useful to you as well. On Fri, Oct 5, 2012 at 3:20 PM, Mita Williams mita.willi...@gmail.comwrote: Yes, a partner in crime has asked a similar question in the bepress list and I've been talking to a Zotero developer as well. Once I get this pieces into context, I will definitely share back with the rest of the list. It's the least I can do. Much thanks all On Fri, Oct 5, 2012 at 12:43 PM, lindsey danis danis@gmail.comwrote: There is a discussion on this topic right now in the Digital Commons Google Group, fyi. On Fri, Oct 5, 2012 at 12:40 PM, Sam Kome sam_k...@cuc.claremont.edu wrote: At some point bring it back to the list, please. Enquiring minds want to know... Thanks, SK -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Roy Tennant Sent: Thursday, October 04, 2012 10:44 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Citation manager -- ??? -- BePress Bulk-upload Excel spreadsheet Mita, A while back (I mean at least six years ago) I wrote some code to take citations downloaded from an index provider, reformat them into bepress spreadsheet format, and bulk upload them. The purpose of the project was to identify published articles by University of California faculty, email them that we had citations of their work in our system, and wouldn't they like to upload their copy of their article into the repository? I don't have the numbers on that project, but I recall that it did boost submissions. Unfortunately, I think the code, which was likely crappy anyway, has long since moldered to dust on a server somewhere that I no longer have access to, but I can put you in touch with someone at UC who might be doing something like this. I'll email you off-list. Roy On Thu, Oct 4, 2012 at 9:32 AM, Mita Williams mita.willi...@gmail.com wrote: We're trying to figure out a workflow for our BePress IR and was curious if anyone in code4libland has developed something (an Excel macro? a Zotero export function?) that could take formatted citations and put them in the proper order so they could be bulk added to the BePress bulk upload Excel spreadsheet. Or perhaps there's an altogether different way of going about collecting, formatting, and adding such things for BePress. Everything counts in large amounts. Mita
Re: [CODE4LIB] CODE4LIB equivalent in UK?
Code4lib is a many headed beast :) It may depend on what you are looking for. (mailing list, IRC, conference, journal, etc.) Mashed Library is a set of events (I ran the first one and I've been involved in many of the subsequent ones), partially bourne out of my frustration that I never got to go to the Code4Lib or Access conferences in North America. There is no organising committee or particular restriction on using the name - so anyone can run a 'mashed library' event, and they can be whatever format you want. The events have tended to be one day, cheap or free to attend, and have at least some 'unconference' element, and often some 'hands on' time/practical sessions. The events have tended to target a mixture of developers and 'tech interested' people - of course the mix varies between events. The last one was in Cambridge this summer and was focussed very much on cataloguing/metadata - there is a collection of presentations and blog posts at http://www.mashcat.info if you want a flavour of this. After the first event I discussed with a few others the idea of having a mailing list etc. but in the end the question is always - why duplicate the code4lib mailing list? The original question asked about an equivalent 'British list', and I guess I've never really been sure what the point of it would be? What would be 'British' about it - what are the UK specific needs that can't be addressed on Code4lib? We use the same s/w generally, have the same code at our displosal etc. To cover off the other thing mentioned DevCSI is a JISC funded initiative which has run a wide variety of events. The focus is coders in UK HE - so not repositories specifically, nor libraries specifically - however there are regular events run by DevCSI that are in these spaces. DevCSI have also supported several of the Mashed Library events - they are interested in making events happen and generally supporing the developer community, not necessarily always running things themselves. They have run a big annual event 'Dev8D' for the last few years http://dev8d.org which has been a week long usually in February in London. They've also run one student developer event DevXS http://devxs.org - I'm not clear if this will be repeated Back to my question above - what is it that the code4lib list doesn't satisfy that people would like to see from a UK based list? Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 8 Oct 2012, at 09:14, Richard Wallis richard.wal...@dataliberate.com wrote: The Mashed Library folks might be fertile ground for gaining interest in a code4libuk. http://www.mashedlibrary.com ~Richard. On 7 October 2012 16:28, Tim Hill th...@astreetpress.com wrote: Here's another lurking UK code4libber! I work for a UK/US company, but I spend the bulk of my time in the UK (and never enough in the US to coincide with a code4lib meetup). I'd certainly be interested in getting the/a community more active in the UK. Tim Hill On Tue, Oct 2, 2012 at 9:12 AM, Simeon Warner simeon.war...@cornell.edu wrote: Have a look at http://devcsi.ukoln.ac.uk/ . This is mainly focused on repositories but seems somewhat similar from an outside view. Cheers, Simeon (lurking expat Brit) On 10/2/12 4:11 AM, Michael Hopwood wrote: Yes - my question was implicitly aimed at lurking UKavians. -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Dave Caroline Sent: 02 October 2012 09:08 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] CODE4LIB equivalent in UK? On Tue, Oct 2, 2012 at 8:55 AM, Michael Hopwood mich...@editeur.org wrote: I know that CODE4LIB isn't per se in the USA but it seems like a large number of its active users are. Is there an equivalent list that you folks know of? I dont know of an equivalent British list but there are a few of us brits about lurking in #cod4lib too (archivist) Dave Caroline -- Richard Wallis Founder, Data Liberate http://dataliberate.com Tel: +44 (0)7767 886 005 Linkedin: http://www.linkedin.com/in/richardwallis Skype: richard.wallis1 Twitter: @rjw IM: rjw3...@hotmail.com
Re: [CODE4LIB] Seeking examples of outstanding discovery layers
The stuff by Mitchell Whitelaw on Generous Interfaces (and he cites some aspects of Trove as an example of a generous interface) seems relevant to this discussion: Slides: http://www.slideshare.net/mtchl/generous-interfaces Paper: http://www.ica2012.com/files/data/Full%20papers%20upload/ica12Final00423.pd f Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936
Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google
The JISC funded CLOCK project did some thinking around cataloguing processes and tracking changes to statements and/or records - e.g. http://clock.blogs.lincoln.ac.uk/2012/05/23/its-a-model-and-its-looking-good/ Not solutions of course, but hopefully of interest Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 28 Aug 2012, at 19:43, Simon Spero sesunc...@gmail.com wrote: On Aug 28, 2012, at 2:17 PM, Joe Hourcle wrote: I seem to recall seeing a presentation a couple of years ago from someone in the intelligence community, where they'd keep all of their intelligence, but they stored RDF quads so they could track the source. They'd then assign a confidence level to each source, so they could get an overall level of confidence on their inferences. […] It's possible that it was in the context of provenance, but I'm getting bogged down in too many articles about people storing provenance information using RDF-triples (without actually tracking the provenance of the triple itself) Provenance is of great importance in the IC and related sectors. An good overview of the nature of evidential reasoning is David A Schum (1994;2001). Evidential Foundations of Probabilistic Reasoning. Wiley Sons, 1994; Northwestern University Press, 2001 [Paperback edition]. There are usually papers on provenance and associated semantics at the GMU Semantic Technology for Intelligence, Defense, and Security (STIDS). This years conference is 23 - 26 October 2012; see http://stids.c4i.gmu.edu/ for more details. Simon
[CODE4LIB] Open data and Research Libraries UK
Hello all I've been commissioned by Research Libraries UK (RLUK) to look at the possibility of making RLUK data openly available, and the related issues and challenges. As part of this work it is important for us to understand who the audience for such open data might be, how they might use the data, and what licences, formats and mechanisms will best support this use. I hope you are able to help by completing the survey linked below. To give a bit more detail on the data we are talking about. Research Libraries UK, through JISC and MIMAS, makes available a large database of bibliographic data. RLUK estimates that approximately 16 million bibliographic records in its database are free from restrictions in terms of redistribution and open licensing. RLUK is committed to the principle of open bibliographic data, and is a signatory to the JISC Discovery Open Metadata Principles (http://discovery.ac.uk/businesscase/principles/). RLUK would therefore like to determine the most effective way of publishing the available records as open metadata, with an emphasis on enabling reuse. The survey should only take about 10 minutes to complete and is available at: https://www.surveymonkey.com/s/5RH8KH8 Thanks and best wishes Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936
Re: [CODE4LIB] Recommendations for a teaching OPAC?
On 3 Aug 2012, at 15:56, Joseph Montibello joseph.montibe...@dartmouth.edu wrote: search, you could probably do worse than to install Blacklight. It probably doesn't really meet the simple criteria - there's a lot more to it than I could talk about. But getting it out of the box, turned on, and searching against a few records is something that you and students could probably manage. I've got a year of unix/ssh/command line experience and with a bit of mucking about, googling, and asking for help I was able to get a local (non-production) instance up and running, so it's definitely easy enough. I'd agree - either Blacklight http://projectblacklight.org or VuFind http://vufind.org are straightforward to get running. I've found Blacklight setup using the Ruby Gem very easy both on Windows and OS X. Since they are both powered by Solr and use SolrMARC there are a lot of similarities on the indexing/searching side. However on the interface side they differ in terms of setup - so it might be this that would sway you one way of the other (or a preference for PHP (VuFind) or Ruby (Blacklight)). Lesson: Interfaces, usability, accessibility Exercise: Use the OPAC, populate it with some data, assess its usability Once you've got VuFind/Blacklight setup populating with data is a matter of uploading some MARC21 records - Blacklight comes with some test records bundled, I suspect VuFind does to but can't remember Lesson: HTML/CSS Exercise: Use CSS to skin the OPAC, customize the HTML for your site This is slightly more complex I guess - both systems can be highly customised, but in either case it isn't necessarily just a matter of editing CSS or HTML. Both use templating systems and both have configuration files that control certain aspects of the interface (e.g. what is searched, how facets display). CSS is probably more straightforward - VuFind you can just drop in CSS to override the default - not sure about Blacklight Lesson: Data management, search, IR Exercise: See if we can peak under the hood about how the OPAC's search works I think this would be the real strength of using Blacklight/VuFind - Solr/Lucene is a powerful combination, and used widely outside the library sector. You can also configure the indexing to a high degree - lots of options, the most basic of which I explore in http://www.meanboyfriend.com/overdue_ideas/2012/07/marc-and-solrmarc/ The thing I really like about this is students would see some of the complexity of MARC as well as some of it's utility - and where it doesn't work well Lesson: Interfaces to data: databases, XML, SQL Exercise: Use the OPAC as an living example to work with those interfaces This is less well served by Blacklight/VuFind - no database, no SQL. This idea primarily came from trying to get some simple XML/SQL exercises that didn't suck (the setup for these environments is almost as involved as any exercises itself), and the fact the previous classes really liked dissecting the nextgen catalogs we've explored from a software selection and 2.0 integration perspective. Unfortunately it may be that Blacklight/VuFind don't work for your scenario because they don't provide an environment for SQL. You could do some XML stuff (there is configuration files, and Solr can be updated via XML messages) - but I'm not clear whether this is the kind of XML work you want. However, I do think they open up some other avenues that are well worth exploring, and use technologies that are going to become more relevant in the future. Another option might be BibServer, which uses elastic search rather than Solr - but I've never tried installing it http://bibserver.readthedocs.org/en/latest/install.html
[CODE4LIB] Code and Catalogue data event
I'm happy to announce a new 'mashed libary' event focussing on cataloguing data. The event (#mashcat) will be held in Cambridge (UK) on 5th July, and is free to attend. We hope to encourage a mixture of developers, cataloguers and metadata specialists to come along, exchanging ideas and knowledge. The programme has not yet been finalised but is likely to be a mixture of talks, and time to pursue ideas, discussions and projects. Details of the event are available from http://www.mashcat.info/ and you can register at http://mashcat.eventbrite.co.uk/ #mashcat is being supported by DevCSI (http://devcsi.ukoln.ac.uk/about/) I hope some of you can make it Owen -- Owen Stephens Consulting http://ostephens.com e: o...@ostephens.com t: 0121 288 6936 skype: owen.stephens
Re: [CODE4LIB] Repositories, OAI-PMH and web crawling
Thanks Jason and Ed, I suspect within this project we'll keep using OAI-PMH because we've got tight deadlines and the other project strands (which do stuff with the harvested content) need time from the developer. At the moment it looks like we will probably combine OAI-PMH with web crawling (using nutch) - so use data from the However, that said, one of the things we are meant to be doing is offering recommendations or good practice guidelines back to the (repository) community based on our experience. If we have time I would love to tackle the questions (a)-(d) that you highlight here - perhaps especially (a) and (c). Since this particular project is part of the wider JISC 'Discovery' programme (http://discovery.ac.uk and tech principles at http://technicalfoundations.ukoln.info/guidance/technical-principles-discovery-ecosystem) - from which one of the main themes might be summarised as 'work with the web' these questions are definitely relevant. I need to look at Jason's stuff again as I think this definitely has parallels with some of the Discovery work, as, of course, does some of the recent discussion on here about the question of the indexing of library catalogues by search engines. Thanks again to all who have contributed to the discussion - very useful Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 1 Mar 2012, at 11:42, Ed Summers wrote: On Mon, Feb 27, 2012 at 12:15 PM, Jason Ronallo jrona...@gmail.com wrote: I'd like to bring this back to your suggestion to just forget OAI-PMH and crawl the web. I think that's probably the long-term way forward. I definitely had the same thoughts while reading this thread. Owen, are you forced to stay within the context of OAI-PMH because you are working with existing institutional repositories? I don't know if it's appropriate, or if it has been done before, but as part of your work it would be interesting to determine: a) how many IRs allow crawling (robots.txt or lack thereof) b) how many IRs support crawling with a sitemap c) how many IR HTML splashpages use the rel-license [1] pattern d) how many IRs support syndication (RSS/Atom) to publish changes If you could do this in a semi-automated way for the UK it would be great if you could then apply it to IRs around the world. It would also align really nicely with the sort of work that Jason has been doing around CAPS [2]. It seems to me that there might be an opportunity to educate digital repository managers about better aligning their content w/ the Web ... instead of trying to cook up new standards. I imagine this is way out of scope for what you are currently doing--if so, maybe this can be your next grant :-) //Ed [1] http://microformats.org/wiki/rel-license [2] https://github.com/jronallo/capsys
Re: [CODE4LIB] Repositories, OAI-PMH and web crawling
Thanks Ian, Agree that it is clear from this discussion that there are differing viewpoints and also very different requirements depending on the context and desired outcomes. I think I said earlier in the thread - I'm not against niche solutions, they just make me want to double check that they are justified. For me I'd say the jury is still out on 'crawl' vs 'harvest' - but I think it definitely needs more investigation and thought - and of course different problems require different solutions. It would be interesting to try to go through the case for OAI-PMH, especially specific examples where it has achieved something that would have been difficult/impossible to do with more general solutions. Not sure if that could be done here on list, or better/easier through other discussion - or both (possibly over that beer? :) From the CORE project, any 'best practice' would be focussed on institutional research publication repositories, and I it seems highly unlikely to make a recommendation on 'crawl' vs 'harvest' - we just won't have time to do enough work on this to understand the pros/cons of these even from our own singular perspective. I think any recommendations are more along the lines of ensuring robots.txt is consistent with other policies; the impact of using splash pages as opposed to links to actual resources in the OAI-PMH feed; configuring access to embargoed papers (as per Raffaele's suggestion); how to deal with multi-part resources etc. Anything coming out of the project would, of course, be just one projects recommendations for JISC to consider not more than that. Cheers, Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 1 Mar 2012, at 14:38, Ian Ibbotson wrote: Owen... Just wanted to say that, whilst I've been silent since my initial response, I'm not sure I agree with all the viewpoints presented here.. From a point of view of (for example, CultureGrid) I'm not sure what has been done could have been pragmatically achieved soley with web crawling as it's described in this thread. Don't have a problem with anything thats been written here. It certainly represent a great cross-section of viewpoints. However, from a jisc discovery perspective, I don't want to contribute to any confirmation bias that we could dispose of pesky old OAI. I'd be interested in providing a counter-point to any Best practice document that suggested we could. Ian. On Thu, Mar 1, 2012 at 12:36 PM, Owen Stephens o...@ostephens.com wrote: Thanks Jason and Ed, I suspect within this project we'll keep using OAI-PMH because we've got tight deadlines and the other project strands (which do stuff with the harvested content) need time from the developer. At the moment it looks like we will probably combine OAI-PMH with web crawling (using nutch) - so use data from the However, that said, one of the things we are meant to be doing is offering recommendations or good practice guidelines back to the (repository) community based on our experience. If we have time I would love to tackle the questions (a)-(d) that you highlight here - perhaps especially (a) and (c). Since this particular project is part of the wider JISC 'Discovery' programme (http://discovery.ac.uk and tech principles at http://technicalfoundations.ukoln.info/guidance/technical-principles-discovery-ecosystem) - from which one of the main themes might be summarised as 'work with the web' these questions are definitely relevant. I need to look at Jason's stuff again as I think this definitely has parallels with some of the Discovery work, as, of course, does some of the recent discussion on here about the question of the indexing of library catalogues by search engines. Thanks again to all who have contributed to the discussion - very useful Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 1 Mar 2012, at 11:42, Ed Summers wrote: On Mon, Feb 27, 2012 at 12:15 PM, Jason Ronallo jrona...@gmail.com wrote: I'd like to bring this back to your suggestion to just forget OAI-PMH and crawl the web. I think that's probably the long-term way forward. I definitely had the same thoughts while reading this thread. Owen, are you forced to stay within the context of OAI-PMH because you are working with existing institutional repositories? I don't know if it's appropriate, or if it has been done before, but as part of your work it would be interesting to determine: a) how many IRs allow crawling (robots.txt or lack thereof) b) how many IRs support crawling with a sitemap c) how many IR HTML splashpages use the rel-license [1] pattern d) how many IRs support syndication (RSS/Atom) to publish changes If you could do this in a semi-automated way for the UK it would be great if you could then apply it to IRs around the world
Re: [CODE4LIB] Repositories, OAI-PMH and web crawling
of full-text indexing in such indexes as Google or Summon, not does it allow them to restrict where copies are served from. Similarly, the dc:rights section in the OAI-PMH records address copyright only. In practice, Google crawls, indexes, and serves full-text copies of our dissertations. Of course, it is absolutely reasonable that some content either not be open or have an embargo period - in which case I'd expect it to either not be added to the repository, or added and protected by some security which prevents public access. I know that in some cases authors wish to delay release of the thesis in order to publish a book which may draw on the PhD research - and this can take several years, although different institutions set different limits on this. I also know of at least one case where a PhD contained information that was deemed so confidential, it was agreed never to release it (I wasn't allowed to know what the information was!) In theory copyright could be seen as sufficient to cover the use of the full-text item by third parties - either Google is protected by fair use (in the US anyway) or not. Unfortunately (and this would certainly be true in the UK) - the only way of really discovering if you have a case against Google would be to take them to court. Google would say (as they did to the newspapers) it's easy to request we don't index/cache your content - we obey robots.txt. Which sort of brings me back to the starting point of the project I'm working on - while two wrongs don't make a right, it seems to us that if repositories are not preventing Google (or others - for example notably CiteSeerX is in the business of crawling repositories http://csxstatic.ist.psu.edu/about/crawler) crawling/indexing/caching their content, then we hope that a non-profit, publicly funded, service should feel able to do the same in the interests of making the content of repositories more discoverable and more widely dissmeniated. Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936
Re: [CODE4LIB] Repositories, OAI-PMH and web crawling
On 27 Feb 2012, at 13:31, Diane Hillmann wrote: On Mon, Feb 27, 2012 at 5:25 AM, Owen Stephens o...@ostephens.com wrote: providers provide such intermediate pages (arxiv.org, for instance). The other issue driving providers towards intermediate pages is that it allows them to continue to derive statistics from usage of their materials, which direct access URIs and multiple web caches don't. For providers dependent on external funding, this is a biggie. Definitely proof of use is a big issue - and one I've seen in other scenarios (for example, Museum's discussing whether to open up access to collections online) although it really feels like the tail wagging the dog. However, if this is *the* key issue for repositories then it would be good to look at alternative approaches - for example it would be possible to provide an API back from services with usage stats per paper/URI, or possibly simply pass on 'clicks' when a cached paper is accessed. I realise that this depends on cooperation of the third party, and you aren't going to always get this - but then, get perfection when tracking use is never going to happen. Perhaps we need to both be more robust in justifying open access as part of a public good mission (otherwise you could just leave it to the publishers?) and consider the question of measuring and reporting impact of offering papers in repositories in a more sophisticated way. On the otherhand, it may be that repository managers/institutions have other reasons for not wanting the full-text to be directly accessed - e.g. they believe that it would be against some of the terms and conditions set by publishers regarding self-archiving (or seen to be encouraging others to break the TC?). Owen
Re: [CODE4LIB] Repositories, OAI-PMH and web crawling
On 24 Feb 2012, at 16:52, Ian Ibbotson wrote: Sorry.. late to the discussion... Isn't this a little apples and oranges? Surely robots.txt exists because many static resources are served directly from a tree structured filesystem? (Nearly) all OAI requests are responded to by specific service applications which are perfectly capable of deciding, on a resource by resource basis if an anonymous user should or should not see a given resource. As has been said, why would you list a resource in OAI if you didn't think **someone** would find it useful. If you want to take something out of circulation, you mark it deleted so that clients connecting for updates know it should be removed. OAI isn't about fully enumerating a tree on every visit to see whats new, it's about a short and efficient visit to say What, if anything, has changed since I was last here. I don't want to have to walk an entire repository of 3 million items to discover item 299 was deleted.. I want a message to say Oh, item 299 was removed on X. I agree about OAI being an efficient way of harvesting content finding changes, and perhaps for repositories on the scale of millions of items it would be needed (although if you get to that scale, perhaps other approaches like dumps of data and deltas would be even better?) - however, most Institutional repositories aren't close to this scale (yet?). I also agree there is a bit of apples and oranges here - they aren't exactly the same thing. However, in some scenarios - and I think really the main ones - the intended outcome seems to be the same. Google Scholar seems to me to be the main point of comparison - this harvests metadata (if correctly embedded in html meta tags) but does it via crawling web pages not OAI-PMH. Because of the advantages of being in Google Scholar (people use it!) repositories support this mechanism anyway - making OAI-PMH an additional overhead. My investigations so far definitely suggest these multiple routes lead to inconsistencies in configuration of different mechanisms. I don't think my thoughts on it are completely clear either! But OAI-PMH is clearly 'niche' compared to the web, and while niche is sometimes needed, it always makes me slightly jumpy :) Owen
Re: [CODE4LIB] Repositories, OAI-PMH and web crawling
On 24 Feb 2012, at 18:20, Joe Hourcle wrote: On Feb 24, 2012, at 9:25 AM, Kyle Banerjee wrote: I see it like the people who request that their pages not be cached elsewhere -- they want to make their object 'discoverable', but they want to control the access to those objects -- so it's one thing for a search engine to get a copy, but they don't want that search engine being an agent to distribute copies to others. That's maybe true - certainly some repositories publish policy statements that imply this type of thinking - e.g. a typical phrase used is Full items must not be harvested by robots except transiently for full-text indexing or citation analysis. This type of policy is usually made available via OAI-PMH 'Identify'. There are some issues with this. Firstly, textual policy statements like this don't help when you want to machine harvest many repositories. Secondly, these statements won't ever be seen by a web crawler. Thirdly 'transiently' is not defined. Lastly, the limitation to two specific uses seems odd - for instance it would seem to me that semantic analysis of the text would not strictly be covered by this - but was this the intention of those framing the policy, or did they just want to say don't copy our stuff and serve it up from your own application (of course, different repositories will have different views on this). Also some of the policies go further than this. For example the University of Cambridge policy states that *for metadata* The metadata must not be re-used in any medium for commercial purposes without formal permission - but does not block search engines from crawling in robots.txt - this is the kind of thing I see as inconsistent. I realise robots.txt is just a request to search engines, and isn't equivalent to a policy on reuse (e.g. a permissive robots.txt doesn't imply there is no copyright in the content being made available) - but there is no doubt that Google use the content they harvest for commercial purposes. So, this is a mixed message to some extent - meaning a well behaved OAI-PMH harvester might feel more constrained than a well behaved web crawler (even though I guess the legal situation would be pretty much the same for both in terms of actual rights to using the data harvested). Again, I don't mean to pick on Cambridge - they aren't the only institution to run this kind of policy, but they are one everyone will have heard of :) Eg, all of the journal publishers who charge access fees -- they want people to find that they have a copy of that article that you're interested in ... but they want to collect their $35 for you to read it. Agreed - this type of issue came up with Google News and led to the introduction of the 'first click free' programme (http://googlenewsblog.blogspot.com/2009/12/update-to-first-click-free.html) - although I'm not sure this is still in action? In the case of scientific data, the problem is that to make stuff discoverable, we often have to perform some lossy transformation to fit some metadata standard, and those standards rarely have mechanisms for describing error (accuracy, precision, etc.). You can do some science with the catalog records, but it's going to introduce some bias into your results, so you're typically better of getting the data from the archive. (and sometimes, they have nice clean catalogs in FITS, VOTable, CDF, NetCDF, HDF or whatever their discipline's preferred data format is) This is going into areas I'm not so familiar with - at the moment the project I'm working on is looking at article level data only (so mostly pdfs with straightforward metadata) ... Also, I don't know if things have changed in the last year, but I seem to remember someone mentioning at last year's RDAP (Research Data Access Preservation) summit that Google had coordinated with some libraries for feeds from their catalogs, but was only interested in books, not other objects. I don't know how other search engines might use data from OAI-PMH, or if they'd filter it because they didn't consider it to be information they cared about. I don't think that Google ever used OAI-PMH to harvest metadata like this, although they did use it for sitemaps for a short time http://googlewebmastercentral.blogspot.com/2008/04/retiring-support-for-oai-pmh-in.html. It may be they have used it in specific cases to get library catalogue records, but I'm not aware of it. Thanks Owen
Re: [CODE4LIB] Repositories, OAI-PMH and web crawling
Thanks Peter - this sounds very interesting. My main plea would be that some consideration is given to how web search engines interact with the same data. If web search engines feel free to ignore policies, and are left to it by publishers (and I realise NISO doesn't have control over this!) then we end up with a 'might is right' scenario. So I believe we should be aiming at: Policies expressed in machine readable formats Policies that are realistically implementable on a (semi-) automated basis (that probably means 'not very nuanced') A single mechanism that both web crawlers, and any other mechanisms like OAI-PMH can follow I realise these may not be achievable, but just my thoughts Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 25 Feb 2012, at 22:18, Peter Noerr wrote: This post veers nearer to something I was going to add as an FYI, so here goes... FYI: NISO has recently started a working group to study best practices for discovery services. The ODI (=Open Discovery Initiative) working group is hoping to look at exactly this issue (how should a content provider tell a content requestor what it can have) among others (how to convey commercial restrictions, how to produce statistics meaningful to providers, discovery services, and consumers of the discovery service), and hopefully produce guidelines on procedures and formats, etc. for this. This is a new working group and its timescale doesn't expect any deliverables until Q3 of 2012, so it is a bit late to help Owen, but anyone who is interested in this may want to follow, from time to time, the NISO progress. Look at www.niso.org and find the ODI working group. If you're really interested contact the group to offer thoughts. And many of you may be contacted by a survey to find out your thoughts as part of the process, anyway. Just like the long reach of OCLC, there is no escaping NISO. Peter -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Joe Hourcle Sent: Friday, February 24, 2012 10:20 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Repositories, OAI-PMH and web crawling On Feb 24, 2012, at 9:25 AM, Kyle Banerjee wrote: One of the questions this raises is what we are/aren't allowed to do in terms of harvesting full-text. While I realise we could get into legal stuff here, at the moment we want to put that question to one side. Instead we want to consider what Google, and other search engines, do, the mechanisms available to control this, and what we do, and the equivalent mechanisms - our starting point is that we don't feel we should be at a disadvantage to a web search engine in our harvesting and use of repository records. Of course, Google and other crawlers can crawl the bits of the repository that are on the open web, and 'good' crawlers will obey the contents of robots.txt We use OAI-PMH, and while we often see (usually general and sometimes contradictory) statements about what we can/can't do with the contents of a repository (or a specific record), it feels like there isn't a nice simple mechanism for a repository to say don't harvest this bit. I would argue there is -- the whole point of OAI-PMH is to make stuff available for harvesting. If someone goes to the trouble of making things available via a protocol that exists only to make things harvestable and then doesn't want it harvested, you can dismiss them as being totally mental. I see it like the people who request that their pages not be cached elsewhere -- they want to make their object 'discoverable', but they want to control the access to those objects -- so it's one thing for a search engine to get a copy, but they don't want that search engine being an agent to distribute copies to others. Eg, all of the journal publishers who charge access fees -- they want people to find that they have a copy of that article that you're interested in ... but they want to collect their $35 for you to read it. In the case of scientific data, the problem is that to make stuff discoverable, we often have to perform some lossy transformation to fit some metadata standard, and those standards rarely have mechanisms for describing error (accuracy, precision, etc.). You can do some science with the catalog records, but it's going to introduce some bias into your results, so you're typically better of getting the data from the archive. (and sometimes, they have nice clean catalogs in FITS, VOTable, CDF, NetCDF, HDF or whatever their discipline's preferred data format is) ... Also, I don't know if things have changed in the last year, but I seem to remember someone mentioning at last year's RDAP (Research Data Access Preservation) summit that Google had coordinated with some libraries for feeds from their catalogs
Re: [CODE4LIB] Repositories, OAI-PMH and web crawling
On 24 Feb 2012, at 18:20, Joe Hourcle wrote: I see it like the people who request that their pages not be cached elsewhere -- they want to make their object 'discoverable', but they want to control the access to those objects -- so it's one thing for a search engine to get a copy, but they don't want that search engine being an agent to distribute copies to others. Also meant to say that Google (and others) support a 'Noarchive' instruction (not quite sure if this can be implemented in robots.txt or only via robots meta tags and x-robots-tags - if anyone can tell me I'd be grateful) which I think would fulfil this type of instruction - index, but don't keep a copy. Owen
Re: [CODE4LIB] URL checking for the catalog
It's not quite the same thing, but I worked on a project a couple of years ago integrating references/citations into a learning environment (called Telstar http://www8.open.ac.uk/telstar/) , and looked at the question of how to deal with broken links from references. We proposed a more reactive mechanism than running link checking software. This clearly has some disadvantages, but I think a major advantage is the targetting of staff time towards those links that are being used. The mechanism proposed was to add a level of redirection, with an intermediary script checking the availability of the destination URL before either: a) passing the user on to the destination b) finding the destination URL unresponsive (e.g. 404), automatically reporting the issue to library staff, and directing the user to a page explaining that the resource was not currently responding and that library staff had been informed Particularly we proposed putting the destination URL into the rft_id of an OpenURL to achieve this, but this was only because it allowed us to piggyback on existing infrastructure using a standard approach - you could do the same with a simple script, with the destination URL as a parameter (if you are really interested, we created a new Source parser in SFX to do (a) and (b) ). Because we didn't necessarily have control over the URL in the reference, we also built a table that allowed us to map broken URLs being used in the learning environment to alternative URLs so we could offer a temporary redirect while we worked with the relevant staff to get corrections made to the reference link. There's some more on this at http://www.open.ac.uk/blogs/telstar/remit-toc/remit-the-open-university-approach/remit-providing-links-to-resources-from-references/6-8-3-telstar-approach/ although for some reason (my fault) this doesn't include a write up of the link checking process/code we created. Of course, this approach is in no way incompatible with regular proactive link checking. Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 23 Feb 2012, at 17:02, Tod Olson wrote: There's been some recent discussion at our site about revi(s|v)ing URL checking in our catalog, and I was wondering if other sites have any strategies that they have found to be effective. We used to run some home-grown link checking software. It fit nicely into a shell pipeline, so it was easy to filter out sites that didn't want to be link checked. But still the reports had too many spurious errors. And with over a million links in the catalog, there are some issues of scale, both for checking the links and consuming any report. Anyhow, if you have some system you use as part of catalog link maintenance, or if there's some link checking software that you've had good experiences with, or if there's some related experience you'd like to share, I'd like to hear about it. Thanks, -Tod Tod Olson t...@uchicago.edu Systems Librarian University of Chicago Library
Re: [CODE4LIB] Repositories, OAI-PMH and web crawling
Thanks both... Kyle said: If someone goes to the trouble of making things available via a protocol that exists only to make things harvestable and then doesn't want it harvested, you can dismiss them ... True - but that's essentially what Southampton's configuration seems to say. Thomas said: The M in PMH still stands for Metadata, right? So opening an OAI-PMH server implicitly says you're willing to share metadata. I can certainly sympathize with sites wanting to do that but not necessarily wanting to offer anything more than normal end-user access to full text. This is a fair point - but I've yet to see an example of a robots.txt file that makes this distinction - that is, in general Google is not being told to not crawl and cache pdfs, while being granted explicit permission to crawl the metadata, no matter what the OAI-PMH situation. Kyle said: OAI-PMH runs on top of HTTP, so anything robots.txt already applies -- i.e. if they want you to crawl metadata only but not download the objects themselves because they don't want to deal with the load or bandwidth charges, this should be indicated for all crawlers. OK - this suggests a way forward for me. Although I don't think we can regard robots.txt applying across the board for OAI-PMH (as in the Southampton example, the OAI-PMH endpoint is disallowed by robots.txt), it seems to make sense that given a resource identifier in the metadata we could use robots.txt (and I guess potentially x-robots-tag, assuming most of the resources are not simple html) to see whether a web crawler is permitted to crawl it, and so make the right decision about what we do. That sounds vaguely sensible (although I'm still left thinking, maybe we should just use a web crawler and ignore OAI-PMH but I guess this was we maybe get the best of both worlds). Thanks again (and of course further thoughts welcome) Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 24 Feb 2012, at 14:45, Thomas Dowling wrote: On 02/24/2012 09:25 AM, Kyle Banerjee wrote: We use OAI-PMH, and while we often see (usually general and sometimes contradictory) statements about what we can/can't do with the contents of a repository (or a specific record), it feels like there isn't a nice simple mechanism for a repository to say don't harvest this bit. I would argue there is -- the whole point of OAI-PMH is to make stuff available for harvesting. If someone goes to the trouble of making things available via a protocol that exists only to make things harvestable and then doesn't want it harvested, you can dismiss them as being totally mental. The M in PMH still stands for Metadata, right? So opening an OAI-PMH server implicitly says you're willing to share metadata. I can certainly sympathize with sites wanting to do that but not necessarily wanting to offer anything more than normal end-user access to full text. That said, in a world with unfriendly bots, the repository should still be making informed choices about controlling full text crawlers (robots.txt, meta tags, HTTP cache directives, etc etc.). -- Thomas Dowling thomas.dowl...@gmail.com
Re: [CODE4LIB] Namespace management, was Models of MARC in RDF
The other issue that the 'modelling' brings (IMO) is that the model influences use - or better the other way round, the intended use and/or audience should influence the model. This raises questions for me about the value of a 'neutral' model - which is what I perceive libraries as aiming for - treating users as a homogenous mass with needs that will be met by a single approach. Obviously there are resource implications to developing multiple models for different uses/audiences, and once again I'd argue that an advantage of the linked data approach is that it allows for the effort to be distributed amongst the relevant communities. To be provocative - has the time come for us to abandon the idea that 'libraries' act as one where cataloguing is concerned, and our metadata serves the same purpose in all contexts? (I can't decide if I'm serious about this or not!) Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 11 Dec 2011, at 23:47, Karen Coyle wrote: Quoting Richard Wallis richard.wal...@talis.com: You get the impression that the BL chose a subset of their current bibliographic data to expose as LD - it was kind of the other way around. Having modeled the 'things' in the British National Bibliography domain (plus those in related domain vocabularis such as VIAF, LCSH, Geonames, Bio, etc.), they then looked at the information held in their [Marc] bib records to identify what could be extracted to populate it. Richard, I've been thinking of something along these lines myself, especially as I see the number of translating X to RDF projects go on. I begin to wonder what there is in library data that is *unique*, and my conclusion is: not much. Books, people, places, topics: they all exist independently of libraries, and libraries cannot take the credit for creating any of them. So we should be able to say quite a bit about the resources in libraries using shared data points -- and by that I mean, data points that are also used by others. So once you decide on a model (as BL did), then it is a matter of looking *outward* for the data to re-use. I maintain, however, as per my LITA Forum talk [1] that the subject headings (without talking about quality thereof) and classification designations that libraries provide are an added value, and we should do more to make them useful for discovery. I know it is only semantics (no pun intended), but we need to stop using the word 'record' when talking about the future description of 'things' or entities that are then linked together. That word has so many built in assumptions, especially in the library world. I'll let you battle that one out with Simon :-), but I am often at a loss for a better term to describe the unit of metadata that libraries may create in the future to describe their resources. Suggestions highly welcome. kc [1] http://kcoyle.net/presentations/lita2011.html -- Karen Coyle kco...@kcoyle.net http://kcoyle.net ph: 1-510-540-7596 m: 1-510-435-8234 skype: kcoylenet
Re: [CODE4LIB] Namespace management, was Models of MARC in RDF
On 11 Dec 2011, at 23:30, Richard Wallis wrote: There is no document I am aware of, but I can point you at the blog post by Tim Hodson [ http://consulting.talis.com/2011/07/british-library-data-model-overview/] who helped the BL get to grips with and start thinking Linked Data. Another by the BL's Neil Wilson [ http://consulting.talis.com/2011/10/establishing-the-connection/] filling in the background around his recent presentations about their work. Neil Wilson at the BL has indicated a few times that in principle the BL has no problem sharing the software they used to extract the relevant data from the MARC records, but that there are licensing issues around the s/w due to the use of a proprietary compiler (sorry, I don't have any more details so I can't explain any more than this). I'm not sure whether this extends to sharing the source that would tell us what exactly was happening, but I think this would be worth more discussion with Neil - I'll try to pursue it with him when I get a chance Owen
Re: [CODE4LIB] Models of MARC in RDF
Fair point. Just instinct on my part that putting it in a triple is a bit ugly :) It probably doesn't make any difference, although I don't think storing in a triple ensures that it sticks to the object (you could store the triple anywhere as well) Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 6 Dec 2011, at 22:43, Fleming, Declan wrote: Hi - point at it where? We could point back to the library catalog that we harvested in the MARC to MODS to RDF process, but what if that goes away? Why not write ourselves a 1K insurance policy that sticks with the object for its life? D -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Owen Stephens Sent: Tuesday, December 06, 2011 8:06 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Models of MARC in RDF I'd suggest that rather than shove it in a triple it might be better to point at alternative representations, including MARC if desirable (keep meaning to blog some thoughts about progressively enhanced metadata...) Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 6 Dec 2011, at 15:44, Karen Coyle wrote: Quoting Fleming, Declan dflem...@ucsd.edu: Hi - I'll note that the mapping decisions were made by our metadata services (then Cataloging) group, not by the tech folks making it all work, though we were all involved in the discussions. One idea that came up was to do a, perhaps, lossy translation, but also stuff one triple with a text dump of the whole MARC record just in case we needed to grab some other element out we might need. We didn't do that, but I still like the idea. Ok, it was my idea. ;) I like that idea! Now that disk space is no longer an issue, it makes good sense to keep around the original state of any data that you transform, just in case you change your mind. I hadn't thought about incorporating the entire MARC record string in the transformation, but as I recall the average size of a MARC record is somewhere around 1K, which really isn't all that much by today's standards. (As an old-timer, I remember running the entire Univ. of California union catalog on 35 megabytes, something that would now be considered a smallish email attachment.) kc D -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Esme Cowles Sent: Monday, December 05, 2011 11:22 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Models of MARC in RDF I looked into this a little more closely, and it turns out it's a little more complicated than I remembered. We built support for transforming to MODS using the MODS21slim2MODS.xsl stylesheet, but don't use that. Instead, we use custom Java code to do the mapping. I don't have a lot of public examples, but there's at least one public object which you can view the MARC from our OPAC: http://roger.ucsd.edu/search/.b4827884/.b4827884/1,1,1,B/detlmarc~123 4567FF=1,0, The public display in our digital collections site: http://libraries.ucsd.edu/ark:/20775/bb0648473d The RDF for the MODS looks like: mods:classification rdf:parseType=Resource mods:authoritylocal/mods:authority rdf:valueFVLP 222-1/rdf:value /mods:classification mods:identifier rdf:parseType=Resource mods:typeARK/mods:type rdf:valuehttp://libraries.ucsd.edu/ark:/20775/bb0648473d/rdf:value /mods:identifier mods:name rdf:parseType=Resource mods:namePartBrown, Victor W/mods:namePart mods:typepersonal/mods:type /mods:name mods:name rdf:parseType=Resource mods:namePartAmateur Film Club of San Diego/mods:namePart mods:typecorporate/mods:type /mods:name mods:originInfo rdf:parseType=Resource mods:dateCreated[196-]/mods:dateCreated /mods:originInfo mods:originInfo rdf:parseType=Resource mods:dateIssued2005/mods:dateIssued mods:publisherFilm and Video Library, University of California, San Diego, La Jolla, CA 92093-0175 http://orpheus.ucsd.edu/fvl/FVLPAGE.HTM/mods:publisher /mods:originInfo mods:physicalDescription rdf:parseType=Resource mods:digitalOriginreformatted digital/mods:digitalOrigin mods:note16mm; 1 film reel (25 min.) :; sd., col. ;/mods:note /mods:physicalDescription mods:subject rdf:parseType=Resource mods:authoritylcsh/mods:authority mods:topicRanching/mods:topic /mods:subject etc. There is definitely some loss in the conversion process -- I don't know enough about the MARC leader and control fields to know if they are captured in the MODS and/or RDF in some way. But there are quite
Re: [CODE4LIB] Models of MARC in RDF
When I did a project converting records from UKMARC - MARC21 we kept the UKMARC records for a period (about 5 years I think) while we assured ourselves that we hadn't missed anything vital. We did occasionally refer back to the older record to check things, but having not found any major issues with the conversion after that period we felt confident disposing of the record. This is the type of usage I was imagining for a copy of the MARC record in this scenario. Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 7 Dec 2011, at 01:52, Montoya, Gabriela wrote: One critical thing to consider with MARC records (or any metadata, for that matter) is that it they are not stagnant, so what is the value of storing entire record strings into one triple if we know that metadata is volatile? As an example, UCSD has over 200,000 art images that had their metadata records ingested into our local DAMS over five years ago. Since then, many of these records have been edited/massaged in our OPAC (and ARTstor), but these updated records have not been refreshed in our DAMS. Now we find ourselves needing to desperately have the What is our database of record? conversation. I'd much rather see resources invested in data synching than spending it in saving text dumps that will most likely not be referred to again. Dream Team for Building a MARC RDF Model: Karen Coyle, Alistair Miles, Diane Hillman, Ed Summers, Bradley Westbrook. Gabriela -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Karen Coyle Sent: Tuesday, December 06, 2011 7:44 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Models of MARC in RDF Quoting Fleming, Declan dflem...@ucsd.edu: Hi - I'll note that the mapping decisions were made by our metadata services (then Cataloging) group, not by the tech folks making it all work, though we were all involved in the discussions. One idea that came up was to do a, perhaps, lossy translation, but also stuff one triple with a text dump of the whole MARC record just in case we needed to grab some other element out we might need. We didn't do that, but I still like the idea. Ok, it was my idea. ;) I like that idea! Now that disk space is no longer an issue, it makes good sense to keep around the original state of any data that you transform, just in case you change your mind. I hadn't thought about incorporating the entire MARC record string in the transformation, but as I recall the average size of a MARC record is somewhere around 1K, which really isn't all that much by today's standards. (As an old-timer, I remember running the entire Univ. of California union catalog on 35 megabytes, something that would now be considered a smallish email attachment.) kc D -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Esme Cowles Sent: Monday, December 05, 2011 11:22 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Models of MARC in RDF I looked into this a little more closely, and it turns out it's a little more complicated than I remembered. We built support for transforming to MODS using the MODS21slim2MODS.xsl stylesheet, but don't use that. Instead, we use custom Java code to do the mapping. I don't have a lot of public examples, but there's at least one public object which you can view the MARC from our OPAC: http://roger.ucsd.edu/search/.b4827884/.b4827884/1,1,1,B/detlmarc~1234 567FF=1,0, The public display in our digital collections site: http://libraries.ucsd.edu/ark:/20775/bb0648473d The RDF for the MODS looks like: mods:classification rdf:parseType=Resource mods:authoritylocal/mods:authority rdf:valueFVLP 222-1/rdf:value /mods:classification mods:identifier rdf:parseType=Resource mods:typeARK/mods:type rdf:valuehttp://libraries.ucsd.edu/ark:/20775/bb0648473d/rdf:value /mods:identifier mods:name rdf:parseType=Resource mods:namePartBrown, Victor W/mods:namePart mods:typepersonal/mods:type /mods:name mods:name rdf:parseType=Resource mods:namePartAmateur Film Club of San Diego/mods:namePart mods:typecorporate/mods:type /mods:name mods:originInfo rdf:parseType=Resource mods:dateCreated[196-]/mods:dateCreated /mods:originInfo mods:originInfo rdf:parseType=Resource mods:dateIssued2005/mods:dateIssued mods:publisherFilm and Video Library, University of California, San Diego, La Jolla, CA 92093-0175 http://orpheus.ucsd.edu/fvl/FVLPAGE.HTM/mods:publisher /mods:originInfo mods:physicalDescription rdf:parseType=Resource mods:digitalOriginreformatted digital
Re: [CODE4LIB] Namespace management, was Models of MARC in RDF
On 7 Dec 2011, at 00:38, Alexander Johannesen wrote: Hiya, Karen Coyle li...@kcoyle.net wrote: I wonder how easy it will be to manage a metadata scheme that has cherry-picked from existing ones, so something like: dc:title bibo:chapter foaf:depiction Yes, you're right in pointing out this as a problem. And my answer is; it's complicated. My previous rant on this list was about data models*, and dangnabbit if this isn't related as well. What your example is doing is pointing out a new model based on bits of other models. This works fine, for the most part, when the concepts are simple; simple to understand, simple to extend. Often you'll find that what used to be unclear has grown clear over time (as more and more have used FOAF, you'll find some things are more used and better understood, while other parts of it fade into 'we don't really use that anymore') But when things get complicated, it *can* render your model unusable. Mixed data models can be good, but can also lead directly to meta data hell. For example ; dc:title foaf:title Ouch. Although not a biggie, I see this kind of discrepancy all the time, so the argument against mixed models is of course that the power of definition lies with you rather than some third-party that might change their mind (albeit rare) or have similar terms that differ (more often). I personally would say that the library world should define RDA as you need it to be, and worry less about reuse at this stage unless you know for sure that the external models do bibliographic meta data well. I agree this is a risk, and I suspect there is a further risk around simply the feeling of 'ownership' by the community - perhaps it is easier to feel ownership over an entire ontoloy than an 'application profile' of somekind. It maybe that mapping is the solution to this, but if this is really going to work I suspect it needs to be done from the very start - otherwise it is just another crosswalk, and we'll get varying views on how much one thing maps to another (but perhaps that's OK - I'm not looking for perfection) That said, I believe we need absolutely to be aiming for a world in which we work with mixed ontologies - no matter what we do other, relevant, data sources will use FOAF, Bibo etc.. I'm convinced that this gives us the opportunity to stop treating what are very mixed materials in a single way, while still exploiting common properties. For example Musical materials are really not well catered for in MARC, and we know there are real issues with applying FRBR to them - and I see the implementation of RDF/Linked Data as an opportunity to tackle this issue by adopting alternative ontologies where it makes sense, while still assigning common properties (dc:title) where this makes sense. HOWEVER! When we're done talking about ontologies and vocabularies, we need to talk about identifiers, and there I would swing the other way and let reuse govern, because it is when you reuse an identifier you start thinking about what that identifiers means to *both* parties. Or, put differently ; It's remarkably easier to get this right if the identifier is a number, rather than some word. And for that reason I'd say reuse identifiers (subject proxies) as they are easier to get right and bring a lot of benefits, but not ontologies (model proxies) as they can be very difficult to get right and don't necessarily give you what you want. Agreed :)
Re: [CODE4LIB] Models of MARC in RDF
I'd suggest that rather than shove it in a triple it might be better to point at alternative representations, including MARC if desirable (keep meaning to blog some thoughts about progressively enhanced metadata...) Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 6 Dec 2011, at 15:44, Karen Coyle wrote: Quoting Fleming, Declan dflem...@ucsd.edu: Hi - I'll note that the mapping decisions were made by our metadata services (then Cataloging) group, not by the tech folks making it all work, though we were all involved in the discussions. One idea that came up was to do a, perhaps, lossy translation, but also stuff one triple with a text dump of the whole MARC record just in case we needed to grab some other element out we might need. We didn't do that, but I still like the idea. Ok, it was my idea. ;) I like that idea! Now that disk space is no longer an issue, it makes good sense to keep around the original state of any data that you transform, just in case you change your mind. I hadn't thought about incorporating the entire MARC record string in the transformation, but as I recall the average size of a MARC record is somewhere around 1K, which really isn't all that much by today's standards. (As an old-timer, I remember running the entire Univ. of California union catalog on 35 megabytes, something that would now be considered a smallish email attachment.) kc D -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Esme Cowles Sent: Monday, December 05, 2011 11:22 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Models of MARC in RDF I looked into this a little more closely, and it turns out it's a little more complicated than I remembered. We built support for transforming to MODS using the MODS21slim2MODS.xsl stylesheet, but don't use that. Instead, we use custom Java code to do the mapping. I don't have a lot of public examples, but there's at least one public object which you can view the MARC from our OPAC: http://roger.ucsd.edu/search/.b4827884/.b4827884/1,1,1,B/detlmarc~1234567FF=1,0, The public display in our digital collections site: http://libraries.ucsd.edu/ark:/20775/bb0648473d The RDF for the MODS looks like: mods:classification rdf:parseType=Resource mods:authoritylocal/mods:authority rdf:valueFVLP 222-1/rdf:value /mods:classification mods:identifier rdf:parseType=Resource mods:typeARK/mods:type rdf:valuehttp://libraries.ucsd.edu/ark:/20775/bb0648473d/rdf:value /mods:identifier mods:name rdf:parseType=Resource mods:namePartBrown, Victor W/mods:namePart mods:typepersonal/mods:type /mods:name mods:name rdf:parseType=Resource mods:namePartAmateur Film Club of San Diego/mods:namePart mods:typecorporate/mods:type /mods:name mods:originInfo rdf:parseType=Resource mods:dateCreated[196-]/mods:dateCreated /mods:originInfo mods:originInfo rdf:parseType=Resource mods:dateIssued2005/mods:dateIssued mods:publisherFilm and Video Library, University of California, San Diego, La Jolla, CA 92093-0175 http://orpheus.ucsd.edu/fvl/FVLPAGE.HTM/mods:publisher /mods:originInfo mods:physicalDescription rdf:parseType=Resource mods:digitalOriginreformatted digital/mods:digitalOrigin mods:note16mm; 1 film reel (25 min.) :; sd., col. ;/mods:note /mods:physicalDescription mods:subject rdf:parseType=Resource mods:authoritylcsh/mods:authority mods:topicRanching/mods:topic /mods:subject etc. There is definitely some loss in the conversion process -- I don't know enough about the MARC leader and control fields to know if they are captured in the MODS and/or RDF in some way. But there are quite a few local and note fields that aren't present in the RDF. Other fields (e.g. 300 and 505) are mapped to MODS, but not displayed in our access system (though they are indexed for searching). I agree it's hard to quantify lossy-ness. Counting fields or characters would be the most objective, but has obvious problems with control characters sometimes containing a lot of information, and then the relative importance of different fields to the overall description. There are other issues too -- some fields in this record weren't migrated because they duplicated collection-wide values, which are formulated slightly differently from the MARC record. Some fields weren't migrated because they concern the physical object, and therefore don't really apply to the digital object. So that really seems like a morass to me. -Esme -- Esme Cowles escow...@ucsd.edu Necessity
Re: [CODE4LIB] Models of MARC in RDF
I think the strength of adopting RDF is that it doesn't tie us to a single vocab/schema. That isn't to say it isn't desirable for us to establish common approaches, but that we need to think slightly differently about how this is done - more application profiles than 'one true schema'. This is why RDA worries me - because it (seems to?) suggest that we define a schema that stands alone from everything else and that is used by the library community. I'd prefer to see the library community adopting the best of what already exists and then enhancing where the existing ontologies are lacking. If we are going to have a (web of) linked data, then re-use of ontologies and IDs is needed. For example in the work I did at the Open University in the UK we ended up only a single property from a specific library ontology (the draft ISBD http://metadataregistry.org/schemaprop/show/id/1957.html has place of publication, production, distribution). I think it is interesting that many of the MARC-RDF mappings so far have adopting many of the same ontologies (although no doubt partly because there is a 'follow the leader' element to this - or at least there was for me when looking at the transformation at the Open University) Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 5 Dec 2011, at 18:56, Jonathan Rochkind wrote: On 12/5/2011 1:40 PM, Karen Coyle wrote: This brings up another point that I haven't fully grokked yet: the use of MARC kept library data consistent across the many thousands of libraries that had MARC-based systems. Well, only somewhat consistent, but, yeah. What happens if we move to RDF without a standard? Can we rely on linking to provide interoperability without that rigid consistency of data models? Definitely not. I think this is a real issue. There is no magic to linking or RDF that provides interoperability for free; it's all about the vocabularies/schemata -- whether in MARC or in anything else. (Note different national/regional library communities used different schemata in MARC, which made interoperability infeasible there. Some still do, although gradually people have moved to Marc21 precisely for this reason, even when Marc21 was less powerful than the MARC variant they started with). That is to say, if we just used MARC's own implicit vocabularies, but output them as RDF, sure, we'd still have consistency, although we wouldn't really _gain_ much.On the other hand, if we switch to a new better vocabulary -- we've got to actually switch to a new better vocabulary. If it's just whatever anyone wants to use, we've made it VERY difficult to share data, which is something pretty darn important to us. Of course, the goal of the RDA process (or one of em) was to create a new schema for us to consistently use. That's the library community effort to maintain a common schema that is more powerful and flexible than MARC. If people are using other things instead, apparently that failed, or at least has not yet succeeded.
Re: [CODE4LIB] Models of MARC in RDF
Hi Esme - thanks for this. Do you have any documentation on which predicates you've used and MODS-RDF transformation? Owen On 2 Dec 2011, at 16:07, Esme Cowles escow...@ucsd.edu wrote: Owen- Another strategy for capturing MARC data in RDF is to convert it to MODS (we do this using the LoC MARC to MODS stylesheet: http://www.loc.gov/standards/marcxml/xslt/MARC21slim2MODS.xsl). From there, it's pretty easy to incorporate into RDF. There are some issues to be aware of, such as how to map the MODS XML names to predicates and how to handle elements that can appear in multiple places in the hierarchy. -Esme -- Esme Cowles escow...@ucsd.edu Necessity is the plea for every infringement of human freedom. It is the argument of tyrants; it is the creed of slaves. -- William Pitt, 1783 On 11/28/2011, at 8:25 AM, Owen Stephens wrote: It would be great to start collecting transforms together - just a quick brain dump of some I'm aware of MARC21 transformations Cambridge University Library - http://data.lib.cam.ac.uk - transformation made available (in code) from same site Open University - http://data.open.ac.uk - specific transform for materials related to teaching, code available at http://code.google.com/p/luceroproject/source/browse/trunk%20luceroproject/OULinkedData/src/uk/ac/open/kmi/lucero/rdfextractor/RDFExtractor.java (MARC transform is in libraryRDFExtraction method) COPAC - small set of records from the COPAC Union catalogue - data and transform not yet published Podes Projekt - LinkedAuthors - documentation at http://bibpode.no/linkedauthors/doc/Pode-LinkedAuthors-Documentation.pdf - 2 stage transformation firstly from MARC to FRBRized version of data, then from FRBRized data to RDF. These linked from documentation Podes Project - LinkedNonFiction - documentation at http://bibpode.no/linkednonfiction/doc/Pode-LinkedNonFiction-Documentation.pdf - MARC data transformed using xslt https://github.com/pode/LinkedNonFiction/blob/master/marcslim2n3.xsl British Library British National Bibliography - http://www.bl.uk/bibliographic/datafree.html - data model documented, but no code available Libris.se - some notes in various presentations/blogposts (e.g. http://dc2008.de/wp-content/uploads/2008/09/malmsten.pdf) but can't find explicit transformation Hungarian National library - http://thedatahub.org/dataset/hungarian-national-library-catalog and http://nektar.oszk.hu/wiki/Semantic_web#Implementation - some information on ontologies used but no code or explicit transformation (not 100% sure this is from MARC) Talis - implemented in several live catalogues including http://catalogue.library.manchester.ac.uk/ - no documentation or code afaik although some notes in MAB transformation HBZ - some of the transformation documented at https://wiki1.hbz-nrw.de/display/SEM/Converting+the+Open+Data+from+the+hbz+to+BIBO, don't think any code published? Would be really helpful if more projects published their transformations (or someone told me where to look!) Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 26 Nov 2011, at 15:58, Karen Coyle wrote: A few of the code4lib talk proposals mention projects that have or will transform MARC records into RDF. If any of you have documentation and/or examples of this, I would be very interested to see them, even if they are under construction. Thanks, kc -- Karen Coyle kco...@kcoyle.net http://kcoyle.net ph: 1-510-540-7596 m: 1-510-435-8234 skype: kcoylenet
Re: [CODE4LIB] Models of MARC in RDF
Oh - and perhaps just/more importantly - how do you create URIs for you data and how do you reconcile against other sources? Owen On 2 Dec 2011, at 16:07, Esme Cowles escow...@ucsd.edu wrote: Owen- Another strategy for capturing MARC data in RDF is to convert it to MODS (we do this using the LoC MARC to MODS stylesheet: http://www.loc.gov/standards/marcxml/xslt/MARC21slim2MODS.xsl). From there, it's pretty easy to incorporate into RDF. There are some issues to be aware of, such as how to map the MODS XML names to predicates and how to handle elements that can appear in multiple places in the hierarchy. -Esme -- Esme Cowles escow...@ucsd.edu Necessity is the plea for every infringement of human freedom. It is the argument of tyrants; it is the creed of slaves. -- William Pitt, 1783 On 11/28/2011, at 8:25 AM, Owen Stephens wrote: It would be great to start collecting transforms together - just a quick brain dump of some I'm aware of MARC21 transformations Cambridge University Library - http://data.lib.cam.ac.uk - transformation made available (in code) from same site Open University - http://data.open.ac.uk - specific transform for materials related to teaching, code available at http://code.google.com/p/luceroproject/source/browse/trunk%20luceroproject/OULinkedData/src/uk/ac/open/kmi/lucero/rdfextractor/RDFExtractor.java (MARC transform is in libraryRDFExtraction method) COPAC - small set of records from the COPAC Union catalogue - data and transform not yet published Podes Projekt - LinkedAuthors - documentation at http://bibpode.no/linkedauthors/doc/Pode-LinkedAuthors-Documentation.pdf - 2 stage transformation firstly from MARC to FRBRized version of data, then from FRBRized data to RDF. These linked from documentation Podes Project - LinkedNonFiction - documentation at http://bibpode.no/linkednonfiction/doc/Pode-LinkedNonFiction-Documentation.pdf - MARC data transformed using xslt https://github.com/pode/LinkedNonFiction/blob/master/marcslim2n3.xsl British Library British National Bibliography - http://www.bl.uk/bibliographic/datafree.html - data model documented, but no code available Libris.se - some notes in various presentations/blogposts (e.g. http://dc2008.de/wp-content/uploads/2008/09/malmsten.pdf) but can't find explicit transformation Hungarian National library - http://thedatahub.org/dataset/hungarian-national-library-catalog and http://nektar.oszk.hu/wiki/Semantic_web#Implementation - some information on ontologies used but no code or explicit transformation (not 100% sure this is from MARC) Talis - implemented in several live catalogues including http://catalogue.library.manchester.ac.uk/ - no documentation or code afaik although some notes in MAB transformation HBZ - some of the transformation documented at https://wiki1.hbz-nrw.de/display/SEM/Converting+the+Open+Data+from+the+hbz+to+BIBO, don't think any code published? Would be really helpful if more projects published their transformations (or someone told me where to look!) Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 26 Nov 2011, at 15:58, Karen Coyle wrote: A few of the code4lib talk proposals mention projects that have or will transform MARC records into RDF. If any of you have documentation and/or examples of this, I would be very interested to see them, even if they are under construction. Thanks, kc -- Karen Coyle kco...@kcoyle.net http://kcoyle.net ph: 1-510-540-7596 m: 1-510-435-8234 skype: kcoylenet
Re: [CODE4LIB] Models of MARC in RDF
It would be great to start collecting transforms together - just a quick brain dump of some I'm aware of MARC21 transformations Cambridge University Library - http://data.lib.cam.ac.uk - transformation made available (in code) from same site Open University - http://data.open.ac.uk - specific transform for materials related to teaching, code available at http://code.google.com/p/luceroproject/source/browse/trunk%20luceroproject/OULinkedData/src/uk/ac/open/kmi/lucero/rdfextractor/RDFExtractor.java (MARC transform is in libraryRDFExtraction method) COPAC - small set of records from the COPAC Union catalogue - data and transform not yet published Podes Projekt - LinkedAuthors - documentation at http://bibpode.no/linkedauthors/doc/Pode-LinkedAuthors-Documentation.pdf - 2 stage transformation firstly from MARC to FRBRized version of data, then from FRBRized data to RDF. These linked from documentation Podes Project - LinkedNonFiction - documentation at http://bibpode.no/linkednonfiction/doc/Pode-LinkedNonFiction-Documentation.pdf - MARC data transformed using xslt https://github.com/pode/LinkedNonFiction/blob/master/marcslim2n3.xsl British Library British National Bibliography - http://www.bl.uk/bibliographic/datafree.html - data model documented, but no code available Libris.se - some notes in various presentations/blogposts (e.g. http://dc2008.de/wp-content/uploads/2008/09/malmsten.pdf) but can't find explicit transformation Hungarian National library - http://thedatahub.org/dataset/hungarian-national-library-catalog and http://nektar.oszk.hu/wiki/Semantic_web#Implementation - some information on ontologies used but no code or explicit transformation (not 100% sure this is from MARC) Talis - implemented in several live catalogues including http://catalogue.library.manchester.ac.uk/ - no documentation or code afaik although some notes in MAB transformation HBZ - some of the transformation documented at https://wiki1.hbz-nrw.de/display/SEM/Converting+the+Open+Data+from+the+hbz+to+BIBO, don't think any code published? Would be really helpful if more projects published their transformations (or someone told me where to look!) Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 26 Nov 2011, at 15:58, Karen Coyle wrote: A few of the code4lib talk proposals mention projects that have or will transform MARC records into RDF. If any of you have documentation and/or examples of this, I would be very interested to see them, even if they are under construction. Thanks, kc -- Karen Coyle kco...@kcoyle.net http://kcoyle.net ph: 1-510-540-7596 m: 1-510-435-8234 skype: kcoylenet
[CODE4LIB] Mobile technologies in libraries - fact finding survey
The m-libraries support project (http://www.m-libraries.info/) is part of JISC’s Mobile Infrastructure for Libraries programme (http://infteam.jiscinvolve.org/wp/2011/10/11/mobile-infrastructure-for-libraries-new-projects/) running from November 2011 until September 2012. The project aims to build a collection of useful resources and case studies based on current developments using mobile technologies in libraries, and to foster a community for those working in the m-library area or interested in learning more. A brief introductory survey has been devised to help inform the project - as a way of starting to gather information, to discover what information is needed to help libraries decide on a way forward, and to begin to understand what an m-libraries community could offer to help. The survey should only take 5-10 minutes and all questions are optional. This is an open survey - please pass the survey link on to anyone else you think might be interested via email or social media: http://svy.mk/mlibs1 If you’re interested in mobile technologies in libraries and would like to receive updates about the project, please visit our project blog at http://m-libraries.info and subscribe to updates (links in the right hand side for RSS or email subscriptions). Thanks and best wishes, Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936
Re: [CODE4LIB] CIS students, service learning, and the library
I was going to point to that too, and also note that the DevXS event was the brainchild of two students at the University of Lincoln, who went onto work at the University - including developing 'Jerome' a library search interface using MongoDB and the Sphinx index/search s/w http://jerome.library.lincoln.ac.uk/ Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 13 Oct 2011, at 23:04, Robert Robertson wrote: Hi Ellen, The event hasn't been held yet but it might be worth taking a look at what DevCSI are doing with their DevXS event http://devxs.org/ and seeing what comes out of it after the fact. The DevCSI initiative (http://devcsi.ukoln.ac.uk/blog/) has run quite a few hackday events (inlcuding dev8D ) as part of an effort to build a stronger community of developers in HE in the UK and some of their events and challenges have been around library data. DevXS is their first major foray into trying the same idea with CS and other students but it might offer some ideas for events that could raise interest in longer term service learning projects or tackle specific tasks. cheers, John R. John Robertson skype: rjohnrobertson Research Fellow/ Open Education Resources programme support officer (JISC CETIS), Centre for Academic Practice and Learning Enhancement University of Strathclyde Tel:+44 (0) 141 548 3072 http://blogs.cetis.ac.uk/johnr/ The University of Strathclyde is a charitable body, registered in Scotland, with registration number SC015263 From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Ellen K. Wilson [ewil...@jaguar1.usouthal.edu] Sent: Thursday, October 13, 2011 9:29 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] CIS students, service learning, and the library I am wondering if anyone has experience working with students (particularly CIS students) in service learning projects involving the library. I am currently supervising four first-year students who are working on a brief (10 hour) project involving the usability and redesign of the homepage as part of a first year seminar course. Obviously we won't get the whole thing done, but it is providing us with some valuable student insight into what should be on the page, etc. I anticipate the CIS department's first-year experience program will want to continue this collaboration, so I'm trying to brainstorm some projects that might be useful for future semesters particularly for freshmen who are just beginning their course of study in computer science, information technology, or information systems. This semester's project was thrown together in only a few days and I would like to not do that again! Ideas would be appreciated. Best regards, Ellen -- Ellen Knowlton Wilson Instructional Services Librarian Room 250, University Library University of South Alabama 5901 USA Drive North Mobile, AL 36688 (251) 460-6045 ewil...@jaguar1.usouthal.edu
[CODE4LIB] Show reuse of library/archive/museum data and win prizes
fully the benefits of sharing it and improve our services. Please contact metad...@bl.uk if you wish to share your experiences with us and those that are using this service. Give Credit Where Credit is Due: The British Library has a responsibility to maintain its bibliographic data on the nation’s behalf. Please credit all use of this data to the British Library and link back to www.bl.uk/bibliographic/datafree.html in order that this information can be shared and developed with today’s Internet users as well as future generations. Duplicate of package:bluk-bnb Tyne and Wear Museums Collections (Imagine) Part of the Europeana Linked Open Data, this is a collection of metadata describing (and linking to digital copies where appropriate) items in the Tyne and Wear Museums Collections. Cambridge University Library dataset #1 This data marks the first major out put of the COMET project. COMET is a JISC funded collaboration between Cambridge University Library and CARET, University of Cambridge. It is funded under the JISC Infrastructure for Resource Discovery programme. It represents work over a 20+ year period which contains a number of changes in practices and cataloguing tools. No attempt has been made to screen for quaility of records other than the Voyager export process. This data also includes the 180,000 'Tower Project' records published under the JISC Open Bibliography Project. JISC MOSAIC Activity Data The JISC MOSAIC (www.sero.co.uk/jisc-mosaic.html) project gathered together data covering user activity in a few UK Higher Education libraries. The data is available for download and via an API and contains information on books borrowed during specific time periods, and where available describes links between books, courses, and year of study. OpenURL Router Data (EDINA) EDINA is making the OpenURL Router Data available from April 2011. It is derived from the logs of the OpenURL Router, which directs user requests for academic papers to the appropriate institutional resolver. It enables institutions to register their resolver once only, at [http://openurl.ac.uk](http://openurl.ac.uk OpenURL Router), and service providers may then use openurl.ac.uk as the “base URL” for OpenURL links for UK HE and FE customers. This is the product of JISC-funded project activity, and provides a unique data set. The data captured varies from request to request since different users enter different information into requests. Further information on the details of the data set, sample files and the data itself is available at [http://openurl.ac.uk/doc/data/data.html](http://openurl.ac.uk/doc/data/data.html OpenURL Router Data). The team would like to thank all the institutions involved in this initiative for their participation. The data are made available under the Open Data Commons (ODC) Public Domain Dedication and Licence and the ODC Attribution Sharealike Community Norms. Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936
[CODE4LIB] Developer Competition using Library/Archive/Museum data
Celebrate Liberation – A worldwide competition for open software developers open data UK Discovery (http://discovery.ac.uk/) and the Developer Community Supporting Innovation (DevCSI) project based at UKOLN are running a global Developer Competition throughout July 2011 to build open source software applications / tools, using at least one of our 10 open data sources collected from libraries, museums and archives. Enter simply by blogging about your application and emailing the blog post URI to joy.pal...@manchester.ac.uk by the deadline of 2359 (your local time) on Monday 1 August 2011. Full details of the competition, the data sets and how to enter are at http://discovery.ac.uk/developers/competition/ There are 13 prizes including Best entry for each dataset – there are 10 datasets so there could be 10 winners of £30 Amazon vouchers and an aggregation could win more than one! Data Munging – Best example of Consolidating or Aggregating or De-duplicating or Entity matching or … one prize of £100 Amazon voucher. Overall winners – An EEE Pad Transformer for the overall winner and a £200 Amazon voucher for the Runner Up. And you can win more than once :) Specific competition tag on twitter is #discodev, but #devcsi and #ukdiscovery also good to follow/use Excited to see what people come up with - hope some of you are able to enter Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936
[CODE4LIB] PDF-text extraction
The CORE project at The Open University in the UK is doing some work on finding similarity between papers in institutional repositories (see http://core-project.kmi.open.ac.uk/ for more info). The first step in the process is extracting text from the (mainly) pdf documents harvested from repositories We've tried iText but had issues with quality We moved to PDFBox but are having performance issues Any other suggestions/experience? Thanks, Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936
Re: [CODE4LIB] RDF for opening times/hours?
I'd suggest having a look at the Goid Relations ontology http://wiki.goodrelations-vocabulary.org/Quickstart - it's aimed at businesses but the OpeningHours specification might do what you need http://www.heppnetz.de/ontologies/goodrelations/v1.html#OpeningHoursSpecification While handling public holidays etc is not immediately obvious it is covered in this mail http://ebusiness-unibw.org/pipermail/goodrelations/2010-October/000261.html Picking up on the previous comment Good Relations in RDFa is one of the formats Google use for Rich Snippets and it is also picked up by Yahoo Owen On 7 Jun 2011, at 23:05, Tom Keays tomke...@gmail.com wrote: There was a time, about 5 years ago, when I assumed that microformats were the way to go and spent a bit of time looking at hCalendar for representing iCalendar-formatted event information. http://microformats.org/wiki/hcalendar Not long after that, there was a lot of talk about RDF and RDFa for this same purpose. Now I was confused as to whether to change my strategy or not, but RDF Calendar seemed to be a good idea. The latter also was nice because it could be used to syndicate event information via RSS. http://pemberton-vandf.blogspot.com/2008/06/how-to-do-hcalendar-in-rdfa.html http://www.w3.org/TR/rdfcal/ These days it seems to be all about HTML5 microdata, especially because of Rich Snippets and Google's support for this approach. http://html5doctor.com/microdata/#microdata-action All three approaches allow you to embed iCalendar formatted event information on a web page. All three of them do it differently. I'm even more confused now than I was 5 years ago. This should not be this hard, yet there is still no definitive way to deploy this information and preserve the semantics of the event information. Part of this may be because the iCalendar format, although widely used, is itself insufficient. Tom
Re: [CODE4LIB] [dpla-discussion] Rethinking the library part of DPLA
I guess that people may already be familiar with the Candide 2.0 project at NYPL http://candide.nypl.org/text/ - this sounds not dissimilar to the type of approach being suggested This document is built using Wordpress with the Digress.it plugin (http://digress.it/) Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 10 Apr 2011, at 17:35, Nate Hill wrote: Eric, thanks for finding enough merit in my post on the DPLA listserv to repost it here. Karen and Peter, I completely agree with your feelings- But my point in throwing this idea out there was that despite all of the copyright issues, we don't really do a great job making a simple, intuitive, branded interface for the works that *are* available - the public domain stuff. Instead we seem to be content with knowing that this content is out there, and letting vendors add it to their difficult-to-use interfaces. I guess my hope, seeing this reposted here is that someone might have a suggestion as to why I would not host public domain ebooks on my own library's site. Are there technical hurdles to consider? I feel like I see a tiny little piece of the ebook access problem that we *can* solve here, while some of the larger issues will indeed be debated in forums like the DPLA for quite a while. By solving a small problem along the way, perhaps when the giant 1923-2011 problem is resolved we'll have a clearer path as to what type of access we might provide. On 4/10/11, Peter Murray peter.mur...@lyrasis.org wrote: I, too, have been struggling with this aspect of the discussion. (I'm on the DPLA list as well.) There seems to be this blind spot within the leadership of the group to ignore the copyright problem and any interaction with publishers of popular materials. One of the great hopes that I have for this group, with all of the publicity it is generating, is to serve as a voice and a focal point to bring authors, publishers and librarians together to talk about a new digital ownership and sharing model. That doesn't seem to be happening. Peter On Apr 10, 2011, at 10:05, Karen Coyle li...@kcoyle.net wrote: I appreciate the spirit of this, but despair at the idea that libraries organize their services around public domain works, thus becoming early 20th century institutions. The gap between 1923 and 2011 is huge, and it makes no sense to users that a library provide services based on publication date, much less that enhanced services stop at 1923. kc Quoting Eric Hellman e...@hellman.net: The DPLA listserv is probably too impractical for most of Code4Lib, but Nate Hill (who's on this list as well) made this contribution there, which I think deserves attention from library coders here. On Apr 5, 2011, at 11:15 AM, Nate Hill wrote: It is awesome that the project Gutenberg stuff is out there, it is a great start. But libraries aren't using it right. There's been talk on this list about the changing role of the public library in people's lives, there's been talk about the library brand, and some talk about what 'local' might mean in this context. I'd suggest that we should find ways to make reading library ebooks feel local and connected to an immediate community. Brick and mortar library facilities are public spaces, and librarians are proud of that. We have collections of materials in there, and we host programs and events to give those materials context within the community. There's something special about watching a child find a good book, and then show it to his or her friend and talk about how awesome it is. There's also something special about watching a senior citizens book group get together and discuss a new novel every month. For some reason, libraries really struggle with treating their digital spaces the same way. I'd love to see libraries creating online conversations around ebooks in much the same way. Take a title from project Gutenberg: The Adventures of Huckleberry Finn. Why not host that book directly on my library website so that it can be found at an intuitive URL, www.sjpl.org/the-adventures-of-huckleberry-finn and then create a forum for it? The URL itself takes care of the 'local' piece; certainly my most likely visitors will be San Jose residents- especially if other libraries do this same thing. The brand remains intact, when I launch this web page that holds the book I can promote my library's identity. The interface is no problem because I can optimize the page to load well on any device and I can link to different formats of the book. Finally, and most importantly, I've created a local digital space for this book so that people can converse about it via comments, uploaded pictures, video, whatever. I really think this community conversation and context-creation around materials is a big part of what makes public libraries special
Re: [CODE4LIB] LCSH and Linked Data
Thanks for all the information and discussion. I don't think I'm familiar enough with Authority file formats to completely comprehend - but I certainly understand the issues around the question of 'place' vs 'histo-geo-poltical entity'. Some of this makes me worry about the immediate applicability of the LC Authority files in the Linked Data space - someone said to me recently 'SKOS is just a way of avoiding dealing with the real semantics' :) Anyway - putting that to one side, the simplest approach for me at the moment seems to only look at authorised LCSH as represented on id.loc.gov. Picking up on Andy's first response: On Thu, Apr 7, 2011 at 3:46 PM, Houghton,Andrew hough...@oclc.org wrote: After having done numerous matching and mapping projects, there are some issues that you will face with your strategy, assuming I understand it correctly. Trying to match a heading starting at the left most subfield and working forward will not necessarily produce correct results when matching against the LCSH authority file. Using your example: 650 _0 $a Education $z England $x Finance is a good example of why processing the heading starting at the left will not necessarily produce the correct results. Assuming I understand your proposal you would first search for: 150 __ $a Education and find the heading with LCCN sh85040989. Next you would look for: 181 __ $z England and you would NOT find this heading in LCSH. OK - ignoring the question of where the best place to look for this is - I can live with not matching it for now. Later (perhaps when I understand it better, or when these headings are added to id.loc.gov we can revisit this) The second issue using your example is that you want to find the “longest” matching heading. While the pieces parts are there, so is the enumerated authority heading: 150 __ $a Education $z England as LCCN sh2008102746. So your heading is actually composed of the enumerated headings: sh2008102746150 __ $a Education $z England sh2002007885180 __ $x Finance and not the separate headings: sh85040989 150 __ $a Education n82068148 150 __ $a England sh2002007885180 __ $x Finance Although one could argue that either analysis is correct depending upon what you are trying to accomplish. What I'm interested in is representing the data as RDF/Linked Data in a way that opens up the best opportunities for both understanding and querying the data. Unfortunately at the moment there isn't a good way of representing LCSH directly in RDF (the MADS work may help I guess but to be honest at the moment I see that as overly complex - but that's another discussion). What I can do is make statements that an item is 'about' a subject (probably using dc:subject) and then point at an id.loc.gov URI. However, if I only express individual headings: Education England (natch) Finance Then obviously I lose the context of the full heading - so I also want to look for Education--England--Finance (which I won't find on id.loc.gov as not authorised) At this point I could stop, but my feeling is that it is useful to also look for other combinations of the terms: Education--England (not authorised) Education--Finance (authorised! http://id.loc.gov/authorities/sh85041008) My theory is that as long as I stick to combinations that start with a topical term I'm not going to make startlingly inaccurate statements? The matching algorithm I have used in the past contains two routines. The first f(a) will accept a heading as a parameter, scrub the heading, e.g., remove unnecessary subfield like $0, $3, $6, $8, etc. and do any other pre-processing necessary on the heading, then call the second function f(b). The f(b) function accepts a heading as a parameter and recursively calls itself until it builds up the list LCCNs that comprise the heading. It first looks for the given heading when it doesn’t find it, it removes the **last ** subfield and recursively calls itself, otherwise it appends the found LCCN to the returned list and exits. This strategy will find the longest match. Unless I've misunderstood this, this strategy would not find 'Education--Finance'? Instead I need to remove each *subdivision* in turn (no matter where it appears in the heading order) and try all possible combinations checking each for a match on id.loc.gov. Again, I can do this without worrying about possible invalid headings, as these wouldn't have been authorised anyway... I can check the number of variations around this but I guess that in my limited set of records (only 30k) there will be a relatively small number of possible patterns to check. Does that make sense?
Re: [CODE4LIB] LCSH and Linked Data
Thanks Ross - I have been pushing some cataloguing folk to comment on some of this as well (and have some feedback) - but I take the point that wider consultation via autocat could be a good idea. (for some reason this makes me slightly nervous!)s In terms of whether Education--England--Finance is authorised or not - I think I took from Andy's response that it wasn't, but also looking at it on authorities.loc.gov it isn't marked as 'authorised'. Anyway - the relevant thing for me at this stage is that I won't find a match via id.loc.gov - so I can't get a URI for it anyway. There are clearly quite a few issues with interacting with LCSH as Linked Data at the moment - I'm not that keen on how this currently works, and my reaction to the MADS/RDF ontology is similar to that of Bruce D'Arcus (see http://metadata.posterous.com/lcs-madsrdf-ontology-and-the-future-of-the-se), but on the otherhand I want to embrace the opportunity to start joining some stuff up and seeing what happens :) Owen On Fri, Apr 8, 2011 at 3:10 PM, Ross Singer rossfsin...@gmail.com wrote: On Fri, Apr 8, 2011 at 5:02 AM, Owen Stephens o...@ostephens.com wrote: Then obviously I lose the context of the full heading - so I also want to look for Education--England--Finance (which I won't find on id.loc.gov as not authorised) At this point I could stop, but my feeling is that it is useful to also look for other combinations of the terms: Education--England (not authorised) Education--Finance (authorised! http://id.loc.gov/authorities/sh85041008 ) My theory is that as long as I stick to combinations that start with a topical term I'm not going to make startlingly inaccurate statements? I would definitely ask this question somewhere other than Code4lib (autocat, maybe?), since I think the answer is more complicated than this (although they could validate/invalidate your assumption about whether or not this approach would get you close enough). My understanding is that Education--England--Finance *is* authorized, because Education--Finance is and England is a free-floating geographic subdivision. Because it's also an authorized heading, Education--England--Finance is, in fact, an authority. The problem is that free-floating subdivisions cause an almost infinite number of permutations, so there aren't LCCNs issued for them. This is where things get super-wonky. It's also the reason I initially created lcsubjects.org, specifically to give these (and, ideally, locally controlled subject headings) a publishing platform/centralized repository, but it quickly grew to be more than just a side project. There were issues of how the data would be constructed (esp. since, at the time, I had no access to the NAF), how to reconcile changes, provenance, etc. Add to the fact that 2 years ago, there wasn't much linked library data going on, it was really hard to justify the effort. But, yeah, it would be worth running your ideas by a few catalogers to see what they think. -Ross. -- Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com
[CODE4LIB] LCSH and Linked Data
We are working on converting some MARC library records to RDF, and looking at how we handle links to LCSH (id.loc.gov) - and I'm looking for feedback on how we are proposing to do this... I'm not 100% confident about the approach, and to some extent I'm trying to work around the nature of how LCSH interacts with RDF at the moment I guess... but here goes - I would very much appreciate feedback/criticism/being told why what I'm proposing is wrong: I guess what I want to do is preserve aspects of the faceted nature of LCSH in a useful way, give useful links back to id.loc.gov where possible, and give access to a wide range of facets on which the data set could be queried. Because of this I'm proposing not just expressing the whole of the 650 field as a LCSH and checking for it's existence on id.loc.gov, but also checking for various combinations of topical term and subdivisions from the 650 field. So for any 650 field I'm proposing we should check on id.loc.govfor labels matching: check(650$$a) -- topical term check(650$$b) -- topical term check(650$$v) -- Form subdivision check(650$$x) -- General subdivision check(650$$y) -- Chronological subdivision check(650$$z) -- Geographic subdivision Then using whichever elements exist (all as topical terms): Check(650$$a--650$$b) Check(650$$a--650$$v) Check(650$$a--650$$x) Check(650$$a--650$$y) Check(650$$a--650$$z) Check(650$$a--650$$b--650$$v) Check(650$$a--650$$b--650$$x) Check(650$$a--650$$b--650$$y) Check(650$$a--650$$b--650$$z) Check(650$$a--650$$b--650$$x--650$$v) Check(650$$a--650$$b--650$$x--650$$y) Check(650$$a--650$$b--650$$x--650$$z) Check(650$$a--650$$b--650$$x--650$$z--650$$v) Check(650$$a--650$$b--650$$x--650$$z--650$$y) Check(650$$a--650$$b--650$$x--650$$z--650$$y--650$$v) As an example given: 650 00 $$aPopular music$$xHistory$$y20th century We would be checking id.loc.gov for 'Popular music' as a topical term (http://id.loc.gov/authorities/sh85088865) 'History' as a general subdivision (http://id.loc.gov/authorities/sh99005024 ) '20th century' as a chronological subdivision ( http://id.loc.gov/authorities/sh2002012476) 'Popular music--History and criticism' as a topical term ( http://id.loc.gov/authorities/sh2008109787) 'Popular music--20th century' as a topical term (not authorised) 'Popular music--History and criticism--20th century' as a topical term (not authorised) And expressing all matches in our RDF. My understanding of LCSH isn't what it might be - but the ordering of terms in the combined string checking is based on what I understand to be the usual order - is this correct, and should we be checking for alternative orderings? Thanks Owen -- Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com
Re: [CODE4LIB] LCSH and Linked Data
Thanks Tom - very helpful Perhaps this suggests that rather using an order we should check combinations while preserving the order of the original 650 field (I assume this should in theory be correct always - or at least done to the best of the cataloguers knowledge)? So for: 650 _0 $$a Education $$z England $$x Finance. check: Education England (subdiv) Finance (subdiv) Education--England Education--Finance Education--England--Finance While for 650 _0 $$a Education $$x Economic aspects $$z England we check Education Economic aspects (subdiv) England (subdiv) Education--Economic aspects Education--England Education--Economic aspects--England - It is possible for other orders in special circumstances, e.g. with language dictionaries which can go something like: 650 _0 $$a English language $$v Dictionaries $$x Albanian. This possiblity would also covered by preserving the order - check: English Language Dictionaries (subdiv) Albanian (subdiv) English Language--Dictionaries English Language--Albanian English Language--Dictionaries-Albanian Creating possibly invalid headings isn't necessarily a problem - as we won't get a match on id.loc.gov anyway. (Instinctively English Language--Albanian doesn't feel right) - Some of these are repeatable, so you can have too $$vs following each other (e.g. Biography--Dictionaries); two $$zs (very common), as in Education--England--London; two $xs (e.g. Biography--History and criticism). OK - that's fine, we can use each individually and in combination for any repeated headings I think - I'm not I've ever come across a lot of $$bs in 650s. Do you have a lot of them in the database? Hadn't checked until you asked! We have 1 in the dataset in question (c.30k records) :) I'm not sure how possible it would be to come up with a definitive list of (reasonable) possible combinations. You are probably right - but I'm not too bothered about aiming at 'definitive' at this stage anyway - but I do want to get something relatively functional/useful Tom Thomas Meehan Head of Current Cataloguing University College London Library Services Owen Stephens wrote: We are working on converting some MARC library records to RDF, and looking at how we handle links to LCSH (id.loc.gov http://id.loc.gov) - and I'm looking for feedback on how we are proposing to do this... I'm not 100% confident about the approach, and to some extent I'm trying to work around the nature of how LCSH interacts with RDF at the moment I guess... but here goes - I would very much appreciate feedback/criticism/being told why what I'm proposing is wrong: I guess what I want to do is preserve aspects of the faceted nature of LCSH in a useful way, give useful links back to id.loc.gov http://id.loc.gov where possible, and give access to a wide range of facets on which the data set could be queried. Because of this I'm proposing not just expressing the whole of the 650 field as a LCSH and checking for it's existence on id.loc.gov http://id.loc.gov, but also checking for various combinations of topical term and subdivisions from the 650 field. So for any 650 field I'm proposing we should check on id.loc.gov http://id.loc.gov for labels matching: check(650$$a) -- topical term check(650$$b) -- topical term check(650$$v) -- Form subdivision check(650$$x) -- General subdivision check(650$$y) -- Chronological subdivision check(650$$z) -- Geographic subdivision Then using whichever elements exist (all as topical terms): Check(650$$a--650$$b) Check(650$$a--650$$v) Check(650$$a--650$$x) Check(650$$a--650$$y) Check(650$$a--650$$z) Check(650$$a--650$$b--650$$v) Check(650$$a--650$$b--650$$x) Check(650$$a--650$$b--650$$y) Check(650$$a--650$$b--650$$z) Check(650$$a--650$$b--650$$x--650$$v) Check(650$$a--650$$b--650$$x--650$$y) Check(650$$a--650$$b--650$$x--650$$z) Check(650$$a--650$$b--650$$x--650$$z--650$$v) Check(650$$a--650$$b--650$$x--650$$z--650$$y) Check(650$$a--650$$b--650$$x--650$$z--650$$y--650$$v) As an example given: 650 00 $$aPopular music$$xHistory$$y20th century We would be checking id.loc.gov http://id.loc.gov for 'Popular music' as a topical term ( http://id.loc.gov/authorities/sh85088865) 'History' as a general subdivision ( http://id.loc.gov/authorities/sh99005024) '20th century' as a chronological subdivision ( http://id.loc.gov/authorities/sh2002012476) 'Popular music--History and criticism' as a topical term ( http://id.loc.gov/authorities/sh2008109787) 'Popular music--20th century' as a topical term (not authorised) 'Popular music--History and criticism--20th century' as a topical term (not authorised) And expressing all matches in our RDF. My understanding of LCSH isn't what it might be - but the ordering of terms in the combined string checking is based on what I understand to be the usual order - is this correct, and should we be checking for alternative orderings? Thanks Owen -- Owen
Re: [CODE4LIB] LCSH and Linked Data
Still digesting Andrew's response (thanks Andrew), but On Thu, Apr 7, 2011 at 4:17 PM, Ya'aqov Ziso yaaq...@gmail.com wrote: *Currently under id.loc.gov you will not find name authority records, but you can find them at viaf.org*. *[YZ]* viaf.org does not include geographic names. I just checked there England. Is this not the relevant VIAF entry http://viaf.org/viaf/14299580http://viaf.org/viaf/142995804 -- Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com
Re: [CODE4LIB] LCSH and Linked Data
I'm out of my depth here :) But... this is what I understood Andrew to be saying. In this instance (?because 'England' is a Name Authority?) rather than create a separate LCSH authority record for 'England' (as the 151), rather the LCSH subdivision is recorded in the 781 of the existing Name Authority record. Searching on http://authorities.loc.gov for England, I find an Authorised heading, marked as a LCSH - but when I go to that record what I get is the name authority record n 82068148 - the name authority record as represented on VIAF by http://viaf.org/viaf/142995804/ (which links to http://errol.oclc.org/laf/n%20%2082068148.html) Just as this is getting interesting time differences mean I'm about to head home :) Owen On Thu, Apr 7, 2011 at 4:34 PM, LeVan,Ralph le...@oclc.org wrote: If you look at the fields those names come from, I think they mean England as a corporation, not England as a place. Ralph -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Owen Stephens Sent: Thursday, April 07, 2011 11:28 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] LCSH and Linked Data Still digesting Andrew's response (thanks Andrew), but On Thu, Apr 7, 2011 at 4:17 PM, Ya'aqov Ziso yaaq...@gmail.com wrote: *Currently under id.loc.gov you will not find name authority records, but you can find them at viaf.org*. *[YZ]* viaf.org does not include geographic names. I just checked there England. Is this not the relevant VIAF entry http://viaf.org/viaf/14299580http://viaf.org/viaf/142995804 -- Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com -- Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com