Re: [CODE4LIB] Implementing OpenURL for simple web resources
True. How, from the OpenURL, are you going to know that the rft is meant to represent a website? I guess that was part of my question. But no one has suggested defining a new metadata profile for websites (which I probably would avoid tbh). DC doesn't seem to offer a nice way of doing this (that is saying 'this is a website'), although there are perhaps some bits and pieces (format, type) that could be used to give some indication (but I suspect not unambiguously) But I still think what you want is simply a purl server. What makes you think you want OpenURL in the first place? But I still don't really understand what you're trying to do: deliver consistency of approach across all our references -- so are you using OpenURL for it's more conventional use too, but you want to tack on a purl-like functionality to the same software that's doing something more like a conventional link resolver? I don't completely understand your use case. I wouldn't use OpenURL just to get a persistent URL - I'd almost certainly look at PURL for this. But, I want something slightly different. I want our course authors to be able to use whatever URL they know for a resource, but still try to ensure that the link works persistently over time. I don't think it is reasonable for a user to have to know a 'special' URL for a resource - and this approach means establishing a PURL for all resources used in our teaching material whether or not it moves in the future - which is an overhead it would be nice to avoid. You can hit delete now if you aren't interested, but ... ... perhaps if I just say a little more about the project I'm working on it may clarify... The project I'm working on is concerned with referencing and citation. We are looking at how references appear in teaching material (esp. online) and how they can be reused by students in their personal environment (in essays, later study, or something else). The references that appear can be to anything - books, chapters, journals, articles, etc. Increasingly of course there are references to web-based materials. For print material, references generally describe the resource and nothing more, but for digital material references are expected not only to describe the resource, but also state a route of access to the resource. This tends to be a bad idea when (for example) referencing e-journals, as we know the problems that surround this - many different routes of access to the same item. OpenURLs work well in this situation and seem to me like a sensible (and perhaps the only viable) solution. So we can say that for journals/articles it is sensible to ignore any URL supplied as part of the reference, and to form an OpenURL instead. If there is a DOI in the reference (which is increasingly common) then that can be used to form a URL using DOI resolution, but it makes more sense to me to hand this off to another application rather than bake this into the reference - and OpenURL resolvers are reasonably set to do this. If we look at a website it is pretty difficult to reference it without including the URL - it seems to be the only good way of describing what you are actually talking about (how many people think of websites by 'title', 'author' and 'publisher'?). For me, this leads to an immediate confusion between the description of the resource and the route of access to it. So, to differentiate I'm starting to think of the http URI in a reference like this as a URI, but not necessarily a URL. We then need some mechanism to check, given a URI, what is the URL. Now I could do this with a script - just pass the URI to a script that checks what URL to use against a list and redirects the user if necessary. On this point Jonathan said if the usefulness of your technique does NOT count on being inter-operable with existing link resolver infrastructure... PERSONALLY I would be using OpenURL, I don't think it's worth it - but it struck me that if we were passing a URI to a script, why not pass it in an OpenURL? I could see a number of advantages to this in the local context: Consistency - references to websites get treated the same as references to journal articles - this means a single approach on the course side, with flexibility Usage stats - we could collect these whatever, but if we do it via OpenURL we get this in the same place as the stats about usage of other scholarly material and could consider driving personalisation services off the data (like the bX product from Ex Libris) Appropriate copy problem - for resources we subscribe to with authentication mechanisms there is (I think) an equivalent to the 'appropriate copy' issue as with journal articles - we can push a URI to 'Web of Science' to the correct version of Web of Science via a local authentication method (using ezproxy for us) The problem with the approach (as Nate and Eric mention) is that any approach that relies on the URI as a identifier (whether using
Re: [CODE4LIB] Implementing OpenURL for simple web resources
I agree with this Rosalyn. The issue that Nate brought up was that the content at http://www.bbc.co.uk could change over time, and old content might be moved to another URI - http://archive.bbc.co.uk or something. So if course A references http://www.bbc.co.uk on 24/08/09, if the content that was on http://www.bbc.co.uk on 24/08/09 moves to http://archive.bbc.co.uk we can use the mechanism I propose to trap the links to http://www.bbc.co.uk and redirect to http://archive.bbc.co.uk. However, if at a later date course B references http://www.bbc.co.uk we have no way of knowing whether they mean the stuff that is currently on http://www.bbc.co.uk or the stuff that used to be on http://www.bbc.co.uk and is now on http://archive.bbc.co.uk - and we have a redirect that is being applied across the board. Thinking about it, references are required to include a date of access when citing websites, so this is probably the best piece of information to use to know where to resolve to (and we can put this in the DC metadata). Whether this will just get too confusing is a good question - I'll have at think about this. Owen PS using the date we could even consider resolving to the Internet Archive copy of a website if it was available I guess - this might be useful I guess... Owen Stephens TELSTAR Project Manager Library and Learning Resources Centre The Open University Walton Hall Milton Keynes, MK7 6AA T: +44 (0) 1908 858701 F: +44 (0) 1908 653571 E: o.steph...@open.ac.uk -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Rosalyn Metz Sent: 14 September 2009 21:52 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Implementing OpenURL for simple web resources oops...just re-read original post s/professor/article also your link resolver should be creating a context object with each request. this context object is what makes the openurl unique. so if you want uniqueness for stats purposes i would image the link resolver is already doing that (and just another reason to use an rfr_id that you create). On Mon, Sep 14, 2009 at 4:45 PM, Rosalyn Metz rosalynm...@gmail.com wrote: Owen, rft_id isn't really meant to be a unique identifier (although it can be in situations like a pmid or doi). are you looking for it to be? if so why? if professor A is pointing to http://www.bbc.co.uk and professor B is pointing to http://www.bbc.co.uk why do they have to have unique OpenURLs. Rosalyn On Mon, Sep 14, 2009 at 4:41 PM, Eric Hellman e...@hellman.net wrote: Nate's point is what I was thinking about in this comment in my original reply: If you don't add DC metadata, which seems like a good idea, you'll definitely want to include something that will help you to persist your replacement record. For example, a label or description for the link. I should also point out a solution that could work for some people but not you- put rewrite rules in the gateways serving your network. A bit dangerous and kludgy, but we've seen kludgier things. On Sep 14, 2009, at 4:24 PM, O.Stephens wrote: Nate has a point here - what if we end up with a commonly used URI pointing at a variety of different things over time, and so is used to indicate different content each time. However the problem with a 'short URL' solution (tr.im, purl etc), or indeed any locally assigned identifier that acts as a key, is that as described in the blog post you need prior knowledge of the short URL/identifier to use it. The only 'identifier' our authors know for a website is it's URL - and it seems contrary for us to ask them to use something else. I'll need to think about Nate's point - is this common or an edge case? Is there any other approach we could take? Eric Hellman President, Gluejar, Inc. 41 Watchung Plaza, #132 Montclair, NJ 07042 USA e...@hellman.net http://go-to-hellman.blogspot.com/ The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England Wales and a charity registered in Scotland (SC 038302).
[CODE4LIB] Results from Institutional Identifiers in Repositories Survey
Greetings, The NISO I2 Working Group surveyed repository managers and developers about current practices and needs of the repository community around institutional identifiers. Results from the survey will inform a set of use cases that are expected to drive the development of a draft standard for institutional identifiers. A report on the results of the survey is now available to the public: http://bit.ly/14hWly Feedback from the repository community is most welcome. It may be sent to our public i2info mailing list -- http://www.niso.org/lists/i2info/ -- or directly to me. Thanks, -Mike Co-chair, Repositories scenario, NISO I2 Working Group
[CODE4LIB] indexing pdf files
I have been having fun recently indexing PDF files. For the pasts six months or so I have been keeping the articles I've read in a pile, and I was rather amazed at the size of the pile. It was about a foot tall. When I read these articles I actively read them -- meaning, I write, scribble, highlight, and annotate the text with my own special notation denoting names, keywords, definitions, citations, quotations, list items, examples, etc. This active reading process: 1) makes for better comprehension on my part, and 2) makes the articles easier to review and pick out the ideas I thought were salient. Being the librarian I am, I thought it might be cool (kewl) to make the articles into a collection. Thus, the beginnings of Highlights Annotations: A Value-Added Reading List. The techno-weenie process for creating and maintaining the content is something this community might find interesting: 1. Print article and read it actively. 2. Convert the printed article into a PDF file -- complete with embedded OCR -- with my handy-dandy ScanSnap scanner. [1] 3. Use MyLibrary to create metadata (author, title, date published, date read, note, keywords, facet/term combinations, local and remote URLs, etc.) describing the article. [2] 4. Save the PDF to my file system. 5. Use pdttotext to extract the OCRed text from the PDF and index it along with the MyLibrary metadata using Solr. [3, 4] 6. Provide a searchable/browsable user interface to the collection through a mod_perl module. [5, 6] Software is never done, and if it were then it would be called hardware. Accordingly, I know there are some things I need to do before I can truely deem the system version 1.0. At the same time my excitment is overflowing and I thought I'd share some geekdom with my fellow hackers. Fun with PDF files and open source software. [1] ScanSnap - http://tinyurl.com/oafgwe [2] MyLibrary screen dump - http://infomotions.com/tmp/mylibrary.png [3] pdftotext - http://www.foolabs.com/xpdf/ [4] Solr - http://lucene.apache.org/solr/ [5] module source code - http://infomotions.com/highlights/Highlights.pl [6] user interface - http://infomotions.com/highlights/highlights.cgi -- Eric Lease Morgan University of Notre Dame -- Eric Lease Morgan Head, Digital Access and Information Architecture Department Hesburgh Libraries, University of Notre Dame (574) 631-8604
Re: [CODE4LIB] Implementing OpenURL for simple web resources
you could force a timestamp if people don't include a date. and I like the idea of going to the Internet Archive of a website, because then you're not having to get into the business of handling www.bbc.co.uk differently than cnn.com and someblog.org. i also like the idea of using a redirect. you could theoretically write a source parser (i'm assuming youre using SFX based on what you said about bX) that says if my rfr_id = mylocalid and the item is a website (however you choose to identify the website...which if you're writing your own source parser you could put website in the rft_genre even though its not technically a metadata format but you just want your source parser to forward the url on anyway, so the link resolver isn't actually going to do anything with it) bypass everything and just direct to the internet archive of the website. all of this is of course kind of hackish...but really isn't the whole thing hackish? there were a few source parsers that would be good models for writing something like this. but i have no idea if they still exist because i haven't looked at the back end of sfx in about a year. On Tue, Sep 15, 2009 at 5:12 AM, O.Stephens o.steph...@open.ac.uk wrote: I agree with this Rosalyn. The issue that Nate brought up was that the content at http://www.bbc.co.uk could change over time, and old content might be moved to another URI - http://archive.bbc.co.uk or something. So if course A references http://www.bbc.co.uk on 24/08/09, if the content that was on http://www.bbc.co.uk on 24/08/09 moves to http://archive.bbc.co.uk we can use the mechanism I propose to trap the links to http://www.bbc.co.uk and redirect to http://archive.bbc.co.uk. However, if at a later date course B references http://www.bbc.co.uk we have no way of knowing whether they mean the stuff that is currently on http://www.bbc.co.uk or the stuff that used to be on http://www.bbc.co.uk and is now on http://archive.bbc.co.uk - and we have a redirect that is being applied across the board. Thinking about it, references are required to include a date of access when citing websites, so this is probably the best piece of information to use to know where to resolve to (and we can put this in the DC metadata). Whether this will just get too confusing is a good question - I'll have at think about this. Owen PS using the date we could even consider resolving to the Internet Archive copy of a website if it was available I guess - this might be useful I guess... Owen Stephens TELSTAR Project Manager Library and Learning Resources Centre The Open University Walton Hall Milton Keynes, MK7 6AA T: +44 (0) 1908 858701 F: +44 (0) 1908 653571 E: o.steph...@open.ac.uk -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Rosalyn Metz Sent: 14 September 2009 21:52 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Implementing OpenURL for simple web resources oops...just re-read original post s/professor/article also your link resolver should be creating a context object with each request. this context object is what makes the openurl unique. so if you want uniqueness for stats purposes i would image the link resolver is already doing that (and just another reason to use an rfr_id that you create). On Mon, Sep 14, 2009 at 4:45 PM, Rosalyn Metz rosalynm...@gmail.com wrote: Owen, rft_id isn't really meant to be a unique identifier (although it can be in situations like a pmid or doi). are you looking for it to be? if so why? if professor A is pointing to http://www.bbc.co.uk and professor B is pointing to http://www.bbc.co.uk why do they have to have unique OpenURLs. Rosalyn On Mon, Sep 14, 2009 at 4:41 PM, Eric Hellman e...@hellman.net wrote: Nate's point is what I was thinking about in this comment in my original reply: If you don't add DC metadata, which seems like a good idea, you'll definitely want to include something that will help you to persist your replacement record. For example, a label or description for the link. I should also point out a solution that could work for some people but not you- put rewrite rules in the gateways serving your network. A bit dangerous and kludgy, but we've seen kludgier things. On Sep 14, 2009, at 4:24 PM, O.Stephens wrote: Nate has a point here - what if we end up with a commonly used URI pointing at a variety of different things over time, and so is used to indicate different content each time. However the problem with a 'short URL' solution (tr.im, purl etc), or indeed any locally assigned identifier that acts as a key, is that as described in the blog post you need prior knowledge of the short URL/identifier to use it. The only 'identifier' our authors know for a website is it's URL - and it seems contrary for us to ask them to use something else. I'll need to think about
Re: [CODE4LIB] indexing pdf files
Eric, I have librarians that would kill for this. In fact I was talking to one about it the other day. She felt there must be a way to handle active reading and make it portable. This would be great in conjunction with RefWorks or Zotero or something along those lines. Rosalyn On Tue, Sep 15, 2009 at 9:31 AM, Eric Lease Morgan emor...@nd.edu wrote: I have been having fun recently indexing PDF files. For the pasts six months or so I have been keeping the articles I've read in a pile, and I was rather amazed at the size of the pile. It was about a foot tall. When I read these articles I actively read them -- meaning, I write, scribble, highlight, and annotate the text with my own special notation denoting names, keywords, definitions, citations, quotations, list items, examples, etc. This active reading process: 1) makes for better comprehension on my part, and 2) makes the articles easier to review and pick out the ideas I thought were salient. Being the librarian I am, I thought it might be cool (kewl) to make the articles into a collection. Thus, the beginnings of Highlights Annotations: A Value-Added Reading List. The techno-weenie process for creating and maintaining the content is something this community might find interesting: 1. Print article and read it actively. 2. Convert the printed article into a PDF file -- complete with embedded OCR -- with my handy-dandy ScanSnap scanner. [1] 3. Use MyLibrary to create metadata (author, title, date published, date read, note, keywords, facet/term combinations, local and remote URLs, etc.) describing the article. [2] 4. Save the PDF to my file system. 5. Use pdttotext to extract the OCRed text from the PDF and index it along with the MyLibrary metadata using Solr. [3, 4] 6. Provide a searchable/browsable user interface to the collection through a mod_perl module. [5, 6] Software is never done, and if it were then it would be called hardware. Accordingly, I know there are some things I need to do before I can truely deem the system version 1.0. At the same time my excitment is overflowing and I thought I'd share some geekdom with my fellow hackers. Fun with PDF files and open source software. [1] ScanSnap - http://tinyurl.com/oafgwe [2] MyLibrary screen dump - http://infomotions.com/tmp/mylibrary.png [3] pdftotext - http://www.foolabs.com/xpdf/ [4] Solr - http://lucene.apache.org/solr/ [5] module source code - http://infomotions.com/highlights/Highlights.pl [6] user interface - http://infomotions.com/highlights/highlights.cgi -- Eric Lease Morgan University of Notre Dame -- Eric Lease Morgan Head, Digital Access and Information Architecture Department Hesburgh Libraries, University of Notre Dame (574) 631-8604
Re: [CODE4LIB] Implementing OpenURL for simple web resources
Thanks Rosalyn, As you say we could push a custom value into rfr_genre. I'm a bit torn on this, as I guess I'm trying to do something that isn't 'hacky' - or at least not from the OpenURL end of it. It might be that this is just wishful thinking, and that I'm just trying to fool myself into thinking I'm 'sticking to the standard' when the likelihood of what I'm doing being transferrable to other scenarios is zero (although Eric's comments make me hope not) Yes, we are using SFX. What I'm proposing on the SFX end as the path of least resisitance is writing a source parser for our learning environment which can do a 'fetch' for an alternative URL, or use the primary URL, and put it in an SFX internal field rft_856. We can then use the existing Target Parser 856_URL which displays the contents of rft_856 in the menu. Combined with some logic which forces this as the only option under certain circumstances we can then push the user directly to the resulting URL. Owen Owen Stephens TELSTAR Project Manager Library and Learning Resources Centre The Open University Walton Hall Milton Keynes, MK7 6AA T: +44 (0) 1908 858701 F: +44 (0) 1908 653571 E: o.steph...@open.ac.uk -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Rosalyn Metz Sent: 15 September 2009 14:42 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Implementing OpenURL for simple web resources you could force a timestamp if people don't include a date. and I like the idea of going to the Internet Archive of a website, because then you're not having to get into the business of handling www.bbc.co.uk differently than cnn.com and someblog.org. i also like the idea of using a redirect. you could theoretically write a source parser (i'm assuming youre using SFX based on what you said about bX) that says if my rfr_id = mylocalid and the item is a website (however you choose to identify the website...which if you're writing your own source parser you could put website in the rft_genre even though its not technically a metadata format but you just want your source parser to forward the url on anyway, so the link resolver isn't actually going to do anything with it) bypass everything and just direct to the internet archive of the website. all of this is of course kind of hackish...but really isn't the whole thing hackish? there were a few source parsers that would be good models for writing something like this. but i have no idea if they still exist because i haven't looked at the back end of sfx in about a year. On Tue, Sep 15, 2009 at 5:12 AM, O.Stephens o.steph...@open.ac.uk wrote: I agree with this Rosalyn. The issue that Nate brought up was that the content at http://www.bbc.co.uk could change over time, and old content might be moved to another URI - http://archive.bbc.co.uk or something. So if course A references http://www.bbc.co.uk on 24/08/09, if the content that was on http://www.bbc.co.uk on 24/08/09 moves to http://archive.bbc.co.uk we can use the mechanism I propose to trap the links to http://www.bbc.co.uk and redirect to http://archive.bbc.co.uk. However, if at a later date course B references http://www.bbc.co.uk we have no way of knowing whether they mean the stuff that is currently on http://www.bbc.co.uk or the stuff that used to be on http://www.bbc.co.uk and is now on http://archive.bbc.co.uk - and we have a redirect that is being applied across the board. Thinking about it, references are required to include a date of access when citing websites, so this is probably the best piece of information to use to know where to resolve to (and we can put this in the DC metadata). Whether this will just get too confusing is a good question - I'll have at think about this. Owen PS using the date we could even consider resolving to the Internet Archive copy of a website if it was available I guess - this might be useful I guess... Owen Stephens TELSTAR Project Manager Library and Learning Resources Centre The Open University Walton Hall Milton Keynes, MK7 6AA T: +44 (0) 1908 858701 F: +44 (0) 1908 653571 E: o.steph...@open.ac.uk -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Rosalyn Metz Sent: 14 September 2009 21:52 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Implementing OpenURL for simple web resources oops...just re-read original post s/professor/article also your link resolver should be creating a context object with each request. this context object is what makes the openurl unique. so if you want uniqueness for stats purposes i would image the link resolver is already doing that (and just another reason to use an rfr_id that you create). On Mon, Sep 14, 2009 at 4:45 PM, Rosalyn Metz rosalynm...@gmail.com wrote: Owen, rft_id isn't really meant to be a unique
Re: [CODE4LIB] indexing pdf files
Eric, 5. Use pdttotext to extract the OCRed text from the PDF and index it along with the MyLibrary metadata using Solr. [3, 4] Have you considered using Solr's ExtractingRequestHandler [1] for the PDFs? We're using it at NYPL with pretty great success. [1] http://wiki.apache.org/solr/ExtractingRequestHandler Mark A. Matienzo Applications Developer, Digital Experience Group The New York Public Library
Re: [CODE4LIB] indexing pdf files
Hi all, I would like to suggest an API for extracting text (including highlighted or annotated ones) from PDF: iText (http://www.lowagie.com/iText/). This is a Java API (has C# port), and it helped me a lot, when we worked with extraordinary PDF files. Solr uses Tika (http://lucene.apache.org/tika) for extracting text from documents, and Tika uses PDFBox (http://incubator.apache.org/pdfbox/) to extract from PDF files, and it is a great tool for the normal PDF files, but it has (at least had) some features, which I didn't satisfied with: - it consumed more memory comparing with iText, and couldn't read files above a given size (this was large, about 1 GB, but we had even larger files) - it couldn't handled correctly the conditional hypens at the end of the line - it had poorer documentation then iText, and its API was also poorer (that time the Manning published the iText in Action book). Our PDF files were double layered (original hi-res image + OCR-ed text), several thousands pages length documents (Hungarian scientific journals, the diary of the Houses of Parliament from the 19th century etc.). We indexed the content with Lucene, and in the UI we showed one page per screen, so the user didn't need to download the full PDF. We extracted the Table of contents from the PDF as well, and we implemented it in the web UI, so the user can browse pages according to the full file's TOC. This project happened two years ago, so it is possible, that lots of things were changed since that time. Király Péter http://eXtensibleCatalog.org - Original Message - From: Mark A. Matienzo m...@matienzo.org To: CODE4LIB@LISTSERV.ND.EDU Sent: Tuesday, September 15, 2009 3:56 PM Subject: Re: [CODE4LIB] indexing pdf files Eric, 5. Use pdttotext to extract the OCRed text from the PDF and index it along with the MyLibrary metadata using Solr. [3, 4] Have you considered using Solr's ExtractingRequestHandler [1] for the PDFs? We're using it at NYPL with pretty great success. [1] http://wiki.apache.org/solr/ExtractingRequestHandler Mark A. Matienzo Applications Developer, Digital Experience Group The New York Public Library
Re: [CODE4LIB] indexing pdf files
My (much more primitive) version of the same thing involves reading and annotating articles using my Tablet PC. Although I do get a variety of print publications, I find I don't tend to annotate them as much anymore. I used to use EndNote to do the metadata, then I switched to Zotero. I hadn't thought to try to create a full-text search of the articles -- hmm. -- Danielle Cunniff Plumer, Coordinator Texas Heritage Digitization Initiative Texas State Library and Archives Commission 512.463.5852 (phone) / 512.936.2306 (fax) dplu...@tsl.state.tx.us dcplu...@gmail.com On Tue, Sep 15, 2009 at 8:31 AM, Eric Lease Morgan emor...@nd.edu wrote: I have been having fun recently indexing PDF files. For the pasts six months or so I have been keeping the articles I've read in a pile, and I was rather amazed at the size of the pile. It was about a foot tall. When I read these articles I actively read them -- meaning, I write, scribble, highlight, and annotate the text with my own special notation denoting names, keywords, definitions, citations, quotations, list items, examples, etc. This active reading process: 1) makes for better comprehension on my part, and 2) makes the articles easier to review and pick out the ideas I thought were salient. Being the librarian I am, I thought it might be cool (kewl) to make the articles into a collection. Thus, the beginnings of Highlights Annotations: A Value-Added Reading List. The techno-weenie process for creating and maintaining the content is something this community might find interesting: 1. Print article and read it actively. 2. Convert the printed article into a PDF file -- complete with embedded OCR -- with my handy-dandy ScanSnap scanner. [1] 3. Use MyLibrary to create metadata (author, title, date published, date read, note, keywords, facet/term combinations, local and remote URLs, etc.) describing the article. [2] 4. Save the PDF to my file system. 5. Use pdttotext to extract the OCRed text from the PDF and index it along with the MyLibrary metadata using Solr. [3, 4] 6. Provide a searchable/browsable user interface to the collection through a mod_perl module. [5, 6] Software is never done, and if it were then it would be called hardware. Accordingly, I know there are some things I need to do before I can truely deem the system version 1.0. At the same time my excitment is overflowing and I thought I'd share some geekdom with my fellow hackers. Fun with PDF files and open source software. [1] ScanSnap - http://tinyurl.com/oafgwe [2] MyLibrary screen dump - http://infomotions.com/tmp/mylibrary.png [3] pdftotext - http://www.foolabs.com/xpdf/ [4] Solr - http://lucene.apache.org/solr/ [5] module source code - http://infomotions.com/highlights/Highlights.pl [6] user interface - http://infomotions.com/highlights/highlights.cgi -- Eric Lease Morgan University of Notre Dame -- Eric Lease Morgan Head, Digital Access and Information Architecture Department Hesburgh Libraries, University of Notre Dame (574) 631-8604
Re: [CODE4LIB] Implementing OpenURL for simple web resources
Owen, I might have missed it in this message -- my eyes are starting glaze over at this point in the thread, but can you describe how the input of these resources would work? What I'm basically asking is -- what would the professor need to do to add a new: citation for a 70 year old book; journal on PubMed; URL to CiteSeer? How does their input make it into your database? -Ross. On Tue, Sep 15, 2009 at 5:04 AM, O.Stephens o.steph...@open.ac.uk wrote: True. How, from the OpenURL, are you going to know that the rft is meant to represent a website? I guess that was part of my question. But no one has suggested defining a new metadata profile for websites (which I probably would avoid tbh). DC doesn't seem to offer a nice way of doing this (that is saying 'this is a website'), although there are perhaps some bits and pieces (format, type) that could be used to give some indication (but I suspect not unambiguously) But I still think what you want is simply a purl server. What makes you think you want OpenURL in the first place? But I still don't really understand what you're trying to do: deliver consistency of approach across all our references -- so are you using OpenURL for it's more conventional use too, but you want to tack on a purl-like functionality to the same software that's doing something more like a conventional link resolver? I don't completely understand your use case. I wouldn't use OpenURL just to get a persistent URL - I'd almost certainly look at PURL for this. But, I want something slightly different. I want our course authors to be able to use whatever URL they know for a resource, but still try to ensure that the link works persistently over time. I don't think it is reasonable for a user to have to know a 'special' URL for a resource - and this approach means establishing a PURL for all resources used in our teaching material whether or not it moves in the future - which is an overhead it would be nice to avoid. You can hit delete now if you aren't interested, but ... ... perhaps if I just say a little more about the project I'm working on it may clarify... The project I'm working on is concerned with referencing and citation. We are looking at how references appear in teaching material (esp. online) and how they can be reused by students in their personal environment (in essays, later study, or something else). The references that appear can be to anything - books, chapters, journals, articles, etc. Increasingly of course there are references to web-based materials. For print material, references generally describe the resource and nothing more, but for digital material references are expected not only to describe the resource, but also state a route of access to the resource. This tends to be a bad idea when (for example) referencing e-journals, as we know the problems that surround this - many different routes of access to the same item. OpenURLs work well in this situation and seem to me like a sensible (and perhaps the only viable) solution. So we can say that for journals/articles it is sensible to ignore any URL supplied as part of the reference, and to form an OpenURL instead. If there is a DOI in the reference (which is increasingly common) then that can be used to form a URL using DOI resolution, but it makes more sense to me to hand this off to another application rather than bake this into the reference - and OpenURL resolvers are reasonably set to do this. If we look at a website it is pretty difficult to reference it without including the URL - it seems to be the only good way of describing what you are actually talking about (how many people think of websites by 'title', 'author' and 'publisher'?). For me, this leads to an immediate confusion between the description of the resource and the route of access to it. So, to differentiate I'm starting to think of the http URI in a reference like this as a URI, but not necessarily a URL. We then need some mechanism to check, given a URI, what is the URL. Now I could do this with a script - just pass the URI to a script that checks what URL to use against a list and redirects the user if necessary. On this point Jonathan said if the usefulness of your technique does NOT count on being inter-operable with existing link resolver infrastructure... PERSONALLY I would be using OpenURL, I don't think it's worth it - but it struck me that if we were passing a URI to a script, why not pass it in an OpenURL? I could see a number of advantages to this in the local context: Consistency - references to websites get treated the same as references to journal articles - this means a single approach on the course side, with flexibility Usage stats - we could collect these whatever, but if we do it via OpenURL we get this in the same place as the stats about usage of other scholarly material and could consider driving
Re: [CODE4LIB] Implementing OpenURL for simple web resources
Ross - no you didn't miss it, There are 3 ways that references might be added to the learning environment: An author (or realistically a proxy on behalf of the author) can insert a reference into a structured Word document from an RIS file. This structured document (XML) then goes through a 'publication' process which pushes the content to the learning environment (Moodle), including rendering the references from RIS format into a specified style, with links. An author/librarian/other can import references to a 'resources' area in our learning environment (Moodle) from a RIS file An author/librarian/other can subscribe to an RSS feed from a RefWorks 'RefShare' folder within the 'resources' area of the learning environment In general the project is focussing on the use of RefWorks - so although the RIS files could be created by any suitable s/w, we are looking specifically at RefWorks. How you get the reference into RefWorks is something we are looking at currently. The best approach varies depending on the type of material you are looking at: For websites it looks like the 'RefGrab-it' bookmarklet/browser plugin (depending on your browser) is the easiest way of capturing website details. For books, probably a Union catalogue search from within RefWorks For journal articles, probably a Federated search engine (SS 360 is what we've got) Any of these could be entered by hand of course, as could several other kinds of reference Entering the references into RefWorks could be done by an author, but it more likely to be done by a member of clerical staff or a librarian/library assistant Owen Owen Stephens TELSTAR Project Manager Library and Learning Resources Centre The Open University Walton Hall Milton Keynes, MK7 6AA T: +44 (0) 1908 858701 F: +44 (0) 1908 653571 E: o.steph...@open.ac.uk -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Ross Singer Sent: 15 September 2009 15:56 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Implementing OpenURL for simple web resources Owen, I might have missed it in this message -- my eyes are starting glaze over at this point in the thread, but can you describe how the input of these resources would work? What I'm basically asking is -- what would the professor need to do to add a new: citation for a 70 year old book; journal on PubMed; URL to CiteSeer? How does their input make it into your database? -Ross. On Tue, Sep 15, 2009 at 5:04 AM, O.Stephens o.steph...@open.ac.uk wrote: True. How, from the OpenURL, are you going to know that the rft is meant to represent a website? I guess that was part of my question. But no one has suggested defining a new metadata profile for websites (which I probably would avoid tbh). DC doesn't seem to offer a nice way of doing this (that is saying 'this is a website'), although there are perhaps some bits and pieces (format, type) that could be used to give some indication (but I suspect not unambiguously) But I still think what you want is simply a purl server. What makes you think you want OpenURL in the first place? But I still don't really understand what you're trying to do: deliver consistency of approach across all our references -- so are you using OpenURL for it's more conventional use too, but you want to tack on a purl-like functionality to the same software that's doing something more like a conventional link resolver? I don't completely understand your use case. I wouldn't use OpenURL just to get a persistent URL - I'd almost certainly look at PURL for this. But, I want something slightly different. I want our course authors to be able to use whatever URL they know for a resource, but still try to ensure that the link works persistently over time. I don't think it is reasonable for a user to have to know a 'special' URL for a resource - and this approach means establishing a PURL for all resources used in our teaching material whether or not it moves in the future - which is an overhead it would be nice to avoid. You can hit delete now if you aren't interested, but ... ... perhaps if I just say a little more about the project I'm working on it may clarify... The project I'm working on is concerned with referencing and citation. We are looking at how references appear in teaching material (esp. online) and how they can be reused by students in their personal environment (in essays, later study, or something else). The references that appear can be to anything - books, chapters, journals, articles, etc. Increasingly of course there are references to web-based materials. For print material, references generally describe the resource and nothing more, but for digital material references are expected not only to describe the resource, but also state a route of access to the resource. This tends to be a bad idea when (for example) referencing e-journals, as we
Re: [CODE4LIB] Implementing OpenURL for simple web resources
A suggestion on how to get a prof to enter a url. I use this bookmarklet to add a URL to Hacker News: javascript:window.location=%22http://news.ycombinator.com/submitlink?u=%22+encodeURIComponent(document.location)+%22t=%22+encodeURIComponent(document.title) I'm tempted to suggest an api based on OpenURL, but I fear the 10 emails it would provoke. On Sep 15, 2009, at 10:56 AM, Ross Singer wrote: Owen, I might have missed it in this message -- my eyes are starting glaze over at this point in the thread, but can you describe how the input of these resources would work? What I'm basically asking is -- what would the professor need to do to add a new: citation for a 70 year old book; journal on PubMed; URL to CiteSeer? How does their input make it into your database? -Ross. On Tue, Sep 15, 2009 at 5:04 AM, O.Stephens o.steph...@open.ac.uk wrote: True. How, from the OpenURL, are you going to know that the rft is meant to represent a website? I guess that was part of my question. But no one has suggested defining a new metadata profile for websites (which I probably would avoid tbh). DC doesn't seem to offer a nice way of doing this (that is saying 'this is a website'), although there are perhaps some bits and pieces (format, type) that could be used to give some indication (but I suspect not unambiguously) But I still think what you want is simply a purl server. What makes you think you want OpenURL in the first place? But I still don't really understand what you're trying to do: deliver consistency of approach across all our references -- so are you using OpenURL for it's more conventional use too, but you want to tack on a purl-like functionality to the same software that's doing something more like a conventional link resolver? I don't completely understand your use case. I wouldn't use OpenURL just to get a persistent URL - I'd almost certainly look at PURL for this. But, I want something slightly different. I want our course authors to be able to use whatever URL they know for a resource, but still try to ensure that the link works persistently over time. I don't think it is reasonable for a user to have to know a 'special' URL for a resource - and this approach means establishing a PURL for all resources used in our teaching material whether or not it moves in the future - which is an overhead it would be nice to avoid. You can hit delete now if you aren't interested, but ... ... perhaps if I just say a little more about the project I'm working on it may clarify... The project I'm working on is concerned with referencing and citation. We are looking at how references appear in teaching material (esp. online) and how they can be reused by students in their personal environment (in essays, later study, or something else). The references that appear can be to anything - books, chapters, journals, articles, etc. Increasingly of course there are references to web-based materials. For print material, references generally describe the resource and nothing more, but for digital material references are expected not only to describe the resource, but also state a route of access to the resource. This tends to be a bad idea when (for example) referencing e-journals, as we know the problems that surround this - many different routes of access to the same item. OpenURLs work well in this situation and seem to me like a sensible (and perhaps the only viable) solution. So we can say that for journals/articles it is sensible to ignore any URL supplied as part of the reference, and to form an OpenURL instead. If there is a DOI in the reference (which is increasingly common) then that can be used to form a URL using DOI resolution, but it makes more sense to me to hand this off to another application rather than bake this into the reference - and OpenURL resolvers are reasonably set to do this. If we look at a website it is pretty difficult to reference it without including the URL - it seems to be the only good way of describing what you are actually talking about (how many people think of websites by 'title', 'author' and 'publisher'?). For me, this leads to an immediate confusion between the description of the resource and the route of access to it. So, to differentiate I'm starting to think of the http URI in a reference like this as a URI, but not necessarily a URL. We then need some mechanism to check, given a URI, what is the URL. Now I could do this with a script - just pass the URI to a script that checks what URL to use against a list and redirects the user if necessary. On this point Jonathan said if the usefulness of your technique does NOT count on being inter-operable with existing link resolver infrastructure... PERSONALLY I would be using OpenURL, I don't think it's worth it - but it struck me that if we were passing a URI to a script, why not pass it in an OpenURL?
Re: [CODE4LIB] Implementing OpenURL for simple web resources
O.Stephens wrote: True. How, from the OpenURL, are you going to know that the rft is meant to represent a website? I guess that was part of my question. But no one has suggested defining a new metadata profile for websites (which I probably would avoid tbh). DC doesn't seem to offer a nice way of doing this (that is saying 'this is a website'), although there are perhaps some bits and pieces (format, type) that could be used to give some indication (but I suspect not unambiguously) Yeah, I don't think there IS any good way to do this. Well, wait, okay, you could use a DC metadata package, and try to convey web site in dc.type. The OpenURL dc.type is _recommended_ that you use a term from the DCTerms Type vocabulary, but that only lets you say something like it's an InteractiveResource or Text or Software. Unless InteractiveResource is sufficient to convey what you need, you could disregard the suggestion (not requirement) that the openurl dc metadata schema type element contain a DCMI Type vocabulary term, and just put something else there. Website. If you want to go this route, probably make a URI (perhaps using purl.org) to put an actual URI instead of a string literal there to represent Website. Now, you've still wound up with something that is somewhat local/custom, that other resolvers are not going to understand. But frankly, I think anything you're going to wind up with is something that you aren't going to be able to trust arbitrary resolvers in the wild to do anything in particular with. Which may not be a requirement for you anyway. (Which is why I personally find a new OpenURL metadata format to be a complete non-starter. I don't think OpenURL's abstract core actually provides much actual practical benefit, a new metadata format might as well be an entirely new standard -- for the practical benefit you get from it. Other link resolvers that aren't yours are unlikely to ever do anything with your new format, and if they do, whoever implements that is going to have almost as much work to do as if it hadn't been OpenURL at all. If I wanted a really abstract metadata framework to create a new profile/schema on top of, I'd choose DCMI, not OpenURL. DCMI is also so abstract that it doesn't make sense to just say My app can take DCMI (just like it doens't make any sense to say my app can take OpenURL--it's all about the profiels/schemas). But at least DCMI is a lot more flexible, and still has an active body of people working on maintaining and developing and adopting it.) Jonathan
Re: [CODE4LIB] Implementing OpenURL for simple web resources
Wait, are you really going to try to do this with _SFX_ too? I missed that part. Oh boy. Seriously, I think you are in for a world of painful hacky kludge. Rosalyn Metz wrote: Owen, The reason I suggest a source parser rather than a target parser is that handling the openurl based on the source rather than shave a bit of time off. Attached is a slide i created (back in the day when it was my job to create such slides...no i don't sit around in my hole creating slides because i'm bored...although.) that shows the process an OpenURL goes through. So the source parser in this example would come into play before the OpenURL metadata hits the SFX KB. It would bypass the bottom half of the slide completely and reduce any weird formatting that SFX might try to do to the metadata with a value like website (if you tell sfx you're looking for an article but you're really looking for a book it sometimes ignores metadata unrelated to an article even though you might actually need it). if you never let it get to that point, then you don't need to worry about that feature coming into play. Source parsers aren't used as frequently as they once were, but they used to be a way to retrieve more metadata from databases that didn't create useful openurls (not that many vendors create useful openurls now...). but if you go a hackish route you could use a source parser like a redirect rather than using it to fetch more metadata. If none of this makes sense let me know and i can try to describe it better off list so as not to bore people into oblivion. Rosalyn On Tue, Sep 15, 2009 at 9:52 AM, O.Stephens o.steph...@open.ac.uk wrote: Thanks Rosalyn, As you say we could push a custom value into rfr_genre. I'm a bit torn on this, as I guess I'm trying to do something that isn't 'hacky' - or at least not from the OpenURL end of it. It might be that this is just wishful thinking, and that I'm just trying to fool myself into thinking I'm 'sticking to the standard' when the likelihood of what I'm doing being transferrable to other scenarios is zero (although Eric's comments make me hope not) Yes, we are using SFX. What I'm proposing on the SFX end as the path of least resisitance is writing a source parser for our learning environment which can do a 'fetch' for an alternative URL, or use the primary URL, and put it in an SFX internal field rft_856. We can then use the existing Target Parser 856_URL which displays the contents of rft_856 in the menu. Combined with some logic which forces this as the only option under certain circumstances we can then push the user directly to the resulting URL. Owen Owen Stephens TELSTAR Project Manager Library and Learning Resources Centre The Open University Walton Hall Milton Keynes, MK7 6AA T: +44 (0) 1908 858701 F: +44 (0) 1908 653571 E: o.steph...@open.ac.uk -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Rosalyn Metz Sent: 15 September 2009 14:42 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Implementing OpenURL for simple web resources you could force a timestamp if people don't include a date. and I like the idea of going to the Internet Archive of a website, because then you're not having to get into the business of handling www.bbc.co.uk differently than cnn.com and someblog.org. i also like the idea of using a redirect. you could theoretically write a source parser (i'm assuming youre using SFX based on what you said about bX) that says if my rfr_id = mylocalid and the item is a website (however you choose to identify the website...which if you're writing your own source parser you could put website in the rft_genre even though its not technically a metadata format but you just want your source parser to forward the url on anyway, so the link resolver isn't actually going to do anything with it) bypass everything and just direct to the internet archive of the website. all of this is of course kind of hackish...but really isn't the whole thing hackish? there were a few source parsers that would be good models for writing something like this. but i have no idea if they still exist because i haven't looked at the back end of sfx in about a year. On Tue, Sep 15, 2009 at 5:12 AM, O.Stephens o.steph...@open.ac.uk wrote: I agree with this Rosalyn. The issue that Nate brought up was that the content at http://www.bbc.co.uk could change over time, and old content might be moved to another URI - http://archive.bbc.co.uk or something. So if course A references http://www.bbc.co.uk on 24/08/09, if the content that was on http://www.bbc.co.uk on 24/08/09 moves to http://archive.bbc.co.uk we can use the mechanism I propose to trap the links to http://www.bbc.co.uk and redirect to http://archive.bbc.co.uk. However, if at a later date course B references http://www.bbc.co.uk we have no way of knowing whether they mean the stuff that is currently on http://www.bbc.co.uk or
Re: [CODE4LIB] Implementing OpenURL for simple web resources
O.Stephens wrote: Thanks Rosalyn, As you say we could push a custom value into rfr_genre. I'm a bit torn on this, as I guess I'm trying to do something that isn't 'hacky' - or at least not from the OpenURL end of it. It might be that this is just wishful thinking, and that I'm just trying to fool myself into thinking I'm 'sticking to the standard' when the likelihood of what I'm doing being transferrable to other scenarios is zero (although Eric's comments make me hope not) Heh, that is my opinion. Everything I've ever tried to do with OpenURL that isn't part of the original 0.1 use case has ended up very hacky, despite my best efforts.
Re: [CODE4LIB] Implementing OpenURL for simple web resources
Do you think? I reckon it is just a few lines of code in a custom source parser... Only need to: Check rft.id contains an http uri (regexp) Define a fetchID based on this URI (possibly + date/other metadata) Get a URL or null from a lookup service Insert URL or rft_id value into rft.856 Simple! Owen Owen Stephens TELSTAR Project Manager Library and Learning Resources Centre The Open University Walton Hall Milton Keynes, MK7 6AA T: +44 (0) 1908 858701 F: +44 (0) 1908 653571 E: o.steph...@open.ac.uk -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Jonathan Rochkind Sent: 15 September 2009 16:30 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Implementing OpenURL for simple web resources Wait, are you really going to try to do this with _SFX_ too? I missed that part. Oh boy. Seriously, I think you are in for a world of painful hacky kludge. Rosalyn Metz wrote: Owen, The reason I suggest a source parser rather than a target parser is that handling the openurl based on the source rather than shave a bit of time off. Attached is a slide i created (back in the day when it was my job to create such slides...no i don't sit around in my hole creating slides because i'm bored...although.) that shows the process an OpenURL goes through. So the source parser in this example would come into play before the OpenURL metadata hits the SFX KB. It would bypass the bottom half of the slide completely and reduce any weird formatting that SFX might try to do to the metadata with a value like website (if you tell sfx you're looking for an article but you're really looking for a book it sometimes ignores metadata unrelated to an article even though you might actually need it). if you never let it get to that point, then you don't need to worry about that feature coming into play. Source parsers aren't used as frequently as they once were, but they used to be a way to retrieve more metadata from databases that didn't create useful openurls (not that many vendors create useful openurls now...). but if you go a hackish route you could use a source parser like a redirect rather than using it to fetch more metadata. If none of this makes sense let me know and i can try to describe it better off list so as not to bore people into oblivion. Rosalyn On Tue, Sep 15, 2009 at 9:52 AM, O.Stephens o.steph...@open.ac.uk wrote: Thanks Rosalyn, As you say we could push a custom value into rfr_genre. I'm a bit torn on this, as I guess I'm trying to do something that isn't 'hacky' - or at least not from the OpenURL end of it. It might be that this is just wishful thinking, and that I'm just trying to fool myself into thinking I'm 'sticking to the standard' when the likelihood of what I'm doing being transferrable to other scenarios is zero (although Eric's comments make me hope not) Yes, we are using SFX. What I'm proposing on the SFX end as the path of least resisitance is writing a source parser for our learning environment which can do a 'fetch' for an alternative URL, or use the primary URL, and put it in an SFX internal field rft_856. We can then use the existing Target Parser 856_URL which displays the contents of rft_856 in the menu. Combined with some logic which forces this as the only option under certain circumstances we can then push the user directly to the resulting URL. Owen Owen Stephens TELSTAR Project Manager Library and Learning Resources Centre The Open University Walton Hall Milton Keynes, MK7 6AA T: +44 (0) 1908 858701 F: +44 (0) 1908 653571 E: o.steph...@open.ac.uk -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Rosalyn Metz Sent: 15 September 2009 14:42 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Implementing OpenURL for simple web resources you could force a timestamp if people don't include a date. and I like the idea of going to the Internet Archive of a website, because then you're not having to get into the business of handling www.bbc.co.uk differently than cnn.com and someblog.org. i also like the idea of using a redirect. you could theoretically write a source parser (i'm assuming youre using SFX based on what you said about bX) that says if my rfr_id = mylocalid and the item is a website (however you choose to identify the website...which if you're writing your own source parser you could put website in the rft_genre even though its not technically a metadata format but you just want your source parser to forward the url on anyway, so the link resolver isn't actually going to do anything with it) bypass everything and just direct to the internet archive of the website. all of this is of course kind of hackish...but really isn't the whole thing hackish? there were a few
Re: [CODE4LIB] Implementing OpenURL for simple web resources
Given that the burden of creating these links is entirely on RefWorks Telstar, OpenURL seems as good a choice as anything (since anything would require some other service, anyway). As long as the profs aren't expected to mess with it, I'm not sure that *how* you do the indirection matters all that much and, as you say, there are added bonuses to keeping it within SFX. It seems to me, though, that your rft_id should be a URI to the db you're using to store their references, so your CTX would look something like: http://res.open.ac.uk/?rfr_id=info:/telstar.open.ac.ukrft_id=http://telstar.open.ac.uk/1234dc.identifier=http://bbc.uk.co/ # not url encoded because I have, you know, a life. I can't remember if you can include both metadata-by-reference keys and metadata-by-value, but you could have by-reference (rft_ref=http://telstar.open.ac.uk/1234rft_ref_fmt=RIS or something) point at your citation db to return a formatted citation. This way your citations are unique -- somebody pointing at today's London Times frontpage isn't the same as somebody else's on a different day. While I'm shocked that I agree with using OpenURL for this, it seems as reasonable as any other solution. That being said, unless you can definitely offer some other service besides linking to the resource, I'd avoid the resolver menu completely. -Ross. On Tue, Sep 15, 2009 at 11:17 AM, O.Stephens o.steph...@open.ac.uk wrote: Ross - no you didn't miss it, There are 3 ways that references might be added to the learning environment: An author (or realistically a proxy on behalf of the author) can insert a reference into a structured Word document from an RIS file. This structured document (XML) then goes through a 'publication' process which pushes the content to the learning environment (Moodle), including rendering the references from RIS format into a specified style, with links. An author/librarian/other can import references to a 'resources' area in our learning environment (Moodle) from a RIS file An author/librarian/other can subscribe to an RSS feed from a RefWorks 'RefShare' folder within the 'resources' area of the learning environment In general the project is focussing on the use of RefWorks - so although the RIS files could be created by any suitable s/w, we are looking specifically at RefWorks. How you get the reference into RefWorks is something we are looking at currently. The best approach varies depending on the type of material you are looking at: For websites it looks like the 'RefGrab-it' bookmarklet/browser plugin (depending on your browser) is the easiest way of capturing website details. For books, probably a Union catalogue search from within RefWorks For journal articles, probably a Federated search engine (SS 360 is what we've got) Any of these could be entered by hand of course, as could several other kinds of reference Entering the references into RefWorks could be done by an author, but it more likely to be done by a member of clerical staff or a librarian/library assistant Owen Owen Stephens TELSTAR Project Manager Library and Learning Resources Centre The Open University Walton Hall Milton Keynes, MK7 6AA T: +44 (0) 1908 858701 F: +44 (0) 1908 653571 E: o.steph...@open.ac.uk -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Ross Singer Sent: 15 September 2009 15:56 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Implementing OpenURL for simple web resources Owen, I might have missed it in this message -- my eyes are starting glaze over at this point in the thread, but can you describe how the input of these resources would work? What I'm basically asking is -- what would the professor need to do to add a new: citation for a 70 year old book; journal on PubMed; URL to CiteSeer? How does their input make it into your database? -Ross. On Tue, Sep 15, 2009 at 5:04 AM, O.Stephens o.steph...@open.ac.uk wrote: True. How, from the OpenURL, are you going to know that the rft is meant to represent a website? I guess that was part of my question. But no one has suggested defining a new metadata profile for websites (which I probably would avoid tbh). DC doesn't seem to offer a nice way of doing this (that is saying 'this is a website'), although there are perhaps some bits and pieces (format, type) that could be used to give some indication (but I suspect not unambiguously) But I still think what you want is simply a purl server. What makes you think you want OpenURL in the first place? But I still don't really understand what you're trying to do: deliver consistency of approach across all our references -- so are you using OpenURL for it's more conventional use too, but you want to tack on a purl-like functionality to the same software that's doing something more like a conventional link resolver? I don't completely understand your use case.
Re: [CODE4LIB] Implementing OpenURL for simple web resources
Oh yeah, one thing I left off -- In Moodle, it would probably make sense to link to the URL in the a tag: a href=http://bbc.co.uk/;The Beeb!/a but use a javascript onMouseDown action to rewrite the link to route through your funky link resolver path, a la Google. That way, the page works like any normal webpage, right mouse click-Copy Link Location gives the user the real URL to copy and paste, but normal behavior funnels through the link resolver. -Ross. On Tue, Sep 15, 2009 at 11:41 AM, Ross Singer rossfsin...@gmail.com wrote: Given that the burden of creating these links is entirely on RefWorks Telstar, OpenURL seems as good a choice as anything (since anything would require some other service, anyway). As long as the profs aren't expected to mess with it, I'm not sure that *how* you do the indirection matters all that much and, as you say, there are added bonuses to keeping it within SFX. It seems to me, though, that your rft_id should be a URI to the db you're using to store their references, so your CTX would look something like: http://res.open.ac.uk/?rfr_id=info:/telstar.open.ac.ukrft_id=http://telstar.open.ac.uk/1234dc.identifier=http://bbc.uk.co/ # not url encoded because I have, you know, a life. I can't remember if you can include both metadata-by-reference keys and metadata-by-value, but you could have by-reference (rft_ref=http://telstar.open.ac.uk/1234rft_ref_fmt=RIS or something) point at your citation db to return a formatted citation. This way your citations are unique -- somebody pointing at today's London Times frontpage isn't the same as somebody else's on a different day. While I'm shocked that I agree with using OpenURL for this, it seems as reasonable as any other solution. That being said, unless you can definitely offer some other service besides linking to the resource, I'd avoid the resolver menu completely. -Ross. On Tue, Sep 15, 2009 at 11:17 AM, O.Stephens o.steph...@open.ac.uk wrote: Ross - no you didn't miss it, There are 3 ways that references might be added to the learning environment: An author (or realistically a proxy on behalf of the author) can insert a reference into a structured Word document from an RIS file. This structured document (XML) then goes through a 'publication' process which pushes the content to the learning environment (Moodle), including rendering the references from RIS format into a specified style, with links. An author/librarian/other can import references to a 'resources' area in our learning environment (Moodle) from a RIS file An author/librarian/other can subscribe to an RSS feed from a RefWorks 'RefShare' folder within the 'resources' area of the learning environment In general the project is focussing on the use of RefWorks - so although the RIS files could be created by any suitable s/w, we are looking specifically at RefWorks. How you get the reference into RefWorks is something we are looking at currently. The best approach varies depending on the type of material you are looking at: For websites it looks like the 'RefGrab-it' bookmarklet/browser plugin (depending on your browser) is the easiest way of capturing website details. For books, probably a Union catalogue search from within RefWorks For journal articles, probably a Federated search engine (SS 360 is what we've got) Any of these could be entered by hand of course, as could several other kinds of reference Entering the references into RefWorks could be done by an author, but it more likely to be done by a member of clerical staff or a librarian/library assistant Owen Owen Stephens TELSTAR Project Manager Library and Learning Resources Centre The Open University Walton Hall Milton Keynes, MK7 6AA T: +44 (0) 1908 858701 F: +44 (0) 1908 653571 E: o.steph...@open.ac.uk -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Ross Singer Sent: 15 September 2009 15:56 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Implementing OpenURL for simple web resources Owen, I might have missed it in this message -- my eyes are starting glaze over at this point in the thread, but can you describe how the input of these resources would work? What I'm basically asking is -- what would the professor need to do to add a new: citation for a 70 year old book; journal on PubMed; URL to CiteSeer? How does their input make it into your database? -Ross. On Tue, Sep 15, 2009 at 5:04 AM, O.Stephens o.steph...@open.ac.uk wrote: True. How, from the OpenURL, are you going to know that the rft is meant to represent a website? I guess that was part of my question. But no one has suggested defining a new metadata profile for websites (which I probably would avoid tbh). DC doesn't seem to offer a nice way of doing this (that is saying 'this is a website'), although there are perhaps some bits and pieces (format, type) that could
[CODE4LIB] Fall Internships at WGBH Media Library Archives
Greetings colleagues! We have two opportunities for 2-3 interns at the WGBH Media Library Archives! Please forgive the cross postings and do not respond to me, but send a resume and a statement of interest by email to: human_resour...@wgbh.org or by mail to: WGBH Educational Foundation Human Resources Department One Guest Street Boston, MA 02135 Please forward to any interested parties! Thank you! Courtney. Digital Library Projects Internship http://careers.wgbh.org/internships/internships/mla_digital_library.html The WGBH Media Library Archives has opportunities for undergraduate and graduate students to work in a film and media production archives. Come and learn what happens to all the materials that went into that FRONTLINE you saw after it aired on TV. Digital library interns will work with the Project Manager and Production Assistant to make archival media materials accessible online for two ongoing pilot projects, the CPB American Archive project and the Mellon Digital Library project. The CPB American Archive project will focus on Civil Rights Movement content. Funded by CPB, the American Archive will eventually be a national archive of PBS media materials. The Mellon Digital Library project uses foreign policy and the history of science content, and focuses on scholarly use of archival media material online. Interns will get hands-on experience preparing archival media for web access by digitizing materials, applying metadata, and encoding transcripts. This is an opportunity to learn moving image digitization for preservation and access, the PBCore metadata schema (pbcore.org) and the TEI XML schema (tei-c.org/). Electronic Records Internship http://careers.wgbh.org/internships/internships/mla_records.html The WGBH Media Library Archives has opportunities for undergraduate and graduate students to work in a film and media production archives. Come and use your electronic management knowledge in a real world setting. The Electronic Records Management interns will work with both the Program Shutdown Manager and the Digital Archives Manager. They will review electronic original interview transcripts that have been delivered to the Media Library and Archives by productions (such as Frontline or Nova) to standardize names, and correct any inconsistencies. This may require some research skills to identify exactly who a particular interviewee is and, where applicable, the position held at the time of the interview. This will require embedding the interviewee information within the document header and linking the transcript back to the physical tape holdings. The position will work to standardize naming conventions for interview transcripts, and create a suitable electronic workflow, prior to upload the WGBH digital assessment management system. Training will be given in this Artesia based application. The position requires excellent skills in reviewing and correcting information metadata. Familiarity with online search engines, Library of Congress Authorities and other online resources is recommended, as is an attention to detail.
Re: [CODE4LIB] Implementing OpenURL for simple web resources
I'm thinking about it :) Logically I think we can avoid this as we have the context based on the rfr_id (for which we are proposing) rfr_id=info:sid/learn.open.ac.uk:[course code] (at the risk of more comment!) Which seems to me equivalent. I guess it is just a matter of where you do the work, since in SFX we'll end up constructing a 'fetch' to the same location anyway. The amount of work involved to change it one way or the other is probably trivial though. I'm not sure I agree that what I'm proposing puts 'random' URLs in the rft_id, although I do accept that this is a moot point if other resolvers don't do something useful with them (or worse, make incorrect assumptions about them) - perhaps this is something I could survey as part of the project... (although its all moot if we are only doing this within an internal environment and no-one else ever does it!) Owen Owen Stephens TELSTAR Project Manager Library and Learning Resources Centre The Open University Walton Hall Milton Keynes, MK7 6AA T: +44 (0) 1908 858701 F: +44 (0) 1908 653571 E: o.steph...@open.ac.uk -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Jonathan Rochkind Sent: 15 September 2009 16:52 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Implementing OpenURL for simple web resources I do like Ross's solution, if you really wanna use OpenURL. I'm much more comfortable with the idea of including a URI based on your own local service in rft_id, then including any old public URL in rft_id. Then at least your link resolver can say if what's in rft_id begins with (eg) http://telstar.open.ac.uk/, THEN I know this is one of these purl type things, and I know that sending the user to it will result in a redirect to an end-user-appropriate access URL. Cause that's my concern with putting random URLs in rft_id, that there's no way to know if they are intended as end-user-appropriate access URLs or not, and in putting things in rft_id that aren't really good identifiers for the referent at all. But using your own local service ID, now you really DO have something that's appropriately considered a persistent identifier for the referent, AND you have a straightforward way to tell when the rft_id of this context is intended as an access URL. Jonathan Ross Singer wrote: Oh yeah, one thing I left off -- In Moodle, it would probably make sense to link to the URL in the a tag: a href=http://bbc.co.uk/;The Beeb!/a but use a javascript onMouseDown action to rewrite the link to route through your funky link resolver path, a la Google. That way, the page works like any normal webpage, right mouse click-Copy Link Location gives the user the real URL to copy and paste, but normal behavior funnels through the link resolver. -Ross. On Tue, Sep 15, 2009 at 11:41 AM, Ross Singer rossfsin...@gmail.com wrote: Given that the burden of creating these links is entirely on RefWorks Telstar, OpenURL seems as good a choice as anything (since anything would require some other service, anyway). As long as the profs aren't expected to mess with it, I'm not sure that *how* you do the indirection matters all that much and, as you say, there are added bonuses to keeping it within SFX. It seems to me, though, that your rft_id should be a URI to the db you're using to store their references, so your CTX would look something like: http://res.open.ac.uk/?rfr_id=info:/telstar.open.ac.ukrft_id=http:// telstar.open.ac.uk/1234dc.identifier=http://bbc.uk.co/ # not url encoded because I have, you know, a life. I can't remember if you can include both metadata-by-reference keys and metadata-by-value, but you could have by-reference (rft_ref=http://telstar.open.ac.uk/1234rft_ref_fmt=RIS or something) point at your citation db to return a formatted citation. This way your citations are unique -- somebody pointing at today's London Times frontpage isn't the same as somebody else's on a different day. While I'm shocked that I agree with using OpenURL for this, it seems as reasonable as any other solution. That being said, unless you can definitely offer some other service besides linking to the resource, I'd avoid the resolver menu completely. -Ross. On Tue, Sep 15, 2009 at 11:17 AM, O.Stephens o.steph...@open.ac.uk wrote: Ross - no you didn't miss it, There are 3 ways that references might be added to the learning environment: An author (or realistically a proxy on behalf of the author) can insert a reference into a structured Word document from an RIS file. This structured document (XML) then goes through a 'publication' process which pushes the content to the learning environment (Moodle), including rendering the references from RIS format into a specified style, with links. An author/librarian/other can import references to a 'resources' area in our
Re: [CODE4LIB] Implementing OpenURL for simple web resources
I think using locally meaningful ids in rft_id is a misuse and a mistake. locally meaningful data should goi in rft_dat, accompanied by rfr_id just sayin' On Sep 15, 2009, at 11:52 AM, Jonathan Rochkind wrote: I do like Ross's solution, if you really wanna use OpenURL. I'm much more comfortable with the idea of including a URI based on your own local service in rft_id, then including any old public URL in rft_id. Then at least your link resolver can say if what's in rft_id begins with (eg) http://telstar.open.ac.uk/, THEN I know this is one of these purl type things, and I know that sending the user to it will result in a redirect to an end-user-appropriate access URL. Cause that's my concern with putting random URLs in rft_id, that there's no way to know if they are intended as end-user-appropriate access URLs or not, and in putting things in rft_id that aren't really good identifiers for the referent at all. But using your own local service ID, now you really DO have something that's appropriately considered a persistent identifier for the referent, AND you have a straightforward way to tell when the rft_id of this context is intended as an access URL. Jonathan Eric Hellman President, Gluejar, Inc. 41 Watchung Plaza, #132 Montclair, NJ 07042 USA e...@hellman.net http://go-to-hellman.blogspot.com/
Re: [CODE4LIB] Implementing OpenURL for simple web resources
Yes, you can. On Sep 15, 2009, at 11:41 AM, Ross Singer wrote: I can't remember if you can include both metadata-by-reference keys and metadata-by-value, but you could have by-reference (rft_ref=http://telstar.open.ac.uk/1234rft_ref_fmt=RIS or something) point at your citation db to return a formatted citation. Eric Hellman President, Gluejar, Inc. 41 Watchung Plaza, #132 Montclair, NJ 07042 USA e...@hellman.net http://go-to-hellman.blogspot.com/
Re: [CODE4LIB] Implementing OpenURL for simple web resources
On Tue, Sep 15, 2009 at 12:06 PM, Eric Hellman e...@hellman.net wrote: Yes, you can. In this case, I say punt on dc.identifier, throw the URL in rft_id (since, Eric, you had some concern regarding using the local id for this?) and let the real URL persistence/resolution work happen with the by-ref negotiation. -Ross. On Sep 15, 2009, at 11:41 AM, Ross Singer wrote: I can't remember if you can include both metadata-by-reference keys and metadata-by-value, but you could have by-reference (rft_ref=http://telstar.open.ac.uk/1234rft_ref_fmt=RIS or something) point at your citation db to return a formatted citation. Eric Hellman President, Gluejar, Inc. 41 Watchung Plaza, #132 Montclair, NJ 07042 USA e...@hellman.net http://go-to-hellman.blogspot.com/
Re: [CODE4LIB] Implementing OpenURL for simple web resources
Hi Owen, all: This is a very interesting problem. At Tue, 15 Sep 2009 10:04:09 +0100, O.Stephens wrote: […] If we look at a website it is pretty difficult to reference it without including the URL - it seems to be the only good way of describing what you are actually talking about (how many people think of websites by 'title', 'author' and 'publisher'?). For me, this leads to an immediate confusion between the description of the resource and the route of access to it. So, to differentiate I'm starting to think of the http URI in a reference like this as a URI, but not necessarily a URL. We then need some mechanism to check, given a URI, what is the URL. […] The problem with the approach (as Nate and Eric mention) is that any approach that relies on the URI as a identifier (whether using OpenURL or a script) is going to have problems as the same URI could be used to identify different resources over time. I think Eric's suggestion of using additional information to help differentiate is worth looking at, but I suspect that this is going to cause us problems - although I'd say that it is likely to cause us much less work than the alternative, which is allocating every single reference to a web resource used in our course material it's own persistent URL. […] I might be misunderstanding you, but, I think that you are leaving out the implicit dimension of time here - when was the URL referenced? What can we use to represent the tuple URL, date, and how do we retrieve an appropriate representation of this tuple? Is the most appropriate representation the most recent version of the page, wherever it may have moved? Or is the most appropriate representation the page as it existed in the past? I would argue that the most appropriate representation would be the page as it existed in the past, not what the page looks like now - but I am biased, because I work in web archiving. Unfortunately this is a problem that has not been very well addressed by the web architecture people, or the web archiving people. The web architecture people start from the assumption that http://example.org/ is the same resource which only varies in its representation as a function of time, not in its identity as a resource. The web archives people create closed systems and do not think about how to store and resolve the tuple, URL, date. I know this doesn’t help with your immediate problem, but I think these are important issues. best, Erik Hetzner ;; Erik Hetzner, California Digital Library ;; gnupg key id: 1024D/01DB07E3 pgpoU4UofTFjn.pgp Description: PGP signature
Re: [CODE4LIB] indexing pdf files
Here's a post on how easy it is to send PDF documents to Solr from Java: http://www.lucidimagination.com/blog/2009/09/14/posting-rich-documents-to-apache-solr-using-solrj-and-solr-cell-apache-tika/ Not only can you post PDF (and other rich content) files to Solr for indexing, you can also as shown in that blog entry extract the text from such files and have it returned to the client. This Solr capability makes the tool chain a bit simpler. Erik On Sep 15, 2009, at 10:31 AM, Peter Kiraly wrote: Hi all, I would like to suggest an API for extracting text (including highlighted or annotated ones) from PDF: iText (http://www.lowagie.com/iText/). This is a Java API (has C# port), and it helped me a lot, when we worked with extraordinary PDF files. Solr uses Tika (http://lucene.apache.org/tika) for extracting text from documents, and Tika uses PDFBox (http://incubator.apache.org/pdfbox/) to extract from PDF files, and it is a great tool for the normal PDF files, but it has (at least had) some features, which I didn't satisfied with: - it consumed more memory comparing with iText, and couldn't read files above a given size (this was large, about 1 GB, but we had even larger files) - it couldn't handled correctly the conditional hypens at the end of the line - it had poorer documentation then iText, and its API was also poorer (that time the Manning published the iText in Action book). Our PDF files were double layered (original hi-res image + OCR-ed text), several thousands pages length documents (Hungarian scientific journals, the diary of the Houses of Parliament from the 19th century etc.). We indexed the content with Lucene, and in the UI we showed one page per screen, so the user didn't need to download the full PDF. We extracted the Table of contents from the PDF as well, and we implemented it in the web UI, so the user can browse pages according to the full file's TOC. This project happened two years ago, so it is possible, that lots of things were changed since that time. Király Péter http://eXtensibleCatalog.org - Original Message - From: Mark A. Matienzo m...@matienzo.org To: CODE4LIB@LISTSERV.ND.EDU Sent: Tuesday, September 15, 2009 3:56 PM Subject: Re: [CODE4LIB] indexing pdf files Eric, 5. Use pdttotext to extract the OCRed text from the PDF and index it along with the MyLibrary metadata using Solr. [3, 4] Have you considered using Solr's ExtractingRequestHandler [1] for the PDFs? We're using it at NYPL with pretty great success. [1] http://wiki.apache.org/solr/ExtractingRequestHandler Mark A. Matienzo Applications Developer, Digital Experience Group The New York Public Library
Re: [CODE4LIB] Implementing OpenURL for simple web resources
The process by which a URI comes to identify something other than the stuff you get by resolving it can be mysterious- I've blogged about a bit: http://go-to-hellman.blogspot.com/2009/07/illusion-of-internet-identity.html In the case of worldcat or google, it's fame. If you think a URI can be usable outside your institution for identification purposes, and your institution can maintain some sort of identification machinery as long as the OpenURL is expected to be useful, then it's fine to use it in rft_id. If you intend the uri to connote identity it only in the context that you're building urls for, then use rft_dat which is there for exactly that purpose. On Sep 15, 2009, at 12:17 PM, Jonathan Rochkind wrote: If it's a URI that is indeed an identifier that unambiguously identifies the referent, as the standard says... I don't see how that's inappropriate in rft_id. Isn't that what it's for? I mentioned before that I put things like http://catalog.library.jhu.edu/bib/1234 in my rft_ids. Putting http://somewhere.edu/our-purl-server/1234 in rft_id seems very analogous to me. Both seem appropriate. I'm not sure what makes a URI locally meaningful or not. What makes http://www.worldcat.org/bibID or http://books.google.com/book?id=foo globally meaningful but http://catalog.library.jhu.edu/bib/1234 or http://somewhere.edu/our-purl-server/1234 locally meaningful? If it's a URI that is reasonably persistent and unambiguously identifies the referent, then it's an identifier and is appropriate for rft_id, says me. Jonathan Eric Hellman wrote: I think using locally meaningful ids in rft_id is a misuse and a mistake. locally meaningful data should goi in rft_dat, accompanied by rfr_id just sayin' On Sep 15, 2009, at 11:52 AM, Jonathan Rochkind wrote: I do like Ross's solution, if you really wanna use OpenURL. I'm much more comfortable with the idea of including a URI based on your own local service in rft_id, then including any old public URL in rft_id. Then at least your link resolver can say if what's in rft_id begins with (eg) http://telstar.open.ac.uk/, THEN I know this is one of these purl type things, and I know that sending the user to it will result in a redirect to an end-user-appropriate access URL. Cause that's my concern with putting random URLs in rft_id, that there's no way to know if they are intended as end-user- appropriate access URLs or not, and in putting things in rft_id that aren't really good identifiers for the referent at all. But using your own local service ID, now you really DO have something that's appropriately considered a persistent identifier for the referent, AND you have a straightforward way to tell when the rft_id of this context is intended as an access URL. Jonathan Eric Hellman President, Gluejar, Inc. 41 Watchung Plaza, #132 Montclair, NJ 07042 USA e...@hellman.net http://go-to-hellman.blogspot.com/