date:20090915

True. How, from the OpenURL, are you going to know that the rft is meant
to represent a website?
I guess that was part of my question. But no one has suggested defining a new 
metadata profile for websites (which I probably would avoid tbh). DC doesn't 
seem to offer a nice way of doing this (that is saying 'this is a website'), 
although there are perhaps some bits and pieces (format, type) that could be 
used to give some indication (but I suspect not unambiguously)

But I still think what you want is simply a purl server. What makes you
think you want OpenURL in the first place?  But I still don't really
understand what you're trying to do: deliver consistency of approach
across all our references -- so are you using OpenURL for it's more
conventional use too, but you want to tack on a purl-like
functionality to the same software that's doing something more like a
conventional link resolver?  I don't completely understand your use case.

I wouldn't use OpenURL just to get a persistent URL - I'd almost certainly look 
at PURL for this. But, I want something slightly different. I want our course 
authors to be able to use whatever URL they know for a resource, but still try 
to ensure that the link works persistently over time. I don't think it is 
reasonable for a user to have to know a 'special' URL for a resource - and this 
approach means establishing a PURL for all resources used in our teaching 
material whether or not it moves in the future - which is an overhead it would 
be nice to avoid.

You can hit delete now if you aren't interested, but ...

... perhaps if I just say a little more about the project I'm working on it may 
clarify...

The project I'm working on is concerned with referencing and citation. We are 
looking at how references appear in teaching material (esp. online) and how 
they can be reused by students in their personal environment (in essays, later 
study, or something else). The references that appear can be to anything - 
books, chapters, journals, articles, etc. Increasingly of course there are 
references to web-based materials.

For print material, references generally describe the resource and nothing 
more, but for digital material references are expected not only to describe the 
resource, but also state a route of access to the resource. This tends to be a 
bad idea when (for example) referencing e-journals, as we know the problems 
that surround this - many different routes of access to the same item. OpenURLs 
work well in this situation and seem to me like a sensible (and perhaps the 
only viable) solution. So we can say that for journals/articles it is sensible 
to ignore any URL supplied as part of the reference, and to form an OpenURL 
instead. If there is a DOI in the reference (which is increasingly common) then 
that can be used to form a URL using DOI resolution, but it makes more sense to 
me to hand this off to another application rather than bake this into the 
reference - and OpenURL resolvers are reasonably set to do this.

If we look at a website it is pretty difficult to reference it without 
including the URL - it seems to be the only good way of describing what you are 
actually talking about (how many people think of websites by 'title', 'author' 
and 'publisher'?). For me, this leads to an immediate confusion between the 
description of the resource and the route of access to it. So, to differentiate 
I'm starting to think of the http URI in a reference like this as a URI, but 
not necessarily a URL. We then need some mechanism to check, given a URI, what 
is the URL.

Now I could do this with a script - just pass the URI to a script that checks 
what URL to use against a list and redirects the user if necessary. On this 
point Jonathan said if the usefulness of your technique does NOT count on 
being inter-operable with existing link resolver infrastructure... PERSONALLY I 
would be using OpenURL, I don't think it's worth it - but it struck me that if 
we were passing a URI to a script, why not pass it in an OpenURL? I could see a 
number of advantages to this in the local context:

Consistency - references to websites get treated the same as references to 
journal articles - this means a single approach on the course side, with 
flexibility
Usage stats - we could collect these whatever, but if we do it via OpenURL we 
get this in the same place as the stats about usage of other scholarly material 
and could consider driving personalisation services off the data (like the bX 
product from Ex Libris)
Appropriate copy problem - for resources we subscribe to with authentication 
mechanisms there is (I think) an equivalent to the 'appropriate copy' issue as 
with journal articles - we can push a URI to 'Web of Science' to the correct 
version of Web of Science via a local authentication method (using ezproxy for 
us)

The problem with the approach (as Nate and Eric mention) is that any approach 
that relies on the URI as a identifier (whether using

Re: [CODE4LIB] Implementing OpenURL for simple web resources

I agree with this Rosalyn. The issue that Nate brought up was that the content 
at http://www.bbc.co.uk could change over time, and old content might be moved 
to another URI - http://archive.bbc.co.uk or something. So if course A 
references http://www.bbc.co.uk on 24/08/09, if the content that was on 
http://www.bbc.co.uk on 24/08/09 moves to http://archive.bbc.co.uk we can use 
the mechanism I propose to trap the links to http://www.bbc.co.uk and redirect 
to http://archive.bbc.co.uk. However, if at a later date course B references 
http://www.bbc.co.uk we have no way of knowing whether they mean the stuff that 
is currently on http://www.bbc.co.uk or the stuff that used to be on 
http://www.bbc.co.uk and is now on http://archive.bbc.co.uk - and we have a 
redirect that is being applied across the board.

Thinking about it, references are required to include a date of access when 
citing websites, so this is probably the best piece of information to use to 
know where to resolve to (and we can put this in the DC metadata). Whether this 
will just get too confusing is a good question - I'll have at think about this.

Owen

PS using the date we could even consider resolving to the Internet Archive copy 
of a website if it was available I guess - this might be useful I guess...

Owen Stephens
TELSTAR Project Manager
Library and Learning Resources Centre
The Open University
Walton Hall
Milton Keynes, MK7 6AA

T: +44 (0) 1908 858701
F: +44 (0) 1908 653571
E: o.steph...@open.ac.uk


 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On
 Behalf Of Rosalyn Metz
 Sent: 14 September 2009 21:52
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Implementing OpenURL for simple web resources

 oops...just re-read original post s/professor/article

 also your link resolver should be creating a context object
 with each request.  this context object is what makes the
 openurl unique.  so if you want uniqueness for stats purposes
 i would image the link resolver is already doing that (and
 just another reason to use an rfr_id that you create).




 On Mon, Sep 14, 2009 at 4:45 PM, Rosalyn Metz
 rosalynm...@gmail.com wrote:
  Owen,
 
  rft_id isn't really meant to be a unique identifier
 (although it can
  be in situations like a pmid or doi).  are you looking for it to be?
  if so why?
 
  if professor A is pointing to http://www.bbc.co.uk and
 professor B is
  pointing to http://www.bbc.co.uk why do they have to have unique
  OpenURLs.
 
  Rosalyn
 
 
 
 
  On Mon, Sep 14, 2009 at 4:41 PM, Eric Hellman
 e...@hellman.net wrote:
  Nate's point is what I was thinking about in this comment in my
  original
  reply:
  If you don't add DC metadata, which seems like a good idea, you'll
  definitely want to include something that will help you to persist
  your replacement record. For example, a label or
 description for the link.
 
  I should also point out a solution that could work for some people
  but not
  you- put rewrite rules in the gateways serving your network. A bit
  dangerous and kludgy, but we've seen kludgier things.
 
  On Sep 14, 2009, at 4:24 PM, O.Stephens wrote:
 
  Nate has a point here - what if we end up with a commonly
 used URI
  pointing at a variety of different things over time, and
 so is used
  to indicate different content each time. However the
 problem with a 'short URL'
  solution (tr.im, purl etc), or indeed any locally assigned
  identifier that acts as a key, is that as described in
 the blog post
  you need prior knowledge of the short URL/identifier to
 use it. The
  only 'identifier' our authors know for a website is it's
 URL - and
  it seems contrary for us to ask them to use something else. I'll
  need to think about Nate's point - is this common or an
 edge case? Is there any other approach we could take?
 
 
  Eric Hellman
  President, Gluejar, Inc.
  41 Watchung Plaza, #132
  Montclair, NJ 07042
  USA
 
  e...@hellman.net
  http://go-to-hellman.blogspot.com/
 
 



The Open University is incorporated by Royal Charter (RC 000391), an exempt 
charity in England  Wales and a charity registered in Scotland (SC 038302).

[CODE4LIB] Results from Institutional Identifiers in Repositories Survey

2009-09-15 Thread Michael J. Giarlo

Greetings,

The NISO I2 Working Group surveyed repository managers and developers
about current practices and needs of the repository community around
institutional identifiers.  Results from the survey will inform a set
of use cases that are expected to drive the development of a draft
standard for institutional identifiers.

A report on the results of the survey is now available to the public:

http://bit.ly/14hWly

Feedback from the repository community is most welcome.  It may be
sent to our public i2info mailing list --
http://www.niso.org/lists/i2info/ -- or directly to me.

Thanks,

-Mike
 Co-chair, Repositories scenario, NISO I2 Working Group

[CODE4LIB] indexing pdf files

2009-09-15 Thread Eric Lease Morgan


I have been having fun recently indexing PDF files.

For the pasts six months or so I have been keeping the articles I've  
read in a pile, and I was rather amazed at the size of the pile. It  
was about a foot tall. When I read these articles I actively read  
them -- meaning, I write, scribble, highlight, and annotate the text  
with my own special notation denoting names, keywords, definitions,  
citations, quotations, list items, examples, etc. This active reading  
process: 1) makes for better comprehension on my part, and 2) makes  
the articles easier to review and pick out the ideas I thought were  
salient. Being the librarian I am, I thought it might be cool (kewl)  
to make the articles into a collection. Thus, the beginnings of  
Highlights  Annotations: A Value-Added Reading List.


The techno-weenie process for creating and maintaining the content is  
something this community might find interesting:


 1. Print article and read it actively.

 2. Convert the printed article into a PDF
file -- complete with embedded OCR --
with my handy-dandy ScanSnap scanner. [1]

 3. Use MyLibrary to create metadata (author,
title, date published, date read, note,
keywords, facet/term combinations, local
and remote URLs, etc.) describing the
article. [2]

 4. Save the PDF to my file system.

 5. Use pdttotext to extract the OCRed text
from the PDF and index it along with
the MyLibrary metadata using Solr. [3, 4]

 6. Provide a searchable/browsable user
interface to the collection through a
mod_perl module. [5, 6]

Software is never done, and if it were then it would be called  
hardware. Accordingly, I know there are some things I need to do  
before I can truely deem the system version 1.0. At the same time my  
excitment is overflowing and I thought I'd share some geekdom with my  
fellow hackers. Fun with PDF files and open source software.



[1] ScanSnap - http://tinyurl.com/oafgwe
[2] MyLibrary screen dump - http://infomotions.com/tmp/mylibrary.png
[3] pdftotext - http://www.foolabs.com/xpdf/
[4] Solr - http://lucene.apache.org/solr/
[5] module source code - http://infomotions.com/highlights/Highlights.pl
[6] user interface - http://infomotions.com/highlights/highlights.cgi

--
Eric Lease Morgan
University of Notre Dame




--
Eric Lease Morgan
Head, Digital Access and Information Architecture Department
Hesburgh Libraries, University of Notre Dame

(574) 631-8604

Re: [CODE4LIB] Implementing OpenURL for simple web resources

2009-09-15 Thread Rosalyn Metz

you could force a timestamp if people don't include a date.

and I like the idea of going to the Internet Archive of a website,
because then you're not having to get into the business of handling
www.bbc.co.uk differently than cnn.com and someblog.org.

i also like the idea of using a redirect.  you could theoretically
write a source parser (i'm assuming youre using SFX based on what you
said about bX) that says if my rfr_id = mylocalid and the item is a
website (however you choose to identify the website...which if you're
writing your own source parser you could put website in the rft_genre
even though its not technically a metadata format but you just want
your source parser to forward the url on anyway, so the link resolver
isn't actually going to do anything with it) bypass everything and
just direct to the internet archive of the website.

all of this is of course kind of hackish...but really isn't the whole
thing hackish?  there were a few source parsers that would be good
models for writing something like this.  but i have no idea if they
still exist because i haven't looked at the back end of sfx in about a
year.




On Tue, Sep 15, 2009 at 5:12 AM, O.Stephens o.steph...@open.ac.uk wrote:
 I agree with this Rosalyn. The issue that Nate brought up was that the 
 content at http://www.bbc.co.uk could change over time, and old content might 
 be moved to another URI - http://archive.bbc.co.uk or something. So if course 
 A references http://www.bbc.co.uk on 24/08/09, if the content that was on 
 http://www.bbc.co.uk on 24/08/09 moves to http://archive.bbc.co.uk we can use 
 the mechanism I propose to trap the links to http://www.bbc.co.uk and 
 redirect to http://archive.bbc.co.uk. However, if at a later date course B 
 references http://www.bbc.co.uk we have no way of knowing whether they mean 
 the stuff that is currently on http://www.bbc.co.uk or the stuff that used to 
 be on http://www.bbc.co.uk and is now on http://archive.bbc.co.uk - and we 
 have a redirect that is being applied across the board.

 Thinking about it, references are required to include a date of access when 
 citing websites, so this is probably the best piece of information to use to 
 know where to resolve to (and we can put this in the DC metadata). Whether 
 this will just get too confusing is a good question - I'll have at think 
 about this.

 Owen

 PS using the date we could even consider resolving to the Internet Archive 
 copy of a website if it was available I guess - this might be useful I 
 guess...

 Owen Stephens
 TELSTAR Project Manager
 Library and Learning Resources Centre
 The Open University
 Walton Hall
 Milton Keynes, MK7 6AA

 T: +44 (0) 1908 858701
 F: +44 (0) 1908 653571
 E: o.steph...@open.ac.uk


 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On
 Behalf Of Rosalyn Metz
 Sent: 14 September 2009 21:52
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Implementing OpenURL for simple web resources

 oops...just re-read original post s/professor/article

 also your link resolver should be creating a context object
 with each request.  this context object is what makes the
 openurl unique.  so if you want uniqueness for stats purposes
 i would image the link resolver is already doing that (and
 just another reason to use an rfr_id that you create).




 On Mon, Sep 14, 2009 at 4:45 PM, Rosalyn Metz
 rosalynm...@gmail.com wrote:
  Owen,
 
  rft_id isn't really meant to be a unique identifier
 (although it can
  be in situations like a pmid or doi).  are you looking for it to be?
  if so why?
 
  if professor A is pointing to http://www.bbc.co.uk and
 professor B is
  pointing to http://www.bbc.co.uk why do they have to have unique
  OpenURLs.
 
  Rosalyn
 
 
 
 
  On Mon, Sep 14, 2009 at 4:41 PM, Eric Hellman
 e...@hellman.net wrote:
  Nate's point is what I was thinking about in this comment in my
  original
  reply:
  If you don't add DC metadata, which seems like a good idea, you'll
  definitely want to include something that will help you to persist
  your replacement record. For example, a label or
 description for the link.
 
  I should also point out a solution that could work for some people
  but not
  you- put rewrite rules in the gateways serving your network. A bit
  dangerous and kludgy, but we've seen kludgier things.
 
  On Sep 14, 2009, at 4:24 PM, O.Stephens wrote:
 
  Nate has a point here - what if we end up with a commonly
 used URI
  pointing at a variety of different things over time, and
 so is used
  to indicate different content each time. However the
 problem with a 'short URL'
  solution (tr.im, purl etc), or indeed any locally assigned
  identifier that acts as a key, is that as described in
 the blog post
  you need prior knowledge of the short URL/identifier to
 use it. The
  only 'identifier' our authors know for a website is it's
 URL - and
  it seems contrary for us to ask them to use something else. I'll
  need to think about

Re: [CODE4LIB] indexing pdf files

2009-09-15 Thread Rosalyn Metz

Eric,

I have librarians that would kill for this.  In fact I was talking to
one about it the other day.  She felt there must be a way to handle
active reading and make it portable.  This would be great in
conjunction with RefWorks or Zotero or something along those lines.

Rosalyn



On Tue, Sep 15, 2009 at 9:31 AM, Eric Lease Morgan emor...@nd.edu wrote:
 I have been having fun recently indexing PDF files.

 For the pasts six months or so I have been keeping the articles I've read in
 a pile, and I was rather amazed at the size of the pile. It was about a foot
 tall. When I read these articles I actively read them -- meaning, I write,
 scribble, highlight, and annotate the text with my own special notation
 denoting names, keywords, definitions, citations, quotations, list items,
 examples, etc. This active reading process: 1) makes for better
 comprehension on my part, and 2) makes the articles easier to review and
 pick out the ideas I thought were salient. Being the librarian I am, I
 thought it might be cool (kewl) to make the articles into a collection.
 Thus, the beginnings of Highlights  Annotations: A Value-Added Reading
 List.

 The techno-weenie process for creating and maintaining the content is
 something this community might find interesting:

  1. Print article and read it actively.

  2. Convert the printed article into a PDF
    file -- complete with embedded OCR --
    with my handy-dandy ScanSnap scanner. [1]

  3. Use MyLibrary to create metadata (author,
    title, date published, date read, note,
    keywords, facet/term combinations, local
    and remote URLs, etc.) describing the
    article. [2]

  4. Save the PDF to my file system.

  5. Use pdttotext to extract the OCRed text
    from the PDF and index it along with
    the MyLibrary metadata using Solr. [3, 4]

  6. Provide a searchable/browsable user
    interface to the collection through a
    mod_perl module. [5, 6]

 Software is never done, and if it were then it would be called hardware.
 Accordingly, I know there are some things I need to do before I can truely
 deem the system version 1.0. At the same time my excitment is overflowing
 and I thought I'd share some geekdom with my fellow hackers. Fun with PDF
 files and open source software.


 [1] ScanSnap - http://tinyurl.com/oafgwe
 [2] MyLibrary screen dump - http://infomotions.com/tmp/mylibrary.png
 [3] pdftotext - http://www.foolabs.com/xpdf/
 [4] Solr - http://lucene.apache.org/solr/
 [5] module source code - http://infomotions.com/highlights/Highlights.pl
 [6] user interface - http://infomotions.com/highlights/highlights.cgi

 --
 Eric Lease Morgan
 University of Notre Dame




 --
 Eric Lease Morgan
 Head, Digital Access and Information Architecture Department
 Hesburgh Libraries, University of Notre Dame

 (574) 631-8604

Re: [CODE4LIB] Implementing OpenURL for simple web resources

Thanks Rosalyn,

As you say we could push a custom value into rfr_genre. I'm a bit torn on this, 
as I guess I'm trying to do something that isn't 'hacky' - or at least not from 
the OpenURL end of it. It might be that this is just wishful thinking, and that 
I'm just trying to fool myself into thinking I'm 'sticking to the standard' 
when the likelihood of what I'm doing being transferrable to other scenarios is 
zero (although Eric's comments make me hope not)

Yes, we are using SFX. What I'm proposing on the SFX end as the path of least 
resisitance is writing a source parser for our learning environment which can 
do a 'fetch' for an alternative URL, or use the primary URL, and put it in an 
SFX internal field rft_856. We can then use the existing Target Parser 856_URL 
which displays the contents of rft_856 in the menu. Combined with some logic 
which forces this as the only option under certain circumstances we can then 
push the user directly to the resulting URL.

Owen

Owen Stephens
TELSTAR Project Manager
Library and Learning Resources Centre
The Open University
Walton Hall
Milton Keynes, MK7 6AA

T: +44 (0) 1908 858701
F: +44 (0) 1908 653571
E: o.steph...@open.ac.uk


 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On
 Behalf Of Rosalyn Metz
 Sent: 15 September 2009 14:42
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Implementing OpenURL for simple web resources

 you could force a timestamp if people don't include a date.

 and I like the idea of going to the Internet Archive of a
 website, because then you're not having to get into the
 business of handling www.bbc.co.uk differently than cnn.com
 and someblog.org.

 i also like the idea of using a redirect.  you could
 theoretically write a source parser (i'm assuming youre using
 SFX based on what you said about bX) that says if my rfr_id =
 mylocalid and the item is a website (however you choose to
 identify the website...which if you're writing your own
 source parser you could put website in the rft_genre even
 though its not technically a metadata format but you just
 want your source parser to forward the url on anyway, so the
 link resolver isn't actually going to do anything with it)
 bypass everything and just direct to the internet archive of
 the website.

 all of this is of course kind of hackish...but really isn't
 the whole thing hackish?  there were a few source parsers
 that would be good models for writing something like this.
 but i have no idea if they still exist because i haven't
 looked at the back end of sfx in about a year.




 On Tue, Sep 15, 2009 at 5:12 AM, O.Stephens
 o.steph...@open.ac.uk wrote:
  I agree with this Rosalyn. The issue that Nate brought up
 was that the content at http://www.bbc.co.uk could change
 over time, and old content might be moved to another URI -
 http://archive.bbc.co.uk or something. So if course A
 references http://www.bbc.co.uk on 24/08/09, if the content
 that was on http://www.bbc.co.uk on 24/08/09 moves to
 http://archive.bbc.co.uk we can use the mechanism I propose
 to trap the links to http://www.bbc.co.uk and redirect to
 http://archive.bbc.co.uk. However, if at a later date course
 B references http://www.bbc.co.uk we have no way of knowing
 whether they mean the stuff that is currently on
 http://www.bbc.co.uk or the stuff that used to be on
 http://www.bbc.co.uk and is now on http://archive.bbc.co.uk -
 and we have a redirect that is being applied across the board.
 
  Thinking about it, references are required to include a
 date of access when citing websites, so this is probably the
 best piece of information to use to know where to resolve to
 (and we can put this in the DC metadata). Whether this will
 just get too confusing is a good question - I'll have at
 think about this.
 
  Owen
 
  PS using the date we could even consider resolving to the
 Internet Archive copy of a website if it was available I
 guess - this might be useful I guess...
 
  Owen Stephens
  TELSTAR Project Manager
  Library and Learning Resources Centre
  The Open University
  Walton Hall
  Milton Keynes, MK7 6AA
 
  T: +44 (0) 1908 858701
  F: +44 (0) 1908 653571
  E: o.steph...@open.ac.uk
 
 
  -Original Message-
  From: Code for Libraries [mailto:code4...@listserv.nd.edu]
 On Behalf
  Of Rosalyn Metz
  Sent: 14 September 2009 21:52
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] Implementing OpenURL for simple
 web resources
 
  oops...just re-read original post s/professor/article
 
  also your link resolver should be creating a context
 object with each
  request.  this context object is what makes the openurl
 unique.  so
  if you want uniqueness for stats purposes i would image the link
  resolver is already doing that (and just another reason to use an
  rfr_id that you create).
 
 
 
 
  On Mon, Sep 14, 2009 at 4:45 PM, Rosalyn Metz
 rosalynm...@gmail.com
  wrote:
   Owen,
  
   rft_id isn't really meant to be a unique

Re: [CODE4LIB] indexing pdf files

2009-09-15 Thread Mark A. Matienzo

Eric,

  5. Use pdttotext to extract the OCRed text
from the PDF and index it along with
the MyLibrary metadata using Solr. [3, 4]


Have you considered using Solr's ExtractingRequestHandler [1] for the
PDFs? We're using it at NYPL with pretty great success.

[1] http://wiki.apache.org/solr/ExtractingRequestHandler

Mark A. Matienzo
Applications Developer, Digital Experience Group
The New York Public Library

Re: [CODE4LIB] indexing pdf files

2009-09-15 Thread Peter Kiraly


Hi all,

I would like to suggest an API for extracting text (including highlighted or
annotated ones) from PDF: iText (http://www.lowagie.com/iText/).
This is a Java API (has C# port), and it helped me a lot, when we worked
with extraordinary PDF files.

Solr uses Tika (http://lucene.apache.org/tika) for extracting text from
documents, and Tika uses PDFBox (http://incubator.apache.org/pdfbox/)
to extract from PDF files, and it is a great tool for the normal PDF files,
but it has (at least had) some features, which I didn't satisfied with:

- it consumed more memory comparing with iText, and couldn't
read files above a given size (this was large, about 1 GB, but we
had even larger files)

- it couldn't handled correctly the conditional hypens at the end of
the line
- it had poorer documentation then iText, and its API was also
poorer (that time the Manning published the iText in Action book).

Our PDF files were double layered (original hi-res image + OCR-ed text),
several thousands pages length documents (Hungarian scientific journals,
the diary of the Houses of Parliament from the 19th century etc.). We 
indexed

the content with Lucene, and in the UI we showed one page per screen,
so the user didn't need to download the full PDF. We extracted the
Table of contents from the PDF as well, and we implemented it in the web UI,
so the user can browse pages according to the full file's TOC.

This project happened two years ago, so it is possible, that lots of things
were changed since that time.

Király Péter
http://eXtensibleCatalog.org

- Original Message - 
From: Mark A. Matienzo m...@matienzo.org

To: CODE4LIB@LISTSERV.ND.EDU
Sent: Tuesday, September 15, 2009 3:56 PM
Subject: Re: [CODE4LIB] indexing pdf files



Eric,


 5. Use pdttotext to extract the OCRed text
   from the PDF and index it along with
   the MyLibrary metadata using Solr. [3, 4]



Have you considered using Solr's ExtractingRequestHandler [1] for the
PDFs? We're using it at NYPL with pretty great success.

[1] http://wiki.apache.org/solr/ExtractingRequestHandler

Mark A. Matienzo
Applications Developer, Digital Experience Group
The New York Public Library

Re: [CODE4LIB] indexing pdf files

2009-09-15 Thread danielle plumer

My (much more primitive) version of the same thing involves reading and
annotating articles using my Tablet PC. Although I do get a variety of print
publications, I find I don't tend to annotate them as much anymore. I used
to use EndNote to do the metadata, then I switched to Zotero. I hadn't
thought to try to create a full-text search of the articles -- hmm.

-- 
Danielle Cunniff Plumer, Coordinator
Texas Heritage Digitization Initiative
Texas State Library and Archives Commission
512.463.5852 (phone) / 512.936.2306 (fax)
dplu...@tsl.state.tx.us
dcplu...@gmail.com


On Tue, Sep 15, 2009 at 8:31 AM, Eric Lease Morgan emor...@nd.edu wrote:

 I have been having fun recently indexing PDF files.

 For the pasts six months or so I have been keeping the articles I've read
 in a pile, and I was rather amazed at the size of the pile. It was about a
 foot tall. When I read these articles I actively read them -- meaning, I
 write, scribble, highlight, and annotate the text with my own special
 notation denoting names, keywords, definitions, citations, quotations, list
 items, examples, etc. This active reading process: 1) makes for better
 comprehension on my part, and 2) makes the articles easier to review and
 pick out the ideas I thought were salient. Being the librarian I am, I
 thought it might be cool (kewl) to make the articles into a collection.
 Thus, the beginnings of Highlights  Annotations: A Value-Added Reading
 List.

 The techno-weenie process for creating and maintaining the content is
 something this community might find interesting:

  1. Print article and read it actively.

  2. Convert the printed article into a PDF
file -- complete with embedded OCR --
with my handy-dandy ScanSnap scanner. [1]

  3. Use MyLibrary to create metadata (author,
title, date published, date read, note,
keywords, facet/term combinations, local
and remote URLs, etc.) describing the
article. [2]

  4. Save the PDF to my file system.

  5. Use pdttotext to extract the OCRed text
from the PDF and index it along with
the MyLibrary metadata using Solr. [3, 4]

  6. Provide a searchable/browsable user
interface to the collection through a
mod_perl module. [5, 6]

 Software is never done, and if it were then it would be called hardware.
 Accordingly, I know there are some things I need to do before I can truely
 deem the system version 1.0. At the same time my excitment is overflowing
 and I thought I'd share some geekdom with my fellow hackers. Fun with PDF
 files and open source software.


 [1] ScanSnap - http://tinyurl.com/oafgwe
 [2] MyLibrary screen dump - http://infomotions.com/tmp/mylibrary.png
 [3] pdftotext - http://www.foolabs.com/xpdf/
 [4] Solr - http://lucene.apache.org/solr/
 [5] module source code - http://infomotions.com/highlights/Highlights.pl
 [6] user interface - http://infomotions.com/highlights/highlights.cgi

 --
 Eric Lease Morgan
 University of Notre Dame




 --
 Eric Lease Morgan
 Head, Digital Access and Information Architecture Department
 Hesburgh Libraries, University of Notre Dame

 (574) 631-8604

Re: [CODE4LIB] Implementing OpenURL for simple web resources

Owen, I might have missed it in this message -- my eyes are starting
glaze over at this point in the thread, but can you describe how the
input of these resources would work?

What I'm basically asking is -- what would the professor need to do to
add a new:  citation for a 70 year old book; journal on PubMed; URL to
CiteSeer?

How does their input make it into your database?

-Ross.

On Tue, Sep 15, 2009 at 5:04 AM, O.Stephens o.steph...@open.ac.uk wrote:
True. How, from the OpenURL, are you going to know that the rft is meant
to represent a website?
 I guess that was part of my question. But no one has suggested defining a new 
 metadata profile for websites (which I probably would avoid tbh). DC doesn't 
 seem to offer a nice way of doing this (that is saying 'this is a website'), 
 although there are perhaps some bits and pieces (format, type) that could be 
 used to give some indication (but I suspect not unambiguously)

But I still think what you want is simply a purl server. What makes you
think you want OpenURL in the first place?  But I still don't really
understand what you're trying to do: deliver consistency of approach
across all our references -- so are you using OpenURL for it's more
conventional use too, but you want to tack on a purl-like
functionality to the same software that's doing something more like a
conventional link resolver?  I don't completely understand your use case.

 I wouldn't use OpenURL just to get a persistent URL - I'd almost certainly 
 look at PURL for this. But, I want something slightly different. I want our 
 course authors to be able to use whatever URL they know for a resource, but 
 still try to ensure that the link works persistently over time. I don't think 
 it is reasonable for a user to have to know a 'special' URL for a resource - 
 and this approach means establishing a PURL for all resources used in our 
 teaching material whether or not it moves in the future - which is an 
 overhead it would be nice to avoid.

 You can hit delete now if you aren't interested, but ...

 ... perhaps if I just say a little more about the project I'm working on it 
 may clarify...

 The project I'm working on is concerned with referencing and citation. We are 
 looking at how references appear in teaching material (esp. online) and how 
 they can be reused by students in their personal environment (in essays, 
 later study, or something else). The references that appear can be to 
 anything - books, chapters, journals, articles, etc. Increasingly of course 
 there are references to web-based materials.

 For print material, references generally describe the resource and nothing 
 more, but for digital material references are expected not only to describe 
 the resource, but also state a route of access to the resource. This tends to 
 be a bad idea when (for example) referencing e-journals, as we know the 
 problems that surround this - many different routes of access to the same 
 item. OpenURLs work well in this situation and seem to me like a sensible 
 (and perhaps the only viable) solution. So we can say that for 
 journals/articles it is sensible to ignore any URL supplied as part of the 
 reference, and to form an OpenURL instead. If there is a DOI in the reference 
 (which is increasingly common) then that can be used to form a URL using DOI 
 resolution, but it makes more sense to me to hand this off to another 
 application rather than bake this into the reference - and OpenURL resolvers 
 are reasonably set to do this.

 If we look at a website it is pretty difficult to reference it without 
 including the URL - it seems to be the only good way of describing what you 
 are actually talking about (how many people think of websites by 'title', 
 'author' and 'publisher'?). For me, this leads to an immediate confusion 
 between the description of the resource and the route of access to it. So, to 
 differentiate I'm starting to think of the http URI in a reference like this 
 as a URI, but not necessarily a URL. We then need some mechanism to check, 
 given a URI, what is the URL.

 Now I could do this with a script - just pass the URI to a script that checks 
 what URL to use against a list and redirects the user if necessary. On this 
 point Jonathan said if the usefulness of your technique does NOT count on 
 being inter-operable with existing link resolver infrastructure... PERSONALLY 
 I would be using OpenURL, I don't think it's worth it - but it struck me 
 that if we were passing a URI to a script, why not pass it in an OpenURL? I 
 could see a number of advantages to this in the local context:

 Consistency - references to websites get treated the same as references to 
 journal articles - this means a single approach on the course side, with 
 flexibility
 Usage stats - we could collect these whatever, but if we do it via OpenURL we 
 get this in the same place as the stats about usage of other scholarly 
 material and could consider driving

Re: [CODE4LIB] Implementing OpenURL for simple web resources

Ross - no you didn't miss it,

There are 3 ways that references might be added to the learning environment:

An author (or realistically a proxy on behalf of the author) can insert a 
reference into a structured Word document from an RIS file. This structured 
document (XML) then goes through a 'publication' process which pushes the 
content to the learning environment (Moodle), including rendering the 
references from RIS format into a specified style, with links.
An author/librarian/other can import references to a 'resources' area in our 
learning environment (Moodle) from a RIS file
An author/librarian/other can subscribe to an RSS feed from a RefWorks 
'RefShare' folder within the 'resources' area of the learning environment

In general the project is focussing on the use of RefWorks - so although the 
RIS files could be created by any suitable s/w, we are looking specifically at 
RefWorks.

How you get the reference into RefWorks is something we are looking at 
currently. The best approach varies depending on the type of material you are 
looking at:

For websites it looks like the 'RefGrab-it' bookmarklet/browser plugin 
(depending on your browser) is the easiest way of capturing website details.
For books, probably a Union catalogue search from within RefWorks
For journal articles, probably a Federated search engine (SS 360 is what we've 
got)
Any of these could be entered by hand of course, as could several other kinds 
of reference

Entering the references into RefWorks could be done by an author, but it more 
likely to be done by a member of clerical staff or a librarian/library assistant

Owen

Owen Stephens
TELSTAR Project Manager
Library and Learning Resources Centre
The Open University
Walton Hall
Milton Keynes, MK7 6AA

T: +44 (0) 1908 858701
F: +44 (0) 1908 653571
E: o.steph...@open.ac.uk


 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On
 Behalf Of Ross Singer
 Sent: 15 September 2009 15:56
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Implementing OpenURL for simple web resources

 Owen, I might have missed it in this message -- my eyes are
 starting glaze over at this point in the thread, but can you
 describe how the input of these resources would work?

 What I'm basically asking is -- what would the professor need
 to do to add a new:  citation for a 70 year old book; journal
 on PubMed; URL to CiteSeer?

 How does their input make it into your database?

 -Ross.

 On Tue, Sep 15, 2009 at 5:04 AM, O.Stephens
 o.steph...@open.ac.uk wrote:
 True. How, from the OpenURL, are you going to know that the rft is
 meant to represent a website?
  I guess that was part of my question. But no one has suggested
  defining a new metadata profile for websites (which I
 probably would
  avoid tbh). DC doesn't seem to offer a nice way of doing
 this (that is
  saying 'this is a website'), although there are perhaps
 some bits and
  pieces (format, type) that could be used to give some
 indication (but
  I suspect not unambiguously)
 
 But I still think what you want is simply a purl server. What makes
 you think you want OpenURL in the first place?  But I still don't
 really understand what you're trying to do: deliver consistency of
 approach across all our references -- so are you using OpenURL for
 it's more conventional use too, but you want to tack on a
 purl-like
 functionality to the same software that's doing something
 more like a
 conventional link resolver?  I don't completely understand
 your use case.
 
  I wouldn't use OpenURL just to get a persistent URL - I'd
 almost certainly look at PURL for this. But, I want something
 slightly different. I want our course authors to be able to
 use whatever URL they know for a resource, but still try to
 ensure that the link works persistently over time. I don't
 think it is reasonable for a user to have to know a 'special'
 URL for a resource - and this approach means establishing a
 PURL for all resources used in our teaching material whether
 or not it moves in the future - which is an overhead it would
 be nice to avoid.
 
  You can hit delete now if you aren't interested, but ...
 
  ... perhaps if I just say a little more about the project
 I'm working on it may clarify...
 
  The project I'm working on is concerned with referencing
 and citation. We are looking at how references appear in
 teaching material (esp. online) and how they can be reused by
 students in their personal environment (in essays, later
 study, or something else). The references that appear can be
 to anything - books, chapters, journals, articles, etc.
 Increasingly of course there are references to web-based materials.
 
  For print material, references generally describe the
 resource and nothing more, but for digital material
 references are expected not only to describe the resource,
 but also state a route of access to the resource. This tends
 to be a bad idea when (for example) referencing e-journals,
 as we

Re: [CODE4LIB] Implementing OpenURL for simple web resources

A suggestion on how to get a prof to enter a url.

I use this bookmarklet to add a URL to Hacker News:

javascript:window.location=%22http://news.ycombinator.com/submitlink?u=%22+encodeURIComponent(document.location)+%22t=%22+encodeURIComponent(document.title)

I'm tempted to suggest an api based on OpenURL, but I fear the 10
emails it would provoke.

On Sep 15, 2009, at 10:56 AM, Ross Singer wrote:

Owen, I might have missed it in this message -- my eyes are starting
glaze over at this point in the thread, but can you describe how the
input of these resources would work?

What I'm basically asking is -- what would the professor need to do to
add a new: citation for a 70 year old book; journal on PubMed; URL to
CiteSeer?

How does their input make it into your database?

-Ross.

On Tue, Sep 15, 2009 at 5:04 AM, O.Stephens o.steph...@open.ac.uk
wrote:
True. How, from the OpenURL, are you going to know that the rft is
meant

to represent a website?
I guess that was part of my question. But no one has suggested
defining a new metadata profile for websites (which I probably
would avoid tbh). DC doesn't seem to offer a nice way of doing this
(that is saying 'this is a website'), although there are perhaps
some bits and pieces (format, type) that could be used to give some
indication (but I suspect not unambiguously)

But I still think what you want is simply a purl server. What
makes you

think you want OpenURL in the first place? But I still don't really
understand what you're trying to do: deliver consistency of
approach

across all our references -- so are you using OpenURL for it's more
conventional use too, but you want to tack on a purl-like
functionality to the same software that's doing something more
like a
conventional link resolver? I don't completely understand your
use case.

I wouldn't use OpenURL just to get a persistent URL - I'd almost
certainly look at PURL for this. But, I want something slightly
different. I want our course authors to be able to use whatever URL
they know for a resource, but still try to ensure that the link
works persistently over time. I don't think it is reasonable for a
user to have to know a 'special' URL for a resource - and this
approach means establishing a PURL for all resources used in our
teaching material whether or not it moves in the future - which is
an overhead it would be nice to avoid.

You can hit delete now if you aren't interested, but ...

... perhaps if I just say a little more about the project I'm
working on it may clarify...

The project I'm working on is concerned with referencing and
citation. We are looking at how references appear in teaching
material (esp. online) and how they can be reused by students in
their personal environment (in essays, later study, or something
else). The references that appear can be to anything - books,
chapters, journals, articles, etc. Increasingly of course there are
references to web-based materials.

For print material, references generally describe the resource and
nothing more, but for digital material references are expected not
only to describe the resource, but also state a route of access to
the resource. This tends to be a bad idea when (for example)
referencing e-journals, as we know the problems that surround this
- many different routes of access to the same item. OpenURLs work
well in this situation and seem to me like a sensible (and perhaps
the only viable) solution. So we can say that for journals/articles
it is sensible to ignore any URL supplied as part of the reference,
and to form an OpenURL instead. If there is a DOI in the reference
(which is increasingly common) then that can be used to form a URL
using DOI resolution, but it makes more sense to me to hand this
off to another application rather than bake this into the reference
- and OpenURL resolvers are reasonably set to do this.

If we look at a website it is pretty difficult to reference it
without including the URL - it seems to be the only good way of
describing what you are actually talking about (how many people
think of websites by 'title', 'author' and 'publisher'?). For me,
this leads to an immediate confusion between the description of the
resource and the route of access to it. So, to differentiate I'm
starting to think of the http URI in a reference like this as a
URI, but not necessarily a URL. We then need some mechanism to
check, given a URI, what is the URL.

Now I could do this with a script - just pass the URI to a script
that checks what URL to use against a list and redirects the user
if necessary. On this point Jonathan said if the usefulness of
your technique does NOT count on being inter-operable with existing
link resolver infrastructure... PERSONALLY I would be using
OpenURL, I don't think it's worth it - but it struck me that if we
were passing a URI to a script, why not pass it in an OpenURL?

Re: [CODE4LIB] Implementing OpenURL for simple web resources

2009-09-15 Thread Jonathan Rochkind


O.Stephens wrote:

True. How, from the OpenURL, are you going to know that the rft is meant
to represent a website?


I guess that was part of my question. But no one has suggested defining a new 
metadata profile for websites (which I probably would avoid tbh). DC doesn't 
seem to offer a nice way of doing this (that is saying 'this is a website'), 
although there are perhaps some bits and pieces (format, type) that could be 
used to give some indication (but I suspect not unambiguously)

  


Yeah, I don't think there IS any good way to do this.  Well, wait, okay, 
you could use a DC metadata package, and try to convey web site in 
dc.type.   The OpenURL dc.type is _recommended_ that you use a term from 
the DCTerms Type vocabulary, but that only lets you say something like 
it's an InteractiveResource or Text or Software.   Unless 
InteractiveResource is sufficient to convey what you need, you could 
disregard the suggestion (not requirement) that the openurl dc metadata 
schema type element contain a DCMI Type vocabulary term, and just put 
something else there. Website.  If you want to go this route, probably 
make a URI (perhaps using purl.org) to put an actual URI instead of a 
string literal there to represent Website.


Now, you've still wound up with something that is somewhat local/custom, 
that other resolvers are not going to understand. But frankly, I think 
anything you're going to wind up with is something that you aren't going 
to be able to trust arbitrary resolvers in the wild to do anything in 
particular with.  Which may not be a requirement for you anyway.


(Which is why I personally find a new OpenURL metadata format to be a 
complete non-starter.  I don't think OpenURL's abstract core actually 
provides much actual practical benefit, a new metadata format might as 
well be an entirely new standard -- for the practical benefit you get 
from it.  Other link resolvers that aren't yours are unlikely to ever do 
anything with your new format, and if they do, whoever implements that 
is going to have almost as much work to do as if it hadn't been OpenURL 
at all. If I wanted a really abstract metadata framework to create a new 
profile/schema on top of, I'd choose DCMI, not OpenURL. DCMI is also so 
abstract that it doesn't make sense to just say My app can take DCMI 
(just like it doens't make any sense to say my app can take 
OpenURL--it's all about the profiels/schemas).  But at least DCMI is a 
lot more flexible, and still has an active body of people working on 
maintaining and developing and adopting it.)


Jonathan

Re: [CODE4LIB] Implementing OpenURL for simple web resources

2009-09-15 Thread Jonathan Rochkind

Wait, are you really going to try to do this with _SFX_ too?   I missed 
that part. Oh boy. Seriously, I think you are in for a world of painful 
hacky kludge.


Rosalyn Metz wrote:

Owen,

The reason I suggest a source parser rather than a target parser is
that handling the openurl based on the source rather than shave a bit
of time off.  Attached is a slide i created (back in the day when it
was my job to create such slides...no i don't sit around in my hole
creating slides because i'm bored...although.) that shows the
process an OpenURL goes through.

So the source parser in this example would come into play before the
OpenURL metadata hits the SFX KB.  It would bypass the bottom half of
the slide completely and reduce any weird formatting that SFX might
try to do to the metadata with a value like website (if you tell sfx
you're looking for an article but you're really looking for a book it
sometimes ignores metadata unrelated to an article even though you
might actually need it).  if you never let it get to that point, then
you don't need to worry about that feature coming into play.

Source parsers aren't used as frequently as they once were, but they
used to be a way to retrieve more metadata from databases that didn't
create useful openurls (not that many vendors create useful openurls
now...).  but if you go a hackish route you could use a source parser
like a redirect rather than using it to fetch more metadata.

If none of this makes sense let me know and i can try to describe it
better off list so as not to bore people into oblivion.

Rosalyn




On Tue, Sep 15, 2009 at 9:52 AM, O.Stephens o.steph...@open.ac.uk wrote:
  

Thanks Rosalyn,

As you say we could push a custom value into rfr_genre. I'm a bit torn on this, 
as I guess I'm trying to do something that isn't 'hacky' - or at least not from 
the OpenURL end of it. It might be that this is just wishful thinking, and that 
I'm just trying to fool myself into thinking I'm 'sticking to the standard' 
when the likelihood of what I'm doing being transferrable to other scenarios is 
zero (although Eric's comments make me hope not)

Yes, we are using SFX. What I'm proposing on the SFX end as the path of least 
resisitance is writing a source parser for our learning environment which can 
do a 'fetch' for an alternative URL, or use the primary URL, and put it in an 
SFX internal field rft_856. We can then use the existing Target Parser 856_URL 
which displays the contents of rft_856 in the menu. Combined with some logic 
which forces this as the only option under certain circumstances we can then 
push the user directly to the resulting URL.

Owen

Owen Stephens
TELSTAR Project Manager
Library and Learning Resources Centre
The Open University
Walton Hall
Milton Keynes, MK7 6AA

T: +44 (0) 1908 858701
F: +44 (0) 1908 653571
E: o.steph...@open.ac.uk




-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On
Behalf Of Rosalyn Metz
Sent: 15 September 2009 14:42
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Implementing OpenURL for simple web resources

you could force a timestamp if people don't include a date.

and I like the idea of going to the Internet Archive of a
website, because then you're not having to get into the
business of handling www.bbc.co.uk differently than cnn.com
and someblog.org.

i also like the idea of using a redirect.  you could
theoretically write a source parser (i'm assuming youre using
SFX based on what you said about bX) that says if my rfr_id =
mylocalid and the item is a website (however you choose to
identify the website...which if you're writing your own
source parser you could put website in the rft_genre even
though its not technically a metadata format but you just
want your source parser to forward the url on anyway, so the
link resolver isn't actually going to do anything with it)
bypass everything and just direct to the internet archive of
the website.

all of this is of course kind of hackish...but really isn't
the whole thing hackish?  there were a few source parsers
that would be good models for writing something like this.
but i have no idea if they still exist because i haven't
looked at the back end of sfx in about a year.




On Tue, Sep 15, 2009 at 5:12 AM, O.Stephens
o.steph...@open.ac.uk wrote:
  

I agree with this Rosalyn. The issue that Nate brought up


was that the content at http://www.bbc.co.uk could change
over time, and old content might be moved to another URI -
http://archive.bbc.co.uk or something. So if course A
references http://www.bbc.co.uk on 24/08/09, if the content
that was on http://www.bbc.co.uk on 24/08/09 moves to
http://archive.bbc.co.uk we can use the mechanism I propose
to trap the links to http://www.bbc.co.uk and redirect to
http://archive.bbc.co.uk. However, if at a later date course
B references http://www.bbc.co.uk we have no way of knowing
whether they mean the stuff that is currently on
http://www.bbc.co.uk or

Re: [CODE4LIB] Implementing OpenURL for simple web resources

2009-09-15 Thread Jonathan Rochkind


O.Stephens wrote:

Thanks Rosalyn,

As you say we could push a custom value into rfr_genre. I'm a bit torn on this, 
as I guess I'm trying to do something that isn't 'hacky' - or at least not from 
the OpenURL end of it. It might be that this is just wishful thinking, and that 
I'm just trying to fool myself into thinking I'm 'sticking to the standard' 
when the likelihood of what I'm doing being transferrable to other scenarios is 
zero (although Eric's comments make me hope not)

  


Heh, that is my opinion. Everything I've ever tried to do with OpenURL 
that isn't part of the original 0.1 use case has ended up very hacky, 
despite my best efforts.

Re: [CODE4LIB] Implementing OpenURL for simple web resources

Do you think? I reckon it is just a few lines of code in a custom source 
parser... Only need to:

Check rft.id contains an http uri (regexp)
Define a fetchID based on this URI (possibly + date/other metadata)
Get a URL or null from a lookup service
Insert URL or rft_id value into rft.856

Simple!

Owen

Owen Stephens
TELSTAR Project Manager
Library and Learning Resources Centre
The Open University
Walton Hall
Milton Keynes, MK7 6AA

T: +44 (0) 1908 858701
F: +44 (0) 1908 653571
E: o.steph...@open.ac.uk


 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On
 Behalf Of Jonathan Rochkind
 Sent: 15 September 2009 16:30
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Implementing OpenURL for simple web resources

 Wait, are you really going to try to do this with _SFX_ too?
  I missed
 that part. Oh boy. Seriously, I think you are in for a world
 of painful hacky kludge.

 Rosalyn Metz wrote:
  Owen,
 
  The reason I suggest a source parser rather than a target parser is
  that handling the openurl based on the source rather than
 shave a bit
  of time off.  Attached is a slide i created (back in the
 day when it
  was my job to create such slides...no i don't sit around in my hole
  creating slides because i'm bored...although.) that shows the
  process an OpenURL goes through.
 
  So the source parser in this example would come into play
 before the
  OpenURL metadata hits the SFX KB.  It would bypass the
 bottom half of
  the slide completely and reduce any weird formatting that SFX might
  try to do to the metadata with a value like website (if you
 tell sfx
  you're looking for an article but you're really looking for
 a book it
  sometimes ignores metadata unrelated to an article even though you
  might actually need it).  if you never let it get to that
 point, then
  you don't need to worry about that feature coming into play.
 
  Source parsers aren't used as frequently as they once were,
 but they
  used to be a way to retrieve more metadata from databases
 that didn't
  create useful openurls (not that many vendors create useful
 openurls
  now...).  but if you go a hackish route you could use a
 source parser
  like a redirect rather than using it to fetch more metadata.
 
  If none of this makes sense let me know and i can try to
 describe it
  better off list so as not to bore people into oblivion.
 
  Rosalyn
 
 
 
 
  On Tue, Sep 15, 2009 at 9:52 AM, O.Stephens
 o.steph...@open.ac.uk wrote:
 
  Thanks Rosalyn,
 
  As you say we could push a custom value into rfr_genre. I'm a bit
  torn on this, as I guess I'm trying to do something that isn't
  'hacky' - or at least not from the OpenURL end of it. It might be
  that this is just wishful thinking, and that I'm just
 trying to fool
  myself into thinking I'm 'sticking to the standard' when the
  likelihood of what I'm doing being transferrable to other
 scenarios
  is zero (although Eric's comments make me hope not)
 
  Yes, we are using SFX. What I'm proposing on the SFX end
 as the path of least resisitance is writing a source parser
 for our learning environment which can do a 'fetch' for an
 alternative URL, or use the primary URL, and put it in an SFX
 internal field rft_856. We can then use the existing Target
 Parser 856_URL which displays the contents of rft_856 in the
 menu. Combined with some logic which forces this as the only
 option under certain circumstances we can then push the user
 directly to the resulting URL.
 
  Owen
 
  Owen Stephens
  TELSTAR Project Manager
  Library and Learning Resources Centre The Open University
 Walton Hall
  Milton Keynes, MK7 6AA
 
  T: +44 (0) 1908 858701
  F: +44 (0) 1908 653571
  E: o.steph...@open.ac.uk
 
 
 
  -Original Message-
  From: Code for Libraries
 [mailto:code4...@listserv.nd.edu] On Behalf
  Of Rosalyn Metz
  Sent: 15 September 2009 14:42
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] Implementing OpenURL for simple web
  resources
 
  you could force a timestamp if people don't include a date.
 
  and I like the idea of going to the Internet Archive of a
 website,
  because then you're not having to get into the business
 of handling
  www.bbc.co.uk differently than cnn.com and someblog.org.
 
  i also like the idea of using a redirect.  you could
 theoretically
  write a source parser (i'm assuming youre using SFX based on what
  you said about bX) that says if my rfr_id = mylocalid and
 the item
  is a website (however you choose to identify the
 website...which if
  you're writing your own source parser you could put
 website in the
  rft_genre even though its not technically a metadata
 format but you
  just want your source parser to forward the url on anyway, so the
  link resolver isn't actually going to do anything with it) bypass
  everything and just direct to the internet archive of the website.
 
  all of this is of course kind of hackish...but really isn't the
  whole thing hackish?  there were a few

Re: [CODE4LIB] Implementing OpenURL for simple web resources

Given that the burden of creating these links is entirely on RefWorks
 Telstar, OpenURL seems as good a choice as anything (since anything
would require some other service, anyway).  As long as the profs
aren't expected to mess with it, I'm not sure that *how* you do the
indirection matters all that much and, as you say, there are added
bonuses to keeping it within SFX.

It seems to me, though, that your rft_id should be a URI to the db
you're using to store their references, so your CTX would look
something like:

http://res.open.ac.uk/?rfr_id=info:/telstar.open.ac.ukrft_id=http://telstar.open.ac.uk/1234dc.identifier=http://bbc.uk.co/
# not url encoded because I have, you know, a life.

I can't remember if you can include both metadata-by-reference keys
and metadata-by-value, but you could have by-reference
(rft_ref=http://telstar.open.ac.uk/1234rft_ref_fmt=RIS or something)
point at your citation db to return a formatted citation.

This way your citations are unique -- somebody pointing at today's
London Times frontpage isn't the same as somebody else's on a
different day.

While I'm shocked that I agree with using OpenURL for this, it seems
as reasonable as any other solution.  That being said, unless you can
definitely offer some other service besides linking to the resource,
I'd avoid the resolver menu completely.

-Ross.

On Tue, Sep 15, 2009 at 11:17 AM, O.Stephens o.steph...@open.ac.uk wrote:
 Ross - no you didn't miss it,

 There are 3 ways that references might be added to the learning environment:

 An author (or realistically a proxy on behalf of the author) can insert a 
 reference into a structured Word document from an RIS file. This structured 
 document (XML) then goes through a 'publication' process which pushes the 
 content to the learning environment (Moodle), including rendering the 
 references from RIS format into a specified style, with links.
 An author/librarian/other can import references to a 'resources' area in our 
 learning environment (Moodle) from a RIS file
 An author/librarian/other can subscribe to an RSS feed from a RefWorks 
 'RefShare' folder within the 'resources' area of the learning environment

 In general the project is focussing on the use of RefWorks - so although the 
 RIS files could be created by any suitable s/w, we are looking specifically 
 at RefWorks.

 How you get the reference into RefWorks is something we are looking at 
 currently. The best approach varies depending on the type of material you are 
 looking at:

 For websites it looks like the 'RefGrab-it' bookmarklet/browser plugin 
 (depending on your browser) is the easiest way of capturing website details.
 For books, probably a Union catalogue search from within RefWorks
 For journal articles, probably a Federated search engine (SS 360 is what 
 we've got)
 Any of these could be entered by hand of course, as could several other kinds 
 of reference

 Entering the references into RefWorks could be done by an author, but it more 
 likely to be done by a member of clerical staff or a librarian/library 
 assistant

 Owen

 Owen Stephens
 TELSTAR Project Manager
 Library and Learning Resources Centre
 The Open University
 Walton Hall
 Milton Keynes, MK7 6AA

 T: +44 (0) 1908 858701
 F: +44 (0) 1908 653571
 E: o.steph...@open.ac.uk


 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On
 Behalf Of Ross Singer
 Sent: 15 September 2009 15:56
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Implementing OpenURL for simple web resources

 Owen, I might have missed it in this message -- my eyes are
 starting glaze over at this point in the thread, but can you
 describe how the input of these resources would work?

 What I'm basically asking is -- what would the professor need
 to do to add a new:  citation for a 70 year old book; journal
 on PubMed; URL to CiteSeer?

 How does their input make it into your database?

 -Ross.

 On Tue, Sep 15, 2009 at 5:04 AM, O.Stephens
 o.steph...@open.ac.uk wrote:
 True. How, from the OpenURL, are you going to know that the rft is
 meant to represent a website?
  I guess that was part of my question. But no one has suggested
  defining a new metadata profile for websites (which I
 probably would
  avoid tbh). DC doesn't seem to offer a nice way of doing
 this (that is
  saying 'this is a website'), although there are perhaps
 some bits and
  pieces (format, type) that could be used to give some
 indication (but
  I suspect not unambiguously)
 
 But I still think what you want is simply a purl server. What makes
 you think you want OpenURL in the first place?  But I still don't
 really understand what you're trying to do: deliver consistency of
 approach across all our references -- so are you using OpenURL for
 it's more conventional use too, but you want to tack on a
 purl-like
 functionality to the same software that's doing something
 more like a
 conventional link resolver?  I don't completely understand
 your use case.

Re: [CODE4LIB] Implementing OpenURL for simple web resources

Oh yeah, one thing I left off --

In Moodle, it would probably make sense to link to the URL in the a tag:
a href=http://bbc.co.uk/;The Beeb!/a
but use a javascript onMouseDown action to rewrite the link to route
through your funky link resolver path, a la Google.

That way, the page works like any normal webpage, right mouse
click-Copy Link Location gives the user the real URL to copy and
paste, but normal behavior funnels through the link resolver.

-Ross.

On Tue, Sep 15, 2009 at 11:41 AM, Ross Singer rossfsin...@gmail.com wrote:
 Given that the burden of creating these links is entirely on RefWorks
  Telstar, OpenURL seems as good a choice as anything (since anything
 would require some other service, anyway).  As long as the profs
 aren't expected to mess with it, I'm not sure that *how* you do the
 indirection matters all that much and, as you say, there are added
 bonuses to keeping it within SFX.

 It seems to me, though, that your rft_id should be a URI to the db
 you're using to store their references, so your CTX would look
 something like:

 http://res.open.ac.uk/?rfr_id=info:/telstar.open.ac.ukrft_id=http://telstar.open.ac.uk/1234dc.identifier=http://bbc.uk.co/
 # not url encoded because I have, you know, a life.

 I can't remember if you can include both metadata-by-reference keys
 and metadata-by-value, but you could have by-reference
 (rft_ref=http://telstar.open.ac.uk/1234rft_ref_fmt=RIS or something)
 point at your citation db to return a formatted citation.

 This way your citations are unique -- somebody pointing at today's
 London Times frontpage isn't the same as somebody else's on a
 different day.

 While I'm shocked that I agree with using OpenURL for this, it seems
 as reasonable as any other solution.  That being said, unless you can
 definitely offer some other service besides linking to the resource,
 I'd avoid the resolver menu completely.

 -Ross.

 On Tue, Sep 15, 2009 at 11:17 AM, O.Stephens o.steph...@open.ac.uk wrote:
 Ross - no you didn't miss it,

 There are 3 ways that references might be added to the learning environment:

 An author (or realistically a proxy on behalf of the author) can insert a 
 reference into a structured Word document from an RIS file. This structured 
 document (XML) then goes through a 'publication' process which pushes the 
 content to the learning environment (Moodle), including rendering the 
 references from RIS format into a specified style, with links.
 An author/librarian/other can import references to a 'resources' area in our 
 learning environment (Moodle) from a RIS file
 An author/librarian/other can subscribe to an RSS feed from a RefWorks 
 'RefShare' folder within the 'resources' area of the learning environment

 In general the project is focussing on the use of RefWorks - so although the 
 RIS files could be created by any suitable s/w, we are looking specifically 
 at RefWorks.

 How you get the reference into RefWorks is something we are looking at 
 currently. The best approach varies depending on the type of material you 
 are looking at:

 For websites it looks like the 'RefGrab-it' bookmarklet/browser plugin 
 (depending on your browser) is the easiest way of capturing website details.
 For books, probably a Union catalogue search from within RefWorks
 For journal articles, probably a Federated search engine (SS 360 is what 
 we've got)
 Any of these could be entered by hand of course, as could several other 
 kinds of reference

 Entering the references into RefWorks could be done by an author, but it 
 more likely to be done by a member of clerical staff or a librarian/library 
 assistant

 Owen

 Owen Stephens
 TELSTAR Project Manager
 Library and Learning Resources Centre
 The Open University
 Walton Hall
 Milton Keynes, MK7 6AA

 T: +44 (0) 1908 858701
 F: +44 (0) 1908 653571
 E: o.steph...@open.ac.uk


 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On
 Behalf Of Ross Singer
 Sent: 15 September 2009 15:56
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Implementing OpenURL for simple web resources

 Owen, I might have missed it in this message -- my eyes are
 starting glaze over at this point in the thread, but can you
 describe how the input of these resources would work?

 What I'm basically asking is -- what would the professor need
 to do to add a new:  citation for a 70 year old book; journal
 on PubMed; URL to CiteSeer?

 How does their input make it into your database?

 -Ross.

 On Tue, Sep 15, 2009 at 5:04 AM, O.Stephens
 o.steph...@open.ac.uk wrote:
 True. How, from the OpenURL, are you going to know that the rft is
 meant to represent a website?
  I guess that was part of my question. But no one has suggested
  defining a new metadata profile for websites (which I
 probably would
  avoid tbh). DC doesn't seem to offer a nice way of doing
 this (that is
  saying 'this is a website'), although there are perhaps
 some bits and
  pieces (format, type) that could

[CODE4LIB] Fall Internships at WGBH Media Library Archives

2009-09-15 Thread Courtney Michael

Greetings colleagues! We have two opportunities for 2-3 interns at the WGBH 
Media Library  Archives! Please forgive the cross postings and do not respond 
to me, but send a resume and a statement of interest by email to: 
human_resour...@wgbh.org or by mail to:

WGBH Educational Foundation
Human Resources Department
One Guest Street
Boston, MA 02135

Please forward to any interested parties!
Thank you!
Courtney.

Digital Library Projects Internship
http://careers.wgbh.org/internships/internships/mla_digital_library.html
The WGBH Media Library  Archives has opportunities for undergraduate and 
graduate students to work in a film and media production archives. Come and 
learn what happens to all the materials that went into that FRONTLINE you saw 
after it aired on TV. Digital library interns will work with the Project 
Manager and Production Assistant to make archival media materials accessible 
online for two ongoing pilot projects, the CPB American Archive project and the 
Mellon Digital Library project. The CPB American Archive project will focus on 
Civil Rights Movement content. Funded by CPB, the American Archive will 
eventually be a national archive of PBS media materials. The Mellon Digital 
Library project uses foreign policy and the history of science content, and 
focuses on scholarly use of archival media material online. Interns will get 
hands-on experience preparing archival media for web access by digitizing 
materials, applying metadata, and encoding transcripts. This is an opportunity 
to learn moving image digitization for preservation and access, the PBCore 
metadata schema (pbcore.org) and the TEI XML schema (tei-c.org/).

Electronic Records Internship
http://careers.wgbh.org/internships/internships/mla_records.html
The WGBH Media Library  Archives has opportunities for undergraduate and 
graduate students to work in a film and media production archives. Come and use 
your electronic management knowledge in a real world setting. The Electronic 
Records Management interns will work with both the Program Shutdown Manager and 
the Digital Archives Manager. They will review electronic original interview 
transcripts that have been delivered to the Media Library and Archives by 
productions (such as Frontline or Nova) to standardize names, and correct any 
inconsistencies. This may require some research skills to identify exactly who 
a particular interviewee is and, where applicable, the position held at the 
time of the interview. This will require embedding the interviewee information 
within the document header and linking the transcript back to the physical tape 
holdings. The position will work to standardize naming conventions for 
interview transcripts, and create a suitable electronic workflow, prior to 
upload the WGBH digital assessment management system. Training will be given in 
this Artesia based application. The position requires excellent skills in 
reviewing and correcting information metadata. Familiarity with online search 
engines, Library of Congress Authorities and other online resources is 
recommended, as is an attention to detail.

Re: [CODE4LIB] Implementing OpenURL for simple web resources

I'm thinking about it :)

Logically I think we can avoid this as we have the context based on the rfr_id 
(for which we are proposing)

rfr_id=info:sid/learn.open.ac.uk:[course code] (at the risk of more comment!)

Which seems to me equivalent. I guess it is just a matter of where you do the 
work, since in SFX we'll end up constructing a 'fetch' to the same location 
anyway. The amount of work involved to change it one way or the other is 
probably trivial though.

I'm not sure I agree that what I'm proposing puts 'random' URLs in the rft_id, 
although I do accept that this is a moot point if other resolvers don't do 
something useful with them (or worse, make incorrect assumptions about them) - 
perhaps this is something I could survey as part of the project... (although 
its all moot if we are only doing this within an internal environment and 
no-one else ever does it!)

Owen

Owen Stephens
TELSTAR Project Manager
Library and Learning Resources Centre
The Open University
Walton Hall
Milton Keynes, MK7 6AA

T: +44 (0) 1908 858701
F: +44 (0) 1908 653571
E: o.steph...@open.ac.uk


 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On
 Behalf Of Jonathan Rochkind
 Sent: 15 September 2009 16:52
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Implementing OpenURL for simple web resources

 I do like Ross's solution, if you really wanna use OpenURL.
 I'm much more comfortable with the idea of including a URI
 based on your own local service in rft_id, then including any
 old public URL in rft_id.

 Then at least your link resolver can say if what's in rft_id
 begins with (eg)  http://telstar.open.ac.uk/, THEN I know
 this is one of these purl type things, and I know that
 sending the user to it will result in a redirect to an
 end-user-appropriate access URL.

 Cause that's my concern with putting random URLs in rft_id,
 that there's no way to know if they are intended as
 end-user-appropriate access URLs or not, and in putting
 things in rft_id that aren't really good
 identifiers for the referent at all.   But using your own local
 service ID, now you really DO have something that's
 appropriately considered a persistent identifier for the
 referent, AND you have a straightforward way to tell when the
 rft_id of this context is intended as an access URL.

 Jonathan

 Ross Singer wrote:
  Oh yeah, one thing I left off --
 
  In Moodle, it would probably make sense to link to the URL
 in the a tag:
  a href=http://bbc.co.uk/;The Beeb!/a but use a javascript
  onMouseDown action to rewrite the link to route through your funky
  link resolver path, a la Google.
 
  That way, the page works like any normal webpage, right mouse
  click-Copy Link Location gives the user the real URL to copy and
  paste, but normal behavior funnels through the link resolver.
 
  -Ross.
 
  On Tue, Sep 15, 2009 at 11:41 AM, Ross Singer
 rossfsin...@gmail.com wrote:
 
  Given that the burden of creating these links is entirely
 on RefWorks
   Telstar, OpenURL seems as good a choice as anything
 (since anything
  would require some other service, anyway).  As long as the profs
  aren't expected to mess with it, I'm not sure that *how*
 you do the
  indirection matters all that much and, as you say, there are added
  bonuses to keeping it within SFX.
 
  It seems to me, though, that your rft_id should be a URI to the db
  you're using to store their references, so your CTX would look
  something like:
 
 
 http://res.open.ac.uk/?rfr_id=info:/telstar.open.ac.ukrft_id=http://
  telstar.open.ac.uk/1234dc.identifier=http://bbc.uk.co/
  # not url encoded because I have, you know, a life.
 
  I can't remember if you can include both
 metadata-by-reference keys
  and metadata-by-value, but you could have by-reference
  (rft_ref=http://telstar.open.ac.uk/1234rft_ref_fmt=RIS or
  something) point at your citation db to return a formatted
 citation.
 
  This way your citations are unique -- somebody pointing at today's
  London Times frontpage isn't the same as somebody else's on a
  different day.
 
  While I'm shocked that I agree with using OpenURL for
 this, it seems
  as reasonable as any other solution.  That being said,
 unless you can
  definitely offer some other service besides linking to the
 resource,
  I'd avoid the resolver menu completely.
 
  -Ross.
 
  On Tue, Sep 15, 2009 at 11:17 AM, O.Stephens
 o.steph...@open.ac.uk wrote:
 
  Ross - no you didn't miss it,
 
  There are 3 ways that references might be added to the
 learning environment:
 
  An author (or realistically a proxy on behalf of the
 author) can insert a reference into a structured Word
 document from an RIS file. This structured document (XML)
 then goes through a 'publication' process which pushes the
 content to the learning environment (Moodle), including
 rendering the references from RIS format into a specified
 style, with links.
  An author/librarian/other can import references to a 'resources'
  area in our

Re: [CODE4LIB] Implementing OpenURL for simple web resources

I think using locally meaningful ids in rft_id is a misuse and a  
mistake. locally meaningful data should goi in rft_dat, accompanied by  
rfr_id


just sayin'

On Sep 15, 2009, at 11:52 AM, Jonathan Rochkind wrote:

I do like Ross's solution, if you really wanna use OpenURL. I'm much  
more comfortable with the idea of including a URI based on your own  
local service in rft_id, then including any old public URL in rft_id.


Then at least your link resolver can say if what's in rft_id begins  
with (eg)  http://telstar.open.ac.uk/, THEN I know this is one of  
these purl type things, and I know that sending the user to it will  
result in a redirect to an end-user-appropriate access URL.
Cause that's my concern with putting random URLs in rft_id, that  
there's no way to know if they are intended as end-user-appropriate  
access URLs or not, and in putting things in rft_id that aren't  
really good identifiers for the referent at all.   But using your  
own local service ID, now you really DO have something that's  
appropriately considered a persistent identifier for the referent,  
AND you have a straightforward way to tell when the rft_id of this  
context is intended as an access URL.


Jonathan



Eric Hellman
President, Gluejar, Inc.
41 Watchung Plaza, #132
Montclair, NJ 07042
USA

e...@hellman.net
http://go-to-hellman.blogspot.com/

Re: [CODE4LIB] Implementing OpenURL for simple web resources


Yes, you can.

On Sep 15, 2009, at 11:41 AM, Ross Singer wrote:

I can't remember if you can include both metadata-by-reference keys
and metadata-by-value, but you could have by-reference
(rft_ref=http://telstar.open.ac.uk/1234rft_ref_fmt=RIS or something)
point at your citation db to return a formatted citation.


Eric Hellman
President, Gluejar, Inc.
41 Watchung Plaza, #132
Montclair, NJ 07042
USA

e...@hellman.net
http://go-to-hellman.blogspot.com/

Re: [CODE4LIB] Implementing OpenURL for simple web resources

On Tue, Sep 15, 2009 at 12:06 PM, Eric Hellman e...@hellman.net wrote:
 Yes, you can.


In this case, I say punt on dc.identifier, throw the URL in rft_id
(since, Eric, you had some concern regarding using the local id for
this?) and let the real URL persistence/resolution work happen with
the by-ref negotiation.

-Ross.

 On Sep 15, 2009, at 11:41 AM, Ross Singer wrote:

 I can't remember if you can include both metadata-by-reference keys
 and metadata-by-value, but you could have by-reference
 (rft_ref=http://telstar.open.ac.uk/1234rft_ref_fmt=RIS or something)
 point at your citation db to return a formatted citation.

 Eric Hellman
 President, Gluejar, Inc.
 41 Watchung Plaza, #132
 Montclair, NJ 07042
 USA

 e...@hellman.net
 http://go-to-hellman.blogspot.com/

Re: [CODE4LIB] Implementing OpenURL for simple web resources

2009-09-15 Thread Erik Hetzner

Hi Owen, all:

This is a very interesting problem.

At Tue, 15 Sep 2009 10:04:09 +0100,
O.Stephens wrote:
 […]

 If we look at a website it is pretty difficult to reference it
 without including the URL - it seems to be the only good way of
 describing what you are actually talking about (how many people
 think of websites by 'title', 'author' and 'publisher'?). For me,
 this leads to an immediate confusion between the description of the
 resource and the route of access to it. So, to differentiate I'm
 starting to think of the http URI in a reference like this as a URI,
 but not necessarily a URL. We then need some mechanism to check,
 given a URI, what is the URL.

 […]

 The problem with the approach (as Nate and Eric mention) is that any
 approach that relies on the URI as a identifier (whether using
 OpenURL or a script) is going to have problems as the same URI could
 be used to identify different resources over time. I think Eric's
 suggestion of using additional information to help differentiate is
 worth looking at, but I suspect that this is going to cause us
 problems - although I'd say that it is likely to cause us much less
 work than the alternative, which is allocating every single
 reference to a web resource used in our course material it's own
 persistent URL.

 […]

I might be misunderstanding you, but, I think that you are leaving out
the implicit dimension of time here - when was the URL referenced?
What can we use to represent the tuple URL, date, and how do we
retrieve an appropriate representation of this tuple? Is the most
appropriate representation the most recent version of the page,
wherever it may have moved? Or is the most appropriate representation
the page as it existed in the past? I would argue that the most
appropriate representation would be the page as it existed in the
past, not what the page looks like now - but I am biased, because I
work in web archiving.

Unfortunately this is a problem that has not been very well addressed
by the web architecture people, or the web archiving people. The web
architecture people start from the assumption that
http://example.org/ is the same resource which only varies in its
representation as a function of time, not in its identity as a
resource. The web archives people create closed systems and do not
think about how to store and resolve the tuple, URL, date.

I know this doesn’t help with your immediate problem, but I think
these are important issues.

best,
Erik Hetzner
;; Erik Hetzner, California Digital Library
;; gnupg key id: 1024D/01DB07E3


pgpoU4UofTFjn.pgp
Description: PGP signature

Re: [CODE4LIB] indexing pdf files

2009-09-15 Thread Erik Hatcher


Here's a post on how easy it is to send PDF documents to Solr from Java:

  http://www.lucidimagination.com/blog/2009/09/14/posting-rich-documents-to-apache-solr-using-solrj-and-solr-cell-apache-tika/ 



Not only can you post PDF (and other rich content) files to Solr for  
indexing, you can also as shown in that blog entry extract the text  
from such files and have it returned to the client.  This Solr  
capability makes the tool chain a bit simpler.


Erik


On Sep 15, 2009, at 10:31 AM, Peter Kiraly wrote:


Hi all,

I would like to suggest an API for extracting text (including  
highlighted or

annotated ones) from PDF: iText (http://www.lowagie.com/iText/).
This is a Java API (has C# port), and it helped me a lot, when we  
worked

with extraordinary PDF files.

Solr uses Tika (http://lucene.apache.org/tika) for extracting text  
from

documents, and Tika uses PDFBox (http://incubator.apache.org/pdfbox/)
to extract from PDF files, and it is a great tool for the normal PDF  
files,
but it has (at least had) some features, which I didn't satisfied  
with:


- it consumed more memory comparing with iText, and couldn't
read files above a given size (this was large, about 1 GB, but we
had even larger files)

- it couldn't handled correctly the conditional hypens at the end of
the line
- it had poorer documentation then iText, and its API was also
poorer (that time the Manning published the iText in Action book).

Our PDF files were double layered (original hi-res image + OCR-ed  
text),
several thousands pages length documents (Hungarian scientific  
journals,
the diary of the Houses of Parliament from the 19th century etc.).  
We indexed

the content with Lucene, and in the UI we showed one page per screen,
so the user didn't need to download the full PDF. We extracted the
Table of contents from the PDF as well, and we implemented it in the  
web UI,

so the user can browse pages according to the full file's TOC.

This project happened two years ago, so it is possible, that lots of  
things

were changed since that time.

Király Péter
http://eXtensibleCatalog.org

- Original Message - From: Mark A. Matienzo m...@matienzo.org 


To: CODE4LIB@LISTSERV.ND.EDU
Sent: Tuesday, September 15, 2009 3:56 PM
Subject: Re: [CODE4LIB] indexing pdf files



Eric,


5. Use pdttotext to extract the OCRed text
  from the PDF and index it along with
  the MyLibrary metadata using Solr. [3, 4]



Have you considered using Solr's ExtractingRequestHandler [1] for the
PDFs? We're using it at NYPL with pretty great success.

[1] http://wiki.apache.org/solr/ExtractingRequestHandler

Mark A. Matienzo
Applications Developer, Digital Experience Group
The New York Public Library

Re: [CODE4LIB] Implementing OpenURL for simple web resources