Re: [CODE4LIB] Vote for NE code4lib meetup location

2008-10-20 Thread Klein, Michael
Looking into the space/time issue this week, folks. I promise.
 
-- 
Michael B. Klein
Digital Initiatives Technology Librarian
Boston Public Library
(617) 859-2391
[EMAIL PROTECTED]


 From: Jay Luker [EMAIL PROTECTED]
 Reply-To: Code for Libraries CODE4LIB@LISTSERV.ND.EDU
 CODE4LIB@LISTSERV.ND.EDU
 Date: Wed, 15 Oct 2008 16:48:12 -0400
 To: Code for Libraries CODE4LIB@LISTSERV.ND.EDU CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Vote for NE code4lib meetup location
 
 Sorry to leave you all in suspense all day. The results are in:
 
 23 Boston, MA
 18 Northampton, MA
 14 Concord, NH
 11 Portland, ME
 
 Michael Klein has said he will now check when a suitable space will be
 available at BPL. Then we'll update the WhenIsGood page and hope for
 some availability intersection goodness.
 
 --jay


Re: [CODE4LIB] OCR PDFs

2008-10-20 Thread Michael Beccaria
It's not exactly what you're looking for, but Microsoft Office comes
with a scripting OCR engine that works on TIFFs. I use it to get text
from yearbooks we are scanning so people can look for names and such.
While I wouldn't put it on par with ABBYY, it does a pretty decent job.

I wrote a simple script in vbscript that scans all the tiff files in a
folder and exports a txt file with the same name as the image that has
all of the text it finds. If you want it, let me know and I'll send it
your way.

Mike Beccaria 
Systems Librarian 
Head of Digital Initiatives 
Paul Smith's College 
518.327.6376 
[EMAIL PROTECTED] 
 
---
This message may contain confidential information and is intended only
for the individual named. If you are not the named addressee you should
not disseminate, distribute or copy this e-mail. Please notify the
sender immediately by e-mail if you have received this e-mail by mistake
and delete this e-mail from your system.
-Original Message-
From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
James Tuttle
Sent: Friday, October 17, 2008 7:57 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] OCR PDFs

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I wonder if any of you might have experience with creating text PDFs
from  TIFFs.  I've been using tiffcp to stitch TIFFs together into a
single image and then using tiff2pdf to generate PDFs from the single
TIFF.  I've had to pass this image-based PDF to someone with Acrobat to
use it's batch processing facility to OCR the text and save a text-based
PDF.  I wonder if anyone has suggestions for software I can integrate
into the script (Python on Linux) I'm using.

Thanks,
James

- --
- ---
James Tuttle
Digital Repository Librarian

NCSU Libraries, Box 7111
North Carolina State University
Raleigh, NC 27695-7111
[EMAIL PROTECTED]

(919)513-0651 Phone
(919)515-3031  Fax

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFI+H1zKxpLzx+LOWMRAgxIAJwNXyeMJbk6r6hmHpNAdEvWIQbCVgCgp8JR
nyS3WZ4UuRbU/6DTH7ohe/M=
=mT2T
-END PGP SIGNATURE-


[CODE4LIB] marc4j 2.4 released

2008-10-20 Thread Bess Sadler

Dear Code4Libbers,

I'm very pleased to announce that for the first time in almost two  
years there has been a new release of marc4j. Release 2.4 is a minor  
release in the sense that it shouldn't break any existing code, but  
it's a major release in the sense that it represents an influx of new  
people into the development of this project, and a significant  
improvement in marc4j's ability to handle malformed or mis-encoded  
marc records.


Release notes are here: http://marc4j.tigris.org/files/documents/ 
220/44060/changes.txt


And the project website, including download links, is here: http:// 
marc4j.tigris.org/


We've been using this new marc4j code in solrmarc since solrmarc  
started, so if you're using Blacklight or VuFind, you're probably  
using it already, just in an unreleased form.


Bravo to Bob Haschart, Wayne Graham, and Bas Peters for making these  
improvements to marc4j and getting this release out the door.


Bess

Elizabeth (Bess) Sadler
Research and Development Librarian
Digital Scholarship Services
Box 400129
Alderman Library
University of Virginia
Charlottesville, VA 22904

[EMAIL PROTECTED]
(434) 243-2305


[CODE4LIB] Call for Book Reviewers Announcing a New Reviews Editor: Journal of Web Librarianship

2008-10-20 Thread Jody Condit Fagan
Please excuse cross postings!

The Journal of Web Librarianship is pleased to announce Lisa Ennis and Nicole 
Mitchell as new co-editors of the reviews section beginning with volume 3, 
issue 1.

Lisa is the Systems Librarian at UAB’s Lister Hill Library of the Health 
Sciences. She received her M.A. in History from Georgia College  State 
University and her M.S. in Information Sciences from the University of 
Tennessee. Nicole is a reference librarian at UAB’s Lister Hill Library of the 
Health Sciences. She received her M.A. in History from Georgia College  State 
University and her M.L.I.S. from the University of Alabama.

We are currently seeking contributors to the reviews section!

The Journal of Web Librarianship scope, which extends to the Reviews section, 
includes:

* web page design
* usability testing of library or library-related sites
* cataloging or classification of Web information
* international issues in web librarianship
* scholars' use of the web
* information architecture
* RSS feeds, podcasting, blogs, and other 2.0 technologies,
* search engines
* the history of libraries and the web
* emerging and future aspects of web librarianship.

New reviewers, including MLIS students and recent graduates, are welcome to 
apply.  If you are interested, please email Lisa and Nicole at [EMAIL 
PROTECTED] with a copy of your resume, statement of interest, and a writing 
sample (preferably of a review).

Jody Fagan
Editor, The Journal of Web Librarianship
http://www.lib.jmu.edu/org/jwl
James Madison University
Preferred email: [EMAIL PROTECTED]

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


Re: [CODE4LIB] 2009 Conference Registration Rates?

2008-10-20 Thread jean rainwater
We're still working to line up sponsors but we hope to be able to keep
the registration fee the same as last year - $125.  Room rate at the
conference hotel is $135 plus tax (free internet in guest rooms).

Jean Rainwater
Brown University Library
Providence, RI 02912


On Mon, Oct 20, 2008 at 11:45 AM, John Nowlin [EMAIL PROTECTED] wrote:
 Where is the information for the 2009 conference registration fee, need
 this to get the travel request completed.

 John Nowlin
 College Center for Library Automation (cclaflorida.org)
 Tallahassee, FL 32310 US - 850.922.6044

 Always vote for principle, though you may vote alone, and you may
 cherish the sweetest reflection that your vote is never lost. -- John
 Quincy Adams




[CODE4LIB] Mashed Library UK 2008 - registration is open

2008-10-20 Thread Stephens, Owen
I posted a little while ago that I was organising a 'Mashed Libraries' event. 
Well, registration for the event is now open at 
http://www.ukoln.ac.uk/events/mashed-library-2008/

There is no charge for the day, thanks to my employer (Imperial College 
London), sponsorship from UKOLN (http://www.ukoln.ac.uk), and the donation of 
time and space from Birkbeck College London (esp. thanks to David Flanders for 
this). 

Although the day is intended to be reasonably informal, there is a loose 
schedule for the day that looks like this:

10am Start
10-11 Dummies guide to ... (some short presentations on some of the
tech/tools that might be of use during the day - I've got some topics, but post 
requests/suggestions at 
http://mashedlibrary.ning.com/forum/topic/show?id=2186716%3ATopic%3A5)
11-4 Mashup - work in teams or individually to do interesting stuff
4-5 Round up of mashups and close

Food and drink will be supplied throughout the day as necessary (again, no 
charge)

Registration closes on 14th November.

I've had some interest shown in remote participation, and I'm happy to see what 
we can do to support this, although I'm not quite sure what form this 
participation should take - if you are interested in this, please post at 
http://mashedlibrary.ning.com/forum/topic/show?id=2186716%3ATopic%3A127 and 
I'll see what I can do (not promising anything at this stage though!)

Hope to see some of you there.

Best wishes,

Owen

Owen Stephens
Assistant Director: eStrategy and Information Resources
Imperial College London
[EMAIL PROTECTED]


Re: [CODE4LIB] marc4j 2.4 released

2008-10-20 Thread Michael Beccaria
Very cool! I noticed that a feature, MarcDirStreamReader, is capable of
iterating over all marc record files in a given directory. Does anyone
know of any de-duplicating efforts done with marc4j? For example,
libraries that have similar holdings would have their records merged
into one record with a location tag somewhere. I know places do it
(consortia etc.) but I haven't been able to find a good open program
that handles stuff like that.

Mike Beccaria 
Systems Librarian 
Head of Digital Initiatives 
Paul Smith's College 
518.327.6376 
[EMAIL PROTECTED] 
 
---
This message may contain confidential information and is intended only
for the individual named. If you are not the named addressee you should
not disseminate, distribute or copy this e-mail. Please notify the
sender immediately by e-mail if you have received this e-mail by mistake
and delete this e-mail from your system.

-Original Message-
From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
Bess Sadler
Sent: Monday, October 20, 2008 11:12 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] marc4j 2.4 released

Dear Code4Libbers,

I'm very pleased to announce that for the first time in almost two  
years there has been a new release of marc4j. Release 2.4 is a minor  
release in the sense that it shouldn't break any existing code, but  
it's a major release in the sense that it represents an influx of new  
people into the development of this project, and a significant  
improvement in marc4j's ability to handle malformed or mis-encoded  
marc records.

Release notes are here: http://marc4j.tigris.org/files/documents/ 
220/44060/changes.txt

And the project website, including download links, is here: http:// 
marc4j.tigris.org/

We've been using this new marc4j code in solrmarc since solrmarc  
started, so if you're using Blacklight or VuFind, you're probably  
using it already, just in an unreleased form.

Bravo to Bob Haschart, Wayne Graham, and Bas Peters for making these  
improvements to marc4j and getting this release out the door.

Bess

Elizabeth (Bess) Sadler
Research and Development Librarian
Digital Scholarship Services
Box 400129
Alderman Library
University of Virginia
Charlottesville, VA 22904

[EMAIL PROTECTED]
(434) 243-2305


[CODE4LIB] JOB ADVERTISEMENT- Web Applications Developer, VCU Libraries

2008-10-20 Thread Jimmy Ghaphery
Web Applications Developer. Virginia Commonwealth University Libraries 
seeks faculty candidates for advancing the state of the art in the 
library’s Web environment, making it a rich, functional, and highly 
engaging experience for the VCU community of users. Position reports to 
the Web Applications Manager. ALA-accredited graduate degree or 
accredited graduate degree in an appropriate discipline required. Salary 
commensurate with experience, not less than $48,000. Review of 
applications will begin November 24, 2008, and will continue until the 
position is filled. Preferred qualifications, application procedures and 
other information are available in the complete position description at 
http://www.library.vcu.edu/admin/webappdev.html. VCU is Virginia’s 
largest university and one of the nation’s leading research 
institutions. It is located in historic and dynamic Richmond, Virginia, 
convenient to the beauty of the Blue Ridge Mountains, the recreation 
destinations of the Atlantic Ocean and the Chesapeake Bay, and the 
cultural resources of Washington, D.C. Virginia Commonwealth University 
is an Equal Opportunity/Affirmative Action Employer. Women, minorities, 
and persons with disabilities are encouraged to apply.


--
Jimmy Ghaphery
Head, Library Information Systems
VCU Libraries
http://www.library.vcu.edu
--


Re: [CODE4LIB] marc4j 2.4 released

2008-10-20 Thread Bess Sadler

Hi, Mike.

I don't know of any off-the-shelf software that does de-duplication  
of the kind you're describing, but it would be pretty useful. That  
would be awesome if someone wanted to build something like that into  
marc4j. Has anyone published any good algorithms for de-duping? As I  
understand it, if you have two records that are 100% identical except  
for holdings information, that's pretty easy. It gets harder when one  
record is more complete than the other, and very hard when one record  
has even slightly different information than the other, to tell  
whether they are the same record and decide whose information to  
privilege. Are there any good de-duping guidelines out there? When a  
library contracts out the de-duping of their catalog, what kind of  
specific guidelines are they expected to provide? Anyone know?


I remember the open library folks were very interested in this  
question. Any open library folks on this list? Did that effort to de- 
dupe all those contributed marc records ever go anywhere?


Bess

On Oct 20, 2008, at 1:12 PM, Michael Beccaria wrote:

Very cool! I noticed that a feature, MarcDirStreamReader, is  
capable of

iterating over all marc record files in a given directory. Does anyone
know of any de-duplicating efforts done with marc4j? For example,
libraries that have similar holdings would have their records merged
into one record with a location tag somewhere. I know places do it
(consortia etc.) but I haven't been able to find a good open program
that handles stuff like that.

Mike Beccaria
Systems Librarian
Head of Digital Initiatives
Paul Smith's College
518.327.6376
[EMAIL PROTECTED]

---
This message may contain confidential information and is intended only
for the individual named. If you are not the named addressee you  
should

not disseminate, distribute or copy this e-mail. Please notify the
sender immediately by e-mail if you have received this e-mail by  
mistake

and delete this e-mail from your system.

-Original Message-
From: Code for Libraries [mailto:[EMAIL PROTECTED] On  
Behalf Of

Bess Sadler
Sent: Monday, October 20, 2008 11:12 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] marc4j 2.4 released

Dear Code4Libbers,

I'm very pleased to announce that for the first time in almost two
years there has been a new release of marc4j. Release 2.4 is a minor
release in the sense that it shouldn't break any existing code, but
it's a major release in the sense that it represents an influx of new
people into the development of this project, and a significant
improvement in marc4j's ability to handle malformed or mis-encoded
marc records.

Release notes are here: http://marc4j.tigris.org/files/documents/
220/44060/changes.txt

And the project website, including download links, is here: http://
marc4j.tigris.org/

We've been using this new marc4j code in solrmarc since solrmarc
started, so if you're using Blacklight or VuFind, you're probably
using it already, just in an unreleased form.

Bravo to Bob Haschart, Wayne Graham, and Bas Peters for making these
improvements to marc4j and getting this release out the door.

Bess

Elizabeth (Bess) Sadler
Research and Development Librarian
Digital Scholarship Services
Box 400129
Alderman Library
University of Virginia
Charlottesville, VA 22904

[EMAIL PROTECTED]
(434) 243-2305


Re: [CODE4LIB] marc4j 2.4 released

2008-10-20 Thread Jonathan Rochkind
To me, de-duplication means throwing out some records as duplicates. 
Are we talking about that, or are we talking about what I call work set 
grouping and others (erroneously in my opinion) call FRBRization?


If the latter, I don't think there is any mature open source software 
that addresses that yet. Or for that matter, any proprietary 
for-purchase software that you could use as a component in your own 
tools. Various proprietary software includes a work set grouping feature 
in it's black box (AquaBrowser, Primo, I believe the VTLS ILS).  But I 
don't know of anything available to do it for you in your own tool.


I've been just starting to give some thought to how to accomplish this, 
and it's a bit of a tricky problem on several grounds, including 
computationally (doing it in a way that performs efficiently). One 
choice is whether you group records at the indexing stage, or on-demand 
at the retrieval stage. Both have performance implications--we really 
don't want to slow down retrieval OR indexing.  Usually if you have the 
choice, you put the slow down at indexing since it only happens once 
in abstract theory. But in fact, with what we do, when indexing that's 
already been optmized and does not have this feature can take hours or 
even days with some of our corpuses, and when in fact we do re-index 
from time to time (including 'incremental' addition to the index of new 
and changed records)---we really don't want to slow down indexing either.


Jonathan

Bess Sadler wrote:

Hi, Mike.

I don't know of any off-the-shelf software that does de-duplication of 
the kind you're describing, but it would be pretty useful. That would 
be awesome if someone wanted to build something like that into marc4j. 
Has anyone published any good algorithms for de-duping? As I 
understand it, if you have two records that are 100% identical except 
for holdings information, that's pretty easy. It gets harder when one 
record is more complete than the other, and very hard when one record 
has even slightly different information than the other, to tell 
whether they are the same record and decide whose information to 
privilege. Are there any good de-duping guidelines out there? When a 
library contracts out the de-duping of their catalog, what kind of 
specific guidelines are they expected to provide? Anyone know?


I remember the open library folks were very interested in this 
question. Any open library folks on this list? Did that effort to 
de-dupe all those contributed marc records ever go anywhere?


Bess

On Oct 20, 2008, at 1:12 PM, Michael Beccaria wrote:


Very cool! I noticed that a feature, MarcDirStreamReader, is capable of
iterating over all marc record files in a given directory. Does anyone
know of any de-duplicating efforts done with marc4j? For example,
libraries that have similar holdings would have their records merged
into one record with a location tag somewhere. I know places do it
(consortia etc.) but I haven't been able to find a good open program
that handles stuff like that.

Mike Beccaria
Systems Librarian
Head of Digital Initiatives
Paul Smith's College
518.327.6376
[EMAIL PROTECTED]

---
This message may contain confidential information and is intended only
for the individual named. If you are not the named addressee you should
not disseminate, distribute or copy this e-mail. Please notify the
sender immediately by e-mail if you have received this e-mail by mistake
and delete this e-mail from your system.

-Original Message-
From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
Bess Sadler
Sent: Monday, October 20, 2008 11:12 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] marc4j 2.4 released

Dear Code4Libbers,

I'm very pleased to announce that for the first time in almost two
years there has been a new release of marc4j. Release 2.4 is a minor
release in the sense that it shouldn't break any existing code, but
it's a major release in the sense that it represents an influx of new
people into the development of this project, and a significant
improvement in marc4j's ability to handle malformed or mis-encoded
marc records.

Release notes are here: http://marc4j.tigris.org/files/documents/
220/44060/changes.txt

And the project website, including download links, is here: http://
marc4j.tigris.org/

We've been using this new marc4j code in solrmarc since solrmarc
started, so if you're using Blacklight or VuFind, you're probably
using it already, just in an unreleased form.

Bravo to Bob Haschart, Wayne Graham, and Bas Peters for making these
improvements to marc4j and getting this release out the door.

Bess

Elizabeth (Bess) Sadler
Research and Development Librarian
Digital Scholarship Services
Box 400129
Alderman Library
University of Virginia
Charlottesville, VA 22904

[EMAIL PROTECTED]
(434) 243-2305




--
Jonathan Rochkind
Digital Services Software Engineer
The Sheridan Libraries
Johns Hopkins University
410.516.8886 
rochkind (at) jhu.edu


Re: [CODE4LIB] marc4j 2.4 released

2008-10-20 Thread Kyle Banerjee
Terry Reese wrote a program called RobertCompare a few years back
http://oregonstate.edu/~reeset/marcedit/html/robertcompare.html that
could compare MARC records and tell you about differences. Perhaps
that would be useful.

kyle

On Mon, Oct 20, 2008 at 11:55 AM, Bess Sadler [EMAIL PROTECTED] wrote:
 Hi, Mike.

 I don't know of any off-the-shelf software that does de-duplication of the
 kind you're describing, but it would be pretty useful. That would be awesome
 if someone wanted to build something like that into marc4j. Has anyone
 published any good algorithms for de-duping? As I understand it, if you have
 two records that are 100% identical except for holdings information, that's
 pretty easy. It gets harder when one record is more complete than the other,
 and very hard when one record has even slightly different information than
 the other, to tell whether they are the same record and decide whose
 information to privilege. Are there any good de-duping guidelines out there?
 When a library contracts out the de-duping of their catalog, what kind of
 specific guidelines are they expected to provide? Anyone know?

 I remember the open library folks were very interested in this question. Any
 open library folks on this list? Did that effort to de-dupe all those
 contributed marc records ever go anywhere?

 Bess

 On Oct 20, 2008, at 1:12 PM, Michael Beccaria wrote:

 Very cool! I noticed that a feature, MarcDirStreamReader, is capable of
 iterating over all marc record files in a given directory. Does anyone
 know of any de-duplicating efforts done with marc4j? For example,
 libraries that have similar holdings would have their records merged
 into one record with a location tag somewhere. I know places do it
 (consortia etc.) but I haven't been able to find a good open program
 that handles stuff like that.

 Mike Beccaria
 Systems Librarian
 Head of Digital Initiatives
 Paul Smith's College
 518.327.6376
 [EMAIL PROTECTED]

 ---
 This message may contain confidential information and is intended only
 for the individual named. If you are not the named addressee you should
 not disseminate, distribute or copy this e-mail. Please notify the
 sender immediately by e-mail if you have received this e-mail by mistake
 and delete this e-mail from your system.

 -Original Message-
 From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
 Bess Sadler
 Sent: Monday, October 20, 2008 11:12 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: [CODE4LIB] marc4j 2.4 released

 Dear Code4Libbers,

 I'm very pleased to announce that for the first time in almost two
 years there has been a new release of marc4j. Release 2.4 is a minor
 release in the sense that it shouldn't break any existing code, but
 it's a major release in the sense that it represents an influx of new
 people into the development of this project, and a significant
 improvement in marc4j's ability to handle malformed or mis-encoded
 marc records.

 Release notes are here: http://marc4j.tigris.org/files/documents/
 220/44060/changes.txt

 And the project website, including download links, is here: http://
 marc4j.tigris.org/

 We've been using this new marc4j code in solrmarc since solrmarc
 started, so if you're using Blacklight or VuFind, you're probably
 using it already, just in an unreleased form.

 Bravo to Bob Haschart, Wayne Graham, and Bas Peters for making these
 improvements to marc4j and getting this release out the door.

 Bess

 Elizabeth (Bess) Sadler
 Research and Development Librarian
 Digital Scholarship Services
 Box 400129
 Alderman Library
 University of Virginia
 Charlottesville, VA 22904

 [EMAIL PROTECTED]
 (434) 243-2305




-- 
--
Kyle Banerjee
Digital Services Program Manager
Orbis Cascade Alliance
[EMAIL PROTECTED] / 541.359.9599


[CODE4LIB] de-dupping (was: marc4j 2.4 released)

2008-10-20 Thread Naomi Dushay
I've wondered if standard number matching  (ISBN, LCCN, OCLC,  
ISSN ...) would be a big piece.  Isn't there such a service from OCLC,  
and another flavor of something-or-other from LibraryThing?


- Naomi

On Oct 20, 2008, at 12:21 PM, Jonathan Rochkind wrote:

To me, de-duplication means throwing out some records as  
duplicates. Are we talking about that, or are we talking about what  
I call work set grouping and others (erroneously in my opinion)  
call FRBRization?


If the latter, I don't think there is any mature open source  
software that addresses that yet. Or for that matter, any  
proprietary for-purchase software that you could use as a component  
in your own tools. Various proprietary software includes a work set  
grouping feature in it's black box (AquaBrowser, Primo, I believe  
the VTLS ILS).  But I don't know of anything available to do it for  
you in your own tool.


I've been just starting to give some thought to how to accomplish  
this, and it's a bit of a tricky problem on several grounds,  
including computationally (doing it in a way that performs  
efficiently). One choice is whether you group records at the  
indexing stage, or on-demand at the retrieval stage. Both have  
performance implications--we really don't want to slow down  
retrieval OR indexing.  Usually if you have the choice, you put the  
slow down at indexing since it only happens once in abstract  
theory. But in fact, with what we do, when indexing that's already  
been optmized and does not have this feature can take hours or even  
days with some of our corpuses, and when in fact we do re-index from  
time to time (including 'incremental' addition to the index of new  
and changed records)---we really don't want to slow down indexing  
either.


Jonathan

Bess Sadler wrote:

Hi, Mike.

I don't know of any off-the-shelf software that does de-duplication  
of the kind you're describing, but it would be pretty useful. That  
would be awesome if someone wanted to build something like that  
into marc4j. Has anyone published any good algorithms for de- 
duping? As I understand it, if you have two records that are 100%  
identical except for holdings information, that's pretty easy. It  
gets harder when one record is more complete than the other, and  
very hard when one record has even slightly different information  
than the other, to tell whether they are the same record and decide  
whose information to privilege. Are there any good de-duping  
guidelines out there? When a library contracts out the de-duping of  
their catalog, what kind of specific guidelines are they expected  
to provide? Anyone know?


I remember the open library folks were very interested in this  
question. Any open library folks on this list? Did that effort to  
de-dupe all those contributed marc records ever go anywhere?


Bess

On Oct 20, 2008, at 1:12 PM, Michael Beccaria wrote:

Very cool! I noticed that a feature, MarcDirStreamReader, is  
capable of
iterating over all marc record files in a given directory. Does  
anyone

know of any de-duplicating efforts done with marc4j? For example,
libraries that have similar holdings would have their records merged
into one record with a location tag somewhere. I know places do it
(consortia etc.) but I haven't been able to find a good open program
that handles stuff like that.

Mike Beccaria
Systems Librarian
Head of Digital Initiatives
Paul Smith's College
518.327.6376
[EMAIL PROTECTED]



--
Jonathan Rochkind
Digital Services Software Engineer
The Sheridan Libraries
Johns Hopkins University
410.516.8886 rochkind (at) jhu.edu


Naomi Dushay
[EMAIL PROTECTED]


Re: [CODE4LIB] de-dupping (was: marc4j 2.4 released)

2008-10-20 Thread Min-Yen Kan
Hi all:

My student, Yee Fan Tan, and I published a short technical column on
record linkage tasks (very similar to the de-dup task discussed here)
in February in the Communications of the ACM.

Min-Yen Kan and Yee Fan Tan (2008) Record matching in digital library
metadata.  In Communications of the ACM, Technical opinion column,
pp. 91-94, February.

http://doi.acm.org/10.1145/1314215.1314231

We're in the process of releasing a tool/demo for de-dup tasks, as a
java library (jar).  If there's sufficient interest, we might try to
cater some of our string similarity metrics to MARC or other catalog
data.

Cheers,

Min

--
Min-Yen KAN (Dr) :: Assistant Professor :: National University of
Singapore :: School of Computing, AS6 05-12, Law Link, Singapore
117590 :: 65-6516 1885(DID) :: 65-6779 4580 (Fax) ::
[EMAIL PROTECTED] (E) :: www.comp.nus.edu.sg/~kanmy (W)

Important: This email is confidential and may be privileged. If you
are not the intended recipient, please delete it and notify us
immediately; you should not copy or use it for any purpose, nor
disclose its contents to any other person. Thank you.



On Tue, Oct 21, 2008 at 8:03 AM, Naomi Dushay [EMAIL PROTECTED] wrote:
 I've wondered if standard number matching  (ISBN, LCCN, OCLC, ISSN ...)
 would be a big piece.  Isn't there such a service from OCLC, and another
 flavor of something-or-other from LibraryThing?

 - Naomi

 On Oct 20, 2008, at 12:21 PM, Jonathan Rochkind wrote:

 To me, de-duplication means throwing out some records as duplicates. Are
 we talking about that, or are we talking about what I call work set
 grouping and others (erroneously in my opinion) call FRBRization?

 If the latter, I don't think there is any mature open source software that
 addresses that yet. Or for that matter, any proprietary for-purchase
 software that you could use as a component in your own tools. Various
 proprietary software includes a work set grouping feature in it's black
 box (AquaBrowser, Primo, I believe the VTLS ILS).  But I don't know of
 anything available to do it for you in your own tool.

 I've been just starting to give some thought to how to accomplish this,
 and it's a bit of a tricky problem on several grounds, including
 computationally (doing it in a way that performs efficiently). One choice is
 whether you group records at the indexing stage, or on-demand at the
 retrieval stage. Both have performance implications--we really don't want to
 slow down retrieval OR indexing.  Usually if you have the choice, you put
 the slow down at indexing since it only happens once in abstract theory.
 But in fact, with what we do, when indexing that's already been optmized and
 does not have this feature can take hours or even days with some of our
 corpuses, and when in fact we do re-index from time to time (including
 'incremental' addition to the index of new and changed records)---we really
 don't want to slow down indexing either.

 Jonathan

 Bess Sadler wrote:

 Hi, Mike.

 I don't know of any off-the-shelf software that does de-duplication of
 the kind you're describing, but it would be pretty useful. That would be
 awesome if someone wanted to build something like that into marc4j. Has
 anyone published any good algorithms for de-duping? As I understand it, if
 you have two records that are 100% identical except for holdings
 information, that's pretty easy. It gets harder when one record is more
 complete than the other, and very hard when one record has even slightly
 different information than the other, to tell whether they are the same
 record and decide whose information to privilege. Are there any good
 de-duping guidelines out there? When a library contracts out the de-duping
 of their catalog, what kind of specific guidelines are they expected to
 provide? Anyone know?

 I remember the open library folks were very interested in this question.
 Any open library folks on this list? Did that effort to de-dupe all those
 contributed marc records ever go anywhere?

 Bess

 On Oct 20, 2008, at 1:12 PM, Michael Beccaria wrote:

 Very cool! I noticed that a feature, MarcDirStreamReader, is capable of
 iterating over all marc record files in a given directory. Does anyone
 know of any de-duplicating efforts done with marc4j? For example,
 libraries that have similar holdings would have their records merged
 into one record with a location tag somewhere. I know places do it
 (consortia etc.) but I haven't been able to find a good open program
 that handles stuff like that.

 Mike Beccaria
 Systems Librarian
 Head of Digital Initiatives
 Paul Smith's College
 518.327.6376
 [EMAIL PROTECTED]


 --
 Jonathan Rochkind
 Digital Services Software Engineer
 The Sheridan Libraries
 Johns Hopkins University
 410.516.8886 rochkind (at) jhu.edu

 Naomi Dushay
 [EMAIL PROTECTED]



[CODE4LIB] XML Workshop

2008-10-20 Thread Patrick Yott
This is being shamelessly cross-posted ‹ all apologies for full mailboxes!

WEB DEVELOPMENT WITH XML: DESIGN AND APPLICATIONS, JAN. 5-9, 2009,
CHAPEL HILL, NC

Washington DC‹The Association of Research Libraries (ARL) is pleased to
offer once again an in-depth workshop focused on Web development with XML.

Taught by experienced XML developers from the libraries of Brown
University, the University of Virginia, and the Virginia Foundation for
the Humanities, this five-day workshop will explore XML with a specific
focus on fundamentals of design, markup, and use. Participants will use
XML and related technologies in the creation of a prototype digital
publication. In addition, the University of North Carolina at Chapel
Hill Libraries will host a reception and tour of their new Carolina
Digital Library and Archive.

Topics to be covered include:

   1. XML: What is it? How does it differ from SGML and HTML?
   2. Working with content models (primarily XML Schema) and methods of
  using them when constructing and validating XML
   3. Implementing methods of content transformation and delivery (using
  XSL and XPath) so the XML we build can be delivered, read, and
  used in a variety of formats
   4. Using XML applications such as XQuery and eXist to further utilize
  XML capabilities and technologies in a Web environment


DATE  LOCATION
January 5-9, 2009
University of North Carolina at Chapel Hill
247 Davis Library
Chapel Hill NC

PRESENTERS
Matthew Gibson, Managing Editor, Encyclopedia Virginia
Christine Ruotolo, Digital Service Manager, University of Virginia Library
Patrick Yott, Director, Center for Digital Initiatives, Brown University

Matthew, Christine, and Patrick have taught XML courses in collaboration
with the ARL Statistics and Measurement program since 2002. This will be
their seventh collaborative event.

REGISTRATION
Register by December 1, 2008, at
http://www.arl.org/stats/statsevents/index.shtml.

Members of ARL and TRLN libraries pay a registration fee of $850;
non-members pay $1,275. These prices do not include meals or housing for
the event.

ARL has reserved a block of rooms at the Carolina Inn, a nearby hotel,
until November 20, 2008. The rooms cannot be guaranteed after this date.
For reservations, call 800-962-8519 and identify yourself as part of the
Association of Research Libraries group.

AUDIENCE
There are no prerequisites for this workshop.

QUESTIONS?
For more information, please contact Kristina Justh, [EMAIL PROTECTED]
mailto:[EMAIL PROTECTED].

--

The Association of Research Libraries (ARL) is a nonprofit organization
of 123 research libraries in North America. Its mission is to influence
the changing environment of scholarly communication and the public
policies that affect research libraries and the diverse communities they
serve. ARL pursues this mission by advancing the goals of its member
research libraries, providing leadership in public and information
policy to the scholarly and higher education communities, fostering the
exchange of ideas and expertise, and shaping a future environment that
leverages its interests with those of allied organizations. ARL is on
the Web at http://www.arl.org/.

Triangle Research Libraries Network (TRLN) is a collaborative
organization of Duke University, North Carolina Central University,
North Carolina State University, and the University of North Carolina at
Chapel Hill, the purpose of which is to marshal the financial, human,
and information resources of their research libraries through
cooperative efforts in order to create a rich and unparalleled knowledge
environment that furthers the universities' teaching, research, and
service missions. TRLN is on the Web at http://www.trln.org/.