[CODE4LIB] de-dupping (was: marc4j 2.4 released)

Naomi Dushay Mon, 20 Oct 2008 17:04:19 -0700

I've wondered if standard number matching (ISBN, LCCN, OCLC,ISSN ...) would be a big piece. Isn't there such a service from OCLC,and another flavor of something-or-other from LibraryThing?


- Naomi


On Oct 20, 2008, at 12:21 PM, Jonathan Rochkind wrote:

To me, "de-duplication" means throwing out some records asduplicates. Are we talking about that, or are we talking about whatI call "work set grouping" and others (erroneously in my opinion)call "FRBRization"?
If the latter, I don't think there is any mature open sourcesoftware that addresses that yet. Or for that matter, anyproprietary for-purchase software that you could use as a componentin your own tools. Various proprietary software includes a work setgrouping feature in it's "black box" (AquaBrowser, Primo, I believethe VTLS ILS). But I don't know of anything available to do it foryou in your own tool.
I've been just starting to give some thought to how to accomplishthis, and it's a bit of a tricky problem on several grounds,including computationally (doing it in a way that performsefficiently). One choice is whether you group records at theindexing stage, or on-demand at the retrieval stage. Both haveperformance implications--we really don't want to slow downretrieval OR indexing. Usually if you have the choice, you put theslow down at indexing since it only happens "once" in abstracttheory. But in fact, with what we do, when indexing that's alreadybeen optmized and does not have this feature can take hours or evendays with some of our corpuses, and when in fact we do re-index fromtime to time (including 'incremental' addition to the index of newand changed records)---we really don't want to slow down indexingeither.
Jonathan

Bess Sadler wrote:
Hi, Mike.
I don't know of any off-the-shelf software that does de-duplicationof the kind you're describing, but it would be pretty useful. Thatwould be awesome if someone wanted to build something like thatinto marc4j. Has anyone published any good algorithms for de-duping? As I understand it, if you have two records that are 100%identical except for holdings information, that's pretty easy. Itgets harder when one record is more complete than the other, andvery hard when one record has even slightly different informationthan the other, to tell whether they are the same record and decidewhose information to privilege. Are there any good de-dupingguidelines out there? When a library contracts out the de-duping oftheir catalog, what kind of specific guidelines are they expectedto provide? Anyone know?
I remember the open library folks were very interested in thisquestion. Any open library folks on this list? Did that effort tode-dupe all those contributed marc records ever go anywhere?
Bess

On Oct 20, 2008, at 1:12 PM, Michael Beccaria wrote:
Very cool! I noticed that a feature, MarcDirStreamReader, iscapable ofiterating over all marc record files in a given directory. Doesanyone
know of any de-duplicating efforts done with marc4j? For example,
libraries that have similar holdings would have their records merged
into one record with a location tag somewhere. I know places do it
(consortia etc.) but I haven't been able to find a good open program
that handles stuff like that.

Mike Beccaria
Systems Librarian
Head of Digital Initiatives
Paul Smith's College
518.327.6376
[EMAIL PROTECTED]
--
Jonathan Rochkind
Digital Services Software Engineer
The Sheridan Libraries
Johns Hopkins University
410.516.8886 rochkind (at) jhu.edu


Naomi Dushay
[EMAIL PROTECTED]

[CODE4LIB] de-dupping (was: marc4j 2.4 released)

Reply via email to