I've wondered if standard number matching (ISBN, LCCN, OCLC,
ISSN ...) would be a big piece. Isn't there such a service from OCLC,
and another flavor of something-or-other from LibraryThing?
- Naomi
On Oct 20, 2008, at 12:21 PM, Jonathan Rochkind wrote:
To me, "de-duplication" means throwing out some records as
duplicates. Are we talking about that, or are we talking about what
I call "work set grouping" and others (erroneously in my opinion)
call "FRBRization"?
If the latter, I don't think there is any mature open source
software that addresses that yet. Or for that matter, any
proprietary for-purchase software that you could use as a component
in your own tools. Various proprietary software includes a work set
grouping feature in it's "black box" (AquaBrowser, Primo, I believe
the VTLS ILS). But I don't know of anything available to do it for
you in your own tool.
I've been just starting to give some thought to how to accomplish
this, and it's a bit of a tricky problem on several grounds,
including computationally (doing it in a way that performs
efficiently). One choice is whether you group records at the
indexing stage, or on-demand at the retrieval stage. Both have
performance implications--we really don't want to slow down
retrieval OR indexing. Usually if you have the choice, you put the
slow down at indexing since it only happens "once" in abstract
theory. But in fact, with what we do, when indexing that's already
been optmized and does not have this feature can take hours or even
days with some of our corpuses, and when in fact we do re-index from
time to time (including 'incremental' addition to the index of new
and changed records)---we really don't want to slow down indexing
either.
Jonathan
Bess Sadler wrote:
Hi, Mike.
I don't know of any off-the-shelf software that does de-duplication
of the kind you're describing, but it would be pretty useful. That
would be awesome if someone wanted to build something like that
into marc4j. Has anyone published any good algorithms for de-
duping? As I understand it, if you have two records that are 100%
identical except for holdings information, that's pretty easy. It
gets harder when one record is more complete than the other, and
very hard when one record has even slightly different information
than the other, to tell whether they are the same record and decide
whose information to privilege. Are there any good de-duping
guidelines out there? When a library contracts out the de-duping of
their catalog, what kind of specific guidelines are they expected
to provide? Anyone know?
I remember the open library folks were very interested in this
question. Any open library folks on this list? Did that effort to
de-dupe all those contributed marc records ever go anywhere?
Bess
On Oct 20, 2008, at 1:12 PM, Michael Beccaria wrote:
Very cool! I noticed that a feature, MarcDirStreamReader, is
capable of
iterating over all marc record files in a given directory. Does
anyone
know of any de-duplicating efforts done with marc4j? For example,
libraries that have similar holdings would have their records merged
into one record with a location tag somewhere. I know places do it
(consortia etc.) but I haven't been able to find a good open program
that handles stuff like that.
Mike Beccaria
Systems Librarian
Head of Digital Initiatives
Paul Smith's College
518.327.6376
[EMAIL PROTECTED]
--
Jonathan Rochkind
Digital Services Software Engineer
The Sheridan Libraries
Johns Hopkins University
410.516.8886 rochkind (at) jhu.edu
Naomi Dushay
[EMAIL PROTECTED]