Re: [CODE4LIB] De-dup MARC Ebook records

2013-08-25 Thread Ere Maijala

Mike,

our RecordManager (https://github.com/KDK-Alli/RecordManager) used in 
conjunction with VuFind does deduplication with our own algorithm. This 
might be of some interest to you. RecordManager works standalone, so no 
VuFind installation needed. For some parts it's still in active 
development, but the deduplication has been working pretty well so far. 
A short description of the algorithm is available at 
<https://github.com/KDK-Alli/RecordManager/wiki/Deduplication>, and the 
actual PHP code is in 
<https://github.com/KDK-Alli/RecordManager/blob/master/classes/RecordManager.php> 
starting at the dedupRecord function.


--Ere

22.8.2013 18.07, Michael Beccaria kirjoitti:

Steve,
I don't think it's so much find a control field (however, the closest match I 
can use is ISBN or eISBN which has its issues) but also normalizing the data in 
the fields so that matches are produced. It will no doubt take some time to 
figure out.

Mike Beccaria
Systems Librarian
Head of Digital Initiative
Paul Smith's College
518.327.6376
mbecca...@paulsmiths.edu
Become a friend of Paul Smith's Library on Facebook today!


-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of 
McDonald, Stephen
Sent: Friday, August 16, 2013 8:16 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] De-dup MARC Ebook records

Michael Beccaria said:

Thanks for the replies. To clarify, I am working with 2 (or more in
the future) marc records outside of the ILS. I've tried using Marcedit
but my usage did vary...not much overlap with the control fields that
were available to me. I have a feeling they are a bit varied. I'm also
messing around with marcXimiL a little but I'm having trouble getting
it to output any records at all. I also was looking at the XC
aggregation module but I was having trouble getting that to work
properly as well and the listserv was unresponsive. It seemed like
good software but it required me to set up an OAI harvest source to
allow it to ingest the records and that...well...enough is enough... I
think I will probably need to write something, and at least that way I
know what it will be doing rather than plowing through software that
has little to no support. Please feel free to let me know of a particular 
strategy you think might work best in this regard...


If you couldn't get adequate deduping from the control fields available in 
MarcEdit deduping, what control fields do you think you need to dedup on?  You 
can actually specify any arbitrary field and subfield for deduping in MarcEdit.

Steve McDonald
steve.mcdon...@tufts.edu




--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: [CODE4LIB] De-dup MARC Ebook records

2013-08-22 Thread Karen Coyle
Yes, open library implemented it, and, of course, where it doesn't work 
is where the data is pretty bad. If you prefer to err on the side of 
merging, you can loosen the algorithm's weights. It's based on the 
algorithm used for the U Cal union catalog, which was exercised over 
about 20 years. Last I was able to ascertain, the data elements are very 
similar to the ones used (at least at the time) by WorldCat.


kc

On 8/22/13 1:21 PM, Michael Beccaria wrote:

Karen,
Do you have a sense of how well it actually works? Is Open Library implementing 
it?

Mike Beccaria
Systems Librarian
Head of Digital Initiative
Paul Smith's College
518.327.6376
mbecca...@paulsmiths.edu
Become a friend of Paul Smith's Library on Facebook today!


-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Karen 
Coyle
Sent: Thursday, August 22, 2013 11:53 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] De-dup MARC Ebook records

The record matching algorithm used by the Open Library is available here:
https://github.com/openlibrary/openlibrary/tree/master/openlibrary/catalog/merge

The original spec, which may have changed in the implementation, is here:

http://kcoyle.net/merge.html

kc


On 8/22/13 8:07 AM, Michael Beccaria wrote:

Steve,
I don't think it's so much find a control field (however, the closest match I 
can use is ISBN or eISBN which has its issues) but also normalizing the data in 
the fields so that matches are produced. It will no doubt take some time to 
figure out.

Mike Beccaria
Systems Librarian
Head of Digital Initiative
Paul Smith's College
518.327.6376
mbecca...@paulsmiths.edu
Become a friend of Paul Smith's Library on Facebook today!


-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf
Of McDonald, Stephen
Sent: Friday, August 16, 2013 8:16 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] De-dup MARC Ebook records

Michael Beccaria said:

Thanks for the replies. To clarify, I am working with 2 (or more in
the future) marc records outside of the ILS. I've tried using
Marcedit but my usage did vary...not much overlap with the control
fields that were available to me. I have a feeling they are a bit
varied. I'm also messing around with marcXimiL a little but I'm
having trouble getting it to output any records at all. I also was
looking at the XC aggregation module but I was having trouble getting
that to work properly as well and the listserv was unresponsive. It
seemed like good software but it required me to set up an OAI harvest
source to allow it to ingest the records and that...well...enough is
enough... I think I will probably need to write something, and at
least that way I know what it will be doing rather than plowing
through software that has little to no support. Please feel free to let me know 
of a particular strategy you think might work best in this regard...

If you couldn't get adequate deduping from the control fields available in 
MarcEdit deduping, what control fields do you think you need to dedup on?  You 
can actually specify any arbitrary field and subfield for deduping in MarcEdit.

Steve McDonald
steve.mcdon...@tufts.edu

--
Karen Coyle
kco...@kcoyle.net http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet


--
Karen Coyle
kco...@kcoyle.net http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet


Re: [CODE4LIB] De-dup MARC Ebook records

2013-08-22 Thread Michael Beccaria
Karen,
Do you have a sense of how well it actually works? Is Open Library implementing 
it?

Mike Beccaria
Systems Librarian
Head of Digital Initiative
Paul Smith's College
518.327.6376
mbecca...@paulsmiths.edu
Become a friend of Paul Smith's Library on Facebook today!


-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Karen 
Coyle
Sent: Thursday, August 22, 2013 11:53 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] De-dup MARC Ebook records

The record matching algorithm used by the Open Library is available here:
https://github.com/openlibrary/openlibrary/tree/master/openlibrary/catalog/merge

The original spec, which may have changed in the implementation, is here:

http://kcoyle.net/merge.html

kc


On 8/22/13 8:07 AM, Michael Beccaria wrote:
> Steve,
> I don't think it's so much find a control field (however, the closest match I 
> can use is ISBN or eISBN which has its issues) but also normalizing the data 
> in the fields so that matches are produced. It will no doubt take some time 
> to figure out.
>
> Mike Beccaria
> Systems Librarian
> Head of Digital Initiative
> Paul Smith's College
> 518.327.6376
> mbecca...@paulsmiths.edu
> Become a friend of Paul Smith's Library on Facebook today!
>
>
> -Original Message-
> From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf 
> Of McDonald, Stephen
> Sent: Friday, August 16, 2013 8:16 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] De-dup MARC Ebook records
>
> Michael Beccaria said:
>> Thanks for the replies. To clarify, I am working with 2 (or more in 
>> the future) marc records outside of the ILS. I've tried using 
>> Marcedit but my usage did vary...not much overlap with the control 
>> fields that were available to me. I have a feeling they are a bit 
>> varied. I'm also messing around with marcXimiL a little but I'm 
>> having trouble getting it to output any records at all. I also was 
>> looking at the XC aggregation module but I was having trouble getting 
>> that to work properly as well and the listserv was unresponsive. It 
>> seemed like good software but it required me to set up an OAI harvest 
>> source to allow it to ingest the records and that...well...enough is 
>> enough... I think I will probably need to write something, and at 
>> least that way I know what it will be doing rather than plowing 
>> through software that has little to no support. Please feel free to let me 
>> know of a particular strategy you think might work best in this regard...
> If you couldn't get adequate deduping from the control fields available in 
> MarcEdit deduping, what control fields do you think you need to dedup on?  
> You can actually specify any arbitrary field and subfield for deduping in 
> MarcEdit.
>
>   Steve McDonald
>   steve.mcdon...@tufts.edu

--
Karen Coyle
kco...@kcoyle.net http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet


Re: [CODE4LIB] De-dup MARC Ebook records

2013-08-22 Thread Karen Coyle

The record matching algorithm used by the Open Library is available here:
https://github.com/openlibrary/openlibrary/tree/master/openlibrary/catalog/merge

The original spec, which may have changed in the implementation, is here:

http://kcoyle.net/merge.html

kc


On 8/22/13 8:07 AM, Michael Beccaria wrote:

Steve,
I don't think it's so much find a control field (however, the closest match I 
can use is ISBN or eISBN which has its issues) but also normalizing the data in 
the fields so that matches are produced. It will no doubt take some time to 
figure out.

Mike Beccaria
Systems Librarian
Head of Digital Initiative
Paul Smith's College
518.327.6376
mbecca...@paulsmiths.edu
Become a friend of Paul Smith's Library on Facebook today!


-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of 
McDonald, Stephen
Sent: Friday, August 16, 2013 8:16 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] De-dup MARC Ebook records

Michael Beccaria said:

Thanks for the replies. To clarify, I am working with 2 (or more in
the future) marc records outside of the ILS. I've tried using Marcedit
but my usage did vary...not much overlap with the control fields that
were available to me. I have a feeling they are a bit varied. I'm also
messing around with marcXimiL a little but I'm having trouble getting
it to output any records at all. I also was looking at the XC
aggregation module but I was having trouble getting that to work
properly as well and the listserv was unresponsive. It seemed like
good software but it required me to set up an OAI harvest source to
allow it to ingest the records and that...well...enough is enough... I
think I will probably need to write something, and at least that way I
know what it will be doing rather than plowing through software that
has little to no support. Please feel free to let me know of a particular 
strategy you think might work best in this regard...

If you couldn't get adequate deduping from the control fields available in 
MarcEdit deduping, what control fields do you think you need to dedup on?  You 
can actually specify any arbitrary field and subfield for deduping in MarcEdit.

Steve McDonald
steve.mcdon...@tufts.edu


--
Karen Coyle
kco...@kcoyle.net http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet


Re: [CODE4LIB] De-dup MARC Ebook records

2013-08-22 Thread Michael Beccaria
Steve,
I don't think it's so much find a control field (however, the closest match I 
can use is ISBN or eISBN which has its issues) but also normalizing the data in 
the fields so that matches are produced. It will no doubt take some time to 
figure out.

Mike Beccaria
Systems Librarian
Head of Digital Initiative
Paul Smith's College
518.327.6376
mbecca...@paulsmiths.edu
Become a friend of Paul Smith's Library on Facebook today!


-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of 
McDonald, Stephen
Sent: Friday, August 16, 2013 8:16 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] De-dup MARC Ebook records

Michael Beccaria said:
> Thanks for the replies. To clarify, I am working with 2 (or more in 
> the future) marc records outside of the ILS. I've tried using Marcedit 
> but my usage did vary...not much overlap with the control fields that 
> were available to me. I have a feeling they are a bit varied. I'm also 
> messing around with marcXimiL a little but I'm having trouble getting 
> it to output any records at all. I also was looking at the XC 
> aggregation module but I was having trouble getting that to work 
> properly as well and the listserv was unresponsive. It seemed like 
> good software but it required me to set up an OAI harvest source to 
> allow it to ingest the records and that...well...enough is enough... I 
> think I will probably need to write something, and at least that way I 
> know what it will be doing rather than plowing through software that 
> has little to no support. Please feel free to let me know of a particular 
> strategy you think might work best in this regard...

If you couldn't get adequate deduping from the control fields available in 
MarcEdit deduping, what control fields do you think you need to dedup on?  You 
can actually specify any arbitrary field and subfield for deduping in MarcEdit.

Steve McDonald
steve.mcdon...@tufts.edu


Re: [CODE4LIB] De-dup MARC Ebook records

2013-08-16 Thread McDonald, Stephen
Michael Beccaria said:
> Thanks for the replies. To clarify, I am working with 2 (or more in the 
> future)
> marc records outside of the ILS. I've tried using Marcedit but my usage did
> vary...not much overlap with the control fields that were available to me. I
> have a feeling they are a bit varied. I'm also messing around with marcXimiL a
> little but I'm having trouble getting it to output any records at all. I also 
> was
> looking at the XC aggregation module but I was having trouble getting that to
> work properly as well and the listserv was unresponsive. It seemed like good
> software but it required me to set up an OAI harvest source to allow it to
> ingest the records and that...well...enough is enough... I think I will 
> probably
> need to write something, and at least that way I know what it will be doing
> rather than plowing through software that has little to no support. Please
> feel free to let me know of a particular strategy you think might work best in
> this regard...

If you couldn't get adequate deduping from the control fields available in 
MarcEdit deduping, what control fields do you think you need to dedup on?  You 
can actually specify any arbitrary field and subfield for deduping in MarcEdit.

Steve McDonald
steve.mcdon...@tufts.edu


Re: [CODE4LIB] De-dup MARC Ebook records

2013-08-15 Thread Alain Borel

Michael Beccaria  a écrit :
I'm also messing around with marcXimiL a little but I'm having  
trouble getting it to output any records at all.


Glad to hear someone's looking at our little toy :-)
If I can be of any help with it, just let me know!

Best regards,
Alain Borel
MarcXimiL co-author


Re: [CODE4LIB] De-dup MARC Ebook records

2013-08-15 Thread Michael Beccaria
Thanks for the replies. To clarify, I am working with 2 (or more in the future) 
marc records outside of the ILS. I've tried using Marcedit but my usage did 
vary...not much overlap with the control fields that were available to me. I 
have a feeling they are a bit varied. I'm also messing around with marcXimiL a 
little but I'm having trouble getting it to output any records at all. I also 
was looking at the XC aggregation module but I was having trouble getting that 
to work properly as well and the listserv was unresponsive. It seemed like good 
software but it required me to set up an OAI harvest source to allow it to 
ingest the records and that...well...enough is enough... I think I will 
probably need to write something, and at least that way I know what it will be 
doing rather than plowing through software that has little to no support. 
Please feel free to let me know of a particular strategy you think might work 
best in this regard...

Mike Beccaria
Systems Librarian
Head of Digital Initiative
Paul Smith's College
518.327.6376
mbecca...@paulsmiths.edu
Become a friend of Paul Smith's Library on Facebook today!


-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Andy 
Kohler
Sent: Thursday, August 15, 2013 2:29 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] De-dup MARC Ebook records

Are you expecting to work with two files of records, outside of your ILS?
If so, for a project like that I'd probably write Perl script(s) using 
MARC::Record (there are similar code libraries for Ruby, Python and Java at 
least).

For each record in each file, use the ISBN (and/or OCLC number and/or LCCN) as 
a key.  Compare all sets, and keep one record per key.

This assumes that the vendors are supplying records with standard identifiers, 
and not just their own record numbers.

If you're comparing each file with what's already in your ILS, then it'll 
depend on the tools the ILS offers for matching incoming records to the 
database.  Or, export the database and compare it with the files, as above.

Andy Kohler / UCLA Library Info Tech
akoh...@library.ucla.edu / 310 206-8312

On Thu, Aug 15, 2013 at 10:11 AM, Michael Beccaria  wrote:

> Has anyone had any luck finding a good way to de-duplicate MARC 
> records from ebook vendors. We're looking to integrate Ebrary and 
> Ebsco Academic Ebook collections and they estimate an overlap into the 10's 
> of thousands.
>
>


Re: [CODE4LIB] De-dup MARC Ebook records

2013-08-15 Thread Harper, Cynthia
Michael -  I'm just about to load ebook records into our Innovative catalog, 
and I'm going to keep the e-books separate from the print book records.  For 
ebooks, I'm going to copy the OCLC number to the 901 with a prestamp, and 
overlay on that. So only records loaded with our ebook load table will have 
this 901 to overlay on.  Then I'm going to protect the 856s and the 710s for 
the ebook collection statement.  That'll take care of adds.  For deletes... I 
haven't got that worked out yet.  I think there's a way to delete a field based 
on the incoming field.

Cindy Harper
Virginia Theological Seminary
char...@vts.edu

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Andy 
Kohler
Sent: Thursday, August 15, 2013 2:29 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] De-dup MARC Ebook records

Are you expecting to work with two files of records, outside of your ILS?
If so, for a project like that I'd probably write Perl script(s) using 
MARC::Record (there are similar code libraries for Ruby, Python and Java at 
least).

For each record in each file, use the ISBN (and/or OCLC number and/or LCCN) as 
a key.  Compare all sets, and keep one record per key.

This assumes that the vendors are supplying records with standard identifiers, 
and not just their own record numbers.

If you're comparing each file with what's already in your ILS, then it'll 
depend on the tools the ILS offers for matching incoming records to the 
database.  Or, export the database and compare it with the files, as above.

Andy Kohler / UCLA Library Info Tech
akoh...@library.ucla.edu / 310 206-8312

On Thu, Aug 15, 2013 at 10:11 AM, Michael Beccaria  wrote:

> Has anyone had any luck finding a good way to de-duplicate MARC 
> records from ebook vendors. We're looking to integrate Ebrary and 
> Ebsco Academic Ebook collections and they estimate an overlap into the 10's 
> of thousands.
>
>


Re: [CODE4LIB] De-dup MARC Ebook records

2013-08-15 Thread Terry Reese
Your mileage may vary, but MarcEdit has a dedup tool that will allow you to
take two files and find duplications.  It also has a merge tool that will
allow you to take two files, and merge specific fields into one or another
(so if you want fields like the 856 from two packages in the same record).
There are some assumptions made when matching (dedup can use any
field/subfield pair, but obviously control numbers are better -- merging is
done using a heuristic analysis of 20-25 different field points, with
significance weighted to create a match score for merge), but some folks
find it useful if you don't want to code something up yourself.

Otherwise, if I was coding this, I'd stay away from needing exact matches.
I've found that when doing matches, I like to include a wide range of
elements, and then use a fuzzy match when working with titles because the
title isn't as fixed as you might like -- especially if the cataloging
level varies.

--tr


On Thu, Aug 15, 2013 at 2:29 PM, Andy Kohler  wrote:

> Are you expecting to work with two files of records, outside of your ILS?
> If so, for a project like that I'd probably write Perl script(s) using
> MARC::Record (there are similar code libraries for Ruby, Python and Java at
> least).
>
> For each record in each file, use the ISBN (and/or OCLC number and/or LCCN)
> as a key.  Compare all sets, and keep one record per key.
>
> This assumes that the vendors are supplying records with standard
> identifiers, and not just their own record numbers.
>
> If you're comparing each file with what's already in your ILS, then it'll
> depend on the tools the ILS offers for matching incoming records to the
> database.  Or, export the database and compare it with the files, as above.
>
> Andy Kohler / UCLA Library Info Tech
> akoh...@library.ucla.edu / 310 206-8312
>
> On Thu, Aug 15, 2013 at 10:11 AM, Michael Beccaria <
> mbecca...@paulsmiths.edu
> > wrote:
>
> > Has anyone had any luck finding a good way to de-duplicate MARC records
> > from ebook vendors. We're looking to integrate Ebrary and Ebsco Academic
> > Ebook collections and they estimate an overlap into the 10's of
> thousands.
> >
> >
>


Re: [CODE4LIB] De-dup MARC Ebook records

2013-08-15 Thread Andy Kohler
Are you expecting to work with two files of records, outside of your ILS?
If so, for a project like that I'd probably write Perl script(s) using
MARC::Record (there are similar code libraries for Ruby, Python and Java at
least).

For each record in each file, use the ISBN (and/or OCLC number and/or LCCN)
as a key.  Compare all sets, and keep one record per key.

This assumes that the vendors are supplying records with standard
identifiers, and not just their own record numbers.

If you're comparing each file with what's already in your ILS, then it'll
depend on the tools the ILS offers for matching incoming records to the
database.  Or, export the database and compare it with the files, as above.

Andy Kohler / UCLA Library Info Tech
akoh...@library.ucla.edu / 310 206-8312

On Thu, Aug 15, 2013 at 10:11 AM, Michael Beccaria  wrote:

> Has anyone had any luck finding a good way to de-duplicate MARC records
> from ebook vendors. We're looking to integrate Ebrary and Ebsco Academic
> Ebook collections and they estimate an overlap into the 10's of thousands.
>
>


[CODE4LIB] De-dup MARC Ebook records

2013-08-15 Thread Michael Beccaria
Has anyone had any luck finding a good way to de-duplicate MARC records from 
ebook vendors. We're looking to integrate Ebrary and Ebsco Academic Ebook 
collections and they estimate an overlap into the 10's of thousands.

Strategies, tools, software?

Mike Beccaria
Systems Librarian
Head of Digital Initiative
Paul Smith's College
518.327.6376
mbecca...@paulsmiths.edu
Become a friend of Paul Smith's Library on Facebook today!