Re: [CODE4LIB] De-dup MARC Ebook records
Mike, our RecordManager (https://github.com/KDK-Alli/RecordManager) used in conjunction with VuFind does deduplication with our own algorithm. This might be of some interest to you. RecordManager works standalone, so no VuFind installation needed. For some parts it's still in active development, but the deduplication has been working pretty well so far. A short description of the algorithm is available at <https://github.com/KDK-Alli/RecordManager/wiki/Deduplication>, and the actual PHP code is in <https://github.com/KDK-Alli/RecordManager/blob/master/classes/RecordManager.php> starting at the dedupRecord function. --Ere 22.8.2013 18.07, Michael Beccaria kirjoitti: Steve, I don't think it's so much find a control field (however, the closest match I can use is ISBN or eISBN which has its issues) but also normalizing the data in the fields so that matches are produced. It will no doubt take some time to figure out. Mike Beccaria Systems Librarian Head of Digital Initiative Paul Smith's College 518.327.6376 mbecca...@paulsmiths.edu Become a friend of Paul Smith's Library on Facebook today! -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of McDonald, Stephen Sent: Friday, August 16, 2013 8:16 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] De-dup MARC Ebook records Michael Beccaria said: Thanks for the replies. To clarify, I am working with 2 (or more in the future) marc records outside of the ILS. I've tried using Marcedit but my usage did vary...not much overlap with the control fields that were available to me. I have a feeling they are a bit varied. I'm also messing around with marcXimiL a little but I'm having trouble getting it to output any records at all. I also was looking at the XC aggregation module but I was having trouble getting that to work properly as well and the listserv was unresponsive. It seemed like good software but it required me to set up an OAI harvest source to allow it to ingest the records and that...well...enough is enough... I think I will probably need to write something, and at least that way I know what it will be doing rather than plowing through software that has little to no support. Please feel free to let me know of a particular strategy you think might work best in this regard... If you couldn't get adequate deduping from the control fields available in MarcEdit deduping, what control fields do you think you need to dedup on? You can actually specify any arbitrary field and subfield for deduping in MarcEdit. Steve McDonald steve.mcdon...@tufts.edu -- Ere Maijala Kansalliskirjasto / The National Library of Finland
Re: [CODE4LIB] De-dup MARC Ebook records
Yes, open library implemented it, and, of course, where it doesn't work is where the data is pretty bad. If you prefer to err on the side of merging, you can loosen the algorithm's weights. It's based on the algorithm used for the U Cal union catalog, which was exercised over about 20 years. Last I was able to ascertain, the data elements are very similar to the ones used (at least at the time) by WorldCat. kc On 8/22/13 1:21 PM, Michael Beccaria wrote: Karen, Do you have a sense of how well it actually works? Is Open Library implementing it? Mike Beccaria Systems Librarian Head of Digital Initiative Paul Smith's College 518.327.6376 mbecca...@paulsmiths.edu Become a friend of Paul Smith's Library on Facebook today! -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Karen Coyle Sent: Thursday, August 22, 2013 11:53 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] De-dup MARC Ebook records The record matching algorithm used by the Open Library is available here: https://github.com/openlibrary/openlibrary/tree/master/openlibrary/catalog/merge The original spec, which may have changed in the implementation, is here: http://kcoyle.net/merge.html kc On 8/22/13 8:07 AM, Michael Beccaria wrote: Steve, I don't think it's so much find a control field (however, the closest match I can use is ISBN or eISBN which has its issues) but also normalizing the data in the fields so that matches are produced. It will no doubt take some time to figure out. Mike Beccaria Systems Librarian Head of Digital Initiative Paul Smith's College 518.327.6376 mbecca...@paulsmiths.edu Become a friend of Paul Smith's Library on Facebook today! -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of McDonald, Stephen Sent: Friday, August 16, 2013 8:16 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] De-dup MARC Ebook records Michael Beccaria said: Thanks for the replies. To clarify, I am working with 2 (or more in the future) marc records outside of the ILS. I've tried using Marcedit but my usage did vary...not much overlap with the control fields that were available to me. I have a feeling they are a bit varied. I'm also messing around with marcXimiL a little but I'm having trouble getting it to output any records at all. I also was looking at the XC aggregation module but I was having trouble getting that to work properly as well and the listserv was unresponsive. It seemed like good software but it required me to set up an OAI harvest source to allow it to ingest the records and that...well...enough is enough... I think I will probably need to write something, and at least that way I know what it will be doing rather than plowing through software that has little to no support. Please feel free to let me know of a particular strategy you think might work best in this regard... If you couldn't get adequate deduping from the control fields available in MarcEdit deduping, what control fields do you think you need to dedup on? You can actually specify any arbitrary field and subfield for deduping in MarcEdit. Steve McDonald steve.mcdon...@tufts.edu -- Karen Coyle kco...@kcoyle.net http://kcoyle.net ph: 1-510-540-7596 m: 1-510-435-8234 skype: kcoylenet -- Karen Coyle kco...@kcoyle.net http://kcoyle.net ph: 1-510-540-7596 m: 1-510-435-8234 skype: kcoylenet
Re: [CODE4LIB] De-dup MARC Ebook records
Karen, Do you have a sense of how well it actually works? Is Open Library implementing it? Mike Beccaria Systems Librarian Head of Digital Initiative Paul Smith's College 518.327.6376 mbecca...@paulsmiths.edu Become a friend of Paul Smith's Library on Facebook today! -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Karen Coyle Sent: Thursday, August 22, 2013 11:53 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] De-dup MARC Ebook records The record matching algorithm used by the Open Library is available here: https://github.com/openlibrary/openlibrary/tree/master/openlibrary/catalog/merge The original spec, which may have changed in the implementation, is here: http://kcoyle.net/merge.html kc On 8/22/13 8:07 AM, Michael Beccaria wrote: > Steve, > I don't think it's so much find a control field (however, the closest match I > can use is ISBN or eISBN which has its issues) but also normalizing the data > in the fields so that matches are produced. It will no doubt take some time > to figure out. > > Mike Beccaria > Systems Librarian > Head of Digital Initiative > Paul Smith's College > 518.327.6376 > mbecca...@paulsmiths.edu > Become a friend of Paul Smith's Library on Facebook today! > > > -Original Message- > From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf > Of McDonald, Stephen > Sent: Friday, August 16, 2013 8:16 AM > To: CODE4LIB@LISTSERV.ND.EDU > Subject: Re: [CODE4LIB] De-dup MARC Ebook records > > Michael Beccaria said: >> Thanks for the replies. To clarify, I am working with 2 (or more in >> the future) marc records outside of the ILS. I've tried using >> Marcedit but my usage did vary...not much overlap with the control >> fields that were available to me. I have a feeling they are a bit >> varied. I'm also messing around with marcXimiL a little but I'm >> having trouble getting it to output any records at all. I also was >> looking at the XC aggregation module but I was having trouble getting >> that to work properly as well and the listserv was unresponsive. It >> seemed like good software but it required me to set up an OAI harvest >> source to allow it to ingest the records and that...well...enough is >> enough... I think I will probably need to write something, and at >> least that way I know what it will be doing rather than plowing >> through software that has little to no support. Please feel free to let me >> know of a particular strategy you think might work best in this regard... > If you couldn't get adequate deduping from the control fields available in > MarcEdit deduping, what control fields do you think you need to dedup on? > You can actually specify any arbitrary field and subfield for deduping in > MarcEdit. > > Steve McDonald > steve.mcdon...@tufts.edu -- Karen Coyle kco...@kcoyle.net http://kcoyle.net ph: 1-510-540-7596 m: 1-510-435-8234 skype: kcoylenet
Re: [CODE4LIB] De-dup MARC Ebook records
The record matching algorithm used by the Open Library is available here: https://github.com/openlibrary/openlibrary/tree/master/openlibrary/catalog/merge The original spec, which may have changed in the implementation, is here: http://kcoyle.net/merge.html kc On 8/22/13 8:07 AM, Michael Beccaria wrote: Steve, I don't think it's so much find a control field (however, the closest match I can use is ISBN or eISBN which has its issues) but also normalizing the data in the fields so that matches are produced. It will no doubt take some time to figure out. Mike Beccaria Systems Librarian Head of Digital Initiative Paul Smith's College 518.327.6376 mbecca...@paulsmiths.edu Become a friend of Paul Smith's Library on Facebook today! -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of McDonald, Stephen Sent: Friday, August 16, 2013 8:16 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] De-dup MARC Ebook records Michael Beccaria said: Thanks for the replies. To clarify, I am working with 2 (or more in the future) marc records outside of the ILS. I've tried using Marcedit but my usage did vary...not much overlap with the control fields that were available to me. I have a feeling they are a bit varied. I'm also messing around with marcXimiL a little but I'm having trouble getting it to output any records at all. I also was looking at the XC aggregation module but I was having trouble getting that to work properly as well and the listserv was unresponsive. It seemed like good software but it required me to set up an OAI harvest source to allow it to ingest the records and that...well...enough is enough... I think I will probably need to write something, and at least that way I know what it will be doing rather than plowing through software that has little to no support. Please feel free to let me know of a particular strategy you think might work best in this regard... If you couldn't get adequate deduping from the control fields available in MarcEdit deduping, what control fields do you think you need to dedup on? You can actually specify any arbitrary field and subfield for deduping in MarcEdit. Steve McDonald steve.mcdon...@tufts.edu -- Karen Coyle kco...@kcoyle.net http://kcoyle.net ph: 1-510-540-7596 m: 1-510-435-8234 skype: kcoylenet
Re: [CODE4LIB] De-dup MARC Ebook records
Steve, I don't think it's so much find a control field (however, the closest match I can use is ISBN or eISBN which has its issues) but also normalizing the data in the fields so that matches are produced. It will no doubt take some time to figure out. Mike Beccaria Systems Librarian Head of Digital Initiative Paul Smith's College 518.327.6376 mbecca...@paulsmiths.edu Become a friend of Paul Smith's Library on Facebook today! -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of McDonald, Stephen Sent: Friday, August 16, 2013 8:16 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] De-dup MARC Ebook records Michael Beccaria said: > Thanks for the replies. To clarify, I am working with 2 (or more in > the future) marc records outside of the ILS. I've tried using Marcedit > but my usage did vary...not much overlap with the control fields that > were available to me. I have a feeling they are a bit varied. I'm also > messing around with marcXimiL a little but I'm having trouble getting > it to output any records at all. I also was looking at the XC > aggregation module but I was having trouble getting that to work > properly as well and the listserv was unresponsive. It seemed like > good software but it required me to set up an OAI harvest source to > allow it to ingest the records and that...well...enough is enough... I > think I will probably need to write something, and at least that way I > know what it will be doing rather than plowing through software that > has little to no support. Please feel free to let me know of a particular > strategy you think might work best in this regard... If you couldn't get adequate deduping from the control fields available in MarcEdit deduping, what control fields do you think you need to dedup on? You can actually specify any arbitrary field and subfield for deduping in MarcEdit. Steve McDonald steve.mcdon...@tufts.edu
Re: [CODE4LIB] De-dup MARC Ebook records
Michael Beccaria said: > Thanks for the replies. To clarify, I am working with 2 (or more in the > future) > marc records outside of the ILS. I've tried using Marcedit but my usage did > vary...not much overlap with the control fields that were available to me. I > have a feeling they are a bit varied. I'm also messing around with marcXimiL a > little but I'm having trouble getting it to output any records at all. I also > was > looking at the XC aggregation module but I was having trouble getting that to > work properly as well and the listserv was unresponsive. It seemed like good > software but it required me to set up an OAI harvest source to allow it to > ingest the records and that...well...enough is enough... I think I will > probably > need to write something, and at least that way I know what it will be doing > rather than plowing through software that has little to no support. Please > feel free to let me know of a particular strategy you think might work best in > this regard... If you couldn't get adequate deduping from the control fields available in MarcEdit deduping, what control fields do you think you need to dedup on? You can actually specify any arbitrary field and subfield for deduping in MarcEdit. Steve McDonald steve.mcdon...@tufts.edu
Re: [CODE4LIB] De-dup MARC Ebook records
Michael Beccaria a écrit : I'm also messing around with marcXimiL a little but I'm having trouble getting it to output any records at all. Glad to hear someone's looking at our little toy :-) If I can be of any help with it, just let me know! Best regards, Alain Borel MarcXimiL co-author
Re: [CODE4LIB] De-dup MARC Ebook records
Thanks for the replies. To clarify, I am working with 2 (or more in the future) marc records outside of the ILS. I've tried using Marcedit but my usage did vary...not much overlap with the control fields that were available to me. I have a feeling they are a bit varied. I'm also messing around with marcXimiL a little but I'm having trouble getting it to output any records at all. I also was looking at the XC aggregation module but I was having trouble getting that to work properly as well and the listserv was unresponsive. It seemed like good software but it required me to set up an OAI harvest source to allow it to ingest the records and that...well...enough is enough... I think I will probably need to write something, and at least that way I know what it will be doing rather than plowing through software that has little to no support. Please feel free to let me know of a particular strategy you think might work best in this regard... Mike Beccaria Systems Librarian Head of Digital Initiative Paul Smith's College 518.327.6376 mbecca...@paulsmiths.edu Become a friend of Paul Smith's Library on Facebook today! -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Andy Kohler Sent: Thursday, August 15, 2013 2:29 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] De-dup MARC Ebook records Are you expecting to work with two files of records, outside of your ILS? If so, for a project like that I'd probably write Perl script(s) using MARC::Record (there are similar code libraries for Ruby, Python and Java at least). For each record in each file, use the ISBN (and/or OCLC number and/or LCCN) as a key. Compare all sets, and keep one record per key. This assumes that the vendors are supplying records with standard identifiers, and not just their own record numbers. If you're comparing each file with what's already in your ILS, then it'll depend on the tools the ILS offers for matching incoming records to the database. Or, export the database and compare it with the files, as above. Andy Kohler / UCLA Library Info Tech akoh...@library.ucla.edu / 310 206-8312 On Thu, Aug 15, 2013 at 10:11 AM, Michael Beccaria wrote: > Has anyone had any luck finding a good way to de-duplicate MARC > records from ebook vendors. We're looking to integrate Ebrary and > Ebsco Academic Ebook collections and they estimate an overlap into the 10's > of thousands. > >
Re: [CODE4LIB] De-dup MARC Ebook records
Michael - I'm just about to load ebook records into our Innovative catalog, and I'm going to keep the e-books separate from the print book records. For ebooks, I'm going to copy the OCLC number to the 901 with a prestamp, and overlay on that. So only records loaded with our ebook load table will have this 901 to overlay on. Then I'm going to protect the 856s and the 710s for the ebook collection statement. That'll take care of adds. For deletes... I haven't got that worked out yet. I think there's a way to delete a field based on the incoming field. Cindy Harper Virginia Theological Seminary char...@vts.edu -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Andy Kohler Sent: Thursday, August 15, 2013 2:29 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] De-dup MARC Ebook records Are you expecting to work with two files of records, outside of your ILS? If so, for a project like that I'd probably write Perl script(s) using MARC::Record (there are similar code libraries for Ruby, Python and Java at least). For each record in each file, use the ISBN (and/or OCLC number and/or LCCN) as a key. Compare all sets, and keep one record per key. This assumes that the vendors are supplying records with standard identifiers, and not just their own record numbers. If you're comparing each file with what's already in your ILS, then it'll depend on the tools the ILS offers for matching incoming records to the database. Or, export the database and compare it with the files, as above. Andy Kohler / UCLA Library Info Tech akoh...@library.ucla.edu / 310 206-8312 On Thu, Aug 15, 2013 at 10:11 AM, Michael Beccaria wrote: > Has anyone had any luck finding a good way to de-duplicate MARC > records from ebook vendors. We're looking to integrate Ebrary and > Ebsco Academic Ebook collections and they estimate an overlap into the 10's > of thousands. > >
Re: [CODE4LIB] De-dup MARC Ebook records
Your mileage may vary, but MarcEdit has a dedup tool that will allow you to take two files and find duplications. It also has a merge tool that will allow you to take two files, and merge specific fields into one or another (so if you want fields like the 856 from two packages in the same record). There are some assumptions made when matching (dedup can use any field/subfield pair, but obviously control numbers are better -- merging is done using a heuristic analysis of 20-25 different field points, with significance weighted to create a match score for merge), but some folks find it useful if you don't want to code something up yourself. Otherwise, if I was coding this, I'd stay away from needing exact matches. I've found that when doing matches, I like to include a wide range of elements, and then use a fuzzy match when working with titles because the title isn't as fixed as you might like -- especially if the cataloging level varies. --tr On Thu, Aug 15, 2013 at 2:29 PM, Andy Kohler wrote: > Are you expecting to work with two files of records, outside of your ILS? > If so, for a project like that I'd probably write Perl script(s) using > MARC::Record (there are similar code libraries for Ruby, Python and Java at > least). > > For each record in each file, use the ISBN (and/or OCLC number and/or LCCN) > as a key. Compare all sets, and keep one record per key. > > This assumes that the vendors are supplying records with standard > identifiers, and not just their own record numbers. > > If you're comparing each file with what's already in your ILS, then it'll > depend on the tools the ILS offers for matching incoming records to the > database. Or, export the database and compare it with the files, as above. > > Andy Kohler / UCLA Library Info Tech > akoh...@library.ucla.edu / 310 206-8312 > > On Thu, Aug 15, 2013 at 10:11 AM, Michael Beccaria < > mbecca...@paulsmiths.edu > > wrote: > > > Has anyone had any luck finding a good way to de-duplicate MARC records > > from ebook vendors. We're looking to integrate Ebrary and Ebsco Academic > > Ebook collections and they estimate an overlap into the 10's of > thousands. > > > > >
Re: [CODE4LIB] De-dup MARC Ebook records
Are you expecting to work with two files of records, outside of your ILS? If so, for a project like that I'd probably write Perl script(s) using MARC::Record (there are similar code libraries for Ruby, Python and Java at least). For each record in each file, use the ISBN (and/or OCLC number and/or LCCN) as a key. Compare all sets, and keep one record per key. This assumes that the vendors are supplying records with standard identifiers, and not just their own record numbers. If you're comparing each file with what's already in your ILS, then it'll depend on the tools the ILS offers for matching incoming records to the database. Or, export the database and compare it with the files, as above. Andy Kohler / UCLA Library Info Tech akoh...@library.ucla.edu / 310 206-8312 On Thu, Aug 15, 2013 at 10:11 AM, Michael Beccaria wrote: > Has anyone had any luck finding a good way to de-duplicate MARC records > from ebook vendors. We're looking to integrate Ebrary and Ebsco Academic > Ebook collections and they estimate an overlap into the 10's of thousands. > >
[CODE4LIB] De-dup MARC Ebook records
Has anyone had any luck finding a good way to de-duplicate MARC records from ebook vendors. We're looking to integrate Ebrary and Ebsco Academic Ebook collections and they estimate an overlap into the 10's of thousands. Strategies, tools, software? Mike Beccaria Systems Librarian Head of Digital Initiative Paul Smith's College 518.327.6376 mbecca...@paulsmiths.edu Become a friend of Paul Smith's Library on Facebook today!