http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=7419
Jared Camins-Esakov <jcam...@cpbibliography.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #18714|0 |1 is obsolete| | --- Comment #35 from Jared Camins-Esakov <jcam...@cpbibliography.com> --- Created attachment 20783 --> http://bugs.koha-community.org/bugzilla3/attachment.cgi?id=20783&action=edit Bug 7419: General-purpose record deduplicator This patch adds a script for deduplicating records. It is most useful for authority records but by design could be easily extended for use with bibliographic records, if someone had a good use case. See the follow-up for an updated test plan. Complete POD documentation: SYNOPSIS dedup_records.pl --match=1 -a dedup_records.pl --match="LC-card-number/010a" --select="date" \ --limit="authid > 367123592" -a dedup_records.pl --match="Match/100abcdefghijklmnopqrstuvwxyz" \ --select="source=DLC" --select="date" \ --limit="authtypecode='PERSO_NAME'" -a DESCRIPTION This script will identify duplicate records, and either suggest that you merge them (in the case of bibliographic records) or automatically merge them for you (in the case of authority records). OPTIONS --help Prints this help -v|--verbose Print verbose log information (warning: very verbose!). -t|--test Do not actually make any changes to the database, just report what changes would be made. -r|--report Print a report of what happened during the run. -l|--limit=S Only process those records that match the user-specified WHERE clause (the WHERE is implied and should not be included on the command line). -a|--authorities Check for duplicate authorities rather that duplicate bibliographic records. -s|--select=s Repeatable. Specify how to identify which record to prefer. See the section on SELECTORS below. -m|--match=s Specifies the matching rule to use. This can be the numeric ID of a matching rule that you have already configured (preferred), or you can specify a matching rule on the command-line in the following format: <index1>/<tag1><subfield1>[##<index2>/<tag2><subfield2>[##...]] Examples: at/152b##he-main/2..a##he/2..bxyzt##ident/009@ authtype/152b##he-main,ext/2..a##he,ext/2..bxyz sn,ne,st-numeric/001##authtype/152b##he-main,ext/2..a##he,ext/2..bxyz -c|--check=s Only relevant when you are using a matching rule specified on the command line. Specifies sanity checks to use to ensure that the records are really duplicate. The format is <tag1><subfields1>[,<tag2><subfields2>[,...]] Examples: 200abxyz will check subfields a,b,x,y,z of 200 fields 009@,152b will check 009 data and 152$b subfields SELECTORS This script supports a number of selectors for choosing which record is "better." score Prefer the record which is the best match based on the specified matching rule. This will probably only be useful in cases where the matching rule will not match the source record, since the source record will automatically be given a score of 2 * the matching rule threshold if it wasn't picked up by the matcher. date Prefer the record which is newer based on the 005 field. source=ABC MARC21 only. Prefer records which come from ABC based on the 003 field. usage Authorities only. Prefer the record used in the most bibliographic records. ppn UNIMARC only. Prefer records which have a PPN in the 009 field. -- You are receiving this mail because: You are watching all bug changes. _______________________________________________ Koha-bugs mailing list Koha-bugs@lists.koha-community.org http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/