So, the technical discussion about how to do the merging is probably more suited to ol-tech, but some of the general discussion about the magnitude and variety of the problem is probably of interest to folks here. Merging people with the same birth & death date and the same or very similar names is certainly a high probability strategy to get the low hanging fruit, but we need to do a lot more than that.
Here's the breakdown of counts from a recent dump: Author records with no dates - 5,525,614 Author records with both birth and death dates - 328,755 Authors records with just birth date - 934,489 Authors records with just death date - 20,972 Below are some examples of sets of duplicates that I merged today. As you can see there is *great* variety in the spellings of author names and often only a small number of the records have dates, so we'll need some more powerful strategies (looking at common books authored, etc), to make headway with this problem. All encoding problems, bad dates, etc are in the original data. One hint for anyone who's attempting to merge things by hand using search - don't include any words which sometimes have accented forms because search will be restricted to just those with (or without) the accent. In the first example below, "Georgiĭ Valentinovich Plekhanov" won't find "Georgi Valentinovich Plekhanov" (or "Georgy Valentinovich Plekhanov" but for a different reason). Try to choose a search phrase like "Valentinovich Plekhanov" or "Plekhanov" which doesn't have accent or spelling variations (not always possible, of course). Tom 18 records: Georgiĭ Valentinovich Plekhanov 1856 - 1918 - 49 books Georgiĭ Valentinovich Plekhanov 1856 - 1918 - 39 books G. V. Plekhanov - 30 books Georgii Valentinovich Plekhanov 1856 - 1918 - 8 books Georgi Plekhanov - 5 books G.V. Plekhanov - 3 books Georgi Valentinovich Plekhanov 1856 - 1918 - 3 books George V. Plekhanov - 2 books Georgi V. Plekhanov - 2 books Georgii Valentinovich Plekhanov 1856 - 1918 - 2 books G.V Plekhanov - 2 books Georgiĭ Valentinovich Plekhanov - 2 books Georgii Plekhanov - 1 book Georgii Valentinovich Plekhanov - 1 book Georges Plekhanov - 1 book Georgi Valentinovich Plekhanov - 1 book about , including Art and social life. Georgy Valentinovich Plekhanov - 1 book about , including Art and social life Georgii Valen Plekhanov - 0 books 15 records José Joaquín Fernández de Lizardi 1776 - 1827 Jose Joaquin Fernandez de Lizardi 1776 - 1827 Jose J. Fernandez de Lizardi José Joaquín Fernández de Lizardi José Joaquin Fernández de Lizardi 1776 - 1827 Fernandez de Lizardi Fernandez De Lizardi Jose . Fernandez de Lizardi Jose Joaqui n Ferna ndez de Lizardi 1776 - 1827 José Joaguin Fernández de Lizardi José Joaquín Fernández de Lizardi José Joaqúin Fernández de Lizardi Jos´e Joaqu´in Fern´andez de Lizardi Jose Joaqui n. Ferna ndez de Lizardi Jose Joaqui n. Ferna ndez de Lizardi 10 records Frédéric Guillaume de Vaudoncourt 1772 - 1845 Vaudoncourt, Frédéric François Guillaume baron de 1772 - 1845 Guillaume de Vaudoncourt 1772 - 1845 Frédéric François Guillaume de Vaudoncourt Vaudoncourt, Fr©Øed©Øeric Fran©ʻcois Guillaume, baron de 1772 - 1 Fréderic François [Guillaume de Vaudoncourt 1772 - 1845] Frédéric François Guillaume de Vaudoncourt 1772 - 1845 Vaudoncourt, Frédéric François Guillaume de baron de 1772 - 1845 Frédéric Guillaume de Vaudoncourt Frederick Henry Ambrose Scrivener 1813 - 1891 Henry Ambrose Scrivener Frederick H. Scrivener Frederick Scrivener Frederick Henry A . Scrivener Frederick Henry A mbrose Scrivener Frederick Henry Ambrose Scrivener 1813 - 1891 Frederick Henry Ambrose Scrivener 1813 - 1891 Klementyna Hoffmanowa-Tańska 1798 - 1845 Klementyna Tańska-Hoffmanowa 1798 - 1845 Klementyna Tanska-Hoffmanowa 1798 - 1845 Klementyna Tan?ska-Hoffmanowa 1798 - 1845 Klementyna Hoffmanowa Klementyna Taska-Hoffmanowa 1798 - 1845 Klementyna Hoffmanowa On Thu, Aug 29, 2013 at 6:44 AM, Richard Light <[email protected]>wrote: > Hi, > > In a general spirit of exploration, I took the OL author dump, extracted > authors with dates, converted them to XML and fed them into a Modes [1] > database. I have spent some time tidying up said dates so that they are, > as far as possible, meaningful and indexable. I have limited my attention > to authors with a death date and/or a birth date of 1950 or earlier. > > One potential use of this work, I thought, might be to find duplicate OL > author records which represent the same person. I have discovered the > de-duplication magic wand, and have done a few by hand. However, I am > rather puzzled. For example, the last person I looked at was A. Hamon > (1860-1939). In my Modes data I have two records for him, both with dates: > > http://openlibrary.org/authors/OL5218117A > and > http://openlibrary.org/authors/OL5358432A > > Both of these URLs dereference to an actual page, with associated works. > However, in the de-duplication listing only the first of these identifiers > is present (though I did find another A. Hamon entry to merge). So, two > questions: > > 1. Is there a format in which I can express a set of instructions to merge > authors programmatically, to avoid having to do this by hand? The > excitement of doing this manually has already worn off, but Modes could > easily tell me where authors have the same name and same DoB/DoD and help > me to generate a list of identifiers to merge. > > 2. Why don't all the potential mergees appear in the merge listing, > despite the fact that loads of clearly irrelevant entries do appear there? > > Thanks, > > Richard > > [1] http://modes.org.uk > -- > *Richard Light* > > _______________________________________________ > Ol-discuss mailing list - [email protected] > http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss > Archives: http://www.mail-archive.com/[email protected]/ > To unsubscribe from this mailing list, send email to > [email protected] >
_______________________________________________ Ol-discuss mailing list - [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss Archives: http://www.mail-archive.com/[email protected]/ To unsubscribe from this mailing list, send email to [email protected]
