So, the technical discussion about how to do the merging is probably more
suited to ol-tech, but some of the general discussion about the magnitude
and variety of the problem is probably of interest to folks here.  Merging
people with the same birth & death date and the same or very similar names
is certainly a high probability strategy to get the low hanging fruit, but
we need to do a lot more than that.

Here's the breakdown of counts from a recent dump:

Author records with no dates - 5,525,614
Author records with both birth and death dates - 328,755
Authors records with just birth date - 934,489
Authors records with just death date - 20,972

Below are some examples of sets of duplicates that I merged today.  As you
can see there is *great* variety in the spellings of author names and often
only a small number of the records have dates, so we'll need some more
powerful strategies (looking at common books authored, etc), to make
headway with this problem.  All encoding problems, bad dates, etc are in
the original data.

One hint for anyone who's attempting to merge things by hand using search -
don't include any words which sometimes have accented forms because search
will be restricted to just those with (or without) the accent.  In the
first example below, "Georgiĭ Valentinovich Plekhanov" won't find "Georgi
Valentinovich Plekhanov" (or "Georgy Valentinovich Plekhanov" but for a
different reason).  Try to choose a search phrase like "Valentinovich
Plekhanov" or "Plekhanov" which doesn't have accent or spelling variations
(not always possible, of course).

Tom

18 records:
Georgiĭ Valentinovich Plekhanov 1856 - 1918 - 49 books
Georgiĭ Valentinovich Plekhanov 1856 - 1918 - 39 books
G. V. Plekhanov - 30 books
Georgii Valentinovich Plekhanov 1856 - 1918 - 8 books
Georgi Plekhanov - 5 books
G.V. Plekhanov - 3 books
Georgi Valentinovich Plekhanov 1856 - 1918 - 3 books
George V. Plekhanov  - 2 books
Georgi V. Plekhanov - 2 books
Georgii Valentinovich Plekhanov 1856 - 1918 - 2 books
G.V Plekhanov - 2 books
Georgiĭ Valentinovich Plekhanov  - 2 books
Georgii Plekhanov - 1 book
Georgii Valentinovich Plekhanov - 1 book
Georges Plekhanov - 1 book
Georgi Valentinovich Plekhanov - 1 book about , including Art and social
life.
Georgy Valentinovich Plekhanov - 1 book about , including Art and social
life
Georgii Valen Plekhanov - 0 books

 15 records
José Joaquín Fernández de Lizardi 1776 - 1827
Jose Joaquin Fernandez de Lizardi 1776 - 1827
Jose J. Fernandez de Lizardi
José Joaquín Fernández de Lizardi
José Joaquin Fernández de Lizardi 1776 - 1827
Fernandez de Lizardi
Fernandez De Lizardi
Jose . Fernandez de Lizardi
Jose Joaqui n Ferna ndez de Lizardi 1776 - 1827
José Joaguin Fernández de Lizardi
José Joaquín Fernández de Lizardi
José Joaqúin Fernández de Lizardi
Jos´e Joaqu´in Fern´andez de Lizardi
Jose Joaqui n. Ferna ndez de Lizardi
Jose Joaqui n. Ferna ndez de Lizardi

10 records
Frédéric Guillaume de Vaudoncourt 1772 - 1845
Vaudoncourt, Frédéric François Guillaume baron de 1772 - 1845
Guillaume de Vaudoncourt 1772 - 1845
Frédéric François Guillaume de Vaudoncourt
Vaudoncourt, Fr©Øed©Øeric Fran©ʻcois Guillaume, baron de 1772 - 1
Fréderic François [Guillaume de Vaudoncourt 1772 - 1845]
Frédéric François Guillaume de Vaudoncourt 1772 - 1845
Vaudoncourt, Frédéric François Guillaume de baron de 1772 - 1845
Frédéric Guillaume de Vaudoncourt


Frederick Henry Ambrose Scrivener 1813 - 1891
Henry Ambrose Scrivener
Frederick H. Scrivener
Frederick Scrivener
Frederick Henry A . Scrivener
Frederick Henry A mbrose Scrivener
Frederick Henry Ambrose Scrivener 1813 - 1891
Frederick Henry Ambrose Scrivener 1813 - 1891


Klementyna Hoffmanowa-Tańska 1798 - 1845
Klementyna Tańska-Hoffmanowa 1798 - 1845
Klementyna Tanska-Hoffmanowa 1798 - 1845
Klementyna Tan?ska-Hoffmanowa 1798 - 1845
Klementyna Hoffmanowa
Klementyna Taska-Hoffmanowa 1798 - 1845
Klementyna Hoffmanowa


On Thu, Aug 29, 2013 at 6:44 AM, Richard Light <[email protected]>wrote:

>  Hi,
>
> In a general spirit of exploration, I took the OL author dump, extracted
> authors with dates, converted them to XML and fed them into a Modes [1]
> database.  I have spent some time tidying up said dates so that they are,
> as far as possible, meaningful and indexable.  I have limited my attention
> to authors with a death date and/or a birth date of 1950 or earlier.
>
> One potential use of this work, I thought, might be to find duplicate OL
> author records which represent the same person.  I have discovered the
> de-duplication magic wand, and have done a few by hand.  However, I am
> rather puzzled.  For example, the last person I looked at was A. Hamon
> (1860-1939).  In my Modes data I have two records for him, both with dates:
>
> http://openlibrary.org/authors/OL5218117A
> and
> http://openlibrary.org/authors/OL5358432A
>
> Both of these URLs dereference to an actual page, with associated works.
> However, in the de-duplication listing only the first of these identifiers
> is present (though I did find another A. Hamon entry to merge).  So, two
> questions:
>
> 1. Is there a format in which I can express a set of instructions to merge
> authors programmatically, to avoid having to do this by hand?  The
> excitement of doing this manually has already worn off, but Modes could
> easily tell me where authors have the same name and same DoB/DoD and help
> me to generate a list of identifiers to merge.
>
> 2. Why don't all the potential mergees appear in the merge listing,
> despite the fact that loads of clearly irrelevant entries do appear there?
>
> Thanks,
>
> Richard
>
> [1] http://modes.org.uk
> --
> *Richard Light*
>
> _______________________________________________
> Ol-discuss mailing list - [email protected]
> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
> Archives: http://www.mail-archive.com/[email protected]/
> To unsubscribe from this mailing list, send email to
> [email protected]
>
_______________________________________________
Ol-discuss mailing list - [email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
Archives: http://www.mail-archive.com/[email protected]/
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to