Thanks, Tom. It seems to me that we have a general problem with accents 
that needs solving -- I have no idea how accents are handled in 
searching or merging, but I know that this has come up before:

https://github.com/internetarchive/openlibrary/issues/11

Presumably SOLR is able to deal with this?

kc

On 9/1/13 12:00 AM, Tom Morris wrote:
> So, the technical discussion about how to do the merging is probably
> more suited to ol-tech, but some of the general discussion about the
> magnitude and variety of the problem is probably of interest to folks
> here.  Merging people with the same birth & death date and the same or
> very similar names is certainly a high probability strategy to get the
> low hanging fruit, but we need to do a lot more than that.
>
> Here's the breakdown of counts from a recent dump:
>
> Author records with no dates - 5,525,614
> Author records with both birth and death dates - 328,755
> Authors records with just birth date - 934,489
> Authors records with just death date - 20,972
>
> Below are some examples of sets of duplicates that I merged today.  As
> you can see there is *great* variety in the spellings of author names
> and often only a small number of the records have dates, so we'll need
> some more powerful strategies (looking at common books authored, etc),
> to make headway with this problem.  All encoding problems, bad dates,
> etc are in the original data.
>
> One hint for anyone who's attempting to merge things by hand using
> search - don't include any words which sometimes have accented forms
> because search will be restricted to just those with (or without) the
> accent.  In the first example below, "Georgiĭ Valentinovich Plekhanov"
> won't find "Georgi Valentinovich Plekhanov" (or "Georgy Valentinovich
> Plekhanov" but for a different reason).  Try to choose a search phrase
> like "Valentinovich Plekhanov" or "Plekhanov" which doesn't have accent
> or spelling variations (not always possible, of course).
>
> Tom
>
> 18 records:
> Georgiĭ Valentinovich Plekhanov 1856 - 1918 - 49 books
> Georgiĭ Valentinovich Plekhanov 1856 - 1918 - 39 books
> G. V. Plekhanov - 30 books
> Georgii Valentinovich Plekhanov 1856 - 1918 - 8 books
> Georgi Plekhanov - 5 books
> G.V. Plekhanov - 3 books
> Georgi Valentinovich Plekhanov 1856 - 1918 - 3 books
> George V. Plekhanov  - 2 books
> Georgi V. Plekhanov - 2 books
> Georgii Valentinovich Plekhanov 1856 - 1918 - 2 books
> G.V Plekhanov - 2 books
> Georgiĭ Valentinovich Plekhanov  - 2 books
> Georgii Plekhanov - 1 book
> Georgii Valentinovich Plekhanov - 1 book
> Georges Plekhanov - 1 book
> Georgi Valentinovich Plekhanov - 1 book about , including Art and social
> life.
> Georgy Valentinovich Plekhanov - 1 book about , including Art and social
> life
> Georgii Valen Plekhanov - 0 books
>
>   15 records
> José Joaquín Fernández de Lizardi 1776 - 1827
> Jose Joaquin Fernandez de Lizardi 1776 - 1827
> Jose J. Fernandez de Lizardi
> José Joaquín Fernández de Lizardi
> José Joaquin Fernández de Lizardi 1776 - 1827
> Fernandez de Lizardi
> Fernandez De Lizardi
> Jose . Fernandez de Lizardi
> Jose Joaqui n Ferna ndez de Lizardi 1776 - 1827
> José Joaguin Fernández de Lizardi
> José Joaquín Fernández de Lizardi
> José Joaqúin Fernández de Lizardi
> Jos´e Joaqu´in Fern´andez de Lizardi
> Jose Joaqui n. Ferna ndez de Lizardi
> Jose Joaqui n. Ferna ndez de Lizardi
>
> 10 records
> Frédéric Guillaume de Vaudoncourt 1772 - 1845
> Vaudoncourt, Frédéric François Guillaume baron de 1772 - 1845
> Guillaume de Vaudoncourt 1772 - 1845
> Frédéric François Guillaume de Vaudoncourt
> Vaudoncourt, Fr©Øed©Øeric Fran©ʻcois Guillaume, baron de 1772 - 1
> Fréderic François [Guillaume de Vaudoncourt 1772 - 1845]
> Frédéric François Guillaume de Vaudoncourt 1772 - 1845
> Vaudoncourt, Frédéric François Guillaume de baron de 1772 - 1845
> Frédéric Guillaume de Vaudoncourt
>
> Frederick Henry Ambrose Scrivener 1813 - 1891
> Henry Ambrose Scrivener
> Frederick H. Scrivener
> Frederick Scrivener
> Frederick Henry A . Scrivener
> Frederick Henry A mbrose Scrivener
> Frederick Henry Ambrose Scrivener 1813 - 1891
> Frederick Henry Ambrose Scrivener 1813 - 1891
>
>
> Klementyna Hoffmanowa-Tańska 1798 - 1845
> Klementyna Tańska-Hoffmanowa 1798 - 1845
> Klementyna Tanska-Hoffmanowa 1798 - 1845
> Klementyna Tan?ska-Hoffmanowa 1798 - 1845
> Klementyna Hoffmanowa
> Klementyna Taska-Hoffmanowa 1798 - 1845
> Klementyna Hoffmanowa
>
>
> On Thu, Aug 29, 2013 at 6:44 AM, Richard Light
> <rich...@light.demon.co.uk <mailto:rich...@light.demon.co.uk>> wrote:
>
>     Hi,
>
>     In a general spirit of exploration, I took the OL author dump,
>     extracted authors with dates, converted them to XML and fed them
>     into a Modes [1] database.  I have spent some time tidying up said
>     dates so that they are, as far as possible, meaningful and
>     indexable.  I have limited my attention to authors with a death date
>     and/or a birth date of 1950 or earlier.
>
>     One potential use of this work, I thought, might be to find
>     duplicate OL author records which represent the same person.  I have
>     discovered the de-duplication magic wand, and have done a few by
>     hand.  However, I am rather puzzled.  For example, the last person I
>     looked at was A. Hamon (1860-1939).  In my Modes data I have two
>     records for him, both with dates:
>
>     http://openlibrary.org/authors/OL5218117A
>     and
>     http://openlibrary.org/authors/OL5358432A
>
>     Both of these URLs dereference to an actual page, with associated
>     works.  However, in the de-duplication listing only the first of
>     these identifiers is present (though I did find another A. Hamon
>     entry to merge).  So, two questions:
>
>     1. Is there a format in which I can express a set of instructions to
>     merge authors programmatically, to avoid having to do this by hand?
>     The excitement of doing this manually has already worn off, but
>     Modes could easily tell me where authors have the same name and same
>     DoB/DoD and help me to generate a list of identifiers to merge.
>
>     2. Why don't all the potential mergees appear in the merge listing,
>     despite the fact that loads of clearly irrelevant entries do appear
>     there?
>
>     Thanks,
>
>     Richard
>
>     [1] http://modes.org.uk
>     --
>     *Richard Light*
>
>     _______________________________________________
>     Ol-discuss mailing list - Ol-discuss@archive.org
>     <mailto:Ol-discuss@archive.org>
>     http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
>     Archives: http://www.mail-archive.com/ol-discuss@archive.org/
>     To unsubscribe from this mailing list, send email to
>     ol-discuss-unsubscr...@archive.org
>     <mailto:ol-discuss-unsubscr...@archive.org>
>
>
>
>
> _______________________________________________
> Ol-discuss mailing list - Ol-discuss@archive.org
> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
> Archives: http://www.mail-archive.com/ol-discuss@archive.org/
> To unsubscribe from this mailing list, send email to 
> ol-discuss-unsubscr...@archive.org
>

-- 
Karen Coyle
kco...@kcoyle.net http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet
_______________________________________________
Ol-discuss mailing list - Ol-discuss@archive.org
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
Archives: http://www.mail-archive.com/ol-discuss@archive.org/
To unsubscribe from this mailing list, send email to 
ol-discuss-unsubscr...@archive.org

Reply via email to