Thanks, Tom. It seems to me that we have a general problem with accents that needs solving -- I have no idea how accents are handled in searching or merging, but I know that this has come up before:
https://github.com/internetarchive/openlibrary/issues/11 Presumably SOLR is able to deal with this? kc On 9/1/13 12:00 AM, Tom Morris wrote: > So, the technical discussion about how to do the merging is probably > more suited to ol-tech, but some of the general discussion about the > magnitude and variety of the problem is probably of interest to folks > here. Merging people with the same birth & death date and the same or > very similar names is certainly a high probability strategy to get the > low hanging fruit, but we need to do a lot more than that. > > Here's the breakdown of counts from a recent dump: > > Author records with no dates - 5,525,614 > Author records with both birth and death dates - 328,755 > Authors records with just birth date - 934,489 > Authors records with just death date - 20,972 > > Below are some examples of sets of duplicates that I merged today. As > you can see there is *great* variety in the spellings of author names > and often only a small number of the records have dates, so we'll need > some more powerful strategies (looking at common books authored, etc), > to make headway with this problem. All encoding problems, bad dates, > etc are in the original data. > > One hint for anyone who's attempting to merge things by hand using > search - don't include any words which sometimes have accented forms > because search will be restricted to just those with (or without) the > accent. In the first example below, "Georgiĭ Valentinovich Plekhanov" > won't find "Georgi Valentinovich Plekhanov" (or "Georgy Valentinovich > Plekhanov" but for a different reason). Try to choose a search phrase > like "Valentinovich Plekhanov" or "Plekhanov" which doesn't have accent > or spelling variations (not always possible, of course). > > Tom > > 18 records: > Georgiĭ Valentinovich Plekhanov 1856 - 1918 - 49 books > Georgiĭ Valentinovich Plekhanov 1856 - 1918 - 39 books > G. V. Plekhanov - 30 books > Georgii Valentinovich Plekhanov 1856 - 1918 - 8 books > Georgi Plekhanov - 5 books > G.V. Plekhanov - 3 books > Georgi Valentinovich Plekhanov 1856 - 1918 - 3 books > George V. Plekhanov - 2 books > Georgi V. Plekhanov - 2 books > Georgii Valentinovich Plekhanov 1856 - 1918 - 2 books > G.V Plekhanov - 2 books > Georgiĭ Valentinovich Plekhanov - 2 books > Georgii Plekhanov - 1 book > Georgii Valentinovich Plekhanov - 1 book > Georges Plekhanov - 1 book > Georgi Valentinovich Plekhanov - 1 book about , including Art and social > life. > Georgy Valentinovich Plekhanov - 1 book about , including Art and social > life > Georgii Valen Plekhanov - 0 books > > 15 records > José Joaquín Fernández de Lizardi 1776 - 1827 > Jose Joaquin Fernandez de Lizardi 1776 - 1827 > Jose J. Fernandez de Lizardi > José Joaquín Fernández de Lizardi > José Joaquin Fernández de Lizardi 1776 - 1827 > Fernandez de Lizardi > Fernandez De Lizardi > Jose . Fernandez de Lizardi > Jose Joaqui n Ferna ndez de Lizardi 1776 - 1827 > José Joaguin Fernández de Lizardi > José Joaquín Fernández de Lizardi > José Joaqúin Fernández de Lizardi > Jos´e Joaqu´in Fern´andez de Lizardi > Jose Joaqui n. Ferna ndez de Lizardi > Jose Joaqui n. Ferna ndez de Lizardi > > 10 records > Frédéric Guillaume de Vaudoncourt 1772 - 1845 > Vaudoncourt, Frédéric François Guillaume baron de 1772 - 1845 > Guillaume de Vaudoncourt 1772 - 1845 > Frédéric François Guillaume de Vaudoncourt > Vaudoncourt, Fr©Øed©Øeric Fran©ʻcois Guillaume, baron de 1772 - 1 > Fréderic François [Guillaume de Vaudoncourt 1772 - 1845] > Frédéric François Guillaume de Vaudoncourt 1772 - 1845 > Vaudoncourt, Frédéric François Guillaume de baron de 1772 - 1845 > Frédéric Guillaume de Vaudoncourt > > Frederick Henry Ambrose Scrivener 1813 - 1891 > Henry Ambrose Scrivener > Frederick H. Scrivener > Frederick Scrivener > Frederick Henry A . Scrivener > Frederick Henry A mbrose Scrivener > Frederick Henry Ambrose Scrivener 1813 - 1891 > Frederick Henry Ambrose Scrivener 1813 - 1891 > > > Klementyna Hoffmanowa-Tańska 1798 - 1845 > Klementyna Tańska-Hoffmanowa 1798 - 1845 > Klementyna Tanska-Hoffmanowa 1798 - 1845 > Klementyna Tan?ska-Hoffmanowa 1798 - 1845 > Klementyna Hoffmanowa > Klementyna Taska-Hoffmanowa 1798 - 1845 > Klementyna Hoffmanowa > > > On Thu, Aug 29, 2013 at 6:44 AM, Richard Light > <rich...@light.demon.co.uk <mailto:rich...@light.demon.co.uk>> wrote: > > Hi, > > In a general spirit of exploration, I took the OL author dump, > extracted authors with dates, converted them to XML and fed them > into a Modes [1] database. I have spent some time tidying up said > dates so that they are, as far as possible, meaningful and > indexable. I have limited my attention to authors with a death date > and/or a birth date of 1950 or earlier. > > One potential use of this work, I thought, might be to find > duplicate OL author records which represent the same person. I have > discovered the de-duplication magic wand, and have done a few by > hand. However, I am rather puzzled. For example, the last person I > looked at was A. Hamon (1860-1939). In my Modes data I have two > records for him, both with dates: > > http://openlibrary.org/authors/OL5218117A > and > http://openlibrary.org/authors/OL5358432A > > Both of these URLs dereference to an actual page, with associated > works. However, in the de-duplication listing only the first of > these identifiers is present (though I did find another A. Hamon > entry to merge). So, two questions: > > 1. Is there a format in which I can express a set of instructions to > merge authors programmatically, to avoid having to do this by hand? > The excitement of doing this manually has already worn off, but > Modes could easily tell me where authors have the same name and same > DoB/DoD and help me to generate a list of identifiers to merge. > > 2. Why don't all the potential mergees appear in the merge listing, > despite the fact that loads of clearly irrelevant entries do appear > there? > > Thanks, > > Richard > > [1] http://modes.org.uk > -- > *Richard Light* > > _______________________________________________ > Ol-discuss mailing list - Ol-discuss@archive.org > <mailto:Ol-discuss@archive.org> > http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss > Archives: http://www.mail-archive.com/ol-discuss@archive.org/ > To unsubscribe from this mailing list, send email to > ol-discuss-unsubscr...@archive.org > <mailto:ol-discuss-unsubscr...@archive.org> > > > > > _______________________________________________ > Ol-discuss mailing list - Ol-discuss@archive.org > http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss > Archives: http://www.mail-archive.com/ol-discuss@archive.org/ > To unsubscribe from this mailing list, send email to > ol-discuss-unsubscr...@archive.org > -- Karen Coyle kco...@kcoyle.net http://kcoyle.net ph: 1-510-540-7596 m: 1-510-435-8234 skype: kcoylenet _______________________________________________ Ol-discuss mailing list - Ol-discuss@archive.org http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss Archives: http://www.mail-archive.com/ol-discuss@archive.org/ To unsubscribe from this mailing list, send email to ol-discuss-unsubscr...@archive.org