Hi,
In a general spirit of exploration, I took the OL author dump, extracted
authors with dates, converted them to XML and fed them into a Modes [1]
database. I have spent some time tidying up said dates so that they
are, as far as possible, meaningful and indexable. I have limited my
attention to authors with a death date and/or a birth date of 1950 or
earlier.
One potential use of this work, I thought, might be to find duplicate OL
author records which represent the same person. I have discovered the
de-duplication magic wand, and have done a few by hand. However, I am
rather puzzled. For example, the last person I looked at was A. Hamon
(1860-1939). In my Modes data I have two records for him, both with dates:
http://openlibrary.org/authors/OL5218117A
and
http://openlibrary.org/authors/OL5358432A
Both of these URLs dereference to an actual page, with associated
works. However, in the de-duplication listing only the first of these
identifiers is present (though I did find another A. Hamon entry to
merge). So, two questions:
1. Is there a format in which I can express a set of instructions to
merge authors programmatically, to avoid having to do this by hand? The
excitement of doing this manually has already worn off, but Modes could
easily tell me where authors have the same name and same DoB/DoD and
help me to generate a list of identifiers to merge.
2. Why don't all the potential mergees appear in the merge listing,
despite the fact that loads of clearly irrelevant entries do appear there?
Thanks,
Richard
[1] http://modes.org.uk
--
*Richard Light*
_______________________________________________
Ol-discuss mailing list - [email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
Archives: http://www.mail-archive.com/[email protected]/
To unsubscribe from this mailing list, send email to
[email protected]