Re: [Ldsoss] Automating Genealogy Research

Jay Askren Fri, 03 Nov 2006 06:14:48 -0800

I would have to disagree about this not being a computer science problem. Computers can't solve the problem for sure, but they can take us farther than we are now. Our company focuses on Artificial Intelligence research and my background is in Computational Linguistics or in other words Natural Language Processing.

First, there are several tools that I know about that are at least heading in the right direction.

http://www.ancestry.com - Ancestry does have a little data mining group which is working on extracting information from the web automatically. I applied for a job there a couple of years ago, but they were looking for more of a web interface person which isn't my strongest skill. An example of where data mining/text processing is used is there obituary search. They have a crawler which extracts information from obituaries so it can be added to their search engine. It certainly isn't perfect, but it does a decent job. I believe they also have a group which tries to combine information from various databases and build family trees with it.

http://www.myheritage.com/FP/Company/myheritage-research.php - My Heritage has a meta search engine for genealogy. It searches quite a few other genealogy web sites for the name you give it and gives you all of the results.

http://www.werelate.org/ - We Relate searches the web for genealogy information. I think this is becoming a great resource and will only get better.

These are all search engines, which isn't what quite what you were asking for, but I think it's as close as we can get to for now. It may be even the best we will ever do, but if we can make better search engines, that will make genealogy much easier. It would really be great if we could digitize genealogy books and make those searchable.

One paradigm which has been around for a while but hasn't taken off yet is the semantic web. In theory it could enable more of what you are talking about. The idea is to codify information so it can be understood by a machine. So, in the context of genealogy, each person would have a unique identifier called a uri. Then we can make assertions about different people. For instance we can assert that http://www.familysearch.org/person1 is the same as http://www.ancestry.com/person2 and that http://www.familysearch.org/person1 is the child of http://www.ancestry.com/person3. The semantic web technologies can also do inference, so a software agent could infer that that person 2 is also the child of person 3 as well. Here's some more reading on it:

http://jay.askren.net/Projects/SemWeb/

http://polaris.gseis.ucla.edu/mleahey/genealogyAndSemanticWebXHTML.htm

In theory if all genealogy databases were coded up using the semantic web languages, computers could combine them to make family trees with some help from humans. In practice I don't know if this will ever happen. I don't know if the semantic web will ever taken off. It's still very much a research topic and has been for quite some time. I haven't really seen a real application come out of the research that couldn't be done just as easily with plain xml.

Now along those same lines, it sounds like the church's new genealogy web site uses the same principles in that each person has a unique identifier and that some at least limited inference can be done as far as asserting that people are the same people. I'm very interested in being able to use the application, and can't wait until it's finished. I haven't heard anything about it for a while.

A huge problem with having computers do the research in addtion to the Cambell's Soup problem is the problem of ambiguity. If I search for John Smith in Family Search, it will come back with a lot of names, and it's quite difficult to figure out which John Smith's are the same person. Human's would have to markup which John Smiths are the same, which it sounds like is the focus of the new church genealogy website.

Jay

On 11/2/06, Paul Penrod <[EMAIL PROTECTED]> wrote:

What you're describing is the Campbell's Soup problem, which was part of
the AI research and deployment
back in the 1980's when Lisp and Business Intelligence systems were in
vogue.

However, before we can dive into the "done" part of your request, you
need to narrow it down to something
more specific. The implied assumption is that databases and information
heaps are similar in nature with respect
data, arrangement, relationships, etc. They are not. Geopolitically,
there are in excess of 190 Countries, Kingdoms,
provinces, protectorates, etc. There are over 100+ languages spoken,
plus there are great dissimilarities in
record keeping in terms of important information.

If you want to discuss a more concrete solution to "flailing about
looking for diamonds", (research), then lets
narrow the discussion to something narrower in scope with a know
terminus. From your surname, I would
take an educated guess that many of your records start or lie within the
US/UK/Ireland/Scotland venue, and
branch out to other points within Europe due to intermarriage within
lesser and greater royal lines (typical
for many people).

The Church has already placed a great deal of effort already in this
data set, as well as many other organizations;
partly due to US immigrant heritage at the time, and partly due to the
adoption of English record keeping,
laws, and practices. We used to be a collection of English colonies, so
that is a natural process.

Data mining in and of itself in this environment will yield a plethora
of false positives, unless you know more
specifically what you are looking for, AND you know your HISTORY in the
area and time you are researching.
For example, it was common during the middle ages through the Industrial
Revolution for women who had lost
husbands to marry a relative (sometimes a brother of the deceased). This
could be for economic reasons,
family reasons, politics, survival or any other reason that made sense
to them at the time. On the genealogy
charts, you will see the same names and sometimes information show for
multiple marriages. This is not a
mistake, but people who do not educate themselves and trust in the
computer only will see it as a error
in the reporting. Data mining does not help here. This is not a computer
science problem. It sits in the history
and genealogy domain and information management is merely the tool to
help us see things more clearly
as long as we understand the CONTEXT of the data within those domains as
presented. Noodling out an
algorithm to apply these kinds of tenuous possible data relationships is
noble, but not needed, given we
have been blessed with sufficient intelligence to work out the
relationships in our head, along with the
gift of the Holy Ghost (for those who will bother to use it).

Applications like PAF, tools like GEDCOM, and it's derivatives, are
valuable in that they help to organize
existing data for ANALYSIS. They do not produce the end result.

So, Let's talk about a more narrow, concrete, scope to your problem.

Steven H. McCown wrote:
> Has anyone ever noticed that this list tends to concentrate on hashing and
> re-hashing which OSS tools are best?  Then, the discussion moves to whether
> client-server, webapps, or standalone apps are best.  Next, we always jump
> on to (my favorite) legal issues.  Goto line 1 and repeat...
>
> I'd like to take a sideline from that and discuss problem solving issues --
> just for a minute.
>
> I did some research for my family and came to a dead end.  At that point, I
> sat in several libraries and read book after book.  Eventually, place names
> and dates started to sound familiar.  I started reading genealogies for
> unrelated people that lived in the same place/time as my family.  Finally, I
> found families that had intermarried and surprisingly had clues for my own
> family.  I've since been able to tie into some very old family lines.
>
> That will sound very familiar to most researchers as that is the way
> genealogy is often done.
>
> With all that we know about computers, algorithms, searching, data mining,
> etc., is there anything that we can do to affect the research process?  To
> me, as a researcher, whether PAF is AJAX, C++, Python, is mainly a
> distraction.  The only real requirement is that gen apps be available to
> everyone -- whether on the net or not.
>
> So, the discussion that I'd like to hear is not an Info Tech discussion, but
> a hardcore Computer Science one.
>
> Given the research paradigm that I described above, have you done anything
> that might allow researchers to data mine across databases and make
> inferences or suggestions to where to look when we get stumped?
>
> Thanks,
>
> Steve
>
> _______________________________________________
> Ldsoss mailing list
> Ldsoss@lists.ldsoss.org
> http://lists.ldsoss.org/mailman/listinfo/ldsoss
>
>
>

_______________________________________________
Ldsoss mailing list
Ldsoss@lists.ldsoss.org
http://lists.ldsoss.org/mailman/listinfo/ldsoss

_______________________________________________
Ldsoss mailing list
Ldsoss@lists.ldsoss.org
http://lists.ldsoss.org/mailman/listinfo/ldsoss

Re: [Ldsoss] Automating Genealogy Research

Reply via email to