Greetings all,

Some further reflections on the creation of name conflated data, as
illustrated by the Spock Challenge. Dealing with the multiple forms of
names when creating ambiguities in names is challenging but quite
important. Suppose you are substituting BillClintonTonyBlair for all
occurrences of Bill Clinton and Tony Blair. Perfect. Almost. What
about forms like "President Clinton", "Prime Minister Blair", "William
Jefferson Clinton"....you really need to consider this alternate forms
when doing your name conflation substitutions, and in fact we
generally work pretty hard to make sure we catch most of those
alternate forms when creating this kind of that.

In general I continue to believe that using name conflated data for
experimental evaluations can be very effective. However, you must make
sure that the names you are disguising or conflating are relatively
unambiguous, and you must make sure that you accounts for at least
some of the alternate forms of unambiguous names.

A few relevant posts of my from the Spock Challenge discussion list,
available at:  http://challenge.spock.com

-------------------------------------------------------------------------------------------------------------------------------

Yesterday 09:00:06

In reviewing the Antoine Destin cluster, I noticed that not only was
Antoine Destin inserted as a name (in place of Benjamin Norden), all
occurrences of Norden were replaced with Destin. See the following
example:

------------------------------------------------------------------------------------

Before (from Google) :

A year earlier he was advertising for tenders for building six cottages
in Lawrence Street. In 1834, J. D. Norden was advertising a public sale
of substantial landed property belonging to Benjamin Norden.

After :

SCI.6.945516281.html

A year earlier he was advertising for tenders for building six cottages
in Lawrence Street. In 1834, J. D. Destin was advertising a public sale
of substantial landed property belonging to Antoine Destin.

------------------------------------------------------------------------------------

I think it is possibly important to understand this, especially if
this wasn't intended by Spock. It's not clear to me that it is, since
this doesn't seem to be specified when they discuss their data
creation process.

So, let me ask, what exactly has been changed in the test data?

It's clear that Antoine Destin was substituted for Benjamin
Norden....this is what we'd expect based on the FAQ....

It's clear that Destin was substituted for Norden (in the case of
J.D.) ...this is not expected (at least not based on the description
in the FAQ).

There are quite a few questions that can arise if we move beyond
simply substituting Antoine Destin for Benjamin Norden....

What about occurrences of B. Norden? Does that become A. Destin or B.
Destin? This seems very unclear, since B. Norden clearly refers to
Benjamin Norden, so A. Destin is probably the "right" thing to do, but
if they are simply replacing the last name you could end up with B.
Destin, which now appears to be a different entity than A. Destin.

What about Benny Norden? smile Does that become Benny Destin or Antoine Destin?

What happens to occurrences of Mr. Norden? Does that become Mr. Destin
or stay as Mr. Norden? I suppose this one is easier, since Mr. is
generic and does not refer to any particular Norden/Destin. But then
what about Mr. B. Norden? smile

I'm using Norden and Destin as examples here, and referring to some
cases that don't occur for that name, but clearly this is a general
issue that could and will affect all the data. I'm asking for the
general case, and for the pattern or "rule" that was used to
substitute disguised target names for the real names.  Could a more
detailed description of how the substitutions were done be provided?

Cordially,
Ted

----------------------------------------------------------------------------------------------------------------------------
Yesterday 22:15:48

A little bit of bad news.

It appears that Destin was substituted fairly blindly for Norden.

>From the same document as in the post above...

Original (from google)

FOR SALE, a Female Slave. - For particulars apply to Mr. B. NORDEN.
This Slave is
to be sold owing to the severity of the existing Law on Slave Owners.
Benjamin Norden
was clearly, even at this point in his long life, a businessman involved in many
types of enterprise.

It seems quite clear here that Mr. B. Norden refers to Benjamin Norden.

Modified version (SCI.6.945516281.html)

FOR SALE, a Female Slave. - For particulars apply to Mr. B. DESTIN.
This Slave is
to be sold owing to the severity of the existing Law on Slave Owners.
Antoine Destin
was clearly, even at this point in his long life, a businessman involved in many
types of enterprise.

The problem here is that Mr. B. Destin and Antoine Destin really now
look like different entities, but they are not.

So, there are two issues to resolve. What was done to create this
data, and what should have been done. smile

What was done, I believe, is to first replace the complete form of the
original name (Benjamin Norden) with the new name (Antoine Destin).
Then, I believe all occurrences of Norden were changed to Destin. It
does not appear that B. Norden was "recognized" as being another way
of saying Benjamin Norden. Is this a fatal problem? Not really,
although if one has an approach that keys on names that appear with
the target name, it could be confusing or problematic. It would be
best if Spock could describe exactly how this test set data was
created. I'd actually be interested in knowing how they selected the
"unambiguous name", and then how they went through and did the
replacement of the strings.

Now, what should they have done? Well, it's a challenging problem, to
recognize every possible form of Benjamin Norden and then replace it
with Antoine Destin. I wouldn't expect that every form be accounted
for.
However, I would have hoped that the simplest of forms, like B.
Norden, would be accounted for.

In general, I do think names tend to be repeated quite often in text,
doing the replacement as accurately as possible (so that new entities
are not erroneously introduced) could really make a difference in the
quality of the data and the results.

In any case, I do think it would help participants to know exactly
what replacements were done, especially if their approach somehow keys
on other names that occur with the target name.

Cordially,
Ted
-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

Reply via email to