Re: Spam package redesign

Murray Altheim Fri, 25 Sep 2009 19:04:17 -0700

Andrew Jaquith wrote:
> You could certainly do that with the package as I've described it --
> the pseudo-subject or facet is what I called a Score "category," if I
> catch your meaning.


Hi Andrew,

I'm not sure what the correct library term would be, except that if
you look into Faceted Classification (FC) there's the concept of a
subject, rather than being an enumeration (ala Dewey), is comprised
of facets, with the individual facets themselves composite subjects,
eventually drilling recursively down until one reaches some idea of
a 'core' facet (though it's easy to argue that no such core exists
or even could, that facets are themselves subjects and recurse as
well).

  http://en.wikipedia.org/wiki/Faceted_classification

You may know that I have been involved in the Topic Map standards
work and by some minor coincidence Steve Pepper's last message into
the TopicMapMail mailing list is very much in line with this subject-
centric thinking, and I've attached it [1] as it is likely informative.

So 'category' is fine. I was just trying to think of some term that
might describe the components that together build a given subject,
with FC being a very useful framework.

> I'll take this as a vote of confidence -- when I check it in (in the
> next few weeks probably), you'll be able to see the code for yourself.
> :)

Yes it sounds very promising, thanks!

Murray

[1] Re: [topicmapmail] Weekly binge : Do we care about subjects that much?
    Steve Pepper message of 24 September 2009 into the TopicMapMail
    <[email protected]> mailing list
...........................................................................
Murray Altheim <murray09 at altheim dot com>                       ===  = =
http://www.altheim.com/murray/                                     = =  ===
SGML Grease Monkey, Banjo Player, Wantanabe Zen Monk               = =  = =

      Boundless wind and moon - the eye within eyes,
      Inexhaustible heaven and earth - the light beyond light,
      The willow dark, the flower bright - ten thousand houses,
      Knock at any door - there's one who will respond.
                                      -- The Blue Cliff Record

--- Begin Message ---

* Patrick Durusau
|
| Err, actually the TMRM actually says one has to declare how subjects are
| identified. No pre-defined mechanisms.
| 
| It might be helpful to think of the basis for TMDM merging as a starter
| set of *interchangeable* merging rules. That is you can rely on those to
| operate for any software that implements the TMDM.

>From a "philosophical" point of view (maybe "cognitive linguistic" point of 
>view
is better), I think the TMDM and TMRM complement each other rather nicely, each
providing a mechanism that corresponds to how we as humans solve the subject
identity problem.

We all walk around with a vast number of concepts up in our heads, each more or
less clearly defined. Those concepts are all connected through relationships in
an enormous network, and a concept is really nothing more than the sum of its
connections. In theory we can refer to concepts via their connections. For
example, I could say "the capital of the country in which I was born" and those
of you who know me would know what I meant. But this kind of thing is
long-winded and awkward, so we go to the trouble of giving *names* to the most
salient concepts. It then becomes much easier to convey the concept to others:
clearly it's much more economical to say "London" (not to mention less dependent
on rather specific encyclopedic knowledge, such as who I am and where I was
born).

The only problem with names, of course, is that they are not unique. (In the
context of American literature, "London" could refer to the author of "The Iron
Heel".) But they work pretty well for humans, because we are able to use context
to disambiguate. When context-based disambiguation fails, as it sometimes does,
we usually discover the miscommunication at some point and are able to
back-track and fix the error - using one or more properties to disambiguate.

(I remember being with Bernard Vatant in a bar in Austin, Texas when he told one
of the locals that he was from Paris. We didn't realize that he had been
misunderstood until some time later when the guy said, "Oh you mean Paris,
France..." Computers, of course, are not that smart yet, and won't be for a long
time.)

Subject identifiers are like names: They are simply conventional symbols that
are used to stand in for the concepts (subjects) we wish to refer to. The big
difference is that they are globally unique and therefore much more suitable for
computers than the names we humans use.

Subject identifiers are the (primary) mechanism offered by the TMDM for subject
identification (subject locators and item identifiers are secondary and only
used for special purposes). And just as humans could (in theory) do without
names, we could (in theory) do without subject identifiers, and instead base all
our subject identification on properties like "born in" and "located in". This
is essentially the TMRM approach. Conceptually it works; in practice, it usually
doesn't, or at least only in very limited ways.

In one sense the TMRM approach underlies what we usually do anyway when we
create a subject identifier: we conceptualize some subject in our head (on the
basis of all its connections) and then capture just enough of the most salient
relationships in the subject descriptor.[1]

The TMRM approach is also what we fall back on when we merge topic maps that
don't share subject identifiers - we compare properties: social security
numbers, email addresses, data codes (in combination with the topic type -
another property - so as not to merge, say, Norway with Nordfjörður Airport,
both of which have the code "NOR").

But this really is very much a fallback, for the simple reason that the set of
properties necessary to identify a subject is usually not present in both topic
maps. For example, we could use the following tolog-NG query:

   MERGE $T1, $T2 FROM
     instance-of($T1, city),
     instance-of($T2, city),
     capital-of($T1 : city, England : country),
     capital-of($T2 : city, England : country)?

But this *only works* if the capital-of assertion is made about both T1 and T2.
Only exceptionally will that be the case. Here are the assertions made about
London (T1) in one topic map:[2]

  Birthplace of 
        Bulwer-Lytton, Edward 
        Lord Byron 
  Contains 
        Covent Garden Theatre 
        Hippodrome 
        His/Her Majesty's 
        Savoy Theatre 
  Died here 
        Leoni, Franco 
  Located in 
        England

Any of these associations (except the located in association) are what the OWL
folks called inverse functional properties* and could therefore form the basis
for merging (which, after all, is what subject identity is primarily about). But
what are the chances of T2 having one of these associations? Probably very
slight.

And even if, by some miraculous chance, it did have one of them, you wouldn't
know unless you could establish the identity of the associated topic, which
might well be referred to by a subtly different name (e.g "Edward
Bulwer-Lytton", "George Gordon Byron, 6th Baron Byron", or "Royal Opera House").
So you have a recursion problem on your hands...

And it doesn't stop there, because you would also have to establish the identity
of "Birthplace of", "Contains", and "Died here" - as well as the corresponding
role types - otherwise you wouldn't know if the associations involving T1 and T2
really were identical.

In summary, while the TMRM approach might work in limited ways within the
confines of a single application, it is doomed to fail in the general case.

So, to return to Alex's original posting:

| We plug away at our Topic Maps, and I for one claim to think in terms
| of subjects, being all subject-centric and all. But am I? I like to
| think I am, but there's that ever-nagging feeling that a subject proxy
| will never quite be right, and that the compromise of subject locators
| / identifiers / indicators is as good as it gets, but not quite
| subject centric now, is it?
| 
| Can I ask us all a philosophical question? Apart from the TMDM / TMRM
| mechanisms for identity of subjects, what are my alternatives?

As far as I'm concerned, subject identifiers - and published subjects -
constitute the only really viable alternative. If they sometimes engender a
feeling of compromise, it's in the nature of the problem. We are not always
aware of it in real life, because things generally "just work", but the truth is
that every one of us has a slightly different (and continually evolving) concept
of, say, London. The true "subject" is just a fuzzy compromise consisting of the
most salient shared properties of all those gloriously varied concepts.

But, hey, it works in real life, so why not in Topic Maps as well? I betcha
http://psi.ontopedia.net/London would do the trick for 99% of our needs, and for
the other 1% we just create additional, more specific PSIs.

Steve

* "If a property is declared to be inverse-functional, then the object of a
property statement uniquely determines the subject (some individual)."
http://www.w3.org/TR/owl-ref/#InverseFunctionalProperty-def

People can only be born in (or die in) one place, and theatres can only be
located in one place, so born in, died in, and located in are sufficient to
identify the places concerned. But it doesn't work the other way around: You
can't identify a person through the place s/he was born (or died), for obvious
reasons.

[1] The information at http://psi.ontopedia.net/London is admittedly a bit on
the sparse side. The reason is that it was autogenerated from a topic map that
did not contain the capital-of information - nor indeed that city would be a
more appropriate type. A better example is http://psi.ontopedia.net/born_in.

[2] http://tinyurl.com/nsyjwo

--
PSI: http://psi.ontopedia.net/Steve_Pepper
Blog: http://topicmaps.wordpress.com

_______________________________________________
topicmapmail mailing list
[email protected]
http://www.infoloom.com/mailman/listinfo/topicmapmail

--- End Message ---

Re: Spam package redesign

Reply via email to