Paul, please keep these thoughts coming. I have a couple of followups,
inline below.
On Jul 2, 2010, at 10:07 AM, Paul Houle wrote:
Here are some of my thoughts
(1) The global namespace in RDF plus the concept that "most
knowledge can be efficiently represented with triples" are
brilliant; in the long term we're going to see these two concepts
diffuse into non-RDF systems because they are so powerful.
FWIW, the reduction-to-triples idea has been around and known in the
logic community since about 1880, so it is indeed powerful. It has its
own issues, though (as some of your later comments suggest).
I appreciate the way multiple languages are implemented in RDF --
although imperfect, it's a big improvement over what I've had to do
to implement multi-lingual "digital libraries" on relational systems.
(2) Yet, the "big graph" and triple paradigms run into big problems
when we try to build real systems. There are two paradigms I work
in: (i) storing 'facts' in a database, and (ii) processing 'facts'
through pipelines that effectively do one or more "full scans" of
data; type (ii) processes can be highly scalable, however, when
they can be parallelized.
Now, if hardware cost was no object, I suppose I could keep
triples in a huge distributed main-memory database. Right now, I
can't afford that. (If I get richer and if hardware gets cheaper,
I'll probably want to handle more data, putting me back where I
started...)
Well, hardware will get cheaper. Especially fast memory. Care to
extrapolate, say, five years to guess which will win, data bloat or
RAM capacity?
Today I can get 100x performance increases by physically
partitioning data in ways that reflect the way I'm going to use it.
Relational databases are highly mature at this, but RDF systems
barely recognize that there's an issue. Named graphs are a step
forward in this direction, but to make something that's really
useful we'd need both (a) the ability to do graph algebra, and (b)
the ability to automatically partition 'facts' into graphs. That
'automatic' could be something similar to RDBMS practice ("put this
kind of predicate in that graph", "put triples with this sort of
subject in that graph") or it could be something really
'intelligent', that can infer likely use patterns by reasoning over
the schema and/or by adaptive profiling of actual use (as
Salesforce.com does to build a pretty awesome OLTP system on top of
what's a triple store at the core.)
Practically, I deal with this by building hybrid systems that
combine both relational and RDF ideas. If you're really trying to
get things done in this space, however, it's amazing how
precarious the tools are. For instance, I looked at a large number
of data stores and wound up choosing MySQL based on two fairly
accidental facts: (i) I couldn't get VARCHAR() or TEXT() fields in
other RDMS systems to handle the full length of Freebase text fields
in an indexable way, and (ii) mongodb crashes and corrupts data.
All great observations :-)
As for the linear pipelines, the big issue I have is that I want to
process "facts" as complete chunks; everything needed for one
particular bit of processing needs to get routed to the right
pipeline. If it takes four triples involving a bnode to represent a
'fact', these all need to go to the same physical node.
Showing the problems with the triple model. Suppose we allowed
arbitrary length tuples ala JSON, so each 'fact' is a single tuple.
Would this make things easier? BTW, you might find the idea of an RDF
molecule useful.
As in the database case, partitioning of data becomes a critical
issue, but it becomes even more here that the partition a
particular triple falls in might be determined by the graph that
surrounds a triple, which kind of points to a representation where
we (a) develop some mechanism for efficiently representing subgraphs
of related triples, or (b) just give up on the whole triple thing
and use something like a relational or JSON model to represent facts.
Ah, I see you are thinking on the same lines. This is the more
traditional model in any case, actually.
(3) I've spoken to entrepreneurs and potential customers of semantic
technology and found that, right now, people want things that are
beyond the state of the art. Often when I consult with people, I
come to the conclusion that they haven't found the boundaries of
what they could accomplish through plain old "bag of words" and that
it's not so clear they'll do better with NLP/semantic tech.
Commonly, these people have fundamental flaws in their business
model (thinking that they can pay some Turk $0.20 cents to do $2000
worth of marketing work.) The most common theme in semantic
"product" companies that that they build complex systems out of
components that just barely work.
I'll single out Zemanta for this, although this is true of many
other companies. Let's just estimate that Zemanta's service has 5
components and each of these is 85% accurate; put those together,
and you've got a system that's just an embarrassment. There are
multiple routes to solving this problem (either a "widening of the
scope" or a "narrowing of the scope" could help a lot) the fact is
that a lot of companies are aiming for that "sour spot" which has
the paradoxical dual effects that: (i) some others imitate them,
and (ii) others write off the whole semantic space. Success in
semantic technology is going to come from companies that find
fortuitous matches between "what's possible" and "what can be sold"
Brilliant!
Another spectre that haunts the space is legacy "information
services" companies. I've talked with many people who think they're
going to make big money selling a crappy product to undiscriminating
customers with deep pockets (U.S. Government, Finance,
Pharma, ...) I think the actual breakthroughs in semantic tech are
going to come from the disruptive direction: people who find ways
to make things that are drastically cheaper than the old way, but
that can accept the limitations of today's semantic tech.
(4) I'm one of the people who got interested in semantic tech
because of DBPedia, but yet, I've also largely given up on
DBPedia. One day I realized that I could, with Freebase, do
things in 20 minutes that would take 2 weeks of data cleanup with
DBPedia. DBPedia 3.5/3.5.1 seems to be a large step backwards,
with major key integrity problems that are completely invisible to
'open world' and OWL-paradigm systems. I've wound up writing my own
framework for extracting 'facts' from wikipedia because DBPedia
isn't interested in extracting the things I want. Every time I try
to do something with DBpedia, I make shocking discoveries (for
instance, "New York City", "Berlin", "Tokyo", "Washington ,
D.C." and "Manchester, N.H." are not of rdf:type "City") The fact
that I see so little complaining about this on the mailing list
seems to indicate that not a lot of people are trying to do real
work it.
(5) It might make me a heretic, but I've found that the closed
world assumption can, properly used, (i) do miracles, and (ii)
directly confront many of the practical problems that show up in RDF
systems.
Indeed. What we need, its been clear for some time, is a globally open
world with many smaller closed worlds inside it. But this needs a
whole scheme/mechanism for how to say what the boundaries of these
smaller closed worlds are, and what it is that they enclose, which has
not been done and isn't likely to get done with the very conservative
climate that we seem to be in right now.
OWL has greatly changed my thinking about schemas... I'm less
concerned, however, about the official semantics of OWL, than I
am about the general prospect of "reasoning about schemas." I think
the "inference-based" model of OWL is awesome, but pretty
frequently I find I need forms of reasoning that aren't quite
supported by OWL...
Can you give me any examples? I am trying to collect real-world but
currently unsupported inference patterns, with the long-term goal of
reengineering the semantics to make it fit what people think it ought
to mean. So this is gold, for me.
Alternately, data partitioning and data validation is really
important for me, so I need something that has some of the nature
of an RDMS schema. Of course, I can get some of this by "applying
my own hermeneutics" to OWL and adding some features
Again, details would be wonderful.
Pat Hayes
------------------------------------------------------------
IHMC (850)434 8903 or (650)494 3973
40 South Alcaniz St. (850)202 4416 office
Pensacola (850)202 4440 fax
FL 32502 (850)291 0667 mobile
phayesAT-SIGNihmc.us http://www.ihmc.us/users/phayes