Re: RDF and its discontents

Pat Hayes Fri, 02 Jul 2010 09:11:24 -0700

Paul, please keep these thoughts coming. I have a couple of followups,inline below.


On Jul 2, 2010, at 10:07 AM, Paul Houle wrote:

Here are some of my thoughts
(1) The global namespace in RDF plus the concept that "mostknowledge can be efficiently represented with triples" arebrilliant; in the long term we're going to see these two conceptsdiffuse into non-RDF systems because they are so powerful.

FWIW, the reduction-to-triples idea has been around and known in thelogic community since about 1880, so it is indeed powerful. It has itsown issues, though (as some of your later comments suggest).

I appreciate the way multiple languages are implemented in RDF --although imperfect, it's a big improvement over what I've had to doto implement multi-lingual "digital libraries" on relational systems.
(2) Yet, the "big graph" and triple paradigms run into big problemswhen we try to build real systems. There are two paradigms I workin: (i) storing 'facts' in a database, and (ii) processing 'facts'through pipelines that effectively do one or more "full scans" ofdata; type (ii) processes can be highly scalable, however, whenthey can be parallelized.
Now, if hardware cost was no object, I suppose I could keeptriples in a huge distributed main-memory database. Right now, Ican't afford that. (If I get richer and if hardware gets cheaper,I'll probably want to handle more data, putting me back where Istarted...)

Well, hardware will get cheaper. Especially fast memory. Care toextrapolate, say, five years to guess which will win, data bloat orRAM capacity?

Today I can get 100x performance increases by physicallypartitioning data in ways that reflect the way I'm going to use it.Relational databases are highly mature at this, but RDF systemsbarely recognize that there's an issue. Named graphs are a stepforward in this direction, but to make something that's reallyuseful we'd need both (a) the ability to do graph algebra, and (b)the ability to automatically partition 'facts' into graphs. That'automatic' could be something similar to RDBMS practice ("put thiskind of predicate in that graph", "put triples with this sort ofsubject in that graph") or it could be something really'intelligent', that can infer likely use patterns by reasoning overthe schema and/or by adaptive profiling of actual use (asSalesforce.com does to build a pretty awesome OLTP system on top ofwhat's a triple store at the core.)
Practically, I deal with this by building hybrid systems thatcombine both relational and RDF ideas. If you're really trying toget things done in this space, however, it's amazing howprecarious the tools are. For instance, I looked at a large numberof data stores and wound up choosing MySQL based on two fairlyaccidental facts: (i) I couldn't get VARCHAR() or TEXT() fields inother RDMS systems to handle the full length of Freebase text fieldsin an indexable way, and (ii) mongodb crashes and corrupts data.


All great observations :-)

As for the linear pipelines, the big issue I have is that I want toprocess "facts" as complete chunks; everything needed for oneparticular bit of processing needs to get routed to the rightpipeline. If it takes four triples involving a bnode to represent a'fact', these all need to go to the same physical node.

Showing the problems with the triple model. Suppose we allowedarbitrary length tuples ala JSON, so each 'fact' is a single tuple.Would this make things easier? BTW, you might find the idea of an RDFmolecule useful.

As in the database case, partitioning of data becomes a criticalissue, but it becomes even more here that the partition aparticular triple falls in might be determined by the graph thatsurrounds a triple, which kind of points to a representation wherewe (a) develop some mechanism for efficiently representing subgraphsof related triples, or (b) just give up on the whole triple thingand use something like a relational or JSON model to represent facts.

Ah, I see you are thinking on the same lines. This is the moretraditional model in any case, actually.

(3) I've spoken to entrepreneurs and potential customers of semantictechnology and found that, right now, people want things that arebeyond the state of the art. Often when I consult with people, Icome to the conclusion that they haven't found the boundaries ofwhat they could accomplish through plain old "bag of words" and thatit's not so clear they'll do better with NLP/semantic tech.Commonly, these people have fundamental flaws in their businessmodel (thinking that they can pay some Turk $0.20 cents to do $2000worth of marketing work.) The most common theme in semantic"product" companies that that they build complex systems out ofcomponents that just barely work.
I'll single out Zemanta for this, although this is true of manyother companies. Let's just estimate that Zemanta's service has 5components and each of these is 85% accurate; put those together,and you've got a system that's just an embarrassment. There aremultiple routes to solving this problem (either a "widening of thescope" or a "narrowing of the scope" could help a lot) the fact isthat a lot of companies are aiming for that "sour spot" which hasthe paradoxical dual effects that: (i) some others imitate them,and (ii) others write off the whole semantic space. Success insemantic technology is going to come from companies that findfortuitous matches between "what's possible" and "what can be sold"


Brilliant!

Another spectre that haunts the space is legacy "informationservices" companies. I've talked with many people who think they'regoing to make big money selling a crappy product to undiscriminatingcustomers with deep pockets (U.S. Government, Finance,Pharma, ...) I think the actual breakthroughs in semantic tech aregoing to come from the disruptive direction: people who find waysto make things that are drastically cheaper than the old way, butthat can accept the limitations of today's semantic tech.
(4) I'm one of the people who got interested in semantic techbecause of DBPedia, but yet, I've also largely given up onDBPedia. One day I realized that I could, with Freebase, dothings in 20 minutes that would take 2 weeks of data cleanup withDBPedia. DBPedia 3.5/3.5.1 seems to be a large step backwards,with major key integrity problems that are completely invisible to'open world' and OWL-paradigm systems. I've wound up writing my ownframework for extracting 'facts' from wikipedia because DBPediaisn't interested in extracting the things I want. Every time I tryto do something with DBpedia, I make shocking discoveries (forinstance, "New York City", "Berlin", "Tokyo", "Washington ,D.C." and "Manchester, N.H." are not of rdf:type "City") The factthat I see so little complaining about this on the mailing listseems to indicate that not a lot of people are trying to do realwork it.
(5) It might make me a heretic, but I've found that the closedworld assumption can, properly used, (i) do miracles, and (ii)directly confront many of the practical problems that show up in RDFsystems.

Indeed. What we need, its been clear for some time, is a globally openworld with many smaller closed worlds inside it. But this needs awhole scheme/mechanism for how to say what the boundaries of thesesmaller closed worlds are, and what it is that they enclose, which hasnot been done and isn't likely to get done with the very conservativeclimate that we seem to be in right now.

OWL has greatly changed my thinking about schemas... I'm lessconcerned, however, about the official semantics of OWL, than Iam about the general prospect of "reasoning about schemas." I thinkthe "inference-based" model of OWL is awesome, but prettyfrequently I find I need forms of reasoning that aren't quitesupported by OWL...

Can you give me any examples? I am trying to collect real-world butcurrently unsupported inference patterns, with the long-term goal ofreengineering the semantics to make it fit what people think it oughtto mean. So this is gold, for me.

Alternately, data partitioning and data validation is reallyimportant for me, so I need something that has some of the natureof an RDMS schema. Of course, I can get some of this by "applyingmy own hermeneutics" to OWL and adding some features


Again, details would be wonderful.

Pat Hayes


------------------------------------------------------------
IHMC                                     (850)434 8903 or (650)494 3973
40 South Alcaniz St.           (850)202 4416   office
Pensacola                            (850)202 4440   fax
FL 32502                              (850)291 0667   mobile
phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes

Re: RDF and its discontents

Reply via email to