folkschemas and semistructured data

Kragen Sitaker Thu, 24 Nov 2005 00:40:01 -0800

(1200 words, sorry)

I was talking with Tantek Celik tonight (2005-11-23) for an hour or two
about folksonomies, why they work, and user-defined data structures.
The ideas in this post came almost entirely from our discussion, but I
don't know which ones he deserves credit for and which ones I deserve
blame for, but he definitely deserves credit for the term "folkschemas".


A couple of new services (Google Base[insert ref], Ning/24HL[insert
ref], JotSpot <http://www.jot.com/>) allow people to put arbitrary sets
or bags of key-value pairs into a big shared store, and then run
arbitrary queries on that store, queries more or less of the type that
Partial Match Indexing [insert ref] optimizes: "field foo has value bar,
and field baz has value quux," etc.  This is sort of the thing RDF was
intended to support [insert ref], but without a single centralized
database --- just a bunch of XML files on the web.  So if a group of
people happen to use the same field name (predicate name, edge label,
relationship, etc.), you can search across all of their data records
with a single query, at least if that data is indexed in some index that
you have access to.

Obviously, this can result in the Tower of Babel problem: it's not at
all certain that different people will in fact use the same field names
when they mean the same thing, and, worse, it's not at all certain that
they will use different field names when they mean different things.

This is also a potential problem with folksonomy systems like del.icio.us:
I may choose 'sf' as my tag for a semantic category that you call
'sanfrancisco' or 'san-francisco', and so searches by tag won't have
very good recall, and searches that exclude by tag won't have very good
precision.  RDF's solution to this, I think, is OWL [insert ref], where
you specify mappings between relationships and other objects you think
are equivalent for the purpose of your query.

The Microformats effort <http://www.microformats.org/> is working to
establish broadly interoperable schemas for the most widely used data
types, such as personal contact information and event announcements,
with defined concrete representations in HTML, semantics, and pragmatic
intent for each format.  Microformats solves the Tower of Babel problem
within the domains where the largest benefit attaches to doing so, using
a lightweight version of the normal standards process: document existing
needs, research existing solutions, propose a solution that spans most
of the existing uses, and seek consensus on the solution.

But what about domains with only a few tens of thousands of currently
published structured data items?  These include horserace odds, used-car
listings, electronic part listings, personal ads, classical music album
information, satellite photos, software bug reports, word definitions
from dictionaries, software source code patches, expense reports,
college course schedules, prayers, and so forth.

In domains like these, even the lightweight Microformats specifications
process may be enough effort to discourage implementors from
participating, much as most people who want to categorize their
snapshots or web bookmarks aren't interested in participating in the
discussions of the Library of Congress or Open Directory topic headings.
It's easy enough for people working in these domains to come up with
some set of field names and stick their data into Google Base or Ning.
Is there some kind of even-lighter-weight standardization process that
might make their schemas more likely to interoperate?

The cases of Flickr and del.icio.us --- the canonical folksonomies ---
suggest that there might.  The systems are perfectly usable without
looking at anyone else's stuff, but people get network-effect benefits
from using tags that others use: people can find their items more
easily, they get better suggestions for tags to apply to their items,
their items are classed with other items that are more closely related,
and so forth.  Gradually, over time, people tend to adopt tags that are
more similar to other people's tags.

Tantek coined the term "folkschemas" to describe potential sets of field
names that would converge in the same gradual way.  So far, we don't
have a lot of examples of this happening in the real world; geotagging
of photos on Flickr, with 94000 current examples, is perhaps the most
prominent example.  If there are only a few examples with wide
applicability, a Microformats-style lightweight standards process might
not be so bad.  It's only if there is a "long tail", with an almost
infinite number of formats popular enough that interoperability matters
but unpopular enough that nobody wants to bother to spend a week or two
on a standard, that "folkschemas" are really a win.

Before del.icio.us and Flickr, there had been a number of attempts at
folksonomy-like systems that had never achieved any noticeable level of
common categorization without a formal standards process: the meta
keywords HTML tag, the Keywords fields of various scientific article
formats, the old McBee and Indecks card systems.  We hypothesized that
the software in these two cases had reached the point where the gentle
incentives to converge on common categorizations were strong enough that
people began to do it, and brainstormed a bit about things that could
enable the same sort of effect for folkschemas.

The "heat map" or "tag cloud", which displays some set of tags with
sizes indicating their relative popularity, is one innovation ---
<http://flickr.com/photos/tags/> and <http://del.icio.us/tag/> [verify
ref] are two examples, but they're found all over the folksonomy web.  A
recent innovation is to look just at the subset of tags that co-occur
with a single other tag [insert ref].  Displaying a "heat map" of field
names, perhaps from the subset of field names that co-occur with the other
fields you've already put into the current record, could facilitate
schema convergence.

Likewise, if you've typed a data value such as "SWM", it could be
helpful to have a "heat map", or just a sorted list, of the field names
that have previously been attached to that value in the records produced
by other people, and if you've typed a field name such as "Current
Mood", it could be helpful to have a "heat map" or sorted list of the
other data values that occurred in that field in existing records.

In a system such as Google Base that supports fielded searches, it would
be helpful to know which fields were commonly specified in searches, and
what values were being searched for --- not just what fields and values
others had specified.

Once you have some data entered with some set of field names, it would
be helpful to find out what other fields were similar in content to the
field names you chose, and what their records looked like; this would
enable you to decide whether or not to establish a rule, for example,
that your "lastname" field should also be published as "surname" --- or
possibly whether or not to simply rename the field to "surname" if
that's what's more commonly used.  "Heat maps" or sorted lists of
alternatives might prove useful in this case as well.

Finally, when you perform a query that selects some set of data records,
it would be nice to be able to get "heat maps" and sorted lists of the
field names and field values used within the set of query results.

folkschemas and semistructured data

Reply via email to