(1200 words, sorry) I was talking with Tantek Celik tonight (2005-11-23) for an hour or two about folksonomies, why they work, and user-defined data structures. The ideas in this post came almost entirely from our discussion, but I don't know which ones he deserves credit for and which ones I deserve blame for, but he definitely deserves credit for the term "folkschemas".
A couple of new services (Google Base[insert ref], Ning/24HL[insert ref], JotSpot <http://www.jot.com/>) allow people to put arbitrary sets or bags of key-value pairs into a big shared store, and then run arbitrary queries on that store, queries more or less of the type that Partial Match Indexing [insert ref] optimizes: "field foo has value bar, and field baz has value quux," etc. This is sort of the thing RDF was intended to support [insert ref], but without a single centralized database --- just a bunch of XML files on the web. So if a group of people happen to use the same field name (predicate name, edge label, relationship, etc.), you can search across all of their data records with a single query, at least if that data is indexed in some index that you have access to. Obviously, this can result in the Tower of Babel problem: it's not at all certain that different people will in fact use the same field names when they mean the same thing, and, worse, it's not at all certain that they will use different field names when they mean different things. This is also a potential problem with folksonomy systems like del.icio.us: I may choose 'sf' as my tag for a semantic category that you call 'sanfrancisco' or 'san-francisco', and so searches by tag won't have very good recall, and searches that exclude by tag won't have very good precision. RDF's solution to this, I think, is OWL [insert ref], where you specify mappings between relationships and other objects you think are equivalent for the purpose of your query. The Microformats effort <http://www.microformats.org/> is working to establish broadly interoperable schemas for the most widely used data types, such as personal contact information and event announcements, with defined concrete representations in HTML, semantics, and pragmatic intent for each format. Microformats solves the Tower of Babel problem within the domains where the largest benefit attaches to doing so, using a lightweight version of the normal standards process: document existing needs, research existing solutions, propose a solution that spans most of the existing uses, and seek consensus on the solution. But what about domains with only a few tens of thousands of currently published structured data items? These include horserace odds, used-car listings, electronic part listings, personal ads, classical music album information, satellite photos, software bug reports, word definitions from dictionaries, software source code patches, expense reports, college course schedules, prayers, and so forth. In domains like these, even the lightweight Microformats specifications process may be enough effort to discourage implementors from participating, much as most people who want to categorize their snapshots or web bookmarks aren't interested in participating in the discussions of the Library of Congress or Open Directory topic headings. It's easy enough for people working in these domains to come up with some set of field names and stick their data into Google Base or Ning. Is there some kind of even-lighter-weight standardization process that might make their schemas more likely to interoperate? The cases of Flickr and del.icio.us --- the canonical folksonomies --- suggest that there might. The systems are perfectly usable without looking at anyone else's stuff, but people get network-effect benefits from using tags that others use: people can find their items more easily, they get better suggestions for tags to apply to their items, their items are classed with other items that are more closely related, and so forth. Gradually, over time, people tend to adopt tags that are more similar to other people's tags. Tantek coined the term "folkschemas" to describe potential sets of field names that would converge in the same gradual way. So far, we don't have a lot of examples of this happening in the real world; geotagging of photos on Flickr, with 94000 current examples, is perhaps the most prominent example. If there are only a few examples with wide applicability, a Microformats-style lightweight standards process might not be so bad. It's only if there is a "long tail", with an almost infinite number of formats popular enough that interoperability matters but unpopular enough that nobody wants to bother to spend a week or two on a standard, that "folkschemas" are really a win. Before del.icio.us and Flickr, there had been a number of attempts at folksonomy-like systems that had never achieved any noticeable level of common categorization without a formal standards process: the meta keywords HTML tag, the Keywords fields of various scientific article formats, the old McBee and Indecks card systems. We hypothesized that the software in these two cases had reached the point where the gentle incentives to converge on common categorizations were strong enough that people began to do it, and brainstormed a bit about things that could enable the same sort of effect for folkschemas. The "heat map" or "tag cloud", which displays some set of tags with sizes indicating their relative popularity, is one innovation --- <http://flickr.com/photos/tags/> and <http://del.icio.us/tag/> [verify ref] are two examples, but they're found all over the folksonomy web. A recent innovation is to look just at the subset of tags that co-occur with a single other tag [insert ref]. Displaying a "heat map" of field names, perhaps from the subset of field names that co-occur with the other fields you've already put into the current record, could facilitate schema convergence. Likewise, if you've typed a data value such as "SWM", it could be helpful to have a "heat map", or just a sorted list, of the field names that have previously been attached to that value in the records produced by other people, and if you've typed a field name such as "Current Mood", it could be helpful to have a "heat map" or sorted list of the other data values that occurred in that field in existing records. In a system such as Google Base that supports fielded searches, it would be helpful to know which fields were commonly specified in searches, and what values were being searched for --- not just what fields and values others had specified. Once you have some data entered with some set of field names, it would be helpful to find out what other fields were similar in content to the field names you chose, and what their records looked like; this would enable you to decide whether or not to establish a rule, for example, that your "lastname" field should also be published as "surname" --- or possibly whether or not to simply rename the field to "surname" if that's what's more commonly used. "Heat maps" or sorted lists of alternatives might prove useful in this case as well. Finally, when you perform a query that selects some set of data records, it would be nice to be able to get "heat maps" and sorted lists of the field names and field values used within the set of query results.