+1 on this, although I don't know if it's feasible to hold up 0.4 for it. I'll echo the difficulties in getting familiar with Cassandra terminology. My only issue is "Attribute Collection" is a mouthful. Something like AttributeSet might be more concise and still convey roughly the same meaning.
---Mark On Tue, Aug 11, 2009 at 10:37 AM, Evan Weaver<[email protected]> wrote: > Dear Cassandra Developers, > > In my experience, the naming of the data model has been a huge barrier > to entry for users of Cassandra. This goes both for people familiar > with SQL, and for people familiar with BigTable. I would like to > change this before 0.4, since the 0.3 to 0.4 transition is the Great > API Breakening. > > I (that is, all of us at Twitter) are willing to write all the patches > and update the wiki, if I get the necessary community buy-in. I hoped > that I could do one patch per each external interface change, and then > after those are complete, a patch for each internal interface change > as a phase 2. > > So technically this is not a bikeshed, because I'm happy to do all the > work. I'll even submit a patch for Digg's Python client. Since there > are no production deployments of ASF, and only a couple > well-maintained clients, now is the time to break the world. A few > hours of work now will pay off richly in terms of community > involvement and reduced noob-explanation-time. > > In general, I think the data model names should have the following goals: > > * Use existing, widely understood terms. > * Do not use terms that have conflicting meanings. > * Express analogies in the data model, where useful. > * Be unambiguous. > > Are these goals valid? Clearly I think they are, because I wrote you a > very long email about it. Also, I don't think the current names meet > these goals. Currently, we have: > > Cluster, contains keyspaces: > > This is fine. > > Keyspace: contains column families. > > There was some discussion of this change on the list a while back. > Keyspace beats Table by a mile, due to the "conflicting existing > usage" rule, but I think we can do better. > > Column family: containing a name, keys, column type, column sort, > and sub column sort. > > This name is from BigTable, and not in wide usage. It does not > express the hierarchy of storage, rather referring to a side effect of > the storage hierarchy by talking about the most granular data objects. > Confusing. > > Key: associated with columns. > > Since there's no word for the entire > key-and-columns-in-a-column-family thing ("row"), it's hard to talk > about this level of the data model clearly. > > Column: containing a name, value, and timestamp. > > This is from BigTable. In most cases, except when contained within a > super column, the data is row-oriented. There is nothing inherently > columnar about the storage. Furthermore, column is widely understood > from SQL to mean a table-enforced, strongly typed slot. Since > Cassandra does not have a tabular model, this is straight-up wrong. > Timestamps are an additional unexpected innovation in the normal use > of "column". > > Super column, containing a name and columns. > > This is a container of columns. However, the name expresses some > kind of priority order, but nothing about the container nature, even > though that's the most important property. This is not in any other > usage anywhere, and will always require explanation. Despite being a > type of column, it cannot be updated or overwritten like a standard > column, and does not have a timestamp. > > Try to approach the naming with the mind of a beginner. For what it's > worth, it took me at least 6 weeks to become comfortable with the > current Cassandra terminology, and I had many false assumptions based > on the names. I remember it took far less than that when starting out > with SQL. At least there you can defer the confusing parts until > later; Cassandra hits you with the confusion all up front. Just > because we are comfortable now, doesn't mean that the current names > are a good thing. > > So, on to the new proposed naming. In Cassandra's implementation, each > level of the data model contains the totality of the lower levels. > I've tried to express that in the new names. > > Cluster. > > No change. > > Database (formerly keyspace formerly table). > > Since this is quite literally the same as a database in an RDMBS, > there's no reason to change the term. It's a namespace with a specific > set of storage flags flipped. Its usage is analogous to the same usage > in an RDBMS. > > Record collection (formerly column family). > > This expresses the container nature--an ordered set. The word > "collection" is used in document databases to mean the same thing. > > Record (formerly a-thing-without-a-name) > > This is the row itself. It has a key, and attributes, but the thing > itself is not a key. It is not a "document" because it does not > arbitrarily nest, and it's not "row" because that might imply the > tabular nature of an RDBMS. Record has a history in databases which is > reasonable in this context. It does not imply that a record > necessarily corresponds to a complete object in the application, but > it doesn't rule it out. Since this is the only thing that has a key, > it's still valid to refer to a "key" in isolation, when convenient. > > Attribute (formerly column). > > It has a name, value, and a timestamp. It does not imply anything > about the storage. It does not imply a tabular model. It's more > specific then "tuple", but easier to talk about than "timestamped > key/value pair". It's the same as attributes in any object system. > > Attribute collection (formerly super column). > > This is clearly a container of attributes. That is all it implies, > and that is what it is. It is analogous to record collection. > > In short: > > Cluster > Database > Record collection > Record > Attribute collection > Attribute > > We could call the cluster "database collection", but even I'm not > going to go that far. I realize that each level is merely a collection > of the collections under it, but an "attribute collection collection > collection collection" is no help to day-to-day usage. ;-) > > As a heuristic, do the current names help, or get in the way? I'm not > married to the new proposal, but I want us to move in the right > direction, and not act like the current unusual naming is a badge of > honor, or forget our own difficulties in getting started. > > Keep in mind that BigTable, as an internal Google project, did not > have API clarity as a primary goal; witness the colon-string-API that > got copied by Cassandra originally. > > Comments please! > > Thanks, > > Evan > > -- > Evan Weaver >
