Re: Fixing the data model names

Mark McBride Tue, 11 Aug 2009 16:38:04 -0700

+1 on this, although I don't know if it's feasible to hold up 0.4 for
it.  I'll echo the difficulties in getting familiar with Cassandra
terminology.  My only issue is "Attribute Collection" is a mouthful.
Something like AttributeSet might be more concise and still convey
roughly the same meaning.


   ---Mark

On Tue, Aug 11, 2009 at 10:37 AM, Evan Weaver<[email protected]> wrote:
> Dear Cassandra Developers,
>
> In my experience, the naming of the data model has been a huge barrier
> to entry for users of Cassandra. This goes both for people familiar
> with SQL, and for people familiar with BigTable. I would like to
> change this before 0.4, since the 0.3 to 0.4 transition is the Great
> API Breakening.
>
> I (that is, all of us at Twitter) are willing to write all the patches
> and update the wiki, if I get the necessary community buy-in. I hoped
> that I could do one patch per each external interface change, and then
> after those are complete, a patch for each internal interface change
> as a phase 2.
>
> So technically this is not a bikeshed, because I'm happy to do all the
> work. I'll even submit a patch for Digg's Python client. Since there
> are no production deployments of ASF, and only a couple
> well-maintained clients, now is the time to break the world. A few
> hours of work now will pay off richly in terms of community
> involvement and reduced noob-explanation-time.
>
> In general, I think the data model names should have the following goals:
>
>  * Use existing, widely understood terms.
>  * Do not use terms that have conflicting meanings.
>  * Express analogies in the data model, where useful.
>  * Be unambiguous.
>
> Are these goals valid? Clearly I think they are, because I wrote you a
> very long email about it. Also, I don't think the current names meet
> these goals. Currently, we have:
>
>  Cluster, contains keyspaces:
>
>  This is fine.
>
>  Keyspace: contains column families.
>
> There was some discussion of this change on the list a while back.
> Keyspace beats Table by a mile, due to the "conflicting existing
> usage" rule, but I think we can do better.
>
>  Column family: containing a name, keys, column type, column sort,
> and sub column sort.
>
>  This name is from BigTable, and not in wide usage. It does not
> express the hierarchy of storage, rather referring to a side effect of
> the storage hierarchy by talking about the most granular data objects.
> Confusing.
>
>  Key: associated with columns.
>
>  Since there's no word for the entire
> key-and-columns-in-a-column-family thing ("row"), it's hard to talk
> about this level of the data model clearly.
>
>  Column: containing a name, value, and timestamp.
>
>  This is from BigTable. In most cases, except when contained within a
> super column, the data is row-oriented. There is nothing inherently
> columnar about the storage. Furthermore, column is widely understood
> from SQL to mean a table-enforced, strongly typed slot. Since
> Cassandra does not have a tabular model, this is straight-up wrong.
> Timestamps are an additional unexpected innovation in the normal use
> of "column".
>
>  Super column, containing a name and columns.
>
>  This is a container of columns. However, the name expresses some
> kind of priority order, but nothing about the container nature, even
> though that's the most important property. This is not in any other
> usage anywhere, and will always require explanation. Despite being a
> type of column, it cannot be updated or overwritten like a standard
> column, and does not have a timestamp.
>
> Try to approach the naming with the mind of a beginner. For what it's
> worth, it took me at least 6 weeks to become comfortable with the
> current Cassandra terminology, and I had many false assumptions based
> on the names. I remember it took far less than that when starting out
> with SQL. At least there you can defer the confusing parts until
> later; Cassandra hits you with the confusion all up front. Just
> because we are comfortable now, doesn't mean that the current names
> are a good thing.
>
> So, on to the new proposed naming. In Cassandra's implementation, each
> level of the data model contains the totality of the lower levels.
> I've tried to express that in the new names.
>
>  Cluster.
>
>  No change.
>
>  Database (formerly keyspace formerly table).
>
>  Since this is quite literally the same as a database in an RDMBS,
> there's no reason to change the term. It's a namespace with a specific
> set of storage flags flipped. Its usage is analogous to the same usage
> in an RDBMS.
>
>  Record collection (formerly column family).
>
>  This expresses the container nature--an ordered set. The word
> "collection" is used in document databases to mean the same thing.
>
>  Record (formerly a-thing-without-a-name)
>
>  This is the row itself. It has a key, and attributes, but the thing
> itself is not a key. It is not a "document" because it does not
> arbitrarily nest, and it's not "row" because that might imply the
> tabular nature of an RDBMS. Record has a history in databases which is
> reasonable in this context. It does not imply that a record
> necessarily corresponds to a complete object in the application, but
> it doesn't rule it out. Since this is the only thing that has a key,
> it's still valid to refer to a "key" in isolation, when convenient.
>
>  Attribute (formerly column).
>
>  It has a name, value, and a timestamp. It does not imply anything
> about the storage. It does not imply a tabular model. It's more
> specific then "tuple", but easier to talk about than "timestamped
> key/value pair". It's the same as attributes in any object system.
>
>  Attribute collection (formerly super column).
>
>  This is clearly a container of attributes. That is all it implies,
> and that is what it is. It is analogous to record collection.
>
> In short:
>
>  Cluster
>  Database
>  Record collection
>  Record
>  Attribute collection
>  Attribute
>
> We could call the cluster "database collection", but even I'm not
> going to go that far. I realize that each level is merely a collection
> of the collections under it, but an "attribute collection collection
> collection collection" is no help to day-to-day usage. ;-)
>
> As a heuristic, do the current names help, or get in the way? I'm not
> married to the new proposal, but I want us to move in the right
> direction, and not act like the current unusual naming is a badge of
> honor, or forget our own difficulties in getting started.
>
> Keep in mind that BigTable, as an internal Google project, did not
> have API clarity as a primary goal; witness the colon-string-API that
> got copied by Cassandra originally.
>
> Comments please!
>
> Thanks,
>
> Evan
>
> --
> Evan Weaver
>

Re: Fixing the data model names

Reply via email to