Re: Cassandra data model misconceptions, and their sources

Evan Weaver Tue, 18 Aug 2009 07:37:05 -0700

Did you read the previous thread about this?

http://markmail.org/thread/qbocotgkan4mg73w


I don't think your proposals are too good...I have a new proposal
based on feedback in the previous thread, that I will send soon. But I
wanted some comments on the misconceptions themselves.

Evan

On Tue, Aug 18, 2009 at 1:33 AM, Curt Micol<[email protected]> wrote:
> I've been thinking about this for a number of days, and again, while I am not 
> a
> developer I thought I might toss in a proposal if that's okay.
>
> Since putting together a schema diagram and having a number of people review
> it, I think a change is warranted. Too many people are coming from the RDBMS
> world and the terms used by Cassandra are conflicting with those terms they
> are already familiar with.
>
> The TLDR version is as follows:
>
> Object (Column)
> ObjectFamily (ColumnFamily)
> Directory (Row)
> ObjectContainer (SuperColumn)
> Namespace (Keyspace)
>
> The long version...
>
> Object (Column)
> As Evan has stated repeatedly, column is a bit misleading especially when
> compared to other types of database systems.  I think this is probably the
> most important change to the data model names, and exactly where I started
> since this is the 'core' of Cassandra.  Object gives the impression that this
> is a piece of data, it's relatively structured but the name gives no
> impression how strict that structure is. 'Objects' have names that have values
> and timestamps. Simple and too the point. 'Object' doesn't come with the
> preconceived notions that 'column' comes with and leaves room for Cassandra to
> define what an 'object' is without any conflict to preexisting data
> structures.
>
> By changing this, we can move up the ladder to other data types and
> easily rename them to something that 'contains objects' or 'accesses objects'.
> This allows us to describe the data model in the name structure without
> having to get too deep into the definition.
>
> Directory (Row)
> 'row' is currently unnamed, but still a structure that exists in the model.
> It's not specifically data itself, but more of a mapping of how to get to
> objects (using keys). 'Directory' fills this void quite well. It is easily
> explained as a path to get to data and not data itself.
>
> ObjectFamily (ColumnFamily)
> There's no argument that the one direct link to the BigTable paper is 'column
> families'. It's perhaps the only structure that is virtually the same in both
> pieces of software.  Considering this, I think we need to avoid too drastic a
> change.  With that said, I think a change is necessary due to the differences
> in columns between the two databases. 'object family' is descriptive of the
> relation between objects and removes any reference to tabular structures while
> keeping a loose relationship to 'column family' in the BigTable paper.
>
> ObjectContainer (SuperColumn)
> I could see this being shortened to 'container' in every day conversation.
> However, 'objectcontainer' fits nicely with the rest of the data model names
> and is descriptive of it's purpose and use. Ultimately a 'supercolumn' is
> nothing more than a named container of columns (and I've seen on at least 3
> different occasions the word container used to describe supercolumns).
> 'supercolumn' had no real connection to what exactly it was defining, but with
> 'object container' we have a clear understanding that we are naming the
> structure that holds objects. Or as I explained it to a friend, we are naming
> the 'jar' and not the 'honey'. :)
>
> Namespace (Keyspace)
> This one I go back and forth on. I know it's been changed from 'Table' to
> 'keyspace' and Evan proposed 'database', but I think that 'namespace' is
> really what it is we are talking about. Wikipedia has this as the first line
> to describe 'namespace':
>
> A namespace is an abstract container or environment created to hold a
> logical grouping of unique identifiers or symbols (i.e., names).
>
> Originally I thought 'objectspace' would fit better, but I think 'namespace'
> comes with a better history and is clearer to what this structure really is.
> Especially when you relate the name namespace to how it is used in Ruby, 
> Python
> and Java. Ultimately though, I think I prefer 'keyspace' over 'table'
> or 'database'.
>
> The only issue I see with all of these names is the potential conflict with
> programming languages and their objects. I know next to nothing about Java so
> I don't know if there would be a conflict here. I've ran the following Google
> search 'reserved words in *' where '*' is Ruby, Python, Java and C++ and
> received no mention of 'object' being a reserved word in any of those
> languages.
>
> I also grep'd through current source code and there doesn't seem to be any
> real conflicts that couldn't be named something else so as not to conflict
> with this naming structure.
>
> In the end, I think it's a good idea to look at this and work out a solution.
> Documentation and tutorials are going to help, but I think people are so
> entrenched in the RDBMS world that there is somewhat of a barrier to
> understanding Cassandra's data model.
>
> Thanks for your time,
>
> --
> # Curt Micol
>



-- 
Evan Weaver

Re: Cassandra data model misconceptions, and their sources

Reply via email to