[ https://issues.apache.org/jira/browse/CASSANDRA-6561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908124#comment-13908124 ]
DOAN DuyHai commented on CASSANDRA-6561: ---------------------------------------- It's me again. By talking with my collegues about the static columns this morning, one interesting question arises. {code:sql} CREATE TABLE blogpost ( id bigint, content text static, creation_date timestamp static, author_id text static, comment_date timeuuid, comment_author_id text, comment text, PRIMARY KEY (id,comment_date)) WITH CLUSTERING ORDER (comment_date DESC); {code} If we do something like : {code:sql} SELECT * FROM blogpost WHERE id=xxx LIMIT 10; {code} We'll have the blog post content plus the last 10 comments, pretty clear. The answer is something like ||id||content||creation_date||author_id||comment_date||comment_author_id||comment|| |10|big blog text|2014/02/20 10:00:00|author1|2014/02/20 22:00:00|user1|last comment| |10|big blog text|2014/02/20 10:00:00|author1|2014/02/20 21:50:00|user2|before last comment| |10|big blog text|2014/02/20 10:00:00|author1|2014/02/20 21:40:00|user3|another comment| |10|big blog text|2014/02/20 10:00:00|author1|2014/02/20 21:30:00|user4|and again| |10|big blog text|2014/02/20 10:00:00|author1|2014/02/20 21:00:00|user5|and again| As you can see the first 4 columns are duplicated, only the last 3 columns vary, as per the CQL3 semantics. The question is: how are data sent over the wire ? *Are the first 4 columns duplication sent over the network or does the Java driver "reformat" the data to show them duplicated ?* > Static columns in CQL3 > ---------------------- > > Key: CASSANDRA-6561 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6561 > Project: Cassandra > Issue Type: New Feature > Reporter: Sylvain Lebresne > Assignee: Sylvain Lebresne > Fix For: 2.0.6 > > > I'd like to suggest the following idea for adding "static" columns to CQL3. > I'll note that the basic idea has been suggested by jhalliday on irc but the > rest of the details are mine and I should be blamed for anything stupid in > what follows. > Let me start with a rational: there is 2 main family of CF that have been > historically used in Thrift: static ones and dynamic ones. CQL3 handles both > family through the presence or not of clustering columns. There is however > some cases where mixing both behavior has its use. I like to think of those > use cases as 3 broad category: > # to denormalize small amounts of not-entirely-static data in otherwise > static entities. It's say "tags" for a product or "custom properties" in a > user profile. This is why we've added CQL3 collections. Importantly, this is > the *only* use case for which collections are meant (which doesn't diminishes > their usefulness imo, and I wouldn't disagree that we've maybe not > communicated this too well). > # to optimize fetching both a static entity and related dynamic ones. Say you > have blog posts, and each post has associated comments (chronologically > ordered). *And* say that a very common query is "fetch a post and its 50 last > comments". In that case, it *might* be beneficial to store a blog post > (static entity) in the same underlying CF than it's comments for performance > reason. So that "fetch a post and it's 50 last comments" is just one slice > internally. > # you want to CAS rows of a dynamic partition based on some partition > condition. This is the same use case than why CASSANDRA-5633 exists for. > As said above, 1) is already covered by collections, but 2) and 3) are not > (and > I strongly believe collections are not the right fit, API wise, for those). > Also, note that I don't want to underestimate the usefulness of 2). In most > cases, using a separate table for the blog posts and the comments is The > Right Solution, and trying to do 2) is premature optimisation. Yet, when used > properly, that kind of optimisation can make a difference, so I think having > a relatively native solution for it in CQL3 could make sense. > Regarding 3), though CASSANDRA-5633 would provide one solution for it, I have > the feeling that static columns actually are a more natural approach (in term > of API). That's arguably more of a personal opinion/feeling though. > So long story short, CQL3 lacks a way to mix both some "static" and "dynamic" > rows in the same partition of the same CQL3 table, and I think such a tool > could have it's use. > The proposal is thus to allow "static" columns. Static columns would only > make sense in table with clustering columns (the "dynamic" ones). A static > column value would be static to the partition (all rows of the partition > would share the value for such column). The syntax would just be: > {noformat} > CREATE TABLE t ( > k text, > s text static, > i int, > v text, > PRIMARY KEY (k, i) > ) > {noformat} > then you'd get: > {noformat} > INSERT INTO t(k, s, i, v) VALUES ("k0", "I'm shared", 0, "foo"); > INSERT INTO t(k, s, i, v) VALUES ("k0", "I'm still shared", 1, "bar"); > SELECT * FROM t; > k | s | i | v > ------------------------------------ > k0 | "I'm still shared" | 0 | "bar" > k0 | "I'm still shared" | 1 | "foo" > {noformat} > There would be a few semantic details to decide on regarding deletions, ttl, > etc. but let's see if we agree it's a good idea first before ironing those > out. > One last point is the implementation. Though I do think this idea has merits, > it's definitively not useful enough to justify rewriting the storage engine > for it. But I think we can support this relatively easily (emphasis on > "relatively" :)), which is probably the main reason why I like the approach. > Namely, internally, we can store static columns as cells whose clustering > column values are empty. So in terms of cells, the partition of my example > would look like: > {noformat} > "k0" : [ > (:"s" -> "I'm still shared"), // the static column > (0:"" -> "") // row marker > (0:"v" -> "bar") > (1:"" -> "") // row marker > (1:"v" -> "foo") > ] > {noformat} > Of course, using empty values for the clustering columns doesn't quite work > because it could conflict with the user using empty clustering columns. But > in the CompositeType encoding we have the end-of-component byte that we could > reuse by using a specific value (say 0xFF, currently we never set that byte > to anything else than -1, 0 and 1) to indicate it's a static column. > With that, we'd need to update the CQL3 statements to support the new syntax > and rules, but that's probably not horribly hard. > So anyway, this may or may not be a good idea, but I think it has enough meat > to warrant some consideration. -- This message was sent by Atlassian JIRA (v6.1.5#6160)