[jira] [Issue Comment Edited] (CASSANDRA-2474) CQL support for compound columns

Sylvain Lebresne (JIRA) Tue, 06 Sep 2011 10:09:37 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13098165#comment-13098165
 ]


Sylvain Lebresne edited comment on CASSANDRA-2474 at 9/6/11 5:07 PM:
---------------------------------------------------------------------

bq. A more Cassandra-ish way to model this would be to encode this as a series 
of columns: (<timestamp>, 'category', <category>), (<timestamp>, 'subcategory', 
<subcategory>), (<timestamp>, 'event', <eventId>). This is better in the 
general case for the same reason that a sparse top-level set of columns is 
better: I can easily add more data to events (e.g., "source") without rewriting 
existing events.

But my point is: I disagree with that claim.

Maybe sometime your proposal is better, but not always. What if you know that 
you won't add more data to events. Or more precisely, you know that what 
identify an event won't change. What if you decided to model it with a 
(timestamp, category, sub-category, eventId) composite not as a way to feed 
data into the column key, but because this correspond to how you want to query 
the data (which I would say is a very cassandra-ish way to model).

Let's take an example. The data for the (timestamp, category, sub-category, 
eventId) composite (for some key) could look like (on disk):
{noformat}
  ts1:catA:subcatA:id1 -> <value>
  ts1:catA:subcatA:id2 -> <value>
  ts1:catA:subcatA:id3 -> <value>
  ts1:catA:subcatA:id4 -> <value>
  ts1:catA:subcatB:id5 -> <value>
  ts1:catA:subcatB:id6 -> <value>
  ts1:catB:subcatA:id7 -> <value>
  ts1:catB:subcatA:id8 -> <value>
  ....
{noformat}
And say that value is some opaque bytes representing some event data.

Now, I'm not even sure how you model the same thing with your proposal, but I'm 
pretty sure it will involve indirections (or duplication), I doubt it will be 
more user friendly and you will need more than one query (I would have said 3 
queries at first but after trying to see how it would look like I'm not even 
sure I see where you would put the value in your proposal) to do query like:
  * give me all the events (eventid and value) for (ts1, catA, subcatA)
  * give me all the events (eventid and value) for (ts1, catA)
  * give me all the events (eventid and value) for ts1
because the events would not be ordered correctly.

The kind of modeling you propose would make sense if the <value> for an event 
above was not opaque but composed of a number of property. They yes, I may 
would want to model things as:
{noformat}
  ts1:catA:subcatA:id1:prop1 -> <value_prop1>
  ts1:catA:subcatA:id1:prop2 -> <value_prop2>
  ts1:catA:subcatA:id1:prop3 -> <value_prop3>
  ts1:catA:subcatA:id2:prop1 -> <value_prop1>
  ts1:catA:subcatA:id2:prop2 -> <value_prop2>
  ts1:catA:subcatA:id2:prop3 -> <value_prop3>
  ts1:catA:subcatA:id3:prop1 -> <value_prop1>
  ts1:catA:subcatA:id3:prop2 -> <value_prop2>
  ts1:catA:subcatA:id3:prop3 -> <value_prop3>
  ...
{noformat}
because that doesn't screw up with the sorting I'm trying to impose (and that 
correspond to my queries). And btw, prop1 could 'category' (though that would 
be redundant in that case). But there is two different thing:
  # the first part of the key (ts1:catA:subcatA:id1) is the key to my object. 
It is what makes the ordering corresponding to my queries.
  # the last component (prop1, ...) is just the way to express the different 
properties of my object (and just a way to emulate super columns after all).

So I guess what I'm arguing here is just to not forget the case where you use 
CompositeType because your column key do is intrinsically composed of multiple 
parts. Because it *is* very useful.

      was (Author: slebresne):
    bq. A more Cassandra-ish way to model this would be to encode this as a 
series of columns: (<timestamp>, 'category', <category>), (<timestamp>, 
'subcategory', <subcategory>), (<timestamp>, 'event', <eventId>). This is 
better in the general case for the same reason that a sparse top-level set of 
columns is better: I can easily add more data to events (e.g., "source") 
without rewriting existing events.

But my point is: I disagree with that claim.

Maybe sometime your proposal is better, but not always. What if you know that 
you won't add more data to events. Or more precisely, you know that what 
identify an event won't change. What if you decided to model it with a 
(timestamp, category, sub-category, eventId) composite not as a way to feed 
data into the column key, but because this correspond to how you want to query 
the data (which I would say is a very cassandra-ish way to model).

Let's take an example. The data for the (timestamp, category, sub-category, 
eventId) composite (for some key) could look like (on disk):
{noformat}
  ts1:catA:subcatA:id1 -> <value>
  ts1:catA:subcatA:id2 -> <value>
  ts1:catA:subcatA:id3 -> <value>
  ts1:catA:subcatA:id4 -> <value>
  ts1:catA:subcatB:id5 -> <value>
  ts1:catA:subcatB:id6 -> <value>
  ts1:catB:subcatA:id7 -> <value>
  ts1:catB:subcatA:id8 -> <value>
  ....
{noformat}
And say that value is some opaque bytes representing some event data.

Now, I'm not even sure how you model the same thing with your proposal, but I'm 
pretty sure it will involve indirections (or duplication), I doubt it will be 
more user friendly and you will need more than one query (I would have said 3 
queries at first but after trying to see how it would look like I'm not even 
sure I see where you would put the value in your proposal) to do query like:
  * give me all the events (eventid and value) for (ts1, catA, subcatA)
  * give me all the events (eventid and value) for (ts1, catA)
  * give me all the events (eventid and value) for ts1
because the events would not be ordered correctly.

The kind of modeling you propose would make sense if the <value> for an event 
above was not opaque but composed of a number of property. They yes, I may 
would want to model things as:
{noformat}
  ts1:catA:subcatA:id1:prop1 -> <value_prop1>
  ts1:catA:subcatA:id1:prop2 -> <value_prop2>
  ts1:catA:subcatA:id1:prop3 -> <value_prop3>
  ts1:catA:subcatA:id2:prop1 -> <value_prop1>
  ts1:catA:subcatA:id2:prop2 -> <value_prop2>
  ts1:catA:subcatA:id2:prop3 -> <value_prop3>
  ts1:catA:subcatA:id3:prop1 -> <value_prop1>
  ts1:catA:subcatA:id3:prop2 -> <value_prop2>
  ts1:catA:subcatA:id3:prop3 -> <value_prop3>
  ...
{noformat}
because that doesn't screw up with the sorting I'm trying to impose (and that 
correspond to my queries). And btw, prop1 could 'category' (though that would 
be redundant in that case). But there is two different thing:
  # the first part of the key (ts1:catA:subcatA:id1) is the key to my object. 
It is what makes the ordering corresponding to my queries.
  # the last component (prop1, ...) is just the way to express the different 
properties of my object (and just a way to emulate super columns after all).

So I guess what I'm arguing here is just to not forget the case where you use 
CompositeType because your column key do is intrinsically composed of multiple 
parts. Before it *is* useful.

  
> CQL support for compound columns
> --------------------------------
>
>                 Key: CASSANDRA-2474
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2474
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: API, Core
>            Reporter: Eric Evans
>            Assignee: Pavel Yaskevich
>              Labels: cql
>             Fix For: 1.0
>
>         Attachments: screenshot-1.jpg, screenshot-2.jpg
>
>
> For the most part, this boils down to supporting the specification of 
> compound column names (the CQL syntax is colon-delimted terms), and then 
> teaching the decoders (drivers) to create structures from the results.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-2474) CQL support for compound columns

Reply via email to