Re: Dynamic Columns in Cassandra 2.X

graham sanderson Fri, 13 Jun 2014 15:22:59 -0700

My 2 cents…

A motivation for CQL3 AFAIK was to make Cassandra more familiar to SQL users. 
This is a valid goal, and works well in many cases.
Equally there are use cases (that some might find ugly) where Cassandra is 
chosen explicitly because of the sorts of things you can do at the thrift 
level, which aren’t (currently) exposed via CQL3


To Robert’s point earlier - "Rational people should presume that Thrift support 
must eventually disappear”… he is probably right (though frankly I’d rather the 
non-blocking thrift version was added instead). However if we do get rid of the 
thrift interface, then it needs to be at a time that CQLn is capable of 
expressing all the things you could do via the thrift API. Note, I need to go 
look and see if the non-blocking thrift version also requires materializing the 
entire thrift object in memory.

On Jun 13, 2014, at 4:55 PM, DuyHai Doan <doanduy...@gmail.com> wrote:

> There are always the pros and the cons with a querying language, as always.
> 
> But as far as I can see, the advantages of Thrift I can see over CQL3 are:
> 
>  1) Thrift require a little bit less decoding server-side (a difference 
> around 10% in CPU usage).
> 
>  2) Thrift use more "compact" storage because CQL3 need to add extra "marker" 
> columns to guarantee the existence of primary key. It is worsen when you use 
> clustering columns because for each distinct clustering group you have a 
> related "marker" columns.
> 
>  That being said, point 1) is not really an issue since most of the time 
> nodes are more I/O bound than CPU bound. Only in extreme cases where you have 
> incredible read rate with data that fits entirely in memory that you may 
> notice the difference.
> 
>  For point 2) this is a small trade-off to have access to a query language 
> and being able to do slice queries using the WHERE clause. Some like it, 
> other hate it, it's just a question of taste.  Please note that the "waste" 
> in disk space is somehow mitigated by compression.
> 
>  Long story short I think Thrift may have appropriate usage but only in very 
> few use cases. Recently a lot of improvement and features have been added to 
> CQL3 so that it shoud be considered as the first choice for most users and if 
> they fall into those few use cases then switch back to Thrift
> 
> My 2 cents
> 
> 
> 
> 
> 
> 
> On Fri, Jun 13, 2014 at 11:43 PM, Peter Lin <wool...@gmail.com> wrote:
> 
> With text based query approach like CQL, you loose the type with dynamic 
> columns. Yes, we're storing it as bytes, but it is simpler and easier with 
> Thrift to do these types of things.
> 
> I like CQL3 and what it does, but text based query languages make certain 
> dynamic schema use cases painful. Having used and built ORM's they are poorly 
> suited to dynamic schemas. If you've never had to write an ORM to handle 
> dynamic user defined schemas at runtime, it's tough to see where the problems 
> arise and how that makes life painful.
> 
> Just to be clear, I'm not saying "don't use CQL3" or "CQL3 is bad". I'm 
> saying CQL3 is good for certain kinds of use cases and Thrift is good at 
> certain use cases. People need to look at what and how they're storing data 
> and do what makes the most sense to them. Slavishly following CQL3 doesn't 
> make any sense to me.
>  
> 
> 
> On Fri, Jun 13, 2014 at 5:30 PM, DuyHai Doan <doanduy...@gmail.com> wrote:
> "the validation type is set to bytes, and my code is type safe, so it knows 
> which serializers to use. Those dynamic columns are driven off the types in 
> Java."  --> Correct. However, you are still bound by the column comparator 
> type which should be fixed (unless again you set it to bytes, in this case 
> you loose the ordering and sorting feature)
> 
>  Basically what you are doing is telling Cassandra to save data in the cells 
> as raw bytes, the serialization is taken care client side using the 
> appropriate serializer. This is perfectly a valid strategy.
> 
>  But how is it different from using CQL3 and setting the value to "blob" 
> (equivalent to bytes) and take care of the serialization client-side also ? 
> You can even imagine saving value in JSON format and set the type to "text".
> 
>  Really, I don't see why CQL3 cannot achieve the scenario you describe.
> 
>  For the record, when you create a table in CQL3 as follow:
> 
>  CREATE TABLE user (
>      id bigint PRIMARY KEY,
>      firstname text,
>      lastname text,
>      last_connection timestamp,
>      ....);
> 
>  C* will create a column family with validation type = bytes to accommodate 
> the timestamp and text types for the firstname, lastname and last_connection 
> columns. Basically the CQL3 engine is doing the serialization server-side for 
> you
> 
>  
> 
> 
> 
> 
> On Fri, Jun 13, 2014 at 11:19 PM, Peter Lin <wool...@gmail.com> wrote:
> 
> the validation type is set to bytes, and my code is type safe, so it knows 
> which serializers to use. Those dynamic columns are driven off the types in 
> Java.
> 
> Having said that, CQL3 does have a new custom type feature, but the 
> documentation is basically non-existent on how that actually works. One could 
> also modify CQL such that insert statements gives Cassandra hints about what 
> type it is, but I'm not aware of anyone enhancing CQL3 to do that.
> 
> I realize my kind of use case is a bit unique, but I do know of others that 
> are doing similar kinds of things.
> 
> 
> 
> 
> On Fri, Jun 13, 2014 at 5:11 PM, DuyHai Doan <doanduy...@gmail.com> wrote:
> In thrift, when creating a column family, you need to define
> 
> 1) the row/partition key type
> 2) the column comparator type
> 3) the validation type for the actual value (cell in CQL3 terminology)
> 
> Unless you use "dynamic composites" feature, which does not exist (and 
> probably won't) in CQL3, I don't see how you can have columns with "different 
> types" on the same row/partition
> 
> 
> On Fri, Jun 13, 2014 at 11:06 PM, Peter Lin <wool...@gmail.com> wrote:
> 
> when I say dynamic column, I mean non-static columns of different types 
> within the same row. Some could be an object or one of the defined datatypes.
> 
> with thrift I use the appropriate serializer to handle these dynamic columns.
> 
> 
> On Fri, Jun 13, 2014 at 4:55 PM, DuyHai Doan <doanduy...@gmail.com> wrote:
> Well, before talking and discussing about "dynamic columns", we should first 
> define it clearly. What do people mean by "dynamic columns" exactly ? Is it 
> the ability to add many columns "of same type" to an existing physical row?  
> If yes then CQL3 does support it with clustering columns. 
> 
> 
> On Fri, Jun 13, 2014 at 10:36 PM, Mark Greene <green...@gmail.com> wrote:
> Yeah I don't anticipate more than 1000 properties, well under in fact. I 
> guess the trade off of using the clustered columns is that I'd have a table 
> that would be tall and skinny which also has its challenges w/r/t memory. 
> 
> I'll look into your suggestion a bit more and consider some others around a 
> hybrid of CQL and Thrift (where necssary). But from a newb's perspective, I 
> sense the community is unsettled around this concept of truly dynamic 
> columns. Coming from an HBase background, it's a consideration I didn't 
> anticipate having to evaluate.
> 
> 
> --
> about.me
> 
> 
> On Fri, Jun 13, 2014 at 4:19 PM, DuyHai Doan <doanduy...@gmail.com> wrote:
> Hi Mark
> 
>  I believe that in your table you want to have some "common" fields that will 
> be there whatever customer is, and other fields that are entirely 
> customer-dependent, isn't it ?
> 
>  In this case, creating a table with static columns for the common fields and 
> a clustering column representing all custom fields defined by a customer 
> could be a solution (see here for static column: 
> https://issues.apache.org/jira/browse/CASSANDRA-6561 )
> 
> CREATE TABLE user_data (
>    user_id bigint,
>    user_firstname text static,
>    user_lastname text static,
>    ...
>    custom_property_name text,
>    custom_property_value text,
>    PRIMARY KEY(user_id, custom_property_name, custom_property_value));
> 
>  Please note that with this solution you need to have "at least one" custom 
> property per customer to make it work
> 
>  The only thing to take care of is the type of custom_property_value. You 
> need to define it once for all. To accommodate for dynamic types, you can 
> either save the value as blob or text(as JSON) and take care of the 
> serialization/deserialization yourself at the client side
> 
>  As an alternative you can save custom properties in a map, provided that 
> their number is not too large. But considering the business case of CRM, I 
> believe that it's quite rare and user has more than 1000 custom properties 
> isn't it ?
> 
> 
> 
> On Fri, Jun 13, 2014 at 10:03 PM, Mark Greene <green...@gmail.com> wrote:
> My use case requires the support of arbitrary columns much like a CRM. My 
> users can define 'custom' fields within the application. Ideally I wouldn't 
> have to change the schema at all, which is why I like the old thrift approach 
> rather than the CQL approach. 
> 
> Having said all that, I'd be willing to adapt my API to make explicit schema 
> changes to Cassandra whenever my user makes a change to their custom fields 
> if that's an accepted practice. 
> 
> Ultimately, I'm trying to figure out of the Cassandra community intends to 
> support true schemaless use cases in the future.
> 
> --
> about.me
> 
> 
> On Fri, Jun 13, 2014 at 3:47 PM, DuyHai Doan <doanduy...@gmail.com> wrote:
> This strikes me as bad practice in the world of multi tenant systems. I don't 
> want to create a table per customer. So I'm wondering if dynamically 
> modifying the table is an accepted practice?  --> Can you give some details 
> about your use case ? How would you "alter" a table structure to adapt it to 
> a new customer ?
> 
> Wouldn't it be better to model your table so that it supports 
> addition/removal of customer ?
> 
> 
> 
> On Fri, Jun 13, 2014 at 9:00 PM, Mark Greene <green...@gmail.com> wrote:
> Thanks DuyHai,
> 
> I have a follow up question to #2. You mentioned ideally I would create a new 
> table instead of mutating an existing one. 
> 
> This strikes me as bad practice in the world of multi tenant systems. I don't 
> want to create a table per customer. So I'm wondering if dynamically 
> modifying the table is an accepted practice?
> 
> --
> about.me
> 
> 
> On Fri, Jun 13, 2014 at 2:54 PM, DuyHai Doan <doanduy...@gmail.com> wrote:
> Hello Mark
> 
>  Dynamic columns, as you said, are perfectly supported by CQL3 via clustering 
> columns. And no, using collections for storing dynamic data is a very bad 
> idea if the cardinality is very high (>> 1000 elements)
> 
> 1)  Is using Thrift a valid approach in the era of CQL?  --> Less and less. 
> Unless you are looking for extreme performance, you'd better off choosing 
> CQL3. The ease of programming and querying with CQL3 does worth the small 
> overhead in CPU
> 
> 2) If CQL is the best practice,  should I alter the schema at runtime when I 
> detect I need to do an schema mutation?  --> Ideally you should not alter 
> schema but create a new table to adapt to your changing requirements. 
> 
> 3) If I utilize CQL collections, will Cassandra page the entire thing into 
> the heap?  --> Of course. All collections and maps in Cassandra are eagerly 
> loaded entirely in memory on server side. That's why it is recommended to 
> limit their cardinality to ~ 1000 elements
> 
> 
> 
> 
> On Fri, Jun 13, 2014 at 8:33 PM, Mark Greene <green...@gmail.com> wrote:
> I'm looking for some best practices w/r/t supporting arbitrary columns. It 
> seems from the docs I've read around CQL that they are supported in some 
> capacity via collections but you can't exceed 64K in size. For my 
> requirements that would cause problems. 
> 
> So my questions are:
> 
> 1)  Is using Thrift a valid approach in the era of CQL? 
> 
> 2) If CQL is the best practice,  should I alter the schema at runtime when I 
> detect I need to do an schema mutation?
> 
> 3) If I utilize CQL collections, will Cassandra page the entire thing into 
> the heap?
> 
> My data model is akin to a CRM, arbitrary column definitions per customer.
> 
> 
> Cheers,
> Mark
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>

smime.p7s
Description: S/MIME cryptographic signature

Re: Dynamic Columns in Cassandra 2.X

Reply via email to