Re: Dynamic Columns in Cassandra 2.X

Laing, Michael Fri, 13 Jun 2014 15:40:24 -0700

Just to add 2 more cents... :)

The CQL3 protocol is asynchronous. This can provide a substantial
throughput increase, according to my benchmarking, when one uses
non-blocking techniques.


It is also peer-to-peer. Hence the server can generate events to send to
the client, e.g. schema changes - in general, 'triggers' become possible.

ml


On Fri, Jun 13, 2014 at 6:21 PM, graham sanderson <gra...@vast.com> wrote:

> My 2 cents…
>
> A motivation for CQL3 AFAIK was to make Cassandra more familiar to SQL
> users. This is a valid goal, and works well in many cases.
> Equally there are use cases (that some might find ugly) where Cassandra is
> chosen explicitly because of the sorts of things you can do at the thrift
> level, which aren’t (currently) exposed via CQL3
>
> To Robert’s point earlier - "Rational people should presume that Thrift
> support must eventually disappear”… he is probably right (though frankly
> I’d rather the non-blocking thrift version was added instead). However if
> we do get rid of the thrift interface, then it needs to be at a time that
> CQLn is capable of expressing all the things you could do via the thrift
> API. Note, I need to go look and see if the non-blocking thrift version
> also requires materializing the entire thrift object in memory.
>
> On Jun 13, 2014, at 4:55 PM, DuyHai Doan <doanduy...@gmail.com> wrote:
>
> There are always the pros and the cons with a querying language, as always.
>
> But as far as I can see, the advantages of Thrift I can see over CQL3 are:
>
>  1) Thrift require a little bit less decoding server-side (a difference
> around 10% in CPU usage).
>
>  2) Thrift use more "compact" storage because CQL3 need to add extra
> "marker" columns to guarantee the existence of primary key. It is worsen
> when you use clustering columns because for each distinct clustering group
> you have a related "marker" columns.
>
>  That being said, point 1) is not really an issue since most of the time
> nodes are more I/O bound than CPU bound. Only in extreme cases where you
> have incredible read rate with data that fits entirely in memory that you
> may notice the difference.
>
>  For point 2) this is a small trade-off to have access to a query language
> and being able to do slice queries using the WHERE clause. Some like it,
> other hate it, it's just a question of taste.  Please note that the "waste"
> in disk space is somehow mitigated by compression.
>
>  Long story short I think Thrift may have appropriate usage but only in
> very few use cases. Recently a lot of improvement and features have been
> added to CQL3 so that it shoud be considered as the first choice for most
> users and if they fall into those few use cases then switch back to Thrift
>
> My 2 cents
>
>
>
>
>
>
> On Fri, Jun 13, 2014 at 11:43 PM, Peter Lin <wool...@gmail.com> wrote:
>
>>
>> With text based query approach like CQL, you loose the type with dynamic
>> columns. Yes, we're storing it as bytes, but it is simpler and easier with
>> Thrift to do these types of things.
>>
>> I like CQL3 and what it does, but text based query languages make certain
>> dynamic schema use cases painful. Having used and built ORM's they are
>> poorly suited to dynamic schemas. If you've never had to write an ORM to
>> handle dynamic user defined schemas at runtime, it's tough to see where the
>> problems arise and how that makes life painful.
>>
>> Just to be clear, I'm not saying "don't use CQL3" or "CQL3 is bad". I'm
>> saying CQL3 is good for certain kinds of use cases and Thrift is good at
>> certain use cases. People need to look at what and how they're storing data
>> and do what makes the most sense to them. Slavishly following CQL3 doesn't
>> make any sense to me.
>>
>>
>>
>> On Fri, Jun 13, 2014 at 5:30 PM, DuyHai Doan <doanduy...@gmail.com>
>> wrote:
>>
>>> "the validation type is set to bytes, and my code is type safe, so it
>>> knows which serializers to use. Those dynamic columns are driven off the
>>> types in Java."  --> Correct. However, you are still bound by the column
>>> comparator type which should be fixed (unless again you set it to bytes, in
>>> this case you loose the ordering and sorting feature)
>>>
>>>  Basically what you are doing is telling Cassandra to save data in the
>>> cells as raw bytes, the serialization is taken care client side using the
>>> appropriate serializer. This is perfectly a valid strategy.
>>>
>>>  But how is it different from using CQL3 and setting the value to "blob"
>>> (equivalent to bytes) and take care of the serialization client-side also ?
>>> You can even imagine saving value in JSON format and set the type to "text".
>>>
>>>  Really, I don't see why CQL3 cannot achieve the scenario you describe.
>>>
>>>  For the record, when you create a table in CQL3 as follow:
>>>
>>>  CREATE TABLE user (
>>>      id bigint PRIMARY KEY,
>>>      firstname text,
>>>      lastname text,
>>>      last_connection timestamp,
>>>      ....);
>>>
>>>  C* will create a column family with validation type = bytes to
>>> accommodate the timestamp and text types for the firstname, lastname and
>>> last_connection columns. Basically the CQL3 engine is doing the
>>> serialization server-side for you
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Jun 13, 2014 at 11:19 PM, Peter Lin <wool...@gmail.com> wrote:
>>>
>>>>
>>>> the validation type is set to bytes, and my code is type safe, so it
>>>> knows which serializers to use. Those dynamic columns are driven off the
>>>> types in Java.
>>>>
>>>> Having said that, CQL3 does have a new custom type feature, but the
>>>> documentation is basically non-existent on how that actually works. One
>>>> could also modify CQL such that insert statements gives Cassandra hints
>>>> about what type it is, but I'm not aware of anyone enhancing CQL3 to do
>>>> that.
>>>>
>>>> I realize my kind of use case is a bit unique, but I do know of others
>>>> that are doing similar kinds of things.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Jun 13, 2014 at 5:11 PM, DuyHai Doan <doanduy...@gmail.com>
>>>> wrote:
>>>>
>>>>> In thrift, when creating a column family, you need to define
>>>>>
>>>>> 1) the row/partition key type
>>>>> 2) the column comparator type
>>>>> 3) the validation type for the actual value (cell in CQL3 terminology)
>>>>>
>>>>> Unless you use "dynamic composites" feature, which does not exist (and
>>>>> probably won't) in CQL3, I don't see how you can have columns with
>>>>> "different types" on the same row/partition
>>>>>
>>>>>
>>>>> On Fri, Jun 13, 2014 at 11:06 PM, Peter Lin <wool...@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> when I say dynamic column, I mean non-static columns of different
>>>>>> types within the same row. Some could be an object or one of the defined
>>>>>> datatypes.
>>>>>>
>>>>>> with thrift I use the appropriate serializer to handle these dynamic
>>>>>> columns.
>>>>>>
>>>>>>
>>>>>> On Fri, Jun 13, 2014 at 4:55 PM, DuyHai Doan <doanduy...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Well, before talking and discussing about "dynamic columns", we
>>>>>>> should first define it clearly. What do people mean by "dynamic columns"
>>>>>>> exactly ? Is it the ability to add many columns "of same type" to an
>>>>>>> existing physical row?  If yes then CQL3 does support it with clustering
>>>>>>> columns.
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jun 13, 2014 at 10:36 PM, Mark Greene <green...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Yeah I don't anticipate more than 1000 properties, well under in
>>>>>>>> fact. I guess the trade off of using the clustered columns is that I'd 
>>>>>>>> have
>>>>>>>> a table that would be tall and skinny which also has its challenges 
>>>>>>>> w/r/t
>>>>>>>> memory.
>>>>>>>>
>>>>>>>> I'll look into your suggestion a bit more and consider some others
>>>>>>>> around a hybrid of CQL and Thrift (where necssary). But from a newb's
>>>>>>>> perspective, I sense the community is unsettled around this concept of
>>>>>>>> truly dynamic columns. Coming from an HBase background, it's a
>>>>>>>> consideration I didn't anticipate having to evaluate.
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> about.me <http://about.me/markgreene>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jun 13, 2014 at 4:19 PM, DuyHai Doan <doanduy...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Mark
>>>>>>>>>
>>>>>>>>>  I believe that in your table you want to have some "common"
>>>>>>>>> fields that will be there whatever customer is, and other fields that 
>>>>>>>>> are
>>>>>>>>> entirely customer-dependent, isn't it ?
>>>>>>>>>
>>>>>>>>>  In this case, creating a table with static columns for the common
>>>>>>>>> fields and a clustering column representing all custom fields defined 
>>>>>>>>> by a
>>>>>>>>> customer could be a solution (see here for static column:
>>>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-6561 )
>>>>>>>>>
>>>>>>>>> CREATE TABLE user_data (
>>>>>>>>>    user_id bigint,
>>>>>>>>>    user_firstname text static,
>>>>>>>>>    user_lastname text static,
>>>>>>>>>    ...
>>>>>>>>>    custom_property_name text,
>>>>>>>>>    custom_property_value text,
>>>>>>>>>    PRIMARY KEY(user_id, custom_property_name,
>>>>>>>>> custom_property_value));
>>>>>>>>>
>>>>>>>>>  Please note that with this solution you need to have "at least
>>>>>>>>> one" custom property per customer to make it work
>>>>>>>>>
>>>>>>>>>  The only thing to take care of is the type of
>>>>>>>>> custom_property_value. You need to define it once for all. To 
>>>>>>>>> accommodate
>>>>>>>>> for dynamic types, you can either save the value as blob or text(as 
>>>>>>>>> JSON)
>>>>>>>>> and take care of the serialization/deserialization yourself at the 
>>>>>>>>> client
>>>>>>>>> side
>>>>>>>>>
>>>>>>>>>  As an alternative you can save custom properties in a map,
>>>>>>>>> provided that their number is not too large. But considering the 
>>>>>>>>> business
>>>>>>>>> case of CRM, I believe that it's quite rare and user has more than 
>>>>>>>>> 1000
>>>>>>>>> custom properties isn't it ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jun 13, 2014 at 10:03 PM, Mark Greene <green...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> My use case requires the support of arbitrary columns much like a
>>>>>>>>>> CRM. My users can define 'custom' fields within the application. 
>>>>>>>>>> Ideally I
>>>>>>>>>> wouldn't have to change the schema at all, which is why I like the 
>>>>>>>>>> old
>>>>>>>>>> thrift approach rather than the CQL approach.
>>>>>>>>>>
>>>>>>>>>> Having said all that, I'd be willing to adapt my API to make
>>>>>>>>>> explicit schema changes to Cassandra whenever my user makes a change 
>>>>>>>>>> to
>>>>>>>>>> their custom fields if that's an accepted practice.
>>>>>>>>>>
>>>>>>>>>> Ultimately, I'm trying to figure out of the Cassandra community
>>>>>>>>>> intends to support true schemaless use cases in the future.
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> about.me <http://about.me/markgreene>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Jun 13, 2014 at 3:47 PM, DuyHai Doan <
>>>>>>>>>> doanduy...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> This strikes me as bad practice in the world of multi tenant
>>>>>>>>>>> systems. I don't want to create a table per customer. So I'm 
>>>>>>>>>>> wondering if
>>>>>>>>>>> dynamically modifying the table is an accepted practice?  --> Can 
>>>>>>>>>>> you give
>>>>>>>>>>> some details about your use case ? How would you "alter" a table 
>>>>>>>>>>> structure
>>>>>>>>>>> to adapt it to a new customer ?
>>>>>>>>>>>
>>>>>>>>>>> Wouldn't it be better to model your table so that it supports
>>>>>>>>>>> addition/removal of customer ?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Jun 13, 2014 at 9:00 PM, Mark Greene <green...@gmail.com
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks DuyHai,
>>>>>>>>>>>>
>>>>>>>>>>>> I have a follow up question to #2. You mentioned ideally I
>>>>>>>>>>>> would create a new table instead of mutating an existing one.
>>>>>>>>>>>>
>>>>>>>>>>>> This strikes me as bad practice in the world of multi tenant
>>>>>>>>>>>> systems. I don't want to create a table per customer. So I'm 
>>>>>>>>>>>> wondering if
>>>>>>>>>>>> dynamically modifying the table is an accepted practice?
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> about.me <http://about.me/markgreene>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Jun 13, 2014 at 2:54 PM, DuyHai Doan <
>>>>>>>>>>>> doanduy...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello Mark
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Dynamic columns, as you said, are perfectly supported by CQL3
>>>>>>>>>>>>> via clustering columns. And no, using collections for storing 
>>>>>>>>>>>>> dynamic data
>>>>>>>>>>>>> is a very bad idea if the cardinality is very high (>> 1000 
>>>>>>>>>>>>> elements)
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1)  Is using Thrift a valid approach in the era of CQL?  -->
>>>>>>>>>>>>> Less and less. Unless you are looking for extreme performance, 
>>>>>>>>>>>>> you'd better
>>>>>>>>>>>>> off choosing CQL3. The ease of programming and querying with CQL3 
>>>>>>>>>>>>> does
>>>>>>>>>>>>> worth the small overhead in CPU
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2) If CQL is the best practice,  should I alter the schema at
>>>>>>>>>>>>> runtime when I detect I need to do an schema mutation?  --> 
>>>>>>>>>>>>> Ideally you
>>>>>>>>>>>>> should not alter schema but create a new table to adapt to your 
>>>>>>>>>>>>> changing
>>>>>>>>>>>>> requirements.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 3) If I utilize CQL collections, will Cassandra page the
>>>>>>>>>>>>> entire thing into the heap?  --> Of course. All collections and 
>>>>>>>>>>>>> maps in
>>>>>>>>>>>>> Cassandra are eagerly loaded entirely in memory on server side. 
>>>>>>>>>>>>> That's why
>>>>>>>>>>>>> it is recommended to limit their cardinality to ~ 1000 elements
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Jun 13, 2014 at 8:33 PM, Mark Greene <
>>>>>>>>>>>>> green...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm looking for some best practices w/r/t supporting
>>>>>>>>>>>>>> arbitrary columns. It seems from the docs I've read around CQL 
>>>>>>>>>>>>>> that they
>>>>>>>>>>>>>> are supported in some capacity via collections but you can't 
>>>>>>>>>>>>>> exceed 64K in
>>>>>>>>>>>>>> size. For my requirements that would cause problems.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So my questions are:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1)  Is using Thrift a valid approach in the era of CQL?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2) If CQL is the best practice,  should I alter the schema at
>>>>>>>>>>>>>> runtime when I detect I need to do an schema mutation?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  3) If I utilize CQL collections, will Cassandra page the
>>>>>>>>>>>>>> entire thing into the heap?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> My data model is akin to a CRM, arbitrary column definitions
>>>>>>>>>>>>>> per customer.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> Mark
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>

Re: Dynamic Columns in Cassandra 2.X

Reply via email to