cassandra database design

2016-08-31 Thread Stone Fang
Hi all,
have some questions on how to define clustering key.

have a table like this

CREATE TABLE datacenter{

datacentername varchar,

publish timestamp,

value varchar,

PRIMARY KEY(datacentername,publish)

}


*issues:*
there are only two datacenter,so the data would only have two
partitions.and store
in two nodes.want to spread the data evenly around the cluster.

take this post for reference
http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling

CREATE TABLE datacenter{

datacentername varchar,

publish_pre text,

publish timestamp,

value varchar,

PRIMARY KEY((datacentername,publish_pre),publish)

}

publish_pre is from 1~12 hours.*but the workload is high.i dont want to all
workload inserted into one node in a hour.*

have no idea on how to define the partition key to spread data evenly
around the cluster,and the partition not split by time.which means that
data should not be inserted one node at a certain time window.

thanks
stone


Re: cassandra database design

2016-08-31 Thread Carlos Alonso
Maybe a good question could be:

Which is your access pattern to this data?

Carlos Alonso | Software Engineer | @calonso 

On 31 August 2016 at 11:47, Stone Fang  wrote:

> Hi all,
> have some questions on how to define clustering key.
>
> have a table like this
>
> CREATE TABLE datacenter{
>
> datacentername varchar,
>
> publish timestamp,
>
> value varchar,
>
> PRIMARY KEY(datacentername,publish)
>
> }
>
>
> *issues:*
> there are only two datacenter,so the data would only have two
> partitions.and store
> in two nodes.want to spread the data evenly around the cluster.
>
> take this post for reference
> http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling
>
> CREATE TABLE datacenter{
>
> datacentername varchar,
>
> publish_pre text,
>
> publish timestamp,
>
> value varchar,
>
> PRIMARY KEY((datacentername,publish_pre),publish)
>
> }
>
> publish_pre is from 1~12 hours.*but the workload is high.i dont want to
> all workload inserted into one node in a hour.*
>
> have no idea on how to define the partition key to spread data evenly
> around the cluster,and the partition not split by time.which means that
> data should not be inserted one node at a certain time window.
>
> thanks
> stone
>


Re: cassandra database design

2016-08-31 Thread Stone Fang
access pattern is

select *from datacenter where datacentername = '' and publish>$time and
publish<$time

On Wed, Aug 31, 2016 at 8:37 PM, Carlos Alonso  wrote:

> Maybe a good question could be:
>
> Which is your access pattern to this data?
>
> Carlos Alonso | Software Engineer | @calonso 
>
> On 31 August 2016 at 11:47, Stone Fang  wrote:
>
>> Hi all,
>> have some questions on how to define clustering key.
>>
>> have a table like this
>>
>> CREATE TABLE datacenter{
>>
>> datacentername varchar,
>>
>> publish timestamp,
>>
>> value varchar,
>>
>> PRIMARY KEY(datacentername,publish)
>>
>> }
>>
>>
>> *issues:*
>> there are only two datacenter,so the data would only have two
>> partitions.and store
>> in two nodes.want to spread the data evenly around the cluster.
>>
>> take this post for reference
>> http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling
>>
>> CREATE TABLE datacenter{
>>
>> datacentername varchar,
>>
>> publish_pre text,
>>
>> publish timestamp,
>>
>> value varchar,
>>
>> PRIMARY KEY((datacentername,publish_pre),publish)
>>
>> }
>>
>> publish_pre is from 1~12 hours.*but the workload is high.i dont want to
>> all workload inserted into one node in a hour.*
>>
>> have no idea on how to define the partition key to spread data evenly
>> around the cluster,and the partition not split by time.which means that
>> data should not be inserted one node at a certain time window.
>>
>> thanks
>> stone
>>
>
>


Re: cassandra database design

2016-09-01 Thread Carlos Alonso
I guess there's no easy solution for this. The bucketing technique you were
applying with the publish_pre extra field making a composite partition key
is probably your best bet but you're right being concerned that all your
workload will hit the same node during an hour.

I'd then suggest adding a higher-cardinality extra field, as the second
number for example. That will spread the load across 60 partitions per
datacenter and, of course, when querying, you'll have to try all
partitions. It won't scale forever as the partitions will grow big at some
point, but if your data is small enough it may work.

Another suggestion would be to use the number of seconds since epoch as the
extra partitioning key. Instead of querying a range you'll have to issue
single partition queries for each second within your range. This solution
may suppose heavier read workloads but will definitely scale as the size of
the partitions shouldn't be an issue.

I don't think this suggestions will match your requirements off the shelf,
but hopefully will give you an idea on how to trade off between partition
size, number of partitions and read strategy to find the sweet spot for
your use case.

Regards

Carlos Alonso | Software Engineer | @calonso 

On 1 September 2016 at 02:58, Stone Fang  wrote:

> access pattern is
>
> select *from datacenter where datacentername = '' and publish>$time and
> publish<$time
>
> On Wed, Aug 31, 2016 at 8:37 PM, Carlos Alonso  wrote:
>
>> Maybe a good question could be:
>>
>> Which is your access pattern to this data?
>>
>> Carlos Alonso | Software Engineer | @calonso
>> 
>>
>> On 31 August 2016 at 11:47, Stone Fang  wrote:
>>
>>> Hi all,
>>> have some questions on how to define clustering key.
>>>
>>> have a table like this
>>>
>>> CREATE TABLE datacenter{
>>>
>>> datacentername varchar,
>>>
>>> publish timestamp,
>>>
>>> value varchar,
>>>
>>> PRIMARY KEY(datacentername,publish)
>>>
>>> }
>>>
>>>
>>> *issues:*
>>> there are only two datacenter,so the data would only have two
>>> partitions.and store
>>> in two nodes.want to spread the data evenly around the cluster.
>>>
>>> take this post for reference
>>> http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling
>>>
>>> CREATE TABLE datacenter{
>>>
>>> datacentername varchar,
>>>
>>> publish_pre text,
>>>
>>> publish timestamp,
>>>
>>> value varchar,
>>>
>>> PRIMARY KEY((datacentername,publish_pre),publish)
>>>
>>> }
>>>
>>> publish_pre is from 1~12 hours.*but the workload is high.i dont want to
>>> all workload inserted into one node in a hour.*
>>>
>>> have no idea on how to define the partition key to spread data evenly
>>> around the cluster,and the partition not split by time.which means that
>>> data should not be inserted one node at a certain time window.
>>>
>>> thanks
>>> stone
>>>
>>
>>
>


Re: cassandra database design

2016-09-01 Thread Stone Fang
Thanks,Carlos.
the key point is how to balance the data spread around the cluster and the
partition number of query.
it is hard to determine which is best.anyway,thanks for your suggestion.it
help me a lot.

stone

On Thu, Sep 1, 2016 at 4:54 PM, Carlos Alonso  wrote:

> I guess there's no easy solution for this. The bucketing technique you
> were applying with the publish_pre extra field making a composite partition
> key is probably your best bet but you're right being concerned that all
> your workload will hit the same node during an hour.
>
> I'd then suggest adding a higher-cardinality extra field, as the second
> number for example. That will spread the load across 60 partitions per
> datacenter and, of course, when querying, you'll have to try all
> partitions. It won't scale forever as the partitions will grow big at some
> point, but if your data is small enough it may work.
>
> Another suggestion would be to use the number of seconds since epoch as
> the extra partitioning key. Instead of querying a range you'll have to
> issue single partition queries for each second within your range. This
> solution may suppose heavier read workloads but will definitely scale as
> the size of the partitions shouldn't be an issue.
>
> I don't think this suggestions will match your requirements off the shelf,
> but hopefully will give you an idea on how to trade off between partition
> size, number of partitions and read strategy to find the sweet spot for
> your use case.
>
> Regards
>
> Carlos Alonso | Software Engineer | @calonso 
>
> On 1 September 2016 at 02:58, Stone Fang  wrote:
>
>> access pattern is
>>
>> select *from datacenter where datacentername = '' and publish>$time and
>> publish<$time
>>
>> On Wed, Aug 31, 2016 at 8:37 PM, Carlos Alonso 
>> wrote:
>>
>>> Maybe a good question could be:
>>>
>>> Which is your access pattern to this data?
>>>
>>> Carlos Alonso | Software Engineer | @calonso
>>> 
>>>
>>> On 31 August 2016 at 11:47, Stone Fang  wrote:
>>>
 Hi all,
 have some questions on how to define clustering key.

 have a table like this

 CREATE TABLE datacenter{

 datacentername varchar,

 publish timestamp,

 value varchar,

 PRIMARY KEY(datacentername,publish)

 }


 *issues:*
 there are only two datacenter,so the data would only have two
 partitions.and store
 in two nodes.want to spread the data evenly around the cluster.

 take this post for reference
 http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling

 CREATE TABLE datacenter{

 datacentername varchar,

 publish_pre text,

 publish timestamp,

 value varchar,

 PRIMARY KEY((datacentername,publish_pre),publish)

 }

 publish_pre is from 1~12 hours.*but the workload is high.i dont want
 to all workload inserted into one node in a hour.*

 have no idea on how to define the partition key to spread data evenly
 around the cluster,and the partition not split by time.which means that
 data should not be inserted one node at a certain time window.

 thanks
 stone

>>>
>>>
>>
>