Re: Time Series schema performance

2018-05-30 Thread Haris Altaf
Thanks Affan Syed! :)

On Wed, 30 May 2018 at 11:07 sujeet jog  wrote:

> Thanks Jeff & Jonathan,
>
>
> On Tue, May 29, 2018 at 10:41 PM, Jonathan Haddad 
> wrote:
>
>> I wrote a post on this topic a while ago, might be worth reading over:
>>
>> http://thelastpickle.com/blog/2017/08/02/time-series-data-modeling-massive-scale.html
>> On Tue, May 29, 2018 at 8:02 AM Jeff Jirsa  wrote:
>>
>> > There’s a third option which is doing bucketing by time instead of by
>> hash, which tends to perform quite well if you’re using TWCS as it makes
>> it
>> quite likely that a read can be served by a single sstable
>>
>> > --
>> > Jeff Jirsa
>>
>>
>> > On May 29, 2018, at 6:49 AM, sujeet jog  wrote:
>>
>> > Folks,
>> > I have two alternatives for the time series schema i have, and wanted to
>> weigh of on one of the schema .
>>
>> > The query is given id, & timestamp, read the metrics associated with the
>> id
>>
>> > The records are inserted every 5 mins, and the number of id's = 2
>> million,
>> > so at every 5mins  it will be 2 million records that will be written.
>>
>> > Bucket Range  : 0 - 5K.
>>
>> > Schema 1 )
>>
>> > create table (
>> > id timeuuid,
>> > bucketid Int,
>> > date date,
>> > timestamp timestamp,
>> > metricName1   BigInt,
>> > metricName2 BigInt.
>> > ...
>> > .
>> > metricName300 BigInt,
>>
>> > Primary Key (( day, bucketid ) ,  id, timestamp)
>> > )
>>
>> > BucketId is just a murmur3 hash of the id  which acts as a splitter to
>> group id's in a partition
>>
>>
>> > Pros : -
>>
>> > Efficient write performance, since data is written to minimal partitions
>>
>> > Cons : -
>>
>> > While the first schema works best when queried programmatically, but is
>> a
>> bit inflexible If it has to be integrated with 3rd party BI tools like
>> tableau, bucket-id cannot be generated from tableau as it's not part of
>> the
>> view etc..
>>
>>
>> > Schema 2 )
>> > Same as above, without bucketid &  date.
>>
>> > Primary Key (id, timestamp )
>>
>> > Pros : -
>>
>> > BI tools don't need to generate bucket id lookups,
>>
>> > Cons :-
>> > Too many partitions are written every 5 mins,  say 2 million records
>> written in distinct 2 million partitions.
>>
>>
>>
>> > I believe writing this data to commit log is same in case of Schema 1 &
>> Schema 2 ) , but the actual performance bottleneck could be compaction,
>> since the data from memtable is transformed to ssTables often based on the
>> memory settings, and
>> > the header for every SSTable would maintain partitionIndex with
>>   byteoffsets,
>>
>> >   wanted to guage how bad can the performance of Schema-2 go with
>> respect
>> to Write/Compaction having to do many diskseeks.
>>
>> > compacting many tables but with too many partitionIndex entries because
>> of the high number of parititions ,  can this be a bottleneck ?..
>>
>> > Any indept performance explanation of Schema-2 would be very much
>> helpful
>>
>>
>> > Thanks,
>>
>>
>>
>>
>> --
>> Jon Haddad
>> http://www.rustyrazorblade.com
>> twitter: rustyrazorblade
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>>
> --
regards,
Haris


Re: Time Series schema performance

2018-05-30 Thread sujeet jog
Thanks Jeff & Jonathan,


On Tue, May 29, 2018 at 10:41 PM, Jonathan Haddad  wrote:

> I wrote a post on this topic a while ago, might be worth reading over:
> http://thelastpickle.com/blog/2017/08/02/time-series-data-
> modeling-massive-scale.html
> On Tue, May 29, 2018 at 8:02 AM Jeff Jirsa  wrote:
>
> > There’s a third option which is doing bucketing by time instead of by
> hash, which tends to perform quite well if you’re using TWCS as it makes it
> quite likely that a read can be served by a single sstable
>
> > --
> > Jeff Jirsa
>
>
> > On May 29, 2018, at 6:49 AM, sujeet jog  wrote:
>
> > Folks,
> > I have two alternatives for the time series schema i have, and wanted to
> weigh of on one of the schema .
>
> > The query is given id, & timestamp, read the metrics associated with the
> id
>
> > The records are inserted every 5 mins, and the number of id's = 2
> million,
> > so at every 5mins  it will be 2 million records that will be written.
>
> > Bucket Range  : 0 - 5K.
>
> > Schema 1 )
>
> > create table (
> > id timeuuid,
> > bucketid Int,
> > date date,
> > timestamp timestamp,
> > metricName1   BigInt,
> > metricName2 BigInt.
> > ...
> > .
> > metricName300 BigInt,
>
> > Primary Key (( day, bucketid ) ,  id, timestamp)
> > )
>
> > BucketId is just a murmur3 hash of the id  which acts as a splitter to
> group id's in a partition
>
>
> > Pros : -
>
> > Efficient write performance, since data is written to minimal partitions
>
> > Cons : -
>
> > While the first schema works best when queried programmatically, but is a
> bit inflexible If it has to be integrated with 3rd party BI tools like
> tableau, bucket-id cannot be generated from tableau as it's not part of the
> view etc..
>
>
> > Schema 2 )
> > Same as above, without bucketid &  date.
>
> > Primary Key (id, timestamp )
>
> > Pros : -
>
> > BI tools don't need to generate bucket id lookups,
>
> > Cons :-
> > Too many partitions are written every 5 mins,  say 2 million records
> written in distinct 2 million partitions.
>
>
>
> > I believe writing this data to commit log is same in case of Schema 1 &
> Schema 2 ) , but the actual performance bottleneck could be compaction,
> since the data from memtable is transformed to ssTables often based on the
> memory settings, and
> > the header for every SSTable would maintain partitionIndex with
>   byteoffsets,
>
> >   wanted to guage how bad can the performance of Schema-2 go with respect
> to Write/Compaction having to do many diskseeks.
>
> > compacting many tables but with too many partitionIndex entries because
> of the high number of parititions ,  can this be a bottleneck ?..
>
> > Any indept performance explanation of Schema-2 would be very much helpful
>
>
> > Thanks,
>
>
>
>
> --
> Jon Haddad
> http://www.rustyrazorblade.com
> twitter: rustyrazorblade
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


Re: Time Series schema performance

2018-05-29 Thread Affan Syed
Haris,

Like all things in Cassandra, you will need to create a down-sample
normalized table. ie either run a cron over the raw table, or if using some
streaming solution like Flink/Storm/Spark, to extract aggregate values and
put them into your downsample data.

HTH

- Affan

On Tue, May 29, 2018 at 10:24 PM, Haris Altaf  wrote:

> Hi All,
> I have a related question. How do you down-sample your timeseries data?
>
>
> regards,
> Haris
>
> On Tue, 29 May 2018 at 22:11 Jonathan Haddad  wrote:
>
>> I wrote a post on this topic a while ago, might be worth reading over:
>> http://thelastpickle.com/blog/2017/08/02/time-series-data-
>> modeling-massive-scale.html
>> On Tue, May 29, 2018 at 8:02 AM Jeff Jirsa  wrote:
>>
>> > There’s a third option which is doing bucketing by time instead of by
>> hash, which tends to perform quite well if you’re using TWCS as it makes
>> it
>> quite likely that a read can be served by a single sstable
>>
>> > --
>> > Jeff Jirsa
>>
>>
>> > On May 29, 2018, at 6:49 AM, sujeet jog  wrote:
>>
>> > Folks,
>> > I have two alternatives for the time series schema i have, and wanted to
>> weigh of on one of the schema .
>>
>> > The query is given id, & timestamp, read the metrics associated with the
>> id
>>
>> > The records are inserted every 5 mins, and the number of id's = 2
>> million,
>> > so at every 5mins  it will be 2 million records that will be written.
>>
>> > Bucket Range  : 0 - 5K.
>>
>> > Schema 1 )
>>
>> > create table (
>> > id timeuuid,
>> > bucketid Int,
>> > date date,
>> > timestamp timestamp,
>> > metricName1   BigInt,
>> > metricName2 BigInt.
>> > ...
>> > .
>> > metricName300 BigInt,
>>
>> > Primary Key (( day, bucketid ) ,  id, timestamp)
>> > )
>>
>> > BucketId is just a murmur3 hash of the id  which acts as a splitter to
>> group id's in a partition
>>
>>
>> > Pros : -
>>
>> > Efficient write performance, since data is written to minimal partitions
>>
>> > Cons : -
>>
>> > While the first schema works best when queried programmatically, but is
>> a
>> bit inflexible If it has to be integrated with 3rd party BI tools like
>> tableau, bucket-id cannot be generated from tableau as it's not part of
>> the
>> view etc..
>>
>>
>> > Schema 2 )
>> > Same as above, without bucketid &  date.
>>
>> > Primary Key (id, timestamp )
>>
>> > Pros : -
>>
>> > BI tools don't need to generate bucket id lookups,
>>
>> > Cons :-
>> > Too many partitions are written every 5 mins,  say 2 million records
>> written in distinct 2 million partitions.
>>
>>
>>
>> > I believe writing this data to commit log is same in case of Schema 1 &
>> Schema 2 ) , but the actual performance bottleneck could be compaction,
>> since the data from memtable is transformed to ssTables often based on the
>> memory settings, and
>> > the header for every SSTable would maintain partitionIndex with
>>   byteoffsets,
>>
>> >   wanted to guage how bad can the performance of Schema-2 go with
>> respect
>> to Write/Compaction having to do many diskseeks.
>>
>> > compacting many tables but with too many partitionIndex entries because
>> of the high number of parititions ,  can this be a bottleneck ?..
>>
>> > Any indept performance explanation of Schema-2 would be very much
>> helpful
>>
>>
>> > Thanks,
>>
>>
>>
>>
>> --
>> Jon Haddad
>> http://www.rustyrazorblade.com
>> twitter: rustyrazorblade
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>> --
> regards,
> Haris
>


Re: Time Series schema performance

2018-05-29 Thread Haris Altaf
Hi All,
I have a related question. How do you down-sample your timeseries data?


regards,
Haris

On Tue, 29 May 2018 at 22:11 Jonathan Haddad  wrote:

> I wrote a post on this topic a while ago, might be worth reading over:
>
> http://thelastpickle.com/blog/2017/08/02/time-series-data-modeling-massive-scale.html
> On Tue, May 29, 2018 at 8:02 AM Jeff Jirsa  wrote:
>
> > There’s a third option which is doing bucketing by time instead of by
> hash, which tends to perform quite well if you’re using TWCS as it makes it
> quite likely that a read can be served by a single sstable
>
> > --
> > Jeff Jirsa
>
>
> > On May 29, 2018, at 6:49 AM, sujeet jog  wrote:
>
> > Folks,
> > I have two alternatives for the time series schema i have, and wanted to
> weigh of on one of the schema .
>
> > The query is given id, & timestamp, read the metrics associated with the
> id
>
> > The records are inserted every 5 mins, and the number of id's = 2
> million,
> > so at every 5mins  it will be 2 million records that will be written.
>
> > Bucket Range  : 0 - 5K.
>
> > Schema 1 )
>
> > create table (
> > id timeuuid,
> > bucketid Int,
> > date date,
> > timestamp timestamp,
> > metricName1   BigInt,
> > metricName2 BigInt.
> > ...
> > .
> > metricName300 BigInt,
>
> > Primary Key (( day, bucketid ) ,  id, timestamp)
> > )
>
> > BucketId is just a murmur3 hash of the id  which acts as a splitter to
> group id's in a partition
>
>
> > Pros : -
>
> > Efficient write performance, since data is written to minimal partitions
>
> > Cons : -
>
> > While the first schema works best when queried programmatically, but is a
> bit inflexible If it has to be integrated with 3rd party BI tools like
> tableau, bucket-id cannot be generated from tableau as it's not part of the
> view etc..
>
>
> > Schema 2 )
> > Same as above, without bucketid &  date.
>
> > Primary Key (id, timestamp )
>
> > Pros : -
>
> > BI tools don't need to generate bucket id lookups,
>
> > Cons :-
> > Too many partitions are written every 5 mins,  say 2 million records
> written in distinct 2 million partitions.
>
>
>
> > I believe writing this data to commit log is same in case of Schema 1 &
> Schema 2 ) , but the actual performance bottleneck could be compaction,
> since the data from memtable is transformed to ssTables often based on the
> memory settings, and
> > the header for every SSTable would maintain partitionIndex with
>   byteoffsets,
>
> >   wanted to guage how bad can the performance of Schema-2 go with respect
> to Write/Compaction having to do many diskseeks.
>
> > compacting many tables but with too many partitionIndex entries because
> of the high number of parititions ,  can this be a bottleneck ?..
>
> > Any indept performance explanation of Schema-2 would be very much helpful
>
>
> > Thanks,
>
>
>
>
> --
> Jon Haddad
> http://www.rustyrazorblade.com
> twitter: rustyrazorblade
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
> --
regards,
Haris


Re: Time Series schema performance

2018-05-29 Thread Jonathan Haddad
I wrote a post on this topic a while ago, might be worth reading over:
http://thelastpickle.com/blog/2017/08/02/time-series-data-modeling-massive-scale.html
On Tue, May 29, 2018 at 8:02 AM Jeff Jirsa  wrote:

> There’s a third option which is doing bucketing by time instead of by
hash, which tends to perform quite well if you’re using TWCS as it makes it
quite likely that a read can be served by a single sstable

> --
> Jeff Jirsa


> On May 29, 2018, at 6:49 AM, sujeet jog  wrote:

> Folks,
> I have two alternatives for the time series schema i have, and wanted to
weigh of on one of the schema .

> The query is given id, & timestamp, read the metrics associated with the
id

> The records are inserted every 5 mins, and the number of id's = 2
million,
> so at every 5mins  it will be 2 million records that will be written.

> Bucket Range  : 0 - 5K.

> Schema 1 )

> create table (
> id timeuuid,
> bucketid Int,
> date date,
> timestamp timestamp,
> metricName1   BigInt,
> metricName2 BigInt.
> ...
> .
> metricName300 BigInt,

> Primary Key (( day, bucketid ) ,  id, timestamp)
> )

> BucketId is just a murmur3 hash of the id  which acts as a splitter to
group id's in a partition


> Pros : -

> Efficient write performance, since data is written to minimal partitions

> Cons : -

> While the first schema works best when queried programmatically, but is a
bit inflexible If it has to be integrated with 3rd party BI tools like
tableau, bucket-id cannot be generated from tableau as it's not part of the
view etc..


> Schema 2 )
> Same as above, without bucketid &  date.

> Primary Key (id, timestamp )

> Pros : -

> BI tools don't need to generate bucket id lookups,

> Cons :-
> Too many partitions are written every 5 mins,  say 2 million records
written in distinct 2 million partitions.



> I believe writing this data to commit log is same in case of Schema 1 &
Schema 2 ) , but the actual performance bottleneck could be compaction,
since the data from memtable is transformed to ssTables often based on the
memory settings, and
> the header for every SSTable would maintain partitionIndex with
  byteoffsets,

>   wanted to guage how bad can the performance of Schema-2 go with respect
to Write/Compaction having to do many diskseeks.

> compacting many tables but with too many partitionIndex entries because
of the high number of parititions ,  can this be a bottleneck ?..

> Any indept performance explanation of Schema-2 would be very much helpful


> Thanks,




-- 
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Time Series schema performance

2018-05-29 Thread Jeff Jirsa
There’s a third option which is doing bucketing by time instead of by hash, 
which tends to perform quite well if you’re using TWCS as it makes it quite 
likely that a read can be served by a single sstable

-- 
Jeff Jirsa


> On May 29, 2018, at 6:49 AM, sujeet jog  wrote:
> 
> Folks, 
> I have two alternatives for the time series schema i have, and wanted to 
> weigh of on one of the schema . 
> 
> The query is given id, & timestamp, read the metrics associated with the id
> 
> The records are inserted every 5 mins, and the number of id's = 2 million, 
> so at every 5mins  it will be 2 million records that will be written.  
> 
> Bucket Range  : 0 - 5K.
> 
> Schema 1 ) 
> 
> create table (
> id timeuuid,
> bucketid Int, 
> date date,
> timestamp timestamp, 
> metricName1   BigInt,
> metricName2 BigInt. 
> ...
> .
> metricName300 BigInt,
> 
> Primary Key (( day, bucketid ) ,  id, timestamp)
> ) 
> 
> BucketId is just a murmur3 hash of the id  which acts as a splitter to group 
> id's in a partition
> 
> 
> Pros : - 
> 
> Efficient write performance, since data is written to minimal partitions
> 
> Cons : -
> 
> While the first schema works best when queried programmatically, but is a bit 
> inflexible If it has to be integrated with 3rd party BI tools like tableau, 
> bucket-id cannot be generated from tableau as it's not part of the view etc..
> 
> 
> Schema 2 ) 
> Same as above, without bucketid &  date. 
> 
> Primary Key (id, timestamp ) 
> 
> Pros : -
> 
> BI tools don't need to generate bucket id lookups, 
> 
> Cons :-
> Too many partitions are written every 5 mins,  say 2 million records written 
> in distinct 2 million partitions.
> 
> 
> 
> I believe writing this data to commit log is same in case of Schema 1 & 
> Schema 2 ) , but the actual performance bottleneck could be compaction, since 
> the data from memtable is transformed to ssTables often based on the memory 
> settings, and 
> the header for every SSTable would maintain partitionIndex with  byteoffsets, 
> 
>  wanted to guage how bad can the performance of Schema-2 go with respect to 
> Write/Compaction having to do many diskseeks.
> 
> compacting many tables but with too many partitionIndex entries because of 
> the high number of parititions ,  can this be a bottleneck ?..
> 
> Any indept performance explanation of Schema-2 would be very much helpful
> 
> 
> Thanks, 
> 
>