Re: Time Series schema performance
Thanks Affan Syed! :) On Wed, 30 May 2018 at 11:07 sujeet jog wrote: > Thanks Jeff & Jonathan, > > > On Tue, May 29, 2018 at 10:41 PM, Jonathan Haddad > wrote: > >> I wrote a post on this topic a while ago, might be worth reading over: >> >> http://thelastpickle.com/blog/2017/08/02/time-series-data-modeling-massive-scale.html >> On Tue, May 29, 2018 at 8:02 AM Jeff Jirsa wrote: >> >> > There’s a third option which is doing bucketing by time instead of by >> hash, which tends to perform quite well if you’re using TWCS as it makes >> it >> quite likely that a read can be served by a single sstable >> >> > -- >> > Jeff Jirsa >> >> >> > On May 29, 2018, at 6:49 AM, sujeet jog wrote: >> >> > Folks, >> > I have two alternatives for the time series schema i have, and wanted to >> weigh of on one of the schema . >> >> > The query is given id, & timestamp, read the metrics associated with the >> id >> >> > The records are inserted every 5 mins, and the number of id's = 2 >> million, >> > so at every 5mins it will be 2 million records that will be written. >> >> > Bucket Range : 0 - 5K. >> >> > Schema 1 ) >> >> > create table ( >> > id timeuuid, >> > bucketid Int, >> > date date, >> > timestamp timestamp, >> > metricName1 BigInt, >> > metricName2 BigInt. >> > ... >> > . >> > metricName300 BigInt, >> >> > Primary Key (( day, bucketid ) , id, timestamp) >> > ) >> >> > BucketId is just a murmur3 hash of the id which acts as a splitter to >> group id's in a partition >> >> >> > Pros : - >> >> > Efficient write performance, since data is written to minimal partitions >> >> > Cons : - >> >> > While the first schema works best when queried programmatically, but is >> a >> bit inflexible If it has to be integrated with 3rd party BI tools like >> tableau, bucket-id cannot be generated from tableau as it's not part of >> the >> view etc.. >> >> >> > Schema 2 ) >> > Same as above, without bucketid & date. >> >> > Primary Key (id, timestamp ) >> >> > Pros : - >> >> > BI tools don't need to generate bucket id lookups, >> >> > Cons :- >> > Too many partitions are written every 5 mins, say 2 million records >> written in distinct 2 million partitions. >> >> >> >> > I believe writing this data to commit log is same in case of Schema 1 & >> Schema 2 ) , but the actual performance bottleneck could be compaction, >> since the data from memtable is transformed to ssTables often based on the >> memory settings, and >> > the header for every SSTable would maintain partitionIndex with >> byteoffsets, >> >> > wanted to guage how bad can the performance of Schema-2 go with >> respect >> to Write/Compaction having to do many diskseeks. >> >> > compacting many tables but with too many partitionIndex entries because >> of the high number of parititions , can this be a bottleneck ?.. >> >> > Any indept performance explanation of Schema-2 would be very much >> helpful >> >> >> > Thanks, >> >> >> >> >> -- >> Jon Haddad >> http://www.rustyrazorblade.com >> twitter: rustyrazorblade >> >> - >> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org >> For additional commands, e-mail: user-h...@cassandra.apache.org >> >> > -- regards, Haris
Re: Time Series schema performance
Thanks Jeff & Jonathan, On Tue, May 29, 2018 at 10:41 PM, Jonathan Haddad wrote: > I wrote a post on this topic a while ago, might be worth reading over: > http://thelastpickle.com/blog/2017/08/02/time-series-data- > modeling-massive-scale.html > On Tue, May 29, 2018 at 8:02 AM Jeff Jirsa wrote: > > > There’s a third option which is doing bucketing by time instead of by > hash, which tends to perform quite well if you’re using TWCS as it makes it > quite likely that a read can be served by a single sstable > > > -- > > Jeff Jirsa > > > > On May 29, 2018, at 6:49 AM, sujeet jog wrote: > > > Folks, > > I have two alternatives for the time series schema i have, and wanted to > weigh of on one of the schema . > > > The query is given id, & timestamp, read the metrics associated with the > id > > > The records are inserted every 5 mins, and the number of id's = 2 > million, > > so at every 5mins it will be 2 million records that will be written. > > > Bucket Range : 0 - 5K. > > > Schema 1 ) > > > create table ( > > id timeuuid, > > bucketid Int, > > date date, > > timestamp timestamp, > > metricName1 BigInt, > > metricName2 BigInt. > > ... > > . > > metricName300 BigInt, > > > Primary Key (( day, bucketid ) , id, timestamp) > > ) > > > BucketId is just a murmur3 hash of the id which acts as a splitter to > group id's in a partition > > > > Pros : - > > > Efficient write performance, since data is written to minimal partitions > > > Cons : - > > > While the first schema works best when queried programmatically, but is a > bit inflexible If it has to be integrated with 3rd party BI tools like > tableau, bucket-id cannot be generated from tableau as it's not part of the > view etc.. > > > > Schema 2 ) > > Same as above, without bucketid & date. > > > Primary Key (id, timestamp ) > > > Pros : - > > > BI tools don't need to generate bucket id lookups, > > > Cons :- > > Too many partitions are written every 5 mins, say 2 million records > written in distinct 2 million partitions. > > > > > I believe writing this data to commit log is same in case of Schema 1 & > Schema 2 ) , but the actual performance bottleneck could be compaction, > since the data from memtable is transformed to ssTables often based on the > memory settings, and > > the header for every SSTable would maintain partitionIndex with > byteoffsets, > > > wanted to guage how bad can the performance of Schema-2 go with respect > to Write/Compaction having to do many diskseeks. > > > compacting many tables but with too many partitionIndex entries because > of the high number of parititions , can this be a bottleneck ?.. > > > Any indept performance explanation of Schema-2 would be very much helpful > > > > Thanks, > > > > > -- > Jon Haddad > http://www.rustyrazorblade.com > twitter: rustyrazorblade > > - > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > >
Re: Time Series schema performance
Haris, Like all things in Cassandra, you will need to create a down-sample normalized table. ie either run a cron over the raw table, or if using some streaming solution like Flink/Storm/Spark, to extract aggregate values and put them into your downsample data. HTH - Affan On Tue, May 29, 2018 at 10:24 PM, Haris Altaf wrote: > Hi All, > I have a related question. How do you down-sample your timeseries data? > > > regards, > Haris > > On Tue, 29 May 2018 at 22:11 Jonathan Haddad wrote: > >> I wrote a post on this topic a while ago, might be worth reading over: >> http://thelastpickle.com/blog/2017/08/02/time-series-data- >> modeling-massive-scale.html >> On Tue, May 29, 2018 at 8:02 AM Jeff Jirsa wrote: >> >> > There’s a third option which is doing bucketing by time instead of by >> hash, which tends to perform quite well if you’re using TWCS as it makes >> it >> quite likely that a read can be served by a single sstable >> >> > -- >> > Jeff Jirsa >> >> >> > On May 29, 2018, at 6:49 AM, sujeet jog wrote: >> >> > Folks, >> > I have two alternatives for the time series schema i have, and wanted to >> weigh of on one of the schema . >> >> > The query is given id, & timestamp, read the metrics associated with the >> id >> >> > The records are inserted every 5 mins, and the number of id's = 2 >> million, >> > so at every 5mins it will be 2 million records that will be written. >> >> > Bucket Range : 0 - 5K. >> >> > Schema 1 ) >> >> > create table ( >> > id timeuuid, >> > bucketid Int, >> > date date, >> > timestamp timestamp, >> > metricName1 BigInt, >> > metricName2 BigInt. >> > ... >> > . >> > metricName300 BigInt, >> >> > Primary Key (( day, bucketid ) , id, timestamp) >> > ) >> >> > BucketId is just a murmur3 hash of the id which acts as a splitter to >> group id's in a partition >> >> >> > Pros : - >> >> > Efficient write performance, since data is written to minimal partitions >> >> > Cons : - >> >> > While the first schema works best when queried programmatically, but is >> a >> bit inflexible If it has to be integrated with 3rd party BI tools like >> tableau, bucket-id cannot be generated from tableau as it's not part of >> the >> view etc.. >> >> >> > Schema 2 ) >> > Same as above, without bucketid & date. >> >> > Primary Key (id, timestamp ) >> >> > Pros : - >> >> > BI tools don't need to generate bucket id lookups, >> >> > Cons :- >> > Too many partitions are written every 5 mins, say 2 million records >> written in distinct 2 million partitions. >> >> >> >> > I believe writing this data to commit log is same in case of Schema 1 & >> Schema 2 ) , but the actual performance bottleneck could be compaction, >> since the data from memtable is transformed to ssTables often based on the >> memory settings, and >> > the header for every SSTable would maintain partitionIndex with >> byteoffsets, >> >> > wanted to guage how bad can the performance of Schema-2 go with >> respect >> to Write/Compaction having to do many diskseeks. >> >> > compacting many tables but with too many partitionIndex entries because >> of the high number of parititions , can this be a bottleneck ?.. >> >> > Any indept performance explanation of Schema-2 would be very much >> helpful >> >> >> > Thanks, >> >> >> >> >> -- >> Jon Haddad >> http://www.rustyrazorblade.com >> twitter: rustyrazorblade >> >> - >> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org >> For additional commands, e-mail: user-h...@cassandra.apache.org >> >> -- > regards, > Haris >
Re: Time Series schema performance
Hi All, I have a related question. How do you down-sample your timeseries data? regards, Haris On Tue, 29 May 2018 at 22:11 Jonathan Haddad wrote: > I wrote a post on this topic a while ago, might be worth reading over: > > http://thelastpickle.com/blog/2017/08/02/time-series-data-modeling-massive-scale.html > On Tue, May 29, 2018 at 8:02 AM Jeff Jirsa wrote: > > > There’s a third option which is doing bucketing by time instead of by > hash, which tends to perform quite well if you’re using TWCS as it makes it > quite likely that a read can be served by a single sstable > > > -- > > Jeff Jirsa > > > > On May 29, 2018, at 6:49 AM, sujeet jog wrote: > > > Folks, > > I have two alternatives for the time series schema i have, and wanted to > weigh of on one of the schema . > > > The query is given id, & timestamp, read the metrics associated with the > id > > > The records are inserted every 5 mins, and the number of id's = 2 > million, > > so at every 5mins it will be 2 million records that will be written. > > > Bucket Range : 0 - 5K. > > > Schema 1 ) > > > create table ( > > id timeuuid, > > bucketid Int, > > date date, > > timestamp timestamp, > > metricName1 BigInt, > > metricName2 BigInt. > > ... > > . > > metricName300 BigInt, > > > Primary Key (( day, bucketid ) , id, timestamp) > > ) > > > BucketId is just a murmur3 hash of the id which acts as a splitter to > group id's in a partition > > > > Pros : - > > > Efficient write performance, since data is written to minimal partitions > > > Cons : - > > > While the first schema works best when queried programmatically, but is a > bit inflexible If it has to be integrated with 3rd party BI tools like > tableau, bucket-id cannot be generated from tableau as it's not part of the > view etc.. > > > > Schema 2 ) > > Same as above, without bucketid & date. > > > Primary Key (id, timestamp ) > > > Pros : - > > > BI tools don't need to generate bucket id lookups, > > > Cons :- > > Too many partitions are written every 5 mins, say 2 million records > written in distinct 2 million partitions. > > > > > I believe writing this data to commit log is same in case of Schema 1 & > Schema 2 ) , but the actual performance bottleneck could be compaction, > since the data from memtable is transformed to ssTables often based on the > memory settings, and > > the header for every SSTable would maintain partitionIndex with > byteoffsets, > > > wanted to guage how bad can the performance of Schema-2 go with respect > to Write/Compaction having to do many diskseeks. > > > compacting many tables but with too many partitionIndex entries because > of the high number of parititions , can this be a bottleneck ?.. > > > Any indept performance explanation of Schema-2 would be very much helpful > > > > Thanks, > > > > > -- > Jon Haddad > http://www.rustyrazorblade.com > twitter: rustyrazorblade > > - > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > > -- regards, Haris
Re: Time Series schema performance
I wrote a post on this topic a while ago, might be worth reading over: http://thelastpickle.com/blog/2017/08/02/time-series-data-modeling-massive-scale.html On Tue, May 29, 2018 at 8:02 AM Jeff Jirsa wrote: > There’s a third option which is doing bucketing by time instead of by hash, which tends to perform quite well if you’re using TWCS as it makes it quite likely that a read can be served by a single sstable > -- > Jeff Jirsa > On May 29, 2018, at 6:49 AM, sujeet jog wrote: > Folks, > I have two alternatives for the time series schema i have, and wanted to weigh of on one of the schema . > The query is given id, & timestamp, read the metrics associated with the id > The records are inserted every 5 mins, and the number of id's = 2 million, > so at every 5mins it will be 2 million records that will be written. > Bucket Range : 0 - 5K. > Schema 1 ) > create table ( > id timeuuid, > bucketid Int, > date date, > timestamp timestamp, > metricName1 BigInt, > metricName2 BigInt. > ... > . > metricName300 BigInt, > Primary Key (( day, bucketid ) , id, timestamp) > ) > BucketId is just a murmur3 hash of the id which acts as a splitter to group id's in a partition > Pros : - > Efficient write performance, since data is written to minimal partitions > Cons : - > While the first schema works best when queried programmatically, but is a bit inflexible If it has to be integrated with 3rd party BI tools like tableau, bucket-id cannot be generated from tableau as it's not part of the view etc.. > Schema 2 ) > Same as above, without bucketid & date. > Primary Key (id, timestamp ) > Pros : - > BI tools don't need to generate bucket id lookups, > Cons :- > Too many partitions are written every 5 mins, say 2 million records written in distinct 2 million partitions. > I believe writing this data to commit log is same in case of Schema 1 & Schema 2 ) , but the actual performance bottleneck could be compaction, since the data from memtable is transformed to ssTables often based on the memory settings, and > the header for every SSTable would maintain partitionIndex with byteoffsets, > wanted to guage how bad can the performance of Schema-2 go with respect to Write/Compaction having to do many diskseeks. > compacting many tables but with too many partitionIndex entries because of the high number of parititions , can this be a bottleneck ?.. > Any indept performance explanation of Schema-2 would be very much helpful > Thanks, -- Jon Haddad http://www.rustyrazorblade.com twitter: rustyrazorblade - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: Time Series schema performance
There’s a third option which is doing bucketing by time instead of by hash, which tends to perform quite well if you’re using TWCS as it makes it quite likely that a read can be served by a single sstable -- Jeff Jirsa > On May 29, 2018, at 6:49 AM, sujeet jog wrote: > > Folks, > I have two alternatives for the time series schema i have, and wanted to > weigh of on one of the schema . > > The query is given id, & timestamp, read the metrics associated with the id > > The records are inserted every 5 mins, and the number of id's = 2 million, > so at every 5mins it will be 2 million records that will be written. > > Bucket Range : 0 - 5K. > > Schema 1 ) > > create table ( > id timeuuid, > bucketid Int, > date date, > timestamp timestamp, > metricName1 BigInt, > metricName2 BigInt. > ... > . > metricName300 BigInt, > > Primary Key (( day, bucketid ) , id, timestamp) > ) > > BucketId is just a murmur3 hash of the id which acts as a splitter to group > id's in a partition > > > Pros : - > > Efficient write performance, since data is written to minimal partitions > > Cons : - > > While the first schema works best when queried programmatically, but is a bit > inflexible If it has to be integrated with 3rd party BI tools like tableau, > bucket-id cannot be generated from tableau as it's not part of the view etc.. > > > Schema 2 ) > Same as above, without bucketid & date. > > Primary Key (id, timestamp ) > > Pros : - > > BI tools don't need to generate bucket id lookups, > > Cons :- > Too many partitions are written every 5 mins, say 2 million records written > in distinct 2 million partitions. > > > > I believe writing this data to commit log is same in case of Schema 1 & > Schema 2 ) , but the actual performance bottleneck could be compaction, since > the data from memtable is transformed to ssTables often based on the memory > settings, and > the header for every SSTable would maintain partitionIndex with byteoffsets, > > wanted to guage how bad can the performance of Schema-2 go with respect to > Write/Compaction having to do many diskseeks. > > compacting many tables but with too many partitionIndex entries because of > the high number of parititions , can this be a bottleneck ?.. > > Any indept performance explanation of Schema-2 would be very much helpful > > > Thanks, > >