Thanks Affan Syed! :) On Wed, 30 May 2018 at 11:07 sujeet jog <sujeet....@gmail.com> wrote:
> Thanks Jeff & Jonathan, > > > On Tue, May 29, 2018 at 10:41 PM, Jonathan Haddad <j...@jonhaddad.com> > wrote: > >> I wrote a post on this topic a while ago, might be worth reading over: >> >> http://thelastpickle.com/blog/2017/08/02/time-series-data-modeling-massive-scale.html >> On Tue, May 29, 2018 at 8:02 AM Jeff Jirsa <jji...@gmail.com> wrote: >> >> > There’s a third option which is doing bucketing by time instead of by >> hash, which tends to perform quite well if you’re using TWCS as it makes >> it >> quite likely that a read can be served by a single sstable >> >> > -- >> > Jeff Jirsa >> >> >> > On May 29, 2018, at 6:49 AM, sujeet jog <sujeet....@gmail.com> wrote: >> >> > Folks, >> > I have two alternatives for the time series schema i have, and wanted to >> weigh of on one of the schema . >> >> > The query is given id, & timestamp, read the metrics associated with the >> id >> >> > The records are inserted every 5 mins, and the number of id's = 2 >> million, >> > so at every 5mins it will be 2 million records that will be written. >> >> > Bucket Range : 0 - 5K. >> >> > Schema 1 ) >> >> > create table ( >> > id timeuuid, >> > bucketid Int, >> > date date, >> > timestamp timestamp, >> > metricName1 BigInt, >> > metricName2 BigInt. >> > ... >> > ..... >> > metricName300 BigInt, >> >> > Primary Key (( day, bucketid ) , id, timestamp) >> > ) >> >> > BucketId is just a murmur3 hash of the id which acts as a splitter to >> group id's in a partition >> >> >> > Pros : - >> >> > Efficient write performance, since data is written to minimal partitions >> >> > Cons : - >> >> > While the first schema works best when queried programmatically, but is >> a >> bit inflexible If it has to be integrated with 3rd party BI tools like >> tableau, bucket-id cannot be generated from tableau as it's not part of >> the >> view etc.. >> >> >> > Schema 2 ) >> > Same as above, without bucketid & date. >> >> > Primary Key (id, timestamp ) >> >> > Pros : - >> >> > BI tools don't need to generate bucket id lookups, >> >> > Cons :- >> > Too many partitions are written every 5 mins, say 2 million records >> written in distinct 2 million partitions. >> >> >> >> > I believe writing this data to commit log is same in case of Schema 1 & >> Schema 2 ) , but the actual performance bottleneck could be compaction, >> since the data from memtable is transformed to ssTables often based on the >> memory settings, and >> > the header for every SSTable would maintain partitionIndex with >> byteoffsets, >> >> > wanted to guage how bad can the performance of Schema-2 go with >> respect >> to Write/Compaction having to do many diskseeks. >> >> > compacting many tables but with too many partitionIndex entries because >> of the high number of parititions , can this be a bottleneck ?.. >> >> > Any indept performance explanation of Schema-2 would be very much >> helpful >> >> >> > Thanks, >> >> >> >> >> -- >> Jon Haddad >> http://www.rustyrazorblade.com >> twitter: rustyrazorblade >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org >> For additional commands, e-mail: user-h...@cassandra.apache.org >> >> > -- regards, Haris