Folks,
I have two alternatives for the time series schema i have, and wanted to
weigh of on one of the schema .

The query is given id, & timestamp, read the metrics associated with the id

The records are inserted every 5 mins, and the number of id's = 2 million,
so at every 5mins  it will be 2 million records that will be written.

Bucket Range  : 0 - 5K.

Schema 1 )

create table (
id timeuuid,
bucketid Int,
date date,
timestamp timestamp,
metricName1   BigInt,
metricName2 BigInt.
...
.....
metricName300 BigInt,

Primary Key (( day, bucketid ) ,  id, timestamp)
)

BucketId is just a murmur3 hash of the id  which acts as a splitter to
group id's in a partition


Pros : -

Efficient write performance, since data is written to minimal partitions

Cons : -

While the first schema works best when queried programmatically, but is a
bit inflexible If it has to be integrated with 3rd party BI tools like
tableau, bucket-id cannot be generated from tableau as it's not part of the
view etc..


Schema 2 )
Same as above, without bucketid &  date.

Primary Key (id, timestamp )

Pros : -

BI tools don't need to generate bucket id lookups,

Cons :-
Too many partitions are written every 5 mins,  say 2 million records
written in distinct 2 million partitions.



I believe writing this data to commit log is same in case of Schema 1 &
Schema 2 ) , but the actual performance bottleneck could be compaction,
since the data from memtable is transformed to ssTables often based on the
memory settings, and
the header for every SSTable would maintain partitionIndex with
byteoffsets,

 wanted to guage how bad can the performance of Schema-2 go with respect to
Write/Compaction having to do many diskseeks.

compacting many tables but with too many partitionIndex entries because of
the high number of parititions ,  can this be a bottleneck ?..

Any indept performance explanation of Schema-2 would be very much helpful


Thanks,

Reply via email to