[ https://issues.apache.org/jira/browse/KUDU-2258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17124355#comment-17124355 ]
Andrew Wong commented on KUDU-2258: ----------------------------------- It's worth noting [Todd's blogpost|https://blog.cloudera.com/benchmarking-time-series-workloads-on-apache-kudu-using-tsbs/] links to his benchmarking scripts: https://github.com/toddlipcon/kudu-tsbs It's probably worth considering understanding the underlying Kudu queries without the ts daemons and profiling them further. > Create timeseries workload integration test > ------------------------------------------- > > Key: KUDU-2258 > URL: https://issues.apache.org/jira/browse/KUDU-2258 > Project: Kudu > Issue Type: Test > Components: test > Reporter: Dan Burkert > Priority: Major > > A common usecase for Kudu is storing timeseries data sets. Right now we > don't have a good integration test simulating these workloads. Ideally such > an integration test would serve as a good starting point for investigating > and reproducing performance issues with timeseries workloads. > The timeseries workloads we've seen usually have these characteristics: > - Hash partitioning over 1 or 2 series id columns, which are often a UUID or > similar pseudo-random ID. > - Very high cardinality over the ID column(s), in the ballpark of tens or > hundreds of millions > - Range partitioning over a timestamp column, although it may be sufficient > to only simulate a single time range for an integration test. > - The test should probably be flexible with the data column types and count, > there is no 'common' case here. -- This message was sent by Atlassian Jira (v8.3.4#803005)