[ 
https://issues.apache.org/jira/browse/KUDU-2258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17124355#comment-17124355
 ] 

Andrew Wong commented on KUDU-2258:
-----------------------------------

It's worth noting [Todd's 
blogpost|https://blog.cloudera.com/benchmarking-time-series-workloads-on-apache-kudu-using-tsbs/]
 links to his benchmarking scripts:

https://github.com/toddlipcon/kudu-tsbs

It's probably worth considering understanding the underlying Kudu queries 
without the ts daemons and profiling them further.

> Create timeseries workload integration test
> -------------------------------------------
>
>                 Key: KUDU-2258
>                 URL: https://issues.apache.org/jira/browse/KUDU-2258
>             Project: Kudu
>          Issue Type: Test
>          Components: test
>            Reporter: Dan Burkert
>            Priority: Major
>
> A common usecase for Kudu is storing timeseries data sets.  Right now we 
> don't have a good integration test simulating these workloads.  Ideally such 
> an integration test would serve as a good starting point for investigating 
> and reproducing performance issues with timeseries workloads.
> The timeseries workloads we've seen usually have these characteristics:
> - Hash partitioning over 1 or 2 series id columns, which are often a UUID or 
> similar pseudo-random ID.
> - Very high cardinality over the ID column(s), in the ballpark of tens or 
> hundreds of millions
> - Range partitioning over a timestamp column, although it may be sufficient 
> to only simulate a single time range for an integration test.
> - The test should probably be flexible with the data column types and count, 
> there is no 'common' case here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to