subject:"Re\: Performance problems of using spark SQL to read kudu data！"

Re: Performance problems of using spark SQL to read kudu data！

2020-11-29 Thread Grant Henke

By default Kudu can only parallelize up to the number of partitions being scanned. In this case, that is the 6 hash partitions in the "2020-11-01" <= VALUES < "2020-12-01" range. We do have a feature to split the partitions scanned into smaller tokens. You can set this by setting `kudu.splitSizeByt

Re: Performance problems of using spark SQL to read kudu data！

2020-11-29 Thread Andrew Wong

Hello! Starting in Kudu 1.10, you should be able to supply 'splitSizeBytes' as a KuduReadOption in Spark, allowing you to generate Kudu scan tokens that operate on smaller chunks of data. Here's