quenlang commented on issue #8061: Native parallel batch indexing with shuffle URL: https://github.com/apache/incubator-druid/issues/8061#issuecomment-544903870 @jihoonson Thanks for the quick reply! > I'm pretty surprised by this and wondering how big the performance gain was in your case. Sadly, Druid doesn't support segment pruning in brokers for hash-based partitioning for now (this is supported only for single-dimension range partitioning). That means, even though your segments are partitioned based on the hash value of `tenant_id`, the broker will send queries to all historicals having any segments overlapping with the query interval no matter what their hash value is. I guess, perhaps you could see some performance improvement when you filter on `tenant_id` maybe because of less branch misprediction. Can you share your performance benchmark result if you can? I did not get big the performance gain than expected. for the small tenant, the query latency only reduces 50ms-100ms but for the big tenant, the latency increases 10s-30s. I think it caused by the data skewness with the hashed partition of tenant_id. The biggest tenant in an 18GB segment. > That means, even though your segments are partitioned based on the hash value of tenant_id, the broker will send queries to all historicals having any segments overlapping with the query interval no matter what their hash value is. even though the broker sends queries to all historicals, but only one historical node has the tenant data. So I think the data skewness is the root cause of large latency. > Ah, the partitionsSpec you used is the hash-based partitioning. To use the range partitioning, the type of the partitionsSpec should be single_dim instead of hashed. This single-dimension range partitioning is supported only by the hadoop task for now and I believe the native parallel indexing task will support it in the next release. Do you mean that druid 0.17.0 will support single-dimension range partitioning in native parallel indexing tasks? Also, if there is a big tenant in a range set of tenant_id, how to avoid segment size skewness by single-dimension range partitioning in the future native parallel indexing task? Thanks a lot!
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
