quenlang commented on issue #8061: Native parallel batch indexing with shuffle
URL: 
https://github.com/apache/incubator-druid/issues/8061#issuecomment-544903870
 
 
   @jihoonson  Thanks for the quick reply!
   > I'm pretty surprised by this and wondering how big the performance gain 
was in your case. Sadly, Druid doesn't support segment pruning in brokers for 
hash-based partitioning for now (this is supported only for single-dimension 
range partitioning). That means, even though your segments are partitioned 
based on the hash value of `tenant_id`, the broker will send queries to all 
historicals having any segments overlapping with the query interval no matter 
what their hash value is. I guess, perhaps you could see some performance 
improvement when you filter on `tenant_id` maybe because of less branch 
misprediction. Can you share your performance benchmark result if you can?
   
   I did not get big the performance gain than expected. for the small tenant, 
the query latency only reduces 50ms-100ms but for the big tenant, the latency 
increases 10s-30s. I think it caused by the data skewness with the hashed 
partition of tenant_id. The biggest tenant in an 18GB segment.
   
   > That means, even though your segments are partitioned based on the hash 
value of tenant_id, the broker will send queries to all historicals having any 
segments overlapping with the query interval no matter what their hash value is.
   
   even though the broker sends queries to all historicals, but only one 
historical node has the tenant data. So I think the data skewness is the root 
cause of large latency.
   
   > Ah, the partitionsSpec you used is the hash-based partitioning. To use the 
range partitioning, the type of the partitionsSpec should be single_dim instead 
of hashed. This single-dimension range partitioning is supported only by the 
hadoop task for now and I believe the native parallel indexing task will 
support it in the next release.
   
   Do you mean that druid 0.17.0 will support single-dimension range 
partitioning in native parallel indexing tasks? 
   Also, if there is a big tenant in a range set of tenant_id, how to avoid 
segment size skewness by single-dimension range partitioning in the future 
native parallel indexing task?
   Thanks a lot!
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to