[GitHub] [spark] c21 commented on pull request #29079: [SPARK-32286][SQL] Coalesce bucketed table for shuffled hash join if applicable

2020-07-21 Thread GitBox


c21 commented on pull request #29079:
URL: https://github.com/apache/spark/pull/29079#issuecomment-662165418


   Thank you all @maropu, @cloud-fan, @viirya and @imback82 for review!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on pull request #29079: [SPARK-32286][SQL] Coalesce bucketed table for shuffled hash join if applicable

2020-07-20 Thread GitBox


c21 commented on pull request #29079:
URL: https://github.com/apache/spark/pull/29079#issuecomment-661099329


   @cloud-fan - all comments are addressed, wondering is there any other things 
needed for this PR? Thanks.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on pull request #29079: [SPARK-32286][SQL] Coalesce bucketed table for shuffled hash join if applicable

2020-07-19 Thread GitBox


c21 commented on pull request #29079:
URL: https://github.com/apache/spark/pull/29079#issuecomment-660599800


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on pull request #29079: [SPARK-32286][SQL] Coalesce bucketed table for shuffled hash join if applicable

2020-07-18 Thread GitBox


c21 commented on pull request #29079:
URL: https://github.com/apache/spark/pull/29079#issuecomment-660590187


   Addressed all comments and rebased to latest master. Thanks. cc @maropu, 
@cloud-fan and @viirya.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on pull request #29079: [SPARK-32286][SQL] Coalesce bucketed table for shuffled hash join if applicable

2020-07-17 Thread GitBox


c21 commented on pull request #29079:
URL: https://github.com/apache/spark/pull/29079#issuecomment-660428967


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on pull request #29079: [SPARK-32286][SQL] Coalesce bucketed table for shuffled hash join if applicable

2020-07-16 Thread GitBox


c21 commented on pull request #29079:
URL: https://github.com/apache/spark/pull/29079#issuecomment-659875108


   Addressed all comments besides the only one that - I am still keeping two 
ratio configs separately (SMJ and SHJ). Let me know if I need to change this. 
cc @maropu and @viirya, thanks.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on pull request #29079: [SPARK-32286][SQL] Coalesce bucketed table for shuffled hash join if applicable

2020-07-12 Thread GitBox


c21 commented on pull request #29079:
URL: https://github.com/apache/spark/pull/29079#issuecomment-657353621


   > Could you show us performance numbers in the PR description, first? I 
think we need to check the trade-off between #parallelism and shuffle I/O.
   
   @maropu, update PR description for one test query in TPCDS (with 
modification). Thanks.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on pull request #29079: [SPARK-32286][SQL] Coalesce bucketed table for shuffled hash join if applicable

2020-07-12 Thread GitBox


c21 commented on pull request #29079:
URL: https://github.com/apache/spark/pull/29079#issuecomment-657304351


   @viirya, I see your point for coalescing reduces parallelism to cause more 
OOM on build side. I agree this can happen. All in all, this is a 
disable-by-default feature, and user can selectively enable it depending on 
their table size. But I think it's worth to have as it indeed helped our users 
in production for using shuffled hash join on bucketed tables.
   
   Re OOM issue in shuffled hash join - I think we can add a fallback mechanism 
when building hash map and fall back to sort merge join if the size of hash map 
being too big to OOM (i.e. rethink 
https://issues.apache.org/jira/browse/SPARK-21505), we have been running this 
feature in production for years, and it works well.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on pull request #29079: [SPARK-32286][SQL] Coalesce bucketed table for shuffled hash join if applicable

2020-07-12 Thread GitBox


c21 commented on pull request #29079:
URL: https://github.com/apache/spark/pull/29079#issuecomment-657300530


   > We build hash map for each bucket on other side and it also sounds to OOM 
easily. This feature is disabled by a config by default, so it may be okay. But 
we should be careful not to enable it by default later.
   
   @viirya, thanks for comment. I agree this feature should be selectively 
enabled, but sorry I don't see OOM has anything to do with this feature.
   
   You are saying OOM is an issue for shuffled hash join on bucketed table, 
which I agree. This feature is coalescing on stream side (not touch build side 
at all), so I don't think it's adding any more risk for OOM on build side. As 
[sort merge join is by default preferred over shuffled hash 
join](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L335),
 so when users enable shuffled hash join by config explicitly, they should 
already pay attention to OOM problem.
   
   Am I miss anything? Thanks.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on pull request #29079: [SPARK-32286][SQL] Coalesce bucketed table for shuffled hash join if applicable

2020-07-12 Thread GitBox


c21 commented on pull request #29079:
URL: https://github.com/apache/spark/pull/29079#issuecomment-657288265


   @maropu, @cloud-fan @gatorsmile @sameeragarwal Could you help check this PR? 
Thanks.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org