c21 commented on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-672640787


   @cloud-fan, @agrawaldevesh, @maropu and @viirya - updated the PR with latest 
proposed change (I still need to add unit test for `BytesToBytesMap` and 
`HashedRelation`, but the added unit test in `JoinSuite` should give us enough 
confidence for end-to-end working now. Would like to get feedback first before 
spending more time crafting more unit tests, thanks).
   
   Tested with the same example small benchmark query in PR description, still 
seeing 30% wall clock time improvement compared to sort merge join (I agree 
this is much more a toy benchmark query, but it should give us some confidence 
that we are not doing some very wrong thing here in terms of performance):
   
   ```
   Running benchmark: shuffle hash join
     Running case: shuffle hash join off
     Stopped after 2 iterations, 16602 ms
     Running case: shuffle hash join on
     Stopped after 5 iterations, 31911 ms
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.4
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   shuffle hash join:                        Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------------------------------
   shuffle hash join off                              7900           8301       
  567          2.1         470.9       1.0X
   shuffle hash join on                               6250           6382       
   95          2.7         372.5       1.3X
   ```
   
   Also running added unit test in `JoinSuite`. Verified all new added logic 
inside `ShuffledHashJoin` is 100% code covered (the not covered ones are 
related to code-gen, which is irrelevant here):
   
   <img width="1664" alt="Screen Shot 2020-08-11 at 11 10 21 PM" 
src="https://user-images.githubusercontent.com/4629931/89983056-bebd4180-dc2b-11ea-9fe3-cdf06143a002.png";>
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to