Github user ueshin commented on the pull request:

    https://github.com/apache/spark/pull/283#issuecomment-39417683
  
    @rxin Thank you for your reply.
    
    There are some case to use merge join for optimization:
    
    1. If data to be joined are already sorted by join keys, merge join would 
be done more efficiently than hash join. In my test case, both algorithms were 
almost same speed, but merge join was scalable.
    2. Merge join for sorted data by the same keys would be pipelined (each 
output can be produced immediately for arrived input tuples) even if N-way join 
(N>2). Hash join blocks due to building a hash-table before output are produced.
    
    I think it is useful for users to choose ways to optimize their processing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to