[GitHub] spark pull request: SPARK-1380: Add sort-merge based cogroup/joins...

2014-09-23 Thread nchammas
Github user nchammas commented on the pull request:

https://github.com/apache/spark/pull/283#issuecomment-56483665
  
@pwendell @rxin @mateiz What is the status of this PR? It looks pretty 
substantial, but it hasn't been updated in a while.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-1380: Add sort-merge based cogroup/joins...

2014-11-09 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/283#issuecomment-62332367
  
I'd suggest we close this issue for now and go to the JIRA to discuss 
whether the feature is needed and how high of a priority it is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-1380: Add sort-merge based cogroup/joins...

2014-11-09 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/283


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-1380: Add sort-merge based cogroup/joins...

2014-04-01 Thread ueshin
GitHub user ueshin opened a pull request:

https://github.com/apache/spark/pull/283

SPARK-1380: Add sort-merge based cogroup/joins.

I've written cogroup/joins based on 'Sort-Merge' algorithm.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ueshin/apache-spark issues/SPARK-1380

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/283.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #283


commit 1c8ba5a0d480f816a0c217618b40bb615474963d
Author: Takuya UESHIN 
Date:   2014-03-19T10:28:26Z

Add sort-merge cogroup/joins.

commit 99751661fcc7632a0f82816bbaca07bf822d3663
Author: Takuya UESHIN 
Date:   2014-03-25T10:15:09Z

Add Java APIs for sort-merge cogroup/joins.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1380: Add sort-merge based cogroup/joins...

2014-04-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/283#issuecomment-39182558
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1380: Add sort-merge based cogroup/joins...

2014-04-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/283#issuecomment-39182575
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1380: Add sort-merge based cogroup/joins...

2014-04-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/283#issuecomment-39187101
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1380: Add sort-merge based cogroup/joins...

2014-04-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/283#issuecomment-39187102
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13626/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1380: Add sort-merge based cogroup/joins...

2014-04-01 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/283#issuecomment-39245668
  
Is there a specific use case you are trying to address that cannot be 
handled by the hash join?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1380: Add sort-merge based cogroup/joins...

2014-04-01 Thread mridulm
Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/283#issuecomment-39286182
  
I have not done a detailed review - but looks pretty expensive in terms of 
memory.
Is it making assumptions about lack of skew w.r.t a key and amount of data 
per partition (that it can be held entirely in memory)  ?
Would be good to document what are the constraints of the solution.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1380: Add sort-merge based cogroup/joins...

2014-04-02 Thread ueshin
Github user ueshin commented on the pull request:

https://github.com/apache/spark/pull/283#issuecomment-39417683
  
@rxin Thank you for your reply.

There are some case to use merge join for optimization:

1. If data to be joined are already sorted by join keys, merge join would 
be done more efficiently than hash join. In my test case, both algorithms were 
almost same speed, but merge join was scalable.
2. Merge join for sorted data by the same keys would be pipelined (each 
output can be produced immediately for arrived input tuples) even if N-way join 
(N>2). Hash join blocks due to building a hash-table before output are produced.

I think it is useful for users to choose ways to optimize their processing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1380: Add sort-merge based cogroup/joins...

2014-04-03 Thread ueshin
Github user ueshin commented on the pull request:

https://github.com/apache/spark/pull/283#issuecomment-39421176
  
@mridulm Thank you for your reply.

There are 2 points I have to mention about memory:

- Before shuffle  
If data are sorted, no more memory is needed because no sort operation is 
needed, and if not sorted, merge join needs some amount of memory to sort data 
in each partition.
- After shuffle  
Merge join needs at most the same amount of memory as hash join while 
fetching data, but it does not need more memory because it can produce output 
immediately from input. Hash join needs some more memory to build a hash table.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---