[GitHub] spark pull request #21418: Branch 2.2

2018-05-24 Thread gentlewangyu
Github user gentlewangyu closed the pull request at:

https://github.com/apache/spark/pull/21418


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21418: Branch 2.2

2018-05-23 Thread gentlewangyu
GitHub user gentlewangyu opened a pull request:

https://github.com/apache/spark/pull/21418

Branch 2.2

## What changes were proposed in this pull request?

compiling spark with scala-2.10 should use the -p parameter instead of -d

## How was this patch tested?


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apache/spark branch-2.2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21418.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21418


commit 9949fed1c45865b6e5e8ebe610789c5fb9546052
Author: Corey Woodfield 
Date:   2017-07-19T22:21:38Z

[SPARK-21333][DOCS] Removed invalid joinTypes from javadoc of 
Dataset#joinWith

## What changes were proposed in this pull request?

Two invalid join types were mistakenly listed in the javadoc for joinWith, 
in the Dataset class. I presume these were copied from the javadoc of join, but 
since joinWith returns a Dataset\, left_semi and left_anti are 
invalid, as they only return values from one of the datasets, instead of from 
both

## How was this patch tested?

I ran the following code :
```
public static void main(String[] args) {
SparkSession spark = new SparkSession(new SparkContext("local[*]", 
"Test"));
Dataset one = spark.createDataFrame(Arrays.asList(new Bean(1), new 
Bean(2), new Bean(3), new Bean(4), new Bean(5)), Bean.class);
Dataset two = spark.createDataFrame(Arrays.asList(new Bean(4), new 
Bean(5), new Bean(6), new Bean(7), new Bean(8), new Bean(9)), Bean.class);

try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"inner").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"cross").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"outer").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"full").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"full_outer").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"left").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"left_outer").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"right").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"right_outer").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"left_semi").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"left_anti").show();} catch (Exception e) {e.printStackTrace();}
}
```
which tests all the different join types, and the last two (left_semi and 
left_anti) threw exceptions. The same code using join instead of joinWith did 
fine. The Bean class was just a java bean with a single int field, x.

Author: Corey Woodfield 

Closes #18462 from coreywoodfield/master.

(cherry picked from commit 8cd9cdf17a7a4ad6f2eecd7c4b388ca363c20982)
Signed-off-by: gatorsmile 

commit 88dccda393bc79dc6032f71b6acf8eb2b4b152be
Author: Dhruve Ashar 
Date:   2017-07-21T19:03:46Z

[SPARK-21243][CORE] Limit no. of map outputs in a shuffle fetch

For configurations with external shuffle enabled, we have observed that if 
a very large no. of blocks are being fetched from a remote host, it puts the NM 
under extra pressure and can crash it. This change introduces a configuration 
`spark.reducer.maxBlocksInFlightPerAddress` , to limit the no. of map outputs 
being fetched from a given remote address. The changes applied here are 
applicable for both the scenarios - when external shuffle is enabled as well as 
disabled.

Ran the job with the default configuration which does not change the 
existing behavior and ran it with few configurations of lower values 
-10,20,50,100. The job ran fine and there is no change in the output. (I will 
update the metrics related to NM in some time.)

Author: Dhruve Ashar 

Closes #18487 from dhruve/impr/SPARK-21243.

Author: Dhruve Ashar 

Closes #18691 from dhruve/branch-2.2.

commit da403b95353f064c24da25236fa7f905fa8ddca1
Author: Holden Karau 
Date:   2017-07-21T23:50:47Z