[GitHub] spark issue #20704: [SPARK-23551][BUILD] Exclude `hadoop-mapreduce-client-co...

2018-04-13 Thread megaserg
Github user megaserg commented on the issue:

https://github.com/apache/spark/pull/20704
  
Thank you @dongjoon-hyun! This was also affecting our Spark job performance!

We're using `mapreduce.fileoutputcommitter.algorithm.version=2` in our 
Spark job config, as recommended e.g. here: 
http://spark.apache.org/docs/latest/cloud-integration.html. We're using 
user-provided Hadoop 2.9.0.

However, since this 2.6.5 JAR was in spark/jars, it was given priority in 
the classpath over Hadoop-distributed 2.9.0 JAR. The 2.6.5 was silently 
ignoring the `mapreduce.fileoutputcommitter.algorithm.version` setting and used 
the default, slow algorithm (I believe hadoop-mapreduce-client-core only had 
one, slow, algorithm until 2.7.0).

I believe this affects everyone who uses any mapreduce settings with Spark 
2.3.0. Great job!

Can we double-check that this JAR is not present in the "without-hadoop" 
Spark distribution anymore?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18990: [SPARK-21782][Core] Repartition creates skews when numPa...

2017-08-18 Thread megaserg
Github user megaserg commented on the issue:

https://github.com/apache/spark/pull/18990
  
Sorry, I edited the pull request body. The @srowen's comment above was 
referring to the initial version, where I proposed using default, 
non-deterministic constructor for `Random()`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18990: [SPARK-21782][Core] Repartition creates skews whe...

2017-08-17 Thread megaserg
GitHub user megaserg opened a pull request:

https://github.com/apache/spark/pull/18990

[SPARK-21782][Core] Repartition creates skews when numPartitions is a power 
of 2

## Problem
When an RDD (particularly with a low item-per-partition ratio) is 
repartitioned to numPartitions = power of 2, the resulting partitions are very 
uneven-sized, due to using fixed seed to initialize PRNG, and using the PRNG 
only once. See details in https://issues.apache.org/jira/browse/SPARK-21782

## What changes were proposed in this pull request?
Instead of using fixed seed, use a default constuctor for `Random`.

## How was this patch tested?
`build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.rdd.RDDSuite test`


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/megaserg/spark repartition-skew

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18990.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18990


commit 2cb7550b8ecada3c504621a75c4f82d13880496b
Author: Sergey Serebryakov <sserebrya...@tesla.com>
Date:   2017-08-18T05:47:55Z

[SPARK-21782][Core] Repartition creates skews when numPartitions is a power 
of 2




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org