[GitHub] spark issue #20704: [SPARK-23551][BUILD] Exclude `hadoop-mapreduce-client-co...
Github user megaserg commented on the issue: https://github.com/apache/spark/pull/20704 Thank you @dongjoon-hyun! This was also affecting our Spark job performance! We're using `mapreduce.fileoutputcommitter.algorithm.version=2` in our Spark job config, as recommended e.g. here: http://spark.apache.org/docs/latest/cloud-integration.html. We're using user-provided Hadoop 2.9.0. However, since this 2.6.5 JAR was in spark/jars, it was given priority in the classpath over Hadoop-distributed 2.9.0 JAR. The 2.6.5 was silently ignoring the `mapreduce.fileoutputcommitter.algorithm.version` setting and used the default, slow algorithm (I believe hadoop-mapreduce-client-core only had one, slow, algorithm until 2.7.0). I believe this affects everyone who uses any mapreduce settings with Spark 2.3.0. Great job! Can we double-check that this JAR is not present in the "without-hadoop" Spark distribution anymore? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18990: [SPARK-21782][Core] Repartition creates skews when numPa...
Github user megaserg commented on the issue: https://github.com/apache/spark/pull/18990 Sorry, I edited the pull request body. The @srowen's comment above was referring to the initial version, where I proposed using default, non-deterministic constructor for `Random()`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18990: [SPARK-21782][Core] Repartition creates skews whe...
GitHub user megaserg opened a pull request: https://github.com/apache/spark/pull/18990 [SPARK-21782][Core] Repartition creates skews when numPartitions is a power of 2 ## Problem When an RDD (particularly with a low item-per-partition ratio) is repartitioned to numPartitions = power of 2, the resulting partitions are very uneven-sized, due to using fixed seed to initialize PRNG, and using the PRNG only once. See details in https://issues.apache.org/jira/browse/SPARK-21782 ## What changes were proposed in this pull request? Instead of using fixed seed, use a default constuctor for `Random`. ## How was this patch tested? `build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.rdd.RDDSuite test` You can merge this pull request into a Git repository by running: $ git pull https://github.com/megaserg/spark repartition-skew Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18990.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18990 commit 2cb7550b8ecada3c504621a75c4f82d13880496b Author: Sergey Serebryakov <sserebrya...@tesla.com> Date: 2017-08-18T05:47:55Z [SPARK-21782][Core] Repartition creates skews when numPartitions is a power of 2 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org