[GitHub] spark issue #22995: [SPARK-25998] [CORE] Change TorrentBroadcast to hold wea...

2018-11-16 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22995 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark issue #22609: [SPARK-25594] [Core] Avoid maintaining task information ...

2018-10-02 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22609 I will close the PR and document in our tests to use spark.ui.retainedTasks = 1 (or some very low value) : which will, in effect, be equivalent to this PR - and loose information anyway

[GitHub] spark pull request #22609: [SPARK-25594] [Core] Avoid maintaining task infor...

2018-10-02 Thread mridulm
Github user mridulm closed the pull request at: https://github.com/apache/spark/pull/22609 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22609: [SPARK-25594] [Core] Avoid maintaining task information ...

2018-10-02 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22609 > I tried that a lot while writing this code, but the disk store is just too slow That is unfortunate ... I was actually hoping that would be a good compromise if expanded

[GitHub] spark issue #22609: [SPARK-25594] [Core] Avoid maintaining task information ...

2018-10-02 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22609 @vanzin any reason why liveStore is hardcoded to be in-memory ? Any implications of making it disk backed ? That might be another option to unblock - requires a config change to existing

[GitHub] spark issue #22609: [SPARK-25594] [Core] Avoid maintaining task information ...

2018-10-02 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22609 @vanzin , @gengliangwang To clarify earlier behavior - if UI is disabled, any task related metrics were reported as zero: granularity of update was at the stage level. UI is typically disabled

[GitHub] spark issue #22609: [SPARK-25594] [Core] Avoid maintaining task information ...

2018-10-02 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22609 This is a regression introduced in 2.3; unfortunately we did not notice it until now. On Tue, Oct 2, 2018 at 4:56 AM Shahid wrote: > Hi @mridulm <https://gith

[GitHub] spark issue #22609: [SPARK-25594] [Core] Avoid maintaining task information ...

2018-10-02 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22609 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark issue #22609: [SPARK-25594] [Core] Avoid maintaining task information ...

2018-10-02 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22609 +CC @vanzin, @tgravescs --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark pull request #22609: [SPARK-25594] [Core] Avoid maintaining task infor...

2018-10-02 Thread mridulm
GitHub user mridulm opened a pull request: https://github.com/apache/spark/pull/22609 [SPARK-25594] [Core] Avoid maintaining task information when UI is disabled ## What changes were proposed in this pull request? Avoid maintaining task information in live spark application

[GitHub] spark pull request #22010: [SPARK-21436][CORE] Take advantage of known parti...

2018-09-26 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22010#discussion_r22067 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -396,7 +397,20 @@ abstract class RDD[T: ClassTag]( * Return a new RDD containing

[GitHub] spark pull request #22010: [SPARK-21436][CORE] Take advantage of known parti...

2018-09-26 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22010#discussion_r220672896 --- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala --- @@ -95,6 +95,18 @@ class RDDSuite extends SparkFunSuite with SharedSparkContext

[GitHub] spark pull request #22010: [SPARK-21436][CORE] Take advantage of known parti...

2018-09-26 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22010#discussion_r220672579 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -396,7 +397,20 @@ abstract class RDD[T: ClassTag]( * Return a new RDD containing

[GitHub] spark pull request #22010: [SPARK-21436][CORE] Take advantage of known parti...

2018-09-19 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22010#discussion_r218946996 --- Diff: core/src/main/scala/org/apache/spark/rdd/MapPartitionsRDD.scala --- @@ -35,16 +35,24 @@ import org.apache.spark.{Partition, TaskContext

[GitHub] spark pull request #22010: [SPARK-21436][CORE] Take advantage of known parti...

2018-09-19 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22010#discussion_r218946895 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -396,7 +396,26 @@ abstract class RDD[T: ClassTag]( * Return a new RDD containing

[GitHub] spark issue #22433: [SPARK-25442][SQL][K8S] Support STS to run in k8s deploy...

2018-09-16 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22433 > As this script is common start point for all the resource managers(k8s/yarn/mesos/standalone/local), i guess changing this to fit for all the cases has a value add, instead of doing at e

[GitHub] spark issue #22433: [SPARK-25442][SQL][K8S] Support STS to run in k8s deploy...

2018-09-16 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22433 It is an implementation detail of k8s integration that application name is expected to be DNS compliant ... spark does not have that requirement; and yarn/mesos/standalone/local work without

[GitHub] spark issue #22433: [SPARK-25442][SQL][K8S] Support STS to run in k8s deploy...

2018-09-16 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22433 Does it fail in k8s or does spark k8s code error out ? If former, why not fix “name” handling in k8s to replace unsupported characters

[GitHub] spark pull request #22010: [SPARK-21436][CORE] Take advantage of known parti...

2018-09-15 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22010#discussion_r217901185 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -396,7 +396,26 @@ abstract class RDD[T: ClassTag]( * Return a new RDD containing

[GitHub] spark pull request #22010: [SPARK-21436][CORE] Take advantage of known parti...

2018-09-15 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22010#discussion_r217901179 --- Diff: core/src/main/scala/org/apache/spark/rdd/MapPartitionsRDD.scala --- @@ -35,16 +35,24 @@ import org.apache.spark.{Partition, TaskContext

[GitHub] spark pull request #22010: [SPARK-21436][CORE] Take advantage of known parti...

2018-09-14 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22010#discussion_r217876215 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -396,7 +396,26 @@ abstract class RDD[T: ClassTag]( * Return a new RDD containing

[GitHub] spark pull request #22010: [SPARK-21436][CORE] Take advantage of known parti...

2018-09-14 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22010#discussion_r217876184 --- Diff: core/src/main/scala/org/apache/spark/rdd/MapPartitionsRDD.scala --- @@ -35,16 +35,24 @@ import org.apache.spark.{Partition, TaskContext

[GitHub] spark pull request #22192: [SPARK-24918][Core] Executor Plugin API

2018-09-07 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22192#discussion_r216118263 --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala --- @@ -136,6 +136,26 @@ private[spark] class Executor( // for fetching remote

[GitHub] spark pull request #22192: [SPARK-24918][Core] Executor Plugin API

2018-09-07 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22192#discussion_r216116857 --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala --- @@ -136,6 +136,26 @@ private[spark] class Executor( // for fetching remote

[GitHub] spark pull request #22192: [SPARK-24918][Core] Executor Plugin API

2018-09-07 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22192#discussion_r216116706 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -240,6 +240,19 @@ private[spark] object Utils extends Logging

[GitHub] spark pull request #22192: [SPARK-24918][Core] Executor Plugin API

2018-09-06 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22192#discussion_r215750952 --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala --- @@ -218,6 +244,8 @@ private[spark] class Executor

[GitHub] spark pull request #22192: [SPARK-24918][Core] Executor Plugin API

2018-09-06 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22192#discussion_r215749556 --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala --- @@ -136,6 +136,32 @@ private[spark] class Executor( // for fetching remote

[GitHub] spark pull request #22192: [SPARK-24918][Core] Executor Plugin API

2018-09-06 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22192#discussion_r215749405 --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala --- @@ -136,6 +136,32 @@ private[spark] class Executor( // for fetching remote

[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API

2018-09-05 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22192 @vanzin @NiharS I am uncomfortable with that change as well - which is why I wanted the initialization to be pushed into a separate thread (and then join) - if we really need to set the context

[GitHub] spark pull request #22192: [SPARK-24918][Core] Executor Plugin API

2018-08-29 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22192#discussion_r213819091 --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala --- @@ -130,6 +130,14 @@ private[spark] class Executor( private val

[GitHub] spark pull request #22192: [SPARK-24918][Core] Executor Plugin API

2018-08-29 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22192#discussion_r213818362 --- Diff: core/src/main/java/org/apache/spark/ExecutorPlugin.java --- @@ -0,0 +1,38 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #22192: [SPARK-24918][Core] Executor Plugin API

2018-08-29 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22192#discussion_r213821565 --- Diff: core/src/test/java/org/apache/spark/ExecutorPluginSuite.java --- @@ -0,0 +1,114 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #22209: [SPARK-24415][Core] Fixed the aggregated stage me...

2018-08-24 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22209#discussion_r212772176 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala --- @@ -350,11 +350,16 @@ private[spark] class AppStatusListener

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-24 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212705976 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -1865,6 +1871,57 @@ abstract class RDD[T: ClassTag]( // RDD chain

[GitHub] spark pull request #22209: [SPARK-24415][Core] Fixed the issue where stage p...

2018-08-24 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22209#discussion_r212705634 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala --- @@ -350,11 +350,16 @@ private[spark] class AppStatusListener

[GitHub] spark pull request #22209: [SPARK-24415][Core] Fixed the issue where stage p...

2018-08-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22209#discussion_r212491921 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala --- @@ -350,11 +350,16 @@ private[spark] class AppStatusListener

[GitHub] spark pull request #22209: [SPARK-24415][Core] Fixed the issue where stage p...

2018-08-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22209#discussion_r212490964 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala --- @@ -350,11 +350,16 @@ private[spark] class AppStatusListener

[GitHub] spark pull request #22209: [SPARK-24415][Core] Fixed the issue where stage p...

2018-08-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22209#discussion_r212468772 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala --- @@ -350,11 +350,16 @@ private[spark] class AppStatusListener

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212462874 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1502,6 +1502,53 @@ private[spark] class DAGScheduler

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212452014 --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala --- @@ -2627,6 +2632,81 @@ class DAGSchedulerSuite extends SparkFunSuite

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212451081 --- Diff: core/src/main/scala/org/apache/spark/rdd/ZippedPartitionsRDD.scala --- @@ -95,6 +99,18 @@ private[spark] class ZippedPartitionsRDD2[A: ClassTag

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-23 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21698 @jiangxb1987 I am guessing we should close this PR ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212386645 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -1865,6 +1876,39 @@ abstract class RDD[T: ClassTag]( // RDD chain

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212385688 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -1865,6 +1876,39 @@ abstract class RDD[T: ClassTag]( // RDD chain

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212192600 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -855,16 +858,17 @@ abstract class RDD[T: ClassTag]( * a map on the other

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212199604 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1502,6 +1502,53 @@ private[spark] class DAGScheduler

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212195284 --- Diff: core/src/main/scala/org/apache/spark/rdd/ZippedPartitionsRDD.scala --- @@ -95,6 +99,18 @@ private[spark] class ZippedPartitionsRDD2[A: ClassTag

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212199007 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1502,6 +1502,53 @@ private[spark] class DAGScheduler

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212193814 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -1876,6 +1920,22 @@ abstract class RDD[T: ClassTag]( */ object RDD

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212198632 --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala --- @@ -70,7 +70,8 @@ class MyRDD( numPartitions: Int

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212197939 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1502,6 +1502,53 @@ private[spark] class DAGScheduler

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212192065 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -1876,6 +1920,22 @@ abstract class RDD[T: ClassTag]( */ object RDD

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212192772 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -1865,6 +1876,39 @@ abstract class RDD[T: ClassTag]( // RDD chain

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212193206 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -1865,6 +1876,39 @@ abstract class RDD[T: ClassTag]( // RDD chain

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212192261 --- Diff: core/src/main/scala/org/apache/spark/rdd/MapPartitionsRDD.scala --- @@ -54,4 +58,12 @@ private[spark] class MapPartitionsRDD[U: ClassTag, T

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212190267 --- Diff: core/src/main/scala/org/apache/spark/rdd/MapPartitionsRDD.scala --- @@ -32,12 +32,16 @@ import org.apache.spark.{Partition, TaskContext

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212196598 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1502,6 +1502,53 @@ private[spark] class DAGScheduler

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22112 @tgravescs: > The shuffle simply transfers the bytes its supposed to. Sparks shuffle of those bytes is not consistent in that the order it fetches from can change and without the s

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22112 Catching up on discussion ... @cloud-fan > shuffled RDD will never be deterministic unless the shuffle key is the entire record and key ordering is specified. Let me rephr

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-17 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22112 @tgravescs Please see https://github.com/apache/spark/pull/22112#discussion_r210788359 for a further elaboration. We actually cannot support random order (except for small subset of cases like map

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-17 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r210963665 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -1864,6 +1877,22 @@ abstract class RDD[T: ClassTag]( // From performance concern

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-17 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r210963213 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -853,6 +861,11 @@ abstract class RDD[T: ClassTag]( * second element in each RDD

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-17 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r210967814 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1441,6 +1441,44 @@ class DAGScheduler( failedStages

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-17 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r210964794 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -1864,6 +1877,22 @@ abstract class RDD[T: ClassTag]( // From performance concern

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-17 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r210788359 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -112,6 +112,11 @@ abstract class RDD[T: ClassTag]( /** * :: DeveloperApi

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-17 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r210756079 --- Diff: core/src/main/scala/org/apache/spark/Dependency.scala --- @@ -94,6 +94,16 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-16 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22112 @tgravescs To understand better, are you suggesting that we do not support any api and/or user closure which depends on input order ? If yes, that would break not just repartition + shuffle

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-16 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22112 @tgravescs I was specifically in agreement with > Personally I don't want to talk about implementation until we decide what we want our semantics to be around the unordered operations beca

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-16 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22112 I agree @tgravescs, I was looking at the implementation to understand what the expectations are wrt newly introduced methods/fields and whether they make sense : I did not see any details furnished

[GitHub] spark issue #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() data corr...

2018-08-16 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22112 You are perfectly correct @jiangxb1987, that was a silly mistake on my part - and not trivial at all ! It should be shuffle dependency we should rely on when traversing the dependency tree

[GitHub] spark issue #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() data corr...

2018-08-16 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22112 I am not sure what the definition of `isIdempotent` here is. For example, from MapPartitionsRDD : ``` override private[spark] def isIdempotent = { if (inputOrderSensitive

[GitHub] spark issue #22101: [SPARK-25114][Core] Fix RecordBinaryComparator when subt...

2018-08-14 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22101 LGTM pending Xiao Li's excellent suggestion :-) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-14 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21698 > I guess on the RDD side its not called RoundRobinPartitioner Thanks for clarifying @tgravescs ! I was looking at `RangePartitioner` and variants and was wondering what I was missing -

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-14 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21698 @cloud-fan I think we have to be clear on the boundaries of the solution we can provide in spark. > RDD#mapPartitions and its friends can take arbitrary user functions, which may prod

[GitHub] spark pull request #22101: [SPARK-23207][Core][FOLLOWUP] Fix RecordBinaryCom...

2018-08-14 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22101#discussion_r209862610 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/execution/RecordBinaryComparator.java --- @@ -42,16 +42,16 @@ public int compare

[GitHub] spark pull request #22101: [SPARK-23207][Core][FOLLOWUP] Fix RecordBinaryCom...

2018-08-14 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22101#discussion_r209860397 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/execution/RecordBinaryComparator.java --- @@ -42,16 +42,16 @@ public int compare

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-13 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21698 @squito @tgravescs I am probably missing something about why hash partitioner helps, can you please clarify ? IIRC the partitioner for CoalescedRDD when shuffle is enabled is HashPartitioner

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-13 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21698 @tgravescs I vaguely remember someone at y! labs telling me (more than a decade back) about MR always doing a sort as part of its shuffle to avoid a variant of this problem by design

[GitHub] spark pull request #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartit...

2018-08-13 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22079#discussion_r209705612 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/execution/RecordBinaryComparator.java --- @@ -0,0 +1,70 @@ +/* + * Licensed

[GitHub] spark issue #21895: [SPARK-24948][SHS] Delegate check access permissions to ...

2018-08-06 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21895 I merged to master, thanks for the work @mgaido91 ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark pull request #21895: [SPARK-24948][SHS] Delegate check access permissi...

2018-08-03 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21895#discussion_r207641341 --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala --- @@ -779,6 +808,8 @@ private[history] class FsHistoryProvider(conf

[GitHub] spark pull request #21895: [SPARK-24948][SHS] Delegate check access permissi...

2018-08-03 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21895#discussion_r207493280 --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala --- @@ -161,6 +162,29 @@ private[history] class FsHistoryProvider

[GitHub] spark pull request #21895: [SPARK-24948][SHS] Delegate check access permissi...

2018-08-03 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21895#discussion_r207493081 --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala --- @@ -779,6 +808,8 @@ private[history] class FsHistoryProvider(conf

[GitHub] spark issue #21895: [SPARK-24948][SHS] Delegate check access permissions to ...

2018-07-28 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21895 +CC @jerryshao --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark pull request #21895: [SPARK-24948][SHS] Delegate check access permissi...

2018-07-28 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21895#discussion_r205948923 --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala --- @@ -973,6 +973,38 @@ private[history] object FsHistoryProvider

[GitHub] spark pull request #21653: [SPARK-13343] speculative tasks that didn't commi...

2018-07-20 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21653#discussion_r204199580 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -723,6 +723,21 @@ private[spark] class TaskSetManager( def

[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

2018-07-18 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21589 @MaxGekk We are going in circles. I dont think this is a good api to expose currently - the data is available through multiple other means as I detailed and while not a succinct oneliner

[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

2018-07-18 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21589 @MaxGekk The example you cites is literally one of a handful of usages which is not easily overridden - and is prefixed with a 'HACK ALERT' ! A few others are in mllib, typically for reading schema

[GitHub] spark pull request #21221: [SPARK-23429][CORE] Add executor memory metrics t...

2018-07-18 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21221#discussion_r203489913 --- Diff: core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala --- @@ -160,11 +160,29 @@ case class SparkListenerBlockUpdated

[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

2018-07-18 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21589 +CC @markhamstra since you were looking at API stability. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

2018-07-18 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21589 I am not convinced by the rationale given for adding the new api's in the jira. The examples given there can be easily modeled using `defaultParallelism` (to get current state

[GitHub] spark pull request #21102: [SPARK-23913][SQL] Add array_intersect function

2018-07-18 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21102#discussion_r203322643 --- Diff: core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala --- @@ -163,7 +187,7 @@ class OpenHashSet[@specialized(Long, Int) T

[GitHub] spark pull request #21102: [SPARK-23913][SQL] Add array_intersect function

2018-07-18 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21102#discussion_r203322056 --- Diff: core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala --- @@ -163,7 +187,7 @@ class OpenHashSet[@specialized(Long, Int) T

[GitHub] spark issue #21758: [SPARK-24795][CORE] Implement barrier execution mode

2018-07-18 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21758 I had left a few comments on SPARK-24375 @jiangxb1987 ... unfortunately the jira's have moved around a bit. If this is active PR for introducing the feature, would be great to get clarity

[GitHub] spark pull request #21221: [SPARK-23429][CORE] Add executor memory metrics t...

2018-07-18 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21221#discussion_r203319952 --- Diff: core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala --- @@ -160,11 +160,29 @@ case class SparkListenerBlockUpdated

[GitHub] spark issue #21729: [SPARK-24755][Core] Executor loss can cause task to not ...

2018-07-18 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21729 Looks good to me, thanks for fixing this @hthuynh2 ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

2018-07-18 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21589 I am not seeing the utility of these two methods. `defaultParallelism` already captures the current number of cores. For monitoring usecases, existing events fired via listener can

[GitHub] spark pull request #21102: [SPARK-23913][SQL] Add array_intersect function

2018-07-18 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21102#discussion_r203311755 --- Diff: core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala --- @@ -85,9 +85,13 @@ class OpenHashSet[@specialized(Long, Int) T

[GitHub] spark pull request #21102: [SPARK-23913][SQL] Add array_intersect function

2018-07-18 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21102#discussion_r203314045 --- Diff: core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala --- @@ -163,7 +187,7 @@ class OpenHashSet[@specialized(Long, Int) T

[GitHub] spark pull request #21102: [SPARK-23913][SQL] Add array_intersect function

2018-07-18 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21102#discussion_r203311109 --- Diff: core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala --- @@ -85,9 +85,13 @@ class OpenHashSet[@specialized(Long, Int) T

  1   2   3   4   5   6   7   8   9   10   >