[GitHub] spark pull request #21895: [SPARK-24948][SHS] Delegate check access permissi...

2018-08-03 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21895#discussion_r207493081 --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala --- @@ -779,6 +808,8 @@ private[history] class FsHistoryProvider(conf

[GitHub] spark pull request #21895: [SPARK-24948][SHS] Delegate check access permissi...

2018-08-03 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21895#discussion_r207493280 --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala --- @@ -161,6 +162,29 @@ private[history] class FsHistoryProvider

[GitHub] spark pull request #21895: [SPARK-24948][SHS] Delegate check access permissi...

2018-08-03 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21895#discussion_r207641341 --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala --- @@ -779,6 +808,8 @@ private[history] class FsHistoryProvider(conf

[GitHub] spark issue #21895: [SPARK-24948][SHS] Delegate check access permissions to ...

2018-08-06 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21895 I merged to master, thanks for the work @mgaido91 ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For

[GitHub] spark pull request #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartit...

2018-08-13 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22079#discussion_r209705612 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/execution/RecordBinaryComparator.java --- @@ -0,0 +1,70 @@ +/* + * Licensed to the

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-13 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21698 @tgravescs I vaguely remember someone at y! labs telling me (more than a decade back) about MR always doing a sort as part of its shuffle to avoid a variant of this problem by design

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-13 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21698 @squito @tgravescs I am probably missing something about why hash partitioner helps, can you please clarify ? IIRC the partitioner for CoalescedRDD when shuffle is enabled is HashPartitioner

[GitHub] spark pull request #22101: [SPARK-23207][Core][FOLLOWUP] Fix RecordBinaryCom...

2018-08-14 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22101#discussion_r209860397 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/execution/RecordBinaryComparator.java --- @@ -42,16 +42,16 @@ public int compare

[GitHub] spark pull request #22101: [SPARK-23207][Core][FOLLOWUP] Fix RecordBinaryCom...

2018-08-14 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22101#discussion_r209862610 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/execution/RecordBinaryComparator.java --- @@ -42,16 +42,16 @@ public int compare

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-14 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21698 @cloud-fan I think we have to be clear on the boundaries of the solution we can provide in spark. > RDD#mapPartitions and its friends can take arbitrary user functions, which may prod

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-14 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21698 > I guess on the RDD side its not called RoundRobinPartitioner Thanks for clarifying @tgravescs ! I was looking at `RangePartitioner` and variants and was wondering what I was missing -

[GitHub] spark issue #22101: [SPARK-25114][Core] Fix RecordBinaryComparator when subt...

2018-08-14 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22101 LGTM pending Xiao Li's excellent suggestion :-) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.or

[GitHub] spark issue #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() data corr...

2018-08-15 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22112 I am not sure what the definition of `isIdempotent` here is. For example, from MapPartitionsRDD : ``` override private[spark] def isIdempotent = { if (inputOrderSensitive

[GitHub] spark issue #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() data corr...

2018-08-15 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22112 You are perfectly correct @jiangxb1987, that was a silly mistake on my part - and not trivial at all ! It should be shuffle dependency we should rely on when traversing the dependency tree, not

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-16 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22112 I agree @tgravescs, I was looking at the implementation to understand what the expectations are wrt newly introduced methods/fields and whether they make sense : I did not see any details furnished

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-16 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22112 @tgravescs I was specifically in agreement with > Personally I don't want to talk about implementation until we decide what we want our semantics to be around the unordered operations

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-16 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22112 @tgravescs To understand better, are you suggesting that we do not support any api and/or user closure which depends on input order ? If yes, that would break not just repartition + shuffle, but

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-17 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r210756079 --- Diff: core/src/main/scala/org/apache/spark/Dependency.scala --- @@ -94,6 +94,16 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-17 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r210967814 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1441,6 +1441,44 @@ class DAGScheduler( failedStages

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-17 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r210964794 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -1864,6 +1877,22 @@ abstract class RDD[T: ClassTag]( // From performance concern

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-17 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r210788359 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -112,6 +112,11 @@ abstract class RDD[T: ClassTag]( /** * :: DeveloperApi

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-17 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r210963665 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -1864,6 +1877,22 @@ abstract class RDD[T: ClassTag]( // From performance concern

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-17 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r210963213 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -853,6 +861,11 @@ abstract class RDD[T: ClassTag]( * second element in each RDD

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-17 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22112 @tgravescs Please see https://github.com/apache/spark/pull/22112#discussion_r210788359 for a further elaboration. We actually cannot support random order (except for small subset of cases like map

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22112 Catching up on discussion ... @cloud-fan > shuffled RDD will never be deterministic unless the shuffle key is the entire record and key ordering is specified. Let me rephr

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22112 @tgravescs: > The shuffle simply transfers the bytes its supposed to. Sparks shuffle of those bytes is not consistent in that the order it fetches from can change and without the s

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-22 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212192065 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -1876,6 +1920,22 @@ abstract class RDD[T: ClassTag]( */ object RDD

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-22 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212192772 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -1865,6 +1876,39 @@ abstract class RDD[T: ClassTag]( // RDD chain

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-22 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212193206 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -1865,6 +1876,39 @@ abstract class RDD[T: ClassTag]( // RDD chain

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-22 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212192261 --- Diff: core/src/main/scala/org/apache/spark/rdd/MapPartitionsRDD.scala --- @@ -54,4 +58,12 @@ private[spark] class MapPartitionsRDD[U: ClassTag, T

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-22 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212190267 --- Diff: core/src/main/scala/org/apache/spark/rdd/MapPartitionsRDD.scala --- @@ -32,12 +32,16 @@ import org.apache.spark.{Partition, TaskContext

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-22 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212196598 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1502,6 +1502,53 @@ private[spark] class DAGScheduler

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-22 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212199007 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1502,6 +1502,53 @@ private[spark] class DAGScheduler

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-22 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212193814 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -1876,6 +1920,22 @@ abstract class RDD[T: ClassTag]( */ object RDD

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-22 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212198632 --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala --- @@ -70,7 +70,8 @@ class MyRDD( numPartitions: Int

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-22 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212197939 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1502,6 +1502,53 @@ private[spark] class DAGScheduler

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-22 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212192600 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -855,16 +858,17 @@ abstract class RDD[T: ClassTag]( * a map on the other

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-22 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212199604 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1502,6 +1502,53 @@ private[spark] class DAGScheduler

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-22 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212195284 --- Diff: core/src/main/scala/org/apache/spark/rdd/ZippedPartitionsRDD.scala --- @@ -95,6 +99,18 @@ private[spark] class ZippedPartitionsRDD2[A: ClassTag

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212385688 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -1865,6 +1876,39 @@ abstract class RDD[T: ClassTag]( // RDD chain

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212386645 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -1865,6 +1876,39 @@ abstract class RDD[T: ClassTag]( // RDD chain

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-23 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21698 @jiangxb1987 I am guessing we should close this PR ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212451081 --- Diff: core/src/main/scala/org/apache/spark/rdd/ZippedPartitionsRDD.scala --- @@ -95,6 +99,18 @@ private[spark] class ZippedPartitionsRDD2[A: ClassTag

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212452014 --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala --- @@ -2627,6 +2632,81 @@ class DAGSchedulerSuite extends SparkFunSuite

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r212462874 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1502,6 +1502,53 @@ private[spark] class DAGScheduler

[GitHub] spark pull request #22209: [SPARK-24415][Core] Fixed the issue where stage p...

2018-08-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22209#discussion_r212468772 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala --- @@ -350,11 +350,16 @@ private[spark] class AppStatusListener

[GitHub] spark pull request #22209: [SPARK-24415][Core] Fixed the issue where stage p...

2018-08-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22209#discussion_r212490964 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala --- @@ -350,11 +350,16 @@ private[spark] class AppStatusListener

[GitHub] spark pull request #22209: [SPARK-24415][Core] Fixed the issue where stage p...

2018-08-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22209#discussion_r212491921 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala --- @@ -350,11 +350,16 @@ private[spark] class AppStatusListener

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-07-11 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21698 @jiangxb1987 Different number of output rows is due to data loss - it is not another valid run. A complete re-execution of the job in this case could result in a different ordering, but

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-07-12 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21698 @cloud-fan There is no ambiguity in output of map - one record in, one record out. In case of zip, as you said, number of output records is min of both. Given this, there is no ambiguity in

[GitHub] spark pull request #21440: [SPARK-24307][CORE] Support reading remote cached...

2018-07-12 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21440#discussion_r202149408 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -723,7 +728,9 @@ private[spark] class BlockManager

[GitHub] spark pull request #21440: [SPARK-24307][CORE] Support reading remote cached...

2018-07-12 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21440#discussion_r202150730 --- Diff: core/src/main/scala/org/apache/spark/util/io/ChunkedByteBuffer.scala --- @@ -166,6 +170,34 @@ private[spark] class ChunkedByteBuffer(var chunks

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-07-13 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21698 @cloud-fan That depends on what the computeKey is doing - which is user defined. It can have different values, or it need not (again, depends on user data and closure being applied

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-07-13 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21698 @jiangxb1987 data loss comes because a re-execution of zip might generate a key for which corresponding reducer has already finished. Hence re-execution of stage will not result in subsequent

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-07-14 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21698 Taking a step back and analyzing the solution for the problem at hand. There are three main issues with the proposal: * It does not solve the problem in a general manner. * I

[GitHub] spark pull request #21440: [SPARK-24307][CORE] Support reading remote cached...

2018-07-17 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21440#discussion_r203182507 --- Diff: core/src/main/scala/org/apache/spark/util/io/ChunkedByteBuffer.scala --- @@ -166,6 +170,34 @@ private[spark] class ChunkedByteBuffer(var chunks

[GitHub] spark pull request #21440: [SPARK-24307][CORE] Support reading remote cached...

2018-07-18 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21440#discussion_r203273344 --- Diff: core/src/main/scala/org/apache/spark/util/io/ChunkedByteBuffer.scala --- @@ -166,6 +170,34 @@ private[spark] class ChunkedByteBuffer(var chunks

[GitHub] spark issue #21440: [SPARK-24307][CORE] Support reading remote cached partit...

2018-07-18 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21440 LGTM, thanks for working on this @squito ! Since there are other ongoing reviews, I will defer to them to merge. --- - To

[GitHub] spark pull request #21102: [SPARK-23913][SQL] Add array_intersect function

2018-07-18 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21102#discussion_r203314045 --- Diff: core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala --- @@ -163,7 +187,7 @@ class OpenHashSet[@specialized(Long, Int) T

[GitHub] spark pull request #21102: [SPARK-23913][SQL] Add array_intersect function

2018-07-18 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21102#discussion_r203311109 --- Diff: core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala --- @@ -85,9 +85,13 @@ class OpenHashSet[@specialized(Long, Int) T

[GitHub] spark pull request #21102: [SPARK-23913][SQL] Add array_intersect function

2018-07-18 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21102#discussion_r203311755 --- Diff: core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala --- @@ -85,9 +85,13 @@ class OpenHashSet[@specialized(Long, Int) T

[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

2018-07-18 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21589 I am not seeing the utility of these two methods. `defaultParallelism` already captures the current number of cores. For monitoring usecases, existing events fired via listener can be

[GitHub] spark issue #21729: [SPARK-24755][Core] Executor loss can cause task to not ...

2018-07-18 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21729 Looks good to me, thanks for fixing this @hthuynh2 ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For

[GitHub] spark pull request #21221: [SPARK-23429][CORE] Add executor memory metrics t...

2018-07-18 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21221#discussion_r203319952 --- Diff: core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala --- @@ -160,11 +160,29 @@ case class SparkListenerBlockUpdated

[GitHub] spark issue #21758: [SPARK-24795][CORE] Implement barrier execution mode

2018-07-18 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21758 I had left a few comments on SPARK-24375 @jiangxb1987 ... unfortunately the jira's have moved around a bit. If this is active PR for introducing the feature, would be great to get clari

[GitHub] spark pull request #21102: [SPARK-23913][SQL] Add array_intersect function

2018-07-18 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21102#discussion_r203322056 --- Diff: core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala --- @@ -163,7 +187,7 @@ class OpenHashSet[@specialized(Long, Int) T

[GitHub] spark pull request #21102: [SPARK-23913][SQL] Add array_intersect function

2018-07-18 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21102#discussion_r203322643 --- Diff: core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala --- @@ -163,7 +187,7 @@ class OpenHashSet[@specialized(Long, Int) T

[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

2018-07-18 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21589 I am not convinced by the rationale given for adding the new api's in the jira. The examples given there can be easily modeled using `defaultParallelism` (to get current state

[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

2018-07-18 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21589 +CC @markhamstra since you were looking at API stability. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For

[GitHub] spark pull request #21221: [SPARK-23429][CORE] Add executor memory metrics t...

2018-07-18 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21221#discussion_r203489913 --- Diff: core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala --- @@ -160,11 +160,29 @@ case class SparkListenerBlockUpdated

[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

2018-07-18 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21589 @MaxGekk The example you cites is literally one of a handful of usages which is not easily overridden - and is prefixed with a 'HACK ALERT' ! A few others are in mllib, typically for read

[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...

2018-07-18 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21589 @MaxGekk We are going in circles. I dont think this is a good api to expose currently - the data is available through multiple other means as I detailed and while not a succinct oneliner, it is

[GitHub] spark pull request #21653: [SPARK-13343] speculative tasks that didn't commi...

2018-07-20 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21653#discussion_r204199580 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -723,6 +723,21 @@ private[spark] class TaskSetManager( def

[GitHub] spark pull request #21895: [SPARK-24948][SHS] Delegate check access permissi...

2018-07-28 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21895#discussion_r205948923 --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala --- @@ -973,6 +973,38 @@ private[history] object FsHistoryProvider

[GitHub] spark issue #21895: [SPARK-24948][SHS] Delegate check access permissions to ...

2018-07-28 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21895 +CC @jerryshao --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark pull request #21577: [SPARK-24552][core] Correctly identify tasks in o...

2018-06-18 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21577#discussion_r196235040 --- Diff: core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala --- @@ -109,20 +116,21 @@ private[spark] class

[GitHub] spark issue #21527: [SPARK-24519] MapStatus has 2000 hardcoded

2018-06-18 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21527 +1 on making this configurable. Like @tgravescs, I dont like hardcoded constants - all for making it a private config not necessarily exposed to users. Will allow developers to tune it as

[GitHub] spark pull request #21577: [SPARK-24589][core] Correctly identify tasks in o...

2018-06-18 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21577#discussion_r196282241 --- Diff: core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala --- @@ -109,20 +116,21 @@ private[spark] class

[GitHub] spark pull request #21577: [SPARK-24589][core] Correctly identify tasks in o...

2018-06-18 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21577#discussion_r196288823 --- Diff: core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala --- @@ -109,20 +116,21 @@ private[spark] class

[GitHub] spark pull request #21577: [SPARK-24589][core] Correctly identify tasks in o...

2018-06-18 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21577#discussion_r196291657 --- Diff: core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala --- @@ -109,20 +116,21 @@ private[spark] class

[GitHub] spark pull request #21577: [SPARK-24589][core] Correctly identify tasks in o...

2018-06-19 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21577#discussion_r196514785 --- Diff: core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala --- @@ -109,20 +116,21 @@ private[spark] class

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

2018-06-19 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21577 Thanks for the changes @vanzin, looks good to me ! Ideally would have been great to test the speculative execution part as well; but that would be fairly nasty to reliably reproduce I guess

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

2018-06-20 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21577 > * t2 finishes before that kill message arrives, is allowed to commit. > If that can happen it would generate a duplicate map output; but my guess (hope?) is that the map output tracker

[GitHub] spark pull request #16677: [SPARK-19355][SQL] Use map output statistics to i...

2018-06-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/16677#discussion_r197606877 --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java --- @@ -145,10 +145,12 @@ public void write(Iterator> reco

[GitHub] spark pull request #16677: [SPARK-19355][SQL] Use map output statistics to i...

2018-06-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/16677#discussion_r197607227 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala --- @@ -93,25 +96,94 @@ trait BaseLimitExec extends UnaryExecNode with

[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2018-06-23 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/16677 I am not sure if I am missing something - the count's obtained are at map side output per (map-side) partition; while limit is being computed at reduce side (after some arbitrary partiti

[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2018-06-23 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/16677 If there is some codepath not updating shuffle write metrics (introduced for sql), that would be a bug. On Sat, Jun 23, 2018 at 7:27 AM Liang-Chi Hsieh wrote: > *@vii

[GitHub] spark pull request #16677: [SPARK-19355][SQL] Use map output statistics to i...

2018-06-26 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/16677#discussion_r198360170 --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java --- @@ -145,10 +145,12 @@ public void write(Iterator> reco

[GitHub] spark pull request #21656: [SPARK-24677][Core]MedianHeap is empty when specu...

2018-07-07 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21656#discussion_r200804756 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -772,6 +772,12 @@ private[spark] class TaskSetManager( private

[GitHub] spark pull request #21653: [SPARK-13343] speculative tasks that didn't commi...

2018-07-07 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21653#discussion_r200805005 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -723,6 +723,13 @@ private[spark] class TaskSetManager( def

[GitHub] spark pull request #21221: [SPARK-23429][CORE] Add executor memory metrics t...

2018-07-07 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21221#discussion_r200805499 --- Diff: core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala --- @@ -160,11 +160,29 @@ case class SparkListenerBlockUpdated

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-07-07 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21698 I did not go over the PR itself in detail, but the proposal sounds very expensive - particularly given the cascading costs involved. Also, I am not sure why we are special case'ing

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-07-07 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21698 @jiangxb1987 Any closure sensitive to iteration order [1] is effected by this - under the set of circumstances. If we cannot solve it in a principled manner (make shuffle repeatable which I

[GitHub] spark pull request #21729: [SPARK-24755][Core] Executor loss can cause task ...

2018-07-10 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/21729#discussion_r201415032 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -87,7 +87,7 @@ private[spark] class TaskSetManager( // Set the

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-07-10 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21698 @cloud-fan The difference would be between a (user) defined record order (global sort or local sort) and expectation of repeatable record order on recomputation. It might also be a good idea to

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-07-11 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21698 @cloud-fan We should not look at a particular stage in isolation, but rather what happens when there are failures in the middle of a job with multiple shuffle stages - and zip is one of the

[GitHub] spark pull request: [SPARK-3875] Add TEMP DIRECTORY configuration

2014-10-09 Thread mridulm
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/2729#issuecomment-58473957 At least for yarn, this will create issues if overridden from default. Not sure about mesos. Why not use std java property and define it for local and

[GitHub] spark pull request: [SPARK-3875] Add TEMP DIRECTORY configuration

2014-10-09 Thread mridulm
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/2729#issuecomment-58479810 There is a java property which controls this ... java.io.tmpdir On 09-Oct-2014 1:22 pm, "刘钰帆" wrote: > @mridulm <https://github.com/mr

[GitHub] spark pull request: [SPARK-3889] Attempt to avoid SIGBUS by not mm...

2014-10-10 Thread mridulm
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/2742#issuecomment-58724312 This needs to be configurable ... IIRC 1.1 had this customizable. Different limits exist for vm vs heap memory in yarn (for example). --- If your project is set up

[GitHub] spark pull request: [SPARK-3889] Attempt to avoid SIGBUS by not mm...

2014-10-10 Thread mridulm
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/2742#issuecomment-58728241 With 1.1, in expts, we have done both : depending on whether our user code is mmap'ing too much data (and so we pull things into heap .. using libraries not i

<    3   4   5   6   7   8   9   10   11   12   >