Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/21895#discussion_r207493081
--- Diff:
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
@@ -779,6 +808,8 @@ private[history] class FsHistoryProvider(conf
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/21895#discussion_r207493280
--- Diff:
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
@@ -161,6 +162,29 @@ private[history] class FsHistoryProvider
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/21895#discussion_r207641341
--- Diff:
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
@@ -779,6 +808,8 @@ private[history] class FsHistoryProvider(conf
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/21895
I merged to master, thanks for the work @mgaido91 !
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/22079#discussion_r209705612
--- Diff:
sql/catalyst/src/main/java/org/apache/spark/sql/execution/RecordBinaryComparator.java
---
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/21698
@tgravescs I vaguely remember someone at y! labs telling me (more than a
decade back) about MR always doing a sort as part of its shuffle to avoid a
variant of this problem by design
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/21698
@squito @tgravescs I am probably missing something about why hash
partitioner helps, can you please clarify ?
IIRC the partitioner for CoalescedRDD when shuffle is enabled is
HashPartitioner
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/22101#discussion_r209860397
--- Diff:
sql/catalyst/src/main/java/org/apache/spark/sql/execution/RecordBinaryComparator.java
---
@@ -42,16 +42,16 @@ public int compare
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/22101#discussion_r209862610
--- Diff:
sql/catalyst/src/main/java/org/apache/spark/sql/execution/RecordBinaryComparator.java
---
@@ -42,16 +42,16 @@ public int compare
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/21698
@cloud-fan I think we have to be clear on the boundaries of the solution we
can provide in spark.
> RDD#mapPartitions and its friends can take arbitrary user functions,
which may prod
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/21698
> I guess on the RDD side its not called RoundRobinPartitioner
Thanks for clarifying @tgravescs ! I was looking at `RangePartitioner` and
variants and was wondering what I was missing -
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/22101
LGTM pending Xiao Li's excellent suggestion :-)
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.or
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/22112
I am not sure what the definition of `isIdempotent` here is.
For example, from MapPartitionsRDD :
```
override private[spark] def isIdempotent = {
if (inputOrderSensitive
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/22112
You are perfectly correct @jiangxb1987, that was a silly mistake on my part
- and not trivial at all !
It should be shuffle dependency we should rely on when traversing the
dependency tree, not
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/22112
I agree @tgravescs, I was looking at the implementation to understand what
the expectations are wrt newly introduced methods/fields and whether they make
sense : I did not see any details furnished
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/22112
@tgravescs I was specifically in agreement with
> Personally I don't want to talk about implementation until we decide what
we want our semantics to be around the unordered operations
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/22112
@tgravescs To understand better, are you suggesting that we do not support
any api and/or user closure which depends on input order ?
If yes, that would break not just repartition + shuffle, but
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/22112#discussion_r210756079
--- Diff: core/src/main/scala/org/apache/spark/Dependency.scala ---
@@ -94,6 +94,16 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C:
ClassTag
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/22112#discussion_r210967814
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
---
@@ -1441,6 +1441,44 @@ class DAGScheduler(
failedStages
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/22112#discussion_r210964794
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -1864,6 +1877,22 @@ abstract class RDD[T: ClassTag](
// From performance concern
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/22112#discussion_r210788359
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -112,6 +112,11 @@ abstract class RDD[T: ClassTag](
/**
* :: DeveloperApi
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/22112#discussion_r210963665
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -1864,6 +1877,22 @@ abstract class RDD[T: ClassTag](
// From performance concern
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/22112#discussion_r210963213
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -853,6 +861,11 @@ abstract class RDD[T: ClassTag](
* second element in each RDD
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/22112
@tgravescs Please see
https://github.com/apache/spark/pull/22112#discussion_r210788359 for a further
elaboration. We actually cannot support random order (except for small subset
of cases like map
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/22112
Catching up on discussion ...
@cloud-fan
> shuffled RDD will never be deterministic unless the shuffle key is the
entire record and key ordering is specified.
Let me rephr
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/22112
@tgravescs:
> The shuffle simply transfers the bytes its supposed to. Sparks shuffle of
those bytes is not consistent in that the order it fetches from can change and
without the s
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/22112#discussion_r212192065
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -1876,6 +1920,22 @@ abstract class RDD[T: ClassTag](
*/
object RDD
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/22112#discussion_r212192772
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -1865,6 +1876,39 @@ abstract class RDD[T: ClassTag](
// RDD chain
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/22112#discussion_r212193206
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -1865,6 +1876,39 @@ abstract class RDD[T: ClassTag](
// RDD chain
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/22112#discussion_r212192261
--- Diff: core/src/main/scala/org/apache/spark/rdd/MapPartitionsRDD.scala
---
@@ -54,4 +58,12 @@ private[spark] class MapPartitionsRDD[U: ClassTag, T
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/22112#discussion_r212190267
--- Diff: core/src/main/scala/org/apache/spark/rdd/MapPartitionsRDD.scala
---
@@ -32,12 +32,16 @@ import org.apache.spark.{Partition, TaskContext
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/22112#discussion_r212196598
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
---
@@ -1502,6 +1502,53 @@ private[spark] class DAGScheduler
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/22112#discussion_r212199007
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
---
@@ -1502,6 +1502,53 @@ private[spark] class DAGScheduler
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/22112#discussion_r212193814
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -1876,6 +1920,22 @@ abstract class RDD[T: ClassTag](
*/
object RDD
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/22112#discussion_r212198632
--- Diff:
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -70,7 +70,8 @@ class MyRDD(
numPartitions: Int
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/22112#discussion_r212197939
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
---
@@ -1502,6 +1502,53 @@ private[spark] class DAGScheduler
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/22112#discussion_r212192600
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -855,16 +858,17 @@ abstract class RDD[T: ClassTag](
* a map on the other
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/22112#discussion_r212199604
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
---
@@ -1502,6 +1502,53 @@ private[spark] class DAGScheduler
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/22112#discussion_r212195284
--- Diff:
core/src/main/scala/org/apache/spark/rdd/ZippedPartitionsRDD.scala ---
@@ -95,6 +99,18 @@ private[spark] class ZippedPartitionsRDD2[A: ClassTag
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/22112#discussion_r212385688
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -1865,6 +1876,39 @@ abstract class RDD[T: ClassTag](
// RDD chain
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/22112#discussion_r212386645
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -1865,6 +1876,39 @@ abstract class RDD[T: ClassTag](
// RDD chain
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/21698
@jiangxb1987 I am guessing we should close this PR ?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/22112#discussion_r212451081
--- Diff:
core/src/main/scala/org/apache/spark/rdd/ZippedPartitionsRDD.scala ---
@@ -95,6 +99,18 @@ private[spark] class ZippedPartitionsRDD2[A: ClassTag
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/22112#discussion_r212452014
--- Diff:
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -2627,6 +2632,81 @@ class DAGSchedulerSuite extends SparkFunSuite
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/22112#discussion_r212462874
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
---
@@ -1502,6 +1502,53 @@ private[spark] class DAGScheduler
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/22209#discussion_r212468772
--- Diff:
core/src/main/scala/org/apache/spark/status/AppStatusListener.scala ---
@@ -350,11 +350,16 @@ private[spark] class AppStatusListener
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/22209#discussion_r212490964
--- Diff:
core/src/main/scala/org/apache/spark/status/AppStatusListener.scala ---
@@ -350,11 +350,16 @@ private[spark] class AppStatusListener
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/22209#discussion_r212491921
--- Diff:
core/src/main/scala/org/apache/spark/status/AppStatusListener.scala ---
@@ -350,11 +350,16 @@ private[spark] class AppStatusListener
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/21698
@jiangxb1987 Different number of output rows is due to data loss - it is
not another valid run.
A complete re-execution of the job in this case could result in a different
ordering, but
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/21698
@cloud-fan There is no ambiguity in output of map - one record in, one
record out.
In case of zip, as you said, number of output records is min of both.
Given this, there is no ambiguity in
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/21440#discussion_r202149408
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala
---
@@ -723,7 +728,9 @@ private[spark] class BlockManager
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/21440#discussion_r202150730
--- Diff:
core/src/main/scala/org/apache/spark/util/io/ChunkedByteBuffer.scala ---
@@ -166,6 +170,34 @@ private[spark] class ChunkedByteBuffer(var chunks
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/21698
@cloud-fan That depends on what the computeKey is doing - which is user
defined. It can have different values, or it need not (again, depends on user
data and closure being applied
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/21698
@jiangxb1987 data loss comes because a re-execution of zip might generate a
key for which corresponding reducer has already finished.
Hence re-execution of stage will not result in subsequent
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/21698
Taking a step back and analyzing the solution for the problem at hand.
There are three main issues with the proposal:
* It does not solve the problem in a general manner.
* I
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/21440#discussion_r203182507
--- Diff:
core/src/main/scala/org/apache/spark/util/io/ChunkedByteBuffer.scala ---
@@ -166,6 +170,34 @@ private[spark] class ChunkedByteBuffer(var chunks
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/21440#discussion_r203273344
--- Diff:
core/src/main/scala/org/apache/spark/util/io/ChunkedByteBuffer.scala ---
@@ -166,6 +170,34 @@ private[spark] class ChunkedByteBuffer(var chunks
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/21440
LGTM, thanks for working on this @squito !
Since there are other ongoing reviews, I will defer to them to merge.
---
-
To
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/21102#discussion_r203314045
--- Diff:
core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala ---
@@ -163,7 +187,7 @@ class OpenHashSet[@specialized(Long, Int) T
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/21102#discussion_r203311109
--- Diff:
core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala ---
@@ -85,9 +85,13 @@ class OpenHashSet[@specialized(Long, Int) T
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/21102#discussion_r203311755
--- Diff:
core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala ---
@@ -85,9 +85,13 @@ class OpenHashSet[@specialized(Long, Int) T
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/21589
I am not seeing the utility of these two methods.
`defaultParallelism` already captures the current number of cores.
For monitoring usecases, existing events fired via listener can be
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/21729
Looks good to me, thanks for fixing this @hthuynh2 !
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/21221#discussion_r203319952
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala ---
@@ -160,11 +160,29 @@ case class
SparkListenerBlockUpdated
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/21758
I had left a few comments on SPARK-24375 @jiangxb1987 ... unfortunately the
jira's have moved around a bit.
If this is active PR for introducing the feature, would be great to get
clari
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/21102#discussion_r203322056
--- Diff:
core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala ---
@@ -163,7 +187,7 @@ class OpenHashSet[@specialized(Long, Int) T
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/21102#discussion_r203322643
--- Diff:
core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala ---
@@ -163,7 +187,7 @@ class OpenHashSet[@specialized(Long, Int) T
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/21589
I am not convinced by the rationale given for adding the new api's in the
jira.
The examples given there can be easily modeled using `defaultParallelism`
(to get current state
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/21589
+CC @markhamstra since you were looking at API stability.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/21221#discussion_r203489913
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala ---
@@ -160,11 +160,29 @@ case class
SparkListenerBlockUpdated
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/21589
@MaxGekk The example you cites is literally one of a handful of usages
which is not easily overridden - and is prefixed with a 'HACK ALERT' ! A few
others are in mllib, typically for read
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/21589
@MaxGekk We are going in circles.
I dont think this is a good api to expose currently - the data is available
through multiple other means as I detailed and while not a succinct oneliner,
it is
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/21653#discussion_r204199580
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -723,6 +723,21 @@ private[spark] class TaskSetManager(
def
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/21895#discussion_r205948923
--- Diff:
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
@@ -973,6 +973,38 @@ private[history] object FsHistoryProvider
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/21895
+CC @jerryshao
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/21577#discussion_r196235040
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
@@ -109,20 +116,21 @@ private[spark] class
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/21527
+1 on making this configurable.
Like @tgravescs, I dont like hardcoded constants - all for making it a
private config not necessarily exposed to users.
Will allow developers to tune it as
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/21577#discussion_r196282241
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
@@ -109,20 +116,21 @@ private[spark] class
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/21577#discussion_r196288823
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
@@ -109,20 +116,21 @@ private[spark] class
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/21577#discussion_r196291657
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
@@ -109,20 +116,21 @@ private[spark] class
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/21577#discussion_r196514785
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
@@ -109,20 +116,21 @@ private[spark] class
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/21577
Thanks for the changes @vanzin, looks good to me !
Ideally would have been great to test the speculative execution part as
well; but that would be fairly nasty to reliably reproduce I guess
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/21577
> * t2 finishes before that kill message arrives, is allowed to commit.
> If that can happen it would generate a duplicate map output; but my guess
(hope?) is that the map output tracker
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/16677#discussion_r197606877
--- Diff:
core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java
---
@@ -145,10 +145,12 @@ public void write(Iterator> reco
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/16677#discussion_r197607227
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala ---
@@ -93,25 +96,94 @@ trait BaseLimitExec extends UnaryExecNode with
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/16677
I am not sure if I am missing something - the count's obtained are at map
side output per (map-side) partition; while limit is being computed at reduce
side (after some arbitrary partiti
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/16677
If there is some codepath not updating shuffle write metrics (introduced
for sql), that would be a bug.
On Sat, Jun 23, 2018 at 7:27 AM Liang-Chi Hsieh
wrote:
> *@vii
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/16677#discussion_r198360170
--- Diff:
core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java
---
@@ -145,10 +145,12 @@ public void write(Iterator> reco
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/21656#discussion_r200804756
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -772,6 +772,12 @@ private[spark] class TaskSetManager(
private
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/21653#discussion_r200805005
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -723,6 +723,13 @@ private[spark] class TaskSetManager(
def
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/21221#discussion_r200805499
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala ---
@@ -160,11 +160,29 @@ case class
SparkListenerBlockUpdated
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/21698
I did not go over the PR itself in detail, but the proposal sounds very
expensive - particularly given the cascading costs involved.
Also, I am not sure why we are special case'ing
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/21698
@jiangxb1987 Any closure sensitive to iteration order [1] is effected by
this - under the set of circumstances.
If we cannot solve it in a principled manner (make shuffle repeatable which
I
Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/21729#discussion_r201415032
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -87,7 +87,7 @@ private[spark] class TaskSetManager(
// Set the
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/21698
@cloud-fan The difference would be between a (user) defined record order
(global sort or local sort) and expectation of repeatable record order on
recomputation.
It might also be a good idea to
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/21698
@cloud-fan We should not look at a particular stage in isolation, but
rather what happens when there are failures in the middle of a job with
multiple shuffle stages - and zip is one of the
Github user mridulm commented on the pull request:
https://github.com/apache/spark/pull/2729#issuecomment-58473957
At least for yarn, this will create issues if overridden from default.
Not sure about mesos.
Why not use std java property and define it for local and
Github user mridulm commented on the pull request:
https://github.com/apache/spark/pull/2729#issuecomment-58479810
There is a java property which controls this ... java.io.tmpdir
On 09-Oct-2014 1:22 pm, "åé°å¸" wrote:
> @mridulm <https://github.com/mr
Github user mridulm commented on the pull request:
https://github.com/apache/spark/pull/2742#issuecomment-58724312
This needs to be configurable ... IIRC 1.1 had this customizable.
Different limits exist for vm vs heap memory in yarn (for example).
---
If your project is set up
Github user mridulm commented on the pull request:
https://github.com/apache/spark/pull/2742#issuecomment-58728241
With 1.1, in expts, we have done both : depending on whether our user code
is mmap'ing too much data (and so we pull things into heap .. using
libraries not i
701 - 800 of 1358 matches
Mail list logo