[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)

2014-12-05 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14236131#comment-14236131
 ] 

Sandy Ryza commented on SPARK-3655:
---

foldLeft only conceptually makes sense when applied to an ordered collection.  
Requiring an Ordering for the values seems reasonable to me. Though maybe 
explicitly calling it foldLeftSortedValuesByKey would be more clear?

> Support sorting of values in addition to keys (i.e. secondary sort)
> ---
>
> Key: SPARK-3655
> URL: https://issues.apache.org/jira/browse/SPARK-3655
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: koert kuipers
>Assignee: Koert Kuipers
>Priority: Minor
>
> Now that spark has a sort based shuffle, can we expect a secondary sort soon? 
> There are some use cases where getting a sorted iterator of values per key is 
> helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4770) spark.scheduler.minRegisteredResourcesRatio documented default is incorrect for YARN

2014-12-05 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-4770:
-

 Summary: spark.scheduler.minRegisteredResourcesRatio documented 
default is incorrect for YARN
 Key: SPARK-4770
 URL: https://issues.apache.org/jira/browse/SPARK-4770
 Project: Spark
  Issue Type: Bug
  Components: Documentation, YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)

2014-12-05 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14235795#comment-14235795
 ] 

Sandy Ryza commented on SPARK-3655:
---

The repartitionAndSortWithinPartitions approach seems preferable to me

> Support sorting of values in addition to keys (i.e. secondary sort)
> ---
>
> Key: SPARK-3655
> URL: https://issues.apache.org/jira/browse/SPARK-3655
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: koert kuipers
>Assignee: Koert Kuipers
>Priority: Minor
>
> Now that spark has a sort based shuffle, can we expect a secondary sort soon? 
> There are some use cases where getting a sorted iterator of values per key is 
> helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)

2014-12-05 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14235758#comment-14235758
 ] 

Sandy Ryza commented on SPARK-3655:
---

Hey [~koert], I think the transform that would most closely mimic MR-style 
secondary would be a groupByKeyAndSortValues.  Not that a foldBy transformation 
wouldn't be useful as well.  Do you have any interest in implementing the 
former?  If not, I'll have a go at it.

> Support sorting of values in addition to keys (i.e. secondary sort)
> ---
>
> Key: SPARK-3655
> URL: https://issues.apache.org/jira/browse/SPARK-3655
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: koert kuipers
>Assignee: Koert Kuipers
>Priority: Minor
>
> Now that spark has a sort based shuffle, can we expect a secondary sort soon? 
> There are some use cases where getting a sorted iterator of values per key is 
> helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4687) SparkContext#addFile doesn't keep file folder information

2014-12-03 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14233385#comment-14233385
 ] 

Sandy Ryza commented on SPARK-4687:
---

[~pwendell], do you think this is a reasonable API addition?  If so, I'll try 
and add it.

> SparkContext#addFile doesn't keep file folder information
> -
>
> Key: SPARK-4687
> URL: https://issues.apache.org/jira/browse/SPARK-4687
> Project: Spark
>  Issue Type: Bug
>Reporter: Jimmy Xiang
>
> Files added with SparkContext#addFile are loaded with Utils#fetchFile before 
> a task starts. However, Utils#fetchFile puts all files under the Spart root 
> on the worker node. We should have an option to keep the folder information. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4716) Avoid shuffle when all-to-all operation has single input and output partition

2014-12-03 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-4716:
-

 Summary: Avoid shuffle when all-to-all operation has single input 
and output partition
 Key: SPARK-4716
 URL: https://issues.apache.org/jira/browse/SPARK-4716
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Sandy Ryza


I encountered an application that performs joins on a bunch of small RDDs, 
unions the results, and then performs larger aggregations across them.  Many of 
these small RDDs fit in a single partition.  For these operations with only a 
single partition, there's no reason to write data to disk and then fetch it 
over a socket.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4630) Dynamically determine optimal number of partitions

2014-11-30 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229371#comment-14229371
 ] 

Sandy Ryza commented on SPARK-4630:
---

Hey [~pwendell], Spark deals much better with large numbers of partitions than 
MR, but, in our deployments, we've found the number of partitions to be by far 
the most important configuration knob in determining an app's performance.  My 
recommendation to novice users is often to just keep doubling the number of 
partitions until problems go away.  I don't have good numbers, but I'd say this 
has been the solution to maybe half of the performance issues I've been asked 
to assist with.

Most issues we hit are with not having enough partitions.  This causes high GC 
time and unnecessary spilling.  As for why not recommend the parallelism as 
100,000 all the time, the eventual overheads that would kick in are more state 
to hold on to the driver, more random I/O in the shuffle as tasks fetch smaller 
map output blocks, and more task startup overhead

So why not just make Spark's existing heuristic use a higher number of 
partitions? 

First, it's unclear how we would actually do this.  The heuristic we have now 
is to rely on the MR input format for the number of partitions in input stages 
and then, for child stages, to use the number of partitions in the parent 
stage.  A heuristic that, for example, doubled the number of partitions in the 
parent stage, wouldn't make sense in the common case of a reduceByKey.

Second, a heuristic that doesn't take the actual size of the data into account 
can't handle situations where a stage's output is drastically larger or smaller 
than its input.  Smaller output is typical for reduceByKey-style operations.  
Larger output is typical when flattening data or when reading highly compressed 
files.  A typical 1GB Parquet split will decompress to 2-5GB in memory - 
running a sortByKey, join, or groupByKey results in each reducer forced to 
sort/aggregate that much data.
 
Ok, that was a long comment.  I obviously haven't been able to exhaust the 
whole heuristic space, so any simpler ideas that I'm not thinking of are of 
course appreciated.

> Dynamically determine optimal number of partitions
> --
>
> Key: SPARK-4630
> URL: https://issues.apache.org/jira/browse/SPARK-4630
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Kostas Sakellis
>Assignee: Kostas Sakellis
>
> Partition sizes play a big part in how fast stages execute during a Spark 
> job. There is a direct relationship between the size of partitions to the 
> number of tasks - larger partitions, fewer tasks. For better performance, 
> Spark has a sweet spot for how large partitions should be that get executed 
> by a task. If partitions are too small, then the user pays a disproportionate 
> cost in scheduling overhead. If the partitions are too large, then task 
> execution slows down due to gc pressure and spilling to disk.
> To increase performance of jobs, users often hand optimize the number(size) 
> of partitions that the next stage gets. Factors that come into play are:
> Incoming partition sizes from previous stage
> number of available executors
> available memory per executor (taking into account 
> spark.shuffle.memoryFraction)
> Spark has access to this data and so should be able to automatically do the 
> partition sizing for the user. This feature can be turned off/on with a 
> configuration option. 
> To make this happen, we propose modifying the DAGScheduler to take into 
> account partition sizes upon stage completion. Before scheduling the next 
> stage, the scheduler can examine the sizes of the partitions and determine 
> the appropriate number tasks to create. Since this change requires 
> non-trivial modifications to the DAGScheduler, a detailed design doc will be 
> attached before proceeding with the work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

2014-11-26 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226969#comment-14226969
 ] 

Sandy Ryza commented on SPARK-4452:
---

Thinking about the current change a little more, an issue is that it will spill 
all the in-memory data to disk in situations where this is probably overkill.  
E.g. consider the typical situation of shuffle data slightly exceeding memory.  
We end up spilling the entire data structure if a downstream data structure 
needs even a small amount of memory.

I think that your proposed change 2 is probably worthwhile.

> Shuffle data structures can starve others on the same thread for memory 
> 
>
> Key: SPARK-4452
> URL: https://issues.apache.org/jira/browse/SPARK-4452
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Tianshuo Deng
>Assignee: Tianshuo Deng
>Priority: Critical
>
> When an Aggregator is used with ExternalSorter in a task, spark will create 
> many small files and could cause too many files open error during merging.
> Currently, ShuffleMemoryManager does not work well when there are 2 spillable 
> objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used 
> by Aggregator) in this case. Here is an example: Due to the usage of mapside 
> aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may 
> ask as much memory as it can, which is totalMem/numberOfThreads. Then later 
> on when ExternalSorter is created in the same thread, the 
> ShuffleMemoryManager could refuse to allocate more memory to it, since the 
> memory is already given to the previous requested 
> object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling 
> small files(due to the lack of memory)
> I'm currently working on a PR to address these two issues. It will include 
> following changes:
> 1. The ShuffleMemoryManager should not only track the memory usage for each 
> thread, but also the object who holds the memory
> 2. The ShuffleMemoryManager should be able to trigger the spilling of a 
> spillable object. In this way, if a new object in a thread is requesting 
> memory, the old occupant could be evicted/spilled. Previously the spillable 
> objects trigger spilling by themselves. So one may not trigger spilling even 
> if another object in the same thread needs more memory. After this change The 
> ShuffleMemoryManager could trigger the spilling of an object if it needs to.
> 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously 
> ExternalAppendOnlyMap returns an destructive iterator and can not be spilled 
> after the iterator is returned. This should be changed so that even after the 
> iterator is returned, the ShuffleMemoryManager can still spill it.
> Currently, I have a working branch in progress: 
> https://github.com/tsdeng/spark/tree/enhance_memory_manager. Already made 
> change 3 and have a prototype of change 1 and 2 to evict spillable from 
> memory manager, still in progress. I will send a PR when it's done.
> Any feedback or thoughts on this change is highly appreciated !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4630) Dynamically determine optimal number of partitions

2014-11-26 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-4630:
--
Assignee: Kostas Sakellis

> Dynamically determine optimal number of partitions
> --
>
> Key: SPARK-4630
> URL: https://issues.apache.org/jira/browse/SPARK-4630
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Kostas Sakellis
>Assignee: Kostas Sakellis
>
> Partition sizes play a big part in how fast stages execute during a Spark 
> job. There is a direct relationship between the size of partitions to the 
> number of tasks - larger partitions, fewer tasks. For better performance, 
> Spark has a sweet spot for how large partitions should be that get executed 
> by a task. If partitions are too small, then the user pays a disproportionate 
> cost in scheduling overhead. If the partitions are too large, then task 
> execution slows down due to gc pressure and spilling to disk.
> To increase performance of jobs, users often hand optimize the number(size) 
> of partitions that the next stage gets. Factors that come into play are:
> Incoming partition sizes from previous stage
> number of available executors
> available memory per executor (taking into account 
> spark.shuffle.memoryFraction)
> Spark has access to this data and so should be able to automatically do the 
> partition sizing for the user. This feature can be turned off/on with a 
> configuration option. 
> To make this happen, we propose modifying the DAGScheduler to take into 
> account partition sizes upon stage completion. Before scheduling the next 
> stage, the scheduler can examine the sizes of the partitions and determine 
> the appropriate number tasks to create. Since this change requires 
> non-trivial modifications to the DAGScheduler, a detailed design doc will be 
> attached before proceeding with the work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4628) Put all external projects behind a build flag

2014-11-26 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226722#comment-14226722
 ] 

Sandy Ryza commented on SPARK-4628:
---

This looks like a duplicate of SPARK-4376.  Resolving that one as duplicate 
because it has no PR.

> Put all external projects behind a build flag
> -
>
> Key: SPARK-4628
> URL: https://issues.apache.org/jira/browse/SPARK-4628
> Project: Spark
>  Issue Type: Improvement
>Reporter: Patrick Wendell
>Priority: Blocker
>
> This is something we talked about doing for convenience, but I'm escalating 
> this based on realizing today that some of our external projects depend on 
> code that is not in maven central. I.e. if one of these dependencies is taken 
> down (as happened recently with mqtt), all Spark builds will fail.
> The proposal here is simple, have a profile -Pexternal-projects that enables 
> these. This can follow the exact pattern of -Pkinesis-asl which was disabled 
> by default due to a license issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4376) Put external modules behind build profiles

2014-11-26 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza resolved SPARK-4376.
---
Resolution: Duplicate

> Put external modules behind build profiles
> --
>
> Key: SPARK-4376
> URL: https://issues.apache.org/jira/browse/SPARK-4376
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Patrick Wendell
>Assignee: Sandy Ryza
>Priority: Blocker
>
> Several people have asked me whether, to speed up the build, we can put the 
> external projects behind build flags similar to the kinesis-asl module. Since 
> these aren't in the assembly there isn't a great reason to build them by 
> default. We can just modify our release script to build them and when we run 
> tests.
> This doesn't technically block Spark 1.2 but it is going to be looped into a 
> separate fix that does block Spark 1.2 so I'm upgrading it to blocker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4617) Fix spark.yarn.applicationMaster.waitTries doc

2014-11-25 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-4617:
-

 Summary: Fix spark.yarn.applicationMaster.waitTries doc
 Key: SPARK-4617
 URL: https://issues.apache.org/jira/browse/SPARK-4617
 Project: Spark
  Issue Type: Bug
  Components: Documentation, YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4584) 2x Performance regression for Spark-on-YARN

2014-11-25 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-4584:
--
Assignee: Marcelo Vanzin  (was: Sandy Ryza)

> 2x Performance regression for Spark-on-YARN
> ---
>
> Key: SPARK-4584
> URL: https://issues.apache.org/jira/browse/SPARK-4584
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Nishkam Ravi
>Assignee: Marcelo Vanzin
>Priority: Blocker
>
> Significant performance regression observed for Spark-on-YARN (upto 2x) after 
> 1.2 rebase. The offending commit is: 70e824f750aa8ed446eec104ba158b0503ba58a9 
>  from Oct 7th. Problem can be reproduced with JavaWordCount against a large 
> enough input dataset in YARN cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4584) 2x Performance regression for Spark-on-YARN

2014-11-25 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza reassigned SPARK-4584:
-

Assignee: Sandy Ryza

> 2x Performance regression for Spark-on-YARN
> ---
>
> Key: SPARK-4584
> URL: https://issues.apache.org/jira/browse/SPARK-4584
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Nishkam Ravi
>Assignee: Sandy Ryza
>Priority: Blocker
>
> Significant performance regression observed for Spark-on-YARN (upto 2x) after 
> 1.2 rebase. The offending commit is: 70e824f750aa8ed446eec104ba158b0503ba58a9 
>  from Oct 7th. Problem can be reproduced with JavaWordCount against a large 
> enough input dataset in YARN cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4585) Spark dynamic scaling executors use upper limit value as default.

2014-11-25 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224257#comment-14224257
 ] 

Sandy Ryza commented on SPARK-4585:
---

I was discussing this with [~brocknoland].  The issue is that picking a number 
of executors at app-start-time is difficult.  In the common Hive case, we don't 
know what queries the user will run when the app starts, so, even if there was 
a good heuristic, we wouldn't really be able to apply it.  If number of 
starting executors and max executors are linked, there's no good way of setting 
a reasonable default for the starting number of executors that also allows apps 
to scale up to big queries

> Spark dynamic scaling executors use upper limit value as default.
> -
>
> Key: SPARK-4585
> URL: https://issues.apache.org/jira/browse/SPARK-4585
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.1.0
>Reporter: Chengxiang Li
>
> With SPARK-3174, one can configure a minimum and maximum number of executors 
> for a Spark application on Yarn. However, the application always starts with 
> the maximum. It seems more reasonable, at least for Hive on Spark, to start 
> from the minimum and scale up as needed up to the maximum.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4585) Spark dynamic scaling executors use upper limit value as default.

2014-11-25 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-4585:
--
Issue Type: Improvement  (was: Bug)

> Spark dynamic scaling executors use upper limit value as default.
> -
>
> Key: SPARK-4585
> URL: https://issues.apache.org/jira/browse/SPARK-4585
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.1.0
>Reporter: Chengxiang Li
>
> With SPARK-3174, one can configure a minimum and maximum number of executors 
> for a Spark application on Yarn. However, the application always starts with 
> the maximum. It seems more reasonable, at least for Hive on Spark, to start 
> from the minimum and scale up as needed up to the maximum.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4584) 2x Performance regression for Spark-on-YARN

2014-11-25 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224185#comment-14224185
 ] 

Sandy Ryza commented on SPARK-4584:
---

I took a look at the jobs Nishkam ran before and after that commit.  The second 
stage in the "before" job takes 69 seconds and the second stage in the "after" 
job takes 158 seconds.  This seems to be caused by the individual tasks taking 
longer.

Totally confused about how that commit could have caused the regression.

> 2x Performance regression for Spark-on-YARN
> ---
>
> Key: SPARK-4584
> URL: https://issues.apache.org/jira/browse/SPARK-4584
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Nishkam Ravi
>Priority: Blocker
>
> Significant performance regression observed for Spark-on-YARN (upto 2x) after 
> 1.2 rebase. The offending commit is: 70e824f750aa8ed446eec104ba158b0503ba58a9 
>  from Oct 7th. Problem can be reproduced with JavaWordCount against a large 
> enough input dataset in YARN cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4447) Remove layers of abstraction in YARN code no longer needed after dropping yarn-alpha

2014-11-24 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza reassigned SPARK-4447:
-

Assignee: Sandy Ryza  (was: Patrick Wendell)

> Remove layers of abstraction in YARN code no longer needed after dropping 
> yarn-alpha
> 
>
> Key: SPARK-4447
> URL: https://issues.apache.org/jira/browse/SPARK-4447
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.3.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>
> For example, YarnRMClient and YarnRMClientImpl can be merged
> YarnAllocator and YarnAllocationHandler can be merged



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4447) Remove layers of abstraction in YARN code no longer needed after dropping yarn-alpha

2014-11-24 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza reassigned SPARK-4447:
-

Assignee: Patrick Wendell

> Remove layers of abstraction in YARN code no longer needed after dropping 
> yarn-alpha
> 
>
> Key: SPARK-4447
> URL: https://issues.apache.org/jira/browse/SPARK-4447
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.3.0
>Reporter: Sandy Ryza
>Assignee: Patrick Wendell
>
> For example, YarnRMClient and YarnRMClientImpl can be merged
> YarnAllocator and YarnAllocationHandler can be merged



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests

2014-11-24 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-4352:
--
Description: 
Currently, achieving data locality in Spark is difficult unless an application 
takes resources on every node in the cluster.  preferredNodeLocalityData 
provides a sort of hacky workaround that has been broken since 1.0.

With dynamic executor allocation, Spark requests executors in response to 
demand from the application.  When this occurs, it would be useful to look at 
the pending tasks and communicate their location preferences to the cluster 
resource manager. 

  was:
Currently, achieving data locality in Spark is difficult u

preferredNodeLocalityData provides a sort of hacky workaround that has been 
broken since 1.0.


> Incorporate locality preferences in dynamic allocation requests
> ---
>
> Key: SPARK-4352
> URL: https://issues.apache.org/jira/browse/SPARK-4352
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.2.0
>Reporter: Sandy Ryza
>
> Currently, achieving data locality in Spark is difficult unless an 
> application takes resources on every node in the cluster.  
> preferredNodeLocalityData provides a sort of hacky workaround that has been 
> broken since 1.0.
> With dynamic executor allocation, Spark requests executors in response to 
> demand from the application.  When this occurs, it would be useful to look at 
> the pending tasks and communicate their location preferences to the cluster 
> resource manager. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests

2014-11-24 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-4352:
--
Description: 
Currently, achieving data locality in Spark is difficult u

preferredNodeLocalityData provides a sort of hacky workaround that has been 
broken since 1.0.

> Incorporate locality preferences in dynamic allocation requests
> ---
>
> Key: SPARK-4352
> URL: https://issues.apache.org/jira/browse/SPARK-4352
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.2.0
>Reporter: Sandy Ryza
>
> Currently, achieving data locality in Spark is difficult u
> preferredNodeLocalityData provides a sort of hacky workaround that has been 
> broken since 1.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4569) Rename "externalSorting" in Aggregator

2014-11-23 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-4569:
-

 Summary: Rename "externalSorting" in Aggregator
 Key: SPARK-4569
 URL: https://issues.apache.org/jira/browse/SPARK-4569
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Priority: Trivial


While technically all spilling in Spark does result in sorting, calling this 
variable externalSorting makes it seem like ExternalSorter will be used, when 
in fact it just means whether spilling is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

2014-11-23 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222572#comment-14222572
 ] 

Sandy Ryza commented on SPARK-4452:
---

[~tianshuo], I took a look at the patch, and the general approach looks 
reasonable to me.

A couple additional thoughts that apply both to the current approach and 
Tianshuo's patch:
* When we chain an ExternalAppendOnlyMap to an ExternalSorter for processing 
combined map outputs in sort based shuffle, we end up double counting, no? Both 
data structures will be holding references to the same objects and estimating 
their size based on these objects.
* We could make the in-memory iterators destructive as well right?  I.e. if the 
data structures can release references to objects as they yield them, then we 
can give memory back to the shuffle memory manager and make it available to 
other data structures in the same thread.

If we can avoid double and holding on to unneeded objects, it would obviate 
some of the need for intra-thread limits / forced spilling.

> Shuffle data structures can starve others on the same thread for memory 
> 
>
> Key: SPARK-4452
> URL: https://issues.apache.org/jira/browse/SPARK-4452
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Tianshuo Deng
>Assignee: Tianshuo Deng
>Priority: Critical
>
> When an Aggregator is used with ExternalSorter in a task, spark will create 
> many small files and could cause too many files open error during merging.
> Currently, ShuffleMemoryManager does not work well when there are 2 spillable 
> objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used 
> by Aggregator) in this case. Here is an example: Due to the usage of mapside 
> aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may 
> ask as much memory as it can, which is totalMem/numberOfThreads. Then later 
> on when ExternalSorter is created in the same thread, the 
> ShuffleMemoryManager could refuse to allocate more memory to it, since the 
> memory is already given to the previous requested 
> object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling 
> small files(due to the lack of memory)
> I'm currently working on a PR to address these two issues. It will include 
> following changes:
> 1. The ShuffleMemoryManager should not only track the memory usage for each 
> thread, but also the object who holds the memory
> 2. The ShuffleMemoryManager should be able to trigger the spilling of a 
> spillable object. In this way, if a new object in a thread is requesting 
> memory, the old occupant could be evicted/spilled. Previously the spillable 
> objects trigger spilling by themselves. So one may not trigger spilling even 
> if another object in the same thread needs more memory. After this change The 
> ShuffleMemoryManager could trigger the spilling of an object if it needs to.
> 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously 
> ExternalAppendOnlyMap returns an destructive iterator and can not be spilled 
> after the iterator is returned. This should be changed so that even after the 
> iterator is returned, the ShuffleMemoryManager can still spill it.
> Currently, I have a working branch in progress: 
> https://github.com/tsdeng/spark/tree/enhance_memory_manager. Already made 
> change 3 and have a prototype of change 1 and 2 to evict spillable from 
> memory manager, still in progress. I will send a PR when it's done.
> Any feedback or thoughts on this change is highly appreciated !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4550) In sort-based shuffle, store map outputs in serialized form

2014-11-21 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221710#comment-14221710
 ] 

Sandy Ryza commented on SPARK-4550:
---

We don't, though it would allow us to be much more efficient in certain 
situations.

The way sort-based shuffle works right now, the map side only sorts by the 
partition, so we can store this number alongside the serialized record and not 
need to compare keys at all.

SPARK-2926 proposes sorting by keys on the map side.  For that, we'd need to 
deserialize keys before comparing them.  There might be situations where this 
is slower than not serializing them in the first place.  But even in those 
situations, we'd get more reliability by stressing GC less.  It would probably 
be good to define raw comparators for common raw-comparable key types like ints 
and strings.

> In sort-based shuffle, store map outputs in serialized form
> ---
>
> Key: SPARK-4550
> URL: https://issues.apache.org/jira/browse/SPARK-4550
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0
>Reporter: Sandy Ryza
>
> One drawback with sort-based shuffle compared to hash-based shuffle is that 
> it ends up storing many more java objects in memory.  If Spark could store 
> map outputs in serialized form, it could
> * spill less often because the serialized form is more compact
> * reduce GC pressure
> This will only work when the serialized representations of objects are 
> independent from each other and occupy contiguous segments of memory.  E.g. 
> when Kryo reference tracking is left on, objects may contain pointers to 
> objects farther back in the stream, which means that the sort can't relocate 
> objects without corrupting them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4550) In sort-based shuffle, store map outputs in serialized form

2014-11-21 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-4550:
--
Summary: In sort-based shuffle, store map outputs in serialized form  (was: 
In sort-based shuffle, store map outputs as serialized)

> In sort-based shuffle, store map outputs in serialized form
> ---
>
> Key: SPARK-4550
> URL: https://issues.apache.org/jira/browse/SPARK-4550
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0
>Reporter: Sandy Ryza
>
> One drawback with sort-based shuffle compared to hash-based shuffle is that 
> it ends up storing many more java objects in memory.  If Spark could store 
> map outputs in serialized form, it could
> * spill less often because the serialized form is more compact
> * reduce GC pressure
> This will only work when the serialized representations of objects are 
> independent from each other and occupy contiguous segments of memory.  E.g. 
> when Kryo reference tracking is left on, objects may contain pointers to 
> objects farther back in the stream, which means that the sort can't relocate 
> objects without corrupting them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4550) In sort-based shuffle, store map outputs as serialized

2014-11-21 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-4550:
-

 Summary: In sort-based shuffle, store map outputs as serialized
 Key: SPARK-4550
 URL: https://issues.apache.org/jira/browse/SPARK-4550
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Sandy Ryza


One drawback with sort-based shuffle compared to hash-based shuffle is that it 
ends up storing many more java objects in memory.  If Spark could store map 
outputs in serialized form, it could
* spill less often because the serialized form is more compact
* reduce GC pressure

This will only work when the serialized representations of objects are 
independent from each other and occupy contiguous segments of memory.  E.g. 
when Kryo reference tracking is left on, objects may contain pointers to 
objects farther back in the stream, which means that the sort can't relocate 
objects without corrupting them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1956) Enable shuffle consolidation by default

2014-11-21 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221453#comment-14221453
 ] 

Sandy Ryza commented on SPARK-1956:
---

This is of smaller importance now that sort-based shuffle is the default, but 
how do we feel about turning this on by default now?  [~mridulm80], have the 
issues you're aware of been resolved?

> Enable shuffle consolidation by default
> ---
>
> Key: SPARK-1956
> URL: https://issues.apache.org/jira/browse/SPARK-1956
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.0.0
>Reporter: Sandy Ryza
>
> The only drawbacks are on ext3, and most everyone has ext4 at this point.  I 
> think it's better to aim the default at the common case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2089) With YARN, preferredNodeLocalityData isn't honored

2014-11-20 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14220620#comment-14220620
 ] 

Sandy Ryza commented on SPARK-2089:
---

Another possible solution here is SPARK-4352.  Requesting executors at 
locations after jobs have started would allow mean that users don't need to 
specify locations at all.

> With YARN, preferredNodeLocalityData isn't honored 
> ---
>
> Key: SPARK-2089
> URL: https://issues.apache.org/jira/browse/SPARK-2089
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>Priority: Critical
>
> When running in YARN cluster mode, apps can pass preferred locality data when 
> constructing a Spark context that will dictate where to request executor 
> containers.
> This is currently broken because of a race condition.  The Spark-YARN code 
> runs the user class and waits for it to start up a SparkContext.  During its 
> initialization, the SparkContext will create a YarnClusterScheduler, which 
> notifies a monitor in the Spark-YARN code that .  The Spark-Yarn code then 
> immediately fetches the preferredNodeLocationData from the SparkContext and 
> uses it to start requesting containers.
> But in the SparkContext constructor that takes the preferredNodeLocationData, 
> setting preferredNodeLocationData comes after the rest of the initialization, 
> so, if the Spark-YARN code comes around quickly enough after being notified, 
> the data that's fetched is the empty unset version.  The occurred during all 
> of my runs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

2014-11-18 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217340#comment-14217340
 ] 

Sandy Ryza commented on SPARK-4452:
---

[~matei] my point is not that forced spilling allows us to avoid setting 
limits, but that it allows those limits to be soft: if an entity (thread or 
object) is not requesting the 1/N memory reserved for it, that memory can be 
given to other entities that need it.  Then, if the entity later requests the 
memory reserved to it, the other entities above their fair allocation can be 
forced to spill.

(I don't necessarily mean to argue that this advantage is worth the added 
complexity.)

> Shuffle data structures can starve others on the same thread for memory 
> 
>
> Key: SPARK-4452
> URL: https://issues.apache.org/jira/browse/SPARK-4452
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Tianshuo Deng
>Assignee: Tianshuo Deng
>Priority: Critical
>
> When an Aggregator is used with ExternalSorter in a task, spark will create 
> many small files and could cause too many files open error during merging.
> Currently, ShuffleMemoryManager does not work well when there are 2 spillable 
> objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used 
> by Aggregator) in this case. Here is an example: Due to the usage of mapside 
> aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may 
> ask as much memory as it can, which is totalMem/numberOfThreads. Then later 
> on when ExternalSorter is created in the same thread, the 
> ShuffleMemoryManager could refuse to allocate more memory to it, since the 
> memory is already given to the previous requested 
> object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling 
> small files(due to the lack of memory)
> I'm currently working on a PR to address these two issues. It will include 
> following changes:
> 1. The ShuffleMemoryManager should not only track the memory usage for each 
> thread, but also the object who holds the memory
> 2. The ShuffleMemoryManager should be able to trigger the spilling of a 
> spillable object. In this way, if a new object in a thread is requesting 
> memory, the old occupant could be evicted/spilled. Previously the spillable 
> objects trigger spilling by themselves. So one may not trigger spilling even 
> if another object in the same thread needs more memory. After this change The 
> ShuffleMemoryManager could trigger the spilling of an object if it needs to.
> 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously 
> ExternalAppendOnlyMap returns an destructive iterator and can not be spilled 
> after the iterator is returned. This should be changed so that even after the 
> iterator is returned, the ShuffleMemoryManager can still spill it.
> Currently, I have a working branch in progress: 
> https://github.com/tsdeng/spark/tree/enhance_memory_manager. Already made 
> change 3 and have a prototype of change 1 and 2 to evict spillable from 
> memory manager, still in progress. I will send a PR when it's done.
> Any feedback or thoughts on this change is highly appreciated !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

2014-11-18 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14216933#comment-14216933
 ] 

Sandy Ryza commented on SPARK-4452:
---

One issue with a limits-by-object approach is that it could result in extra 
wasted memory over the current approach for tasks that produce less shuffle 
data than they read.  E.g. consider a 
rdd.reduceByKey(...).map(...).reduceByKey(...)...

The object aggregating inputs used to have access to the full memory allotted 
to the task, but now it only gets half the memory.  In situations where the 
object aggregating outputs doesn't need as much memory (because there is less 
output data), some of the memory that previously would have been used is unused.

A forced spilling approach seems like it could give some of the advantages that 
preemption provides in cluster scheduling - better utilization through enabling 
objects to use more than their "fair" amount until it turns out other objects 
need those resources.


> Shuffle data structures can starve others on the same thread for memory 
> 
>
> Key: SPARK-4452
> URL: https://issues.apache.org/jira/browse/SPARK-4452
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Tianshuo Deng
>Assignee: Tianshuo Deng
>Priority: Blocker
>
> When an Aggregator is used with ExternalSorter in a task, spark will create 
> many small files and could cause too many files open error during merging.
> Currently, ShuffleMemoryManager does not work well when there are 2 spillable 
> objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used 
> by Aggregator) in this case. Here is an example: Due to the usage of mapside 
> aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may 
> ask as much memory as it can, which is totalMem/numberOfThreads. Then later 
> on when ExternalSorter is created in the same thread, the 
> ShuffleMemoryManager could refuse to allocate more memory to it, since the 
> memory is already given to the previous requested 
> object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling 
> small files(due to the lack of memory)
> I'm currently working on a PR to address these two issues. It will include 
> following changes:
> 1. The ShuffleMemoryManager should not only track the memory usage for each 
> thread, but also the object who holds the memory
> 2. The ShuffleMemoryManager should be able to trigger the spilling of a 
> spillable object. In this way, if a new object in a thread is requesting 
> memory, the old occupant could be evicted/spilled. Previously the spillable 
> objects trigger spilling by themselves. So one may not trigger spilling even 
> if another object in the same thread needs more memory. After this change The 
> ShuffleMemoryManager could trigger the spilling of an object if it needs to.
> 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously 
> ExternalAppendOnlyMap returns an destructive iterator and can not be spilled 
> after the iterator is returned. This should be changed so that even after the 
> iterator is returned, the ShuffleMemoryManager can still spill it.
> Currently, I have a working branch in progress: 
> https://github.com/tsdeng/spark/tree/enhance_memory_manager. Already made 
> change 3 and have a prototype of change 1 and 2 to evict spillable from 
> memory manager, still in progress. I will send a PR when it's done.
> Any feedback or thoughts on this change is highly appreciated !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

2014-11-17 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14215448#comment-14215448
 ] 

Sandy Ryza commented on SPARK-4452:
---

Ah, true.

> Shuffle data structures can starve others on the same thread for memory 
> 
>
> Key: SPARK-4452
> URL: https://issues.apache.org/jira/browse/SPARK-4452
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Tianshuo Deng
>Priority: Blocker
>
> When an Aggregator is used with ExternalSorter in a task, spark will create 
> many small files and could cause too many files open error during merging.
> This happens when using the sort-based shuffle. The issue is caused by 
> multiple factors:
> 1. There seems to be a bug in setting the elementsRead variable in 
> ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) 
> useless for triggering spilling, the pr to fix it is 
> https://github.com/apache/spark/pull/3302
> 2. Current ShuffleMemoryManager does not work well when there are 2 spillable 
> objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used 
> by Aggregator) in this case. Here is an example: Due to the usage of mapside 
> aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may 
> ask as much memory as it can, which is totalMem/numberOfThreads. Then later 
> on when ExternalSorter is created in the same thread, the 
> ShuffleMemoryManager could refuse to allocate more memory to it, since the 
> memory is already given to the previous requested 
> object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling 
> small files(due to the lack of memory)
> I'm currently working on a PR to address these two issues. It will include 
> following changes
> 1. The ShuffleMemoryManager should not only track the memory usage for each 
> thread, but also the object who holds the memory
> 2. The ShuffleMemoryManager should be able to trigger the spilling of a 
> spillable object. In this way, if a new object in a thread is requesting 
> memory, the old occupant could be evicted/spilled. This avoids problem 2 from 
> happening. Previously spillable object triggers spilling by themself. So one 
> may not trigger spilling even if another object in the same thread needs more 
> memory. After this change The ShuffleMemoryManager could trigger the spilling 
> of an object if it needs to
> 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously 
> ExternalAppendOnlyMap returns an destructive iterator and can not be spilled 
> after the iterator is returned. This should be changed so that even after the 
> iterator is returned, the ShuffleMemoryManager can still spill it.
> Currently, I have a working branch in progress: 
> https://github.com/tsdeng/spark/tree/enhance_memory_manager 
> Already made change 3 and have a prototype of change 1 and 2 to evict 
> spillable from memory manager, still in progress.
> I will send a PR when it's done.
> Any feedback or thoughts on this change is highly appreciated !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

2014-11-17 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14215436#comment-14215436
 ] 

Sandy Ryza commented on SPARK-4452:
---

[~andrewor14], IIUC, (2) shouldn't happen in hash-based shuffle at all, because 
hash-based shuffle doesn't use multiple spillable data structures in each task.

> Shuffle data structures can starve others on the same thread for memory 
> 
>
> Key: SPARK-4452
> URL: https://issues.apache.org/jira/browse/SPARK-4452
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Tianshuo Deng
>
> When an Aggregator is used with ExternalSorter in a task, spark will create 
> many small files and could cause too many files open error during merging.
> This happens when using the sort-based shuffle. The issue is caused by 
> multiple factors:
> 1. There seems to be a bug in setting the elementsRead variable in 
> ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) 
> useless for triggering spilling, the pr to fix it is 
> https://github.com/apache/spark/pull/3302
> 2. Current ShuffleMemoryManager does not work well when there are 2 spillable 
> objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used 
> by Aggregator) in this case. Here is an example: Due to the usage of mapside 
> aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may 
> ask as much memory as it can, which is totalMem/numberOfThreads. Then later 
> on when ExternalSorter is created in the same thread, the 
> ShuffleMemoryManager could refuse to allocate more memory to it, since the 
> memory is already given to the previous requested 
> object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling 
> small files(due to the lack of memory)
> I'm currently working on a PR to address these two issues. It will include 
> following changes
> 1. The ShuffleMemoryManager should not only track the memory usage for each 
> thread, but also the object who holds the memory
> 2. The ShuffleMemoryManager should be able to trigger the spilling of a 
> spillable object. In this way, if a new object in a thread is requesting 
> memory, the old occupant could be evicted/spilled. This avoids problem 2 from 
> happening. Previously spillable object triggers spilling by themself. So one 
> may not trigger spilling even if another object in the same thread needs more 
> memory. After this change The ShuffleMemoryManager could trigger the spilling 
> of an object if it needs to
> 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously 
> ExternalAppendOnlyMap returns an destructive iterator and can not be spilled 
> after the iterator is returned. This should be changed so that even after the 
> iterator is returned, the ShuffleMemoryManager can still spill it.
> Currently, I have a working branch in progress: 
> https://github.com/tsdeng/spark/tree/enhance_memory_manager 
> Already made change 3 and have a prototype of change 1 and 2 to evict 
> spillable from memory manager, still in progress.
> I will send a PR when it's done.
> Any feedback or thoughts on this change is highly appreciated !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

2014-11-17 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14215269#comment-14215269
 ] 

Sandy Ryza commented on SPARK-4452:
---

Updated the title to reflect the specific problem.

> Shuffle data structures can starve others on the same thread for memory 
> 
>
> Key: SPARK-4452
> URL: https://issues.apache.org/jira/browse/SPARK-4452
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: tianshuo
>
> When an Aggregator is used with ExternalSorter in a task, spark will create 
> many small files and could cause too many files open error during merging.
> This happens when using the sort-based shuffle. The issue is caused by 
> multiple factors:
> 1. There seems to be a bug in setting the elementsRead variable in 
> ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) 
> useless for triggering spilling, the pr to fix it is 
> https://github.com/apache/spark/pull/3302
> 2. Current ShuffleMemoryManager does not work well when there are 2 spillable 
> objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used 
> by Aggregator) in this case. Here is an example: Due to the usage of mapside 
> aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may 
> ask as much memory as it can, which is totalMem/numberOfThreads. Then later 
> on when ExternalSorter is created in the same thread, the 
> ShuffleMemoryManager could refuse to allocate more memory to it, since the 
> memory is already given to the previous requested 
> object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling 
> small files(due to the lack of memory)
> I'm currently working on a PR to address these two issues. It will include 
> following changes
> 1. The ShuffleMemoryManager should not only track the memory usage for each 
> thread, but also the object who holds the memory
> 2. The ShuffleMemoryManager should be able to trigger the spilling of a 
> spillable object. In this way, if a new object in a thread is requesting 
> memory, the old occupant could be evicted/spilled. This avoids problem 2 from 
> happening. Previously spillable object triggers spilling by themself. So one 
> may not trigger spilling even if another object in the same thread needs more 
> memory. After this change The ShuffleMemoryManager could trigger the spilling 
> of an object if it needs to
> 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously 
> ExternalAppendOnlyMap returns an destructive iterator and can not be spilled 
> after the iterator is returned. This should be changed so that even after the 
> iterator is returned, the ShuffleMemoryManager can still spill it.
> Currently, I have a working branch in progress: 
> https://github.com/tsdeng/spark/tree/enhance_memory_manager 
> Already made change 3 and have a prototype of change 1 and 2 to evict 
> spillable from memory manager, still in progress.
> I will send a PR when it's done.
> Any feedback or thoughts on this change is highly appreciated !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

2014-11-17 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-4452:
--
Summary: Shuffle data structures can starve others on the same thread for 
memory   (was: Enhance Sort-based Shuffle to avoid spilling small files)

> Shuffle data structures can starve others on the same thread for memory 
> 
>
> Key: SPARK-4452
> URL: https://issues.apache.org/jira/browse/SPARK-4452
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: tianshuo
>
> When an Aggregator is used with ExternalSorter in a task, spark will create 
> many small files and could cause too many files open error during merging.
> This happens when using the sort-based shuffle. The issue is caused by 
> multiple factors:
> 1. There seems to be a bug in setting the elementsRead variable in 
> ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) 
> useless for triggering spilling, the pr to fix it is 
> https://github.com/apache/spark/pull/3302
> 2. Current ShuffleMemoryManager does not work well when there are 2 spillable 
> objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used 
> by Aggregator) in this case. Here is an example: Due to the usage of mapside 
> aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may 
> ask as much memory as it can, which is totalMem/numberOfThreads. Then later 
> on when ExternalSorter is created in the same thread, the 
> ShuffleMemoryManager could refuse to allocate more memory to it, since the 
> memory is already given to the previous requested 
> object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling 
> small files(due to the lack of memory)
> I'm currently working on a PR to address these two issues. It will include 
> following changes
> 1. The ShuffleMemoryManager should not only track the memory usage for each 
> thread, but also the object who holds the memory
> 2. The ShuffleMemoryManager should be able to trigger the spilling of a 
> spillable object. In this way, if a new object in a thread is requesting 
> memory, the old occupant could be evicted/spilled. This avoids problem 2 from 
> happening. Previously spillable object triggers spilling by themself. So one 
> may not trigger spilling even if another object in the same thread needs more 
> memory. After this change The ShuffleMemoryManager could trigger the spilling 
> of an object if it needs to
> 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously 
> ExternalAppendOnlyMap returns an destructive iterator and can not be spilled 
> after the iterator is returned. This should be changed so that even after the 
> iterator is returned, the ShuffleMemoryManager can still spill it.
> Currently, I have a working branch in progress: 
> https://github.com/tsdeng/spark/tree/enhance_memory_manager 
> Already made change 3 and have a prototype of change 1 and 2 to evict 
> spillable from memory manager, still in progress.
> I will send a PR when it's done.
> Any feedback or thoughts on this change is highly appreciated !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4457) Document how to build for Hadoop versions greater than 2.4

2014-11-17 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-4457:
-

 Summary: Document how to build for Hadoop versions greater than 2.4
 Key: SPARK-4457
 URL: https://issues.apache.org/jira/browse/SPARK-4457
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4456) Document why spilling depends on both elements read and memory used

2014-11-17 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-4456:
-

 Summary: Document why spilling depends on both elements read and 
memory used
 Key: SPARK-4456
 URL: https://issues.apache.org/jira/browse/SPARK-4456
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Priority: Minor


It's not clear to me why the number of records should have an effect on when a 
data structure spills.  It would probably be useful to document this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4452) Enhance Sort-based Shuffle to avoid spilling small files

2014-11-17 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14215047#comment-14215047
 ] 

Sandy Ryza commented on SPARK-4452:
---

A third possible fix would be to have the shuffle memory manager consider 
fairness in terms of spillable data structures instead of threads.

> Enhance Sort-based Shuffle to avoid spilling small files
> 
>
> Key: SPARK-4452
> URL: https://issues.apache.org/jira/browse/SPARK-4452
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: tianshuo
>
> When an Aggregator is used with ExternalSorter in a task, spark will create 
> many small files and could cause too many files open error during merging.
> This happens when using the sort-based shuffle. The issue is caused by 
> multiple factors:
> 1. There seems to be a bug in setting the elementsRead variable in 
> ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) 
> useless for triggering spilling, the pr to fix it is 
> https://github.com/apache/spark/pull/3302
> 2. Current ShuffleMemoryManager does not work well when there are 2 spillable 
> objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used 
> by Aggregator) in this case. Here is an example: Due to the usage of mapside 
> aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may 
> ask as much memory as it can, which is totalMem/numberOfThreads. Then later 
> on when ExternalSorter is created in the same thread, the 
> ShuffleMemoryManager could refuse to allocate more memory to it, since the 
> memory is already given to the previous requested 
> object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling 
> small files(due to the lack of memory)
> I'm currently working on a PR to address these two issues. It will include 
> following changes
> 1. The ShuffleMemoryManager should not only track the memory usage for each 
> thread, but also the object who holds the memory
> 2. The ShuffleMemoryManager should be able to trigger the spilling of a 
> spillable object. In this way, if a new object in a thread is requesting 
> memory, the old occupant could be evicted/spilled. This avoids problem 2 from 
> happening. Previously spillable object triggers spilling by themself. So one 
> may not trigger spilling even if another object in the same thread needs more 
> memory. After this change The ShuffleMemoryManager could trigger the spilling 
> of an object if it needs to
> 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously 
> ExternalAppendOnlyMap returns an destructive iterator and can not be spilled 
> after the iterator is returned. This should be changed so that even after the 
> iterator is returned, the ShuffleMemoryManager can still spill it.
> Currently, I have a working branch in progress: 
> https://github.com/tsdeng/spark/tree/enhance_memory_manager 
> Already made change 3 and have a prototype of change 1 and 2 to evict 
> spillable from memory manager, still in progress.
> I will send a PR when it's done.
> Any feedback or thoughts on this change is highly appreciated !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4452) Enhance Sort-based Shuffle to avoid spilling small files

2014-11-17 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-4452:
--
Affects Version/s: 1.1.0

> Enhance Sort-based Shuffle to avoid spilling small files
> 
>
> Key: SPARK-4452
> URL: https://issues.apache.org/jira/browse/SPARK-4452
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: tianshuo
>
> When an Aggregator is used with ExternalSorter in a task, spark will create 
> many small files and could cause too many files open error during merging.
> This happens when using the sort-based shuffle. The issue is caused by 
> multiple factors:
> 1. There seems to be a bug in setting the elementsRead variable in 
> ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) 
> useless for triggering spilling, the pr to fix it is 
> https://github.com/apache/spark/pull/3302
> 2. Current ShuffleMemoryManager does not work well when there are 2 spillable 
> objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used 
> by Aggregator) in this case. Here is an example: Due to the usage of mapside 
> aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may 
> ask as much memory as it can, which is totalMem/numberOfThreads. Then later 
> on when ExternalSorter is created in the same thread, the 
> ShuffleMemoryManager could refuse to allocate more memory to it, since the 
> memory is already given to the previous requested 
> object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling 
> small files(due to the lack of memory)
> I'm currently working on a PR to address these two issues. It will include 
> following changes
> 1. The ShuffleMemoryManager should not only track the memory usage for each 
> thread, but also the object who holds the memory
> 2. The ShuffleMemoryManager should be able to trigger the spilling of a 
> spillable object. In this way, if a new object in a thread is requesting 
> memory, the old occupant could be evicted/spilled. This avoids problem 2 from 
> happening. Previously spillable object triggers spilling by themself. So one 
> may not trigger spilling even if another object in the same thread needs more 
> memory. After this change The ShuffleMemoryManager could trigger the spilling 
> of an object if it needs to
> 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously 
> ExternalAppendOnlyMap returns an destructive iterator and can not be spilled 
> after the iterator is returned. This should be changed so that even after the 
> iterator is returned, the ShuffleMemoryManager can still spill it.
> Currently, I have a working branch in progress: 
> https://github.com/tsdeng/spark/tree/enhance_memory_manager 
> Already made change 3 and have a prototype of change 1 and 2 to evict 
> spillable from memory manager, still in progress.
> I will send a PR when it's done.
> Any feedback or thoughts on this change is highly appreciated !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4452) Enhance Sort-based Shuffle to avoid spilling small files

2014-11-17 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14215018#comment-14215018
 ] 

Sandy Ryza commented on SPARK-4452:
---

I haven't thought the implications out fully, but it worries me that data 
structures wouldn't be in charge of their own spilling.  It seems like for 
different data structures, different spilling patterns and sizes could be more 
efficient, and placing control in the hands of the shuffle memory manager 
negates this.

Another fix (simpler but maybe less flexible) worth considering would be to 
define a static fraction of memory that goes to each of the aggregator and the 
external sorter, right?  One way that could be implemented is that each time 
the aggregator gets memory, it allocates some for the sorter.

Also, for SPARK-2926, we are working on a tiered merger that would avoid the 
"Too many open files" problem by only merging a limited number of files at once.

> Enhance Sort-based Shuffle to avoid spilling small files
> 
>
> Key: SPARK-4452
> URL: https://issues.apache.org/jira/browse/SPARK-4452
> Project: Spark
>  Issue Type: Bug
>Reporter: tianshuo
>
> When an Aggregator is used with ExternalSorter in a task, spark will create 
> many small files and could cause too many files open error during merging.
> This happens when using the sort-based shuffle. The issue is caused by 
> multiple factors:
> 1. There seems to be a bug in setting the elementsRead variable in 
> ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) 
> useless for triggering spilling, the pr to fix it is 
> https://github.com/apache/spark/pull/3302
> 2. Current ShuffleMemoryManager does not work well when there are 2 spillable 
> objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used 
> by Aggregator) in this case. Here is an example: Due to the usage of mapside 
> aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may 
> ask as much memory as it can, which is totalMem/numberOfThreads. Then later 
> on when ExternalSorter is created in the same thread, the 
> ShuffleMemoryManager could refuse to allocate more memory to it, since the 
> memory is already given to the previous requested 
> object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling 
> small files(due to the lack of memory)
> I'm currently working on a PR to address these two issues. It will include 
> following changes
> 1. The ShuffleMemoryManager should not only track the memory usage for each 
> thread, but also the object who holds the memory
> 2. The ShuffleMemoryManager should be able to trigger the spilling of a 
> spillable object. In this way, if a new object in a thread is requesting 
> memory, the old occupant could be evicted/spilled. This avoids problem 2 from 
> happening. Previously spillable object triggers spilling by themself. So one 
> may not trigger spilling even if another object in the same thread needs more 
> memory. After this change The ShuffleMemoryManager could trigger the spilling 
> of an object if it needs to
> 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously 
> ExternalAppendOnlyMap returns an destructive iterator and can not be spilled 
> after the iterator is returned. This should be changed so that even after the 
> iterator is returned, the ShuffleMemoryManager can still spill it.
> Currently, I have a working branch in progress: 
> https://github.com/tsdeng/spark/tree/enhance_memory_manager 
> Already made change 3 and have a prototype of change 1 and 2 to evict 
> spillable from memory manager, still in progress.
> I will send a PR when it's done.
> Any feedback or thoughts on this change is highly appreciated !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4447) Remove layers of abstraction in YARN code no longer needed after dropping yarn-alpha

2014-11-17 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214411#comment-14214411
 ] 

Sandy Ryza commented on SPARK-4447:
---

Planning to work on this.

> Remove layers of abstraction in YARN code no longer needed after dropping 
> yarn-alpha
> 
>
> Key: SPARK-4447
> URL: https://issues.apache.org/jira/browse/SPARK-4447
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.3.0
>Reporter: Sandy Ryza
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4447) Remove layers of abstraction in YARN code no longer needed after dropping yarn-alpha

2014-11-17 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-4447:
--
Description: 
For example, YarnRMClient and YarnRMClientImpl can be merged
YarnAllocator and YarnAllocationHandler can be merged

> Remove layers of abstraction in YARN code no longer needed after dropping 
> yarn-alpha
> 
>
> Key: SPARK-4447
> URL: https://issues.apache.org/jira/browse/SPARK-4447
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.3.0
>Reporter: Sandy Ryza
>
> For example, YarnRMClient and YarnRMClientImpl can be merged
> YarnAllocator and YarnAllocationHandler can be merged



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4447) Remove layers of abstraction in YARN code no longer needed after dropping yarn-alpha

2014-11-17 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-4447:
-

 Summary: Remove layers of abstraction in YARN code no longer 
needed after dropping yarn-alpha
 Key: SPARK-4447
 URL: https://issues.apache.org/jira/browse/SPARK-4447
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.3.0
Reporter: Sandy Ryza






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2819) Difficult to turn on intercept with linear models

2014-11-14 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212854#comment-14212854
 ] 

Sandy Ryza commented on SPARK-2819:
---

Yes, as far as I can tell there are still no public APIs that allow both 
turning on the intercept and specifying a number of runs.

> Difficult to turn on intercept with linear models
> -
>
> Key: SPARK-2819
> URL: https://issues.apache.org/jira/browse/SPARK-2819
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Sandy Ryza
>
> If I want to train a logistic regression model with default parameters and 
> include an intercept, I can run:
> val alg = new LogisticRegressionWithSGD()
> alg.setIntercept(true)
> alg.run(data)
> but if I want to set a parameter like numIterations, I need to use
> LogisticRegressionWithSGD.train(data, 50)
> and have no opportunity to turn on the intercept.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4375) Assembly built with Maven is missing most of repl classes

2014-11-12 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209388#comment-14209388
 ] 

Sandy Ryza commented on SPARK-4375:
---

This all makes sense to me.  Will put up a patch.

> Assembly built with Maven is missing most of repl classes
> -
>
> Key: SPARK-4375
> URL: https://issues.apache.org/jira/browse/SPARK-4375
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Sandy Ryza
>Priority: Blocker
>
> In particular, the ones in the split scala-2.10/scala-2.11 directories aren't 
> being added



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4375) Assembly built with Maven is missing most of repl classes

2014-11-12 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209217#comment-14209217
 ] 

Sandy Ryza commented on SPARK-4375:
---

The issue here is that the activeByDefault Maven option doesn't work as 
expected: it turns off all default profiles if any profile is given on the 
command line.  Circumventing this by removing the scala-2.10 profile and 
setting properties that get overridden when the scala-2.11 profile is turned on 
is pretty straightforward.  Except in a couple cases:

* When using Scala 2.11, we don't want to build the external/kafka module
* When using Scala 2.11, we don't want the examples module to include the kafka 
module as a dependency
* When using Scala 2.11, we don't want the examples module to include algebird 
as a dependency

As far as I can tell, there aren't Maven doesn't let us do these things in any 
ways that aren't deeply hacky.

For dealing with the first issue, I had talked with Patrick about adding 
profiles for all of the external modules (including kafka).  An issue with this 
is that the examples depend on all these modules, so we would need to create 
separate modules or directories for the examples tied to each of these modules. 

Another thing we could do is add a "scala-2.10-only" profile that turns on 
building the kafka module and turns on the examples and examples dependencies 
that only work with scala-2.10.


> Assembly built with Maven is missing most of repl classes
> -
>
> Key: SPARK-4375
> URL: https://issues.apache.org/jira/browse/SPARK-4375
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Sandy Ryza
>Priority: Blocker
>
> In particular, the ones in the split scala-2.10/scala-2.11 directories aren't 
> being added



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4375) Assembly built with Maven is missing most of repl classes

2014-11-12 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-4375:
--
Summary: Assembly built with Maven is missing most of repl classes  (was: 
assembly built with Maven is missing most of repl classes)

> Assembly built with Maven is missing most of repl classes
> -
>
> Key: SPARK-4375
> URL: https://issues.apache.org/jira/browse/SPARK-4375
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Sandy Ryza
>Priority: Blocker
>
> In particular, the ones in the split scala-2.10/scala-2.11 directories aren't 
> being added



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4375) assembly built with Maven is missing most of repl classes

2014-11-12 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-4375:
--
Description: In particular, the ones in the split scala-2.10/scala-2.11 
directories aren't being added

> assembly built with Maven is missing most of repl classes
> -
>
> Key: SPARK-4375
> URL: https://issues.apache.org/jira/browse/SPARK-4375
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Sandy Ryza
>Priority: Blocker
>
> In particular, the ones in the split scala-2.10/scala-2.11 directories aren't 
> being added



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4375) assembly built with Maven is missing most of repl classes

2014-11-12 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-4375:
-

 Summary: assembly built with Maven is missing most of repl classes
 Key: SPARK-4375
 URL: https://issues.apache.org/jira/browse/SPARK-4375
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Priority: Blocker






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests

2014-11-11 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-4352:
-

 Summary: Incorporate locality preferences in dynamic allocation 
requests
 Key: SPARK-4352
 URL: https://issues.apache.org/jira/browse/SPARK-4352
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4338) Remove yarn-alpha support

2014-11-11 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-4338:
-

 Summary: Remove yarn-alpha support
 Key: SPARK-4338
 URL: https://issues.apache.org/jira/browse/SPARK-4338
 Project: Spark
  Issue Type: Sub-task
  Components: YARN
Reporter: Sandy Ryza






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4338) Remove yarn-alpha support

2014-11-11 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206104#comment-14206104
 ] 

Sandy Ryza commented on SPARK-4338:
---

Planning to take a stab at this

> Remove yarn-alpha support
> -
>
> Key: SPARK-4338
> URL: https://issues.apache.org/jira/browse/SPARK-4338
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Reporter: Sandy Ryza
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4337) Add ability to cancel pending requests to YARN

2014-11-10 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-4337:
-

 Summary: Add ability to cancel pending requests to YARN
 Key: SPARK-4337
 URL: https://issues.apache.org/jira/browse/SPARK-4337
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza


This will be useful for things like SPARK-4136



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4290) Provide an equivalent functionality of distributed cache as MR does

2014-11-10 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205346#comment-14205346
 ] 

Sandy Ryza commented on SPARK-4290:
---

SparkFiles.get needs to be called, but it will only download them once per 
executor.  The distributed cache is lazy in this way too.

> Provide an equivalent functionality of distributed cache as MR does
> ---
>
> Key: SPARK-4290
> URL: https://issues.apache.org/jira/browse/SPARK-4290
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Xuefu Zhang
>
> MapReduce allows client to specify files to be put in distributed cache for a 
> job and the framework guarentees that the file will be available in local 
> file system of a node where a task of the job runs and before the tasks 
> actually starts. While this might be achieved with Yarn via hacks, it's not 
> available in other clusters. It would be nice to have such an equivalent 
> functionality like this in Spark.
> It would also complement Spark's broadcast variable, which may not be 
> suitable in certain scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4280) In dynamic allocation, add option to never kill executors with cached blocks

2014-11-07 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14202382#comment-14202382
 ] 

Sandy Ryza commented on SPARK-4280:
---

So it looks like the block IDs of broadcast variables on each node are the same 
broadcast IDs used on the driver.  Which means it wouldn't be too hard to do 
this filtering.  Even without it, this would still be useful.  What do you 
think? 

> In dynamic allocation, add option to never kill executors with cached blocks
> 
>
> Key: SPARK-4280
> URL: https://issues.apache.org/jira/browse/SPARK-4280
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Sandy Ryza
>
> Even with the external shuffle service, this is useful in situations like 
> Hive on Spark where a query might require caching some data. We want to be 
> able to give back executors after the job ends, but not during the job if it 
> would delete intermediate results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4267) Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later

2014-11-07 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14202319#comment-14202319
 ] 

Sandy Ryza commented on SPARK-4267:
---

Strange.  Checked in the code and it seems like this must mean the 
taskScheduler is null.  Did you see any errors farther up in the shell before 
this happened?  Does it work in local mode?

> Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later
> --
>
> Key: SPARK-4267
> URL: https://issues.apache.org/jira/browse/SPARK-4267
> Project: Spark
>  Issue Type: Bug
>Reporter: Tsuyoshi OZAWA
>
> Currently we're trying Spark on YARN included in Hadoop 2.5.1. Hadoop 2.5 
> uses protobuf 2.5.0 so I compiled with protobuf 2.5.1 like this:
> {code}
>  ./make-distribution.sh --name spark-1.1.1 --tgz -Pyarn 
> -Dhadoop.version=2.5.1 -Dprotobuf.version=2.5.0
> {code}
> Then Spark on YARN fails to launch jobs with NPE.
> {code}
> $ bin/spark-shell --master yarn-client
> scala> sc.textFile("hdfs:///user/ozawa/wordcountInput20G").flatMap(line 
> => line.split(" ")).map(word => (word, 1)).persist().reduceByKey((a, b) => a 
> + b, 16).saveAsTextFile("hdfs:///user/ozawa/sparkWordcountOutNew2");
> java.lang.NullPointerException
>   
>   
> 
> at 
> org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1284)
> at 
> org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1291)   
>   
>   
>  
> at 
> org.apache.spark.SparkContext.textFile$default$2(SparkContext.scala:480)
> at $iwC$$iwC$$iwC$$iwC.(:13)   
>   
>   
> 
> at $iwC$$iwC$$iwC.(:18)
> at $iwC$$iwC.(:20) 
>   
>   
> 
> at $iwC.(:22)
> at (:24)   
>   
>   
> 
> at .(:28)
> at .()   
>   
>   
> 
> at .(:7)
> at .()   
>   
>   
> 
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   
>   
> 
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   
>   
>   
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) 
>   
>   
>  
> at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
> at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
>   
>   
>  
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)   
>  

[jira] [Commented] (SPARK-4290) Provide an equivalent functionality of distributed cache as MR does

2014-11-06 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14201572#comment-14201572
 ] 

Sandy Ryza commented on SPARK-4290:
---

If you call SparkContext#addFile, the file will be pulled onto the local disks 
of the executors a single time.

It's true that for large clusters, writing to HDFS with high replication could 
be more efficient than sending from the driver to every executor.  It might be 
worth implementing / using a more sophisticated broadcast mechanism rather than 
adding a new API.


> Provide an equivalent functionality of distributed cache as MR does
> ---
>
> Key: SPARK-4290
> URL: https://issues.apache.org/jira/browse/SPARK-4290
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Xuefu Zhang
>
> MapReduce allows client to specify files to be put in distributed cache for a 
> job and the framework guarentees that the file will be available in local 
> file system of a node where a task of the job runs and before the tasks 
> actually starts. While this might be achieved with Yarn via hacks, it's not 
> available in other clusters. It would be nice to have such an equivalent 
> functionality like this in Spark.
> It would also complement Spark's broadcast variable, which may not be 
> suitable in certain scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4280) In dynamic allocation, add option to never kill executors with cached blocks

2014-11-06 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200801#comment-14200801
 ] 

Sandy Ryza commented on SPARK-4280:
---

My thinking was that it would just be based on whether the node reports it's 
storing blocks.  So, e.g., applications wouldn't need to explicitly uncache the 
RDDs in situations where Spark garbage collects them because their references 
go away.

Ideally we would make a special case for broadcast variables, because they're 
not unique to any node.  Will look into whether there's a good way to do this.


> In dynamic allocation, add option to never kill executors with cached blocks
> 
>
> Key: SPARK-4280
> URL: https://issues.apache.org/jira/browse/SPARK-4280
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Sandy Ryza
>
> Even with the external shuffle service, this is useful in situations like 
> Hive on Spark where a query might require caching some data. We want to be 
> able to give back executors after the job ends, but not during the job if it 
> would delete intermediate results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4280) In dynamic allocation, add option to never kill executors with cached blocks

2014-11-06 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-4280:
-

 Summary: In dynamic allocation, add option to never kill executors 
with cached blocks
 Key: SPARK-4280
 URL: https://issues.apache.org/jira/browse/SPARK-4280
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Sandy Ryza


Even with the external shuffle service, this is useful in situations like Hive 
on Spark where a query might require caching some data. We want to be able to 
give back executors after the job ends, but not during the job if it would 
delete intermediate results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4230) Doc for spark.default.parallelism is incorrect

2014-11-04 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-4230:
--
Description: 
The default default parallelism for shuffle transformations is actually the 
maximum number of partitions in dependent RDDs.

Should also probably be clear about what SparkContext.defaultParallelism is 
used for

  was:The default default parallelism for shuffle transformations is actually 
the maximum number of partitions in dependent RDDs.


> Doc for spark.default.parallelism is incorrect
> --
>
> Key: SPARK-4230
> URL: https://issues.apache.org/jira/browse/SPARK-4230
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Sandy Ryza
>
> The default default parallelism for shuffle transformations is actually the 
> maximum number of partitions in dependent RDDs.
> Should also probably be clear about what SparkContext.defaultParallelism is 
> used for



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4230) Doc for spark.default.parallelism is incorrect

2014-11-04 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-4230:
-

 Summary: Doc for spark.default.parallelism is incorrect
 Key: SPARK-4230
 URL: https://issues.apache.org/jira/browse/SPARK-4230
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Sandy Ryza


The default default parallelism for shuffle transformations is actually the 
maximum number of partitions in dependent RDDs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4227) Document external shuffle service

2014-11-04 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-4227:
-

 Summary: Document external shuffle service
 Key: SPARK-4227
 URL: https://issues.apache.org/jira/browse/SPARK-4227
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Sandy Ryza


We should add spark.shuffle.service.enabled to the Configuration page and give 
instructions for launching the shuffle service as an auxiliary service on YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4214) With dynamic allocation, avoid outstanding requests for more executors than pending tasks need

2014-11-04 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14196464#comment-14196464
 ] 

Sandy Ryza edited comment on SPARK-4214 at 11/4/14 6:00 PM:


We can implement this in either a "weak" way or a "strong" way.
* Weak: whenever we would request additional executors, avoid requesting more 
than pending tasks need.
* Strong: In addition to the above, when pending tasks go below the capacity of 
outstanding requests, cancel outstanding requests.

My opinion is that the strong way is worthwhile.  It allows us to behave well 
in scenarios like:
* Due to contention on the cluster, an app can get no more than X executors 
from YARN
* As a long job runs, we make lots of executors requests to YARN.  At some 
point, YARN stops fulfilling additional requests.
* The long job completes.

With cancellation, we can avoid
* Surges of unneeded executors when contention on the cluster goes away
* New executors popping back up as we kill our own for being idly
 

[~pwendell] [~andrewor] do you have any thoughts?


was (Author: sandyr):
We can implement this in either a "weak" way or a "strong" way.
* Weak: whenever we would request additional executors, avoid requesting more 
than pending tasks need.
* Strong: In addition to the above, when pending tasks go below the capacity of 
outstanding requests, cancel outstanding requests.

My opinion is that the strong way is worthwhile.  It allows us to behave well 
in scenarios like:
* Due to contention on the cluster, an app can get no more than X executors 
from YARN
* As a long job runs, we make lots of executors requests to YARN.  At some 
point, YARN stops fulfilling additional requests.
* The long job completes.
With cancellation, we can avoid
* Surges of unneeded executors when contention on the cluster goes away
* New executors popping back up as we kill our own for being idly
 

[~pwendell] [~andrewor] do you have any thoughts?

> With dynamic allocation, avoid outstanding requests for more executors than 
> pending tasks need
> --
>
> Key: SPARK-4214
> URL: https://issues.apache.org/jira/browse/SPARK-4214
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.2.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>
> Dynamic allocation tries to allocate more executors while we have pending 
> tasks remaining.  Our current policy can end up with more outstanding 
> executor requests than needed to fulfill all the pending tasks.  Capping the 
> executor requests to the number of cores needed to fulfill all pending tasks 
> would make dynamic allocation behavior less sensitive to settings for 
> maxExecutors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4214) With dynamic allocation, avoid outstanding requests for more executors than pending tasks need

2014-11-04 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14196464#comment-14196464
 ] 

Sandy Ryza edited comment on SPARK-4214 at 11/4/14 6:00 PM:


We can implement this in either a "weak" way or a "strong" way.
* Weak: whenever we would request additional executors, avoid requesting more 
than pending tasks need.
* Strong: In addition to the above, when pending tasks go below the capacity of 
outstanding requests, cancel outstanding requests.

My opinion is that the strong way is worthwhile.  It allows us to behave well 
in scenarios like:
* Due to contention on the cluster, an app can get no more than X executors 
from YARN
* As a long job runs, we make lots of executors requests to YARN.  At some 
point, YARN stops fulfilling additional requests.
* The long job completes.
With cancellation, we can avoid
* Surges of unneeded executors when contention on the cluster goes away
* New executors popping back up as we kill our own for being idly
 

[~pwendell] [~andrewor] do you have any thoughts?


was (Author: sandyr):
We can implement this in either a "weak" way or a "strong" way.
* Weak: whenever we would request additional executors, avoid requesting more 
than pending tasks need.
* Strong: In addition to the above, when pending tasks go below the capacity of 
outstanding requests, cancel outstanding requests.

My opinion is that the strong way is worthwhile.  It allows us to behave well 
in scenarios like:
1. Due to contention on the cluster, an app can get no more than X executors 
from YARN
2. As a long job runs, we make lots of executors requests to YARN.  At some 
point, YARN stops fulfilling additional requests.
3. The long job completes.
With cancellation, we can avoid
* Surges of unneeded executors when contention on the cluster goes away
* New executors popping back up as we kill our own for being idly
 

[~pwendell] [~andrewor] do you have any thoughts?

> With dynamic allocation, avoid outstanding requests for more executors than 
> pending tasks need
> --
>
> Key: SPARK-4214
> URL: https://issues.apache.org/jira/browse/SPARK-4214
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.2.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>
> Dynamic allocation tries to allocate more executors while we have pending 
> tasks remaining.  Our current policy can end up with more outstanding 
> executor requests than needed to fulfill all the pending tasks.  Capping the 
> executor requests to the number of cores needed to fulfill all pending tasks 
> would make dynamic allocation behavior less sensitive to settings for 
> maxExecutors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4214) With dynamic allocation, avoid outstanding requests for more executors than pending tasks need

2014-11-04 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14196464#comment-14196464
 ] 

Sandy Ryza commented on SPARK-4214:
---

We can implement this in either a "weak" way or a "strong" way.
* Weak: whenever we would request additional executors, avoid requesting more 
than pending tasks need.
* Strong: In addition to the above, when pending tasks go below the capacity of 
outstanding requests, cancel outstanding requests.

My opinion is that the strong way is worthwhile.  It allows us to behave well 
in scenarios like:
1. Due to contention on the cluster, an app can get no more than X executors 
from YARN
2. As a long job runs, we make lots of executors requests to YARN.  At some 
point, YARN stops fulfilling additional requests.
3. The long job completes.
With cancellation, we can avoid
* Surges of unneeded executors when contention on the cluster goes away
* New executors popping back up as we kill our own for being idly
 

[~pwendell] [~andrewor] do you have any thoughts?

> With dynamic allocation, avoid outstanding requests for more executors than 
> pending tasks need
> --
>
> Key: SPARK-4214
> URL: https://issues.apache.org/jira/browse/SPARK-4214
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.2.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>
> Dynamic allocation tries to allocate more executors while we have pending 
> tasks remaining.  Our current policy can end up with more outstanding 
> executor requests than needed to fulfill all the pending tasks.  Capping the 
> executor requests to the number of cores needed to fulfill all pending tasks 
> would make dynamic allocation behavior less sensitive to settings for 
> maxExecutors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4214) With dynamic allocation, avoid outstanding requests for more executors than pending tasks need

2014-11-03 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-4214:
-

 Summary: With dynamic allocation, avoid outstanding requests for 
more executors than pending tasks need
 Key: SPARK-4214
 URL: https://issues.apache.org/jira/browse/SPARK-4214
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza


Dynamic allocation tries to allocate more executors while we have pending tasks 
remaining.  Our current policy can end up with more outstanding executor 
requests than needed to fulfill all the pending tasks.  Capping the executor 
requests to the number of cores needed to fulfill all pending tasks would make 
dynamic allocation behavior less sensitive to settings for maxExecutors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4178) Hadoop input metrics ignore bytes read in RecordReader instantiation

2014-10-31 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192773#comment-14192773
 ] 

Sandy Ryza commented on SPARK-4178:
---

Thanks [~kostas] for noticing this.

> Hadoop input metrics ignore bytes read in RecordReader instantiation
> 
>
> Key: SPARK-4178
> URL: https://issues.apache.org/jira/browse/SPARK-4178
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Sandy Ryza
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4178) Hadoop input metrics ignore bytes read in RecordReader instantiation

2014-10-31 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-4178:
-

 Summary: Hadoop input metrics ignore bytes read in RecordReader 
instantiation
 Key: SPARK-4178
 URL: https://issues.apache.org/jira/browse/SPARK-4178
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Sandy Ryza






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4016) Allow user to optionally show additional, advanced metrics in the UI

2014-10-31 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192609#comment-14192609
 ] 

Sandy Ryza commented on SPARK-4016:
---

Also, it looks like this can cause an exception: SPARK-4175

> Allow user to optionally show additional, advanced metrics in the UI
> 
>
> Key: SPARK-4016
> URL: https://issues.apache.org/jira/browse/SPARK-4016
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
> Fix For: 1.2.0
>
>
> Allowing the user to show/hide additional metrics will allow us to both (1) 
> add more advanced metrics without cluttering the UI for the average user and 
> (2) hide, by default, some of the metrics currently shown that are not widely 
> used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4016) Allow user to optionally show additional, advanced metrics in the UI

2014-10-31 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192604#comment-14192604
 ] 

Sandy Ryza commented on SPARK-4016:
---

It looks like after this change, stage-level summary metrics no longer include 
in-progress tasks.  Is this on purpose?

> Allow user to optionally show additional, advanced metrics in the UI
> 
>
> Key: SPARK-4016
> URL: https://issues.apache.org/jira/browse/SPARK-4016
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
> Fix For: 1.2.0
>
>
> Allowing the user to show/hide additional metrics will allow us to both (1) 
> add more advanced metrics without cluttering the UI for the average user and 
> (2) hide, by default, some of the metrics currently shown that are not widely 
> used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4175) Exception on stage page

2014-10-31 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-4175:
-

 Summary: Exception on stage page
 Key: SPARK-4175
 URL: https://issues.apache.org/jira/browse/SPARK-4175
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Priority: Critical


{code}
14/10/31 14:52:58 WARN servlet.ServletHandler: /stages/stage/
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:313)
at scala.None$.get(Option.scala:311)
at org.apache.spark.ui.jobs.StagePage.taskRow(StagePage.scala:331)
at 
org.apache.spark.ui.jobs.StagePage$$anonfun$8.apply(StagePage.scala:173)
at 
org.apache.spark.ui.jobs.StagePage$$anonfun$8.apply(StagePage.scala:173)
at 
org.apache.spark.ui.UIUtils$$anonfun$listingTable$2.apply(UIUtils.scala:282)
at 
org.apache.spark.ui.UIUtils$$anonfun$listingTable$2.apply(UIUtils.scala:282)
at scala.collection.immutable.Stream.map(Stream.scala:376)
at org.apache.spark.ui.UIUtils$.listingTable(UIUtils.scala:282)
at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:171)
at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68)
at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68)
at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:68)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:735)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
at 
org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1496)
at 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:164)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:499)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:370)
at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at 
org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:722)
{code}

I'm guessing this was caused by SPARK-4016?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4136) Under dynamic allocation, cancel outstanding executor requests when pending task queue is empty

2014-10-29 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-4136:
-

 Summary: Under dynamic allocation, cancel outstanding executor 
requests when pending task queue is empty
 Key: SPARK-4136
 URL: https://issues.apache.org/jira/browse/SPARK-4136
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.1.0
Reporter: Sandy Ryza






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3461) Support external groupByKey using repartitionAndSortWithinPartitions

2014-10-28 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186562#comment-14186562
 ] 

Sandy Ryza commented on SPARK-3461:
---

SPARK-2926 could help with this as well.

> Support external groupByKey using repartitionAndSortWithinPartitions
> 
>
> Key: SPARK-3461
> URL: https://issues.apache.org/jira/browse/SPARK-3461
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Davies Liu
>Priority: Critical
>
> Given that we have SPARK-2978, it seems like we could support an external 
> group by operator pretty easily. We'd just have to wrap the existing iterator 
> exposed by SPARK-2978 with a lookahead iterator that detects the group 
> boundaries. Also, we'd have to override the cache() operator to cache the 
> parent RDD so that if this object is cached it doesn't wind through the 
> iterator.
> I haven't totally followed all the sort-shuffle internals, but just given the 
> stated semantics of SPARK-2978 it seems like this would be possible.
> It would be really nice to externalize this because many beginner users write 
> jobs in terms of groupByKey.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1856) Standardize MLlib interfaces

2014-10-24 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14183174#comment-14183174
 ] 

Sandy Ryza commented on SPARK-1856:
---

Is this work still targeted for 1.2?

> Standardize MLlib interfaces
> 
>
> Key: SPARK-1856
> URL: https://issues.apache.org/jira/browse/SPARK-1856
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Blocker
>
> Instead of expanding MLlib based on the current class naming scheme 
> (ProblemWithAlgorithm),  we should standardize MLlib's interfaces that 
> clearly separate datasets, formulations, algorithms, parameter sets, and 
> models.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3573) Dataset

2014-10-24 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14183173#comment-14183173
 ] 

Sandy Ryza commented on SPARK-3573:
---

Is this still targeted for 1.2?

> Dataset
> ---
>
> Key: SPARK-3573
> URL: https://issues.apache.org/jira/browse/SPARK-3573
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra 
> ML-specific metadata embedded in its schema.
> .Sample code
> Suppose we have training events stored on HDFS and user/ad features in Hive, 
> we want to assemble features for training and then apply decision tree.
> The proposed pipeline with dataset looks like the following (need more 
> refinements):
> {code}
> sqlContext.jsonFile("/path/to/training/events", 
> 0.01).registerTempTable("event")
> val training = sqlContext.sql("""
>   SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, 
> event.action AS label,
>  user.gender AS userGender, user.country AS userCountry, 
> user.features AS userFeatures,
>  ad.targetGender AS targetGender
> FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = 
> ad.id;""").cache()
> val indexer = new Indexer()
> val interactor = new Interactor()
> val fvAssembler = new FeatureVectorAssembler()
> val treeClassifer = new DecisionTreeClassifer()
> val paramMap = new ParamMap()
>   .put(indexer.features, Map("userCountryIndex" -> "userCountry"))
>   .put(indexer.sortByFrequency, true)
>   .put(interactor.features, Map("genderMatch" -> Array("userGender", 
> "targetGender")))
>   .put(fvAssembler.features, Map("features" -> Array("genderMatch", 
> "userCountryIndex", "userFeatures")))
>   .put(fvAssembler.dense, true)
>   .put(treeClassifer.maxDepth, 4) // By default, classifier recognizes 
> "features" and "label" columns.
> val pipeline = Pipeline.create(indexer, interactor, fvAssembler, 
> treeClassifier)
> val model = pipeline.fit(training, paramMap)
> sqlContext.jsonFile("/path/to/events", 0.01).registerTempTable("event")
> val test = sqlContext.sql("""
>   SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId,
>  user.gender AS userGender, user.country AS userCountry, 
> user.features AS userFeatures,
>  ad.targetGender AS targetGender
> FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = 
> ad.id;""")
> val prediction = model.transform(test).select('eventId, 'prediction)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

2014-10-21 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179631#comment-14179631
 ] 

Sandy Ryza commented on SPARK-2926:
---

[~rxin] did you ever get a chance to try this out?

> Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
> --
>
> Key: SPARK-2926
> URL: https://issues.apache.org/jira/browse/SPARK-2926
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: Saisai Shao
> Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test 
> Report(contd).pdf, Spark Shuffle Test Report.pdf
>
>
> Currently Spark has already integrated sort-based shuffle write, which 
> greatly improve the IO performance and reduce the memory consumption when 
> reducer number is very large. But for the reducer side, it still adopts the 
> implementation of hash-based shuffle reader, which neglects the ordering 
> attributes of map output data in some situations.
> Here we propose a MR style sort-merge like shuffle reader for sort-based 
> shuffle to better improve the performance of sort-based shuffle.
> Working in progress code and performance test report will be posted later 
> when some unit test bugs are fixed.
> Any comments would be greatly appreciated. 
> Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3360) Add RowMatrix.multiply(Vector)

2014-10-14 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14171238#comment-14171238
 ] 

Sandy Ryza commented on SPARK-3360:
---

bq. You don't need Vector.multiply(RowMatrix) really since it is either the 
same as RowMatrix.multiply(Vector) or the same with the RowMatrix transposed.

Any reason not to offer it for convenience though?

> Add RowMatrix.multiply(Vector)
> --
>
> Key: SPARK-3360
> URL: https://issues.apache.org/jira/browse/SPARK-3360
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Sandy Ryza
>
> RowMatrix currently has multiply(Matrix), but multiply(Vector) would be 
> useful as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3174) Provide elastic scaling within a Spark application

2014-10-14 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14171024#comment-14171024
 ] 

Sandy Ryza edited comment on SPARK-3174 at 10/14/14 3:03 PM:
-

bq. If I understand correctly, your concern with requesting executors in rounds 
is that we will end up making many requests if we have many executors, when we 
could instead just batch them?
My original claim was that a policy that could be aware of the number of tasks 
needed by a stage would be necessary to get enough executors quickly when going 
from idle to running a job.  Your claim was that the exponential policy can be 
tuned to give similar enough behavior to the type of policy I'm describing.  My 
concern was that, even if that is true, tuning the exponential policy in this 
way would be detrimental at other points of the application lifecycle: e.g., 
starting the next stage within a job could lead to over-allocation.



was (Author: sandyr):
bq. If I understand correctly, your concern with requesting executors in rounds 
is that we will end up making many requests if we have many executors, when we 
could instead just batch them?
My original claim was that a policy that could be aware of the number of tasks 
needed by a stage would be necessary to get enough executors quickly when going 
from idle to running a job.  You claim was that the exponential policy can be 
tuned to give similar enough behavior to the type of policy I'm describing.  My 
concern was that, even if that is true, tuning the exponential policy in this 
way would be detrimental at other points of the application lifecycle: e.g., 
starting the next stage within a job could lead to over-allocation.


> Provide elastic scaling within a Spark application
> --
>
> Key: SPARK-3174
> URL: https://issues.apache.org/jira/browse/SPARK-3174
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.0.2
>Reporter: Sandy Ryza
>Assignee: Andrew Or
> Attachments: SPARK-3174design.pdf, SparkElasticScalingDesignB.pdf, 
> dynamic-scaling-executors-10-6-14.pdf
>
>
> A common complaint with Spark in a multi-tenant environment is that 
> applications have a fixed allocation that doesn't grow and shrink with their 
> resource needs.  We're blocked on YARN-1197 for dynamically changing the 
> resources within executors, but we can still allocate and discard whole 
> executors.
> It would be useful to have some heuristics that
> * Request more executors when many pending tasks are building up
> * Discard executors when they are idle
> See the latest design doc for more information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3174) Provide elastic scaling within a Spark application

2014-10-14 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14171024#comment-14171024
 ] 

Sandy Ryza commented on SPARK-3174:
---

bq. If I understand correctly, your concern with requesting executors in rounds 
is that we will end up making many requests if we have many executors, when we 
could instead just batch them?
My original claim was that a policy that could be aware of the number of tasks 
needed by a stage would be necessary to get enough executors quickly when going 
from idle to running a job.  You claim was that the exponential policy can be 
tuned to give similar enough behavior to the type of policy I'm describing.  My 
concern was that, even if that is true, tuning the exponential policy in this 
way would be detrimental at other points of the application lifecycle: e.g., 
starting the next stage within a job could lead to over-allocation.


> Provide elastic scaling within a Spark application
> --
>
> Key: SPARK-3174
> URL: https://issues.apache.org/jira/browse/SPARK-3174
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.0.2
>Reporter: Sandy Ryza
>Assignee: Andrew Or
> Attachments: SPARK-3174design.pdf, SparkElasticScalingDesignB.pdf, 
> dynamic-scaling-executors-10-6-14.pdf
>
>
> A common complaint with Spark in a multi-tenant environment is that 
> applications have a fixed allocation that doesn't grow and shrink with their 
> resource needs.  We're blocked on YARN-1197 for dynamically changing the 
> resources within executors, but we can still allocate and discard whole 
> executors.
> It would be useful to have some heuristics that
> * Request more executors when many pending tasks are building up
> * Discard executors when they are idle
> See the latest design doc for more information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3174) Provide elastic scaling within a Spark application

2014-10-13 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169931#comment-14169931
 ] 

Sandy Ryza commented on SPARK-3174:
---

bq. Slow-start is actually not slow at all if we lower the interval between 
which we add executors.

For the quick-start case - a fast enough start essentially requires making a 
ton of requests to YARN before we can assess the effect of these allocations on 
task pressure.  My worry is that if an exponential policy configured to provide 
this behavior applies later on while the app is running, more "mundane" events 
(like starting the next stage within a job) could lead to huge over-allocations.

Also, IIUC, the most recent patch has us waiting to hear back from YARN at each 
step - i.e. we don't move on to the next level until the current one is 
satisfied, which adds additional latency.  But this specific issue might be 
resolvable within the current approach.

> Provide elastic scaling within a Spark application
> --
>
> Key: SPARK-3174
> URL: https://issues.apache.org/jira/browse/SPARK-3174
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.0.2
>Reporter: Sandy Ryza
>Assignee: Andrew Or
> Attachments: SPARK-3174design.pdf, 
> dynamic-scaling-executors-10-6-14.pdf
>
>
> A common complaint with Spark in a multi-tenant environment is that 
> applications have a fixed allocation that doesn't grow and shrink with their 
> resource needs.  We're blocked on YARN-1197 for dynamically changing the 
> resources within executors, but we can still allocate and discard whole 
> executors.
> It would be useful to have some heuristics that
> * Request more executors when many pending tasks are building up
> * Discard executors when they are idle
> See the latest design doc for more information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1209) SparkHadoopUtil should not use package org.apache.hadoop

2014-10-13 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169846#comment-14169846
 ] 

Sandy Ryza commented on SPARK-1209:
---

Definitely worth changing, in my opinion.  This has been bothering me for a 
while

> SparkHadoopUtil should not use package org.apache.hadoop
> 
>
> Key: SPARK-1209
> URL: https://issues.apache.org/jira/browse/SPARK-1209
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 0.9.0
>Reporter: Sandy Pérez González
>Assignee: Mark Grover
>
> It's private, so the change won't break compatibility



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3884) If deploy mode is cluster, --driver-memory shouldn't apply to client JVM

2014-10-09 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14165776#comment-14165776
 ] 

Sandy Ryza commented on SPARK-3884:
---

Accidentally assigned this to myself, but others should feel free to pick it up

> If deploy mode is cluster, --driver-memory shouldn't apply to client JVM
> 
>
> Key: SPARK-3884
> URL: https://issues.apache.org/jira/browse/SPARK-3884
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3884) If deploy mode is cluster, --driver-memory shouldn't apply to client JVM

2014-10-09 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-3884:
--
Summary: If deploy mode is cluster, --driver-memory shouldn't apply to 
client JVM  (was: Don't set SPARK_SUBMIT_DRIVER_MEMORY if deploy mode is 
cluster)

> If deploy mode is cluster, --driver-memory shouldn't apply to client JVM
> 
>
> Key: SPARK-3884
> URL: https://issues.apache.org/jira/browse/SPARK-3884
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3884) Don't set SPARK_SUBMIT_DRIVER_MEMORY if deploy mode is cluster

2014-10-09 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-3884:
-

 Summary: Don't set SPARK_SUBMIT_DRIVER_MEMORY if deploy mode is 
cluster
 Key: SPARK-3884
 URL: https://issues.apache.org/jira/browse/SPARK-3884
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Sandy Ryza






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3174) Provide elastic scaling within a Spark application

2014-10-07 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162685#comment-14162685
 ] 

Sandy Ryza commented on SPARK-3174:
---

bq.  Maybe it makes sense to just call it `spark.dynamicAllocation.*`
That sounds good to me.

bq. I think in general we should limit the number of things that will affect 
adding/removing executors. Otherwise an application might get/lose many 
executors all of a sudden without a good understanding of why. Also 
anticipating what's needed in a future stage is usually fairly difficult, 
because you don't know a priori how long each stage is running. I don't see a 
good metric to decide how far in the future to anticipate for.

Consider the (common) case of a user keeping a Hive session open and setting a 
low number of minimum executors in order to not sit on cluster resources when 
idle.  Goal number 1 should be making queries return as fast as possible.  A 
policy that, upon receiving a job, simply requested executors with enough slots 
to handle all the tasks required by the first stage would be a vast latency and 
user experience improvement over the exponential increase policy.  Given that 
resource managers like YARN will mediate fairness between users and that Spark 
will be able to give executors back, there's not much advantage to being 
conservative or ramping up slowly in this case.  Accurately anticipating 
resource needs is difficult, but not necessary.

> Provide elastic scaling within a Spark application
> --
>
> Key: SPARK-3174
> URL: https://issues.apache.org/jira/browse/SPARK-3174
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.0.2
>Reporter: Sandy Ryza
>Assignee: Andrew Or
> Attachments: SPARK-3174design.pdf, 
> dynamic-scaling-executors-10-6-14.pdf
>
>
> A common complaint with Spark in a multi-tenant environment is that 
> applications have a fixed allocation that doesn't grow and shrink with their 
> resource needs.  We're blocked on YARN-1197 for dynamically changing the 
> resources within executors, but we can still allocate and discard whole 
> executors.
> It would be useful to have some heuristics that
> * Request more executors when many pending tasks are building up
> * Discard executors when they are idle
> See the latest design doc for more information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3682) Add helpful warnings to the UI

2014-10-07 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-3682:
--
Attachment: SPARK-3682Design.pdf

Posting an initial design

> Add helpful warnings to the UI
> --
>
> Key: SPARK-3682
> URL: https://issues.apache.org/jira/browse/SPARK-3682
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 1.1.0
>Reporter: Sandy Ryza
> Attachments: SPARK-3682Design.pdf
>
>
> Spark has a zillion configuration options and a zillion different things that 
> can go wrong with a job.  Improvements like incremental and better metrics 
> and the proposed spark replay debugger provide more insight into what's going 
> on under the covers.  However, it's difficult for non-advanced users to 
> synthesize this information and understand where to direct their attention. 
> It would be helpful to have some sort of central location on the UI users 
> could go to that would provide indications about why an app/job is failing or 
> performing poorly.
> Some helpful messages that we could provide:
> * Warn that the tasks in a particular stage are spending a long time in GC.
> * Warn that spark.shuffle.memoryFraction does not fit inside the young 
> generation.
> * Warn that tasks in a particular stage are very short, and that the number 
> of partitions should probably be decreased.
> * Warn that tasks in a particular stage are spilling a lot, and that the 
> number of partitions should probably be increased.
> * Warn that a cached RDD that gets a lot of use does not fit in memory, and a 
> lot of time is being spent recomputing it.
> To start, probably two kinds of warnings would be most helpful.
> * Warnings at the app level that report on misconfigurations, issues with the 
> general health of executors.
> * Warnings at the job level that indicate why a job might be performing 
> slowly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3837) Warn when YARN is killing containers for exceeding memory limits

2014-10-07 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-3837:
-

 Summary: Warn when YARN is killing containers for exceeding memory 
limits
 Key: SPARK-3837
 URL: https://issues.apache.org/jira/browse/SPARK-3837
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.1.0
Reporter: Sandy Ryza


YARN now lets application masters know when it kills their containers for 
exceeding memory limits.  Spark should log something when this happens.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3797) Run the shuffle service inside the YARN NodeManager as an AuxiliaryService

2014-10-07 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14161636#comment-14161636
 ] 

Sandy Ryza commented on SPARK-3797:
---

Not necessarily opposed to this, but wanted to bring up some of the drawbacks 
of running a Spark shuffle service inside YARN NodeManagers.  And the 
alternative.

* *Dependencies* .  We will need to avoid dependency conflicts between Spark's 
shuffle service and the rest of the NodeManager.  It's worth keeping in mind 
that the NodeManager may include Hadoop server-side dependencies we haven't 
dealt with in the past through depending on hadoop-client, and that its 
dependencies will also need to jive with other auxiliary services like MR's and 
Tez's. Unlike in Spark, where we place Spark jars in front of Hadoop jars and 
thus allow Spark versions to take precedence, NodeManagers presumably run with 
Hadoop jars in front.
* *Resource management* .  YARN will soon have support for some disk I/O 
isolation and scheduling (YARN-2139).  Running inside the NodeManager means 
that we won't be able to account for serving shuffle data inside of this. 
* *Deployment* . Where currently "installing" Spark on YARN at most means 
placing a Spark assembly jar on HDFS, this would require deploying Spark bits 
on every node in the cluster.
* *Rolling Upgrades* . Some proposed YARN work will allow containers to 
continue running while NodeManagers restart.  With Spark depending on the 
NodeManager to serve data, these upgrades would interfere with running Spark 
applications in situations where they otherwise might not.

The other option worth considering is to run the shuffle service in containers 
that sit beside the executor(s) on each node. This avoids all the problems 
above, but brings a couple of its own:
* Under many cluster configurations, YARN expects each container to take up at 
least a minimum amount of memory and CPU.  The shuffle service, which would use 
little of either of these, would sit on these resources unnecessarily.
* Scheduling becomes more difficult.  Spark would require two different 
containers to be scheduled on any node it wants to be functional on.  Once YARN 
has container resizing (YARN-1197), this could be mitigated by running two 
processes inside a single container.  If Spark wanted to kill an executor, it 
could order the executor process to kill itself and then shrink the container 
to the size of the shuffle service.

> Run the shuffle service inside the YARN NodeManager as an AuxiliaryService
> --
>
> Key: SPARK-3797
> URL: https://issues.apache.org/jira/browse/SPARK-3797
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Reporter: Patrick Wendell
>Assignee: Andrew Or
>
> It's also worth considering running the shuffle service in a YARN container 
> beside the executor(s) on each node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3797) Run the shuffle service inside the YARN NodeManager as an AuxiliaryService

2014-10-06 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-3797:
--
Summary: Run the shuffle service inside the YARN NodeManager as an 
AuxiliaryService  (was: Enable running shuffle service in separate process from 
executor)

> Run the shuffle service inside the YARN NodeManager as an AuxiliaryService
> --
>
> Key: SPARK-3797
> URL: https://issues.apache.org/jira/browse/SPARK-3797
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Reporter: Patrick Wendell
>Assignee: Andrew Or
>
> This could either mean
> * Running the shuffle service inside the YARN NodeManager as an 
> AuxiliaryService
> * Running the shuffle service in a YARN container on each node



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3797) Run the shuffle service inside the YARN NodeManager as an AuxiliaryService

2014-10-06 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-3797:
--
Description: It's also worth considering running the shuffle service in a 
YARN container beside the executor(s) on each node.  (was: This could either 
mean
* Running the shuffle service inside the YARN NodeManager as an AuxiliaryService
* Running the shuffle service in a YARN container on each node)

> Run the shuffle service inside the YARN NodeManager as an AuxiliaryService
> --
>
> Key: SPARK-3797
> URL: https://issues.apache.org/jira/browse/SPARK-3797
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Reporter: Patrick Wendell
>Assignee: Andrew Or
>
> It's also worth considering running the shuffle service in a YARN container 
> beside the executor(s) on each node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3174) Provide elastic scaling within a Spark application

2014-10-06 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14161048#comment-14161048
 ] 

Sandy Ryza commented on SPARK-3174:
---

Ah, misread.

My opinion is that, for a first cut without the graceful decommission, we 
should go with option 2 and only remove executors that have no cached blocks.

> Provide elastic scaling within a Spark application
> --
>
> Key: SPARK-3174
> URL: https://issues.apache.org/jira/browse/SPARK-3174
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.0.2
>Reporter: Sandy Ryza
>Assignee: Andrew Or
> Attachments: SPARK-3174design.pdf, 
> dynamic-scaling-executors-10-6-14.pdf
>
>
> A common complaint with Spark in a multi-tenant environment is that 
> applications have a fixed allocation that doesn't grow and shrink with their 
> resource needs.  We're blocked on YARN-1197 for dynamically changing the 
> resources within executors, but we can still allocate and discard whole 
> executors.
> It would be useful to have some heuristics that
> * Request more executors when many pending tasks are building up
> * Discard executors when they are idle
> See the latest design doc for more information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3174) Provide elastic scaling within a Spark application

2014-10-06 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14160960#comment-14160960
 ] 

Sandy Ryza commented on SPARK-3174:
---

bq. for instance, lets say I do some ETL stuff where I want it to do dynamic, 
but then I need to run an ML algorithm or do some heavy caching where I want to 
shut it off.

IIUC, the proposal only gets rid of executors once they are idle / empty of 
cached data.  If that's the case, do you still see issues with leaving dynamic 
allocation on during the "ML algorithm phase"?

> Provide elastic scaling within a Spark application
> --
>
> Key: SPARK-3174
> URL: https://issues.apache.org/jira/browse/SPARK-3174
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.0.2
>Reporter: Sandy Ryza
>Assignee: Andrew Or
> Attachments: SPARK-3174design.pdf, 
> dynamic-scaling-executors-10-6-14.pdf
>
>
> A common complaint with Spark in a multi-tenant environment is that 
> applications have a fixed allocation that doesn't grow and shrink with their 
> resource needs.  We're blocked on YARN-1197 for dynamically changing the 
> resources within executors, but we can still allocate and discard whole 
> executors.
> It would be useful to have some heuristics that
> * Request more executors when many pending tasks are building up
> * Discard executors when they are idle
> See the latest design doc for more information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3797) Enable running shuffle service in separate process from executor

2014-10-06 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-3797:
--
Description: 
This could either mean
* Running the shuffle service inside the YARN NodeManager as an AuxiliaryService
* Running the shuffle service in a YARN container on each node

  was:
This could either mean
* Running the shuffle service inside the YARN NodeManager as an auxiliary 
service
* Running the shuffle service in a YARN container on each node


> Enable running shuffle service in separate process from executor
> 
>
> Key: SPARK-3797
> URL: https://issues.apache.org/jira/browse/SPARK-3797
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Reporter: Patrick Wendell
>Assignee: Andrew Or
>
> This could either mean
> * Running the shuffle service inside the YARN NodeManager as an 
> AuxiliaryService
> * Running the shuffle service in a YARN container on each node



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3797) Enable running shuffle service in separate process from executor

2014-10-06 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-3797:
--
Description: 
This could either mean
* Running the shuffle service inside the YARN NodeManager as an auxiliary 
service
* Running the shuffle service in a YARN container on each node

> Enable running shuffle service in separate process from executor
> 
>
> Key: SPARK-3797
> URL: https://issues.apache.org/jira/browse/SPARK-3797
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Reporter: Patrick Wendell
>Assignee: Andrew Or
>
> This could either mean
> * Running the shuffle service inside the YARN NodeManager as an auxiliary 
> service
> * Running the shuffle service in a YARN container on each node



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3797) Enable running shuffle service in separate process from executor

2014-10-06 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-3797:
--
Summary: Enable running shuffle service in separate process from executor  
(was: Integrate shuffle service in YARN's pluggable shuffle)

> Enable running shuffle service in separate process from executor
> 
>
> Key: SPARK-3797
> URL: https://issues.apache.org/jira/browse/SPARK-3797
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Reporter: Patrick Wendell
>Assignee: Andrew Or
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3174) Provide elastic scaling within a Spark application

2014-10-06 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14160937#comment-14160937
 ] 

Sandy Ryza commented on SPARK-3174:
---

Thanks for posting the detailed design, Andrew.  A few comments.

I would expect properties underneath spark.executor.* to pertain to what goes 
on inside of executors.  This is really more of a driver/scheduler feature, so 
a different prefix might make more sense.

Because it's a large task and there's still significant value without it, I 
assume we'll hold off on implementing the graceful decommission until we're 
done with the first parts?

This is probably out of scope for the first cut, but in the future it might be 
useful to include addition/removal policies that use what Spark knows about 
upcoming stages to anticipate the number of executors needed.  Can we structure 
the config property names in a way that will make sense if we choose to add 
more advanced functionality like this?

When cluster resources are constrained, we may find ourselves in situations 
where YARN is unable to allocate the additional resources we requested before 
the next time interval.  I haven't thought about it extremely deeply, but it 
seems like there may be some pathological situations in which we request an 
enormous number of additional executors while waiting.  It might make sense to 
do something like avoid increasing the number requested until we've actually 
received some?

Last, any thoughts on what reasonable intervals would be?  For the add 
interval, I imagine that we want it to be at least the amount of time required 
between invoking requestExecutors and being able to schedule tasks on the 
executors requested.

> Provide elastic scaling within a Spark application
> --
>
> Key: SPARK-3174
> URL: https://issues.apache.org/jira/browse/SPARK-3174
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.0.2
>Reporter: Sandy Ryza
>Assignee: Andrew Or
> Attachments: SPARK-3174design.pdf, 
> dynamic-scaling-executors-10-6-14.pdf
>
>
> A common complaint with Spark in a multi-tenant environment is that 
> applications have a fixed allocation that doesn't grow and shrink with their 
> resource needs.  We're blocked on YARN-1197 for dynamically changing the 
> resources within executors, but we can still allocate and discard whole 
> executors.
> It would be useful to have some heuristics that
> * Request more executors when many pending tasks are building up
> * Discard executors when they are idle
> See the latest design doc for more information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3464) Graceful decommission of executors

2014-10-04 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14159252#comment-14159252
 ] 

Sandy Ryza edited comment on SPARK-3464 at 10/4/14 7:27 PM:


[~pwendell] did you mean to resolve this as "Fixed"?


was (Author: sandyr):
Did you mean to resolve this as "Fixed"?

> Graceful decommission of executors
> --
>
> Key: SPARK-3464
> URL: https://issues.apache.org/jira/browse/SPARK-3464
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Sandy Ryza
>Assignee: Andrew Or
>
> In most cases, even when an application is utilizing only a small fraction of 
> its available resources, executors will still have tasks running or blocks 
> cached.  It would be useful to have a mechanism for waiting for running tasks 
> on an executor to finish and migrating its cached blocks elsewhere before 
> discarding it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3464) Graceful decommission of executors

2014-10-04 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14159252#comment-14159252
 ] 

Sandy Ryza commented on SPARK-3464:
---

Did you mean to resolve this as "Fixed"?

> Graceful decommission of executors
> --
>
> Key: SPARK-3464
> URL: https://issues.apache.org/jira/browse/SPARK-3464
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Sandy Ryza
>Assignee: Andrew Or
>
> In most cases, even when an application is utilizing only a small fraction of 
> its available resources, executors will still have tasks running or blocks 
> cached.  It would be useful to have a mechanism for waiting for running tasks 
> on an executor to finish and migrating its cached blocks elsewhere before 
> discarding it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3561) Decouple Spark's API from its execution engine

2014-10-03 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158571#comment-14158571
 ] 

Sandy Ryza edited comment on SPARK-3561 at 10/3/14 11:00 PM:
-

I think there may be somewhat of a misunderstanding about the relationship 
between Spark and YARN.  YARN is not an "execution environment", but a cluster 
resource manager that has the ability to start processes on behalf of execution 
engines like Spark.  Spark already supports YARN as a cluster resource manager, 
but YARN doesn't provide its own execution engine.  YARN doesn't provide a 
stateless shuffle (although execution engines built atop it like MR and Tez 
do). 

If I understand, the broader intent is to decouple the Spark API from the 
execution engine it runs on top of.  Changing the title to reflect this.  That 
said, the Spark API is currently very tightly integrated with its execution 
engine, and frankly, decoupling the two so that Spark would be able to run on 
top of execution engines with similar properties seems more trouble than its 
worth.


was (Author: sandyr):
I think there may be somewhat of a misunderstanding about the relationship 
between Spark and YARN.  YARN is not an "execution environment", but a cluster 
resource manager that has the ability to start processes on behalf of execution 
engines like Spark.  Spark already supports YARN as a cluster resource manager, 
but YARN doesn't provide its own execution engine.  YARN doesn't provide a 
stateless shuffle (although execution engines built atop it like MR and Tez 
do). 

If I understand, the broader intent is to decouple the Spark API from the 
execution engine it runs on top of.  Changing the title to reflect this.  That, 
the Spark API is currently very tightly integrated with its execution engine, 
and frankly, decoupling the two so that Spark would be able to run on top of 
execution engines with similar properties seems more trouble than its worth.

> Decouple Spark's API from its execution engine
> --
>
> Key: SPARK-3561
> URL: https://issues.apache.org/jira/browse/SPARK-3561
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Oleg Zhurakousky
>  Labels: features
> Fix For: 1.2.0
>
> Attachments: SPARK-3561.pdf
>
>
> Currently Spark's user-facing API is tightly coupled with its backend 
> execution engine.   It could be useful to provide a point of pluggability 
> between the two to allow Spark to run on other DAG execution engines with 
> similar distributed memory abstractions.
> Proposal:
> The proposed approach would introduce a pluggable JobExecutionContext (trait) 
> - as a non-public api (@DeveloperAPI) not exposed to end users of Spark.
> The trait will define 4 only operations:
> * hadoopFile
> * newAPIHadoopFile
> * broadcast
> * runJob
> Each method directly maps to the corresponding methods in current version of 
> SparkContext. JobExecutionContext implementation will be accessed by 
> SparkContext via master URL as 
> "execution-context:foo.bar.MyJobExecutionContext" with default implementation 
> containing the existing code from SparkContext, thus allowing current 
> (corresponding) methods of SparkContext to delegate to such implementation. 
> An integrator will now have an option to provide custom implementation of 
> DefaultExecutionContext by either implementing it from scratch or extending 
> form DefaultExecutionContext.
> Please see the attached design doc for more details.
> Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3561) Decouple Spark's API from its execution engine

2014-10-03 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-3561:
--
Description: 
Currently Spark's user-facing API is tightly coupled with its backend execution 
engine.   It could be useful to provide a point of pluggability between the two 
to allow Spark to run on other DAG execution engines with similar distributed 
memory abstractions.

Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
as a non-public api (@DeveloperAPI) not exposed to end users of Spark.
The trait will define 4 only operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via master URL as 
"execution-context:foo.bar.MyJobExecutionContext" with default implementation 
containing the existing code from SparkContext, thus allowing current 
(corresponding) methods of SparkContext to delegate to such implementation. An 
integrator will now have an option to provide custom implementation of 
DefaultExecutionContext by either implementing it from scratch or extending 
form DefaultExecutionContext.

Please see the attached design doc for more details.
Pull Request will be posted shortly as well

  was:
Currently Spark's API is tightly coupled with its backend execution engine.   
It could be useful to provide a point of pluggability between the two to allow 
Spark to run on other DAG execution engines with similar distributed memory 
abstractions.

Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
as a non-public api (@DeveloperAPI) not exposed to end users of Spark.
The trait will define 4 only operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via master URL as 
"execution-context:foo.bar.MyJobExecutionContext" with default implementation 
containing the existing code from SparkContext, thus allowing current 
(corresponding) methods of SparkContext to delegate to such implementation. An 
integrator will now have an option to provide custom implementation of 
DefaultExecutionContext by either implementing it from scratch or extending 
form DefaultExecutionContext.

Please see the attached design doc for more details.
Pull Request will be posted shortly as well


> Decouple Spark's API from its execution engine
> --
>
> Key: SPARK-3561
> URL: https://issues.apache.org/jira/browse/SPARK-3561
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Oleg Zhurakousky
>  Labels: features
> Fix For: 1.2.0
>
> Attachments: SPARK-3561.pdf
>
>
> Currently Spark's user-facing API is tightly coupled with its backend 
> execution engine.   It could be useful to provide a point of pluggability 
> between the two to allow Spark to run on other DAG execution engines with 
> similar distributed memory abstractions.
> Proposal:
> The proposed approach would introduce a pluggable JobExecutionContext (trait) 
> - as a non-public api (@DeveloperAPI) not exposed to end users of Spark.
> The trait will define 4 only operations:
> * hadoopFile
> * newAPIHadoopFile
> * broadcast
> * runJob
> Each method directly maps to the corresponding methods in current version of 
> SparkContext. JobExecutionContext implementation will be accessed by 
> SparkContext via master URL as 
> "execution-context:foo.bar.MyJobExecutionContext" with default implementation 
> containing the existing code from SparkContext, thus allowing current 
> (corresponding) methods of SparkContext to delegate to such implementation. 
> An integrator will now have an option to provide custom implementation of 
> DefaultExecutionContext by either implementing it from scratch or extending 
> form DefaultExecutionContext.
> Please see the attached design doc for more details.
> Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3561) Decouple Spark's API from its execution engine

2014-10-03 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-3561:
--
Description: 
Currently Spark's API is tightly coupled with its backend execution engine.   
It could be useful to provide a point of pluggability between the two to allow 
Spark to run on other DAG execution engines with similar distributed memory 
abstractions.

Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
as a non-public api (@DeveloperAPI) not exposed to end users of Spark.
The trait will define 4 only operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via master URL as 
"execution-context:foo.bar.MyJobExecutionContext" with default implementation 
containing the existing code from SparkContext, thus allowing current 
(corresponding) methods of SparkContext to delegate to such implementation. An 
integrator will now have an option to provide custom implementation of 
DefaultExecutionContext by either implementing it from scratch or extending 
form DefaultExecutionContext.

Please see the attached design doc for more details.
Pull Request will be posted shortly as well

  was:
Currently Spark's API is tightly coupled with its backend execution engine.   
It could be useful to provide a point of pluggability between the two to allow 
Spark to run on other DAG execution engines with similar distributed memory 
abstractions.

Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
a gateway and a delegate to Hadoop execution environment - as a non-public api 
(@DeveloperAPI) not exposed to end users of Spark.
The trait will define 4 only operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via master URL as 
"execution-context:foo.bar.MyJobExecutionContext" with default implementation 
containing the existing code from SparkContext, thus allowing current 
(corresponding) methods of SparkContext to delegate to such implementation. An 
integrator will now have an option to provide custom implementation of 
DefaultExecutionContext by either implementing it from scratch or extending 
form DefaultExecutionContext.

Please see the attached design doc for more details.
Pull Request will be posted shortly as well


> Decouple Spark's API from its execution engine
> --
>
> Key: SPARK-3561
> URL: https://issues.apache.org/jira/browse/SPARK-3561
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Oleg Zhurakousky
>  Labels: features
> Fix For: 1.2.0
>
> Attachments: SPARK-3561.pdf
>
>
> Currently Spark's API is tightly coupled with its backend execution engine.   
> It could be useful to provide a point of pluggability between the two to 
> allow Spark to run on other DAG execution engines with similar distributed 
> memory abstractions.
> Proposal:
> The proposed approach would introduce a pluggable JobExecutionContext (trait) 
> - as a non-public api (@DeveloperAPI) not exposed to end users of Spark.
> The trait will define 4 only operations:
> * hadoopFile
> * newAPIHadoopFile
> * broadcast
> * runJob
> Each method directly maps to the corresponding methods in current version of 
> SparkContext. JobExecutionContext implementation will be accessed by 
> SparkContext via master URL as 
> "execution-context:foo.bar.MyJobExecutionContext" with default implementation 
> containing the existing code from SparkContext, thus allowing current 
> (corresponding) methods of SparkContext to delegate to such implementation. 
> An integrator will now have an option to provide custom implementation of 
> DefaultExecutionContext by either implementing it from scratch or extending 
> form DefaultExecutionContext.
> Please see the attached design doc for more details.
> Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   4   5   6   >