[jira] [Commented] (SPARK-2630) Input data size of CoalescedRDD is incorrect

2014-09-07 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14124815#comment-14124815
 ] 

Andrew Ash commented on SPARK-2630:
---

Hi [~tsudukim] does the fix I proposed here look like it would address the 
issue you observed?

https://github.com/apache/spark/pull/2310

> Input data size of CoalescedRDD is incorrect
> 
>
> Key: SPARK-2630
> URL: https://issues.apache.org/jira/browse/SPARK-2630
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Davies Liu
>Priority: Critical
> Attachments: overflow.tiff
>
>
> Given one big file, such as text.4.3G, put it in one task, 
> sc.textFile("text.4.3.G").coalesce(1).count()
> In Web UI of Spark, you will see that the input size is 5.4M. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2630) Input data size of CoalescedRDD is incorrect

2014-09-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14124818#comment-14124818
 ] 

Apache Spark commented on SPARK-2630:
-

User 'ash211' has created a pull request for this issue:
https://github.com/apache/spark/pull/2310

> Input data size of CoalescedRDD is incorrect
> 
>
> Key: SPARK-2630
> URL: https://issues.apache.org/jira/browse/SPARK-2630
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Davies Liu
>Priority: Critical
> Attachments: overflow.tiff
>
>
> Given one big file, such as text.4.3G, put it in one task, 
> sc.textFile("text.4.3.G").coalesce(1).count()
> In Web UI of Spark, you will see that the input size is 5.4M. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2004) QA Automation

2014-09-07 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-2004:
--
Component/s: Project Infra

> QA Automation
> -
>
> Key: SPARK-2004
> URL: https://issues.apache.org/jira/browse/SPARK-2004
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Project Infra
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> This is an umbrella JIRA to track QA automation tasks. Spark supports
> * several deploy modes
> ** local
> ** standalone
> ** yarn
> ** mesos
> * three languages
> ** scala
> ** java
> ** python
> * several hadoop versions
> ** 0.x
> ** 1.x
> ** 2.x
> * job submission from different systems
> ** linux
> ** mac os x
> ** windows
> The cross product of them creates a big deployment matrix. QA automation is 
> really necessary to avoid regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1338) Create Additional Style Rules

2014-09-07 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14124822#comment-14124822
 ] 

Andrew Ash commented on SPARK-1338:
---

Blocking unicode operators could be a good idea too -- Eclipse apparently 
doesn't handle them well SPARK-2182

> Create Additional Style Rules
> -
>
> Key: SPARK-1338
> URL: https://issues.apache.org/jira/browse/SPARK-1338
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Patrick Wendell
>Assignee: Prashant Sharma
> Fix For: 1.1.0
>
>
> There are a few other rules that would be helpful to have. Also we should add 
> tests for these rules because it's easy to get them wrong. I gave some 
> example comparisons from a javascript style checker.
> Require spaces in type declarations:
> def foo:String = X // no
> def foo: String = XXX
> def x:Int = 100 // no
> val x: Int = 100
> Require spaces after keywords:
> if(x - 3) // no
> if (x + 10)
> See: requireSpaceAfterKeywords from
> https://github.com/mdevils/node-jscs
> Disallow spaces inside of parentheses:
> val x = ( 3 + 5 ) // no
> val x = (3 + 5)
> See: disallowSpacesInsideParentheses from
> https://github.com/mdevils/node-jscs
> Require spaces before and after binary operators:
> See: requireSpaceBeforeBinaryOperators
> See: disallowSpaceAfterBinaryOperators
> from https://github.com/mdevils/node-jscs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3249) Fix links in ScalaDoc that cause warning messages in `sbt/sbt unidoc`

2014-09-07 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14124828#comment-14124828
 ] 

Andrew Ash commented on SPARK-3249:
---

[~mengxr] would you have the links point to the simplest of the several 
ambiguous methods?  So the one with the fewest parameters?

> Fix links in ScalaDoc that cause warning messages in `sbt/sbt unidoc`
> -
>
> Key: SPARK-3249
> URL: https://issues.apache.org/jira/browse/SPARK-3249
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.1.0
>Reporter: Xiangrui Meng
>Priority: Minor
>
> If there are multiple overloaded versions of a method, we should make the 
> links more specific. Otherwise, `sbt/sbt unidoc` generates warning messages 
> like the following:
> {code}
> [warn] 
> mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala:305: The 
> link target "org.apache.spark.mllib.tree.DecisionTree$#trainClassifier" is 
> ambiguous. Several members fit the target:
> [warn] (input: 
> org.apache.spark.api.java.JavaRDD[org.apache.spark.mllib.regression.LabeledPoint],numClassesForClassification:
>  Int,categoricalFeaturesInfo: java.util.Map[Integer,Integer],impurity: 
> String,maxDepth: Int,maxBins: Int): 
> org.apache.spark.mllib.tree.model.DecisionTreeModel in object DecisionTree 
> [chosen]
> [warn] (input: 
> org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint],numClassesForClassification:
>  Int,categoricalFeaturesInfo: Map[Int,Int],impurity: String,maxDepth: 
> Int,maxBins: Int): org.apache.spark.mllib.tree.model.DecisionTreeModel in 
> object DecisionTree
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1667) Jobs never finish successfully once bucket file missing occurred

2014-09-07 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14124833#comment-14124833
 ] 

Andrew Ash commented on SPARK-1667:
---

Hi [~sarutak] it looks like you sent in a better fix for this problem in 
SPARK-2670.  Are we good to close this ticket now?

> Jobs never finish successfully once bucket file missing occurred
> 
>
> Key: SPARK-1667
> URL: https://issues.apache.org/jira/browse/SPARK-1667
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.0.0
>Reporter: Kousuke Saruta
>
> If jobs execute shuffle, bucket files are created in a temporary directory 
> (named like spark-local-*).
> When the bucket files are missing cased by disk failure or any reasons, jobs 
> cannot execute shuffle which has same shuffle id for the bucket files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3193) output error info when Process exitcode not zero

2014-09-07 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-3193:
--
Summary: output error info when Process exitcode not zero  (was: output 
errer info when Process exitcode not zero)

> output error info when Process exitcode not zero
> 
>
> Key: SPARK-3193
> URL: https://issues.apache.org/jira/browse/SPARK-3193
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.2
>Reporter: wangfei
>
> I noticed that sometimes pr tests failed due to the Process exitcode != 0:
> DriverSuite: 
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath 
> - driver should exit after finishing *** FAILED *** 
>SparkException was thrown during property evaluation. 
> (DriverSuite.scala:40) 
>  Message: Process List(./bin/spark-class, 
> org.apache.spark.DriverWithoutCleanup, local) exited with code 1 
>  Occurred at table row 0 (zero based, not counting headings), which had 
> values ( 
>master = local 
>  ) 
>  
> [info] SparkSubmitSuite:
> [info] - prints usage on empty input
> [info] - prints usage with only --help
> [info] - prints error with unrecognized options
> [info] - handle binary specified but not class
> [info] - handles arguments with --key=val
> [info] - handles arguments to user program
> [info] - handles arguments to user program with name collision
> [info] - handles YARN cluster mode
> [info] - handles YARN client mode
> [info] - handles standalone cluster mode
> [info] - handles standalone client mode
> [info] - handles mesos client mode
> [info] - handles confs with flag equivalents
> [info] - launch simple application with spark-submit *** FAILED ***
> [info]   org.apache.spark.SparkException: Process List(./bin/spark-submit, 
> --class, org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, 
> --master, local, file:/tmp/1408854098404-0/testJar-1408854098404.jar) exited 
> with code 1
> [info]   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:872)
> [info]   at 
> org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
> [info]   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291)
> [info]   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
> [info]   at org.apacSpark assembly has been built with Hive, including 
> Datanucleus jars on classpath
> refer to 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18688/consoleFull
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19118/consoleFull
> we should output the process error info when failed, this can be helpful for 
> diagnosis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1667) Jobs never finish successfully once bucket file missing occurred

2014-09-07 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14124834#comment-14124834
 ] 

Kousuke Saruta commented on SPARK-1667:
---

[~Andrew Ash] Oh yeah, I close this ticket.

> Jobs never finish successfully once bucket file missing occurred
> 
>
> Key: SPARK-1667
> URL: https://issues.apache.org/jira/browse/SPARK-1667
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.0.0
>Reporter: Kousuke Saruta
>
> If jobs execute shuffle, bucket files are created in a temporary directory 
> (named like spark-local-*).
> When the bucket files are missing cased by disk failure or any reasons, jobs 
> cannot execute shuffle which has same shuffle id for the bucket files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-1667) Jobs never finish successfully once bucket file missing occurred

2014-09-07 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta closed SPARK-1667.
-
Resolution: Fixed

This ticket is resolved by SPARK-2670.

> Jobs never finish successfully once bucket file missing occurred
> 
>
> Key: SPARK-1667
> URL: https://issues.apache.org/jira/browse/SPARK-1667
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.0.0
>Reporter: Kousuke Saruta
>
> If jobs execute shuffle, bucket files are created in a temporary directory 
> (named like spark-local-*).
> When the bucket files are missing cased by disk failure or any reasons, jobs 
> cannot execute shuffle which has same shuffle id for the bucket files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2858) Default log4j configuration no longer seems to work

2014-09-07 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14124838#comment-14124838
 ] 

Andrew Ash commented on SPARK-2858:
---

Josh mentions in that ticket that the Spark EC2 AMI might put the HDFS log4j 
ahead of Spark's and steal the priority.  Were you running via the EC2 scripts?

https://issues.apache.org/jira/browse/SPARK-2913?focusedCommentId=14104366&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14104366

> Default log4j configuration no longer seems to work
> ---
>
> Key: SPARK-2858
> URL: https://issues.apache.org/jira/browse/SPARK-2858
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>
> For reasons unknown this doesn't seem to be working anymore. I deleted my 
> log4j.properties file and did a fresh build and it noticed it still gave me a 
> verbose stack trace when port 4040 was contented (which is a log we silence 
> in the conf). I actually think this was an issue even before [~sowen]'s 
> changes, so not sure what's up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2553) CoGroupedRDD unnecessarily allocates a Tuple2 per dep per key

2014-09-07 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-2553:
--
Fix Version/s: 1.1.0

> CoGroupedRDD unnecessarily allocates a Tuple2 per dep per key
> -
>
> Key: SPARK-2553
> URL: https://issues.apache.org/jira/browse/SPARK-2553
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>Priority: Minor
> Fix For: 1.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2574) Avoid allocating new ArrayBuffer in groupByKey's mergeCombiner

2014-09-07 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-2574:
--
Fix Version/s: 1.1.0

> Avoid allocating new ArrayBuffer in groupByKey's mergeCombiner
> --
>
> Key: SPARK-2574
> URL: https://issues.apache.org/jira/browse/SPARK-2574
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Sandy Ryza
>Assignee: Matei Zaharia
>Priority: Trivial
> Fix For: 1.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2048) Optimizations to CPU usage of external spilling code

2014-09-07 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14124841#comment-14124841
 ] 

Andrew Ash commented on SPARK-2048:
---

All subtasks of this umbrella task have been completed and will be included in 
1.1.0 -- are we good to close this ticket?

> Optimizations to CPU usage of external spilling code
> 
>
> Key: SPARK-2048
> URL: https://issues.apache.org/jira/browse/SPARK-2048
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Matei Zaharia
> Fix For: 1.1.0
>
>
> In the external spilling code in ExternalAppendOnlyMap and CoGroupedRDD, 
> there are a few opportunities for optimization:
> - There are lots of uses of pattern-matching on Tuple2 (e.g. val (k, v) = 
> pair), which we found to be much slower than accessing fields directly
> - Hash codes for each element are computed many times in 
> StreamBuffer.minKeyHash, which will be expensive for some data types
> - Uses of buffer.remove() may be expensive if there are lots of hash 
> collisions (better to swap in the last element into that position)
> - More objects are allocated than is probably necessary, e.g. ArrayBuffers 
> and pairs
> - Because ExternalAppendOnlyMap is only given one key-value pair at a time, 
> it allocates a new update function on each one, unlike the way we pass a 
> single update function to AppendOnlyMap in Aggregator
> These should help because situations where we're spilling are also ones where 
> there is presumably a lot of GC pressure in the new generation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2048) Optimizations to CPU usage of external spilling code

2014-09-07 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-2048:
--
Fix Version/s: 1.1.0

> Optimizations to CPU usage of external spilling code
> 
>
> Key: SPARK-2048
> URL: https://issues.apache.org/jira/browse/SPARK-2048
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Matei Zaharia
> Fix For: 1.1.0
>
>
> In the external spilling code in ExternalAppendOnlyMap and CoGroupedRDD, 
> there are a few opportunities for optimization:
> - There are lots of uses of pattern-matching on Tuple2 (e.g. val (k, v) = 
> pair), which we found to be much slower than accessing fields directly
> - Hash codes for each element are computed many times in 
> StreamBuffer.minKeyHash, which will be expensive for some data types
> - Uses of buffer.remove() may be expensive if there are lots of hash 
> collisions (better to swap in the last element into that position)
> - More objects are allocated than is probably necessary, e.g. ArrayBuffers 
> and pairs
> - Because ExternalAppendOnlyMap is only given one key-value pair at a time, 
> it allocates a new update function on each one, unlike the way we pass a 
> single update function to AppendOnlyMap in Aggregator
> These should help because situations where we're spilling are also ones where 
> there is presumably a lot of GC pressure in the new generation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2122) Move aggregation into shuffle implementation

2014-09-07 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14124842#comment-14124842
 ] 

Andrew Ash commented on SPARK-2122:
---

[~jerryshao] is this a dupe of SPARK-2124 ?  It looks like that ticket already 
covered moving aggregation into ShuffleManager implementations

> Move aggregation into shuffle implementation
> 
>
> Key: SPARK-2122
> URL: https://issues.apache.org/jira/browse/SPARK-2122
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Saisai Shao
>
> This is a follow-up work of [SPARK-2044 pluggable shuffle 
> interface|https://issues.apache.org/jira/browse/SPARK-2044] to move the 
> execution of aggregator into shuffle implementation. This will bring 
> flexibility for other different implementation of shuffle reader and writer. 
> PR will be submitted after [PR1009|https://github.com/apache/spark/pull/1009] 
> is merged.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1956) Enable shuffle consolidation by default

2014-09-07 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14124844#comment-14124844
 ] 

Andrew Ash commented on SPARK-1956:
---

[~mridulm80] there has been a significant amount of work done on the shuffle 
codepath for 1.1 -- do you have more issues in mind that would block enabling 
shuffle consolidation by default?

{{spark.shuffle.consolidateFiles}} still defaults to false in v1.1.0-rc4

> Enable shuffle consolidation by default
> ---
>
> Key: SPARK-1956
> URL: https://issues.apache.org/jira/browse/SPARK-1956
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.0.0
>Reporter: Sandy Ryza
>
> The only drawbacks are on ext3, and most everyone has ext4 at this point.  I 
> think it's better to aim the default at the common case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3321) Defining a class within python main script

2014-09-07 Thread Shawn Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shawn Guo updated SPARK-3321:
-
Priority: Minor  (was: Critical)

> Defining a class within python main script
> --
>
> Key: SPARK-3321
> URL: https://issues.apache.org/jira/browse/SPARK-3321
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.0.1
> Environment: Python version 2.6.6
> Spark version version 1.0.1
> jdk1.6.0_43
>Reporter: Shawn Guo
>Priority: Minor
>
> *leftOuterJoin(self, other, numPartitions=None)*
> Perform a left outer join of self and other.
> For each element (k, v) in self, the resulting RDD will either contain all 
> pairs (k, (v, w)) for w in other, or the pair (k, (v, None)) if no elements 
> in other have key k.
> *Background*: leftOuterJoin will produce None element in result dataset.
> I define a new class 'Null' in the main script to replace all python native 
> None to new 'Null' object. 'Null' object overload the [] operator.
> {code:title=Class Null|borderStyle=solid}
> class Null(object):
> def __getitem__(self,key): return None;
> def __getstate__(self): pass;
> def __setstate__(self, dict): pass;
> def convert_to_null(x):
> return Null() if x is None else x
> X = A.leftOuterJoin(B)
> X.mapValues(lambda line: (line[0],convert_to_null(line[1]))
> {code}
> The code seems running good in pyspark console, however spark-submit failed 
> with below error messages:
> /spark-1.0.1-bin-hadoop1/bin/spark-submit --master local[2] 
> /tmp/python_test.py
> {noformat}
>   File "/data/work/spark-1.0.1-bin-hadoop1/python/pyspark/worker.py", line 
> 77, in main
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py", 
> line 191, in dump_stream
> self.serializer.dump_stream(self._batched(iterator), stream)
>   File "/data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py", 
> line 124, in dump_stream
> self._write_with_length(obj, stream)
>   File "/data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py", 
> line 134, in _write_with_length
> serialized = self.dumps(obj)
>   File "/data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py", 
> line 279, in dumps
> def dumps(self, obj): return cPickle.dumps(obj, 2)
> PicklingError: Can't pickle : attribute lookup 
> __main__.Null failed
> 
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115)
> 
> org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:145)
> org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:33)
> org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:74)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200)
> 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
> 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
> 
> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174)
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
> at scala.Option.foreach(Option.scala:236)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala

[jira] [Commented] (SPARK-2972) APPLICATION_COMPLETE not created in Python unless context explicitly stopped

2014-09-07 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14124872#comment-14124872
 ] 

Matthew Farrellee commented on SPARK-2972:
--

[~roji] this was addressed for a pyspark shell in 
https://issues.apache.org/jira/browse/SPARK-2435. as for applications, it is 
the programmer's responsibility to stop the context before exit. this can be 
seen in all the example code provided with spark. are you looking for the 
SparkContext to stop itself?

> APPLICATION_COMPLETE not created in Python unless context explicitly stopped
> 
>
> Key: SPARK-2972
> URL: https://issues.apache.org/jira/browse/SPARK-2972
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.0.2
> Environment: Cloudera 5.1, yarn master on ubuntu precise
>Reporter: Shay Rojansky
>
> If you don't explicitly stop a SparkContext at the end of a Python 
> application with sc.stop(), an APPLICATION_COMPLETE file isn't created and 
> the job doesn't get picked up by the history server.
> This can be easily reproduced with pyspark (but affects scripts as well).
> The current workaround is to wrap the entire script with a try/finally and 
> stop manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2972) APPLICATION_COMPLETE not created in Python unless context explicitly stopped

2014-09-07 Thread Shay Rojansky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14124873#comment-14124873
 ] 

Shay Rojansky commented on SPARK-2972:
--

Thanks for answering. I guess it's a debatable question. I admit I expected the 
context to shut itself down at application exit, a bit in the way that files 
and other resources get closed.

Note that the way the examples are currently written (pi.py), an exception 
anywhere in the code would bypass sc.stop() and the Spark application 
disappears without leaving a trace in the history server. For this reason, my 
scripts all contain try/finally blocks around the application code, which seems 
like needless boilerplate that complicates life and can easily be forgotten.

Is there any specific reason not to use the application shutdown hooks 
available in python/java to close the context(s)?

> APPLICATION_COMPLETE not created in Python unless context explicitly stopped
> 
>
> Key: SPARK-2972
> URL: https://issues.apache.org/jira/browse/SPARK-2972
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.0.2
> Environment: Cloudera 5.1, yarn master on ubuntu precise
>Reporter: Shay Rojansky
>
> If you don't explicitly stop a SparkContext at the end of a Python 
> application with sc.stop(), an APPLICATION_COMPLETE file isn't created and 
> the job doesn't get picked up by the history server.
> This can be easily reproduced with pyspark (but affects scripts as well).
> The current workaround is to wrap the entire script with a try/finally and 
> stop manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2256) pyspark: .take doesn't work ... sometimes ...

2014-09-07 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee updated SPARK-2256:
-
Labels: RDD pyspark take windows  (was: RDD pyspark take)

> pyspark: .take doesn't work ... sometimes ...
> --
>
> Key: SPARK-2256
> URL: https://issues.apache.org/jira/browse/SPARK-2256
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.0.0
> Environment: local file/remote HDFS
>Reporter: Ángel Álvarez
>  Labels: RDD, pyspark, take, windows
> Attachments: A_test.zip
>
>
> If I try to "take" some lines from a file, sometimes it doesn't work
> Code: 
> myfile = sc.textFile("A_ko")
> print myfile.take(10)
> Stacktrace:
> 14/06/24 09:29:27 INFO DAGScheduler: Failed to run take at mytest.py:19
> Traceback (most recent call last):
>   File "mytest.py", line 19, in 
> print myfile.take(10)
>   File "spark-1.0.0-bin-hadoop2\python\pyspark\rdd.py", line 868, in take
> iterator = mapped._jrdd.collectPartitions(partitionsToTake)[0].iterator()
>   File 
> "spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\java_gateway.py", 
> line 537, in __call__
>   File 
> "spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\protocol.py", 
> line 300, in get_return_value
> Test data:
> 
> A
> A
> A
> 
> 
> 
> 
> 
> 
> 
> 
> 
> A

[jira] [Commented] (SPARK-1087) Separate file for traceback and callsite related functions

2014-09-07 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14124877#comment-14124877
 ] 

Matthew Farrellee commented on SPARK-1087:
--

[~jyotiska] PR 581 was merged (though it looked fairly trivial). is this still 
relevant?

> Separate file for traceback and callsite related functions
> --
>
> Key: SPARK-1087
> URL: https://issues.apache.org/jira/browse/SPARK-1087
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jyotiska NK
>
> Right now, _extract_concise_traceback() is written inside rdd.py which 
> provides the callsite information. But for 
> [SPARK-972](https://spark-project.atlassian.net/browse/SPARK-972) in PR #581, 
> we used the function from context.py. Also some issues were faced regarding 
> the return string format. 
> It would be a good idea to move the the traceback function from rdd and 
> create a separate file for future developments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2023) PySpark reduce does a map side reduce and then sends the results to the driver for final reduce, instead do this more like Scala Spark.

2014-09-07 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14124879#comment-14124879
 ] 

Matthew Farrellee commented on SPARK-2023:
--

[~holdenk] are you still concerned about this? if not, will you close it? if 
so, how can we identify the bottleneck to fix it?

> PySpark reduce does a map side reduce and then sends the results to the 
> driver for final reduce, instead do this more like Scala Spark.
> ---
>
> Key: SPARK-2023
> URL: https://issues.apache.org/jira/browse/SPARK-2023
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>
> PySpark reduce does a map side reduce and then sends the results to the 
> driver for final reduce, instead do this more like Scala Spark. The current 
> implementation could be a bottleneck. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3293) yarn's web show "SUCCEEDED" when the driver throw a exception in yarn-client

2014-09-07 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-3293:
---
Affects Version/s: 1.1.0

> yarn's web show "SUCCEEDED" when the driver throw a exception in yarn-client
> 
>
> Key: SPARK-3293
> URL: https://issues.apache.org/jira/browse/SPARK-3293
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.2, 1.1.0
>Reporter: wangfei
> Fix For: 1.2.0
>
>
> If an exception occurs, the yarn'web->Applications->FinalStatus will also be 
> the "SUCCEEDED" without the expectation of "FAILED".
> In the release of spark-1.0.2, only yarn-client mode will show this.
> But recently the yarn-cluster mode will also be a problem.
> To reply this:
> just new a sparkContext and then throw an exception
> then watch the yarn websit about applications



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3293) yarn's web show "SUCCEEDED" when the driver throw a exception in yarn-client

2014-09-07 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-3293:
---
Fix Version/s: (was: 1.1.0)
   1.2.0

> yarn's web show "SUCCEEDED" when the driver throw a exception in yarn-client
> 
>
> Key: SPARK-3293
> URL: https://issues.apache.org/jira/browse/SPARK-3293
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.2, 1.1.0
>Reporter: wangfei
> Fix For: 1.2.0
>
>
> If an exception occurs, the yarn'web->Applications->FinalStatus will also be 
> the "SUCCEEDED" without the expectation of "FAILED".
> In the release of spark-1.0.2, only yarn-client mode will show this.
> But recently the yarn-cluster mode will also be a problem.
> To reply this:
> just new a sparkContext and then throw an exception
> then watch the yarn websit about applications



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-3293) yarn's web show "SUCCEEDED" when the driver throw a exception in yarn-client

2014-09-07 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-3293:
---
Comment: was deleted

(was: Here is a related PR:
https://github.com/apache/spark/pull/1788/files)

> yarn's web show "SUCCEEDED" when the driver throw a exception in yarn-client
> 
>
> Key: SPARK-3293
> URL: https://issues.apache.org/jira/browse/SPARK-3293
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.2, 1.1.0
>Reporter: wangfei
>Assignee: Guoqiang Li
> Fix For: 1.2.0
>
>
> If an exception occurs, the yarn'web->Applications->FinalStatus will also be 
> the "SUCCEEDED" without the expectation of "FAILED".
> In the release of spark-1.0.2, only yarn-client mode will show this.
> But recently the yarn-cluster mode will also be a problem.
> To reply this:
> just new a sparkContext and then throw an exception
> then watch the yarn websit about applications



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1191) Convert configs to use SparkConf

2014-09-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14124898#comment-14124898
 ] 

Apache Spark commented on SPARK-1191:
-

User 'CodingCat' has created a pull request for this issue:
https://github.com/apache/spark/pull/2312

> Convert configs to use SparkConf
> 
>
> Key: SPARK-1191
> URL: https://issues.apache.org/jira/browse/SPARK-1191
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>
> There are many places in the yarn code that still use System.setProperty. We 
> should convert those to use the SparkConf.  
> One specific example is SPARK_YARN_MODE. There are others in like 
> ApplicationMaster and Client.
> Note that currently some configs can't be set in sparkConf and properly 
> picked up with SparkContext as sparkConf isn't really shared with the 
> SparkContext.  The only time we can get the sparkContext is after its been 
> instantiated which is to late.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3293) yarn's web show "SUCCEEDED" when the driver throw a exception in yarn-client

2014-09-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14124896#comment-14124896
 ] 

Apache Spark commented on SPARK-3293:
-

User 'witgo' has created a pull request for this issue:
https://github.com/apache/spark/pull/2311

> yarn's web show "SUCCEEDED" when the driver throw a exception in yarn-client
> 
>
> Key: SPARK-3293
> URL: https://issues.apache.org/jira/browse/SPARK-3293
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.2, 1.1.0
>Reporter: wangfei
>Assignee: Guoqiang Li
> Fix For: 1.2.0
>
>
> If an exception occurs, the yarn'web->Applications->FinalStatus will also be 
> the "SUCCEEDED" without the expectation of "FAILED".
> In the release of spark-1.0.2, only yarn-client mode will show this.
> But recently the yarn-cluster mode will also be a problem.
> To reply this:
> just new a sparkContext and then throw an exception
> then watch the yarn websit about applications



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-675) Gateway JVM should ask for less than SPARK_MEM memory

2014-09-07 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14124918#comment-14124918
 ] 

Matthew Farrellee commented on SPARK-675:
-

[~joshrosen] it looks like SPARK-674 was resolved, do you think this is still 
an issue or can it be closed?

> Gateway JVM should ask for less than SPARK_MEM memory
> -
>
> Key: SPARK-675
> URL: https://issues.apache.org/jira/browse/SPARK-675
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Patrick Cogan
>Assignee: Josh Rosen
>Priority: Minor
> Fix For: 0.7.1
>
>
> This is not so big of a deal assuming that we fix SPARK-674, but it would be 
> nice if the gateway JVM asked for less than SPARK_MEM amount of memory. This 
> might require decoupling the class-path component of "run.sh" so it can be 
> used independently. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-927) PySpark sample() doesn't work if numpy is installed on master but not on workers

2014-09-07 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14124924#comment-14124924
 ] 

Matthew Farrellee commented on SPARK-927:
-

it looks like the issue is rddsampler checks for numpy in its constructor 
instead of when initializing the random number generator

> PySpark sample() doesn't work if numpy is installed on master but not on 
> workers
> 
>
> Key: SPARK-927
> URL: https://issues.apache.org/jira/browse/SPARK-927
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 0.8.0
>Reporter: Josh Rosen
>Assignee: Matthew Farrellee
>Priority: Minor
>
> PySpark's sample() method crashes with ImportErrors on the workers if numpy 
> is installed on the driver machine but not on the workers.  I'm not sure 
> what's the best way to fix this.  A general mechanism for automatically 
> shipping libraries from the master to the workers would address this, but 
> that could be complicated to implement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3430) Introduce ValueIncrementableHashMapAccumulator to compute Histogram

2014-09-07 Thread Suraj Satishkumar Sheth (JIRA)
Suraj Satishkumar Sheth created SPARK-3430:
--

 Summary: Introduce ValueIncrementableHashMapAccumulator to compute 
Histogram
 Key: SPARK-3430
 URL: https://issues.apache.org/jira/browse/SPARK-3430
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Suraj Satishkumar Sheth
Priority: Minor


Currently, we don't have a Hash map which can be used as an accumulator to 
produce Histogram or distribution. This class will provide a customized HashMap 
implemetation whose value can be incremented.
e.g. map+=(a,1), map+=(a,6) will lead to (a,7)
This can have various applications like computation of Histograms, Sampling 
Strategy generation, Statistical metric computation, in MLLib, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3430) Introduce ValueIncrementableHashMapAccumulator to compute Histogram

2014-09-07 Thread Suraj Satishkumar Sheth (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suraj Satishkumar Sheth updated SPARK-3430:
---
Description: 
Pull request : https://github.com/apache/spark/pull/2314

Currently, we don't have a Hash map which can be used as an accumulator to 
produce Histogram or distribution. This class will provide a customized HashMap 
implemetation whose value can be incremented.
e.g. map+=(a,1), map+=(a,6) will lead to (a,7)
This can have various applications like computation of Histograms, Sampling 
Strategy generation, Statistical metric computation, in MLLib, etc.

  was:
Currently, we don't have a Hash map which can be used as an accumulator to 
produce Histogram or distribution. This class will provide a customized HashMap 
implemetation whose value can be incremented.
e.g. map+=(a,1), map+=(a,6) will lead to (a,7)
This can have various applications like computation of Histograms, Sampling 
Strategy generation, Statistical metric computation, in MLLib, etc.


> Introduce ValueIncrementableHashMapAccumulator to compute Histogram
> ---
>
> Key: SPARK-3430
> URL: https://issues.apache.org/jira/browse/SPARK-3430
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Suraj Satishkumar Sheth
>Priority: Minor
>
> Pull request : https://github.com/apache/spark/pull/2314
> Currently, we don't have a Hash map which can be used as an accumulator to 
> produce Histogram or distribution. This class will provide a customized 
> HashMap implemetation whose value can be incremented.
> e.g. map+=(a,1), map+=(a,6) will lead to (a,7)
> This can have various applications like computation of Histograms, Sampling 
> Strategy generation, Statistical metric computation, in MLLib, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3430) Introduce ValueIncrementableHashMapAccumulator to compute Histogram

2014-09-07 Thread Suraj Satishkumar Sheth (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suraj Satishkumar Sheth updated SPARK-3430:
---
Description: 
Pull request : https://github.com/apache/spark/pull/2314

Currently, we don't have a Hash map which can be used as an accumulator to 
produce Histogram or distribution. This class will provide a customized HashMap 
implemetation whose value can be incremented.
e.g. map+=(a,1), map+=(a,6) will lead to (a,7)
This can have various applications like computation of Histograms, Sampling 
Strategy generation, Statistical metric computation, in MLLib, etc.

Example usage :
val map  = sc.accumulableCollection(new 
ValueIncrementableHashMapAccumulator[Int]())

var countMap = sc.broadcast(map)

data.foreach(record => {
  var valArray = record.split("\t")
  var valString = ""
  var i = 0
  var tuple = (0,1L)
  countMap.value += tuple
  for(valString <- valArray) {
i = i+1
try{
  valString.toDouble
  var tuple = (i,1L)
  countMap.value += tuple
}
catch {
  case ioe: Exception => None
}

  }
})

  was:
Pull request : https://github.com/apache/spark/pull/2314

Currently, we don't have a Hash map which can be used as an accumulator to 
produce Histogram or distribution. This class will provide a customized HashMap 
implemetation whose value can be incremented.
e.g. map+=(a,1), map+=(a,6) will lead to (a,7)
This can have various applications like computation of Histograms, Sampling 
Strategy generation, Statistical metric computation, in MLLib, etc.


> Introduce ValueIncrementableHashMapAccumulator to compute Histogram
> ---
>
> Key: SPARK-3430
> URL: https://issues.apache.org/jira/browse/SPARK-3430
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Suraj Satishkumar Sheth
>Priority: Minor
>
> Pull request : https://github.com/apache/spark/pull/2314
> Currently, we don't have a Hash map which can be used as an accumulator to 
> produce Histogram or distribution. This class will provide a customized 
> HashMap implemetation whose value can be incremented.
> e.g. map+=(a,1), map+=(a,6) will lead to (a,7)
> This can have various applications like computation of Histograms, Sampling 
> Strategy generation, Statistical metric computation, in MLLib, etc.
> Example usage :
> val map  = sc.accumulableCollection(new 
> ValueIncrementableHashMapAccumulator[Int]())
> 
> var countMap = sc.broadcast(map)
> 
> data.foreach(record => {
>   var valArray = record.split("\t")
>   var valString = ""
>   var i = 0
>   var tuple = (0,1L)
>   countMap.value += tuple
>   for(valString <- valArray) {
> i = i+1
> try{
>   valString.toDouble
>   var tuple = (i,1L)
>   countMap.value += tuple
> }
> catch {
>   case ioe: Exception => None
> }
> 
>   }
> })



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3430) Introduce ValueIncrementableHashMapAccumulator to compute Histogram

2014-09-07 Thread Suraj Satishkumar Sheth (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suraj Satishkumar Sheth updated SPARK-3430:
---
Priority: Major  (was: Minor)

> Introduce ValueIncrementableHashMapAccumulator to compute Histogram
> ---
>
> Key: SPARK-3430
> URL: https://issues.apache.org/jira/browse/SPARK-3430
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Suraj Satishkumar Sheth
>
> Pull request : https://github.com/apache/spark/pull/2314
> Currently, we don't have a Hash map which can be used as an accumulator to 
> produce Histogram or distribution. This class will provide a customized 
> HashMap implemetation whose value can be incremented.
> e.g. map+=(a,1), map+=(a,6) will lead to (a,7)
> This can have various applications like computation of Histograms, Sampling 
> Strategy generation, Statistical metric computation, in MLLib, etc.
> Example usage :
> val map  = sc.accumulableCollection(new 
> ValueIncrementableHashMapAccumulator[Int]())
> 
> var countMap = sc.broadcast(map)
> 
> data.foreach(record => {
>   var valArray = record.split("\t")
>   var valString = ""
>   var i = 0
>   var tuple = (0,1L)
>   countMap.value += tuple
>   for(valString <- valArray) {
> i = i+1
> try{
>   valString.toDouble
>   var tuple = (i,1L)
>   countMap.value += tuple
> }
> catch {
>   case ioe: Exception => None
> }
> 
>   }
> })



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3430) Introduce ValueIncrementableHashMapAccumulator to compute Histogram and other statistical metrics

2014-09-07 Thread Suraj Satishkumar Sheth (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suraj Satishkumar Sheth updated SPARK-3430:
---
Summary: Introduce ValueIncrementableHashMapAccumulator to compute 
Histogram and other statistical metrics  (was: Introduce 
ValueIncrementableHashMapAccumulator to compute Histogram)

> Introduce ValueIncrementableHashMapAccumulator to compute Histogram and other 
> statistical metrics
> -
>
> Key: SPARK-3430
> URL: https://issues.apache.org/jira/browse/SPARK-3430
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Suraj Satishkumar Sheth
>
> Pull request : https://github.com/apache/spark/pull/2314
> Currently, we don't have a Hash map which can be used as an accumulator to 
> produce Histogram or distribution. This class will provide a customized 
> HashMap implemetation whose value can be incremented.
> e.g. map+=(a,1), map+=(a,6) will lead to (a,7)
> This can have various applications like computation of Histograms, Sampling 
> Strategy generation, Statistical metric computation, in MLLib, etc.
> Example usage :
> val map  = sc.accumulableCollection(new 
> ValueIncrementableHashMapAccumulator[Int]())
> 
> var countMap = sc.broadcast(map)
> 
> data.foreach(record => {
>   var valArray = record.split("\t")
>   var valString = ""
>   var i = 0
>   var tuple = (0,1L)
>   countMap.value += tuple
>   for(valString <- valArray) {
> i = i+1
> try{
>   valString.toDouble
>   var tuple = (i,1L)
>   countMap.value += tuple
> }
> catch {
>   case ioe: Exception => None
> }
> 
>   }
> })



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1087) Separate file for traceback and callsite related functions

2014-09-07 Thread Jyotiska NK (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14124999#comment-14124999
 ] 

Jyotiska NK commented on SPARK-1087:


We initially thought this would be a good feature to add going forward. But 
after the PR was merged, it was abandoned. Also, PR 581 was created in the 
incubator github repo. In the new one, it was PR #34. If it is a relevant 
feature, I can submit a PR for this.

> Separate file for traceback and callsite related functions
> --
>
> Key: SPARK-1087
> URL: https://issues.apache.org/jira/browse/SPARK-1087
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jyotiska NK
>
> Right now, _extract_concise_traceback() is written inside rdd.py which 
> provides the callsite information. But for 
> [SPARK-972](https://spark-project.atlassian.net/browse/SPARK-972) in PR #581, 
> we used the function from context.py. Also some issues were faced regarding 
> the return string format. 
> It would be a good idea to move the the traceback function from rdd and 
> create a separate file for future developments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1087) Separate file for traceback and callsite related functions

2014-09-07 Thread Jyotiska NK (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14124999#comment-14124999
 ] 

Jyotiska NK edited comment on SPARK-1087 at 9/7/14 6:32 PM:


We initially thought this would be a good feature to add going forward. But 
after the PR was merged, it was abandoned. Also, PR 581 was created in the 
incubator github repo. In the new one, it was [PR 
#34](https://github.com/apache/spark/pull/34). If it is a relevant feature, I 
can submit a PR for this.


was (Author: jyotiska):
We initially thought this would be a good feature to add going forward. But 
after the PR was merged, it was abandoned. Also, PR 581 was created in the 
incubator github repo. In the new one, it was PR #34. If it is a relevant 
feature, I can submit a PR for this.

> Separate file for traceback and callsite related functions
> --
>
> Key: SPARK-1087
> URL: https://issues.apache.org/jira/browse/SPARK-1087
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jyotiska NK
>
> Right now, _extract_concise_traceback() is written inside rdd.py which 
> provides the callsite information. But for 
> [SPARK-972](https://spark-project.atlassian.net/browse/SPARK-972) in PR #581, 
> we used the function from context.py. Also some issues were faced regarding 
> the return string format. 
> It would be a good idea to move the the traceback function from rdd and 
> create a separate file for future developments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1087) Separate file for traceback and callsite related functions

2014-09-07 Thread Jyotiska NK (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14124999#comment-14124999
 ] 

Jyotiska NK edited comment on SPARK-1087 at 9/7/14 6:33 PM:


We initially thought this would be a good feature to add going forward. But 
after the PR was merged, it was abandoned. Also, PR 581 was created in the 
incubator github repo. In the new one, it was [PR 
#34|https://github.com/apache/spark/pull/34]. If it is a relevant feature, I 
can submit a PR for this.


was (Author: jyotiska):
We initially thought this would be a good feature to add going forward. But 
after the PR was merged, it was abandoned. Also, PR 581 was created in the 
incubator github repo. In the new one, it was [PR 
#34](https://github.com/apache/spark/pull/34). If it is a relevant feature, I 
can submit a PR for this.

> Separate file for traceback and callsite related functions
> --
>
> Key: SPARK-1087
> URL: https://issues.apache.org/jira/browse/SPARK-1087
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jyotiska NK
>
> Right now, _extract_concise_traceback() is written inside rdd.py which 
> provides the callsite information. But for 
> [SPARK-972](https://spark-project.atlassian.net/browse/SPARK-972) in PR #581, 
> we used the function from context.py. Also some issues were faced regarding 
> the return string format. 
> It would be a good idea to move the the traceback function from rdd and 
> create a separate file for future developments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3174) Under YARN, add and remove executors based on load

2014-09-07 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125008#comment-14125008
 ] 

Brock Noland commented on SPARK-3174:
-

Thank you all for your work on this issue! I am no expert here, but I believe 
that under "Policy For Adding Executors" the following sentence requires some 
clarification: "If the ratio of executable tasks to task capacity is greater 
than a configurable threshold, request executors until this would no longer be 
the case."



> Under YARN, add and remove executors based on load
> --
>
> Key: SPARK-3174
> URL: https://issues.apache.org/jira/browse/SPARK-3174
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.0.2
>Reporter: Sandy Ryza
>Assignee: Andrew Or
> Attachments: SPARK-3174design.pdf
>
>
> A common complaint with Spark in a multi-tenant environment is that 
> applications have a fixed allocation that doesn't grow and shrink with their 
> resource needs.  We're blocked on YARN-1197 for dynamically changing the 
> resources within executors, but we can still allocate and discard whole 
> executors.
> I think it would be useful to have some heuristics that
> * Request more executors when many pending tasks are building up
> * Request more executors when RDDs can't fit in memory
> * Discard executors when few tasks are running / pending and there's not much 
> in memory
> Bonus points: migrate blocks from executors we're about to discard to 
> executors with free space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1956) Enable shuffle consolidation by default

2014-09-07 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125027#comment-14125027
 ] 

Mridul Muralidharan commented on SPARK-1956:


The recent changes to BlockObjectWriter have introduced bugs again ... I don't 
know how badly they affect the codebase, but it would not be prudent to enable 
it by default until they are fixed and changes properly analyzed.

> Enable shuffle consolidation by default
> ---
>
> Key: SPARK-1956
> URL: https://issues.apache.org/jira/browse/SPARK-1956
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.0.0
>Reporter: Sandy Ryza
>
> The only drawbacks are on ext3, and most everyone has ext4 at this point.  I 
> think it's better to aim the default at the common case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2232) Fix Jenkins tests in Maven

2014-09-07 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-2232:
--
Priority: Critical  (was: Major)

> Fix Jenkins tests in Maven
> --
>
> Key: SPARK-2232
> URL: https://issues.apache.org/jira/browse/SPARK-2232
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Patrick Wendell
>Priority: Critical
>
> It appears Maven tests are failing under the newer Hadoop configurations. We 
> need to go through and make sure all the Spark master build configurations 
> are passing.
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Master%20Matrix/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2232) Fix Jenkins tests in Maven

2014-09-07 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-2232:
--
Priority: Blocker  (was: Critical)

> Fix Jenkins tests in Maven
> --
>
> Key: SPARK-2232
> URL: https://issues.apache.org/jira/browse/SPARK-2232
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Patrick Wendell
>Priority: Blocker
>
> It appears Maven tests are failing under the newer Hadoop configurations. We 
> need to go through and make sure all the Spark master build configurations 
> are passing.
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Master%20Matrix/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2232) Fix Jenkins tests in Maven

2014-09-07 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-2232:
--
Description: 
It appears Maven tests are failing under the newer Hadoop configurations. We 
need to go through and make sure all the Spark master build configurations are 
passing.

https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/

  was:
It appears Maven tests are failing under the newer Hadoop configurations. We 
need to go through and make sure all the Spark master build configurations are 
passing.

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Master%20Matrix/


> Fix Jenkins tests in Maven
> --
>
> Key: SPARK-2232
> URL: https://issues.apache.org/jira/browse/SPARK-2232
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Patrick Wendell
>Priority: Blocker
>
> It appears Maven tests are failing under the newer Hadoop configurations. We 
> need to go through and make sure all the Spark master build configurations 
> are passing.
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3431) Parallelize execution of tests

2014-09-07 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-3431:
---

 Summary: Parallelize execution of tests
 Key: SPARK-3431
 URL: https://issues.apache.org/jira/browse/SPARK-3431
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Nicholas Chammas


Running all the tests in {{dev/run-tests}} takes up to 2 hours. A common 
strategy to cut test time down is to parallelize the execution of the tests. 
Doing that may in turn require some prerequisite changes to be made to how 
certain tests run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3432) Fix logging of unit test execution time

2014-09-07 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-3432:
---

 Summary: Fix logging of unit test execution time
 Key: SPARK-3432
 URL: https://issues.apache.org/jira/browse/SPARK-3432
 Project: Spark
  Issue Type: Sub-task
Reporter: Nicholas Chammas
Priority: Minor


[Per 
Reynold|http://mail-archives.apache.org/mod_mbox/spark-dev/201408.mbox/%3CCAPh_B=bDCGAJXPP_CgiU0NJS1+KmhmX31But57WDqUeJ=bu...@mail.gmail.com%3E]:
{quote}
I think the first baby step is to log the amount of time each test cases take. 
This is supposed to happen already (see the flag), but somehow the time are not 
showing. If you have some time to figure that out, that'd be great. 

https://github.com/apache/spark/blob/master/project/SparkBuild.scala#L350
{quote}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3394) TakeOrdered crashes when limit is 0

2014-09-07 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-3394.
--
   Resolution: Fixed
Fix Version/s: 1.0.3
   1.2.0
   1.1.1

> TakeOrdered crashes when limit is 0
> ---
>
> Key: SPARK-3394
> URL: https://issues.apache.org/jira/browse/SPARK-3394
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Minor
> Fix For: 1.1.1, 1.2.0, 1.0.3
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3394) TakeOrdered crashes when limit is 0

2014-09-07 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-3394:
-
Component/s: Spark Core

> TakeOrdered crashes when limit is 0
> ---
>
> Key: SPARK-3394
> URL: https://issues.apache.org/jira/browse/SPARK-3394
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Minor
> Fix For: 1.1.1, 1.2.0, 1.0.3
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3263) PR #720 broke GraphGenerator.logNormal

2014-09-07 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3263:
---
Fix Version/s: (was: 1.3.0)
   1.2.0

> PR #720 broke GraphGenerator.logNormal
> --
>
> Key: SPARK-3263
> URL: https://issues.apache.org/jira/browse/SPARK-3263
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Reporter: RJ Nowling
> Fix For: 1.2.0
>
>
> PR #720 made multiple changes to GraphGenerator.logNormalGraph including:
> * Replacing the call to functions for generating random vertices and edges 
> with in-line implementations with different equations
> * Hard-coding of RNG seeds so that method now generates the same graph for a 
> given number of vertices, edges, mu, and sigma -- user is not able to 
> override seed or specify that seed should be randomly generated.
> * Backwards-incompatible change to logNormalGraph signature with introduction 
> of new required parameter.
> * Failed to update scala docs and programming guide for API changes
> I also see that PR #720 added a Synthetic Benchmark in the examples.
> Based on reading the Pregel paper, I believe the in-line functions are 
> incorrect.  I proposed to:
> * Removing the in-line calls
> * Adding a seed for deterministic behavior (when desired)
> * Keeping the number of partitions parameter.
> * Updating the synthetic benchmark example



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3360) Add RowMatrix.multiply(Vector)

2014-09-07 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125129#comment-14125129
 ] 

Yu Ishikawa commented on SPARK-3360:


Hi Sandy,

I'm interested in this issue.
It seems that it is not difficult to implement the matrix operation to multiply 
a RowMatrix and a Vector. What do you think about the operation to multiply a 
Vector and a RowMatrix? I mean, `Vector.multiply(RowMatrix)`.

By the way, we are discussing about the matrix manipulation for large data set 
in the issue. Just information.
https://issues.apache.org/jira/browse/SPARK-3416

thanks,

> Add RowMatrix.multiply(Vector)
> --
>
> Key: SPARK-3360
> URL: https://issues.apache.org/jira/browse/SPARK-3360
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Sandy Ryza
>
> RowMatrix currently has multiply(Matrix), but multiply(Vector) would be 
> useful as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3408) Limit operator doesn't work with sort based shuffle

2014-09-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-3408.

   Resolution: Fixed
Fix Version/s: 1.2.0
   1.1.1

> Limit operator doesn't work with sort based shuffle
> ---
>
> Key: SPARK-3408
> URL: https://issues.apache.org/jira/browse/SPARK-3408
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.1.1, 1.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3433) Mima false-positives with @DeveloperAPI and @Experimental annotations

2014-09-07 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-3433:
-

 Summary: Mima false-positives with @DeveloperAPI and @Experimental 
annotations
 Key: SPARK-3433
 URL: https://issues.apache.org/jira/browse/SPARK-3433
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.2.0
Reporter: Josh Rosen
Assignee: Prashant Sharma
Priority: Minor


In https://github.com/apache/spark/pull/2315, I found two cases where 
{{@DeveloperAPI}} and {{@Experimental}} annotations didn't prevent 
false-positive warnings from Mima.  To reproduce this problem, run dev/mima as 
of 
https://github.com/JoshRosen/spark/commit/ec90e21947b615d4ef94a3a54cfd646924ccaf7c.
  The spurious warnings are listed at the top of 
https://gist.github.com/JoshRosen/5d8df835516dc367389d.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3415) Using sys.stderr in pyspark results in error

2014-09-07 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3415.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2287
[https://github.com/apache/spark/pull/2287]

> Using sys.stderr in pyspark results in error
> 
>
> Key: SPARK-3415
> URL: https://issues.apache.org/jira/browse/SPARK-3415
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Ward Viaene
>  Labels: python
> Fix For: 1.2.0
>
>
> Using sys.stderr in pyspark results in: 
>   File "/home/spark-1.1/dist/python/pyspark/cloudpickle.py", line 660, in 
> save_file
> from ..transport.adapter import SerializingAdapter
> ValueError: Attempted relative import beyond toplevel package
> Code to reproduce (copy paste the code in pyspark):
> import sys
>   
> class TestClass(object):
> def __init__(self, out = sys.stderr):
> self.out = out
> def getOne(self):
> return 'one'
>   
> 
> def f():
> print type(t)
> return 'ok'
> 
>   
> t = TestClass()
> a = [ 1 , 2, 3, 4, 5 ]
> b = sc.parallelize(a)
> b.map(lambda x: f()).first()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3174) Under YARN, add and remove executors based on load

2014-09-07 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125172#comment-14125172
 ] 

Patrick Wendell commented on SPARK-3174:


Hey Sandy - thanks for posting the design here. This proposes moving blocks off 
of executors before they are decommissioned. This might cause issues for long 
running or ETL-style workloads since the accumulate state on the machines could 
be very large (i.e. gigabytes of data). Another approach would be to use the 
YARN shuffle directly to decouple the shuffle data from Spark executors. Just 
wanted to mention it as a possibility. I think in parallel Andrew is look at 
this as well.

> Under YARN, add and remove executors based on load
> --
>
> Key: SPARK-3174
> URL: https://issues.apache.org/jira/browse/SPARK-3174
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.0.2
>Reporter: Sandy Ryza
>Assignee: Andrew Or
> Attachments: SPARK-3174design.pdf
>
>
> A common complaint with Spark in a multi-tenant environment is that 
> applications have a fixed allocation that doesn't grow and shrink with their 
> resource needs.  We're blocked on YARN-1197 for dynamically changing the 
> resources within executors, but we can still allocate and discard whole 
> executors.
> I think it would be useful to have some heuristics that
> * Request more executors when many pending tasks are building up
> * Request more executors when RDDs can't fit in memory
> * Discard executors when few tasks are running / pending and there's not much 
> in memory
> Bonus points: migrate blocks from executors we're about to discard to 
> executors with free space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3280) Made sort-based shuffle the default implementation

2014-09-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-3280.

   Resolution: Fixed
Fix Version/s: 1.2.0

> Made sort-based shuffle the default implementation
> --
>
> Key: SPARK-3280
> URL: https://issues.apache.org/jira/browse/SPARK-3280
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.2.0
>
> Attachments: hash-sort-comp.png
>
>
> sort-based shuffle has lower memory usage and seems to outperform hash-based 
> in almost all of our testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-938) OpenStack Swift Storage Support

2014-09-07 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-938.
---
   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 2298
[https://github.com/apache/spark/pull/2298]

> OpenStack Swift Storage Support
> ---
>
> Key: SPARK-938
> URL: https://issues.apache.org/jira/browse/SPARK-938
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, Examples, Input/Output, Spark Core
>Affects Versions: 0.8.1
>Reporter: Murali Raju
>Priority: Minor
> Fix For: 1.1.0
>
>
> This issue is to track OpenStack Swift Storage support (development in 
> progress) in addition to S3 for Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-938) OpenStack Swift Storage Support

2014-09-07 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-938:
--
Assignee: Gil Vernik

> OpenStack Swift Storage Support
> ---
>
> Key: SPARK-938
> URL: https://issues.apache.org/jira/browse/SPARK-938
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, Examples, Input/Output, Spark Core
>Affects Versions: 0.8.1
>Reporter: Murali Raju
>Assignee: Gil Vernik
>Priority: Minor
> Fix For: 1.1.0
>
>
> This issue is to track OpenStack Swift Storage support (development in 
> progress) in addition to S3 for Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-938) OpenStack Swift Storage Support

2014-09-07 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125195#comment-14125195
 ] 

Patrick Wendell commented on SPARK-938:
---

This was fixed by [~gvernik] with [~rxin] authoring a slightly revised version 
of the patch.

> OpenStack Swift Storage Support
> ---
>
> Key: SPARK-938
> URL: https://issues.apache.org/jira/browse/SPARK-938
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, Examples, Input/Output, Spark Core
>Affects Versions: 0.8.1
>Reporter: Murali Raju
>Assignee: Gil Vernik
>Priority: Minor
> Fix For: 1.1.0
>
>
> This issue is to track OpenStack Swift Storage support (development in 
> progress) in addition to S3 for Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3174) Under YARN, add and remove executors based on load

2014-09-07 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125235#comment-14125235
 ] 

Sandy Ryza commented on SPARK-3174:
---

To be clear, by YARN shuffle you mean the MR2 approach where shuffle data is 
served by an auxiliary service living in the NodeManager? I think that could 
definitely be beneficial.  Though it does have its drawbacks down the line, 
like more difficulty in accounting for and throttling disk and network IO.

For Hive-on-Spark's needs, the main motivation is that someone who leaves their 
Hive session open but idle shouldn't be holding on to a bunch of cluster 
resources.  So, for this purpose, it might be sufficient to only discard 
executors when no jobs are running.   In that case, we wouldn't need to worry 
about shuffle data at all.

Also, do you know when shuffle data gets deleted?  After the stage that's 
fetching it completes or after the job completes?

> Under YARN, add and remove executors based on load
> --
>
> Key: SPARK-3174
> URL: https://issues.apache.org/jira/browse/SPARK-3174
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.0.2
>Reporter: Sandy Ryza
>Assignee: Andrew Or
> Attachments: SPARK-3174design.pdf
>
>
> A common complaint with Spark in a multi-tenant environment is that 
> applications have a fixed allocation that doesn't grow and shrink with their 
> resource needs.  We're blocked on YARN-1197 for dynamically changing the 
> resources within executors, but we can still allocate and discard whole 
> executors.
> I think it would be useful to have some heuristics that
> * Request more executors when many pending tasks are building up
> * Request more executors when RDDs can't fit in memory
> * Discard executors when few tasks are running / pending and there's not much 
> in memory
> Bonus points: migrate blocks from executors we're about to discard to 
> executors with free space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2425) Standalone Master is too aggressive in removing Applications

2014-09-07 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2425:
---
Priority: Critical  (was: Major)

> Standalone Master is too aggressive in removing Applications
> 
>
> Key: SPARK-2425
> URL: https://issues.apache.org/jira/browse/SPARK-2425
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Mark Hamstra
>Assignee: Mark Hamstra
>Priority: Critical
>
> When standalone Executors trying to run a particular Application fail a 
> cummulative ApplicationState.MAX_NUM_RETRY times, Master will remove the 
> Application.  This will be true even if there actually are a number of 
> Executors that are successfully running the Application.  This makes 
> long-running standalone-mode Applications in particular unnecessarily 
> vulnerable to limited failures in the cluster -- e.g., a single bad node on 
> which Executors repeatedly fail for any reason can prevent an Application 
> from starting or can result in a running Application being removed even 
> though it could continue to run successfully (just not making use of all 
> potential Workers and Executors.) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2425) Standalone Master is too aggressive in removing Applications

2014-09-07 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2425:
---
Target Version/s: 1.2.0  (was: 1.0.3)

> Standalone Master is too aggressive in removing Applications
> 
>
> Key: SPARK-2425
> URL: https://issues.apache.org/jira/browse/SPARK-2425
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Mark Hamstra
>Assignee: Mark Hamstra
>
> When standalone Executors trying to run a particular Application fail a 
> cummulative ApplicationState.MAX_NUM_RETRY times, Master will remove the 
> Application.  This will be true even if there actually are a number of 
> Executors that are successfully running the Application.  This makes 
> long-running standalone-mode Applications in particular unnecessarily 
> vulnerable to limited failures in the cluster -- e.g., a single bad node on 
> which Executors repeatedly fail for any reason can prevent an Application 
> from starting or can result in a running Application being removed even 
> though it could continue to run successfully (just not making use of all 
> potential Workers and Executors.) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-2048) Optimizations to CPU usage of external spilling code

2014-09-07 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia reassigned SPARK-2048:


Assignee: Matei Zaharia

> Optimizations to CPU usage of external spilling code
> 
>
> Key: SPARK-2048
> URL: https://issues.apache.org/jira/browse/SPARK-2048
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
> Fix For: 1.1.0
>
>
> In the external spilling code in ExternalAppendOnlyMap and CoGroupedRDD, 
> there are a few opportunities for optimization:
> - There are lots of uses of pattern-matching on Tuple2 (e.g. val (k, v) = 
> pair), which we found to be much slower than accessing fields directly
> - Hash codes for each element are computed many times in 
> StreamBuffer.minKeyHash, which will be expensive for some data types
> - Uses of buffer.remove() may be expensive if there are lots of hash 
> collisions (better to swap in the last element into that position)
> - More objects are allocated than is probably necessary, e.g. ArrayBuffers 
> and pairs
> - Because ExternalAppendOnlyMap is only given one key-value pair at a time, 
> it allocates a new update function on each one, unlike the way we pass a 
> single update function to AppendOnlyMap in Aggregator
> These should help because situations where we're spilling are also ones where 
> there is presumably a lot of GC pressure in the new generation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2048) Optimizations to CPU usage of external spilling code

2014-09-07 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-2048.
--
Resolution: Fixed

> Optimizations to CPU usage of external spilling code
> 
>
> Key: SPARK-2048
> URL: https://issues.apache.org/jira/browse/SPARK-2048
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
> Fix For: 1.1.0
>
>
> In the external spilling code in ExternalAppendOnlyMap and CoGroupedRDD, 
> there are a few opportunities for optimization:
> - There are lots of uses of pattern-matching on Tuple2 (e.g. val (k, v) = 
> pair), which we found to be much slower than accessing fields directly
> - Hash codes for each element are computed many times in 
> StreamBuffer.minKeyHash, which will be expensive for some data types
> - Uses of buffer.remove() may be expensive if there are lots of hash 
> collisions (better to swap in the last element into that position)
> - More objects are allocated than is probably necessary, e.g. ArrayBuffers 
> and pairs
> - Because ExternalAppendOnlyMap is only given one key-value pair at a time, 
> it allocates a new update function on each one, unlike the way we pass a 
> single update function to AppendOnlyMap in Aggregator
> These should help because situations where we're spilling are also ones where 
> there is presumably a lot of GC pressure in the new generation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2048) Optimizations to CPU usage of external spilling code

2014-09-07 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125247#comment-14125247
 ] 

Matei Zaharia commented on SPARK-2048:
--

Yeah, sounds good, thanks for pointing that out.

> Optimizations to CPU usage of external spilling code
> 
>
> Key: SPARK-2048
> URL: https://issues.apache.org/jira/browse/SPARK-2048
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Matei Zaharia
> Fix For: 1.1.0
>
>
> In the external spilling code in ExternalAppendOnlyMap and CoGroupedRDD, 
> there are a few opportunities for optimization:
> - There are lots of uses of pattern-matching on Tuple2 (e.g. val (k, v) = 
> pair), which we found to be much slower than accessing fields directly
> - Hash codes for each element are computed many times in 
> StreamBuffer.minKeyHash, which will be expensive for some data types
> - Uses of buffer.remove() may be expensive if there are lots of hash 
> collisions (better to swap in the last element into that position)
> - More objects are allocated than is probably necessary, e.g. ArrayBuffers 
> and pairs
> - Because ExternalAppendOnlyMap is only given one key-value pair at a time, 
> it allocates a new update function on each one, unlike the way we pass a 
> single update function to AppendOnlyMap in Aggregator
> These should help because situations where we're spilling are also ones where 
> there is presumably a lot of GC pressure in the new generation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org