date:20140425


 [ 
https://issues.apache.org/jira/browse/SPARK-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-1242:
-

Assignee: Holden Karau

 Add aggregate to python API
 ---

 Key: SPARK-1242
 URL: https://issues.apache.org/jira/browse/SPARK-1242
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 0.9.0
Reporter: Holden Karau
Assignee: Holden Karau
Priority: Trivial
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1626) Update Spark YARN docs to use spark-submit

Patrick Wendell created SPARK-1626:
--

 Summary: Update Spark YARN docs to use spark-submit
 Key: SPARK-1626
 URL: https://issues.apache.org/jira/browse/SPARK-1626
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1625) Ensure all legacy YARN options are supported with spark-submit


 [ 
https://issues.apache.org/jira/browse/SPARK-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1625:
---

Component/s: YARN

 Ensure all legacy YARN options are supported with spark-submit
 --

 Key: SPARK-1625
 URL: https://issues.apache.org/jira/browse/SPARK-1625
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1621) Update Chill to 0.3.6


 [ 
https://issues.apache.org/jira/browse/SPARK-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1621:
---

Issue Type: Dependency upgrade  (was: Improvement)

 Update Chill to 0.3.6
 -

 Key: SPARK-1621
 URL: https://issues.apache.org/jira/browse/SPARK-1621
 Project: Spark
  Issue Type: Dependency upgrade
Reporter: Matei Zaharia
Assignee: Matei Zaharia
Priority: Minor
 Fix For: 1.0.0


 It registers more Scala classes, including things like Ranges that we had to 
 register manually before. See https://github.com/twitter/chill/releases for 
 Chill's change log.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1582) Job cancellation does not interrupt threads


 [ 
https://issues.apache.org/jira/browse/SPARK-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1582.


   Resolution: Fixed
Fix Version/s: 1.0.0

 Job cancellation does not interrupt threads
 ---

 Key: SPARK-1582
 URL: https://issues.apache.org/jira/browse/SPARK-1582
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 0.9.1
Reporter: Aaron Davidson
Assignee: Aaron Davidson
 Fix For: 1.0.0


 Cancelling Spark jobs is limited because executors that are blocked are not 
 interrupted. In effect, the cancellation will succeed and the job will no 
 longer be running, but executor threads may still be tied up with the 
 cancelled job and unable to do further work until complete. This is 
 particularly problematic in the case of deadlock or unlimited/long timeouts.
 It would be useful if cancelling a job would call Thread.interrupt() in order 
 to interrupt blocking in most situations, such as Object monitors or IO. The 
 one caveat is [HDFS-1208|https://issues.apache.org/jira/browse/HDFS-1208], 
 where HDFS's DFSClient will not only swallow InterruptedException but may 
 reinterpret them as IOException, causing HDFS to mark a node as permanently 
 failed. Thus, this feature must be optional and probably off by default.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1590) Recommend to use FindBugs


 [ 
https://issues.apache.org/jira/browse/SPARK-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1590.


Resolution: Not a Problem
  Assignee: Shixiong Zhu

[~zsxwing] I'm marking this as not an issue now, but if you feel we should 
re-consider adding this to the build please feel free to open it.

 Recommend to use FindBugs
 -

 Key: SPARK-1590
 URL: https://issues.apache.org/jira/browse/SPARK-1590
 Project: Spark
  Issue Type: Question
  Components: Build
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
Priority: Minor
 Attachments: findbugs.png


 FindBugs is an open source program created by Bill Pugh and David Hovemeyer 
 which looks for bugs in Java code. It uses static analysis to identify 
 hundreds of different potential types of errors in Java programs.
 Although Spark is a Scala project, FindBugs is still helpful. For example, I 
 used it to find SPARK-1583 and SPARK-1589. However, the disadvantage is that 
 the report generated by FindBugs usually contains many false alarms for a 
 Scala project.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1590) Recommend to use FindBugs


[ 
https://issues.apache.org/jira/browse/SPARK-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13980735#comment-13980735
 ] 

Patrick Wendell commented on SPARK-1590:


I agree with this course of action. [~zsxwing] the issues you've been reporting 
are super helpful. I think the way to go here is to periodically look through 
these and contribute fixes.

[~srowen] are you doing any custom configurations in intellij to enable these? 
Or do you just go with the defaults?

 Recommend to use FindBugs
 -

 Key: SPARK-1590
 URL: https://issues.apache.org/jira/browse/SPARK-1590
 Project: Spark
  Issue Type: Question
  Components: Build
Reporter: Shixiong Zhu
Priority: Minor
 Attachments: findbugs.png


 FindBugs is an open source program created by Bill Pugh and David Hovemeyer 
 which looks for bugs in Java code. It uses static analysis to identify 
 hundreds of different potential types of errors in Java programs.
 Although Spark is a Scala project, FindBugs is still helpful. For example, I 
 used it to find SPARK-1583 and SPARK-1589. However, the disadvantage is that 
 the report generated by FindBugs usually contains many false alarms for a 
 Scala project.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1604) Couldn't run spark-submit with yarn cluster mode when using deps jar


[ 
https://issues.apache.org/jira/browse/SPARK-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13980740#comment-13980740
 ] 

Patrick Wendell commented on SPARK-1604:


I'm not sure why you're including both assemble-deps and the examples jar. The 
examples jar includes all of spark and its dependencies.

I've noted here we should probably mark spark as provided in the examples jar 
so it doesn't embed Spark. https://issues.apache.org/jira/browse/SPARK-1565

 Couldn't run spark-submit with yarn cluster mode when using deps jar
 

 Key: SPARK-1604
 URL: https://issues.apache.org/jira/browse/SPARK-1604
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0
Reporter: Kan Zhang

 SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0-deps.jar
  ./bin/spark-submit 
 ./examples/target/scala-2.10/spark-examples_2.10-1.0.0-SNAPSHOT.jar --master 
 yarn --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL 
 Exception in thread main java.lang.ClassNotFoundException: 
 org.apache.spark.deploy.yarn.Client
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
   at java.lang.Class.forName0(Native Method)
   at java.lang.Class.forName(Class.java:270)
   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:234)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:47)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1607) Remove use of octal literals, deprecated in Scala 2.10 / removed in 2.11


 [ 
https://issues.apache.org/jira/browse/SPARK-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-1607.
--

   Resolution: Fixed
Fix Version/s: 1.0.0

 Remove use of octal literals, deprecated in Scala 2.10 / removed in 2.11
 

 Key: SPARK-1607
 URL: https://issues.apache.org/jira/browse/SPARK-1607
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 0.9.1
Reporter: Sean Owen
Priority: Minor
  Labels: literal, octal, scala, yarn
 Fix For: 1.0.0


 Octal literals like 0700 are deprecated in Scala 2.10, generating a 
 warning. They have been removed entirely in 2.11. See 
 https://issues.scala-lang.org/browse/SI-7618
 This change simply replaces two uses of octals with hex literals, which 
 seemed the next-best representation since they express a bit mask (file 
 permission in particular)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1611) Incorrect initialization order in AppendOnlyMap

2014-04-25 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-1611.
-

   Resolution: Fixed
Fix Version/s: 1.0.0

 Incorrect initialization order in AppendOnlyMap
 ---

 Key: SPARK-1611
 URL: https://issues.apache.org/jira/browse/SPARK-1611
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
Priority: Minor
  Labels: easyfix
 Fix For: 1.0.0


 The initialization order of growThreshold and LOAD_FACTOR is incorrect. 
 growThreshold will be initialized to 0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-993) Don't reuse Writable objects in SequenceFile by default

2014-04-25 Thread Arun Ramakrishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13980747#comment-13980747
 ] 

Arun Ramakrishnan commented on SPARK-993:
-

How does one reproduce this issue ?

I tried a few things on the spark shell locally

{noformat}
import java.io.File
import com.google.common.io.Files
import org.apache.hadoop.io._
val tempDir = Files.createTempDir()
val outputDir = new File(tempDir, output).getAbsolutePath
val num = 100
val nums = sc.makeRDD(1 to num).map(x = (a * x, x)) 
nums.saveAsSequenceFile(outputDir)

val output = sc.sequenceFile[String,Int](outputDir)
assert(output.collect().toSet.size == num)

val t = sc.sequenceFile(outputDir, classOf[Text], classOf[IntWritable])
assert( t.map { case (k,v) = (k.toString, v.get) }.collect().toSet.size == 
num )
{noformat}

But, asserts seem to be fine. 

 Don't reuse Writable objects in SequenceFile by default
 ---

 Key: SPARK-993
 URL: https://issues.apache.org/jira/browse/SPARK-993
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia
  Labels: Starter

 Right now we reuse them as an optimization, which leads to weird results when 
 you call collect() on a file with distinct items. We should instead make that 
 behavior optional through a flag.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-1199) Type mismatch in Spark shell when using case class defined in shell

2014-04-25 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13980756#comment-13980756
 ] 

Piotr Kołaczkowski edited comment on SPARK-1199 at 4/25/14 7:26 AM:


+1 to fixing this. We're affected as well. Classes defined in Shell are inner 
classes, and therefore cannot be easily instantiated by reflection. They need 
additional reference to the outer object, which is non-trivial to obtain (is it 
obtainable at all without modifying Spark?). 

{noformat}
scala class Test
defined class Test

scala new Test
res5: Test = $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$Test@4f755864

// good, so there is a default constructor and we can call it through 
reflection?
// not so fast...
scala classOf[Test].getConstructor()
java.lang.NoSuchMethodException: 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$Test.init()
...

scala classOf[Test].getConstructors()(0)
res7: java.lang.reflect.Constructor[_] = public 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$Test($iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
   
{noformat}

The workaround does not work for us.



was (Author: pkolaczk):
+1 to fixing this. We're affected as well. Classes defined in Shell are inner 
classes, and therefore cannot be easily instantiated by reflection. They need 
additional reference to the outer object, which is non-trivial to obtain (is it 
obtainable at all without modifying Spark?). 

{noformat}
scala class Test
defined class Test

scala new Test
res5: Test = $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$Test@4f755864

// good, so there is a default constructor and we can call it through 
reflection?
// not so fast...
scala classOf[Test].getConstructor()
java.lang.NoSuchMethodException: 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$Test.init()
...

scala classOf[Test].getConstructors()(0)
res7: java.lang.reflect.Constructor[_] = public 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$Test($iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
   
{noformat}


 Type mismatch in Spark shell when using case class defined in shell
 ---

 Key: SPARK-1199
 URL: https://issues.apache.org/jira/browse/SPARK-1199
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Andrew Kerr
Priority: Critical
 Fix For: 1.1.0


 Define a class in the shell:
 {code}
 case class TestClass(a:String)
 {code}
 and an RDD
 {code}
 val data = sc.parallelize(Seq(a)).map(TestClass(_))
 {code}
 define a function on it and map over the RDD
 {code}
 def itemFunc(a:TestClass):TestClass = a
 data.map(itemFunc)
 {code}
 Error:
 {code}
 console:19: error: type mismatch;
  found   : TestClass = TestClass
  required: TestClass = ?
   data.map(itemFunc)
 {code}
 Similarly with a mapPartitions:
 {code}
 def partitionFunc(a:Iterator[TestClass]):Iterator[TestClass] = a
 data.mapPartitions(partitionFunc)
 {code}
 {code}
 console:19: error: type mismatch;
  found   : Iterator[TestClass] = Iterator[TestClass]
  required: Iterator[TestClass] = Iterator[?]
 Error occurred in an application involving default arguments.
   data.mapPartitions(partitionFunc)
 {code}
 The behavior is the same whether in local mode or on a cluster.
 This isn't specific to RDDs. A Scala collection in the Spark shell has the 
 same problem.
 {code}
 scala Seq(TestClass(foo)).map(itemFunc)
 console:15: error: type mismatch;
  found   : TestClass = TestClass
  required: TestClass = ?
   Seq(TestClass(foo)).map(itemFunc)
 ^
 {code}
 When run in the Scala console (not the Spark shell) there are no type 
 mismatch errors.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1628) Missing hashCode methods in Partitioner subclasses

2014-04-25 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-1628:
---

 Summary: Missing hashCode methods in Partitioner subclasses
 Key: SPARK-1628
 URL: https://issues.apache.org/jira/browse/SPARK-1628
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
Priority: Minor


`hashCode` is not override in HashPartitioner, RangePartitioner, 
PythonPartitioner and PageRankUtils.CustomPartitioner. Should override 
hashcode() if overriding equals().



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (SPARK-1597) Add a version of reduceByKey that takes the Partitioner as a second argument

2014-04-25 Thread Sandeep Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh reassigned SPARK-1597:


Assignee: Sandeep Singh

 Add a version of reduceByKey that takes the Partitioner as a second argument
 

 Key: SPARK-1597
 URL: https://issues.apache.org/jira/browse/SPARK-1597
 Project: Spark
  Issue Type: Bug
Reporter: Matei Zaharia
Assignee: Sandeep Singh
Priority: Blocker

 Most of our shuffle methods can take a Partitioner or a number of partitions 
 as a second argument, but for some reason reduceByKey takes the Partitioner 
 as a *first* argument: 
 http://spark.apache.org/docs/0.9.1/api/core/#org.apache.spark.rdd.PairRDDFunctions.
  We should deprecate that version and add one where the Partitioner is the 
 second argument.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1604) Couldn't run spark-submit with yarn cluster mode when built with assemble-deps

2014-04-25 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-1604:
-

Summary: Couldn't run spark-submit with yarn cluster mode when built with 
assemble-deps  (was: Couldn't run spark-submit with yarn cluster mode when 
built with assemble-ceps)

 Couldn't run spark-submit with yarn cluster mode when built with assemble-deps
 --

 Key: SPARK-1604
 URL: https://issues.apache.org/jira/browse/SPARK-1604
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0
Reporter: Kan Zhang

 SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0-deps.jar
  ./bin/spark-submit 
 ./examples/target/scala-2.10/spark-examples_2.10-1.0.0-SNAPSHOT.jar --master 
 yarn --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL 
 Exception in thread main java.lang.ClassNotFoundException: 
 org.apache.spark.deploy.yarn.Client
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
   at java.lang.Class.forName0(Native Method)
   at java.lang.Class.forName(Class.java:270)
   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:234)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:47)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1604) Couldn't run spark-submit with yarn cluster mode when built with assemble-deps

2014-04-25 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-1604:
-

Description: 
{code}
SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0-deps.jar
 ./bin/spark-submit 
./examples/target/scala-2.10/spark-examples_2.10-1.0.0-SNAPSHOT.jar --master 
yarn --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL 
Exception in thread main java.lang.ClassNotFoundException: 
org.apache.spark.deploy.yarn.Client
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:234)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:47)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

  was:
SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0-deps.jar
 ./bin/spark-submit 
./examples/target/scala-2.10/spark-examples_2.10-1.0.0-SNAPSHOT.jar --master 
yarn --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL 
Exception in thread main java.lang.ClassNotFoundException: 
org.apache.spark.deploy.yarn.Client
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:234)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:47)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


 Couldn't run spark-submit with yarn cluster mode when built with assemble-deps
 --

 Key: SPARK-1604
 URL: https://issues.apache.org/jira/browse/SPARK-1604
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0
Reporter: Kan Zhang

 {code}
 SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0-deps.jar
  ./bin/spark-submit 
 ./examples/target/scala-2.10/spark-examples_2.10-1.0.0-SNAPSHOT.jar --master 
 yarn --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL 
 Exception in thread main java.lang.ClassNotFoundException: 
 org.apache.spark.deploy.yarn.Client
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
   at java.lang.Class.forName0(Native Method)
   at java.lang.Class.forName(Class.java:270)
   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:234)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:47)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1629) Spark Core missing commons-lang dependence

2014-04-25 Thread witgo (JIRA)

witgo created SPARK-1629:


 Summary:  Spark Core missing commons-lang dependence
 Key: SPARK-1629
 URL: https://issues.apache.org/jira/browse/SPARK-1629
 Project: Spark
  Issue Type: Bug
Reporter: witgo






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1629) Spark Core missing commons-lang dependence

2014-04-25 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13980878#comment-13980878
 ] 

Sean Owen commented on SPARK-1629:
--

I don't see any usage of Commons Lang in the whole project?
Tachyon uses commons-lang3 but it also brings it in as a dependency.

  Spark Core missing commons-lang dependence
 ---

 Key: SPARK-1629
 URL: https://issues.apache.org/jira/browse/SPARK-1629
 Project: Spark
  Issue Type: Bug
Reporter: witgo





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-1629) Spark Core missing commons-lang dependence

2014-04-25 Thread witgo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13980886#comment-13980886
 ] 

witgo edited comment on SPARK-1629 at 4/25/14 11:06 AM:


Hi Sean Owen,see 
[Utils.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L33]


was (Author: witgo):
Hi Sean Owen see 
[Utils.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L33]

  Spark Core missing commons-lang dependence
 ---

 Key: SPARK-1629
 URL: https://issues.apache.org/jira/browse/SPARK-1629
 Project: Spark
  Issue Type: Bug
Reporter: witgo





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1478) Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915


[ 
https://issues.apache.org/jira/browse/SPARK-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13980962#comment-13980962
 ] 

Ted Malaska commented on SPARK-1478:


Spark-1584 is done and so is PR #300.  So final we are ready for this Jira.  I 
will start development today.

 Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915
 ---

 Key: SPARK-1478
 URL: https://issues.apache.org/jira/browse/SPARK-1478
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska
Priority: Minor

 Flume-1915 added support for compression over the wire from avro sink to avro 
 source.  I would like to add this functionality to the FlumeReceiver.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1467) Make StorageLevel.apply() factory methods experimental

2014-04-25 Thread Sandeep Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13981031#comment-13981031
 ] 

Sandeep Singh commented on SPARK-1467:
--

https://github.com/apache/spark/pull/551

 Make StorageLevel.apply() factory methods experimental
 --

 Key: SPARK-1467
 URL: https://issues.apache.org/jira/browse/SPARK-1467
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Matei Zaharia
Assignee: Sandeep Singh
 Fix For: 1.0.0


 We may want to evolve these in the future to add things like SSDs, so let's 
 mark them as experimental for now. Long-term the right solution might be some 
 kind of builder. The stable API should be the existing StorageLevel constants.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1199) Type mismatch in Spark shell when using case class defined in shell

2014-04-25 Thread Andrew Kerr (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13981187#comment-13981187
 ] 

Andrew Kerr commented on SPARK-1199:


I have something of a workaround:

{code}
object MyTypes {
  case class TestClass(a:Int)
}

object MyLogic {
  import MyClasses._
  def fn(b:TestClass) = TestClass(b.a * 2)
  val result = Seq(TestClass(1)).map(fn)
}

MyLogic.result
// Seq{MyTypes.TestClass] = List(TestClass(2))
{code}

Still can't access TestClass outside an object.

 Type mismatch in Spark shell when using case class defined in shell
 ---

 Key: SPARK-1199
 URL: https://issues.apache.org/jira/browse/SPARK-1199
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Andrew Kerr
Priority: Critical
 Fix For: 1.1.0


 Define a class in the shell:
 {code}
 case class TestClass(a:String)
 {code}
 and an RDD
 {code}
 val data = sc.parallelize(Seq(a)).map(TestClass(_))
 {code}
 define a function on it and map over the RDD
 {code}
 def itemFunc(a:TestClass):TestClass = a
 data.map(itemFunc)
 {code}
 Error:
 {code}
 console:19: error: type mismatch;
  found   : TestClass = TestClass
  required: TestClass = ?
   data.map(itemFunc)
 {code}
 Similarly with a mapPartitions:
 {code}
 def partitionFunc(a:Iterator[TestClass]):Iterator[TestClass] = a
 data.mapPartitions(partitionFunc)
 {code}
 {code}
 console:19: error: type mismatch;
  found   : Iterator[TestClass] = Iterator[TestClass]
  required: Iterator[TestClass] = Iterator[?]
 Error occurred in an application involving default arguments.
   data.mapPartitions(partitionFunc)
 {code}
 The behavior is the same whether in local mode or on a cluster.
 This isn't specific to RDDs. A Scala collection in the Spark shell has the 
 same problem.
 {code}
 scala Seq(TestClass(foo)).map(itemFunc)
 console:15: error: type mismatch;
  found   : TestClass = TestClass
  required: TestClass = ?
   Seq(TestClass(foo)).map(itemFunc)
 ^
 {code}
 When run in the Scala console (not the Spark shell) there are no type 
 mismatch errors.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-1199) Type mismatch in Spark shell when using case class defined in shell

2014-04-25 Thread Andrew Kerr (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13981187#comment-13981187
 ] 

Andrew Kerr edited comment on SPARK-1199 at 4/25/14 4:44 PM:
-

I have something of a workaround:

{code}
object MyTypes {
  case class TestClass(a:Int)
}

object MyLogic {
  import MyClasses._
  def fn(b:TestClass) = TestClass(b.a * 2)
  val result = Seq(TestClass(1)).map(fn)
}

MyLogic.result
// Seq[MyTypes.TestClass] = List(TestClass(2))
{code}

Still can't access TestClass outside an object.


was (Author: andrewkerr):
I have something of a workaround:

{code}
object MyTypes {
  case class TestClass(a:Int)
}

object MyLogic {
  import MyClasses._
  def fn(b:TestClass) = TestClass(b.a * 2)
  val result = Seq(TestClass(1)).map(fn)
}

MyLogic.result
// Seq{MyTypes.TestClass] = List(TestClass(2))
{code}

Still can't access TestClass outside an object.

 Type mismatch in Spark shell when using case class defined in shell
 ---

 Key: SPARK-1199
 URL: https://issues.apache.org/jira/browse/SPARK-1199
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Andrew Kerr
Priority: Critical
 Fix For: 1.1.0


 Define a class in the shell:
 {code}
 case class TestClass(a:String)
 {code}
 and an RDD
 {code}
 val data = sc.parallelize(Seq(a)).map(TestClass(_))
 {code}
 define a function on it and map over the RDD
 {code}
 def itemFunc(a:TestClass):TestClass = a
 data.map(itemFunc)
 {code}
 Error:
 {code}
 console:19: error: type mismatch;
  found   : TestClass = TestClass
  required: TestClass = ?
   data.map(itemFunc)
 {code}
 Similarly with a mapPartitions:
 {code}
 def partitionFunc(a:Iterator[TestClass]):Iterator[TestClass] = a
 data.mapPartitions(partitionFunc)
 {code}
 {code}
 console:19: error: type mismatch;
  found   : Iterator[TestClass] = Iterator[TestClass]
  required: Iterator[TestClass] = Iterator[?]
 Error occurred in an application involving default arguments.
   data.mapPartitions(partitionFunc)
 {code}
 The behavior is the same whether in local mode or on a cluster.
 This isn't specific to RDDs. A Scala collection in the Spark shell has the 
 same problem.
 {code}
 scala Seq(TestClass(foo)).map(itemFunc)
 console:15: error: type mismatch;
  found   : TestClass = TestClass
  required: TestClass = ?
   Seq(TestClass(foo)).map(itemFunc)
 ^
 {code}
 When run in the Scala console (not the Spark shell) there are no type 
 mismatch errors.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1630) PythonRDDs don't handle nulls gracefully

2014-04-25 Thread Kalpit Shah (JIRA)

Kalpit Shah created SPARK-1630:
--

 Summary: PythonRDDs don't handle nulls gracefully
 Key: SPARK-1630
 URL: https://issues.apache.org/jira/browse/SPARK-1630
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.0, 0.9.1
Reporter: Kalpit Shah
 Fix For: 1.0.0


If PythonRDDs receive a null element in iterators, they currently NPE. It would 
be better do log a DEBUG message and skip the write of NULL elements.

Here are the 2 stack traces :

14/04/22 03:44:19 ERROR executor.Executor: Uncaught exception in thread 
Thread[stdin writer for python,5,main]
java.lang.NullPointerException
  at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:267)
  at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:88)

-

Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.writeToFile.
: java.lang.NullPointerException
  at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:273)
  at 
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:247)
  at 
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:246)
  at scala.collection.Iterator$class.foreach(Iterator.scala:727)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
  at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:246)
  at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:285)
  at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:280)
  at org.apache.spark.api.python.PythonRDD.writeToFile(PythonRDD.scala)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
  at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
  at py4j.Gateway.invoke(Gateway.java:259)
  at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
  at py4j.commands.CallCommand.execute(CallCommand.java:79)
  at py4j.GatewayConnection.run(GatewayConnection.java:207)
  at java.lang.Thread.run(Thread.java:744)  







--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1576) Passing of JAVA_OPTS to YARN on command line

2014-04-25 Thread Mridul Muralidharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13981313#comment-13981313
 ] 

Mridul Muralidharan commented on SPARK-1576:


There is a misunderstanding here - it is to pass SPARK_JAVA_OPTS : not 
JAVA_OPTS.
Directly passing JAVA_OPTS has beem removed

 Passing of JAVA_OPTS to YARN on command line
 

 Key: SPARK-1576
 URL: https://issues.apache.org/jira/browse/SPARK-1576
 Project: Spark
  Issue Type: Improvement
Affects Versions: 0.9.0, 1.0.0, 0.9.1
Reporter: Nishkam Ravi
 Fix For: 0.9.0, 1.0.0, 0.9.1

 Attachments: SPARK-1576.patch


 JAVA_OPTS can be passed by using either env variables (i.e., SPARK_JAVA_OPTS) 
 or as config vars (after Patrick's recent change). It would be good to allow 
 the user to pass them on command line as well to restrict scope to single 
 application invocation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1586) Fix issues with spark development under windows

2014-04-25 Thread Mridul Muralidharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13981321#comment-13981321
 ] 

Mridul Muralidharan commented on SPARK-1586:


Immediate issues fixed though there are more hive tests failing due to path 
related issues. pr : https://github.com/apache/spark/pull/505

 Fix issues with spark development under windows
 ---

 Key: SPARK-1586
 URL: https://issues.apache.org/jira/browse/SPARK-1586
 Project: Spark
  Issue Type: Bug
Reporter: Mridul Muralidharan
Assignee: Mridul Muralidharan
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1587) Fix thread leak in spark

2014-04-25 Thread Mridul Muralidharan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-1587.


Resolution: Fixed

Fixed, https://github.com/apache/spark/pull/504

 Fix thread leak in spark
 

 Key: SPARK-1587
 URL: https://issues.apache.org/jira/browse/SPARK-1587
 Project: Spark
  Issue Type: Bug
Reporter: Mridul Muralidharan
Assignee: Mridul Muralidharan

 SparkContext.stop does not cause all threads to exit.
 When running tests via scalatest (which keeps reusing the same vm), over 
 time, this causes too many threads to be created causing tests to fail due to 
 inability to create more threads.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1626) Update Spark YARN docs to use spark-submit

2014-04-25 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13981325#comment-13981325
 ] 

Thomas Graves commented on SPARK-1626:
--

this is dup of SPARK-1492

 Update Spark YARN docs to use spark-submit
 --

 Key: SPARK-1626
 URL: https://issues.apache.org/jira/browse/SPARK-1626
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1631) App name set in SparkConf (not in JVM properties) not respected by Yarn backend

2014-04-25 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13981337#comment-13981337
 ] 

Marcelo Vanzin commented on SPARK-1631:
---

PR: https://github.com/apache/spark/pull/539

 App name set in SparkConf (not in JVM properties) not respected by Yarn 
 backend
 ---

 Key: SPARK-1631
 URL: https://issues.apache.org/jira/browse/SPARK-1631
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin

 When you submit an application that sets its name using a SparkContext 
 constructor or SparkConf.setAppName(), the Yarn app name is not set and the 
 app shows up as Spark in the RM UI.
 That's because YarnClientSchedulerBackend only looks at the system properties 
 to look for the app name, instead of looking at the app's config.
 e.g., app initializes like this:
 {code}
 val sc = new SparkContext(new SparkConf().setAppName(Blah));
 {code}
 Start app like this:
 {noformat}
   ./bin/spark-submit --master yarn --deploy-mode client blah blah blah
 {noformat}
 And app name in RM UI does not reflect the code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1621) Update Chill to 0.3.6


 [ 
https://issues.apache.org/jira/browse/SPARK-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-1621.
--

Resolution: Fixed

 Update Chill to 0.3.6
 -

 Key: SPARK-1621
 URL: https://issues.apache.org/jira/browse/SPARK-1621
 Project: Spark
  Issue Type: Dependency upgrade
Reporter: Matei Zaharia
Assignee: Matei Zaharia
Priority: Minor
 Fix For: 1.0.0


 It registers more Scala classes, including things like Ranges that we had to 
 register manually before. See https://github.com/twitter/chill/releases for 
 Chill's change log.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1630) PythonRDDs don't handle nulls gracefully

2014-04-25 Thread Kalpit Shah (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13981458#comment-13981458
 ] 

Kalpit Shah commented on SPARK-1630:


https://github.com/apache/spark/pull/554

 PythonRDDs don't handle nulls gracefully
 

 Key: SPARK-1630
 URL: https://issues.apache.org/jira/browse/SPARK-1630
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.0, 0.9.1
Reporter: Kalpit Shah
 Fix For: 1.0.0

   Original Estimate: 2h
  Remaining Estimate: 2h

 If PythonRDDs receive a null element in iterators, they currently NPE. It 
 would be better do log a DEBUG message and skip the write of NULL elements.
 Here are the 2 stack traces :
 14/04/22 03:44:19 ERROR executor.Executor: Uncaught exception in thread 
 Thread[stdin writer for python,5,main]
 java.lang.NullPointerException
   at 
 org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:267)
   at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:88)
 -
 Py4JJavaError: An error occurred while calling 
 z:org.apache.spark.api.python.PythonRDD.writeToFile.
 : java.lang.NullPointerException
   at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:273)
   at 
 org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:247)
   at 
 org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:246)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:246)
   at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:285)
   at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:280)
   at org.apache.spark.api.python.PythonRDD.writeToFile(PythonRDD.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
   at py4j.Gateway.invoke(Gateway.java:259)
   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   at py4j.commands.CallCommand.execute(CallCommand.java:79)
   at py4j.GatewayConnection.run(GatewayConnection.java:207)
   at java.lang.Thread.run(Thread.java:744)  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1576) Passing of JAVA_OPTS to YARN on command line

2014-04-25 Thread Nishkam Ravi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13981473#comment-13981473
 ] 

Nishkam Ravi commented on SPARK-1576:
-

We have been using SPARK_JAVA_OPTS and JAVA_OPTS interchangeably. 
SPARK_JAVA_OPTS are JAVA_OPTS :)

 Passing of JAVA_OPTS to YARN on command line
 

 Key: SPARK-1576
 URL: https://issues.apache.org/jira/browse/SPARK-1576
 Project: Spark
  Issue Type: Improvement
Affects Versions: 0.9.0, 1.0.0, 0.9.1
Reporter: Nishkam Ravi
 Fix For: 0.9.0, 1.0.0, 0.9.1

 Attachments: SPARK-1576.patch


 JAVA_OPTS can be passed by using either env variables (i.e., SPARK_JAVA_OPTS) 
 or as config vars (after Patrick's recent change). It would be good to allow 
 the user to pass them on command line as well to restrict scope to single 
 application invocation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Reopened] (SPARK-1395) Cannot launch jobs on Yarn cluster with local: scheme in SPARK_JAR

2014-04-25 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reopened SPARK-1395:
---


This is broken again.

 Cannot launch jobs on Yarn cluster with local: scheme in SPARK_JAR
 

 Key: SPARK-1395
 URL: https://issues.apache.org/jira/browse/SPARK-1395
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
 Fix For: 1.0.0


 If you define SPARK_JAR and friends to use local: URIs, you cannot submit a 
 job on a Yarn cluster. e.g., I have:
 SPARK_JAR=local:/tmp/spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0-cdh5.0.0.jar
 SPARK_YARN_APP_JAR=local:/tmp/spark-examples-assembly-1.0.0-SNAPSHOT.jar
 And running SparkPi using bin/run-example yields this:
 14/04/02 13:23:33 INFO yarn.Client: Preparing Local resources
 Exception in thread main java.io.IOException: No FileSystem for scheme: 
 local
 at 
 org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2385)
 at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
 at 
 org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
 at 
 org.apache.spark.deploy.yarn.ClientBase$class.org$apache$spark$deploy$yarn$ClientBase$$copyRemoteFile(ClientBase.scala:156)
 at 
 org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$3.apply(ClientBase.scala:217)
 at 
 org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$3.apply(ClientBase.scala:212)
 at 
 scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
 at 
 scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
 at 
 scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
 at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
 at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
 at 
 org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:212)
 at 
 org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:41)
 at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:76)
 at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:81)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:129)
 at org.apache.spark.SparkContext.init(SparkContext.scala:226)
 at org.apache.spark.SparkContext.init(SparkContext.scala:96)
 at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
 at org.apache.spark.examples.SparkPi.main(SparkPi.scala)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1633) Various examples for Scala and Java custom receiver, etc.

Tathagata Das created SPARK-1633:


 Summary: Various examples for Scala and Java custom receiver, etc. 
 Key: SPARK-1633
 URL: https://issues.apache.org/jira/browse/SPARK-1633
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1600) flaky test case in streaming.CheckpointSuite


[ 
https://issues.apache.org/jira/browse/SPARK-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13981552#comment-13981552
 ] 

Tathagata Das commented on SPARK-1600:
--

Will try to address this post 1.0 release

 flaky test case in streaming.CheckpointSuite
 

 Key: SPARK-1600
 URL: https://issues.apache.org/jira/browse/SPARK-1600
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9.0, 1.0.0, 0.9.1
Reporter: Nan Zhu

 the case recovery with file input stream.recovery with file input stream   
 sometimes fails when the Jenkins is very busy with an unrelated change 
 I have met it for 3 times, I also saw it in other places, 
 the latest example is in 
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14397/
 where the modification is just in YARN related files
 I once reported in dev mail list: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/a-weird-test-case-in-Streaming-td6116.html



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1409) Flaky Test: actor input stream test in org.apache.spark.streaming.InputStreamsSuite


[ 
https://issues.apache.org/jira/browse/SPARK-1409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13981555#comment-13981555
 ] 

Tathagata Das commented on SPARK-1409:
--

The test seems to run fine when launched from Intellij Idea but not from sbt. I 
am not sure why yet. It is something to do with Akka's Prop (actually Typesafe 
Config it uses) being not serializable under certain conditions. I am afraid 
that there may be two versions of Props/Config in the classpath (when running 
from sbt) though I havent figured out how. The serialization error causes the 
test to fail, every time. 

Some change in the last couple of months resulted in this side effect. 
Comparing spark 0.9 branch (where tests run fine), there is no difference in 
the Akka / Typesafe Config. The only difference is saw is in the version of sbt 
- 0.12.1 for branch 0.9, 0.13.2 for master.

So, still not solution, bumping this to post 1.0

 Flaky Test: actor input stream test in 
 org.apache.spark.streaming.InputStreamsSuite
 -

 Key: SPARK-1409
 URL: https://issues.apache.org/jira/browse/SPARK-1409
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Michael Armbrust
Assignee: Tathagata Das

 Here are just a few cases:
 https://travis-ci.org/apache/spark/jobs/22151827
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13709/



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Closed] (SPARK-1400) Spark Streaming's received data is not cleaned up from BlockManagers when not needed any more


 [ 
https://issues.apache.org/jira/browse/SPARK-1400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das closed SPARK-1400.


Resolution: Duplicate

Duplicate https://issues.apache.org/jira/browse/SPARK-1592

 Spark Streaming's received data is not cleaned up from BlockManagers when not 
 needed any more
 -

 Key: SPARK-1400
 URL: https://issues.apache.org/jira/browse/SPARK-1400
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9.0
Reporter: Tathagata Das

 Spark Streaming generates BlockRDDs with the data received over the network. 
 These data blocks are not automatically cleared, they spill over from memory 
 based on LRU, which slows down processing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1632) Avoid boxing in ExternalAppendOnlyMap compares

2014-04-25 Thread Sandy Ryza (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-1632:
--

Summary: Avoid boxing in ExternalAppendOnlyMap compares  (was: Avoid boxing 
in ExternalAppendOnlyMap.KCComparator)

 Avoid boxing in ExternalAppendOnlyMap compares
 --

 Key: SPARK-1632
 URL: https://issues.apache.org/jira/browse/SPARK-1632
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 Hitting an OOME in ExternalAppendOnlyMap.KCComparator while boxing an int.  I 
 don't know if this is the root cause, but the boxing is also avoidable.
 Code:
 {code}
 def compare(kc1: (K, C), kc2: (K, C)): Int = {
   kc1._1.hashCode().compareTo(kc2._1.hashCode())
 }
 {code}
 Error:
 {code}
 java.lang.OutOfMemoryError: GC overhead limit exceeded
  at java.lang.Integer.valueOf(Integer.java:642)
  at scala.Predef$.int2Integer(Predef.scala:370)
  at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap$KCComparator.compare(ExternalAppendOnlyMap.scala:432)
  at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap$KCComparator.compare(ExternalAppendOnlyMap.scala:430)
  at 
 org.apache.spark.util.collection.AppendOnlyMap$$anon$3.compare(AppendOnlyMap.scala:271)
  at java.util.TimSort.mergeLo(TimSort.java:687)
  at java.util.TimSort.mergeAt(TimSort.java:483)
  at java.util.TimSort.mergeCollapse(TimSort.java:410)
  at java.util.TimSort.sort(TimSort.java:214)
  at java.util.Arrays.sort(Arrays.java:727)
  at 
 org.apache.spark.util.collection.AppendOnlyMap.destructiveSortedIterator(AppendOnlyMap.scala:274)
  at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:188)
  at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insert(ExternalAppendOnlyMap.scala:141)
  at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
  at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
  at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
  at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
  at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
  at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
  at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
  at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
  at org.apache.spark.scheduler.Task.run(Task.scala:53)
  at 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
  at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
  at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:415)
  at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
  at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1634) Java API docs contain test cases


 [ 
https://issues.apache.org/jira/browse/SPARK-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1634:
-

Summary: Java API docs contain test cases  (was: JavaDoc contains test 
cases)

 Java API docs contain test cases
 

 Key: SPARK-1634
 URL: https://issues.apache.org/jira/browse/SPARK-1634
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Priority: Blocker

 The generated Java API docs contains all test cases.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1634) Java API docs contain test cases


 [ 
https://issues.apache.org/jira/browse/SPARK-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1634:
-

Description: The generated Java API docs contain all test cases.  (was: The 
generated Java API docs contains all test cases.)

 Java API docs contain test cases
 

 Key: SPARK-1634
 URL: https://issues.apache.org/jira/browse/SPARK-1634
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Priority: Blocker

 The generated Java API docs contain all test cases.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1635) Java API docs do not show annotation.

Xiangrui Meng created SPARK-1635:


 Summary: Java API docs do not show annotation.
 Key: SPARK-1635
 URL: https://issues.apache.org/jira/browse/SPARK-1635
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Xiangrui Meng


The generated Java API docs do not contain Developer/Experimental annotations. 
The :: Developer/Experimental :: tag is in the generated doc.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1478) Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915


[ 
https://issues.apache.org/jira/browse/SPARK-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13981583#comment-13981583
 ] 

Ted Malaska commented on SPARK-1478:


I re cloned the code and ran a test.  There is a bug with the current Github 
branch.  I'm looking into it now.

 Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915
 ---

 Key: SPARK-1478
 URL: https://issues.apache.org/jira/browse/SPARK-1478
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska
Priority: Minor

 Flume-1915 added support for compression over the wire from avro sink to avro 
 source.  I would like to add this functionality to the FlumeReceiver.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Closed] (SPARK-703) KafkaWordCount example crashes with java.lang.ArrayIndexOutOfBoundsException in CheckpointRDD.scala


 [ 
https://issues.apache.org/jira/browse/SPARK-703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das closed SPARK-703.
---

Resolution: Not a Problem

 KafkaWordCount example crashes with java.lang.ArrayIndexOutOfBoundsException 
 in CheckpointRDD.scala
 ---

 Key: SPARK-703
 URL: https://issues.apache.org/jira/browse/SPARK-703
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.7.0
Reporter: Craig A. Vanderborgh

 This is a bad Spark Streaming bug.  The KafkaWordCount example can be used to 
 demonstrate the problem.  After a few iterations (batches), the test crashes 
 with this stack trace during the checkpointing attempt:
 3/02/22 15:26:54 INFO streaming.JobManager: Total delay: 0.02100 s for job 12 
 (execution: 0.01300 s)
 13/02/22 15:26:54 INFO rdd.CoGroupedRDD: Adding one-to-one dependency with 
 MappedValuesRDD[87] at apply at TraversableLike.scala:239
 13/02/22 15:26:54 INFO rdd.CoGroupedRDD: Adding one-to-one dependency with 
 MapPartitionsRDD[56] at apply at TraversableLike.scala:239
 13/02/22 15:26:54 INFO rdd.CoGroupedRDD: Adding one-to-one dependency with 
 MapPartitionsRDD[99] at apply at TraversableLike.scala:239
 13/02/22 15:26:54 ERROR streaming.JobManager: Running streaming job 13 @ 
 1361572014000 ms failed
 java.lang.ArrayIndexOutOfBoundsException: 0
 at spark.rdd.CheckpointRDD.getPartitions(CheckpointRDD.scala:27)
 at spark.RDD.partitions(RDD.scala:166)
 at spark.RDD.partitions(RDD.scala:166)
 at 
 spark.rdd.CoGroupedRDD$$anonfun$getPartitions$1$$anonfun$apply$mcVI$sp$1.apply(CoGroupedRDD.scala:71)
 at 
 spark.rdd.CoGroupedRDD$$anonfun$getPartitions$1$$anonfun$apply$mcVI$sp$1.apply(CoGroupedRDD.scala:65)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
 at scala.collection.mutable.ArrayBuffer.map(ArrayBuffer.scala:47)
 at spark.rdd.CoGroupedRDD.getPartitions(CoGroupedRDD.scala:63)
 at spark.RDD.partitions(RDD.scala:166)
 at spark.MappedValuesRDD.getPartitions(PairRDDFunctions.scala:655)
 at spark.RDD.partitions(RDD.scala:166)
 at 
 spark.rdd.CoGroupedRDD$$anonfun$getPartitions$1$$anonfun$apply$mcVI$sp$1.apply(CoGroupedRDD.scala:71)
 at 
 spark.rdd.CoGroupedRDD$$anonfun$getPartitions$1$$anonfun$apply$mcVI$sp$1.apply(CoGroupedRDD.scala:65)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
 at scala.collection.mutable.ArrayBuffer.map(ArrayBuffer.scala:47)
 at spark.rdd.CoGroupedRDD.getPartitions(CoGroupedRDD.scala:63)
 at spark.RDD.partitions(RDD.scala:166)
 at spark.MappedValuesRDD.getPartitions(PairRDDFunctions.scala:655)
 at spark.RDD.partitions(RDD.scala:166)
 at spark.RDD.take(RDD.scala:550)
 at 
 spark.streaming.DStream$$anonfun$foreachFunc$2$1.apply(DStream.scala:522)
 at 
 spark.streaming.DStream$$anonfun$foreachFunc$2$1.apply(DStream.scala:521)
 at 
 spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:22)
 at 
 spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:21)
 at 
 spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:21)
 at spark.streaming.Job.run(Job.scala:10)
 at spark.streaming.JobManager$JobHandler.run(JobManager.scala:15)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 The only way I can get this test to work on a cluster is to disable 
 checkpointing and to use reduceByKey() instead of reduceByKeyAndWindow().  
 Also the test works when run using local as the master.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-703) KafkaWordCount example crashes with java.lang.ArrayIndexOutOfBoundsException in CheckpointRDD.scala


[ 
https://issues.apache.org/jira/browse/SPARK-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13981581#comment-13981581
 ] 

Tathagata Das commented on SPARK-703:
-

Yes, that observation is correct. This is something we have to make it clearer 
in the documentation.  Closing this JIRA for now.

 KafkaWordCount example crashes with java.lang.ArrayIndexOutOfBoundsException 
 in CheckpointRDD.scala
 ---

 Key: SPARK-703
 URL: https://issues.apache.org/jira/browse/SPARK-703
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.7.0
Reporter: Craig A. Vanderborgh

 This is a bad Spark Streaming bug.  The KafkaWordCount example can be used to 
 demonstrate the problem.  After a few iterations (batches), the test crashes 
 with this stack trace during the checkpointing attempt:
 3/02/22 15:26:54 INFO streaming.JobManager: Total delay: 0.02100 s for job 12 
 (execution: 0.01300 s)
 13/02/22 15:26:54 INFO rdd.CoGroupedRDD: Adding one-to-one dependency with 
 MappedValuesRDD[87] at apply at TraversableLike.scala:239
 13/02/22 15:26:54 INFO rdd.CoGroupedRDD: Adding one-to-one dependency with 
 MapPartitionsRDD[56] at apply at TraversableLike.scala:239
 13/02/22 15:26:54 INFO rdd.CoGroupedRDD: Adding one-to-one dependency with 
 MapPartitionsRDD[99] at apply at TraversableLike.scala:239
 13/02/22 15:26:54 ERROR streaming.JobManager: Running streaming job 13 @ 
 1361572014000 ms failed
 java.lang.ArrayIndexOutOfBoundsException: 0
 at spark.rdd.CheckpointRDD.getPartitions(CheckpointRDD.scala:27)
 at spark.RDD.partitions(RDD.scala:166)
 at spark.RDD.partitions(RDD.scala:166)
 at 
 spark.rdd.CoGroupedRDD$$anonfun$getPartitions$1$$anonfun$apply$mcVI$sp$1.apply(CoGroupedRDD.scala:71)
 at 
 spark.rdd.CoGroupedRDD$$anonfun$getPartitions$1$$anonfun$apply$mcVI$sp$1.apply(CoGroupedRDD.scala:65)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
 at scala.collection.mutable.ArrayBuffer.map(ArrayBuffer.scala:47)
 at spark.rdd.CoGroupedRDD.getPartitions(CoGroupedRDD.scala:63)
 at spark.RDD.partitions(RDD.scala:166)
 at spark.MappedValuesRDD.getPartitions(PairRDDFunctions.scala:655)
 at spark.RDD.partitions(RDD.scala:166)
 at 
 spark.rdd.CoGroupedRDD$$anonfun$getPartitions$1$$anonfun$apply$mcVI$sp$1.apply(CoGroupedRDD.scala:71)
 at 
 spark.rdd.CoGroupedRDD$$anonfun$getPartitions$1$$anonfun$apply$mcVI$sp$1.apply(CoGroupedRDD.scala:65)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
 at scala.collection.mutable.ArrayBuffer.map(ArrayBuffer.scala:47)
 at spark.rdd.CoGroupedRDD.getPartitions(CoGroupedRDD.scala:63)
 at spark.RDD.partitions(RDD.scala:166)
 at spark.MappedValuesRDD.getPartitions(PairRDDFunctions.scala:655)
 at spark.RDD.partitions(RDD.scala:166)
 at spark.RDD.take(RDD.scala:550)
 at 
 spark.streaming.DStream$$anonfun$foreachFunc$2$1.apply(DStream.scala:522)
 at 
 spark.streaming.DStream$$anonfun$foreachFunc$2$1.apply(DStream.scala:521)
 at 
 spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:22)
 at 
 spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:21)
 at 
 spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:21)
 at spark.streaming.Job.run(Job.scala:10)
 at spark.streaming.JobManager$JobHandler.run(JobManager.scala:15)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 The only way I can get this test to work on a cluster is to disable 
 checkpointing and to use reduceByKey() instead of reduceByKeyAndWindow().  
 Also the test works when run using local as the master.



--
This message was sent by

[jira] [Commented] (SPARK-1478) Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915


[ 
https://issues.apache.org/jira/browse/SPARK-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13981601#comment-13981601
 ] 

Ted Malaska commented on SPARK-1478:


never mind it is working now.  

 Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915
 ---

 Key: SPARK-1478
 URL: https://issues.apache.org/jira/browse/SPARK-1478
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska
Priority: Minor

 Flume-1915 added support for compression over the wire from avro sink to avro 
 source.  I would like to add this functionality to the FlumeReceiver.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1636) Move main methods to examples

Xiangrui Meng created SPARK-1636:


 Summary: Move main methods to examples
 Key: SPARK-1636
 URL: https://issues.apache.org/jira/browse/SPARK-1636
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


Move the main methods to examples and make them compatible with spark-submit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1637) Clean up examples for 1.0

Matei Zaharia created SPARK-1637:


 Summary: Clean up examples for 1.0
 Key: SPARK-1637
 URL: https://issues.apache.org/jira/browse/SPARK-1637
 Project: Spark
  Issue Type: Improvement
  Components: Examples
Reporter: Matei Zaharia
Priority: Critical
 Fix For: 1.0.0


- Move all of them into subpackages of org.apache.spark.examples (right now 
some are in org.apache.spark.streaming.examples, for instance, and others are 
in org.apache.spark.examples.mllib)
- Move Python examples into examples/src/main/python
- Update docs to reflect these changes



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1598) Mark main methods experimental


 [ 
https://issues.apache.org/jira/browse/SPARK-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-1598.
--

Resolution: Duplicate

We will move main methods to examples instead.

 Mark main methods experimental
 --

 Key: SPARK-1598
 URL: https://issues.apache.org/jira/browse/SPARK-1598
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 We should treat the parameters in the main methods as part of our APIs. They 
 are not quite consistent at this time, so we should mark them experimental 
 and look for a unified solution in the next sprint.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1638) Executors fail to come up if spark.executor.extraJavaOptions is set

2014-04-25 Thread Kalpit Shah (JIRA)

Kalpit Shah created SPARK-1638:
--

 Summary: Executors fail to come up if 
spark.executor.extraJavaOptions is set 
 Key: SPARK-1638
 URL: https://issues.apache.org/jira/browse/SPARK-1638
 Project: Spark
  Issue Type: Bug
  Components: Deploy, EC2
 Environment: Bring up a cluster in EC2 using spark-ec2 scripts
Reporter: Kalpit Shah
 Fix For: 1.0.0


If you try to launch a PySpark shell with spark.executor.extraJavaOptions set 
to -XX:+UseCompressedOops -XX:+UseCompressedStrings -verbose:gc 
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps, the executors never come up on any 
of the workers.

I see the following error in log file :

Spark Executor Command: /usr/lib/jvm/java/bin/java -cp 
/root/c3/lib/*::/root/ephemeral-hdfs/conf:/root/spark/conf:/root/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar:
 -XX:+UseCompressedOops -XX:+UseCompressedStrings -verbose:gc 
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xms13312M -Xmx13312M 
org.apache.spark.executor.CoarseGrainedExecutorBackend 
akka.tcp://spark@HOSTNAME:45429/user/CoarseGrainedScheduler 7 HOSTNAME 
4 akka.tcp://sparkWorker@HOSTNAME:39727/user/Worker 
app-20140423224526-


Unrecognized VM option 'UseCompressedOops -XX:+UseCompressedStrings -verbose:gc 
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps'
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.







 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1640) In yarn-client mode, pass preferred node locations to AM

2014-04-25 Thread Sandy Ryza (JIRA)

Sandy Ryza created SPARK-1640:
-

 Summary: In yarn-client mode, pass preferred node locations to AM
 Key: SPARK-1640
 URL: https://issues.apache.org/jira/browse/SPARK-1640
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 0.9.0
Reporter: Sandy Ryza


In yarn-cluster mode, if the user passes preferred node location data to the 
SparkContext, the AM requests containers based on that data.

In yarn-client mode, it would be good to do this as well.  This required some 
way of passing this data from the client process to the AM.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1641) Spark submit warning tells the user to use spark-submit

2014-04-25 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-1641:
-

Description: 
$ bin/spark-submit ...

Spark assembly has been built with Hive, including Datanucleus jars on classpath
WARNING: This client is deprecated and will be removed in a future version of 
Spark.
Use ./bin/spark-submit with --master yarn

This is printed in org.apache.spark.deploy.yarn.Client.

  was:
$ bin/spark-submit ...

Spark assembly has been built with Hive, including Datanucleus jars on classpath
WARNING: This client is deprecated and will be removed in a future version of 
Spark.
Use ./bin/spark-submit with --master yarn



 Spark submit warning tells the user to use spark-submit
 ---

 Key: SPARK-1641
 URL: https://issues.apache.org/jira/browse/SPARK-1641
 Project: Spark
  Issue Type: Improvement
Reporter: Andrew Or
Priority: Minor

 $ bin/spark-submit ...
 Spark assembly has been built with Hive, including Datanucleus jars on 
 classpath
 WARNING: This client is deprecated and will be removed in a future version of 
 Spark.
 Use ./bin/spark-submit with --master yarn
 This is printed in org.apache.spark.deploy.yarn.Client.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Closed] (SPARK-711) Spark Streaming 0.7.0: ArrayIndexOutOfBoundsException in KafkaWordCount Example


 [ 
https://issues.apache.org/jira/browse/SPARK-711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das closed SPARK-711.
---

Resolution: Duplicate

Duplicate issue: https://issues.apache.org/jira/browse/SPARK-703

 Spark Streaming 0.7.0: ArrayIndexOutOfBoundsException in KafkaWordCount 
 Example
 ---

 Key: SPARK-711
 URL: https://issues.apache.org/jira/browse/SPARK-711
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.7.0
Reporter: Craig A. Vanderborgh

 The unmodified KafkaWordCount example is crashing when run under Mesos.  It 
 works fine when the master is local.  The KafkaWordCount job works for 
 about 5 iterations, then the exceptions start.  This problem is related to 
 windowing.  Here is the stack trace:
 13/03/07 15:43:46 ERROR streaming.JobManager: Running streaming job 5 @ 
 1362696226000 ms failed
 java.lang.ArrayIndexOutOfBoundsException: 0
 at 
 spark.rdd.CoGroupedRDD$$anonfun$getPartitions$1$$anonfun$apply$mcVI$sp$1.apply(CoGroupedRDD.scala:71)
 at 
 spark.rdd.CoGroupedRDD$$anonfun$getPartitions$1$$anonfun$apply$mcVI$sp$1.apply(CoGroupedRDD.scala:65)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
 at scala.collection.mutable.ArrayBuffer.map(ArrayBuffer.scala:47)
 at spark.rdd.CoGroupedRDD.getPartitions(CoGroupedRDD.scala:63)
 at spark.RDD.partitions(RDD.scala:168)
 at spark.MappedValuesRDD.getPartitions(PairRDDFunctions.scala:646)
 at spark.RDD.partitions(RDD.scala:168)
 at spark.RDD.take(RDD.scala:579)
 at 
 spark.streaming.DStream$$anonfun$foreachFunc$2$1.apply(DStream.scala:495)
 at 
 spark.streaming.DStream$$anonfun$foreachFunc$2$1.apply(DStream.scala:494)
 at 
 spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:22)
 at 
 spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:21)
 at 
 spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:21)
 at spark.streaming.Job.run(Job.scala:10)
 at spark.streaming.JobManager$JobHandler.run(JobManager.scala:17)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Please let me know if I can help or provide additional information.
 Craig



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-944) Give example of writing to HBase from Spark Streaming


[ 
https://issues.apache.org/jira/browse/SPARK-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13981693#comment-13981693
 ] 

Tathagata Das commented on SPARK-944:
-

Do you have a working example now, that you would like to contribute?

 Give example of writing to HBase from Spark Streaming
 -

 Key: SPARK-944
 URL: https://issues.apache.org/jira/browse/SPARK-944
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Patrick Wendell
Assignee: Patrick Cogan
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-944) Give example of writing to HBase from Spark Streaming


[ 
https://issues.apache.org/jira/browse/SPARK-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13981693#comment-13981693
 ] 

Tathagata Das edited comment on SPARK-944 at 4/25/14 10:10 PM:
---

[~kanwal] Do you have a working example now, that you would like to contribute?


was (Author: tdas):
Do you have a working example now, that you would like to contribute?

 Give example of writing to HBase from Spark Streaming
 -

 Key: SPARK-944
 URL: https://issues.apache.org/jira/browse/SPARK-944
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Patrick Wendell
Assignee: Patrick Cogan
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1639) Some tidying of Spark on YARN ApplicationMaster and ExecutorLauncher

2014-04-25 Thread Sandy Ryza (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-1639:
--

Summary: Some tidying of Spark on YARN ApplicationMaster and 
ExecutorLauncher  (was: Some tidying of Spark on YARN code)

 Some tidying of Spark on YARN ApplicationMaster and ExecutorLauncher
 

 Key: SPARK-1639
 URL: https://issues.apache.org/jira/browse/SPARK-1639
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 0.9.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 I found a few places where we can consolidate duplicate methods, fix typos, 
 add comments, and make what's going on more clear.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1637) Clean up examples for 1.0


 [ 
https://issues.apache.org/jira/browse/SPARK-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-1637:
-

Description: 
- Move all of them into subpackages of org.apache.spark.examples (right now 
some are in org.apache.spark.streaming.examples, for instance, and others are 
in org.apache.spark.examples.mllib)
- Move Python examples into examples/src/main/python
- Update docs to reflect these changes
- Clarify that the hand-written K-means and logistic regression examples are 
for demo purposes, but in reality you might want to use MLlib (we will add 
examples for these using MLlib too)

  was:
- Move all of them into subpackages of org.apache.spark.examples (right now 
some are in org.apache.spark.streaming.examples, for instance, and others are 
in org.apache.spark.examples.mllib)
- Move Python examples into examples/src/main/python
- Update docs to reflect these changes


 Clean up examples for 1.0
 -

 Key: SPARK-1637
 URL: https://issues.apache.org/jira/browse/SPARK-1637
 Project: Spark
  Issue Type: Improvement
  Components: Examples
Reporter: Matei Zaharia
Priority: Critical
 Fix For: 1.0.0


 - Move all of them into subpackages of org.apache.spark.examples (right now 
 some are in org.apache.spark.streaming.examples, for instance, and others are 
 in org.apache.spark.examples.mllib)
 - Move Python examples into examples/src/main/python
 - Update docs to reflect these changes
 - Clarify that the hand-written K-means and logistic regression examples are 
 for demo purposes, but in reality you might want to use MLlib (we will add 
 examples for these using MLlib too)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1235) DAGScheduler ignores exceptions thrown in handleTaskCompletion


 [ 
https://issues.apache.org/jira/browse/SPARK-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-1235:
-

Affects Version/s: (was: 1.0.0)

 DAGScheduler ignores exceptions thrown in handleTaskCompletion
 --

 Key: SPARK-1235
 URL: https://issues.apache.org/jira/browse/SPARK-1235
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.0, 0.9.1
Reporter: Kay Ousterhout
Assignee: Nan Zhu
Priority: Blocker
 Fix For: 1.0.0


 If an exception gets thrown in the handleTaskCompletion method, the method 
 exits, but the exception is caught somewhere (not clear where) and the 
 DAGScheduler keeps running.  Jobs hang as a result -- because not all of the 
 task completion code gets run.
 This was first reported by Brad Miller on the mailing list: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Fwd-pyspark-crash-on-mesos-td2256.html
  and this behavior seems to have changed since 0.8 (when, based on Brad's 
 description, it sounds like an exception in handleTaskCompletion would cause 
 the DAGScheduler to crash), suggesting that this may be related to the Scala 
 2.10.3.
 To reproduce this problem, add throw new Exception(foo) anywhere in 
 handleTaskCompletion and run any job locally.  The job will hang and you can 
 see the exception get printed in the logs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1235) DAGScheduler ignores exceptions thrown in handleTaskCompletion


 [ 
https://issues.apache.org/jira/browse/SPARK-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-1235.
--

   Resolution: Fixed
Fix Version/s: 1.0.0

Resolved in https://github.com/apache/spark/pull/186.

 DAGScheduler ignores exceptions thrown in handleTaskCompletion
 --

 Key: SPARK-1235
 URL: https://issues.apache.org/jira/browse/SPARK-1235
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.0, 0.9.1
Reporter: Kay Ousterhout
Assignee: Nan Zhu
Priority: Blocker
 Fix For: 1.0.0


 If an exception gets thrown in the handleTaskCompletion method, the method 
 exits, but the exception is caught somewhere (not clear where) and the 
 DAGScheduler keeps running.  Jobs hang as a result -- because not all of the 
 task completion code gets run.
 This was first reported by Brad Miller on the mailing list: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Fwd-pyspark-crash-on-mesos-td2256.html
  and this behavior seems to have changed since 0.8 (when, based on Brad's 
 description, it sounds like an exception in handleTaskCompletion would cause 
 the DAGScheduler to crash), suggesting that this may be related to the Scala 
 2.10.3.
 To reproduce this problem, add throw new Exception(foo) anywhere in 
 handleTaskCompletion and run any job locally.  The job will hang and you can 
 see the exception get printed in the logs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Closed] (SPARK-1620) Uncaught exception from Akka scheduler

2014-04-25 Thread Mark Hamstra (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra closed SPARK-1620.
---

Resolution: Invalid

 Uncaught exception from Akka scheduler
 --

 Key: SPARK-1620
 URL: https://issues.apache.org/jira/browse/SPARK-1620
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0, 1.0.0
Reporter: Mark Hamstra
Priority: Blocker

 I've been looking at this one in the context of a BlockManagerMaster that 
 OOMs and doesn't respond to heartBeat(), but I suspect that there may be 
 problems elsewhere where we use Akka's scheduler.
 The basic nature of the problem is that we are expecting exceptions thrown 
 from a scheduled function to be caught in the thread where 
 _ActorSystem_.scheduler.schedule() or scheduleOnce() has been called.  In 
 fact, the scheduled function runs on its own thread, so any exceptions that 
 it throws are not caught in the thread that called schedule() -- e.g., 
 unanswered BlockManager heartBeats (scheduled in BlockManager#initialize) 
 that end up throwing exceptions in BlockManagerMaster#askDriverWithReply do 
 not cause those exceptions to be handled by the Executor thread's 
 UncaughtExceptionHandler. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1632) Avoid boxing in ExternalAppendOnlyMap compares


 [ 
https://issues.apache.org/jira/browse/SPARK-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1632.


   Resolution: Fixed
Fix Version/s: 1.0.0

 Avoid boxing in ExternalAppendOnlyMap compares
 --

 Key: SPARK-1632
 URL: https://issues.apache.org/jira/browse/SPARK-1632
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Fix For: 1.0.0


 Hitting an OOME in ExternalAppendOnlyMap.KCComparator while boxing an int.  I 
 don't know if this is the root cause, but the boxing is also avoidable.
 Code:
 {code}
 def compare(kc1: (K, C), kc2: (K, C)): Int = {
   kc1._1.hashCode().compareTo(kc2._1.hashCode())
 }
 {code}
 Error:
 {code}
 java.lang.OutOfMemoryError: GC overhead limit exceeded
  at java.lang.Integer.valueOf(Integer.java:642)
  at scala.Predef$.int2Integer(Predef.scala:370)
  at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap$KCComparator.compare(ExternalAppendOnlyMap.scala:432)
  at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap$KCComparator.compare(ExternalAppendOnlyMap.scala:430)
  at 
 org.apache.spark.util.collection.AppendOnlyMap$$anon$3.compare(AppendOnlyMap.scala:271)
  at java.util.TimSort.mergeLo(TimSort.java:687)
  at java.util.TimSort.mergeAt(TimSort.java:483)
  at java.util.TimSort.mergeCollapse(TimSort.java:410)
  at java.util.TimSort.sort(TimSort.java:214)
  at java.util.Arrays.sort(Arrays.java:727)
  at 
 org.apache.spark.util.collection.AppendOnlyMap.destructiveSortedIterator(AppendOnlyMap.scala:274)
  at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:188)
  at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insert(ExternalAppendOnlyMap.scala:141)
  at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
  at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
  at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
  at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
  at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
  at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
  at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
  at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
  at org.apache.spark.scheduler.Task.run(Task.scala:53)
  at 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
  at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
  at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:415)
  at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
  at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1299) making comments of RDD.doCheckpoint consistent with its usage

2014-04-25 Thread Nan Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13981819#comment-13981819
 ] 

Nan Zhu commented on SPARK-1299:


addressed in https://github.com/apache/spark/pull/186

 making comments of RDD.doCheckpoint consistent with its usage
 -

 Key: SPARK-1299
 URL: https://issues.apache.org/jira/browse/SPARK-1299
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Nan Zhu
Assignee: Nan Zhu
Priority: Trivial
 Fix For: 1.0.0


 another trivial thing I found occasionally, the comments of function is 
 saying that
 /**
* Performs the checkpointing of this RDD by saving this. It is called by 
 the DAGScheduler
* after a job using this RDD has completed (therefore the RDD has been 
 materialized and
* potentially stored in memory). doCheckpoint() is called recursively on 
 the parent RDDs.
*/
 actually this function is called in SparkContext.runJob
 we can either change the comments or call it in DAGScheduler, I personally 
 prefer the later one, as this calling seems like a auto-checkpoint , better 
 put it in a non-user-facing component



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1299) making comments of RDD.doCheckpoint consistent with its usage

2014-04-25 Thread Nan Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nan Zhu resolved SPARK-1299.


   Resolution: Fixed
Fix Version/s: 1.0.0

 making comments of RDD.doCheckpoint consistent with its usage
 -

 Key: SPARK-1299
 URL: https://issues.apache.org/jira/browse/SPARK-1299
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Nan Zhu
Assignee: Nan Zhu
Priority: Trivial
 Fix For: 1.0.0


 another trivial thing I found occasionally, the comments of function is 
 saying that
 /**
* Performs the checkpointing of this RDD by saving this. It is called by 
 the DAGScheduler
* after a job using this RDD has completed (therefore the RDD has been 
 materialized and
* potentially stored in memory). doCheckpoint() is called recursively on 
 the parent RDDs.
*/
 actually this function is called in SparkContext.runJob
 we can either change the comments or call it in DAGScheduler, I personally 
 prefer the later one, as this calling seems like a auto-checkpoint , better 
 put it in a non-user-facing component



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1298) Remove duplicate partition id checking

2014-04-25 Thread Nan Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nan Zhu resolved SPARK-1298.


   Resolution: Fixed
Fix Version/s: 1.0.0

 Remove duplicate partition id checking
 --

 Key: SPARK-1298
 URL: https://issues.apache.org/jira/browse/SPARK-1298
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Nan Zhu
Assignee: Nan Zhu
Priority: Minor
 Fix For: 1.0.0


 In the current implementation, we check whether partitionIDs make sense in 
 SparkContext.runJob()
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L896
 However, immediately following, in DAGScheduler (calling path 
 SparkContext.runJob - DAGScheduler.runJob - DAGScheduler.submitJob), we 
 check it again, (just missing a  0 condition), 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L432
 I propose to remove the SparkContext one and check it in DAGScheduler (which 
 makes more sense, from my view)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1356) [STREAMING] Annotate developer and experimental API's


 [ 
https://issues.apache.org/jira/browse/SPARK-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-1356.
--

Resolution: Fixed
  Assignee: Tathagata Das

 [STREAMING] Annotate developer and experimental API's
 -

 Key: SPARK-1356
 URL: https://issues.apache.org/jira/browse/SPARK-1356
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Patrick Wendell
Assignee: Tathagata Das
Priority: Blocker
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Closed] (SPARK-1634) Java API docs contain test cases


 [ 
https://issues.apache.org/jira/browse/SPARK-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-1634.


   Resolution: Not a Problem
Fix Version/s: 1.0.0
 Assignee: Xiangrui Meng

Re-tried with 'sbt/sbt clean`. The generated docs for test cases were gone.

 Java API docs contain test cases
 

 Key: SPARK-1634
 URL: https://issues.apache.org/jira/browse/SPARK-1634
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Blocker
 Fix For: 1.0.0


 The generated Java API docs contain all test cases.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Closed] (SPARK-1558) [streaming] Update receiver information to match it with code


 [ 
https://issues.apache.org/jira/browse/SPARK-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das closed SPARK-1558.


Resolution: Invalid

Irrelevant after underlying code changes. 

 [streaming] Update receiver information to match it with code
 -

 Key: SPARK-1558
 URL: https://issues.apache.org/jira/browse/SPARK-1558
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (SPARK-1606) spark-submit needs `--arg` for every application parameter


 [ 
https://issues.apache.org/jira/browse/SPARK-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell reassigned SPARK-1606:
--

Assignee: Patrick Wendell

 spark-submit needs `--arg` for every application parameter
 --

 Key: SPARK-1606
 URL: https://issues.apache.org/jira/browse/SPARK-1606
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Assignee: Patrick Wendell

 If the application has a few parameters, the spark-submit command looks like 
 the following:
 {code}
 spark-submit --master yarn-cluster --class main.Class --arg --numPartitions 
 --arg 8 --arg --kryo --arg true
 {code}
 It is a little bit hard to read and modify. Maybe it is okay to treat all 
 arguments after `main.Class` as application parameters.
 {code}
 spark-submit --master yarn-cluster --class main.Class --numPartitions 8 
 --kryo true
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1606) spark-submit needs `--arg` for every application parameter


[ 
https://issues.apache.org/jira/browse/SPARK-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13981851#comment-13981851
 ] 

Patrick Wendell commented on SPARK-1606:


I submitted a PR here that adds the following syntax.

{code}
./bin/spark-submit [options] user.jar [user options]
{code}

 spark-submit needs `--arg` for every application parameter
 --

 Key: SPARK-1606
 URL: https://issues.apache.org/jira/browse/SPARK-1606
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Assignee: Patrick Wendell
Priority: Blocker

 If the application has a few parameters, the spark-submit command looks like 
 the following:
 {code}
 spark-submit --master yarn-cluster --class main.Class --arg --numPartitions 
 --arg 8 --arg --kryo --arg true
 {code}
 It is a little bit hard to read and modify. Maybe it is okay to treat all 
 arguments after `main.Class` as application parameters.
 {code}
 spark-submit --master yarn-cluster --class main.Class --numPartitions 8 
 --kryo true
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-1606) spark-submit needs `--arg` for every application parameter


[ 
https://issues.apache.org/jira/browse/SPARK-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13981851#comment-13981851
 ] 

Patrick Wendell edited comment on SPARK-1606 at 4/26/14 2:26 AM:
-

I submitted a PR here that adds the following syntax.

{code}
./bin/spark-submit [options] user.jar [user options]
{code}

https://github.com/apache/spark/pull/563


was (Author: pwendell):
I submitted a PR here that adds the following syntax.

{code}
./bin/spark-submit [options] user.jar [user options]
{code}

 spark-submit needs `--arg` for every application parameter
 --

 Key: SPARK-1606
 URL: https://issues.apache.org/jira/browse/SPARK-1606
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Assignee: Patrick Wendell
Priority: Blocker

 If the application has a few parameters, the spark-submit command looks like 
 the following:
 {code}
 spark-submit --master yarn-cluster --class main.Class --arg --numPartitions 
 --arg 8 --arg --kryo --arg true
 {code}
 It is a little bit hard to read and modify. Maybe it is okay to treat all 
 arguments after `main.Class` as application parameters.
 {code}
 spark-submit --master yarn-cluster --class main.Class --numPartitions 8 
 --kryo true
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1478) Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915

[
https://issues.apache.org/jira/browse/SPARK-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13981855#comment-13981855
]

Ted Malaska commented on SPARK-1478:

No worries that error was caused by me. Still learning Scala. It was the
difference between using a lazy val and a var.

I have all three test cases working now and I will do one last review before
submitting it tomorrow.

Now there is also one more odd thing going on that I haven't figured out yet.
Sometime (seeming randomly) my tests will fail with the following exception:

[info] - flume input stream *** FAILED *** (10 seconds, 332 milliseconds)
[info] 0 did not equal 1 (FlumeStreamSuite.scala:104)
[info] org.scalatest.exceptions.TestFailedException:

Then I will rerun the test with no code changes and they will success. It
feels very much like a race condition. Note I found this so odd that I did a
fresh git clone and tested the latest branch and I also was able to get the
exception.

I will look into this tomorrow. I would assume at this point that something is
odd in my environment until I find evidence of it being anything else.

Thank you again for the help.

Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915
---

Key: SPARK-1478
URL: https://issues.apache.org/jira/browse/SPARK-1478
Project: Spark
Issue Type: Improvement
Components: Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska
Priority: Minor

Flume-1915 added support for compression over the wire from avro sink to avro
source. I would like to add this functionality to the FlumeReceiver.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1478) Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915