[jira] [Commented] (SPARK-1764) EOF reached before Python server acknowledged

2014-05-13 Thread Bernardo Gomez Palacio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995484#comment-13995484
 ] 

Bernardo Gomez Palacio commented on SPARK-1764:
---

We just ran `sc.parallelize(range(100)).map(lambda n: n * 2).collect()` on a 
Mesos 0.18.1 cluster with the latest spark and it worked. Could you confirm the 
Spark  Mesos version you are using (if using master please include the 
sha/commit hash).

 EOF reached before Python server acknowledged
 -

 Key: SPARK-1764
 URL: https://issues.apache.org/jira/browse/SPARK-1764
 Project: Spark
  Issue Type: Bug
  Components: Mesos, PySpark
Affects Versions: 1.0.0
Reporter: Bouke van der Bijl
Priority: Blocker
  Labels: mesos, pyspark

 I'm getting EOF reached before Python server acknowledged while using 
 PySpark on Mesos. The error manifests itself in multiple ways. One is:
 14/05/08 18:10:40 ERROR DAGSchedulerActorSupervisor: eventProcesserActor 
 failed due to the error EOF reached before Python server acknowledged; 
 shutting down SparkContext
 And the other has a full stacktrace:
 14/05/08 18:03:06 ERROR OneForOneStrategy: EOF reached before Python server 
 acknowledged
 org.apache.spark.SparkException: EOF reached before Python server acknowledged
   at 
 org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:416)
   at 
 org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:387)
   at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:71)
   at 
 org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:279)
   at 
 org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:277)
   at 
 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
   at 
 scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
   at 
 scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
   at 
 scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
   at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
   at 
 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
   at org.apache.spark.Accumulators$.add(Accumulators.scala:277)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:818)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1204)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 This error causes the SparkContext to shutdown. I have not been able to 
 reliably reproduce this bug, it seems to happen randomly, but if you run 
 enough tasks on a SparkContext it'll hapen eventually



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1780) Non-existent SPARK_DAEMON_OPTS is referred to in a few places

2014-05-13 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995902#comment-13995902
 ] 

Andrew Or commented on SPARK-1780:
--

https://github.com/apache/spark/pull/751

 Non-existent SPARK_DAEMON_OPTS is referred to in a few places
 -

 Key: SPARK-1780
 URL: https://issues.apache.org/jira/browse/SPARK-1780
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.1
Reporter: Andrew Or
 Fix For: 1.0.0


 SparkConf.scala and spark-env.sh refer to a non-existent SPARK_DAEMON_OPTS. 
 What they really mean SPARK_DAEMON_JAVA_OPTS.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1161) Add saveAsObjectFile and SparkContext.objectFile in Python

2014-05-13 Thread Kan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996081#comment-13996081
 ] 

Kan Zhang commented on SPARK-1161:
--

PR: https://github.com/apache/spark/pull/755

 Add saveAsObjectFile and SparkContext.objectFile in Python
 --

 Key: SPARK-1161
 URL: https://issues.apache.org/jira/browse/SPARK-1161
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Reporter: Matei Zaharia
Assignee: Kan Zhang

 It can use pickling for serialization and a SequenceFile on disk similar to 
 the JVM versions of these.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1802) Audit dependency graph when Spark is built with -Phive

2014-05-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995815#comment-13995815
 ] 

Sean Owen commented on SPARK-1802:
--

I looked further into just what might go wrong by including hive-exec into the 
assembly, since it includes its dependencies directly (i.e. Maven can't manage 
around it.)

Attached is a full dump of the conflicts.

The ones that are potential issues appear to be the following, and one looks 
like it could be a deal-breaker -- protobuf -- since it's neither forwards nor 
backwards compatible. That is, I recommend testing this assembly with an older 
Hadoop that needs 2.4.1 and see if it croaks.

The rest might be worked around but need some additional mojo to make sure the 
right version wins in the packaging.

Certainly having hive-exec in the build is making me queasy!


[WARNING] hive-exec-0.12.0.jar, libthrift-0.9.0.jar define 153 overlappping 
classes: 

HBase includes libthrift-0.8.0, but it's in examples, and so figure this is 
ignorable.


[WARNING] hive-exec-0.12.0.jar, commons-lang-2.4.jar define 2 overlappping 
classes: 

Probably ignorable, but we have to make sure commons-lang-3.3.2 'wins' in the 
build.


[WARNING] hive-exec-0.12.0.jar, jackson-core-asl-1.9.11.jar define 117 
overlappping classes: 
[WARNING] hive-exec-0.12.0.jar, jackson-mapper-asl-1.8.8.jar define 432 
overlappping classes: 

Believe this are ignorable. (Not sure why the jackson versions are mismatched? 
another todo)


[WARNING] hive-exec-0.12.0.jar, guava-14.0.1.jar define 1087 overlappping 
classes: 

Should be OK. Hive uses 11.0.2 like Hadoop; the build is already taking that 
particular risk. We need 14.0.1 to win.


[WARNING] hive-exec-0.12.0.jar, protobuf-java-2.4.1.jar define 204 overlappping 
classes: 

Oof. Hive has protobuf 2.5.0. This has got to be a problem for older Hadoop 
builds?



 Audit dependency graph when Spark is built with -Phive
 --

 Key: SPARK-1802
 URL: https://issues.apache.org/jira/browse/SPARK-1802
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Assignee: Sean Owen
Priority: Blocker
 Fix For: 1.0.0


 I'd like to have binary release for 1.0 include Hive support. Since this 
 isn't enabled by default in the build I don't think it's as well tested, so 
 we should dig around a bit and decide if we need to e.g. add any excludes.
 {code}
 $ mvn install -Phive -DskipTests  mvn dependency:build-classpath -pl 
 assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' 
 | sort  without_hive.txt
 $ mvn install -Phive -DskipTests  mvn dependency:build-classpath -Phive -pl 
 assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' 
 | sort  with_hive.txt
 $ diff without_hive.txt with_hive.txt
  antlr-2.7.7.jar
  antlr-3.4.jar
  antlr-runtime-3.4.jar
 10,14d6
  avro-1.7.4.jar
  avro-ipc-1.7.4.jar
  avro-ipc-1.7.4-tests.jar
  avro-mapred-1.7.4.jar
  bonecp-0.7.1.RELEASE.jar
 22d13
  commons-cli-1.2.jar
 25d15
  commons-compress-1.4.1.jar
 33,34d22
  commons-logging-1.1.1.jar
  commons-logging-api-1.0.4.jar
 38d25
  commons-pool-1.5.4.jar
 46,49d32
  datanucleus-api-jdo-3.2.1.jar
  datanucleus-core-3.2.2.jar
  datanucleus-rdbms-3.2.1.jar
  derby-10.4.2.0.jar
 53,57d35
  hive-common-0.12.0.jar
  hive-exec-0.12.0.jar
  hive-metastore-0.12.0.jar
  hive-serde-0.12.0.jar
  hive-shims-0.12.0.jar
 60,61d37
  httpclient-4.1.3.jar
  httpcore-4.1.3.jar
 68d43
  JavaEWAH-0.3.2.jar
 73d47
  javolution-5.5.1.jar
 76d49
  jdo-api-3.0.1.jar
 78d50
  jetty-6.1.26.jar
 87d58
  jetty-util-6.1.26.jar
 93d63
  json-20090211.jar
 98d67
  jta-1.1.jar
 103,104d71
  libfb303-0.9.0.jar
  libthrift-0.9.0.jar
 112d78
  mockito-all-1.8.5.jar
 136d101
  servlet-api-2.5-20081211.jar
 139d103
  snappy-0.2.jar
 144d107
  spark-hive_2.10-1.0.0.jar
 151d113
  ST4-4.0.4.jar
 153d114
  stringtemplate-3.2.1.jar
 156d116
  velocity-1.7.jar
 158d117
  xz-1.0.jar
 {code}
 Some initial investigation suggests we may need to take some precaution 
 surrounding (a) jetty and (b) servlet-api.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1753) PySpark on YARN does not work on assembly jar built on Red Hat based OS

2014-05-13 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1753.


   Resolution: Fixed
Fix Version/s: (was: 1.0.1)
   1.0.0

Issue resolved by pull request 701
[https://github.com/apache/spark/pull/701]

 PySpark on YARN does not work on assembly jar built on Red Hat based OS
 ---

 Key: SPARK-1753
 URL: https://issues.apache.org/jira/browse/SPARK-1753
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Or
 Fix For: 1.0.0


 If the jar is built on a Red Hat based OS, the additional python files 
 included in the jar cannot be accessed. This means PySpark doesn't work on 
 YARN because in this mode it relies on the python files within this jar.
 I have confirmed that my Java, Scala, and maven versions are all exactly the 
 same on my CentOS environment and on my local OSX environment, and the former 
 does not work. Thomas Graves also struggled with the same problem.
 Until a fix is found, we should at the very least document this peculiarity.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1756) Add missing description to spark-env.sh.template

2014-05-13 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13992535#comment-13992535
 ] 

Guoqiang Li commented on SPARK-1756:


[PR 646| https://github.com/apache/spark/pull/646]

 Add missing description to spark-env.sh.template
 

 Key: SPARK-1756
 URL: https://issues.apache.org/jira/browse/SPARK-1756
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, YARN
Reporter: Guoqiang Li
Assignee: Guoqiang Li
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1817) RDD zip erroneous when partitions do not divide RDD count

2014-05-13 Thread Michael Malak (JIRA)
Michael Malak created SPARK-1817:


 Summary: RDD zip erroneous when partitions do not divide RDD count
 Key: SPARK-1817
 URL: https://issues.apache.org/jira/browse/SPARK-1817
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Michael Malak


Example:

scala sc.parallelize(1L to 2L,4).zip(sc.parallelize(11 to 12,4)).collect
res1: Array[(Long, Int)] = Array((2,11))

But more generally, it's whenever the number of partitions does not evenly 
divide the total number of elements in the RDD.

See https://groups.google.com/forum/#!msg/spark-users/demrmjHFnoc/Ek3ijiXHr2MJ




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1616) input file not found issue

2014-05-13 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994044#comment-13994044
 ] 

Marcelo Vanzin commented on SPARK-1616:
---

Hi Prasad,

This doesn't really sound like a bug, but a mismatch between your expectations 
and Spark's.

When you tell a Spark job to read data from a file, Spark expects the file to 
be available to all the workers. This can be achieved in several ways:

* Using a distributed file system such as HDFS
* Using a networked file system such as NFS
* Using Spark's file distribution mechanism, which will copy the file to the 
workers for you (e.g. spark-submit's --files argument if you run 1.0)
* Manually copying the file like you did

But Spark will not automatically copy data to worker nodes on your behalf.

 input file not found issue 
 ---

 Key: SPARK-1616
 URL: https://issues.apache.org/jira/browse/SPARK-1616
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 0.9.0
 Environment: Linux 2.6.18-348.3.1.el5 
Reporter: prasad potipireddi





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1818) Freshen Mesos docs

2014-05-13 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996146#comment-13996146
 ] 

Andrew Ash commented on SPARK-1818:
---

https://github.com/apache/spark/pull/756

 Freshen Mesos docs
 --

 Key: SPARK-1818
 URL: https://issues.apache.org/jira/browse/SPARK-1818
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, Mesos
Affects Versions: 1.0.0
Reporter: Andrew Ash

 They haven't been updated since 0.6.0 and encourage compiling both Mesos and 
 Spark from scratch.  Include mention of the precompiled binary versions of 
 both projects available and otherwise generally freshen the documentation for 
 Mesos newcomers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1780) Non-existent SPARK_DAEMON_OPTS is referred to in a few places

2014-05-13 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1780.


Resolution: Fixed
  Assignee: Andrew Or  (was: Patrick Wendell)

 Non-existent SPARK_DAEMON_OPTS is referred to in a few places
 -

 Key: SPARK-1780
 URL: https://issues.apache.org/jira/browse/SPARK-1780
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.1
Reporter: Andrew Or
Assignee: Andrew Or
 Fix For: 1.0.0


 SparkConf.scala and spark-env.sh refer to a non-existent SPARK_DAEMON_OPTS. 
 What they really mean SPARK_DAEMON_JAVA_OPTS.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1802) Audit dependency graph when Spark is built with -Phive

2014-05-13 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996178#comment-13996178
 ] 

Patrick Wendell commented on SPARK-1802:


This protobuf thing is very troubling. The options here are pretty limited 
since they publish this assembly jar. I see a few:

1. Publish a Hive 0.12 that users our shaded protobuf 2.4.1 (we already 
published a shaded version of protobuf 2.4.1). I actually have this working in 
a local build of Hive 0.12, but I haven't tried to push it to sonatype yet:
https://github.com/pwendell/hive/commits/branch-0.12-shaded-protobuf

2. Upgrade our use of hive to 0.13 (which bumps to protobuf 2.5.0) and only 
support Spark SQL with Hadoop 2+ - that is, versions of Hadoop that have also 
bumped to protobuf 2.5.0. I'm not sure how big of an effort that would be in 
terms of the code changes between 0.12 and 0.13. Spark didn't recompile 
trivially. I can talk to Michael Armbrust tomorrow morning about this.

One thing I don't totally understand is how Hive itself deals with this 
conflict. For instance, if someone wants to run Hive 0.12 with Hadoop 2. 
Presumably both the Hive protobuf 2.4.1 and the HDFS client protobuf 2.5.0 will 
be in the JVM at the same time... I'm not sure how they are isolated from 
each-other. HDP 2.1 for instance, seems to have both 
(http://hortonworks.com/hdp/whats-new/)

 Audit dependency graph when Spark is built with -Phive
 --

 Key: SPARK-1802
 URL: https://issues.apache.org/jira/browse/SPARK-1802
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Assignee: Sean Owen
Priority: Blocker
 Fix For: 1.0.0

 Attachments: hive-exec-jar-problems.txt


 I'd like to have binary release for 1.0 include Hive support. Since this 
 isn't enabled by default in the build I don't think it's as well tested, so 
 we should dig around a bit and decide if we need to e.g. add any excludes.
 {code}
 $ mvn install -Phive -DskipTests  mvn dependency:build-classpath -pl 
 assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' 
 | sort  without_hive.txt
 $ mvn install -Phive -DskipTests  mvn dependency:build-classpath -Phive -pl 
 assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' 
 | sort  with_hive.txt
 $ diff without_hive.txt with_hive.txt
  antlr-2.7.7.jar
  antlr-3.4.jar
  antlr-runtime-3.4.jar
 10,14d6
  avro-1.7.4.jar
  avro-ipc-1.7.4.jar
  avro-ipc-1.7.4-tests.jar
  avro-mapred-1.7.4.jar
  bonecp-0.7.1.RELEASE.jar
 22d13
  commons-cli-1.2.jar
 25d15
  commons-compress-1.4.1.jar
 33,34d22
  commons-logging-1.1.1.jar
  commons-logging-api-1.0.4.jar
 38d25
  commons-pool-1.5.4.jar
 46,49d32
  datanucleus-api-jdo-3.2.1.jar
  datanucleus-core-3.2.2.jar
  datanucleus-rdbms-3.2.1.jar
  derby-10.4.2.0.jar
 53,57d35
  hive-common-0.12.0.jar
  hive-exec-0.12.0.jar
  hive-metastore-0.12.0.jar
  hive-serde-0.12.0.jar
  hive-shims-0.12.0.jar
 60,61d37
  httpclient-4.1.3.jar
  httpcore-4.1.3.jar
 68d43
  JavaEWAH-0.3.2.jar
 73d47
  javolution-5.5.1.jar
 76d49
  jdo-api-3.0.1.jar
 78d50
  jetty-6.1.26.jar
 87d58
  jetty-util-6.1.26.jar
 93d63
  json-20090211.jar
 98d67
  jta-1.1.jar
 103,104d71
  libfb303-0.9.0.jar
  libthrift-0.9.0.jar
 112d78
  mockito-all-1.8.5.jar
 136d101
  servlet-api-2.5-20081211.jar
 139d103
  snappy-0.2.jar
 144d107
  spark-hive_2.10-1.0.0.jar
 151d113
  ST4-4.0.4.jar
 153d114
  stringtemplate-3.2.1.jar
 156d116
  velocity-1.7.jar
 158d117
  xz-1.0.jar
 {code}
 Some initial investigation suggests we may need to take some precaution 
 surrounding (a) jetty and (b) servlet-api.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1802) Audit dependency graph when Spark is built with -Phive

2014-05-13 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996178#comment-13996178
 ] 

Patrick Wendell edited comment on SPARK-1802 at 5/13/14 8:18 AM:
-

This protobuf thing is very troubling. The options here are pretty limited 
since they publish this assembly jar. I see a few:

1. Publish a Hive 0.12 that uses our shaded protobuf 2.4.1 (we already 
published a shaded version of protobuf 2.4.1). I actually have this working in 
a local build of Hive 0.12, but I haven't tried to push it to sonatype yet:
https://github.com/pwendell/hive/commits/branch-0.12-shaded-protobuf

2. Upgrade our use of hive to 0.13 (which bumps to protobuf 2.5.0) and only 
support Spark SQL with Hadoop 2+ - that is, versions of Hadoop that have also 
bumped to protobuf 2.5.0. I'm not sure how big of an effort that would be in 
terms of the code changes between 0.12 and 0.13. Spark didn't recompile 
trivially. I can talk to Michael Armbrust tomorrow morning about this.

One thing I don't totally understand is how Hive itself deals with this 
conflict. For instance, if someone wants to run Hive 0.12 with Hadoop 2. 
Presumably both the Hive protobuf 2.4.1 and the HDFS client protobuf 2.5.0 will 
be in the JVM at the same time... I'm not sure how they are isolated from 
each-other. HDP 2.1 for instance, seems to have both 
(http://hortonworks.com/hdp/whats-new/)


was (Author: pwendell):
This protobuf thing is very troubling. The options here are pretty limited 
since they publish this assembly jar. I see a few:

1. Publish a Hive 0.12 that users our shaded protobuf 2.4.1 (we already 
published a shaded version of protobuf 2.4.1). I actually have this working in 
a local build of Hive 0.12, but I haven't tried to push it to sonatype yet:
https://github.com/pwendell/hive/commits/branch-0.12-shaded-protobuf

2. Upgrade our use of hive to 0.13 (which bumps to protobuf 2.5.0) and only 
support Spark SQL with Hadoop 2+ - that is, versions of Hadoop that have also 
bumped to protobuf 2.5.0. I'm not sure how big of an effort that would be in 
terms of the code changes between 0.12 and 0.13. Spark didn't recompile 
trivially. I can talk to Michael Armbrust tomorrow morning about this.

One thing I don't totally understand is how Hive itself deals with this 
conflict. For instance, if someone wants to run Hive 0.12 with Hadoop 2. 
Presumably both the Hive protobuf 2.4.1 and the HDFS client protobuf 2.5.0 will 
be in the JVM at the same time... I'm not sure how they are isolated from 
each-other. HDP 2.1 for instance, seems to have both 
(http://hortonworks.com/hdp/whats-new/)

 Audit dependency graph when Spark is built with -Phive
 --

 Key: SPARK-1802
 URL: https://issues.apache.org/jira/browse/SPARK-1802
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Assignee: Sean Owen
Priority: Blocker
 Fix For: 1.0.0

 Attachments: hive-exec-jar-problems.txt


 I'd like to have binary release for 1.0 include Hive support. Since this 
 isn't enabled by default in the build I don't think it's as well tested, so 
 we should dig around a bit and decide if we need to e.g. add any excludes.
 {code}
 $ mvn install -Phive -DskipTests  mvn dependency:build-classpath -pl 
 assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' 
 | sort  without_hive.txt
 $ mvn install -Phive -DskipTests  mvn dependency:build-classpath -Phive -pl 
 assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' 
 | sort  with_hive.txt
 $ diff without_hive.txt with_hive.txt
  antlr-2.7.7.jar
  antlr-3.4.jar
  antlr-runtime-3.4.jar
 10,14d6
  avro-1.7.4.jar
  avro-ipc-1.7.4.jar
  avro-ipc-1.7.4-tests.jar
  avro-mapred-1.7.4.jar
  bonecp-0.7.1.RELEASE.jar
 22d13
  commons-cli-1.2.jar
 25d15
  commons-compress-1.4.1.jar
 33,34d22
  commons-logging-1.1.1.jar
  commons-logging-api-1.0.4.jar
 38d25
  commons-pool-1.5.4.jar
 46,49d32
  datanucleus-api-jdo-3.2.1.jar
  datanucleus-core-3.2.2.jar
  datanucleus-rdbms-3.2.1.jar
  derby-10.4.2.0.jar
 53,57d35
  hive-common-0.12.0.jar
  hive-exec-0.12.0.jar
  hive-metastore-0.12.0.jar
  hive-serde-0.12.0.jar
  hive-shims-0.12.0.jar
 60,61d37
  httpclient-4.1.3.jar
  httpcore-4.1.3.jar
 68d43
  JavaEWAH-0.3.2.jar
 73d47
  javolution-5.5.1.jar
 76d49
  jdo-api-3.0.1.jar
 78d50
  jetty-6.1.26.jar
 87d58
  jetty-util-6.1.26.jar
 93d63
  json-20090211.jar
 98d67
  jta-1.1.jar
 103,104d71
  libfb303-0.9.0.jar
  libthrift-0.9.0.jar
 112d78
  mockito-all-1.8.5.jar
 136d101
  servlet-api-2.5-20081211.jar
 139d103
  snappy-0.2.jar
 144d107
  spark-hive_2.10-1.0.0.jar
 151d113
  ST4-4.0.4.jar
 153d114
  stringtemplate-3.2.1.jar
 156d116
  velocity-1.7.jar
 158d117
  xz-1.0.jar
 {code}
 Some initial investigation 

[jira] [Created] (SPARK-1819) Fix GetField.nullable.

2014-05-13 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-1819:


 Summary: Fix GetField.nullable.
 Key: SPARK-1819
 URL: https://issues.apache.org/jira/browse/SPARK-1819
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin


{{GetField.nullable}} should be {{true}} not only when {{field.nullable}} is 
{{true}} but also when {{child.nullable}} is {{true}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1802) Audit dependency graph when Spark is built with -Phive

2014-05-13 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1802.


Resolution: Fixed

Issue resolved by pull request 744
[https://github.com/apache/spark/pull/744]

 Audit dependency graph when Spark is built with -Phive
 --

 Key: SPARK-1802
 URL: https://issues.apache.org/jira/browse/SPARK-1802
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Priority: Blocker
 Fix For: 1.0.0


 I'd like to have binary release for 1.0 include Hive support. Since this 
 isn't enabled by default in the build I don't think it's as well tested, so 
 we should dig around a bit and decide if we need to e.g. add any excludes.
 {code}
 $ mvn install -Phive -DskipTests  mvn dependency:build-classpath -pl 
 assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' 
 | sort  without_hive.txt
 $ mvn install -Phive -DskipTests  mvn dependency:build-classpath -Phive -pl 
 assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' 
 | sort  with_hive.txt
 $ diff without_hive.txt with_hive.txt
  antlr-2.7.7.jar
  antlr-3.4.jar
  antlr-runtime-3.4.jar
 10,14d6
  avro-1.7.4.jar
  avro-ipc-1.7.4.jar
  avro-ipc-1.7.4-tests.jar
  avro-mapred-1.7.4.jar
  bonecp-0.7.1.RELEASE.jar
 22d13
  commons-cli-1.2.jar
 25d15
  commons-compress-1.4.1.jar
 33,34d22
  commons-logging-1.1.1.jar
  commons-logging-api-1.0.4.jar
 38d25
  commons-pool-1.5.4.jar
 46,49d32
  datanucleus-api-jdo-3.2.1.jar
  datanucleus-core-3.2.2.jar
  datanucleus-rdbms-3.2.1.jar
  derby-10.4.2.0.jar
 53,57d35
  hive-common-0.12.0.jar
  hive-exec-0.12.0.jar
  hive-metastore-0.12.0.jar
  hive-serde-0.12.0.jar
  hive-shims-0.12.0.jar
 60,61d37
  httpclient-4.1.3.jar
  httpcore-4.1.3.jar
 68d43
  JavaEWAH-0.3.2.jar
 73d47
  javolution-5.5.1.jar
 76d49
  jdo-api-3.0.1.jar
 78d50
  jetty-6.1.26.jar
 87d58
  jetty-util-6.1.26.jar
 93d63
  json-20090211.jar
 98d67
  jta-1.1.jar
 103,104d71
  libfb303-0.9.0.jar
  libthrift-0.9.0.jar
 112d78
  mockito-all-1.8.5.jar
 136d101
  servlet-api-2.5-20081211.jar
 139d103
  snappy-0.2.jar
 144d107
  spark-hive_2.10-1.0.0.jar
 151d113
  ST4-4.0.4.jar
 153d114
  stringtemplate-3.2.1.jar
 156d116
  velocity-1.7.jar
 158d117
  xz-1.0.jar
 {code}
 Some initial investigation suggests we may need to take some precaution 
 surrounding (a) jetty and (b) servlet-api.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1760) mvn -Dsuites=* test throw an ClassNotFoundException

2014-05-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993493#comment-13993493
 ] 

Sean Owen commented on SPARK-1760:
--

If `wildcardSuites` lets you invoke specific suites across the whole project, 
then that sounds like an ideal solution. If it works then I'd propose that as a 
small doc change?

  mvn  -Dsuites=*  test throw an ClassNotFoundException
 --

 Key: SPARK-1760
 URL: https://issues.apache.org/jira/browse/SPARK-1760
 Project: Spark
  Issue Type: Bug
Reporter: Guoqiang Li
Assignee: Guoqiang Li
 Fix For: 1.0.0


 {{mvn -Dhadoop.version=0.23.9 -Phadoop-0.23 
 -Dsuites=org.apache.spark.repl.ReplSuite test}} = 
 {code}
 *** RUN ABORTED ***
   java.lang.ClassNotFoundException: org.apache.spark.repl.ReplSuite
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
   at org.scalatest.tools.Runner$$anonfun$21.apply(Runner.scala:1470)
   at org.scalatest.tools.Runner$$anonfun$21.apply(Runner.scala:1469)
   at 
 scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
   at scala.collection.immutable.List.foreach(List.scala:318)
   ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1802) Audit dependency graph when Spark is built with -Phive

2014-05-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995815#comment-13995815
 ] 

Sean Owen edited comment on SPARK-1802 at 5/13/14 11:27 AM:


(Edited to fix comment about protobuf versions)

I looked further into just what might go wrong by including hive-exec into the 
assembly, since it includes its dependencies directly (i.e. Maven can't manage 
around it.)

Attached is a full dump of the conflicts.

The ones that are potential issues appear to be the following, and one looks 
like it could be a deal-breaker -- protobuf -- since it's neither forwards nor 
backwards compatible. That is, I recommend testing this assembly with an 
*newer* Hadoop that needs 2.5 and see if it croaks.

The rest might be worked around but need some additional mojo to make sure the 
right version wins in the packaging.

Certainly having hive-exec in the build is making me queasy!


[WARNING] hive-exec-0.12.0.jar, libthrift-0.9.0.jar define 153 overlappping 
classes: 

HBase includes libthrift-0.8.0, but it's in examples, and so figure this is 
ignorable.


[WARNING] hive-exec-0.12.0.jar, commons-lang-2.4.jar define 2 overlappping 
classes: 

Probably ignorable, but we have to make sure commons-lang-3.3.2 'wins' in the 
build.


[WARNING] hive-exec-0.12.0.jar, jackson-core-asl-1.9.11.jar define 117 
overlappping classes: 
[WARNING] hive-exec-0.12.0.jar, jackson-mapper-asl-1.8.8.jar define 432 
overlappping classes: 

Believe this are ignorable. (Not sure why the jackson versions are mismatched? 
another todo)


[WARNING] hive-exec-0.12.0.jar, guava-14.0.1.jar define 1087 overlappping 
classes: 

Should be OK. Hive uses 11.0.2 like Hadoop; the build is already taking that 
particular risk. We need 14.0.1 to win.


[WARNING] hive-exec-0.12.0.jar, protobuf-java-2.4.1.jar define 204 overlappping 
classes: 

Oof. Hive has protobuf *2.4.1*. This has got to be a problem for newer Hadoop 
builds?

(Edited to fix comment about protobuf versions)


was (Author: srowen):
I looked further into just what might go wrong by including hive-exec into the 
assembly, since it includes its dependencies directly (i.e. Maven can't manage 
around it.)

Attached is a full dump of the conflicts.

The ones that are potential issues appear to be the following, and one looks 
like it could be a deal-breaker -- protobuf -- since it's neither forwards nor 
backwards compatible. That is, I recommend testing this assembly with an older 
Hadoop that needs 2.4.1 and see if it croaks.

The rest might be worked around but need some additional mojo to make sure the 
right version wins in the packaging.

Certainly having hive-exec in the build is making me queasy!


[WARNING] hive-exec-0.12.0.jar, libthrift-0.9.0.jar define 153 overlappping 
classes: 

HBase includes libthrift-0.8.0, but it's in examples, and so figure this is 
ignorable.


[WARNING] hive-exec-0.12.0.jar, commons-lang-2.4.jar define 2 overlappping 
classes: 

Probably ignorable, but we have to make sure commons-lang-3.3.2 'wins' in the 
build.


[WARNING] hive-exec-0.12.0.jar, jackson-core-asl-1.9.11.jar define 117 
overlappping classes: 
[WARNING] hive-exec-0.12.0.jar, jackson-mapper-asl-1.8.8.jar define 432 
overlappping classes: 

Believe this are ignorable. (Not sure why the jackson versions are mismatched? 
another todo)


[WARNING] hive-exec-0.12.0.jar, guava-14.0.1.jar define 1087 overlappping 
classes: 

Should be OK. Hive uses 11.0.2 like Hadoop; the build is already taking that 
particular risk. We need 14.0.1 to win.


[WARNING] hive-exec-0.12.0.jar, protobuf-java-2.4.1.jar define 204 overlappping 
classes: 

Oof. Hive has protobuf 2.5.0. This has got to be a problem for older Hadoop 
builds?



 Audit dependency graph when Spark is built with -Phive
 --

 Key: SPARK-1802
 URL: https://issues.apache.org/jira/browse/SPARK-1802
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Assignee: Sean Owen
Priority: Blocker
 Fix For: 1.0.0

 Attachments: hive-exec-jar-problems.txt


 I'd like to have binary release for 1.0 include Hive support. Since this 
 isn't enabled by default in the build I don't think it's as well tested, so 
 we should dig around a bit and decide if we need to e.g. add any excludes.
 {code}
 $ mvn install -Phive -DskipTests  mvn dependency:build-classpath -pl 
 assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' 
 | sort  without_hive.txt
 $ mvn install -Phive -DskipTests  mvn dependency:build-classpath -Phive -pl 
 assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' 
 | sort  with_hive.txt
 $ diff without_hive.txt with_hive.txt
  antlr-2.7.7.jar
  antlr-3.4.jar
  antlr-runtime-3.4.jar
 10,14d6
  avro-1.7.4.jar
  

[jira] [Commented] (SPARK-1819) Fix GetField.nullable.

2014-05-13 Thread Takuya Ueshin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996201#comment-13996201
 ] 

Takuya Ueshin commented on SPARK-1819:
--

Pull-requested: https://github.com/apache/spark/pull/757

 Fix GetField.nullable.
 --

 Key: SPARK-1819
 URL: https://issues.apache.org/jira/browse/SPARK-1819
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin

 {{GetField.nullable}} should be {{true}} not only when {{field.nullable}} is 
 {{true}} but also when {{child.nullable}} is {{true}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1680) Clean up use of setExecutorEnvs in SparkConf

2014-05-13 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1680:
---

Fix Version/s: (was: 1.0.0)
   1.1.0

 Clean up use of setExecutorEnvs in SparkConf 
 -

 Key: SPARK-1680
 URL: https://issues.apache.org/jira/browse/SPARK-1680
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Patrick Wendell
Priority: Blocker
 Fix For: 1.1.0


 We should make this consistent between YARN and Standalone. Basically, YARN 
 mode should just use the executorEnvs from the Spark conf and not need 
 SPARK_YARN_USER_ENV.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1809) Mesos backend doesn't respect HADOOP_CONF_DIR

2014-05-13 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-1809:
-

 Summary: Mesos backend doesn't respect HADOOP_CONF_DIR
 Key: SPARK-1809
 URL: https://issues.apache.org/jira/browse/SPARK-1809
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.0.0
Reporter: Andrew Ash


In order to use HDFS paths without the server component, standalone mode reads 
spark-env.sh and scans the HADOOP_CONF_DIR to open core-site.xml and get the 
fs.default.name parameter.

This lets you use HDFS paths like:
- hdfs:///tmp/myfile.txt
instead of
- hdfs://myserver.mydomain.com:8020/tmp/myfile.txt

However as of recent 1.0.0 pre-release (hash 756c96) I had to specify HDFS 
paths with the full server even though I have HADOOP_CONF_DIR still set in 
spark-env.sh.  The HDFS, Spark, and Mesos nodes are all co-located and 
non-domain HDFS paths work fine when using the standalone mode.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1813) Add a utility to SparkConf that makes using Kryo really easy

2014-05-13 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-1813:
-

 Summary: Add a utility to SparkConf that makes using Kryo really 
easy
 Key: SPARK-1813
 URL: https://issues.apache.org/jira/browse/SPARK-1813
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Sandy Ryza


It would be nice to have a method in SparkConf that makes it really easy to use 
Kryo and register a set of classes. without defining you

Using Kryo currently requires all this:
{code}
import com.esotericsoftware.kryo.Kryo
import org.apache.spark.serializer.KryoRegistrator

class MyRegistrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo) {
kryo.register(classOf[MyClass1])
kryo.register(classOf[MyClass2])
  }
}

val conf = new SparkConf().setMaster(...).setAppName(...)
conf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer)
conf.set(spark.kryo.registrator, mypackage.MyRegistrator)
val sc = new SparkContext(conf)
{code}

It would be nice if it just required this:
{code}
SparkConf.setKryo(Array(classOf[MyFirstClass, classOf[MySecond]))
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1813) Add a utility to SparkConf that makes using Kryo really easy

2014-05-13 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996390#comment-13996390
 ] 

Mridul Muralidharan commented on SPARK-1813:


Writing a KryoRegistrator is the only requirement - rest are done as part of 
initialization anyway.
Registering classes with kryo is non trivial except for degenerate cases : for 
example, we have classes we have to use java read/write Object serialization, 
which support kyro serialization, which support java's external serialization, 
generated classes, etc.
And we would need a registrator ... ofcourse, it could be argued this is corner 
case, though I dont think so.

 Add a utility to SparkConf that makes using Kryo really easy
 

 Key: SPARK-1813
 URL: https://issues.apache.org/jira/browse/SPARK-1813
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Sandy Ryza

 It would be nice to have a method in SparkConf that makes it really easy to 
 use Kryo and register a set of classes. without defining you
 Using Kryo currently requires all this:
 {code}
 import com.esotericsoftware.kryo.Kryo
 import org.apache.spark.serializer.KryoRegistrator
 class MyRegistrator extends KryoRegistrator {
   override def registerClasses(kryo: Kryo) {
 kryo.register(classOf[MyClass1])
 kryo.register(classOf[MyClass2])
   }
 }
 val conf = new SparkConf().setMaster(...).setAppName(...)
 conf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer)
 conf.set(spark.kryo.registrator, mypackage.MyRegistrator)
 val sc = new SparkContext(conf)
 {code}
 It would be nice if it just required this:
 {code}
 SparkConf.setKryo(Array(classOf[MyFirstClass, classOf[MySecond]))
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1808) bin/pyspark does not load default configuration properties

2014-05-13 Thread Andrew Or (JIRA)
Andrew Or created SPARK-1808:


 Summary: bin/pyspark does not load default configuration properties
 Key: SPARK-1808
 URL: https://issues.apache.org/jira/browse/SPARK-1808
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Andrew Or
 Fix For: 1.0.1


... because it doesn't go through spark-submit. Either we make it go through 
spark-submit (hard), or we extract the load default configurations logic and 
set them for the JVM that launches the py4j GatewayServer (easier).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1810) The spark tar ball does not unzip into a separate folder when un-tarred.

2014-05-13 Thread Manikandan Narayanaswamy (JIRA)
Manikandan Narayanaswamy created SPARK-1810:
---

 Summary: The spark tar ball does not unzip into a separate folder 
when un-tarred.
 Key: SPARK-1810
 URL: https://issues.apache.org/jira/browse/SPARK-1810
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 0.9.0
 Environment: All environments
Reporter: Manikandan Narayanaswamy
Priority: Minor


All other Hadoop components when extracted are contained within a new folder 
that is created. But, this is not the case for Spark. The Spark.tar 
decompresses all files into the Current Working Directory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (SPARK-1821) Document History Server

2014-05-13 Thread Nan Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nan Zhu closed SPARK-1821.
--

Resolution: Implemented

sorry, missed some documents

 Document History Server
 ---

 Key: SPARK-1821
 URL: https://issues.apache.org/jira/browse/SPARK-1821
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Nan Zhu

 In 1.0, there is a new component, history server, which is not mentioned in 
 http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/
 I think we'd better add the missing document



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1813) Add a utility to SparkConf that makes using Kryo really easy

2014-05-13 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-1813:
--

Description: 
It would be nice to have a method in SparkConf that makes it really easy to 
turn on Kryo serialization and register a set of classes.

Using Kryo currently requires all this:
{code}
import com.esotericsoftware.kryo.Kryo
import org.apache.spark.serializer.KryoRegistrator

class MyRegistrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo) {
kryo.register(classOf[MyClass1])
kryo.register(classOf[MyClass2])
  }
}

val conf = new SparkConf().setMaster(...).setAppName(...)
conf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer)
conf.set(spark.kryo.registrator, mypackage.MyRegistrator)
val sc = new SparkContext(conf)
{code}

It would be nice if it just required this:
{code}
SparkConf.setKryo(Array(classOf[MyFirstClass, classOf[MySecond]))
{code}

  was:
It would be nice to have a method in SparkConf that makes it really easy to use 
Kryo and register a set of classes. without defining you

Using Kryo currently requires all this:
{code}
import com.esotericsoftware.kryo.Kryo
import org.apache.spark.serializer.KryoRegistrator

class MyRegistrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo) {
kryo.register(classOf[MyClass1])
kryo.register(classOf[MyClass2])
  }
}

val conf = new SparkConf().setMaster(...).setAppName(...)
conf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer)
conf.set(spark.kryo.registrator, mypackage.MyRegistrator)
val sc = new SparkContext(conf)
{code}

It would be nice if it just required this:
{code}
SparkConf.setKryo(Array(classOf[MyFirstClass, classOf[MySecond]))
{code}


 Add a utility to SparkConf that makes using Kryo really easy
 

 Key: SPARK-1813
 URL: https://issues.apache.org/jira/browse/SPARK-1813
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Sandy Ryza

 It would be nice to have a method in SparkConf that makes it really easy to 
 turn on Kryo serialization and register a set of classes.
 Using Kryo currently requires all this:
 {code}
 import com.esotericsoftware.kryo.Kryo
 import org.apache.spark.serializer.KryoRegistrator
 class MyRegistrator extends KryoRegistrator {
   override def registerClasses(kryo: Kryo) {
 kryo.register(classOf[MyClass1])
 kryo.register(classOf[MyClass2])
   }
 }
 val conf = new SparkConf().setMaster(...).setAppName(...)
 conf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer)
 conf.set(spark.kryo.registrator, mypackage.MyRegistrator)
 val sc = new SparkContext(conf)
 {code}
 It would be nice if it just required this:
 {code}
 SparkConf.setKryo(Array(classOf[MyFirstClass, classOf[MySecond]))
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1769) Executor loss can cause race condition in Pool

2014-05-13 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson updated SPARK-1769:
--

Assignee: Andrew Or  (was: Aaron Davidson)

 Executor loss can cause race condition in Pool
 --

 Key: SPARK-1769
 URL: https://issues.apache.org/jira/browse/SPARK-1769
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Aaron Davidson
Assignee: Andrew Or

 Loss of executors (in this case due to OOMs) exposes a race condition in 
 Pool.scala, evident from this stack trace:
 {code}
 14/05/08 22:41:48 ERROR OneForOneStrategy:
 java.lang.NullPointerException
 at 
 org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
 at 
 org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at org.apache.spark.scheduler.Pool.executorLost(Pool.scala:87)
 at 
 org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
 at 
 org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at org.apache.spark.scheduler.Pool.executorLost(Pool.scala:87)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.removeExecutor(TaskSchedulerImpl.scala:412)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:385)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.removeExecutor(CoarseGrainedSchedulerBackend.scala:160)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1$$anonfun$applyOrElse$5.apply(CoarseGrainedSchedulerBackend.scala:123)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1$$anonfun$applyOrElse$5.apply(CoarseGrainedSchedulerBackend.scala:123)
 at scala.Option.foreach(Option.scala:236)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:123)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 {code}
 Note that the line of code that throws this exception is here:
 {code}
 schedulableQueue.foreach(_.executorLost(executorId, host))
 {code}
 By the stack trace, it's not schedulableQueue that is null, but an element 
 therein. As far as I could tell, we never add a null element to this queue. 
 Rather, I could see that removeSchedulable() and executorLost() were called 
 at about the same time (via log messages), and suspect that since this 
 ArrayBuffer is in no way synchronized, that we iterate through the list while 
 it's in an incomplete state.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1769) Executor loss can cause race condition in Pool

2014-05-13 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson reassigned SPARK-1769:
-

Assignee: Aaron Davidson

 Executor loss can cause race condition in Pool
 --

 Key: SPARK-1769
 URL: https://issues.apache.org/jira/browse/SPARK-1769
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson

 Loss of executors (in this case due to OOMs) exposes a race condition in 
 Pool.scala, evident from this stack trace:
 {code}
 14/05/08 22:41:48 ERROR OneForOneStrategy:
 java.lang.NullPointerException
 at 
 org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
 at 
 org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at org.apache.spark.scheduler.Pool.executorLost(Pool.scala:87)
 at 
 org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
 at 
 org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at org.apache.spark.scheduler.Pool.executorLost(Pool.scala:87)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.removeExecutor(TaskSchedulerImpl.scala:412)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:385)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.removeExecutor(CoarseGrainedSchedulerBackend.scala:160)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1$$anonfun$applyOrElse$5.apply(CoarseGrainedSchedulerBackend.scala:123)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1$$anonfun$applyOrElse$5.apply(CoarseGrainedSchedulerBackend.scala:123)
 at scala.Option.foreach(Option.scala:236)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:123)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 {code}
 Note that the line of code that throws this exception is here:
 {code}
 schedulableQueue.foreach(_.executorLost(executorId, host))
 {code}
 By the stack trace, it's not schedulableQueue that is null, but an element 
 therein. As far as I could tell, we never add a null element to this queue. 
 Rather, I could see that removeSchedulable() and executorLost() were called 
 at about the same time (via log messages), and suspect that since this 
 ArrayBuffer is in no way synchronized, that we iterate through the list while 
 it's in an incomplete state.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1817) RDD zip erroneous when partitions do not divide RDD count

2014-05-13 Thread Kan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996818#comment-13996818
 ] 

Kan Zhang commented on SPARK-1817:
--

PR: https://github.com/apache/spark/pull/760

 RDD zip erroneous when partitions do not divide RDD count
 -

 Key: SPARK-1817
 URL: https://issues.apache.org/jira/browse/SPARK-1817
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Michael Malak
Assignee: Kan Zhang
Priority: Blocker

 Example:
 scala sc.parallelize(1L to 2L,4).zip(sc.parallelize(11 to 12,4)).collect
 res1: Array[(Long, Int)] = Array((2,11))
 But more generally, it's whenever the number of partitions does not evenly 
 divide the total number of elements in the RDD.
 See https://groups.google.com/forum/#!msg/spark-users/demrmjHFnoc/Ek3ijiXHr2MJ



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1821) Document History Server

2014-05-13 Thread Nan Zhu (JIRA)
Nan Zhu created SPARK-1821:
--

 Summary: Document History Server
 Key: SPARK-1821
 URL: https://issues.apache.org/jira/browse/SPARK-1821
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Nan Zhu


In 1.0, there is a new component in the standalone mode, history server, which 
is not mentioned in 
http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/spark-standalone.html

I think we'd better add the missing document



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1821) Document History Server

2014-05-13 Thread Nan Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nan Zhu updated SPARK-1821:
---

Issue Type: Improvement  (was: Bug)

 Document History Server
 ---

 Key: SPARK-1821
 URL: https://issues.apache.org/jira/browse/SPARK-1821
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Nan Zhu

 In 1.0, there is a new component in the standalone mode, history server, 
 which is not mentioned in 
 http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/spark-standalone.html
 I think we'd better add the missing document



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1813) Add a utility to SparkConf that makes using Kryo really easy

2014-05-13 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-1813:
--

Description: 
It would be nice to have a method in SparkConf that makes it really easy to 
turn on Kryo serialization and register a set of classes.

Using Kryo currently requires all this:
{code}
import com.esotericsoftware.kryo.Kryo
import org.apache.spark.serializer.KryoRegistrator

class MyRegistrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo) {
kryo.register(classOf[MyClass1])
kryo.register(classOf[MyClass2])
  }
}

val conf = new SparkConf().setMaster(...).setAppName(...)
conf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer)
conf.set(spark.kryo.registrator, mypackage.MyRegistrator)
val sc = new SparkContext(conf)
{code}

It would be nice if it just required this:
{code}
SparkConf.setKryo(Array(classOf[MyClass1], classOf[MyClass2]))
{code}

  was:
It would be nice to have a method in SparkConf that makes it really easy to 
turn on Kryo serialization and register a set of classes.

Using Kryo currently requires all this:
{code}
import com.esotericsoftware.kryo.Kryo
import org.apache.spark.serializer.KryoRegistrator

class MyRegistrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo) {
kryo.register(classOf[MyClass1])
kryo.register(classOf[MyClass2])
  }
}

val conf = new SparkConf().setMaster(...).setAppName(...)
conf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer)
conf.set(spark.kryo.registrator, mypackage.MyRegistrator)
val sc = new SparkContext(conf)
{code}

It would be nice if it just required this:
{code}
SparkConf.setKryo(Array(classOf[MyFirstClass, classOf[MySecond]))
{code}


 Add a utility to SparkConf that makes using Kryo really easy
 

 Key: SPARK-1813
 URL: https://issues.apache.org/jira/browse/SPARK-1813
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Sandy Ryza

 It would be nice to have a method in SparkConf that makes it really easy to 
 turn on Kryo serialization and register a set of classes.
 Using Kryo currently requires all this:
 {code}
 import com.esotericsoftware.kryo.Kryo
 import org.apache.spark.serializer.KryoRegistrator
 class MyRegistrator extends KryoRegistrator {
   override def registerClasses(kryo: Kryo) {
 kryo.register(classOf[MyClass1])
 kryo.register(classOf[MyClass2])
   }
 }
 val conf = new SparkConf().setMaster(...).setAppName(...)
 conf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer)
 conf.set(spark.kryo.registrator, mypackage.MyRegistrator)
 val sc = new SparkContext(conf)
 {code}
 It would be nice if it just required this:
 {code}
 SparkConf.setKryo(Array(classOf[MyClass1], classOf[MyClass2]))
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1789) Multiple versions of Netty dependencies cause FlumeStreamSuite failure

2014-05-13 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995560#comment-13995560
 ] 

William Benton commented on SPARK-1789:
---

Sean, we're currently building against Akka 2.3.0 in Fedora (it's a trivial 
source patch against 0.9.1; I haven't investigated the delta against 1.0 yet).  
Are there reasons why Akka 2.3.0 is a bad idea for Spark in general?  If not, 
I'm happy to file a JIRA for updating the dependency and contribute my patch 
upstream.

 Multiple versions of Netty dependencies cause FlumeStreamSuite failure
 --

 Key: SPARK-1789
 URL: https://issues.apache.org/jira/browse/SPARK-1789
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 0.9.1
Reporter: Sean Owen
Assignee: Sean Owen
  Labels: flume, netty, test
 Fix For: 1.0.0


 TL;DR is there is a bit of JAR hell trouble with Netty, that can be mostly 
 resolved and will resolve a test failure.
 I hit the error described at 
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-startup-time-out-td1753.html
  while running FlumeStreamingSuite, and have for a short while (is it just 
 me?)
 velvia notes:
 I have found a workaround.  If you add akka 2.2.4 to your dependencies, then 
 everything works, probably because akka 2.2.4 brings in newer version of 
 Jetty. 
 There are at least 3 versions of Netty in play in the build:
 - the new Flume 1.4.0 dependency brings in io.netty:netty:3.4.0.Final, and 
 that is the immediate problem
 - the custom version of akka 2.2.3 depends on io.netty:netty:3.6.6.
 - but, Spark Core directly uses io.netty:netty-all:4.0.17.Final
 The POMs try to exclude other versions of netty, but are excluding 
 org.jboss.netty:netty, when in fact older versions of io.netty:netty (not 
 netty-all) are also an issue.
 The org.jboss.netty:netty excludes are largely unnecessary. I replaced many 
 of them with io.netty:netty exclusions until everything agreed on 
 io.netty:netty-all:4.0.17.Final.
 But this didn't work, since Akka 2.2.3 doesn't work with Netty 4.x. 
 Down-grading to 3.6.6.Final across the board made some Spark code not compile.
 If the build *keeps* io.netty:netty:3.6.6.Final as well, everything seems to 
 work. Part of the reason seems to be that Netty 3.x used the old 
 `org.jboss.netty` packages. This is less than ideal, but is no worse than the 
 current situation. 
 So this PR resolves the issue and improves the JAR hell, even if it leaves 
 the existing theoretical Netty 3-vs-4 conflict:
 - Remove org.jboss.netty excludes where possible, for clarity; they're not 
 needed except with Hadoop artifacts
 - Add io.netty:netty excludes where needed -- except, let akka keep its 
 io.netty:netty
 - Change a bit of test code that actually depended on Netty 3.x, to use 4.x 
 equivalent
 - Update SBT build accordingly
 A better change would be to update Akka far enough such that it agrees on 
 Netty 4.x, but I don't know if that's feasible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1817) RDD zip erroneous when partitions do not divide RDD count

2014-05-13 Thread Kan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-1817:
-

Affects Version/s: 1.0.0

 RDD zip erroneous when partitions do not divide RDD count
 -

 Key: SPARK-1817
 URL: https://issues.apache.org/jira/browse/SPARK-1817
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0, 1.0.0
Reporter: Michael Malak
Assignee: Kan Zhang
Priority: Blocker

 Example:
 scala sc.parallelize(1L to 2L,4).zip(sc.parallelize(11 to 12,4)).collect
 res1: Array[(Long, Int)] = Array((2,11))
 But more generally, it's whenever the number of partitions does not evenly 
 divide the total number of elements in the RDD.
 See https://groups.google.com/forum/#!msg/spark-users/demrmjHFnoc/Ek3ijiXHr2MJ



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1708) Add ClassTag parameter on accumulator and broadcast methods

2014-05-13 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1708.


   Resolution: Fixed
Fix Version/s: (was: 1.0.0)
   1.1.0

Issue resolved by pull request 700
[https://github.com/apache/spark/pull/700]

 Add ClassTag parameter on accumulator and broadcast methods
 ---

 Key: SPARK-1708
 URL: https://issues.apache.org/jira/browse/SPARK-1708
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Matei Zaharia
Assignee: Matei Zaharia
Priority: Blocker
 Fix For: 1.1.0


 ClassTags will be needed by some serializers, such as a Scala Pickling based 
 one, to come up with efficient serialization. We need to add them on 
 Broadcast and probably also Accumulator and Accumulable. Since we're freezing 
 the public API in 1.0 we have to do this before the release.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1283) Create spark-contrib repo for 1.0

2014-05-13 Thread Evan Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13992974#comment-13992974
 ] 

Evan Chan commented on SPARK-1283:
--

ping  Any more comments?  Objections to creating a landing page for contrib 
projects in the Spark docs?

 Create spark-contrib repo for 1.0
 -

 Key: SPARK-1283
 URL: https://issues.apache.org/jira/browse/SPARK-1283
 Project: Spark
  Issue Type: Task
  Components: Project Infra
Affects Versions: 1.0.0
Reporter: Evan Chan
 Fix For: 1.0.0


 Let's create a spark-contrib repo to host community projects for the Spark 
 ecosystem that don't quite belong in core, but are very important 
 nevertheless.
 It would be linked to from official Spark documentation and web site, and 
 help provide visibility for community projects.
 Some questions:
 - Who should host this repo, and where should it be hosted?
 - Github would be a strong preference from usability standpoint
 - There is talk that Apache might have some facility for this
 - Contents.  Should it simply be links?  Git submodules?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-571) Forbid return statements when cleaning closures

2014-05-13 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-571.
---

Resolution: Fixed

 Forbid return statements when cleaning closures
 ---

 Key: SPARK-571
 URL: https://issues.apache.org/jira/browse/SPARK-571
 Project: Spark
  Issue Type: Improvement
Reporter: tjhunter
Assignee: William Benton
 Fix For: 1.1.0


 By mistake, I wrote some code like this:
 {code}
 object Foo {
  def main() {
val sc = new SparkContext(...)
sc.parallelize(0 to 10,10).map({  ... return 1 ... }).collect
  }
 }
 {code}
 This compiles fine and actually runs using the local scheduler. However, 
 using the mesos scheduler throws a NotSerializableException in the 
 CollectTask . I agree the result of the program above should be undefined or 
 it should be an error. Would it be possible to have more explicit messages?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1736) spark-submit on Windows

2014-05-13 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1736:
---

Assignee: Andrew Or

 spark-submit on Windows
 ---

 Key: SPARK-1736
 URL: https://issues.apache.org/jira/browse/SPARK-1736
 Project: Spark
  Issue Type: Improvement
  Components: Windows
Reporter: Matei Zaharia
Assignee: Andrew Or
Priority: Blocker
 Fix For: 1.0.0


 - spark-submit needs a Windows version (shouldn't be too hard, it's just 
 launching a Java process)
 - spark-shell.cmd needs to run through spark-submit like it does on Unix



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1753) PySpark on YARN does not work on assembly jar built on Red Hat based OS

2014-05-13 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1753:
---

Assignee: Andrew Or

 PySpark on YARN does not work on assembly jar built on Red Hat based OS
 ---

 Key: SPARK-1753
 URL: https://issues.apache.org/jira/browse/SPARK-1753
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Or
Assignee: Andrew Or
 Fix For: 1.0.0


 If the jar is built on a Red Hat based OS, the additional python files 
 included in the jar cannot be accessed. This means PySpark doesn't work on 
 YARN because in this mode it relies on the python files within this jar.
 I have confirmed that my Java, Scala, and maven versions are all exactly the 
 same on my CentOS environment and on my local OSX environment, and the former 
 does not work. Thomas Graves also struggled with the same problem.
 Until a fix is found, we should at the very least document this peculiarity.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1793) Heavily duplicated test setup code in SVMSuite

2014-05-13 Thread Andrew Tulloch (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Tulloch resolved SPARK-1793.
---

   Resolution: Fixed
Fix Version/s: 1.0.0

 Heavily duplicated test setup code in SVMSuite
 --

 Key: SPARK-1793
 URL: https://issues.apache.org/jira/browse/SPARK-1793
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Andrew Tulloch
Priority: Minor
 Fix For: 1.0.0


 Refactor the code to remove the repeated initialization of test/validation 
 RDDs in every test.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1823) ExternalAppendOnlyMap can still OOM if one key is very large

2014-05-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-1823:
-

Affects Version/s: 1.0.0

 ExternalAppendOnlyMap can still OOM if one key is very large
 

 Key: SPARK-1823
 URL: https://issues.apache.org/jira/browse/SPARK-1823
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Andrew Or
 Fix For: 1.1.0


 If the values for one key do not collectively fit into memory, then the map 
 will still OOM when you merge the spilled contents back in.
 This is a problem especially for PySpark, since we hash the keys (Python 
 objects) before a shuffle, and there are only so many integers out there in 
 the world, so there could potentially be many collisions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1816) LiveListenerBus dies if a listener throws an exception

2014-05-13 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1816.


   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 759
[https://github.com/apache/spark/pull/759]

 LiveListenerBus dies if a listener throws an exception
 --

 Key: SPARK-1816
 URL: https://issues.apache.org/jira/browse/SPARK-1816
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Aaron Davidson
Assignee: Andrew Or
Priority: Critical
 Fix For: 1.0.0


 The exception isn't even printed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1769) Executor loss can cause race condition in Pool

2014-05-13 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997079#comment-13997079
 ] 

Andrew Or commented on SPARK-1769:
--

https://github.com/apache/spark/pull/762

 Executor loss can cause race condition in Pool
 --

 Key: SPARK-1769
 URL: https://issues.apache.org/jira/browse/SPARK-1769
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Aaron Davidson
Assignee: Andrew Or

 Loss of executors (in this case due to OOMs) exposes a race condition in 
 Pool.scala, evident from this stack trace:
 {code}
 14/05/08 22:41:48 ERROR OneForOneStrategy:
 java.lang.NullPointerException
 at 
 org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
 at 
 org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at org.apache.spark.scheduler.Pool.executorLost(Pool.scala:87)
 at 
 org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
 at 
 org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at org.apache.spark.scheduler.Pool.executorLost(Pool.scala:87)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.removeExecutor(TaskSchedulerImpl.scala:412)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:385)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.removeExecutor(CoarseGrainedSchedulerBackend.scala:160)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1$$anonfun$applyOrElse$5.apply(CoarseGrainedSchedulerBackend.scala:123)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1$$anonfun$applyOrElse$5.apply(CoarseGrainedSchedulerBackend.scala:123)
 at scala.Option.foreach(Option.scala:236)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:123)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 {code}
 Note that the line of code that throws this exception is here:
 {code}
 schedulableQueue.foreach(_.executorLost(executorId, host))
 {code}
 By the stack trace, it's not schedulableQueue that is null, but an element 
 therein. As far as I could tell, we never add a null element to this queue. 
 Rather, I could see that removeSchedulable() and executorLost() were called 
 at about the same time (via log messages), and suspect that since this 
 ArrayBuffer is in no way synchronized, that we iterate through the list while 
 it's in an incomplete state.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1824) Python examples still take in master

2014-05-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-1824:
-

Affects Version/s: 1.0.0

 Python examples still take in master
 --

 Key: SPARK-1824
 URL: https://issues.apache.org/jira/browse/SPARK-1824
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Andrew Or

 A recent commit 
 https://github.com/apache/spark/commit/44dd57fb66bb676d753ad8d9757f9f4c03364113
  changed existing Spark examples in Scala and Java such that they no longer 
 take in master as an argument. We forgot to do the same for Python.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1769) Executor loss can cause race condition in Pool

2014-05-13 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-1769:
-

 Summary: Executor loss can cause race condition in Pool
 Key: SPARK-1769
 URL: https://issues.apache.org/jira/browse/SPARK-1769
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Aaron Davidson


Loss of executors (in this case due to OOMs) exposes a race condition in 
Pool.scala, evident from this stack trace:

{code}
14/05/08 22:41:48 ERROR OneForOneStrategy:
java.lang.NullPointerException
at 
org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
at 
org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.Pool.executorLost(Pool.scala:87)
at 
org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
at 
org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.Pool.executorLost(Pool.scala:87)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.removeExecutor(TaskSchedulerImpl.scala:412)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:385)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.removeExecutor(CoarseGrainedSchedulerBackend.scala:160)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1$$anonfun$applyOrElse$5.apply(CoarseGrainedSchedulerBackend.scala:123)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1$$anonfun$applyOrElse$5.apply(CoarseGrainedSchedulerBackend.scala:123)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:123)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{code}

Note that the line of code that throws this exception is here:
{code}
schedulableQueue.foreach(_.executorLost(executorId, host))
{code}

By the stack trace, it's not schedulableQueue that is null, but an element 
therein. As far as I could tell, we never add a null element to this queue. 
Rather, I could see that there removeSchedulable() and executorLost() were 
called at about the same time (via log messages), and suspect that since this 
ArrayBuffer is in no way synchronized, that we iterate through the list while 
it's in an incomplete state.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1283) Create spark-contrib repo for 1.0

2014-05-13 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993076#comment-13993076
 ] 

Patrick Wendell commented on SPARK-1283:


[~velvia] Yeah - do you want to submit a PR? This seems like a good idea to me.

 Create spark-contrib repo for 1.0
 -

 Key: SPARK-1283
 URL: https://issues.apache.org/jira/browse/SPARK-1283
 Project: Spark
  Issue Type: Task
  Components: Project Infra
Affects Versions: 1.0.0
Reporter: Evan Chan
 Fix For: 1.0.0


 Let's create a spark-contrib repo to host community projects for the Spark 
 ecosystem that don't quite belong in core, but are very important 
 nevertheless.
 It would be linked to from official Spark documentation and web site, and 
 help provide visibility for community projects.
 Some questions:
 - Who should host this repo, and where should it be hosted?
 - Github would be a strong preference from usability standpoint
 - There is talk that Apache might have some facility for this
 - Contents.  Should it simply be links?  Git submodules?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1825) Windows Spark fails to work with Linux YARN

2014-05-13 Thread Taeyun Kim (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Taeyun Kim updated SPARK-1825:
--

Fix Version/s: 1.0.0

 Windows Spark fails to work with Linux YARN
 ---

 Key: SPARK-1825
 URL: https://issues.apache.org/jira/browse/SPARK-1825
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Taeyun Kim
 Fix For: 1.0.0


 Windows Spark fails to work with Linux YARN.
 This is a cross-platform problem.
 On YARN side, Hadoop 2.4.0 resolved the issue as follows:
 https://issues.apache.org/jira/browse/YARN-1824
 But Spark YARN module does not incorporate the new YARN API yet, so problem 
 persists for Spark.
 First, the following source files should be changed:
 - /yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala
 - 
 /yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnableUtil.scala
 Change is as follows:
 - Replace .$() to .$$()
 - Replace File.pathSeparator for Environment.CLASSPATH.name to 
 ApplicationConstants.CLASS_PATH_SEPARATOR (import 
 org.apache.hadoop.yarn.api.ApplicationConstants is required for this)
 Unless the above are applied, launch_container.sh will contain invalid shell 
 script statements(since they will contain Windows-specific separators), and 
 job will fail.
 Also, the following symptom should also be fixed (I could not find the 
 relevant source code):
 - SPARK_HOME environment variable is copied straight to launch_container.sh. 
 It should be changed to the path format for the server OS, or, the better, a 
 separate environment variable or a configuration variable should be created.
 - '%HADOOP_MAPRED_HOME%' string still exists in launch_container.sh, after 
 the above change is applied. maybe I missed a few lines.
 I'm not sure whether this is all, since I'm new to both Spark and YARN.



--
This message was sent by Atlassian JIRA
(v6.2#6252)