date:20140718


 [ 
https://issues.apache.org/jira/browse/SPARK-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2568:
---

Assignee: Xiangrui Meng

 RangePartitioner should go through the data only once
 -

 Key: SPARK-2568
 URL: https://issues.apache.org/jira/browse/SPARK-2568
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Xiangrui Meng

 As of Spark 1.0, RangePartitioner goes through data twice: once to compute 
 the count and once to do sampling. As a result, to do sortByKey, Spark goes 
 through data 3 times (once to count, once to sample, and once to sort).
 RangePartitioner should go through data only once (remove the count step).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2572) Can't delete local dir on executor automatically when running spark over Mesos.

2014-07-18 Thread Yadong Qi (JIRA)

Yadong Qi created SPARK-2572:


 Summary: Can't delete local dir on executor automatically when 
running spark over Mesos.
 Key: SPARK-2572
 URL: https://issues.apache.org/jira/browse/SPARK-2572
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Yadong Qi


When running spark over Mesos in “fine-grained” modes or “coarse-grained” mode. 
After the application finished.The local 
dir(/tmp/spark-local-20140718114058-834c) on executor can't not delete 
automatically.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2573) This file make-distribution.sh has an error, please fix it

2014-07-18 Thread JIRA

王金子 created SPARK-2573:
--

 Summary: This file make-distribution.sh has an error, please fix 
it
 Key: SPARK-2573
 URL: https://issues.apache.org/jira/browse/SPARK-2573
 Project: Spark
  Issue Type: Bug
Reporter: 王金子






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2573) This file make-distribution.sh has an error, please fix it

2014-07-18 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

王金子 updated SPARK-2573:
---

  Component/s: Build
  Description: 
line 61: 
echo Error: '--with-hive' is no longer supported, use Maven option -Pyarn.

It should be use -Phive.Otherwise, i like the oldest more,why update it
 Target Version/s: 1.0.0
Affects Version/s: 1.0.0
   Labels: build  (was: )
  Summary: This file make-distribution.sh has an error, please 
fix it  (was: This file make-distribution.sh has an error, please fix it)

 This file make-distribution.sh has an error, please fix it
 

 Key: SPARK-2573
 URL: https://issues.apache.org/jira/browse/SPARK-2573
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0
Reporter: 王金子
  Labels: build

 line 61: 
 echo Error: '--with-hive' is no longer supported, use Maven option -Pyarn.
 It should be use -Phive.Otherwise, i like the oldest more,why update it



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2574) Avoid allocating new ArrayBuffer in groupByKey's mergeCombiner

2014-07-18 Thread Sandy Ryza (JIRA)

Sandy Ryza created SPARK-2574:
-

 Summary: Avoid allocating new ArrayBuffer in groupByKey's 
mergeCombiner
 Key: SPARK-2574
 URL: https://issues.apache.org/jira/browse/SPARK-2574
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Sandy Ryza






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2491) When an OOM is thrown,the executor does not stop properly.

2014-07-18 Thread Guoqiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066176#comment-14066176
 ] 

Guoqiang Li commented on SPARK-2491:


[~sarutak]  I submitted a patch.   https://github.com/apache/spark/pull/1482

 When an OOM is thrown,the executor does not stop properly.
 --

 Key: SPARK-2491
 URL: https://issues.apache.org/jira/browse/SPARK-2491
 Project: Spark
  Issue Type: Bug
Reporter: Guoqiang Li

 The executor log:
 {code}
 #
 # java.lang.OutOfMemoryError: Java heap space
 # -XX:OnOutOfMemoryError=kill %p
 #   Executing /bin/sh -c kill 44942...
 14/07/15 10:38:29 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: 
 SIGTERM
 14/07/15 10:38:29 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception 
 in thread Thread[Connection manager future execution context-6,5,main]
 java.lang.OutOfMemoryError: Java heap space
 at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:57)
 at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
 at org.apache.spark.storage.BlockMessage.set(BlockMessage.scala:94)
 at 
 org.apache.spark.storage.BlockMessage$.fromByteBuffer(BlockMessage.scala:176)
 at 
 org.apache.spark.storage.BlockMessageArray.set(BlockMessageArray.scala:63)
 at 
 org.apache.spark.storage.BlockMessageArray$.fromBufferMessage(BlockMessageArray.scala:109)
 at 
 org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:125)
 at 
 org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:122)
 at 
 scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117)
 at 
 scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115)
 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 14/07/15 10:38:29 WARN HadoopRDD: Exception in RecordReader.close()
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
 at 
 org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
 at java.io.FilterInputStream.close(FilterInputStream.java:181)
 at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
 at 
 org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:243)
 at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226)
 at 
 org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
 at 
 org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197)
 at 
 org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
 at 
 org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
 at org.apache.spark.scheduler.Task.run(Task.scala:51)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 -
 14/07/15 10:38:30 INFO Executor: Running task ID 969
 14/07/15 10:38:30 INFO BlockManager: Found block broadcast_0 locally
 14/07/15 10:38:30 INFO HadoopRDD: Input split: 
 hdfs://10dian72.domain.test:8020/input/lbs/recommend/toona/rating/20140712/part-7:0+68016537
 14/07/15 10:38:30 ERROR Executor: Exception in task ID 969
 java.io.FileNotFoundException: 
 /yarn/nm/usercache/spark/appcache/application_1404728465401_0070/spark-local-20140715103235-ffda/2e/merged_shuffle_4_85_0
  (No such file or directory)
 at java.io.FileOutputStream.open(Native Method)
 at java.io.FileOutputStream.init(FileOutputStream.java:221)
 at 
 org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:116)
 at

[jira] [Commented] (SPARK-1662) PySpark fails if python class is used as a data container

2014-07-18 Thread Chandan Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066201#comment-14066201
 ] 

Chandan Kumar commented on SPARK-1662:
--

Sounds good. Closing the issue.

 PySpark fails if python class is used as a data container
 -

 Key: SPARK-1662
 URL: https://issues.apache.org/jira/browse/SPARK-1662
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
 Environment: Ubuntu 14, Python 2.7.6
Reporter: Chandan Kumar
Priority: Minor

 PySpark fails if RDD operations are performed on data encapsulated in Python 
 objects (rare use case where plain python objects are used as data containers 
 instead of regular dict or tuples).
 I have written a small piece of code to reproduce the bug:
 https://gist.github.com/nrchandan/11394440
 script src=https://gist.github.com/nrchandan/11394440.js;/script



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2576) slave node throw NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file

Svend created SPARK-2576:


 Summary: slave node throw NoClassDefFoundError $line11.$read$ when 
executing a Spark QL query on HDFS CSV file
 Key: SPARK-2576
 URL: https://issues.apache.org/jira/browse/SPARK-2576
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.0.1
 Environment: 
One Mesos 0.19 master without zookeeper and 4 mesos slaves. 

JDK 1.7.51 and Scala 2.10.4 on all nodes. 

HDFS from CDH5.0.3

Spark version: I tried both with the pre-built CDH5 spark package available 
from http://spark.apache.org/downloads.html and by packaging spark with sbt 
0.13.2, JDK 1.7.51 and scala 2.10.4 as explained here 
http://mesosphere.io/learn/run-spark-on-mesos/

All nodes are running Debian 3.2.51-1 x86_64 GNU/Linux and have 

Reporter: Svend



Execution of SQL query against HDFS systematically throws a class not found 
exception on slave nodes when executing .

(this was originally reported on the user list: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-1-spark-sql-error-java-lang-NoClassDefFoundError-Could-not-initialize-class-line11-read-tc10135.html)

Sample code (ran from spark-shell): 

{code}
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

import sqlContext.createSchemaRDD

case class Car(timestamp: Long, objectid: String, isGreen: Boolean)

# I get the same error when pointing to the folder 
hdfs://vm28:8020/test/cardata
val data = sc.textFile(hdfs://vm28:8020/test/cardata/part-0)

val cars = data.map(_.split(,)).map ( ar = Car(ar(0).toLong, ar(1), 
ar(2).toBoolean))
cars.registerAsTable(mcars)
val allgreens = sqlContext.sql(SELECT objectid from mcars where isGreen = 
true)
allgreens.collect.take(10).foreach(println)
{code}

Stack trace on the slave nodes: 

{code}

I0716 13:01:16.215158 13631 exec.cpp:131] Version: 0.19.0
I0716 13:01:16.219285 13656 exec.cpp:205] Executor registered on slave 
20140714-142853-485682442-5050-25487-2
14/07/16 13:01:16 INFO MesosExecutorBackend: Registered with Mesos as executor 
ID 20140714-142853-485682442-5050-25487-2
14/07/16 13:01:16 INFO SecurityManager: Changing view acls to: mesos,mnubohadoop
14/07/16 13:01:16 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(mesos, mnubohadoop)
14/07/16 13:01:17 INFO Slf4jLogger: Slf4jLogger started
14/07/16 13:01:17 INFO Remoting: Starting remoting
14/07/16 13:01:17 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sp...@vm23-hulk-priv.mtl.mnubo.com:38230]
14/07/16 13:01:17 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://sp...@vm23-hulk-priv.mtl.mnubo.com:38230]
14/07/16 13:01:17 INFO SparkEnv: Connecting to MapOutputTracker: 
akka.tcp://spark@vm28-hulk-pub:41632/user/MapOutputTracker
14/07/16 13:01:17 INFO SparkEnv: Connecting to BlockManagerMaster: 
akka.tcp://spark@vm28-hulk-pub:41632/user/BlockManagerMaster
14/07/16 13:01:17 INFO DiskBlockManager: Created local directory at 
/tmp/spark-local-20140716130117-8ea0
14/07/16 13:01:17 INFO MemoryStore: MemoryStore started with capacity 294.9 MB.
14/07/16 13:01:17 INFO ConnectionManager: Bound socket to port 44501 with id = 
ConnectionManagerId(vm23-hulk-priv.mtl.mnubo.com,44501)
14/07/16 13:01:17 INFO BlockManagerMaster: Trying to register BlockManager
14/07/16 13:01:17 INFO BlockManagerMaster: Registered BlockManager
14/07/16 13:01:17 INFO HttpFileServer: HTTP File server directory is 
/tmp/spark-ccf6f36c-2541-4a25-8fe4-bb4ba00ee633
14/07/16 13:01:17 INFO HttpServer: Starting HTTP Server
14/07/16 13:01:18 INFO Executor: Using REPL class URI: 
http://vm28-hulk-pub:33973
14/07/16 13:01:18 INFO Executor: Running task ID 2
14/07/16 13:01:18 INFO HttpBroadcast: Started reading broadcast variable 0
14/07/16 13:01:18 INFO MemoryStore: ensureFreeSpace(125590) called with 
curMem=0, maxMem=309225062
14/07/16 13:01:18 INFO MemoryStore: Block broadcast_0 stored as values to 
memory (estimated size 122.6 KB, free 294.8 MB)
14/07/16 13:01:18 INFO HttpBroadcast: Reading broadcast variable 0 took 
0.294602722 s
14/07/16 13:01:19 INFO HadoopRDD: Input split: 
hdfs://vm28-hulk-priv:8020/svend/testimport/com_mnubo_agriculture60_1396310400/part-0:23960450+23960451
I0716 13:01:19.905113 13657 exec.cpp:378] Executor asked to shutdown
14/07/16 13:01:20 ERROR Executor: Exception in task ID 2
java.lang.NoClassDefFoundError: $line11/$read$
at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19)
at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$1.next(Iterator.scala:853)
at scala.collection.Iterator$$anon$1.head(Iterator.scala:840)
at 
org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:181)
at

[jira] [Closed] (SPARK-1662) PySpark fails if python class is used as a data container

2014-07-18 Thread Chandan Kumar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandan Kumar closed SPARK-1662.


Resolution: Not a Problem

The issue is due to a limitation with Python's pickle mechanism. Probably not 
worth the effort to use something other than pickle. There are workarounds 
anyway.

 PySpark fails if python class is used as a data container
 -

 Key: SPARK-1662
 URL: https://issues.apache.org/jira/browse/SPARK-1662
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
 Environment: Ubuntu 14, Python 2.7.6
Reporter: Chandan Kumar
Priority: Minor

 PySpark fails if RDD operations are performed on data encapsulated in Python 
 objects (rare use case where plain python objects are used as data containers 
 instead of regular dict or tuples).
 I have written a small piece of code to reproduce the bug:
 https://gist.github.com/nrchandan/11394440
 script src=https://gist.github.com/nrchandan/11394440.js;/script



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2576) slave node throw NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file


[ 
https://issues.apache.org/jira/browse/SPARK-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066205#comment-14066205
 ] 

Svend commented on SPARK-2576:
--

I see another person is reporting a similar issue on the mailing list with a 
similar stack (Spark 1.0.1 and CDH 5.0.3): 
http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-1-spark-sql-error-java-lang-NoClassDefFoundError-Could-not-initialize-class-line11-read-tc10135.html

 slave node throw NoClassDefFoundError $line11.$read$ when executing a Spark 
 QL query on HDFS CSV file
 -

 Key: SPARK-2576
 URL: https://issues.apache.org/jira/browse/SPARK-2576
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.0.1
 Environment: One Mesos 0.19 master without zookeeper and 4 mesos 
 slaves. 
 JDK 1.7.51 and Scala 2.10.4 on all nodes. 
 HDFS from CDH5.0.3
 Spark version: I tried both with the pre-built CDH5 spark package available 
 from http://spark.apache.org/downloads.html and by packaging spark with sbt 
 0.13.2, JDK 1.7.51 and scala 2.10.4 as explained here 
 http://mesosphere.io/learn/run-spark-on-mesos/
 All nodes are running Debian 3.2.51-1 x86_64 GNU/Linux and have 
Reporter: Svend

 Execution of SQL query against HDFS systematically throws a class not found 
 exception on slave nodes when executing .
 (this was originally reported on the user list: 
 http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-1-spark-sql-error-java-lang-NoClassDefFoundError-Could-not-initialize-class-line11-read-tc10135.html)
 Sample code (ran from spark-shell): 
 {code}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.createSchemaRDD
 case class Car(timestamp: Long, objectid: String, isGreen: Boolean)
 # I get the same error when pointing to the folder 
 hdfs://vm28:8020/test/cardata
 val data = sc.textFile(hdfs://vm28:8020/test/cardata/part-0)
 val cars = data.map(_.split(,)).map ( ar = Car(ar(0).toLong, ar(1), 
 ar(2).toBoolean))
 cars.registerAsTable(mcars)
 val allgreens = sqlContext.sql(SELECT objectid from mcars where isGreen = 
 true)
 allgreens.collect.take(10).foreach(println)
 {code}
 Stack trace on the slave nodes: 
 {code}
 I0716 13:01:16.215158 13631 exec.cpp:131] Version: 0.19.0
 I0716 13:01:16.219285 13656 exec.cpp:205] Executor registered on slave 
 20140714-142853-485682442-5050-25487-2
 14/07/16 13:01:16 INFO MesosExecutorBackend: Registered with Mesos as 
 executor ID 20140714-142853-485682442-5050-25487-2
 14/07/16 13:01:16 INFO SecurityManager: Changing view acls to: 
 mesos,mnubohadoop
 14/07/16 13:01:16 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(mesos, 
 mnubohadoop)
 14/07/16 13:01:17 INFO Slf4jLogger: Slf4jLogger started
 14/07/16 13:01:17 INFO Remoting: Starting remoting
 14/07/16 13:01:17 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://sp...@vm23-hulk-priv.mtl.mnubo.com:38230]
 14/07/16 13:01:17 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://sp...@vm23-hulk-priv.mtl.mnubo.com:38230]
 14/07/16 13:01:17 INFO SparkEnv: Connecting to MapOutputTracker: 
 akka.tcp://spark@vm28-hulk-pub:41632/user/MapOutputTracker
 14/07/16 13:01:17 INFO SparkEnv: Connecting to BlockManagerMaster: 
 akka.tcp://spark@vm28-hulk-pub:41632/user/BlockManagerMaster
 14/07/16 13:01:17 INFO DiskBlockManager: Created local directory at 
 /tmp/spark-local-20140716130117-8ea0
 14/07/16 13:01:17 INFO MemoryStore: MemoryStore started with capacity 294.9 
 MB.
 14/07/16 13:01:17 INFO ConnectionManager: Bound socket to port 44501 with id 
 = ConnectionManagerId(vm23-hulk-priv.mtl.mnubo.com,44501)
 14/07/16 13:01:17 INFO BlockManagerMaster: Trying to register BlockManager
 14/07/16 13:01:17 INFO BlockManagerMaster: Registered BlockManager
 14/07/16 13:01:17 INFO HttpFileServer: HTTP File server directory is 
 /tmp/spark-ccf6f36c-2541-4a25-8fe4-bb4ba00ee633
 14/07/16 13:01:17 INFO HttpServer: Starting HTTP Server
 14/07/16 13:01:18 INFO Executor: Using REPL class URI: 
 http://vm28-hulk-pub:33973
 14/07/16 13:01:18 INFO Executor: Running task ID 2
 14/07/16 13:01:18 INFO HttpBroadcast: Started reading broadcast variable 0
 14/07/16 13:01:18 INFO MemoryStore: ensureFreeSpace(125590) called with 
 curMem=0, maxMem=309225062
 14/07/16 13:01:18 INFO MemoryStore: Block broadcast_0 stored as values to 
 memory (estimated size 122.6 KB, free 294.8 MB)
 14/07/16 13:01:18 INFO HttpBroadcast: Reading broadcast variable 0 took 
 0.294602722 s
 14/07/16 13:01:19 INFO HadoopRDD: Input split: 
 hdfs://vm28-hulk-priv:8020/svend/testimport/com_mnubo_agriculture60_1396310400/part-0:23960450+23960451
 I0716 13:01:19.905113 13657 exec.cpp:378] Executor

[jira] [Created] (SPARK-2577) File upload to viewfs is broken due to mount point resolution

2014-07-18 Thread Gera Shegalov (JIRA)

Gera Shegalov created SPARK-2577:


 Summary: File upload to viewfs is broken due to mount point 
resolution
 Key: SPARK-2577
 URL: https://issues.apache.org/jira/browse/SPARK-2577
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Gera Shegalov
Priority: Blocker


YARN client resolves paths of uploaded artifacts. When a viewfs path is 
resolved, the filesystem changes to the target file system. However, the 
original fs is passed to {{ClientDistributedCacheManager#addResource}}. 

{code}
14/07/18 01:30:31 INFO yarn.Client: Uploading 
file:/Users/gshegalov/workspace/spark-tw/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop3.0.0-SNAPSHOT.jar
 to 
viewfs:/user/gshegalov/.sparkStaging/application_1405479201490_0049/spark-assembly-1.1.0-SNAPSHOT-hadoop3.0.0-SNAPSHOT.jar
Exception in thread main java.lang.IllegalArgumentException: Wrong FS: 
hdfs://ns1:8020/user/gshegalov/.sparkStaging/application_1405479201490_0049/spark-assembly-1.1.0-SNAPSHOT-hadoop3.0.0-SNAPSHOT.jar,
 expected: viewfs:/
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)
at 
org.apache.hadoop.fs.viewfs.ViewFileSystem.getUriPath(ViewFileSystem.java:116)
at 
org.apache.hadoop.fs.viewfs.ViewFileSystem.getFileStatus(ViewFileSystem.java:345)
at 
org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:72)
at 
org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:236)
at 
org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:229)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:229)
at 
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:37)
at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:74)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:81)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:136)
at org.apache.spark.SparkContext.init(SparkContext.scala:320)
at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:28)
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

There are two options:
# do not resolve path because symlinks are currently disabled in Hadoop
# pass the correct filesystem object



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2578) RIGHT OUTER JOIN causes ClassCastException

Christian Wuertz created SPARK-2578:
---

 Summary: RIGHT OUTER JOIN causes ClassCastException
 Key: SPARK-2578
 URL: https://issues.apache.org/jira/browse/SPARK-2578
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Christian Wuertz


When I run a hql query that contains a right outer join I always get this 
exception:

{code}
org.apache.spark.SparkDriverExecutionException: Execution error
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:849)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1231)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.ClassCastException: scala.collection.mutable.HashSet 
cannot be cast to scala.collection.mutable.BitSet
at 
org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$7.apply(joins.scala:338)
at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:811)
at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:808)
at 
org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:845)
... 10 more
{code} 

This can easily reproduced using the queries and data from this 
[tutorial|http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/].
 Only the last query was modified to use a right outer join.

{code}
hiveContext.hql(SELECT a.year, a.player_id, a.runs from batting a JOIN (SELECT 
year, max(runs) runs FROM batting GROUP BY year ) b ON (a.year = b.year AND 
a.runs = b.runs)).collect.foreach(println)
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2578) RIGHT OUTER JOIN causes ClassCastException


 [ 
https://issues.apache.org/jira/browse/SPARK-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Wuertz updated SPARK-2578:


Description: 
When I run a hql query that contains a right outer join I always get this 
exception:

{code}
org.apache.spark.SparkDriverExecutionException: Execution error
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:849)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1231)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.ClassCastException: scala.collection.mutable.HashSet 
cannot be cast to scala.collection.mutable.BitSet
at 
org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$7.apply(joins.scala:338)
at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:811)
at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:808)
at 
org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:845)
... 10 more
{code} 

This can easily reproduced using the queries and data from this 
[tutorial|http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/].
 Only the last query was modified to use a right outer join.

{code}
hiveContext.hql(SELECT a.year, a.player_id, a.runs from batting a RIGHT OUTER 
JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b ON (a.year = 
b.year AND a.runs = b.runs)).collect.foreach(println)
{code}

  was:
When I run a hql query that contains a right outer join I always get this 
exception:

{code}
org.apache.spark.SparkDriverExecutionException: Execution error
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:849)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1231)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.ClassCastException: scala.collection.mutable.HashSet 
cannot be cast to scala.collection.mutable.BitSet
at 
org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$7.apply(joins.scala:338)
at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:811)
at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:808)
at 
org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:845)
... 10 more
{code} 

This can easily reproduced using the queries and data from this 
[tutorial|http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/].
 Only the last query was modified to use a right outer join.

{code}
hiveContext.hql(SELECT a.year, a.player_id, a.runs from batting a JOIN (SELECT 
year, max(runs) runs FROM batting GROUP BY year ) b ON (a.year = b.year AND 
a.runs = b.runs)).collect.foreach(println)
{code}


 RIGHT OUTER JOIN causes ClassCastException
 --

 Key: SPARK-2578
 URL: https://issues.apache.org/jira/browse/SPARK-2578
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Christian Wuertz

 When I run a hql query that contains a right outer join I always get this 
 exception:
 {code}
 org.apache.spark.SparkDriverExecutionException: Execution error
   at

[jira] [Updated] (SPARK-2578) RIGHT OUTER JOIN causes ClassCastException


 [ 
https://issues.apache.org/jira/browse/SPARK-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Wuertz updated SPARK-2578:


Description: 
When I run a hql query that contains a right outer join I always get this 
exception:

{code}
org.apache.spark.SparkDriverExecutionException: Execution error
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:849)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1231)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.ClassCastException: scala.collection.mutable.HashSet 
cannot be cast to scala.collection.mutable.BitSet
at 
org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$7.apply(joins.scala:338)
at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:811)
at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:808)
at 
org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:845)
... 10 more
{code} 

This can easily reproduced using the queries and data from this 
[tutorial|http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/].
 Only the last query was modified to use a right outer join.

{code}
hiveContext.hql(SELECT a.year, a.player_id, a.runs from batting a RIGHT OUTER 
JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b ON (a.year = 
b.year AND a.runs = b.runs)).collect.foreach(println)
{code}

I compiled the 1.0.1 release myself with the following command: 
./make-distribution.sh --hadoop=2.2.0 --with-yarn --with-hive --tgz

  was:
When I run a hql query that contains a right outer join I always get this 
exception:

{code}
org.apache.spark.SparkDriverExecutionException: Execution error
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:849)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1231)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.ClassCastException: scala.collection.mutable.HashSet 
cannot be cast to scala.collection.mutable.BitSet
at 
org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$7.apply(joins.scala:338)
at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:811)
at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:808)
at 
org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:845)
... 10 more
{code} 

This can easily reproduced using the queries and data from this 
[tutorial|http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/].
 Only the last query was modified to use a right outer join.

{code}
hiveContext.hql(SELECT a.year, a.player_id, a.runs from batting a RIGHT OUTER 
JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b ON (a.year = 
b.year AND a.runs = b.runs)).collect.foreach(println)
{code}


 RIGHT OUTER JOIN causes ClassCastException
 --

 Key: SPARK-2578
 URL: https://issues.apache.org/jira/browse/SPARK-2578
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Christian Wuertz

 When I run a hql query that contains a right outer join I always get this

[jira] [Commented] (SPARK-2577) File upload to viewfs is broken due to mount point resolution

2014-07-18 Thread Gera Shegalov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066243#comment-14066243
 ] 

Gera Shegalov commented on SPARK-2577:
--

https://github.com/apache/spark/pull/1483

 File upload to viewfs is broken due to mount point resolution
 -

 Key: SPARK-2577
 URL: https://issues.apache.org/jira/browse/SPARK-2577
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Gera Shegalov
Priority: Blocker

 YARN client resolves paths of uploaded artifacts. When a viewfs path is 
 resolved, the filesystem changes to the target file system. However, the 
 original fs is passed to {{ClientDistributedCacheManager#addResource}}. 
 {code}
 14/07/18 01:30:31 INFO yarn.Client: Uploading 
 file:/Users/gshegalov/workspace/spark-tw/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop3.0.0-SNAPSHOT.jar
  to 
 viewfs:/user/gshegalov/.sparkStaging/application_1405479201490_0049/spark-assembly-1.1.0-SNAPSHOT-hadoop3.0.0-SNAPSHOT.jar
 Exception in thread main java.lang.IllegalArgumentException: Wrong FS: 
 hdfs://ns1:8020/user/gshegalov/.sparkStaging/application_1405479201490_0049/spark-assembly-1.1.0-SNAPSHOT-hadoop3.0.0-SNAPSHOT.jar,
  expected: viewfs:/
   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)
   at 
 org.apache.hadoop.fs.viewfs.ViewFileSystem.getUriPath(ViewFileSystem.java:116)
   at 
 org.apache.hadoop.fs.viewfs.ViewFileSystem.getFileStatus(ViewFileSystem.java:345)
   at 
 org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:72)
   at 
 org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:236)
   at 
 org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:229)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:229)
   at 
 org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:37)
   at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:74)
   at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:81)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:136)
   at org.apache.spark.SparkContext.init(SparkContext.scala:320)
   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:28)
   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 {code}
 There are two options:
 # do not resolve path because symlinks are currently disabled in Hadoop
 # pass the correct filesystem object



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2578) OUTER JOINs cause ClassCastException


 [ 
https://issues.apache.org/jira/browse/SPARK-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Wuertz updated SPARK-2578:


Summary: OUTER JOINs cause ClassCastException  (was: RIGHT OUTER JOIN 
causes ClassCastException)

 OUTER JOINs cause ClassCastException
 

 Key: SPARK-2578
 URL: https://issues.apache.org/jira/browse/SPARK-2578
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Christian Wuertz

 When I run a hql query that contains a right outer join I always get this 
 exception:
 {code}
 org.apache.spark.SparkDriverExecutionException: Execution error
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:849)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1231)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 Caused by: java.lang.ClassCastException: scala.collection.mutable.HashSet 
 cannot be cast to scala.collection.mutable.BitSet
   at 
 org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$7.apply(joins.scala:338)
   at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:811)
   at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:808)
   at 
 org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:845)
   ... 10 more
 {code} 
 This can easily reproduced using the queries and data from this 
 [tutorial|http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/].
  Only the last query was modified to use a right outer join.
 {code}
 hiveContext.hql(SELECT a.year, a.player_id, a.runs from batting a RIGHT 
 OUTER JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b ON 
 (a.year = b.year AND a.runs = b.runs)).collect.foreach(println)
 {code}
 I compiled the 1.0.1 release myself with the following command: 
 ./make-distribution.sh --hadoop=2.2.0 --with-yarn --with-hive --tgz



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2578) OUTER JOINs cause ClassCastException


 [ 
https://issues.apache.org/jira/browse/SPARK-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Wuertz updated SPARK-2578:


Description: 
When I run a hql query that contains a right or a left outer join I always get 
this exception:

{code}
org.apache.spark.SparkDriverExecutionException: Execution error
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:849)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1231)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.ClassCastException: scala.collection.mutable.HashSet 
cannot be cast to scala.collection.mutable.BitSet
at 
org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$7.apply(joins.scala:338)
at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:811)
at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:808)
at 
org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:845)
... 10 more
{code} 

This can easily reproduced using the queries and data from this 
[tutorial|http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/].
 Only the last query was modified to use a right outer join.

{code}
hiveContext.hql(SELECT a.year, a.player_id, a.runs from batting a RIGHT OUTER 
JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b ON (a.year = 
b.year AND a.runs = b.runs)).collect.foreach(println)
{code}

I compiled the 1.0.1 release myself with the following command: 
./make-distribution.sh --hadoop=2.2.0 --with-yarn --with-hive --tgz

  was:
When I run a hql query that contains a right outer join I always get this 
exception:

{code}
org.apache.spark.SparkDriverExecutionException: Execution error
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:849)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1231)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.ClassCastException: scala.collection.mutable.HashSet 
cannot be cast to scala.collection.mutable.BitSet
at 
org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$7.apply(joins.scala:338)
at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:811)
at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:808)
at 
org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:845)
... 10 more
{code} 

This can easily reproduced using the queries and data from this 
[tutorial|http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/].
 Only the last query was modified to use a right outer join.

{code}
hiveContext.hql(SELECT a.year, a.player_id, a.runs from batting a RIGHT OUTER 
JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b ON (a.year = 
b.year AND a.runs = b.runs)).collect.foreach(println)
{code}

I compiled the 1.0.1 release myself with the following command: 
./make-distribution.sh --hadoop=2.2.0 --with-yarn --with-hive --tgz


 OUTER JOINs cause ClassCastException
 

 Key: SPARK-2578
 URL: https://issues.apache.org/jira/browse/SPARK-2578
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects

[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-18 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066283#comment-14066283
 ] 

Sean Owen commented on SPARK-2341:
--

[~mengxr] Here is an example of changing the argument:
https://github.com/srowen/spark/commit/4a584ff9c0ada3d035d4668ecf22ec0e65ed16b6

I won't open a PR yet. I think this is a better API at this point, but the 
question is more whether the weight of deprecated methods are worth it or not. 
Another data point to keep in mind regarding how APIs can evolve.

 loadLibSVMFile doesn't handle regression datasets
 -

 Key: SPARK-2341
 URL: https://issues.apache.org/jira/browse/SPARK-2341
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor
  Labels: easyfix

 Many datasets exist in LibSVM format for regression tasks [1] but currently 
 the loadLibSVMFile primitive doesn't handle regression datasets.
 More precisely, the LabelParser is either a MulticlassLabelParser or a 
 BinaryLabelParser. What happens then is that the file is loaded but in 
 multiclass mode : each target value is interpreted as a class name !
 The fix would be to write a RegressionLabelParser which converts target 
 values to Double and plug it into the loadLibSVMFile routine.
 [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2491) When an OOM is thrown,the executor does not stop properly.

2014-07-18 Thread Guoqiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-2491:
---

Component/s: YARN

 When an OOM is thrown,the executor does not stop properly.
 --

 Key: SPARK-2491
 URL: https://issues.apache.org/jira/browse/SPARK-2491
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Guoqiang Li

 The executor log:
 {code}
 #
 # java.lang.OutOfMemoryError: Java heap space
 # -XX:OnOutOfMemoryError=kill %p
 #   Executing /bin/sh -c kill 44942...
 14/07/15 10:38:29 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: 
 SIGTERM
 14/07/15 10:38:29 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception 
 in thread Thread[Connection manager future execution context-6,5,main]
 java.lang.OutOfMemoryError: Java heap space
 at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:57)
 at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
 at org.apache.spark.storage.BlockMessage.set(BlockMessage.scala:94)
 at 
 org.apache.spark.storage.BlockMessage$.fromByteBuffer(BlockMessage.scala:176)
 at 
 org.apache.spark.storage.BlockMessageArray.set(BlockMessageArray.scala:63)
 at 
 org.apache.spark.storage.BlockMessageArray$.fromBufferMessage(BlockMessageArray.scala:109)
 at 
 org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:125)
 at 
 org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:122)
 at 
 scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117)
 at 
 scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115)
 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 14/07/15 10:38:29 WARN HadoopRDD: Exception in RecordReader.close()
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
 at 
 org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
 at java.io.FilterInputStream.close(FilterInputStream.java:181)
 at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
 at 
 org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:243)
 at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226)
 at 
 org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
 at 
 org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197)
 at 
 org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
 at 
 org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
 at org.apache.spark.scheduler.Task.run(Task.scala:51)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 -
 14/07/15 10:38:30 INFO Executor: Running task ID 969
 14/07/15 10:38:30 INFO BlockManager: Found block broadcast_0 locally
 14/07/15 10:38:30 INFO HadoopRDD: Input split: 
 hdfs://10dian72.domain.test:8020/input/lbs/recommend/toona/rating/20140712/part-7:0+68016537
 14/07/15 10:38:30 ERROR Executor: Exception in task ID 969
 java.io.FileNotFoundException: 
 /yarn/nm/usercache/spark/appcache/application_1404728465401_0070/spark-local-20140715103235-ffda/2e/merged_shuffle_4_85_0
  (No such file or directory)
 at java.io.FileOutputStream.open(Native Method)
 at java.io.FileOutputStream.init(FileOutputStream.java:221)
 at 
 org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:116)
 at 
 org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:177)
 at

[jira] [Commented] (SPARK-2256) pyspark: RDD.take doesn't work ... sometimes ...


[ 
https://issues.apache.org/jira/browse/SPARK-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066390#comment-14066390
 ] 

Matthew Farrellee commented on SPARK-2256:
--

ok. i don't have a windows machine to try. good luck!

 pyspark: RDD.take doesn't work ... sometimes ...
 --

 Key: SPARK-2256
 URL: https://issues.apache.org/jira/browse/SPARK-2256
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
 Environment: local file/remote HDFS
Reporter: Ángel Álvarez
  Labels: RDD, pyspark, take
 Attachments: A_test.zip


 If I try to take some lines from a file, sometimes it doesn't work
 Code: 
 myfile = sc.textFile(A_ko)
 print myfile.take(10)
 Stacktrace:
 14/06/24 09:29:27 INFO DAGScheduler: Failed to run take at mytest.py:19
 Traceback (most recent call last):
   File mytest.py, line 19, in module
 print myfile.take(10)
   File spark-1.0.0-bin-hadoop2\python\pyspark\rdd.py, line 868, in take
 iterator = mapped._jrdd.collectPartitions(partitionsToTake)[0].iterator()
   File 
 spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\java_gateway.py, 
 line 537, in __call__
   File 
 spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\protocol.py, 
 line 300, in get_return_value
 Test data:
 START TEST DATA
 A
 A
 A

[jira] [Commented] (SPARK-677) PySpark should not collect results through local filesystem


[ 
https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066458#comment-14066458
 ] 

Matthew Farrellee commented on SPARK-677:
-

this can also be used to address the fragile nature of py4j connection 
construction. the parent can create the fifo.

 PySpark should not collect results through local filesystem
 ---

 Key: SPARK-677
 URL: https://issues.apache.org/jira/browse/SPARK-677
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 0.7.0
Reporter: Josh Rosen

 Py4J is slow when transferring large arrays, so PySpark currently dumps data 
 to the disk and reads it back in order to collect() RDDs.  On large enough 
 datasets, this data will spill from the buffer cache and write to the 
 physical disk, resulting in terrible performance.
 Instead, we should stream the data from Java to Python over a local socket or 
 a FIFO.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1305) Support persisting RDD's directly to Tachyon

2014-07-18 Thread Denis Serduik (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066465#comment-14066465
 ] 

Denis Serduik commented on SPARK-1305:
--

I'm interesting in using this feature especially with SchemeRDD to be able 
cache intermediate  results.
Where I can find some code examples how to use it ? 
From SparkContext's code I see the following: 

  private[spark] def persistRDD(rdd: RDD[_]) {
persistentRdds(rdd.id) = rdd
  }

  /**
   * Unpersist an RDD from memory and/or disk storage
   */
  private[spark] def unpersistRDD(rddId: Int, blocking: Boolean = true) {
env.blockManager.master.removeRdd(rddId, blocking)
persistentRdds.remove(rddId)
listenerBus.post(SparkListenerUnpersistRDD(rddId))
  }

So  persist don't put RDD into tachyonStore of blockmanager ? How does this 
feature work ? 


 Support persisting RDD's directly to Tachyon
 

 Key: SPARK-1305
 URL: https://issues.apache.org/jira/browse/SPARK-1305
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager
Reporter: Patrick Wendell
Assignee: Haoyuan Li
Priority: Blocker
 Fix For: 1.0.0


 This is already an ongoing pull request - in a nutshell we want to support 
 Tachyon as a storage level in Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-1305) Support persisting RDD's directly to Tachyon

2014-07-18 Thread Denis Serduik (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066465#comment-14066465
 ] 

Denis Serduik edited comment on SPARK-1305 at 7/18/14 4:00 PM:
---

I'm interested in using this feature especially with SchemeRDD to be able cache 
intermediate  results.
Where I can find some code examples how to use it ? 
From SparkContext's code I see the following: 

  private[spark] def persistRDD(rdd: RDD[_]) {
persistentRdds(rdd.id) = rdd
  }

  /**
   * Unpersist an RDD from memory and/or disk storage
   */
  private[spark] def unpersistRDD(rddId: Int, blocking: Boolean = true) {
env.blockManager.master.removeRdd(rddId, blocking)
persistentRdds.remove(rddId)
listenerBus.post(SparkListenerUnpersistRDD(rddId))
  }

So  persist don't put RDD into tachyonStore of blockmanager ? How does this 
feature work ? 



was (Author: dmaverick):
I'm interesting in using this feature especially with SchemeRDD to be able 
cache intermediate  results.
Where I can find some code examples how to use it ? 
From SparkContext's code I see the following: 

  private[spark] def persistRDD(rdd: RDD[_]) {
persistentRdds(rdd.id) = rdd
  }

  /**
   * Unpersist an RDD from memory and/or disk storage
   */
  private[spark] def unpersistRDD(rddId: Int, blocking: Boolean = true) {
env.blockManager.master.removeRdd(rddId, blocking)
persistentRdds.remove(rddId)
listenerBus.post(SparkListenerUnpersistRDD(rddId))
  }

So  persist don't put RDD into tachyonStore of blockmanager ? How does this 
feature work ? 


 Support persisting RDD's directly to Tachyon
 

 Key: SPARK-1305
 URL: https://issues.apache.org/jira/browse/SPARK-1305
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager
Reporter: Patrick Wendell
Assignee: Haoyuan Li
Priority: Blocker
 Fix For: 1.0.0


 This is already an ongoing pull request - in a nutshell we want to support 
 Tachyon as a storage level in Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2580) broken pipe collecting schemardd results

Matthew Farrellee created SPARK-2580:


 Summary: broken pipe collecting schemardd results
 Key: SPARK-2580
 URL: https://issues.apache.org/jira/browse/SPARK-2580
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.0.0
 Environment: fedora 21 local and rhel 7 clustered (standalone)
Reporter: Matthew Farrellee


{code}
from pyspark.sql import SQLContext
sqlCtx = SQLContext(sc)
# size of cluster impacts where this breaks (i.e. 2**15 vs 2**2)
data = sc.parallelize([{'name': 'index', 'value': 0}] * 2**20)
sdata = sqlCtx.inferSchema(data)
sdata.first()
{code}

result: note - result returned as well as error
{code}
 sdata.first()
14/07/18 12:10:25 INFO SparkContext: Starting job: runJob at PythonRDD.scala:290
14/07/18 12:10:25 INFO DAGScheduler: Got job 43 (runJob at PythonRDD.scala:290) 
with 1 output partitions (allowLocal=true)
14/07/18 12:10:25 INFO DAGScheduler: Final stage: Stage 52(runJob at 
PythonRDD.scala:290)
14/07/18 12:10:25 INFO DAGScheduler: Parents of final stage: List()
14/07/18 12:10:25 INFO DAGScheduler: Missing parents: List()
14/07/18 12:10:25 INFO DAGScheduler: Computing the requested partition locally
14/07/18 12:10:25 INFO PythonRDD: Times: total = 45, boot = 3, init = 40, 
finish = 2
14/07/18 12:10:25 INFO SparkContext: Job finished: runJob at 
PythonRDD.scala:290, took 0.048348426 s
{u'name': u'index', u'value': 0}
 PySpark worker failed with exception:
Traceback (most recent call last):
  File /home/matt/Documents/Repositories/spark/dist/python/pyspark/worker.py, 
line 77, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
/home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py, 
line 191, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
/home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py, 
line 124, in dump_stream
self._write_with_length(obj, stream)
  File 
/home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py, 
line 139, in _write_with_length
stream.write(serialized)
IOError: [Errno 32] Broken pipe

Traceback (most recent call last):
  File /home/matt/Documents/Repositories/spark/dist/python/pyspark/daemon.py, 
line 130, in launch_worker
worker(listen_sock)
  File /home/matt/Documents/Repositories/spark/dist/python/pyspark/daemon.py, 
line 119, in worker
outfile.flush()
IOError: [Errno 32] Broken pipe
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2580) broken pipe collecting schemardd results


[ 
https://issues.apache.org/jira/browse/SPARK-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066480#comment-14066480
 ] 

Matthew Farrellee commented on SPARK-2580:
--

fyi, this was discovered during spark summit 2014 using the wiki_parquet 
example.

 broken pipe collecting schemardd results
 

 Key: SPARK-2580
 URL: https://issues.apache.org/jira/browse/SPARK-2580
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.0.0
 Environment: fedora 21 local and rhel 7 clustered (standalone)
Reporter: Matthew Farrellee
  Labels: py4j, pyspark

 {code}
 from pyspark.sql import SQLContext
 sqlCtx = SQLContext(sc)
 # size of cluster impacts where this breaks (i.e. 2**15 vs 2**2)
 data = sc.parallelize([{'name': 'index', 'value': 0}] * 2**20)
 sdata = sqlCtx.inferSchema(data)
 sdata.first()
 {code}
 result: note - result returned as well as error
 {code}
  sdata.first()
 14/07/18 12:10:25 INFO SparkContext: Starting job: runJob at 
 PythonRDD.scala:290
 14/07/18 12:10:25 INFO DAGScheduler: Got job 43 (runJob at 
 PythonRDD.scala:290) with 1 output partitions (allowLocal=true)
 14/07/18 12:10:25 INFO DAGScheduler: Final stage: Stage 52(runJob at 
 PythonRDD.scala:290)
 14/07/18 12:10:25 INFO DAGScheduler: Parents of final stage: List()
 14/07/18 12:10:25 INFO DAGScheduler: Missing parents: List()
 14/07/18 12:10:25 INFO DAGScheduler: Computing the requested partition locally
 14/07/18 12:10:25 INFO PythonRDD: Times: total = 45, boot = 3, init = 40, 
 finish = 2
 14/07/18 12:10:25 INFO SparkContext: Job finished: runJob at 
 PythonRDD.scala:290, took 0.048348426 s
 {u'name': u'index', u'value': 0}
  PySpark worker failed with exception:
 Traceback (most recent call last):
   File 
 /home/matt/Documents/Repositories/spark/dist/python/pyspark/worker.py, line 
 77, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File 
 /home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py, 
 line 191, in dump_stream
 self.serializer.dump_stream(self._batched(iterator), stream)
   File 
 /home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py, 
 line 124, in dump_stream
 self._write_with_length(obj, stream)
   File 
 /home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py, 
 line 139, in _write_with_length
 stream.write(serialized)
 IOError: [Errno 32] Broken pipe
 Traceback (most recent call last):
   File 
 /home/matt/Documents/Repositories/spark/dist/python/pyspark/daemon.py, line 
 130, in launch_worker
 worker(listen_sock)
   File 
 /home/matt/Documents/Repositories/spark/dist/python/pyspark/daemon.py, line 
 119, in worker
 outfile.flush()
 IOError: [Errno 32] Broken pipe
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2576) slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file

2014-07-18 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066492#comment-14066492
 ] 

Yin Huai commented on SPARK-2576:
-

Regarding
{code}
$line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19)
$line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19)
{code}
what is your input at console line 19?

 slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark 
 QL query on HDFS CSV file
 --

 Key: SPARK-2576
 URL: https://issues.apache.org/jira/browse/SPARK-2576
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.0.1
 Environment: One Mesos 0.19 master without zookeeper and 4 mesos 
 slaves. 
 JDK 1.7.51 and Scala 2.10.4 on all nodes. 
 HDFS from CDH5.0.3
 Spark version: I tried both with the pre-built CDH5 spark package available 
 from http://spark.apache.org/downloads.html and by packaging spark with sbt 
 0.13.2, JDK 1.7.51 and scala 2.10.4 as explained here 
 http://mesosphere.io/learn/run-spark-on-mesos/
 All nodes are running Debian 3.2.51-1 x86_64 GNU/Linux and have 
Reporter: Svend

 Execution of SQL query against HDFS systematically throws a class not found 
 exception on slave nodes when executing .
 (this was originally reported on the user list: 
 http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-1-spark-sql-error-java-lang-NoClassDefFoundError-Could-not-initialize-class-line11-read-tc10135.html)
 Sample code (ran from spark-shell): 
 {code}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.createSchemaRDD
 case class Car(timestamp: Long, objectid: String, isGreen: Boolean)
 // I get the same error when pointing to the folder 
 hdfs://vm28:8020/test/cardata
 val data = sc.textFile(hdfs://vm28:8020/test/cardata/part-0)
 val cars = data.map(_.split(,)).map ( ar = Car(ar(0).toLong, ar(1), 
 ar(2).toBoolean))
 cars.registerAsTable(mcars)
 val allgreens = sqlContext.sql(SELECT objectid from mcars where isGreen = 
 true)
 allgreens.collect.take(10).foreach(println)
 {code}
 Stack trace on the slave nodes: 
 {code}
 I0716 13:01:16.215158 13631 exec.cpp:131] Version: 0.19.0
 I0716 13:01:16.219285 13656 exec.cpp:205] Executor registered on slave 
 20140714-142853-485682442-5050-25487-2
 14/07/16 13:01:16 INFO MesosExecutorBackend: Registered with Mesos as 
 executor ID 20140714-142853-485682442-5050-25487-2
 14/07/16 13:01:16 INFO SecurityManager: Changing view acls to: 
 mesos,mnubohadoop
 14/07/16 13:01:16 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(mesos, 
 mnubohadoop)
 14/07/16 13:01:17 INFO Slf4jLogger: Slf4jLogger started
 14/07/16 13:01:17 INFO Remoting: Starting remoting
 14/07/16 13:01:17 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://sp...@vm23-hulk-priv.mtl.mnubo.com:38230]
 14/07/16 13:01:17 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://sp...@vm23-hulk-priv.mtl.mnubo.com:38230]
 14/07/16 13:01:17 INFO SparkEnv: Connecting to MapOutputTracker: 
 akka.tcp://spark@vm28-hulk-pub:41632/user/MapOutputTracker
 14/07/16 13:01:17 INFO SparkEnv: Connecting to BlockManagerMaster: 
 akka.tcp://spark@vm28-hulk-pub:41632/user/BlockManagerMaster
 14/07/16 13:01:17 INFO DiskBlockManager: Created local directory at 
 /tmp/spark-local-20140716130117-8ea0
 14/07/16 13:01:17 INFO MemoryStore: MemoryStore started with capacity 294.9 
 MB.
 14/07/16 13:01:17 INFO ConnectionManager: Bound socket to port 44501 with id 
 = ConnectionManagerId(vm23-hulk-priv.mtl.mnubo.com,44501)
 14/07/16 13:01:17 INFO BlockManagerMaster: Trying to register BlockManager
 14/07/16 13:01:17 INFO BlockManagerMaster: Registered BlockManager
 14/07/16 13:01:17 INFO HttpFileServer: HTTP File server directory is 
 /tmp/spark-ccf6f36c-2541-4a25-8fe4-bb4ba00ee633
 14/07/16 13:01:17 INFO HttpServer: Starting HTTP Server
 14/07/16 13:01:18 INFO Executor: Using REPL class URI: 
 http://vm28-hulk-pub:33973
 14/07/16 13:01:18 INFO Executor: Running task ID 2
 14/07/16 13:01:18 INFO HttpBroadcast: Started reading broadcast variable 0
 14/07/16 13:01:18 INFO MemoryStore: ensureFreeSpace(125590) called with 
 curMem=0, maxMem=309225062
 14/07/16 13:01:18 INFO MemoryStore: Block broadcast_0 stored as values to 
 memory (estimated size 122.6 KB, free 294.8 MB)
 14/07/16 13:01:18 INFO HttpBroadcast: Reading broadcast variable 0 took 
 0.294602722 s
 14/07/16 13:01:19 INFO HadoopRDD: Input split: 
 hdfs://vm28-hulk-priv:8020/svend/testimport/com_mnubo_agriculture60_1396310400/part-0:23960450+23960451
 I0716 13:01:19.905113 13657 exec.cpp:378] Executor asked to shutdown
 14/07/16 13:01:20 ERROR Executor: Exception in task ID 2

[jira] [Created] (SPARK-2581) complete or withdraw visitedStages optimization in DAGScheduler’s stageDependsOn

2014-07-18 Thread Aaron Staple (JIRA)

Aaron Staple created SPARK-2581:
---

 Summary: complete or withdraw visitedStages optimization in 
DAGScheduler’s stageDependsOn
 Key: SPARK-2581
 URL: https://issues.apache.org/jira/browse/SPARK-2581
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Aaron Staple
Priority: Minor


Right now the visitedStages HashSet is populated with stages, but never queried 
to limit examination of previously visited stages.  It may make sense to check 
whether a mapStage has been visited previously before visiting it again, as in 
the nearby visitedRdds check.  Or it may be that the existing visitedRdds check 
sufficiently optimizes this function, and visitedStages can simply be removed.

See discussion here: 
https://github.com/apache/spark/pull/1362#discussion-diff-15018046L1107



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2568) RangePartitioner should go through the data only once


[ 
https://issues.apache.org/jira/browse/SPARK-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066541#comment-14066541
 ] 

Reynold Xin commented on SPARK-2568:


Yes. Let's solve the problem one by one though.

 RangePartitioner should go through the data only once
 -

 Key: SPARK-2568
 URL: https://issues.apache.org/jira/browse/SPARK-2568
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Xiangrui Meng

 As of Spark 1.0, RangePartitioner goes through data twice: once to compute 
 the count and once to do sampling. As a result, to do sortByKey, Spark goes 
 through data 3 times (once to count, once to sample, and once to sort).
 RangePartitioner should go through data only once (remove the count step).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2568) RangePartitioner should go through the data only once

2014-07-18 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066553#comment-14066553
 ] 

Mark Hamstra commented on SPARK-2568:
-

Sure, if they can be cleanly separated -- but there's also interaction with the 
ShuffleManager refactoring.

Do you have some strategy in mind for addressing just SPARK-2568 in isolation?

 RangePartitioner should go through the data only once
 -

 Key: SPARK-2568
 URL: https://issues.apache.org/jira/browse/SPARK-2568
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Xiangrui Meng

 As of Spark 1.0, RangePartitioner goes through data twice: once to compute 
 the count and once to do sampling. As a result, to do sortByKey, Spark goes 
 through data 3 times (once to count, once to sample, and once to sort).
 RangePartitioner should go through data only once (remove the count step).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2582) Make Block Manager Master pluggable

2014-07-18 Thread Hari Shreedharan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Shreedharan updated SPARK-2582:


  Component/s: Block Manager
Affects Version/s: 1.0.0
   Issue Type: Improvement  (was: Bug)

 Make Block Manager Master pluggable
 ---

 Key: SPARK-2582
 URL: https://issues.apache.org/jira/browse/SPARK-2582
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager
Affects Versions: 1.0.0
Reporter: Hari Shreedharan

 Today, there is no way to make the BMM pluggable. So if we want an HA BMM, 
 that needs to replace the current one. Making this pluggable and selected 
 based on a config makes it easy to select HA or non-HA one based on the 
 application's preference. Streaming applications would be better off with an 
 HA one, while a normal application would not care (since the RDDs can be 
 regenerated).
 Since communication from the Block Managers to the BMM is via akka, we can 
 keep that the same and just have the implementation of the BMM implement the 
 actual methods which do the real work - this would not affect the Block 
 Managers too.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2582) Make Block Manager Master pluggable

2014-07-18 Thread Hari Shreedharan (JIRA)

Hari Shreedharan created SPARK-2582:
---

 Summary: Make Block Manager Master pluggable
 Key: SPARK-2582
 URL: https://issues.apache.org/jira/browse/SPARK-2582
 Project: Spark
  Issue Type: Bug
Reporter: Hari Shreedharan


Today, there is no way to make the BMM pluggable. So if we want an HA BMM, that 
needs to replace the current one. Making this pluggable and selected based on a 
config makes it easy to select HA or non-HA one based on the application's 
preference. Streaming applications would be better off with an HA one, while a 
normal application would not care (since the RDDs can be regenerated).

Since communication from the Block Managers to the BMM is via akka, we can keep 
that the same and just have the implementation of the BMM implement the actual 
methods which do the real work - this would not affect the Block Managers too.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts

2014-07-18 Thread Xuefu Zhang (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066600#comment-14066600
]

Xuefu Zhang commented on SPARK-2420:

{quote}
2. For jetty, it was a problem with Hive on Spark POC, possibly because we
shipped all libraries from Hive process's classpath to the spark cluster. We
have a task (HIVE-7371) to identify a minimum set of jars to be shipped. With
that, the story might change. We will confirm if Jetty is a problem once we
have a better idea on HIVE-7371.
{quote}

A better good news. With the latest work in HIVE-7292, we found that
servlet-api/jetty didn't seem to be a problem any more. Thus, the only conflict
remaining is guava, for which HIVE-7387 has all the details.

Change Spark build to minimize library conflicts

Key: SPARK-2420
URL: https://issues.apache.org/jira/browse/SPARK-2420
Project: Spark
Issue Type: Wish
Components: Build
Affects Versions: 1.0.0
Reporter: Xuefu Zhang
Attachments: spark_1.0.0.patch

During the prototyping of HIVE-7292, many library conflicts showed up because
Spark build contains versions of libraries that's vastly different from
current major Hadoop version. It would be nice if we can choose versions
that's in line with Hadoop or shading them in the assembly. Here are the wish
list:
1. Upgrade protobuf version to 2.5.0 from current 2.4.1
2. Shading Spark's jetty and servlet dependency in the assembly.
3. guava version difference. Spark is using a higher version. I'm not sure
what's the best solution for this.
The list may grow as HIVE-7292 proceeds.
For information only, the attached is a patch that we applied on Spark in
order to make Spark work with Hive. It gives an idea of the scope of changes.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts

2014-07-18 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066634#comment-14066634
 ] 

Sean Owen commented on SPARK-2420:
--

I'm going to make a PR that would show what downgrading Guava looks like, so at 
least that can be assessed by everyone.

 Change Spark build to minimize library conflicts
 

 Key: SPARK-2420
 URL: https://issues.apache.org/jira/browse/SPARK-2420
 Project: Spark
  Issue Type: Wish
  Components: Build
Affects Versions: 1.0.0
Reporter: Xuefu Zhang
 Attachments: spark_1.0.0.patch


 During the prototyping of HIVE-7292, many library conflicts showed up because 
 Spark build contains versions of libraries that's vastly different from 
 current major Hadoop version. It would be nice if we can choose versions 
 that's in line with Hadoop or shading them in the assembly. Here are the wish 
 list:
 1. Upgrade protobuf version to 2.5.0 from current 2.4.1
 2. Shading Spark's jetty and servlet dependency in the assembly.
 3. guava version difference. Spark is using a higher version. I'm not sure 
 what's the best solution for this.
 The list may grow as HIVE-7292 proceeds.
 For information only, the attached is a patch that we applied on Spark in 
 order to make Spark work with Hive. It gives an idea of the scope of changes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2568) RangePartitioner should go through the data only once


[ 
https://issues.apache.org/jira/browse/SPARK-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066689#comment-14066689
 ] 

Reynold Xin commented on SPARK-2568:


Our PhD in Math is working on that :)

 RangePartitioner should go through the data only once
 -

 Key: SPARK-2568
 URL: https://issues.apache.org/jira/browse/SPARK-2568
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Xiangrui Meng

 As of Spark 1.0, RangePartitioner goes through data twice: once to compute 
 the count and once to do sampling. As a result, to do sortByKey, Spark goes 
 through data 3 times (once to count, once to sample, and once to sort).
 RangePartitioner should go through data only once (remove the count step).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2584) Do not mutate block storage level on the UI

2014-07-18 Thread Andrew Or (JIRA)

Andrew Or created SPARK-2584:


 Summary: Do not mutate block storage level on the UI
 Key: SPARK-2584
 URL: https://issues.apache.org/jira/browse/SPARK-2584
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 1.1.0
Reporter: Andrew Or
 Fix For: 1.1.0


If a block is stored MEMORY_AND_DISK and we drop it from memory, it becomes 
DISK_ONLY on the UI. We should preserve the original storage level  proposed by 
the user, in addition to the change in actual storage level.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2585) Remove special handling of Hadoop JobConf

2014-07-18 Thread Patrick Wendell (JIRA)

Patrick Wendell created SPARK-2585:
--

 Summary: Remove special handling of Hadoop JobConf
 Key: SPARK-2585
 URL: https://issues.apache.org/jira/browse/SPARK-2585
 Project: Spark
  Issue Type: Improvement
Reporter: Patrick Wendell
Assignee: Reynold Xin


This is a follow up to SPARK-2521 and should close SPARK-2546 (provided the 
implementation does not use shared conf objects). We no longer need to 
specially broadcast the Hadoop configuration since we are broadcasting RDD data 
anyways.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2585) Remove special handling of Hadoop JobConf

2014-07-18 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2585:
---

Priority: Critical  (was: Major)

 Remove special handling of Hadoop JobConf
 -

 Key: SPARK-2585
 URL: https://issues.apache.org/jira/browse/SPARK-2585
 Project: Spark
  Issue Type: Improvement
Reporter: Patrick Wendell
Assignee: Reynold Xin
Priority: Critical

 This is a follow up to SPARK-2521 and should close SPARK-2546 (provided the 
 implementation does not use shared conf objects). We no longer need to 
 specially broadcast the Hadoop configuration since we are broadcasting RDD 
 data anyways.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1767) Prefer HDFS-cached replicas when scheduling data-local tasks

2014-07-18 Thread Colin Patrick McCabe (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066788#comment-14066788
 ] 

Colin Patrick McCabe commented on SPARK-1767:
-

I posted a pull request here: https://github.com/apache/spark/pull/1486

This illustrates how to do it with Hadoop 2.5.  I'm still not sure about 
whether we should change the type of getPreferredLocations (see the pull 
request)

 Prefer HDFS-cached replicas when scheduling data-local tasks
 

 Key: SPARK-1767
 URL: https://issues.apache.org/jira/browse/SPARK-1767
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2521) Broadcast RDD object once per TaskSet (instead of sending it for every task)


 [ 
https://issues.apache.org/jira/browse/SPARK-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2521:
---

Description: 
Currently (as of Spark 1.0.1), Spark sends RDD object (which contains closures) 
using Akka along with the task itself to the executors. This is inefficient 
because all tasks in the same stage use the same RDD object, but we have to 
send RDD object multiple times to the executors. This is especially bad when a 
closure references some variable that is very large. The current design led to 
users having to explicitly broadcast large variables.

The patch uses broadcast to send RDD objects and the closures to executors, and 
use Akka to only send a reference to the broadcast RDD/closure along with the 
partition specific information for the task. For those of you who know more 
about the internals, Spark already relies on broadcast to send the Hadoop 
JobConf every time it uses the Hadoop input, because the JobConf is large.

The user-facing impact of the change include:

Users won't need to decide what to broadcast anymore, unless they would want to 
use a large object multiple times in different operations
Task size will get smaller, resulting in faster scheduling and higher task 
dispatch throughput.
In addition, the change will simplify some internals of Spark, eliminating the 
need to maintain task caches and the complex logic to broadcast JobConf (which 
also led to a deadlock recently).

A simple way to test this:

{code}
val a = new Array[Byte](1000*1000); scala.util.Random.nextBytes(a);
sc.parallelize(1 to 1000, 1000).map { x = a; x }.groupBy { x = a; x }.count
Numbers on 3 r3.8xlarge instances on EC2

master branch: 5.648436068 s, 4.715361895 s, 5.360161877 s
with this change: 3.416348793 s, 1.477846558 s, 1.553432156 s
{code}

  was:
This can substantially reduce task size, as well as being able to support very 
large closures (e.g. closures that reference large variables).

Once this is in, we can also remove broadcasting the Hadoop JobConf.


 Broadcast RDD object once per TaskSet (instead of sending it for every task)
 

 Key: SPARK-2521
 URL: https://issues.apache.org/jira/browse/SPARK-2521
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin

 Currently (as of Spark 1.0.1), Spark sends RDD object (which contains 
 closures) using Akka along with the task itself to the executors. This is 
 inefficient because all tasks in the same stage use the same RDD object, but 
 we have to send RDD object multiple times to the executors. This is 
 especially bad when a closure references some variable that is very large. 
 The current design led to users having to explicitly broadcast large 
 variables.
 The patch uses broadcast to send RDD objects and the closures to executors, 
 and use Akka to only send a reference to the broadcast RDD/closure along with 
 the partition specific information for the task. For those of you who know 
 more about the internals, Spark already relies on broadcast to send the 
 Hadoop JobConf every time it uses the Hadoop input, because the JobConf is 
 large.
 The user-facing impact of the change include:
 Users won't need to decide what to broadcast anymore, unless they would want 
 to use a large object multiple times in different operations
 Task size will get smaller, resulting in faster scheduling and higher task 
 dispatch throughput.
 In addition, the change will simplify some internals of Spark, eliminating 
 the need to maintain task caches and the complex logic to broadcast JobConf 
 (which also led to a deadlock recently).
 A simple way to test this:
 {code}
 val a = new Array[Byte](1000*1000); scala.util.Random.nextBytes(a);
 sc.parallelize(1 to 1000, 1000).map { x = a; x }.groupBy { x = a; x }.count
 Numbers on 3 r3.8xlarge instances on EC2
 master branch: 5.648436068 s, 4.715361895 s, 5.360161877 s
 with this change: 3.416348793 s, 1.477846558 s, 1.553432156 s
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1680) Clean up use of setExecutorEnvs in SparkConf

2014-07-18 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066872#comment-14066872
 ] 

Thomas Graves commented on SPARK-1680:
--

so what do we want to do here. Do we want to expose spark.executorEnv. to the 
user to allow them to set environment variables?  I assume since it isn't 
documented it wasn't meant for end users but perhaps we just missed documenting 
it.  I would like to add a config to replace the env variable 
SPARK_YARN_USER_ENV. 

Either way I will probably need another config to set env variables on the 
application master.  I could make that one yarn specific like 
spark.yarn.appMasterEnv.

 Clean up use of setExecutorEnvs in SparkConf 
 -

 Key: SPARK-1680
 URL: https://issues.apache.org/jira/browse/SPARK-1680
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Patrick Wendell
Priority: Blocker
 Fix For: 1.1.0


 We should make this consistent between YARN and Standalone. Basically, YARN 
 mode should just use the executorEnvs from the Spark conf and not need 
 SPARK_YARN_USER_ENV.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1536) Add multiclass classification tree support to MLlib

2014-07-18 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-1536.
--

   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 886
[https://github.com/apache/spark/pull/886]

 Add multiclass classification tree support to MLlib
 ---

 Key: SPARK-1536
 URL: https://issues.apache.org/jira/browse/SPARK-1536
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Manish Amde
Assignee: Manish Amde
Priority: Critical
 Fix For: 1.1.0


 The current decision tree implementation in MLlib only supports binary 
 classification. This task involves adding multiclass classification support 
 to the decision tree implementation.
 The tasks involves:
 - Choosing a good strategy for multiclass classification among multiple 
 options:
   -- add multi class support to impurity but it won't work well with the 
 categorical features since the centriod-based ordering assumptions won't hold 
 true
   -- error-correcting output codes
   -- one-vs-all
 - Code implementation
 - Unit tests
 - Functional tests
 - Performance tests
 - Documentation



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2269) Clean up and add unit tests for resourceOffers in MesosSchedulerBackend

2014-07-18 Thread Timothy Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066879#comment-14066879
 ] 

Timothy Chen commented on SPARK-2269:
-

Created a PR for this: https://github.com/apache/spark/pull/1487

 Clean up and add unit tests for resourceOffers in MesosSchedulerBackend
 ---

 Key: SPARK-2269
 URL: https://issues.apache.org/jira/browse/SPARK-2269
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Reporter: Patrick Wendell
Assignee: Tim Chen

 This function could be simplified a bit. We could re-write it without 
 offerableIndices or creating the mesosTasks array as large as the offer list. 
 There is a lot of logic around making sure you get the correct index into 
 mesosTasks and offers, really we should just build mesosTasks directly from 
 the offers we get back. To associate the tasks we are launching with the 
 offers we can just create a hashMap from the slaveId to the original offer.
 The basic logic of the function is that you take the mesos offers, convert 
 them to spark offers, then convert the results back.
 One reason I think it might be designed as it is now is to deal with the case 
 where Mesos gives multiple offers for a single slave. I checked directly with 
 the Mesos team and they said this won't ever happen, you'll get at most one 
 offer per mesos slave within a set of offers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2535) Add StringComparison case to NullPropagation.


 [ 
https://issues.apache.org/jira/browse/SPARK-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2535.
-

   Resolution: Fixed
Fix Version/s: 1.1.0
 Assignee: Takuya Ueshin

 Add StringComparison case to NullPropagation.
 -

 Key: SPARK-2535
 URL: https://issues.apache.org/jira/browse/SPARK-2535
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Takuya Ueshin
Assignee: Takuya Ueshin
 Fix For: 1.1.0


 {{StringComparison}} expressions including {{null}} literal cases could be 
 added to {{NullPropagation}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2540) Add More Types Support for unwarpData of HiveUDF


 [ 
https://issues.apache.org/jira/browse/SPARK-2540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2540.
-

   Resolution: Fixed
Fix Version/s: 1.0.2
   1.1.0

 Add More Types Support for unwarpData of HiveUDF
 

 Key: SPARK-2540
 URL: https://issues.apache.org/jira/browse/SPARK-2540
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao
Priority: Minor
 Fix For: 1.1.0, 1.0.2


 In the function HiveInspectors.unwrapData, currently the 
 HiveVarcharObjectInspector  HiveDecimalObjectInspector are not supported.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2571) Shuffle read bytes are reported incorrectly for stages with multiple shuffle dependencies


 [ 
https://issues.apache.org/jira/browse/SPARK-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-2571:
--

Fix Version/s: 1.0.2

 Shuffle read bytes are reported incorrectly for stages with multiple shuffle 
 dependencies
 -

 Key: SPARK-2571
 URL: https://issues.apache.org/jira/browse/SPARK-2571
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.0.1, 0.9.3
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
 Fix For: 1.0.2


 In BlockStoreShuffleFetcher, we set the shuffle metrics for a task to include 
 information about data fetched from one BlockFetcherIterator.  When tasks 
 have multiple shuffle dependencies (e.g., a stage that joins two datasets 
 together), the metrics will get set based on data fetched from the last 
 BlockFetcherIterator to complete, rather than the sum of all data fetched 
 from all BlockFetcherIterators.  This can lead to dramatically underreporting 
 the shuffle read bytes.
 Thanks [~andrewor14] and [~rxin] for helping to diagnose this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2571) Shuffle read bytes are reported incorrectly for stages with multiple shuffle dependencies


 [ 
https://issues.apache.org/jira/browse/SPARK-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-2571.
---

Resolution: Fixed

 Shuffle read bytes are reported incorrectly for stages with multiple shuffle 
 dependencies
 -

 Key: SPARK-2571
 URL: https://issues.apache.org/jira/browse/SPARK-2571
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.0.1, 0.9.3
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout

 In BlockStoreShuffleFetcher, we set the shuffle metrics for a task to include 
 information about data fetched from one BlockFetcherIterator.  When tasks 
 have multiple shuffle dependencies (e.g., a stage that joins two datasets 
 together), the metrics will get set based on data fetched from the last 
 BlockFetcherIterator to complete, rather than the sum of all data fetched 
 from all BlockFetcherIterators.  This can lead to dramatically underreporting 
 the shuffle read bytes.
 Thanks [~andrewor14] and [~rxin] for helping to diagnose this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2571) Shuffle read bytes are reported incorrectly for stages with multiple shuffle dependencies


 [ 
https://issues.apache.org/jira/browse/SPARK-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-2571:
--


https://github.com/apache/spark/commit/7b971b91caeebda57f1506ffc4fd266a1b379290

 Shuffle read bytes are reported incorrectly for stages with multiple shuffle 
 dependencies
 -

 Key: SPARK-2571
 URL: https://issues.apache.org/jira/browse/SPARK-2571
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.0.1, 0.9.3
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
 Fix For: 1.0.2


 In BlockStoreShuffleFetcher, we set the shuffle metrics for a task to include 
 information about data fetched from one BlockFetcherIterator.  When tasks 
 have multiple shuffle dependencies (e.g., a stage that joins two datasets 
 together), the metrics will get set based on data fetched from the last 
 BlockFetcherIterator to complete, rather than the sum of all data fetched 
 from all BlockFetcherIterators.  This can lead to dramatically underreporting 
 the shuffle read bytes.
 Thanks [~andrewor14] and [~rxin] for helping to diagnose this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2518) Fix foldability of Substring expression.


 [ 
https://issues.apache.org/jira/browse/SPARK-2518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2518.
-

   Resolution: Fixed
Fix Version/s: 1.0.2
   1.1.0

 Fix foldability of Substring expression.
 

 Key: SPARK-2518
 URL: https://issues.apache.org/jira/browse/SPARK-2518
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin
Assignee: Takuya Ueshin
 Fix For: 1.1.0, 1.0.2


 This is a follow-up of [#1428|https://github.com/apache/spark/pull/1428].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2571) Shuffle read bytes are reported incorrectly for stages with multiple shuffle dependencies


 [ 
https://issues.apache.org/jira/browse/SPARK-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-2571:
--

Target Version/s: 1.1.0  (was: 1.0.2)
   Fix Version/s: (was: 1.0.2)
  1.1.0

 Shuffle read bytes are reported incorrectly for stages with multiple shuffle 
 dependencies
 -

 Key: SPARK-2571
 URL: https://issues.apache.org/jira/browse/SPARK-2571
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.0.1, 0.9.3
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
 Fix For: 1.1.0


 In BlockStoreShuffleFetcher, we set the shuffle metrics for a task to include 
 information about data fetched from one BlockFetcherIterator.  When tasks 
 have multiple shuffle dependencies (e.g., a stage that joins two datasets 
 together), the metrics will get set based on data fetched from the last 
 BlockFetcherIterator to complete, rather than the sum of all data fetched 
 from all BlockFetcherIterators.  This can lead to dramatically underreporting 
 the shuffle read bytes.
 Thanks [~andrewor14] and [~rxin] for helping to diagnose this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2562) Add Date datatype support to Spark SQL


 [ 
https://issues.apache.org/jira/browse/SPARK-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2562:


Target Version/s: 1.2.0  (was: 1.1.0, 1.2.0)

 Add Date datatype support to Spark SQL
 --

 Key: SPARK-2562
 URL: https://issues.apache.org/jira/browse/SPARK-2562
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.1
Reporter: Zongheng Yang
Priority: Minor

 Spark SQL currently supports Timestamp, but not Date. Hive introduced support 
 for Date in [HIVE-4055|https://issues.apache.org/jira/browse/HIVE-4055], 
 where the underlying representation is {{java.sql.Date}}.
 (Thanks to user Rindra Ramamonjison for reporting this.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2551) Cleanup FilteringParquetRowInputFormat


 [ 
https://issues.apache.org/jira/browse/SPARK-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2551:


Target Version/s: 1.2.0  (was: 1.1.0)

 Cleanup FilteringParquetRowInputFormat
 --

 Key: SPARK-2551
 URL: https://issues.apache.org/jira/browse/SPARK-2551
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.1, 1.0.2
Reporter: Cheng Lian
Priority: Minor

 To workaround [PARQUET-16|https://issues.apache.org/jira/browse/PARQUET-16] 
 and fix [SPARK-2119|https://issues.apache.org/jira/browse/SPARK-2119], we did 
 some reflection hacks in {{FilteringParquetRowInputFormat}}. This should be 
 cleaned up once PARQUET-16 is fixed.
 A PR for PARQUET-16 is 
 [here|https://github.com/apache/incubator-parquet-mr/pull/17].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2205) Unnecessary exchange operators in a join on multiple tables with the same join key.


 [ 
https://issues.apache.org/jira/browse/SPARK-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2205:


Target Version/s: 1.2.0  (was: 1.1.0)

 Unnecessary exchange operators in a join on multiple tables with the same 
 join key.
 ---

 Key: SPARK-2205
 URL: https://issues.apache.org/jira/browse/SPARK-2205
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Minor

 {code}
 hql(select * from src x join src y on (x.key=y.key) join src z on 
 (y.key=z.key))
 SchemaRDD[1] at RDD at SchemaRDD.scala:100
 == Query Plan ==
 Project [key#4:0,value#5:1,key#6:2,value#7:3,key#8:4,value#9:5]
  HashJoin [key#6], [key#8], BuildRight
   Exchange (HashPartitioning [key#6], 200)
HashJoin [key#4], [key#6], BuildRight
 Exchange (HashPartitioning [key#4], 200)
  HiveTableScan [key#4,value#5], (MetastoreRelation default, src, 
 Some(x)), None
 Exchange (HashPartitioning [key#6], 200)
  HiveTableScan [key#6,value#7], (MetastoreRelation default, src, 
 Some(y)), None
   Exchange (HashPartitioning [key#8], 200)
HiveTableScan [key#8,value#9], (MetastoreRelation default, src, Some(z)), 
 None
 {code}
 However, this is fine...
 {code}
 hql(select * from src x join src y on (x.key=y.key) join src z on 
 (x.key=z.key))
 res5: org.apache.spark.sql.SchemaRDD = 
 SchemaRDD[5] at RDD at SchemaRDD.scala:100
 == Query Plan ==
 Project [key#26:0,value#27:1,key#28:2,value#29:3,key#30:4,value#31:5]
  HashJoin [key#26], [key#30], BuildRight
   HashJoin [key#26], [key#28], BuildRight
Exchange (HashPartitioning [key#26], 200)
 HiveTableScan [key#26,value#27], (MetastoreRelation default, src, 
 Some(x)), None
Exchange (HashPartitioning [key#28], 200)
 HiveTableScan [key#28,value#29], (MetastoreRelation default, src, 
 Some(y)), None
   Exchange (HashPartitioning [key#30], 200)
HiveTableScan [key#30,value#31], (MetastoreRelation default, src, 
 Some(z)), None
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2087) Support SQLConf per session


 [ 
https://issues.apache.org/jira/browse/SPARK-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2087:


Target Version/s: 1.2.0  (was: 1.1.0)

 Support SQLConf per session
 ---

 Key: SPARK-2087
 URL: https://issues.apache.org/jira/browse/SPARK-2087
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Zongheng Yang
Priority: Minor

 For things like the SharkServer we should support configuration per thread 
 instead of globally for a context.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2087) Support SQLConf per session


[ 
https://issues.apache.org/jira/browse/SPARK-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066964#comment-14066964
 ] 

Michael Armbrust commented on SPARK-2087:
-

I'm thinking about this more... and really I think we need to do more than just 
have a SQLConf per session.  Ideally we would have a full SQLContext per 
session.  This way users could have their own temporary tables.  However, doing 
that in the naive way would also mean that cached tables would not be shared 
across contexts... So really I think this JIRA needs to be about clean 
multi-user semantics for the thrift server.

 Support SQLConf per session
 ---

 Key: SPARK-2087
 URL: https://issues.apache.org/jira/browse/SPARK-2087
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Zongheng Yang
Priority: Minor

 For things like the SharkServer we should support configuration per thread 
 instead of globally for a context.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2087) Clean Multi-user semantics for thrift JDBC/ODBC server.


 [ 
https://issues.apache.org/jira/browse/SPARK-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2087:


Description: Configuration and temporary tables should exist per-user.  
Cached tables should be shared across users.  (was: For things like the 
SharkServer we should support configuration per thread instead of globally for 
a context.)

 Clean Multi-user semantics for thrift JDBC/ODBC server.
 ---

 Key: SPARK-2087
 URL: https://issues.apache.org/jira/browse/SPARK-2087
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Zongheng Yang
Priority: Minor

 Configuration and temporary tables should exist per-user.  Cached tables 
 should be shared across users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2087) Clean Multi-user semantics for thrift JDBC/ODBC server.


 [ 
https://issues.apache.org/jira/browse/SPARK-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2087:


Summary: Clean Multi-user semantics for thrift JDBC/ODBC server.  (was: 
Support SQLConf per session)

 Clean Multi-user semantics for thrift JDBC/ODBC server.
 ---

 Key: SPARK-2087
 URL: https://issues.apache.org/jira/browse/SPARK-2087
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Zongheng Yang
Priority: Minor

 For things like the SharkServer we should support configuration per thread 
 instead of globally for a context.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2066) Better error message for non-aggregated attributes with aggregates


 [ 
https://issues.apache.org/jira/browse/SPARK-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2066:


Target Version/s: 1.2.0  (was: 1.1.0)

 Better error message for non-aggregated attributes with aggregates
 --

 Key: SPARK-2066
 URL: https://issues.apache.org/jira/browse/SPARK-2066
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Cheng Lian

 [~marmbrus]
 Run the following query
 {code}
 scala c.hql(select key, count(*) from src).collect()
 {code}
 Got the following exception at runtime
 {code}
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: No function 
 to evaluate expression. type: AttributeReference, tree: key#61
   at 
 org.apache.spark.sql.catalyst.expressions.AttributeReference.eval(namedExpressions.scala:157)
   at 
 org.apache.spark.sql.catalyst.expressions.Projection.apply(Projection.scala:35)
   at 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$1.apply(Aggregate.scala:154)
   at 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$1.apply(Aggregate.scala:134)
   at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:558)
   at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:558)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
   at org.apache.spark.scheduler.Task.run(Task.scala:51)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 This should either fail in analysis time, or pass at runtime. Definitely 
 shouldn't fail at runtime.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2586) Lack of information to figure out connection to Tachyon master is inactive/ down

2014-07-18 Thread Henry Saputra (JIRA)

Henry Saputra created SPARK-2586:


 Summary: Lack of information to figure out connection to Tachyon 
master is inactive/ down
 Key: SPARK-2586
 URL: https://issues.apache.org/jira/browse/SPARK-2586
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Henry Saputra


When you running Spark with Tachyon, when the connection to Tachyon master is 
down (due to problem in network or the Master node is down) there is no clear 
log or error message to indicate it.

Here is sample stack running SparkTachyonPi example with Tachyon connecting:

14/07/15 16:43:10 INFO Utils: Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties
14/07/15 16:43:10 WARN Utils: Your hostname, henry-pivotal.local resolves to a 
loopback address: 127.0.0.1; using 10.64.5.148 instead (on interface en5)
14/07/15 16:43:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
14/07/15 16:43:11 INFO SecurityManager: Changing view acls to: hsaputra
14/07/15 16:43:11 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(hsaputra)
14/07/15 16:43:11 INFO Slf4jLogger: Slf4jLogger started
14/07/15 16:43:11 INFO Remoting: Starting remoting
14/07/15 16:43:11 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sp...@office-5-148.pa.gopivotal.com:53203]
14/07/15 16:43:11 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://sp...@office-5-148.pa.gopivotal.com:53203]
14/07/15 16:43:11 INFO SparkEnv: Registering MapOutputTracker
14/07/15 16:43:11 INFO SparkEnv: Registering BlockManagerMaster
14/07/15 16:43:11 INFO DiskBlockManager: Created local directory at 
/var/folders/nv/nsr_3ysj0wgfq93fqp0rdt3wgp/T/spark-local-20140715164311-e63c
14/07/15 16:43:11 INFO ConnectionManager: Bound socket to port 53204 with id = 
ConnectionManagerId(office-5-148.pa.gopivotal.com,53204)
14/07/15 16:43:11 INFO MemoryStore: MemoryStore started with capacity 2.1 GB
14/07/15 16:43:11 INFO BlockManagerMaster: Trying to register BlockManager
14/07/15 16:43:11 INFO BlockManagerMasterActor: Registering block manager 
office-5-148.pa.gopivotal.com:53204 with 2.1 GB RAM
14/07/15 16:43:11 INFO BlockManagerMaster: Registered BlockManager
14/07/15 16:43:11 INFO HttpServer: Starting HTTP Server
14/07/15 16:43:11 INFO HttpBroadcast: Broadcast server started at 
http://10.64.5.148:53205
14/07/15 16:43:11 INFO HttpFileServer: HTTP File server directory is 
/var/folders/nv/nsr_3ysj0wgfq93fqp0rdt3wgp/T/spark-b2fb12ae-4608-4833-87b6-b335da00738e
14/07/15 16:43:11 INFO HttpServer: Starting HTTP Server
14/07/15 16:43:12 INFO SparkUI: Started SparkUI at 
http://office-5-148.pa.gopivotal.com:4040
2014-07-15 16:43:12.210 java[39068:1903] Unable to load realm info from 
SCDynamicStore
14/07/15 16:43:12 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
14/07/15 16:43:12 INFO SparkContext: Added JAR 
examples/target/scala-2.10/spark-examples-1.1.0-SNAPSHOT-hadoop2.4.0.jar at 
http://10.64.5.148:53206/jars/spark-examples-1.1.0-SNAPSHOT-hadoop2.4.0.jar 
with timestamp 1405467792813
14/07/15 16:43:12 INFO AppClient$ClientActor: Connecting to master 
spark://henry-pivotal.local:7077...
14/07/15 16:43:12 INFO SparkContext: Starting job: reduce at 
SparkTachyonPi.scala:43
14/07/15 16:43:12 INFO DAGScheduler: Got job 0 (reduce at 
SparkTachyonPi.scala:43) with 2 output partitions (allowLocal=false)
14/07/15 16:43:12 INFO DAGScheduler: Final stage: Stage 0(reduce at 
SparkTachyonPi.scala:43)
14/07/15 16:43:12 INFO DAGScheduler: Parents of final stage: List()
14/07/15 16:43:12 INFO DAGScheduler: Missing parents: List()
14/07/15 16:43:12 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map at 
SparkTachyonPi.scala:39), which has no missing parents
14/07/15 16:43:13 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 
(MappedRDD[1] at map at SparkTachyonPi.scala:39)
14/07/15 16:43:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
14/07/15 16:43:13 INFO SparkDeploySchedulerBackend: Connected to Spark cluster 
with app ID app-20140715164313-
14/07/15 16:43:13 INFO AppClient$ClientActor: Executor added: 
app-20140715164313-/0 on 
worker-20140715164009-office-5-148.pa.gopivotal.com-52519 
(office-5-148.pa.gopivotal.com:52519) with 8 cores
14/07/15 16:43:13 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20140715164313-/0 on hostPort office-5-148.pa.gopivotal.com:52519 with 
8 cores, 512.0 MB RAM
14/07/15 16:43:13 INFO AppClient$ClientActor: Executor updated: 
app-20140715164313-/0 is now RUNNING
14/07/15 16:43:15 INFO SparkDeploySchedulerBackend: Registered executor: 
Actor[akka.tcp://sparkexecu...@office-5-148.pa.gopivotal.com:53213/user/Executor#-423405256]
 with ID 0
14/07/15 16:43:15 INFO TaskSetManager:

[jira] [Updated] (SPARK-2586) Lack of information to figure out connection to Tachyon master is inactive/ down

2014-07-18 Thread Henry Saputra (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Henry Saputra updated SPARK-2586:
-

Labels: tachyon  (was: )

 Lack of information to figure out connection to Tachyon master is inactive/ 
 down
 

 Key: SPARK-2586
 URL: https://issues.apache.org/jira/browse/SPARK-2586
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Henry Saputra
  Labels: tachyon

 When you running Spark with Tachyon, when the connection to Tachyon master is 
 down (due to problem in network or the Master node is down) there is no clear 
 log or error message to indicate it.
 Here is sample stack running SparkTachyonPi example with Tachyon connecting:
 14/07/15 16:43:10 INFO Utils: Using Spark's default log4j profile: 
 org/apache/spark/log4j-defaults.properties
 14/07/15 16:43:10 WARN Utils: Your hostname, henry-pivotal.local resolves to 
 a loopback address: 127.0.0.1; using 10.64.5.148 instead (on interface en5)
 14/07/15 16:43:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
 another address
 14/07/15 16:43:11 INFO SecurityManager: Changing view acls to: hsaputra
 14/07/15 16:43:11 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(hsaputra)
 14/07/15 16:43:11 INFO Slf4jLogger: Slf4jLogger started
 14/07/15 16:43:11 INFO Remoting: Starting remoting
 14/07/15 16:43:11 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://sp...@office-5-148.pa.gopivotal.com:53203]
 14/07/15 16:43:11 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://sp...@office-5-148.pa.gopivotal.com:53203]
 14/07/15 16:43:11 INFO SparkEnv: Registering MapOutputTracker
 14/07/15 16:43:11 INFO SparkEnv: Registering BlockManagerMaster
 14/07/15 16:43:11 INFO DiskBlockManager: Created local directory at 
 /var/folders/nv/nsr_3ysj0wgfq93fqp0rdt3wgp/T/spark-local-20140715164311-e63c
 14/07/15 16:43:11 INFO ConnectionManager: Bound socket to port 53204 with id 
 = ConnectionManagerId(office-5-148.pa.gopivotal.com,53204)
 14/07/15 16:43:11 INFO MemoryStore: MemoryStore started with capacity 2.1 GB
 14/07/15 16:43:11 INFO BlockManagerMaster: Trying to register BlockManager
 14/07/15 16:43:11 INFO BlockManagerMasterActor: Registering block manager 
 office-5-148.pa.gopivotal.com:53204 with 2.1 GB RAM
 14/07/15 16:43:11 INFO BlockManagerMaster: Registered BlockManager
 14/07/15 16:43:11 INFO HttpServer: Starting HTTP Server
 14/07/15 16:43:11 INFO HttpBroadcast: Broadcast server started at 
 http://10.64.5.148:53205
 14/07/15 16:43:11 INFO HttpFileServer: HTTP File server directory is 
 /var/folders/nv/nsr_3ysj0wgfq93fqp0rdt3wgp/T/spark-b2fb12ae-4608-4833-87b6-b335da00738e
 14/07/15 16:43:11 INFO HttpServer: Starting HTTP Server
 14/07/15 16:43:12 INFO SparkUI: Started SparkUI at 
 http://office-5-148.pa.gopivotal.com:4040
 2014-07-15 16:43:12.210 java[39068:1903] Unable to load realm info from 
 SCDynamicStore
 14/07/15 16:43:12 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/07/15 16:43:12 INFO SparkContext: Added JAR 
 examples/target/scala-2.10/spark-examples-1.1.0-SNAPSHOT-hadoop2.4.0.jar at 
 http://10.64.5.148:53206/jars/spark-examples-1.1.0-SNAPSHOT-hadoop2.4.0.jar 
 with timestamp 1405467792813
 14/07/15 16:43:12 INFO AppClient$ClientActor: Connecting to master 
 spark://henry-pivotal.local:7077...
 14/07/15 16:43:12 INFO SparkContext: Starting job: reduce at 
 SparkTachyonPi.scala:43
 14/07/15 16:43:12 INFO DAGScheduler: Got job 0 (reduce at 
 SparkTachyonPi.scala:43) with 2 output partitions (allowLocal=false)
 14/07/15 16:43:12 INFO DAGScheduler: Final stage: Stage 0(reduce at 
 SparkTachyonPi.scala:43)
 14/07/15 16:43:12 INFO DAGScheduler: Parents of final stage: List()
 14/07/15 16:43:12 INFO DAGScheduler: Missing parents: List()
 14/07/15 16:43:12 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map 
 at SparkTachyonPi.scala:39), which has no missing parents
 14/07/15 16:43:13 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 
 (MappedRDD[1] at map at SparkTachyonPi.scala:39)
 14/07/15 16:43:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
 14/07/15 16:43:13 INFO SparkDeploySchedulerBackend: Connected to Spark 
 cluster with app ID app-20140715164313-
 14/07/15 16:43:13 INFO AppClient$ClientActor: Executor added: 
 app-20140715164313-/0 on 
 worker-20140715164009-office-5-148.pa.gopivotal.com-52519 
 (office-5-148.pa.gopivotal.com:52519) with 8 cores
 14/07/15 16:43:13 INFO SparkDeploySchedulerBackend: Granted executor ID 
 app-20140715164313-/0 on hostPort office-5-148.pa.gopivotal.com:52519 
 with 8 cores, 512.0 MB RAM
 14/07/15

[jira] [Commented] (SPARK-2586) Lack of information to figure out connection to Tachyon master is inactive/ down

2014-07-18 Thread Henry Saputra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067224#comment-14067224
 ] 

Henry Saputra commented on SPARK-2586:
--

Using Standalone, I do not see log about Tachyon not available in Master or 
Worker nodes log files

 Lack of information to figure out connection to Tachyon master is inactive/ 
 down
 

 Key: SPARK-2586
 URL: https://issues.apache.org/jira/browse/SPARK-2586
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Henry Saputra
  Labels: tachyon

 When you running Spark with Tachyon, when the connection to Tachyon master is 
 down (due to problem in network or the Master node is down) there is no clear 
 log or error message to indicate it.
 Here is sample stack running SparkTachyonPi example with Tachyon connecting:
 14/07/15 16:43:10 INFO Utils: Using Spark's default log4j profile: 
 org/apache/spark/log4j-defaults.properties
 14/07/15 16:43:10 WARN Utils: Your hostname, henry-pivotal.local resolves to 
 a loopback address: 127.0.0.1; using 10.64.5.148 instead (on interface en5)
 14/07/15 16:43:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
 another address
 14/07/15 16:43:11 INFO SecurityManager: Changing view acls to: hsaputra
 14/07/15 16:43:11 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(hsaputra)
 14/07/15 16:43:11 INFO Slf4jLogger: Slf4jLogger started
 14/07/15 16:43:11 INFO Remoting: Starting remoting
 14/07/15 16:43:11 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://sp...@office-5-148.pa.gopivotal.com:53203]
 14/07/15 16:43:11 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://sp...@office-5-148.pa.gopivotal.com:53203]
 14/07/15 16:43:11 INFO SparkEnv: Registering MapOutputTracker
 14/07/15 16:43:11 INFO SparkEnv: Registering BlockManagerMaster
 14/07/15 16:43:11 INFO DiskBlockManager: Created local directory at 
 /var/folders/nv/nsr_3ysj0wgfq93fqp0rdt3wgp/T/spark-local-20140715164311-e63c
 14/07/15 16:43:11 INFO ConnectionManager: Bound socket to port 53204 with id 
 = ConnectionManagerId(office-5-148.pa.gopivotal.com,53204)
 14/07/15 16:43:11 INFO MemoryStore: MemoryStore started with capacity 2.1 GB
 14/07/15 16:43:11 INFO BlockManagerMaster: Trying to register BlockManager
 14/07/15 16:43:11 INFO BlockManagerMasterActor: Registering block manager 
 office-5-148.pa.gopivotal.com:53204 with 2.1 GB RAM
 14/07/15 16:43:11 INFO BlockManagerMaster: Registered BlockManager
 14/07/15 16:43:11 INFO HttpServer: Starting HTTP Server
 14/07/15 16:43:11 INFO HttpBroadcast: Broadcast server started at 
 http://10.64.5.148:53205
 14/07/15 16:43:11 INFO HttpFileServer: HTTP File server directory is 
 /var/folders/nv/nsr_3ysj0wgfq93fqp0rdt3wgp/T/spark-b2fb12ae-4608-4833-87b6-b335da00738e
 14/07/15 16:43:11 INFO HttpServer: Starting HTTP Server
 14/07/15 16:43:12 INFO SparkUI: Started SparkUI at 
 http://office-5-148.pa.gopivotal.com:4040
 2014-07-15 16:43:12.210 java[39068:1903] Unable to load realm info from 
 SCDynamicStore
 14/07/15 16:43:12 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/07/15 16:43:12 INFO SparkContext: Added JAR 
 examples/target/scala-2.10/spark-examples-1.1.0-SNAPSHOT-hadoop2.4.0.jar at 
 http://10.64.5.148:53206/jars/spark-examples-1.1.0-SNAPSHOT-hadoop2.4.0.jar 
 with timestamp 1405467792813
 14/07/15 16:43:12 INFO AppClient$ClientActor: Connecting to master 
 spark://henry-pivotal.local:7077...
 14/07/15 16:43:12 INFO SparkContext: Starting job: reduce at 
 SparkTachyonPi.scala:43
 14/07/15 16:43:12 INFO DAGScheduler: Got job 0 (reduce at 
 SparkTachyonPi.scala:43) with 2 output partitions (allowLocal=false)
 14/07/15 16:43:12 INFO DAGScheduler: Final stage: Stage 0(reduce at 
 SparkTachyonPi.scala:43)
 14/07/15 16:43:12 INFO DAGScheduler: Parents of final stage: List()
 14/07/15 16:43:12 INFO DAGScheduler: Missing parents: List()
 14/07/15 16:43:12 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map 
 at SparkTachyonPi.scala:39), which has no missing parents
 14/07/15 16:43:13 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 
 (MappedRDD[1] at map at SparkTachyonPi.scala:39)
 14/07/15 16:43:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
 14/07/15 16:43:13 INFO SparkDeploySchedulerBackend: Connected to Spark 
 cluster with app ID app-20140715164313-
 14/07/15 16:43:13 INFO AppClient$ClientActor: Executor added: 
 app-20140715164313-/0 on 
 worker-20140715164009-office-5-148.pa.gopivotal.com-52519 
 (office-5-148.pa.gopivotal.com:52519) with 8 cores
 14/07/15 16:43:13 INFO SparkDeploySchedulerBackend: Granted

[jira] [Resolved] (SPARK-2513) Correlations

2014-07-18 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-2513.
--

   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1367
[https://github.com/apache/spark/pull/1367]

 Correlations
 

 Key: SPARK-2513
 URL: https://issues.apache.org/jira/browse/SPARK-2513
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Doris Xin
 Fix For: 1.1.0


 PR: https://github.com/apache/spark/pull/1367



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2587) Error message is incorrect in make-distribution.sh

2014-07-18 Thread Mark Wagner (JIRA)

Mark Wagner created SPARK-2587:
--

 Summary: Error message is incorrect in make-distribution.sh
 Key: SPARK-2587
 URL: https://issues.apache.org/jira/browse/SPARK-2587
 Project: Spark
  Issue Type: Bug
Reporter: Mark Wagner
Priority: Minor


SPARK-2526 removed some options in favor of using Maven profiles, but it now 
gives incorrect guidance for those that try to use the old --with-hive flag: 
--with-hive' is no longer supported, use Maven option -Pyarn



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2567) Resubmitted stage sometimes remains as active stage in the web UI

2014-07-18 Thread Masayoshi TSUZUKI (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067306#comment-14067306
 ] 

Masayoshi TSUZUKI commented on SPARK-2567:
--

I found this lines in log file.
{noformat}
14/07/17 18:07:36 DEBUG DAGScheduler: Stage Stage 1 is actually done; true 3 3
{noformat}
so I think this is because extra stage which has no tasks to be executed is 
submitted, and then SparkListenerStageSubmitted is called but 
SparkListenerStageCompleted is not called.

 Resubmitted stage sometimes remains as active stage in the web UI
 -

 Key: SPARK-2567
 URL: https://issues.apache.org/jira/browse/SPARK-2567
 Project: Spark
  Issue Type: Bug
Reporter: Masayoshi TSUZUKI
 Attachments: SPARK-2567.png


 When a stage is resubmitted because of executor lost for example, sometimes 
 more than one resubmitted task appears in the web UI and one stage remains as 
 active even after the job has finished.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2587) Error message is incorrect in make-distribution.sh

2014-07-18 Thread Mark Wagner (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Wagner updated SPARK-2587:
---

Component/s: Build

 Error message is incorrect in make-distribution.sh
 --

 Key: SPARK-2587
 URL: https://issues.apache.org/jira/browse/SPARK-2587
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Mark Wagner
Priority: Minor

 SPARK-2526 removed some options in favor of using Maven profiles, but it now 
 gives incorrect guidance for those that try to use the old --with-hive flag: 
 --with-hive' is no longer supported, use Maven option -Pyarn



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2583) ConnectionManager cannot distinguish whether error occurred or not

2014-07-18 Thread Kousuke Saruta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067309#comment-14067309
 ] 

Kousuke Saruta commented on SPARK-2583:
---

PR: https://github.com/apache/spark/pull/1490

 ConnectionManager cannot distinguish whether error occurred or not
 --

 Key: SPARK-2583
 URL: https://issues.apache.org/jira/browse/SPARK-2583
 Project: Spark
  Issue Type: Bug
Reporter: Kousuke Saruta

 ConnectionManager#handleMessage sent empty messages to another peer if some 
 error occurred or not in onReceiveCalback.
 {code}
  val ackMessage = if (onReceiveCallback != null) {
 logDebug(Calling back)
 onReceiveCallback(bufferMessage, connectionManagerId)
   } else {
 logDebug(Not calling back as callback is null)
 None
   }
   if (ackMessage.isDefined) {
 if (!ackMessage.get.isInstanceOf[BufferMessage]) {
   logDebug(Response to  + bufferMessage +  is not a buffer 
 message, it is of type 
 + ackMessage.get.getClass)
 } else if (!ackMessage.get.asInstanceOf[BufferMessage].hasAckId) {
   logDebug(Response to  + bufferMessage +  does not have ack 
 id set)
   ackMessage.get.asInstanceOf[BufferMessage].ackId = 
 bufferMessage.id
 }
   }
 // We have no way to tell peer whether error occurred or not
   sendMessage(connectionManagerId, ackMessage.getOrElse {
 Message.createBufferMessage(bufferMessage.id)
   })
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2576) slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file


 [ 
https://issues.apache.org/jira/browse/SPARK-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Svend updated SPARK-2576:
-

Description: 
Execution of SQL query against HDFS systematically throws a class not found 
exception on slave nodes when executing .

(this was originally reported on the user list: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-1-spark-sql-error-java-lang-NoClassDefFoundError-Could-not-initialize-class-line11-read-tc10135.html)

Sample code (ran from spark-shell): 

{code}
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

import sqlContext.createSchemaRDD

case class Car(timestamp: Long, objectid: String, isGreen: Boolean)

// I get the same error when pointing to the folder 
hdfs://vm28:8020/test/cardata
val data = sc.textFile(hdfs://vm28:8020/test/cardata/part-0)

val cars = data.map(_.split(,)).map ( ar = Car(ar(0).toLong, ar(1), 
ar(2).toBoolean))
cars.registerAsTable(mcars)
val allgreens = sqlContext.sql(SELECT objectid from mcars where isGreen = 
true)
allgreens.collect.take(10).foreach(println)
{code}

Stack trace on the slave nodes: 

{code}

I0716 13:01:16.215158 13631 exec.cpp:131] Version: 0.19.0
I0716 13:01:16.219285 13656 exec.cpp:205] Executor registered on slave 
20140714-142853-485682442-5050-25487-2
14/07/16 13:01:16 INFO MesosExecutorBackend: Registered with Mesos as executor 
ID 20140714-142853-485682442-5050-25487-2
14/07/16 13:01:16 INFO SecurityManager: Changing view acls to: mesos,mnubohadoop
14/07/16 13:01:16 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(mesos, mnubohadoop)
14/07/16 13:01:17 INFO Slf4jLogger: Slf4jLogger started
14/07/16 13:01:17 INFO Remoting: Starting remoting
14/07/16 13:01:17 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sp...@vm23-hulk-priv.mtl.mnubo.com:38230]
14/07/16 13:01:17 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://sp...@vm23-hulk-priv.mtl.mnubo.com:38230]
14/07/16 13:01:17 INFO SparkEnv: Connecting to MapOutputTracker: 
akka.tcp://spark@vm28-hulk-pub:41632/user/MapOutputTracker
14/07/16 13:01:17 INFO SparkEnv: Connecting to BlockManagerMaster: 
akka.tcp://spark@vm28-hulk-pub:41632/user/BlockManagerMaster
14/07/16 13:01:17 INFO DiskBlockManager: Created local directory at 
/tmp/spark-local-20140716130117-8ea0
14/07/16 13:01:17 INFO MemoryStore: MemoryStore started with capacity 294.9 MB.
14/07/16 13:01:17 INFO ConnectionManager: Bound socket to port 44501 with id = 
ConnectionManagerId(vm23-hulk-priv.mtl.mnubo.com,44501)
14/07/16 13:01:17 INFO BlockManagerMaster: Trying to register BlockManager
14/07/16 13:01:17 INFO BlockManagerMaster: Registered BlockManager
14/07/16 13:01:17 INFO HttpFileServer: HTTP File server directory is 
/tmp/spark-ccf6f36c-2541-4a25-8fe4-bb4ba00ee633
14/07/16 13:01:17 INFO HttpServer: Starting HTTP Server
14/07/16 13:01:18 INFO Executor: Using REPL class URI: 
http://vm28-hulk-pub:33973
14/07/16 13:01:18 INFO Executor: Running task ID 2
14/07/16 13:01:18 INFO HttpBroadcast: Started reading broadcast variable 0
14/07/16 13:01:18 INFO MemoryStore: ensureFreeSpace(125590) called with 
curMem=0, maxMem=309225062
14/07/16 13:01:18 INFO MemoryStore: Block broadcast_0 stored as values to 
memory (estimated size 122.6 KB, free 294.8 MB)
14/07/16 13:01:18 INFO HttpBroadcast: Reading broadcast variable 0 took 
0.294602722 s
14/07/16 13:01:19 INFO HadoopRDD: Input split: 
hdfs://vm28:8020/test/cardata/part-0:23960450+23960451
I0716 13:01:19.905113 13657 exec.cpp:378] Executor asked to shutdown
14/07/16 13:01:20 ERROR Executor: Exception in task ID 2
java.lang.NoClassDefFoundError: $line11/$read$
at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19)
at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$1.next(Iterator.scala:853)
at scala.collection.Iterator$$anon$1.head(Iterator.scala:840)
at 
org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:181)
at 
org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:176)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at

[jira] [Updated] (SPARK-2576) slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file


 [ 
https://issues.apache.org/jira/browse/SPARK-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Svend updated SPARK-2576:
-

Description: 
Execution of SQL query against HDFS systematically throws a class not found 
exception on slave nodes when executing .

(this was originally reported on the user list: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-1-spark-sql-error-java-lang-NoClassDefFoundError-Could-not-initialize-class-line11-read-tc10135.html)

Sample code (ran from spark-shell): 

{code}
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

import sqlContext.createSchemaRDD

case class Car(timestamp: Long, objectid: String, isGreen: Boolean)

// I get the same error when pointing to the folder 
hdfs://vm28:8020/test/cardata
val data = sc.textFile(hdfs://vm28:8020/test/cardata/part-0)

val cars = data.map(_.split(,)).map ( ar = Car(ar(0).toLong, ar(1), 
ar(2).toBoolean))
cars.registerAsTable(mcars)
val allgreens = sqlContext.sql(SELECT objectid from mcars where isGreen = 
true)
allgreens.collect.take(10).foreach(println)
{code}

Stack trace on the slave nodes: 

{code}

I0716 13:01:16.215158 13631 exec.cpp:131] Version: 0.19.0
I0716 13:01:16.219285 13656 exec.cpp:205] Executor registered on slave 
20140714-142853-485682442-5050-25487-2
14/07/16 13:01:16 INFO MesosExecutorBackend: Registered with Mesos as executor 
ID 20140714-142853-485682442-5050-25487-2
14/07/16 13:01:16 INFO SecurityManager: Changing view acls to: mesos,mnubohadoop
14/07/16 13:01:16 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(mesos, mnubohadoop)
14/07/16 13:01:17 INFO Slf4jLogger: Slf4jLogger started
14/07/16 13:01:17 INFO Remoting: Starting remoting
14/07/16 13:01:17 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://spark@vm23:38230]
14/07/16 13:01:17 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://spark@vm23:38230]
14/07/16 13:01:17 INFO SparkEnv: Connecting to MapOutputTracker: 
akka.tcp://spark@vm28:41632/user/MapOutputTracker
14/07/16 13:01:17 INFO SparkEnv: Connecting to BlockManagerMaster: 
akka.tcp://spark@vm28:41632/user/BlockManagerMaster
14/07/16 13:01:17 INFO DiskBlockManager: Created local directory at 
/tmp/spark-local-20140716130117-8ea0
14/07/16 13:01:17 INFO MemoryStore: MemoryStore started with capacity 294.9 MB.
14/07/16 13:01:17 INFO ConnectionManager: Bound socket to port 44501 with id = 
ConnectionManagerId(vm23-hulk-priv.mtl.mnubo.com,44501)
14/07/16 13:01:17 INFO BlockManagerMaster: Trying to register BlockManager
14/07/16 13:01:17 INFO BlockManagerMaster: Registered BlockManager
14/07/16 13:01:17 INFO HttpFileServer: HTTP File server directory is 
/tmp/spark-ccf6f36c-2541-4a25-8fe4-bb4ba00ee633
14/07/16 13:01:17 INFO HttpServer: Starting HTTP Server
14/07/16 13:01:18 INFO Executor: Using REPL class URI: http://vm28:33973
14/07/16 13:01:18 INFO Executor: Running task ID 2
14/07/16 13:01:18 INFO HttpBroadcast: Started reading broadcast variable 0
14/07/16 13:01:18 INFO MemoryStore: ensureFreeSpace(125590) called with 
curMem=0, maxMem=309225062
14/07/16 13:01:18 INFO MemoryStore: Block broadcast_0 stored as values to 
memory (estimated size 122.6 KB, free 294.8 MB)
14/07/16 13:01:18 INFO HttpBroadcast: Reading broadcast variable 0 took 
0.294602722 s
14/07/16 13:01:19 INFO HadoopRDD: Input split: 
hdfs://vm28:8020/test/cardata/part-0:23960450+23960451
I0716 13:01:19.905113 13657 exec.cpp:378] Executor asked to shutdown
14/07/16 13:01:20 ERROR Executor: Exception in task ID 2
java.lang.NoClassDefFoundError: $line11/$read$
at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19)
at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$1.next(Iterator.scala:853)
at scala.collection.Iterator$$anon$1.head(Iterator.scala:840)
at 
org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:181)
at 
org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:176)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at

[jira] [Commented] (SPARK-2552) Stabilize the computation of logistic function in pyspark

2014-07-18 Thread Michael Yannakopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067342#comment-14067342
 ] 

Michael Yannakopoulos commented on SPARK-2552:
--

Hi Xiangrui,

From what I have seen so far, this error affects the prediction made using the
_predict_ method of _LogisticRegressionModel_ defined in 
_spark/python/pyspark/mllib/classification.py_ file. Is there any other 
occurence of
this issue in another file as well???

I can see two solutions in order to solve this issue:
a) Either check if the dot product between coeffs and data attributes gives a 
value in the desired range [ -745, 709 ] and if not to just set it to the one 
of these two values.

b) To create specific math functions in Java such as Logistic Function, SoftMax,
etc.. and call them via py4j in python2.7 and store the result in a 
'decimal.Decimal' variable.

Thanks,
Michael

 Stabilize the computation of logistic function in pyspark
 -

 Key: SPARK-2552
 URL: https://issues.apache.org/jira/browse/SPARK-2552
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
  Labels: Starter

 exp(1000) throws an error in python. For logistic function, we can use either 
 1 / ( 1 + exp( -x ) ) or 1 - 1 / (1 + exp( x ) ) to compute its value which 
 ensuring exp always takes a negative value.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2576) slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file


[ 
https://issues.apache.org/jira/browse/SPARK-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067348#comment-14067348
 ] 

Svend commented on SPARK-2576:
--

ctually I do not have a line 19 ???

I just re-ran this code, with no additional line breaks: 

{code}
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
case class Car(timestamp: Long, objectid: String, isGreen: Boolean)
val data = sc.textFile(hdfs://vm28:8020/test/cardata)
val cars = data.map(_.split(,)).map ( ar = Car(ar(0).toLong, ar(1), 
ar(2).toBoolean))
cars.registerAsTable(mcars)
val allgreens = sqlContext.sql(SELECT objectid from mcars where isGreen = 
true)
allgreens.collect.take(10).foreach(println)
{code}


I first get this error: 

{code}
java.lang.ExceptionInInitializerError
at $line10.$read$$iwC.init(console:6)
at $line10.$read.init(console:26)
at $line10.$read$.init(console:30)
at $line10.$read$.clinit(console)
at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19)
at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$1.next(Iterator.scala:853)
at scala.collection.Iterator$$anon$1.head(Iterator.scala:840)
at 
org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:181)
at 
org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:176)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.NullPointerException
at $line3.$read$$iwC$$iwC.init(console:8)
at $line3.$read$$iwC.init(console:14)
at $line3.$read.init(console:16)
at $line3.$read$.init(console:20)
at $line3.$read$.clinit(console)
... 31 more
{code}

(I do not know where do line 26, 30 or 12 come from)

Then right after this one: 

{code}
java.lang.NoClassDefFoundError: Could not initialize class $line10.$read$
at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19)
at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$1.next(Iterator.scala:853)
at scala.collection.Iterator$$anon$1.head(Iterator.scala:840)
at 
org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:181)
at 
org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:176)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at

[jira] [Comment Edited] (SPARK-2576) slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file


[ 
https://issues.apache.org/jira/browse/SPARK-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067348#comment-14067348
 ] 

Svend edited comment on SPARK-2576 at 7/19/14 2:24 AM:
---

actually I do not have a line 19 ???

I just re-ran this code, with no additional line breaks: 

{code}
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
case class Car(timestamp: Long, objectid: String, isGreen: Boolean)
val data = sc.textFile(hdfs://vm28:8020/test/cardata)
val cars = data.map(_.split(,)).map ( ar = Car(ar(0).toLong, ar(1), 
ar(2).toBoolean))
cars.registerAsTable(mcars)
val allgreens = sqlContext.sql(SELECT objectid from mcars where isGreen = 
true)
allgreens.collect.take(10).foreach(println)
{code}


I first get this error: 

{code}
java.lang.ExceptionInInitializerError
at $line10.$read$$iwC.init(console:6)
at $line10.$read.init(console:26)
at $line10.$read$.init(console:30)
at $line10.$read$.clinit(console)
at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19)
at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$1.next(Iterator.scala:853)
at scala.collection.Iterator$$anon$1.head(Iterator.scala:840)
at 
org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:181)
at 
org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:176)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.NullPointerException
at $line3.$read$$iwC$$iwC.init(console:8)
at $line3.$read$$iwC.init(console:14)
at $line3.$read.init(console:16)
at $line3.$read$.init(console:20)
at $line3.$read$.clinit(console)
... 31 more
{code}

(I do not know where do line 26, 30 or 12 come from)

Then right after this one: 

{code}
java.lang.NoClassDefFoundError: Could not initialize class $line10.$read$
at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19)
at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$1.next(Iterator.scala:853)
at scala.collection.Iterator$$anon$1.head(Iterator.scala:840)
at 
org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:181)
at 
org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:176)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at

[jira] [Commented] (SPARK-2576) slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file


[ 
https://issues.apache.org/jira/browse/SPARK-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067349#comment-14067349
 ] 

Svend commented on SPARK-2576:
--





I just noted that running the same spark-shell code with the session launched 
like this: 

{code}
./bin/spark-shell
{code}

(without connecting to mesos) computes the result successfully. 

Could it be that the 

{code}
case class Car(timestamp: Long, objectid: String, isGreen: Boolean)
{code}

defined in the shell is not made available to the remote slaves on mesos?


 slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark 
 QL query on HDFS CSV file
 --

 Key: SPARK-2576
 URL: https://issues.apache.org/jira/browse/SPARK-2576
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.0.1
 Environment: One Mesos 0.19 master without zookeeper and 4 mesos 
 slaves. 
 JDK 1.7.51 and Scala 2.10.4 on all nodes. 
 HDFS from CDH5.0.3
 Spark version: I tried both with the pre-built CDH5 spark package available 
 from http://spark.apache.org/downloads.html and by packaging spark with sbt 
 0.13.2, JDK 1.7.51 and scala 2.10.4 as explained here 
 http://mesosphere.io/learn/run-spark-on-mesos/
 All nodes are running Debian 3.2.51-1 x86_64 GNU/Linux and have 
Reporter: Svend

 Execution of SQL query against HDFS systematically throws a class not found 
 exception on slave nodes when executing .
 (this was originally reported on the user list: 
 http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-1-spark-sql-error-java-lang-NoClassDefFoundError-Could-not-initialize-class-line11-read-tc10135.html)
 Sample code (ran from spark-shell): 
 {code}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.createSchemaRDD
 case class Car(timestamp: Long, objectid: String, isGreen: Boolean)
 // I get the same error when pointing to the folder 
 hdfs://vm28:8020/test/cardata
 val data = sc.textFile(hdfs://vm28:8020/test/cardata/part-0)
 val cars = data.map(_.split(,)).map ( ar = Car(ar(0).toLong, ar(1), 
 ar(2).toBoolean))
 cars.registerAsTable(mcars)
 val allgreens = sqlContext.sql(SELECT objectid from mcars where isGreen = 
 true)
 allgreens.collect.take(10).foreach(println)
 {code}
 Stack trace on the slave nodes: 
 {code}
 I0716 13:01:16.215158 13631 exec.cpp:131] Version: 0.19.0
 I0716 13:01:16.219285 13656 exec.cpp:205] Executor registered on slave 
 20140714-142853-485682442-5050-25487-2
 14/07/16 13:01:16 INFO MesosExecutorBackend: Registered with Mesos as 
 executor ID 20140714-142853-485682442-5050-25487-2
 14/07/16 13:01:16 INFO SecurityManager: Changing view acls to: 
 mesos,mnubohadoop
 14/07/16 13:01:16 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(mesos, 
 mnubohadoop)
 14/07/16 13:01:17 INFO Slf4jLogger: Slf4jLogger started
 14/07/16 13:01:17 INFO Remoting: Starting remoting
 14/07/16 13:01:17 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://spark@vm23:38230]
 14/07/16 13:01:17 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://spark@vm23:38230]
 14/07/16 13:01:17 INFO SparkEnv: Connecting to MapOutputTracker: 
 akka.tcp://spark@vm28:41632/user/MapOutputTracker
 14/07/16 13:01:17 INFO SparkEnv: Connecting to BlockManagerMaster: 
 akka.tcp://spark@vm28:41632/user/BlockManagerMaster
 14/07/16 13:01:17 INFO DiskBlockManager: Created local directory at 
 /tmp/spark-local-20140716130117-8ea0
 14/07/16 13:01:17 INFO MemoryStore: MemoryStore started with capacity 294.9 
 MB.
 14/07/16 13:01:17 INFO ConnectionManager: Bound socket to port 44501 with id 
 = ConnectionManagerId(vm23-hulk-priv.mtl.mnubo.com,44501)
 14/07/16 13:01:17 INFO BlockManagerMaster: Trying to register BlockManager
 14/07/16 13:01:17 INFO BlockManagerMaster: Registered BlockManager
 14/07/16 13:01:17 INFO HttpFileServer: HTTP File server directory is 
 /tmp/spark-ccf6f36c-2541-4a25-8fe4-bb4ba00ee633
 14/07/16 13:01:17 INFO HttpServer: Starting HTTP Server
 14/07/16 13:01:18 INFO Executor: Using REPL class URI: http://vm28:33973
 14/07/16 13:01:18 INFO Executor: Running task ID 2
 14/07/16 13:01:18 INFO HttpBroadcast: Started reading broadcast variable 0
 14/07/16 13:01:18 INFO MemoryStore: ensureFreeSpace(125590) called with 
 curMem=0, maxMem=309225062
 14/07/16 13:01:18 INFO MemoryStore: Block broadcast_0 stored as values to 
 memory (estimated size 122.6 KB, free 294.8 MB)
 14/07/16 13:01:18 INFO HttpBroadcast: Reading broadcast variable 0 took 
 0.294602722 s
 14/07/16 13:01:19 INFO HadoopRDD: Input split: 
 hdfs://vm28:8020/test/cardata/part-0:23960450+23960451
 I0716 13:01:19.905113 13657 exec.cpp:378] Executor asked to shutdown
 14/07/16

[jira] [Comment Edited] (SPARK-2576) slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file