[jira] [Commented] (SPARK-1764) EOF reached before Python server acknowledged
[ https://issues.apache.org/jira/browse/SPARK-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066099#comment-14066099 ] nigel commented on SPARK-1764: -- Hi; Never used yarn. Doesn't happen on standalone. EOF reached before Python server acknowledged - Key: SPARK-1764 URL: https://issues.apache.org/jira/browse/SPARK-1764 Project: Spark Issue Type: Bug Components: Mesos, PySpark Affects Versions: 1.0.0 Reporter: Bouke van der Bijl Priority: Blocker Labels: mesos, pyspark I'm getting EOF reached before Python server acknowledged while using PySpark on Mesos. The error manifests itself in multiple ways. One is: 14/05/08 18:10:40 ERROR DAGSchedulerActorSupervisor: eventProcesserActor failed due to the error EOF reached before Python server acknowledged; shutting down SparkContext And the other has a full stacktrace: 14/05/08 18:03:06 ERROR OneForOneStrategy: EOF reached before Python server acknowledged org.apache.spark.SparkException: EOF reached before Python server acknowledged at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:416) at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:387) at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:71) at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:279) at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:277) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.Accumulators$.add(Accumulators.scala:277) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:818) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1204) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) This error causes the SparkContext to shutdown. I have not been able to reliably reproduce this bug, it seems to happen randomly, but if you run enough tasks on a SparkContext it'll hapen eventually -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2553) CoGroupedRDD unnecessarily allocates a Tuple2 per dep per key
[ https://issues.apache.org/jira/browse/SPARK-2553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2553: - Assignee: Sandy Ryza CoGroupedRDD unnecessarily allocates a Tuple2 per dep per key - Key: SPARK-2553 URL: https://issues.apache.org/jira/browse/SPARK-2553 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 1.0.0 Reporter: Sandy Ryza Assignee: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2553) CoGroupedRDD unnecessarily allocates a Tuple2 per dep per key
[ https://issues.apache.org/jira/browse/SPARK-2553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-2553. -- Resolution: Fixed Target Version/s: 1.1.0 CoGroupedRDD unnecessarily allocates a Tuple2 per dep per key - Key: SPARK-2553 URL: https://issues.apache.org/jira/browse/SPARK-2553 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 1.0.0 Reporter: Sandy Ryza Assignee: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2568) RangePartitioner should go through the data only once
[ https://issues.apache.org/jira/browse/SPARK-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2568: --- Assignee: Xiangrui Meng RangePartitioner should go through the data only once - Key: SPARK-2568 URL: https://issues.apache.org/jira/browse/SPARK-2568 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Xiangrui Meng As of Spark 1.0, RangePartitioner goes through data twice: once to compute the count and once to do sampling. As a result, to do sortByKey, Spark goes through data 3 times (once to count, once to sample, and once to sort). RangePartitioner should go through data only once (remove the count step). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2572) Can't delete local dir on executor automatically when running spark over Mesos.
Yadong Qi created SPARK-2572: Summary: Can't delete local dir on executor automatically when running spark over Mesos. Key: SPARK-2572 URL: https://issues.apache.org/jira/browse/SPARK-2572 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Yadong Qi When running spark over Mesos in “fine-grained” modes or “coarse-grained” mode. After the application finished.The local dir(/tmp/spark-local-20140718114058-834c) on executor can't not delete automatically. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2573) This file make-distribution.sh has an error, please fix it
王金子 created SPARK-2573: -- Summary: This file make-distribution.sh has an error, please fix it Key: SPARK-2573 URL: https://issues.apache.org/jira/browse/SPARK-2573 Project: Spark Issue Type: Bug Reporter: 王金子 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2573) This file make-distribution.sh has an error, please fix it
[ https://issues.apache.org/jira/browse/SPARK-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 王金子 updated SPARK-2573: --- Component/s: Build Description: line 61: echo Error: '--with-hive' is no longer supported, use Maven option -Pyarn. It should be use -Phive.Otherwise, i like the oldest more,why update it Target Version/s: 1.0.0 Affects Version/s: 1.0.0 Labels: build (was: ) Summary: This file make-distribution.sh has an error, please fix it (was: This file make-distribution.sh has an error, please fix it) This file make-distribution.sh has an error, please fix it Key: SPARK-2573 URL: https://issues.apache.org/jira/browse/SPARK-2573 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.0 Reporter: 王金子 Labels: build line 61: echo Error: '--with-hive' is no longer supported, use Maven option -Pyarn. It should be use -Phive.Otherwise, i like the oldest more,why update it -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2574) Avoid allocating new ArrayBuffer in groupByKey's mergeCombiner
Sandy Ryza created SPARK-2574: - Summary: Avoid allocating new ArrayBuffer in groupByKey's mergeCombiner Key: SPARK-2574 URL: https://issues.apache.org/jira/browse/SPARK-2574 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 1.0.0 Reporter: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2491) When an OOM is thrown,the executor does not stop properly.
[ https://issues.apache.org/jira/browse/SPARK-2491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066176#comment-14066176 ] Guoqiang Li commented on SPARK-2491: [~sarutak] I submitted a patch. https://github.com/apache/spark/pull/1482 When an OOM is thrown,the executor does not stop properly. -- Key: SPARK-2491 URL: https://issues.apache.org/jira/browse/SPARK-2491 Project: Spark Issue Type: Bug Reporter: Guoqiang Li The executor log: {code} # # java.lang.OutOfMemoryError: Java heap space # -XX:OnOutOfMemoryError=kill %p # Executing /bin/sh -c kill 44942... 14/07/15 10:38:29 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM 14/07/15 10:38:29 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Connection manager future execution context-6,5,main] java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:331) at org.apache.spark.storage.BlockMessage.set(BlockMessage.scala:94) at org.apache.spark.storage.BlockMessage$.fromByteBuffer(BlockMessage.scala:176) at org.apache.spark.storage.BlockMessageArray.set(BlockMessageArray.scala:63) at org.apache.spark.storage.BlockMessageArray$.fromBufferMessage(BlockMessageArray.scala:109) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:125) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:122) at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117) at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 14/07/15 10:38:29 WARN HadoopRDD: Exception in RecordReader.close() java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703) at org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619) at java.io.FilterInputStream.close(FilterInputStream.java:181) at org.apache.hadoop.util.LineReader.close(LineReader.java:150) at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:243) at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226) at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) at org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) - 14/07/15 10:38:30 INFO Executor: Running task ID 969 14/07/15 10:38:30 INFO BlockManager: Found block broadcast_0 locally 14/07/15 10:38:30 INFO HadoopRDD: Input split: hdfs://10dian72.domain.test:8020/input/lbs/recommend/toona/rating/20140712/part-7:0+68016537 14/07/15 10:38:30 ERROR Executor: Exception in task ID 969 java.io.FileNotFoundException: /yarn/nm/usercache/spark/appcache/application_1404728465401_0070/spark-local-20140715103235-ffda/2e/merged_shuffle_4_85_0 (No such file or directory) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:221) at org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:116) at
[jira] [Commented] (SPARK-1662) PySpark fails if python class is used as a data container
[ https://issues.apache.org/jira/browse/SPARK-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066201#comment-14066201 ] Chandan Kumar commented on SPARK-1662: -- Sounds good. Closing the issue. PySpark fails if python class is used as a data container - Key: SPARK-1662 URL: https://issues.apache.org/jira/browse/SPARK-1662 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Environment: Ubuntu 14, Python 2.7.6 Reporter: Chandan Kumar Priority: Minor PySpark fails if RDD operations are performed on data encapsulated in Python objects (rare use case where plain python objects are used as data containers instead of regular dict or tuples). I have written a small piece of code to reproduce the bug: https://gist.github.com/nrchandan/11394440 script src=https://gist.github.com/nrchandan/11394440.js;/script -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2576) slave node throw NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file
Svend created SPARK-2576: Summary: slave node throw NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file Key: SPARK-2576 URL: https://issues.apache.org/jira/browse/SPARK-2576 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.0.1 Environment: One Mesos 0.19 master without zookeeper and 4 mesos slaves. JDK 1.7.51 and Scala 2.10.4 on all nodes. HDFS from CDH5.0.3 Spark version: I tried both with the pre-built CDH5 spark package available from http://spark.apache.org/downloads.html and by packaging spark with sbt 0.13.2, JDK 1.7.51 and scala 2.10.4 as explained here http://mesosphere.io/learn/run-spark-on-mesos/ All nodes are running Debian 3.2.51-1 x86_64 GNU/Linux and have Reporter: Svend Execution of SQL query against HDFS systematically throws a class not found exception on slave nodes when executing . (this was originally reported on the user list: http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-1-spark-sql-error-java-lang-NoClassDefFoundError-Could-not-initialize-class-line11-read-tc10135.html) Sample code (ran from spark-shell): {code} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class Car(timestamp: Long, objectid: String, isGreen: Boolean) # I get the same error when pointing to the folder hdfs://vm28:8020/test/cardata val data = sc.textFile(hdfs://vm28:8020/test/cardata/part-0) val cars = data.map(_.split(,)).map ( ar = Car(ar(0).toLong, ar(1), ar(2).toBoolean)) cars.registerAsTable(mcars) val allgreens = sqlContext.sql(SELECT objectid from mcars where isGreen = true) allgreens.collect.take(10).foreach(println) {code} Stack trace on the slave nodes: {code} I0716 13:01:16.215158 13631 exec.cpp:131] Version: 0.19.0 I0716 13:01:16.219285 13656 exec.cpp:205] Executor registered on slave 20140714-142853-485682442-5050-25487-2 14/07/16 13:01:16 INFO MesosExecutorBackend: Registered with Mesos as executor ID 20140714-142853-485682442-5050-25487-2 14/07/16 13:01:16 INFO SecurityManager: Changing view acls to: mesos,mnubohadoop 14/07/16 13:01:16 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mesos, mnubohadoop) 14/07/16 13:01:17 INFO Slf4jLogger: Slf4jLogger started 14/07/16 13:01:17 INFO Remoting: Starting remoting 14/07/16 13:01:17 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sp...@vm23-hulk-priv.mtl.mnubo.com:38230] 14/07/16 13:01:17 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sp...@vm23-hulk-priv.mtl.mnubo.com:38230] 14/07/16 13:01:17 INFO SparkEnv: Connecting to MapOutputTracker: akka.tcp://spark@vm28-hulk-pub:41632/user/MapOutputTracker 14/07/16 13:01:17 INFO SparkEnv: Connecting to BlockManagerMaster: akka.tcp://spark@vm28-hulk-pub:41632/user/BlockManagerMaster 14/07/16 13:01:17 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140716130117-8ea0 14/07/16 13:01:17 INFO MemoryStore: MemoryStore started with capacity 294.9 MB. 14/07/16 13:01:17 INFO ConnectionManager: Bound socket to port 44501 with id = ConnectionManagerId(vm23-hulk-priv.mtl.mnubo.com,44501) 14/07/16 13:01:17 INFO BlockManagerMaster: Trying to register BlockManager 14/07/16 13:01:17 INFO BlockManagerMaster: Registered BlockManager 14/07/16 13:01:17 INFO HttpFileServer: HTTP File server directory is /tmp/spark-ccf6f36c-2541-4a25-8fe4-bb4ba00ee633 14/07/16 13:01:17 INFO HttpServer: Starting HTTP Server 14/07/16 13:01:18 INFO Executor: Using REPL class URI: http://vm28-hulk-pub:33973 14/07/16 13:01:18 INFO Executor: Running task ID 2 14/07/16 13:01:18 INFO HttpBroadcast: Started reading broadcast variable 0 14/07/16 13:01:18 INFO MemoryStore: ensureFreeSpace(125590) called with curMem=0, maxMem=309225062 14/07/16 13:01:18 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 122.6 KB, free 294.8 MB) 14/07/16 13:01:18 INFO HttpBroadcast: Reading broadcast variable 0 took 0.294602722 s 14/07/16 13:01:19 INFO HadoopRDD: Input split: hdfs://vm28-hulk-priv:8020/svend/testimport/com_mnubo_agriculture60_1396310400/part-0:23960450+23960451 I0716 13:01:19.905113 13657 exec.cpp:378] Executor asked to shutdown 14/07/16 13:01:20 ERROR Executor: Exception in task ID 2 java.lang.NoClassDefFoundError: $line11/$read$ at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19) at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$1.next(Iterator.scala:853) at scala.collection.Iterator$$anon$1.head(Iterator.scala:840) at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:181) at
[jira] [Closed] (SPARK-1662) PySpark fails if python class is used as a data container
[ https://issues.apache.org/jira/browse/SPARK-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandan Kumar closed SPARK-1662. Resolution: Not a Problem The issue is due to a limitation with Python's pickle mechanism. Probably not worth the effort to use something other than pickle. There are workarounds anyway. PySpark fails if python class is used as a data container - Key: SPARK-1662 URL: https://issues.apache.org/jira/browse/SPARK-1662 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Environment: Ubuntu 14, Python 2.7.6 Reporter: Chandan Kumar Priority: Minor PySpark fails if RDD operations are performed on data encapsulated in Python objects (rare use case where plain python objects are used as data containers instead of regular dict or tuples). I have written a small piece of code to reproduce the bug: https://gist.github.com/nrchandan/11394440 script src=https://gist.github.com/nrchandan/11394440.js;/script -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2576) slave node throw NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file
[ https://issues.apache.org/jira/browse/SPARK-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066205#comment-14066205 ] Svend commented on SPARK-2576: -- I see another person is reporting a similar issue on the mailing list with a similar stack (Spark 1.0.1 and CDH 5.0.3): http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-1-spark-sql-error-java-lang-NoClassDefFoundError-Could-not-initialize-class-line11-read-tc10135.html slave node throw NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file - Key: SPARK-2576 URL: https://issues.apache.org/jira/browse/SPARK-2576 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.0.1 Environment: One Mesos 0.19 master without zookeeper and 4 mesos slaves. JDK 1.7.51 and Scala 2.10.4 on all nodes. HDFS from CDH5.0.3 Spark version: I tried both with the pre-built CDH5 spark package available from http://spark.apache.org/downloads.html and by packaging spark with sbt 0.13.2, JDK 1.7.51 and scala 2.10.4 as explained here http://mesosphere.io/learn/run-spark-on-mesos/ All nodes are running Debian 3.2.51-1 x86_64 GNU/Linux and have Reporter: Svend Execution of SQL query against HDFS systematically throws a class not found exception on slave nodes when executing . (this was originally reported on the user list: http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-1-spark-sql-error-java-lang-NoClassDefFoundError-Could-not-initialize-class-line11-read-tc10135.html) Sample code (ran from spark-shell): {code} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class Car(timestamp: Long, objectid: String, isGreen: Boolean) # I get the same error when pointing to the folder hdfs://vm28:8020/test/cardata val data = sc.textFile(hdfs://vm28:8020/test/cardata/part-0) val cars = data.map(_.split(,)).map ( ar = Car(ar(0).toLong, ar(1), ar(2).toBoolean)) cars.registerAsTable(mcars) val allgreens = sqlContext.sql(SELECT objectid from mcars where isGreen = true) allgreens.collect.take(10).foreach(println) {code} Stack trace on the slave nodes: {code} I0716 13:01:16.215158 13631 exec.cpp:131] Version: 0.19.0 I0716 13:01:16.219285 13656 exec.cpp:205] Executor registered on slave 20140714-142853-485682442-5050-25487-2 14/07/16 13:01:16 INFO MesosExecutorBackend: Registered with Mesos as executor ID 20140714-142853-485682442-5050-25487-2 14/07/16 13:01:16 INFO SecurityManager: Changing view acls to: mesos,mnubohadoop 14/07/16 13:01:16 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mesos, mnubohadoop) 14/07/16 13:01:17 INFO Slf4jLogger: Slf4jLogger started 14/07/16 13:01:17 INFO Remoting: Starting remoting 14/07/16 13:01:17 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sp...@vm23-hulk-priv.mtl.mnubo.com:38230] 14/07/16 13:01:17 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sp...@vm23-hulk-priv.mtl.mnubo.com:38230] 14/07/16 13:01:17 INFO SparkEnv: Connecting to MapOutputTracker: akka.tcp://spark@vm28-hulk-pub:41632/user/MapOutputTracker 14/07/16 13:01:17 INFO SparkEnv: Connecting to BlockManagerMaster: akka.tcp://spark@vm28-hulk-pub:41632/user/BlockManagerMaster 14/07/16 13:01:17 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140716130117-8ea0 14/07/16 13:01:17 INFO MemoryStore: MemoryStore started with capacity 294.9 MB. 14/07/16 13:01:17 INFO ConnectionManager: Bound socket to port 44501 with id = ConnectionManagerId(vm23-hulk-priv.mtl.mnubo.com,44501) 14/07/16 13:01:17 INFO BlockManagerMaster: Trying to register BlockManager 14/07/16 13:01:17 INFO BlockManagerMaster: Registered BlockManager 14/07/16 13:01:17 INFO HttpFileServer: HTTP File server directory is /tmp/spark-ccf6f36c-2541-4a25-8fe4-bb4ba00ee633 14/07/16 13:01:17 INFO HttpServer: Starting HTTP Server 14/07/16 13:01:18 INFO Executor: Using REPL class URI: http://vm28-hulk-pub:33973 14/07/16 13:01:18 INFO Executor: Running task ID 2 14/07/16 13:01:18 INFO HttpBroadcast: Started reading broadcast variable 0 14/07/16 13:01:18 INFO MemoryStore: ensureFreeSpace(125590) called with curMem=0, maxMem=309225062 14/07/16 13:01:18 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 122.6 KB, free 294.8 MB) 14/07/16 13:01:18 INFO HttpBroadcast: Reading broadcast variable 0 took 0.294602722 s 14/07/16 13:01:19 INFO HadoopRDD: Input split: hdfs://vm28-hulk-priv:8020/svend/testimport/com_mnubo_agriculture60_1396310400/part-0:23960450+23960451 I0716 13:01:19.905113 13657 exec.cpp:378] Executor
[jira] [Created] (SPARK-2577) File upload to viewfs is broken due to mount point resolution
Gera Shegalov created SPARK-2577: Summary: File upload to viewfs is broken due to mount point resolution Key: SPARK-2577 URL: https://issues.apache.org/jira/browse/SPARK-2577 Project: Spark Issue Type: Bug Components: YARN Reporter: Gera Shegalov Priority: Blocker YARN client resolves paths of uploaded artifacts. When a viewfs path is resolved, the filesystem changes to the target file system. However, the original fs is passed to {{ClientDistributedCacheManager#addResource}}. {code} 14/07/18 01:30:31 INFO yarn.Client: Uploading file:/Users/gshegalov/workspace/spark-tw/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop3.0.0-SNAPSHOT.jar to viewfs:/user/gshegalov/.sparkStaging/application_1405479201490_0049/spark-assembly-1.1.0-SNAPSHOT-hadoop3.0.0-SNAPSHOT.jar Exception in thread main java.lang.IllegalArgumentException: Wrong FS: hdfs://ns1:8020/user/gshegalov/.sparkStaging/application_1405479201490_0049/spark-assembly-1.1.0-SNAPSHOT-hadoop3.0.0-SNAPSHOT.jar, expected: viewfs:/ at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643) at org.apache.hadoop.fs.viewfs.ViewFileSystem.getUriPath(ViewFileSystem.java:116) at org.apache.hadoop.fs.viewfs.ViewFileSystem.getFileStatus(ViewFileSystem.java:345) at org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:72) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:236) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:229) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:229) at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:37) at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:74) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:81) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:136) at org.apache.spark.SparkContext.init(SparkContext.scala:320) at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:28) at org.apache.spark.examples.SparkPi.main(SparkPi.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} There are two options: # do not resolve path because symlinks are currently disabled in Hadoop # pass the correct filesystem object -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2578) RIGHT OUTER JOIN causes ClassCastException
Christian Wuertz created SPARK-2578: --- Summary: RIGHT OUTER JOIN causes ClassCastException Key: SPARK-2578 URL: https://issues.apache.org/jira/browse/SPARK-2578 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Christian Wuertz When I run a hql query that contains a right outer join I always get this exception: {code} org.apache.spark.SparkDriverExecutionException: Execution error at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:849) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1231) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.lang.ClassCastException: scala.collection.mutable.HashSet cannot be cast to scala.collection.mutable.BitSet at org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$7.apply(joins.scala:338) at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:811) at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:808) at org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:845) ... 10 more {code} This can easily reproduced using the queries and data from this [tutorial|http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/]. Only the last query was modified to use a right outer join. {code} hiveContext.hql(SELECT a.year, a.player_id, a.runs from batting a JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b ON (a.year = b.year AND a.runs = b.runs)).collect.foreach(println) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2578) RIGHT OUTER JOIN causes ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Wuertz updated SPARK-2578: Description: When I run a hql query that contains a right outer join I always get this exception: {code} org.apache.spark.SparkDriverExecutionException: Execution error at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:849) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1231) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.lang.ClassCastException: scala.collection.mutable.HashSet cannot be cast to scala.collection.mutable.BitSet at org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$7.apply(joins.scala:338) at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:811) at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:808) at org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:845) ... 10 more {code} This can easily reproduced using the queries and data from this [tutorial|http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/]. Only the last query was modified to use a right outer join. {code} hiveContext.hql(SELECT a.year, a.player_id, a.runs from batting a RIGHT OUTER JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b ON (a.year = b.year AND a.runs = b.runs)).collect.foreach(println) {code} was: When I run a hql query that contains a right outer join I always get this exception: {code} org.apache.spark.SparkDriverExecutionException: Execution error at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:849) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1231) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.lang.ClassCastException: scala.collection.mutable.HashSet cannot be cast to scala.collection.mutable.BitSet at org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$7.apply(joins.scala:338) at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:811) at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:808) at org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:845) ... 10 more {code} This can easily reproduced using the queries and data from this [tutorial|http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/]. Only the last query was modified to use a right outer join. {code} hiveContext.hql(SELECT a.year, a.player_id, a.runs from batting a JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b ON (a.year = b.year AND a.runs = b.runs)).collect.foreach(println) {code} RIGHT OUTER JOIN causes ClassCastException -- Key: SPARK-2578 URL: https://issues.apache.org/jira/browse/SPARK-2578 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Christian Wuertz When I run a hql query that contains a right outer join I always get this exception: {code} org.apache.spark.SparkDriverExecutionException: Execution error at
[jira] [Updated] (SPARK-2578) RIGHT OUTER JOIN causes ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Wuertz updated SPARK-2578: Description: When I run a hql query that contains a right outer join I always get this exception: {code} org.apache.spark.SparkDriverExecutionException: Execution error at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:849) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1231) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.lang.ClassCastException: scala.collection.mutable.HashSet cannot be cast to scala.collection.mutable.BitSet at org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$7.apply(joins.scala:338) at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:811) at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:808) at org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:845) ... 10 more {code} This can easily reproduced using the queries and data from this [tutorial|http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/]. Only the last query was modified to use a right outer join. {code} hiveContext.hql(SELECT a.year, a.player_id, a.runs from batting a RIGHT OUTER JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b ON (a.year = b.year AND a.runs = b.runs)).collect.foreach(println) {code} I compiled the 1.0.1 release myself with the following command: ./make-distribution.sh --hadoop=2.2.0 --with-yarn --with-hive --tgz was: When I run a hql query that contains a right outer join I always get this exception: {code} org.apache.spark.SparkDriverExecutionException: Execution error at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:849) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1231) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.lang.ClassCastException: scala.collection.mutable.HashSet cannot be cast to scala.collection.mutable.BitSet at org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$7.apply(joins.scala:338) at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:811) at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:808) at org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:845) ... 10 more {code} This can easily reproduced using the queries and data from this [tutorial|http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/]. Only the last query was modified to use a right outer join. {code} hiveContext.hql(SELECT a.year, a.player_id, a.runs from batting a RIGHT OUTER JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b ON (a.year = b.year AND a.runs = b.runs)).collect.foreach(println) {code} RIGHT OUTER JOIN causes ClassCastException -- Key: SPARK-2578 URL: https://issues.apache.org/jira/browse/SPARK-2578 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Christian Wuertz When I run a hql query that contains a right outer join I always get this
[jira] [Commented] (SPARK-2577) File upload to viewfs is broken due to mount point resolution
[ https://issues.apache.org/jira/browse/SPARK-2577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066243#comment-14066243 ] Gera Shegalov commented on SPARK-2577: -- https://github.com/apache/spark/pull/1483 File upload to viewfs is broken due to mount point resolution - Key: SPARK-2577 URL: https://issues.apache.org/jira/browse/SPARK-2577 Project: Spark Issue Type: Bug Components: YARN Reporter: Gera Shegalov Priority: Blocker YARN client resolves paths of uploaded artifacts. When a viewfs path is resolved, the filesystem changes to the target file system. However, the original fs is passed to {{ClientDistributedCacheManager#addResource}}. {code} 14/07/18 01:30:31 INFO yarn.Client: Uploading file:/Users/gshegalov/workspace/spark-tw/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop3.0.0-SNAPSHOT.jar to viewfs:/user/gshegalov/.sparkStaging/application_1405479201490_0049/spark-assembly-1.1.0-SNAPSHOT-hadoop3.0.0-SNAPSHOT.jar Exception in thread main java.lang.IllegalArgumentException: Wrong FS: hdfs://ns1:8020/user/gshegalov/.sparkStaging/application_1405479201490_0049/spark-assembly-1.1.0-SNAPSHOT-hadoop3.0.0-SNAPSHOT.jar, expected: viewfs:/ at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643) at org.apache.hadoop.fs.viewfs.ViewFileSystem.getUriPath(ViewFileSystem.java:116) at org.apache.hadoop.fs.viewfs.ViewFileSystem.getFileStatus(ViewFileSystem.java:345) at org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:72) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:236) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:229) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:229) at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:37) at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:74) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:81) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:136) at org.apache.spark.SparkContext.init(SparkContext.scala:320) at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:28) at org.apache.spark.examples.SparkPi.main(SparkPi.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} There are two options: # do not resolve path because symlinks are currently disabled in Hadoop # pass the correct filesystem object -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2578) OUTER JOINs cause ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Wuertz updated SPARK-2578: Summary: OUTER JOINs cause ClassCastException (was: RIGHT OUTER JOIN causes ClassCastException) OUTER JOINs cause ClassCastException Key: SPARK-2578 URL: https://issues.apache.org/jira/browse/SPARK-2578 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Christian Wuertz When I run a hql query that contains a right outer join I always get this exception: {code} org.apache.spark.SparkDriverExecutionException: Execution error at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:849) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1231) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.lang.ClassCastException: scala.collection.mutable.HashSet cannot be cast to scala.collection.mutable.BitSet at org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$7.apply(joins.scala:338) at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:811) at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:808) at org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:845) ... 10 more {code} This can easily reproduced using the queries and data from this [tutorial|http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/]. Only the last query was modified to use a right outer join. {code} hiveContext.hql(SELECT a.year, a.player_id, a.runs from batting a RIGHT OUTER JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b ON (a.year = b.year AND a.runs = b.runs)).collect.foreach(println) {code} I compiled the 1.0.1 release myself with the following command: ./make-distribution.sh --hadoop=2.2.0 --with-yarn --with-hive --tgz -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2578) OUTER JOINs cause ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Wuertz updated SPARK-2578: Description: When I run a hql query that contains a right or a left outer join I always get this exception: {code} org.apache.spark.SparkDriverExecutionException: Execution error at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:849) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1231) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.lang.ClassCastException: scala.collection.mutable.HashSet cannot be cast to scala.collection.mutable.BitSet at org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$7.apply(joins.scala:338) at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:811) at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:808) at org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:845) ... 10 more {code} This can easily reproduced using the queries and data from this [tutorial|http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/]. Only the last query was modified to use a right outer join. {code} hiveContext.hql(SELECT a.year, a.player_id, a.runs from batting a RIGHT OUTER JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b ON (a.year = b.year AND a.runs = b.runs)).collect.foreach(println) {code} I compiled the 1.0.1 release myself with the following command: ./make-distribution.sh --hadoop=2.2.0 --with-yarn --with-hive --tgz was: When I run a hql query that contains a right outer join I always get this exception: {code} org.apache.spark.SparkDriverExecutionException: Execution error at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:849) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1231) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.lang.ClassCastException: scala.collection.mutable.HashSet cannot be cast to scala.collection.mutable.BitSet at org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$7.apply(joins.scala:338) at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:811) at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:808) at org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:845) ... 10 more {code} This can easily reproduced using the queries and data from this [tutorial|http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/]. Only the last query was modified to use a right outer join. {code} hiveContext.hql(SELECT a.year, a.player_id, a.runs from batting a RIGHT OUTER JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b ON (a.year = b.year AND a.runs = b.runs)).collect.foreach(println) {code} I compiled the 1.0.1 release myself with the following command: ./make-distribution.sh --hadoop=2.2.0 --with-yarn --with-hive --tgz OUTER JOINs cause ClassCastException Key: SPARK-2578 URL: https://issues.apache.org/jira/browse/SPARK-2578 Project: Spark Issue Type: Bug Components: SQL Affects
[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets
[ https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066283#comment-14066283 ] Sean Owen commented on SPARK-2341: -- [~mengxr] Here is an example of changing the argument: https://github.com/srowen/spark/commit/4a584ff9c0ada3d035d4668ecf22ec0e65ed16b6 I won't open a PR yet. I think this is a better API at this point, but the question is more whether the weight of deprecated methods are worth it or not. Another data point to keep in mind regarding how APIs can evolve. loadLibSVMFile doesn't handle regression datasets - Key: SPARK-2341 URL: https://issues.apache.org/jira/browse/SPARK-2341 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Eustache Priority: Minor Labels: easyfix Many datasets exist in LibSVM format for regression tasks [1] but currently the loadLibSVMFile primitive doesn't handle regression datasets. More precisely, the LabelParser is either a MulticlassLabelParser or a BinaryLabelParser. What happens then is that the file is loaded but in multiclass mode : each target value is interpreted as a class name ! The fix would be to write a RegressionLabelParser which converts target values to Double and plug it into the loadLibSVMFile routine. [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2491) When an OOM is thrown,the executor does not stop properly.
[ https://issues.apache.org/jira/browse/SPARK-2491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-2491: --- Component/s: YARN When an OOM is thrown,the executor does not stop properly. -- Key: SPARK-2491 URL: https://issues.apache.org/jira/browse/SPARK-2491 Project: Spark Issue Type: Bug Components: YARN Reporter: Guoqiang Li The executor log: {code} # # java.lang.OutOfMemoryError: Java heap space # -XX:OnOutOfMemoryError=kill %p # Executing /bin/sh -c kill 44942... 14/07/15 10:38:29 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM 14/07/15 10:38:29 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Connection manager future execution context-6,5,main] java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:331) at org.apache.spark.storage.BlockMessage.set(BlockMessage.scala:94) at org.apache.spark.storage.BlockMessage$.fromByteBuffer(BlockMessage.scala:176) at org.apache.spark.storage.BlockMessageArray.set(BlockMessageArray.scala:63) at org.apache.spark.storage.BlockMessageArray$.fromBufferMessage(BlockMessageArray.scala:109) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:125) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:122) at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117) at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 14/07/15 10:38:29 WARN HadoopRDD: Exception in RecordReader.close() java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703) at org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619) at java.io.FilterInputStream.close(FilterInputStream.java:181) at org.apache.hadoop.util.LineReader.close(LineReader.java:150) at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:243) at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226) at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) at org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) - 14/07/15 10:38:30 INFO Executor: Running task ID 969 14/07/15 10:38:30 INFO BlockManager: Found block broadcast_0 locally 14/07/15 10:38:30 INFO HadoopRDD: Input split: hdfs://10dian72.domain.test:8020/input/lbs/recommend/toona/rating/20140712/part-7:0+68016537 14/07/15 10:38:30 ERROR Executor: Exception in task ID 969 java.io.FileNotFoundException: /yarn/nm/usercache/spark/appcache/application_1404728465401_0070/spark-local-20140715103235-ffda/2e/merged_shuffle_4_85_0 (No such file or directory) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:221) at org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:116) at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:177) at
[jira] [Commented] (SPARK-2256) pyspark: RDD.take doesn't work ... sometimes ...
[ https://issues.apache.org/jira/browse/SPARK-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066390#comment-14066390 ] Matthew Farrellee commented on SPARK-2256: -- ok. i don't have a windows machine to try. good luck! pyspark: RDD.take doesn't work ... sometimes ... -- Key: SPARK-2256 URL: https://issues.apache.org/jira/browse/SPARK-2256 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Environment: local file/remote HDFS Reporter: Ángel Álvarez Labels: RDD, pyspark, take Attachments: A_test.zip If I try to take some lines from a file, sometimes it doesn't work Code: myfile = sc.textFile(A_ko) print myfile.take(10) Stacktrace: 14/06/24 09:29:27 INFO DAGScheduler: Failed to run take at mytest.py:19 Traceback (most recent call last): File mytest.py, line 19, in module print myfile.take(10) File spark-1.0.0-bin-hadoop2\python\pyspark\rdd.py, line 868, in take iterator = mapped._jrdd.collectPartitions(partitionsToTake)[0].iterator() File spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\java_gateway.py, line 537, in __call__ File spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\protocol.py, line 300, in get_return_value Test data: START TEST DATA A A A
[jira] [Commented] (SPARK-677) PySpark should not collect results through local filesystem
[ https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066458#comment-14066458 ] Matthew Farrellee commented on SPARK-677: - this can also be used to address the fragile nature of py4j connection construction. the parent can create the fifo. PySpark should not collect results through local filesystem --- Key: SPARK-677 URL: https://issues.apache.org/jira/browse/SPARK-677 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 0.7.0 Reporter: Josh Rosen Py4J is slow when transferring large arrays, so PySpark currently dumps data to the disk and reads it back in order to collect() RDDs. On large enough datasets, this data will spill from the buffer cache and write to the physical disk, resulting in terrible performance. Instead, we should stream the data from Java to Python over a local socket or a FIFO. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1305) Support persisting RDD's directly to Tachyon
[ https://issues.apache.org/jira/browse/SPARK-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066465#comment-14066465 ] Denis Serduik commented on SPARK-1305: -- I'm interesting in using this feature especially with SchemeRDD to be able cache intermediate results. Where I can find some code examples how to use it ? From SparkContext's code I see the following: private[spark] def persistRDD(rdd: RDD[_]) { persistentRdds(rdd.id) = rdd } /** * Unpersist an RDD from memory and/or disk storage */ private[spark] def unpersistRDD(rddId: Int, blocking: Boolean = true) { env.blockManager.master.removeRdd(rddId, blocking) persistentRdds.remove(rddId) listenerBus.post(SparkListenerUnpersistRDD(rddId)) } So persist don't put RDD into tachyonStore of blockmanager ? How does this feature work ? Support persisting RDD's directly to Tachyon Key: SPARK-1305 URL: https://issues.apache.org/jira/browse/SPARK-1305 Project: Spark Issue Type: New Feature Components: Block Manager Reporter: Patrick Wendell Assignee: Haoyuan Li Priority: Blocker Fix For: 1.0.0 This is already an ongoing pull request - in a nutshell we want to support Tachyon as a storage level in Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1305) Support persisting RDD's directly to Tachyon
[ https://issues.apache.org/jira/browse/SPARK-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066465#comment-14066465 ] Denis Serduik edited comment on SPARK-1305 at 7/18/14 4:00 PM: --- I'm interested in using this feature especially with SchemeRDD to be able cache intermediate results. Where I can find some code examples how to use it ? From SparkContext's code I see the following: private[spark] def persistRDD(rdd: RDD[_]) { persistentRdds(rdd.id) = rdd } /** * Unpersist an RDD from memory and/or disk storage */ private[spark] def unpersistRDD(rddId: Int, blocking: Boolean = true) { env.blockManager.master.removeRdd(rddId, blocking) persistentRdds.remove(rddId) listenerBus.post(SparkListenerUnpersistRDD(rddId)) } So persist don't put RDD into tachyonStore of blockmanager ? How does this feature work ? was (Author: dmaverick): I'm interesting in using this feature especially with SchemeRDD to be able cache intermediate results. Where I can find some code examples how to use it ? From SparkContext's code I see the following: private[spark] def persistRDD(rdd: RDD[_]) { persistentRdds(rdd.id) = rdd } /** * Unpersist an RDD from memory and/or disk storage */ private[spark] def unpersistRDD(rddId: Int, blocking: Boolean = true) { env.blockManager.master.removeRdd(rddId, blocking) persistentRdds.remove(rddId) listenerBus.post(SparkListenerUnpersistRDD(rddId)) } So persist don't put RDD into tachyonStore of blockmanager ? How does this feature work ? Support persisting RDD's directly to Tachyon Key: SPARK-1305 URL: https://issues.apache.org/jira/browse/SPARK-1305 Project: Spark Issue Type: New Feature Components: Block Manager Reporter: Patrick Wendell Assignee: Haoyuan Li Priority: Blocker Fix For: 1.0.0 This is already an ongoing pull request - in a nutshell we want to support Tachyon as a storage level in Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2580) broken pipe collecting schemardd results
Matthew Farrellee created SPARK-2580: Summary: broken pipe collecting schemardd results Key: SPARK-2580 URL: https://issues.apache.org/jira/browse/SPARK-2580 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.0.0 Environment: fedora 21 local and rhel 7 clustered (standalone) Reporter: Matthew Farrellee {code} from pyspark.sql import SQLContext sqlCtx = SQLContext(sc) # size of cluster impacts where this breaks (i.e. 2**15 vs 2**2) data = sc.parallelize([{'name': 'index', 'value': 0}] * 2**20) sdata = sqlCtx.inferSchema(data) sdata.first() {code} result: note - result returned as well as error {code} sdata.first() 14/07/18 12:10:25 INFO SparkContext: Starting job: runJob at PythonRDD.scala:290 14/07/18 12:10:25 INFO DAGScheduler: Got job 43 (runJob at PythonRDD.scala:290) with 1 output partitions (allowLocal=true) 14/07/18 12:10:25 INFO DAGScheduler: Final stage: Stage 52(runJob at PythonRDD.scala:290) 14/07/18 12:10:25 INFO DAGScheduler: Parents of final stage: List() 14/07/18 12:10:25 INFO DAGScheduler: Missing parents: List() 14/07/18 12:10:25 INFO DAGScheduler: Computing the requested partition locally 14/07/18 12:10:25 INFO PythonRDD: Times: total = 45, boot = 3, init = 40, finish = 2 14/07/18 12:10:25 INFO SparkContext: Job finished: runJob at PythonRDD.scala:290, took 0.048348426 s {u'name': u'index', u'value': 0} PySpark worker failed with exception: Traceback (most recent call last): File /home/matt/Documents/Repositories/spark/dist/python/pyspark/worker.py, line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File /home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py, line 191, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py, line 124, in dump_stream self._write_with_length(obj, stream) File /home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py, line 139, in _write_with_length stream.write(serialized) IOError: [Errno 32] Broken pipe Traceback (most recent call last): File /home/matt/Documents/Repositories/spark/dist/python/pyspark/daemon.py, line 130, in launch_worker worker(listen_sock) File /home/matt/Documents/Repositories/spark/dist/python/pyspark/daemon.py, line 119, in worker outfile.flush() IOError: [Errno 32] Broken pipe {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2580) broken pipe collecting schemardd results
[ https://issues.apache.org/jira/browse/SPARK-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066480#comment-14066480 ] Matthew Farrellee commented on SPARK-2580: -- fyi, this was discovered during spark summit 2014 using the wiki_parquet example. broken pipe collecting schemardd results Key: SPARK-2580 URL: https://issues.apache.org/jira/browse/SPARK-2580 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.0.0 Environment: fedora 21 local and rhel 7 clustered (standalone) Reporter: Matthew Farrellee Labels: py4j, pyspark {code} from pyspark.sql import SQLContext sqlCtx = SQLContext(sc) # size of cluster impacts where this breaks (i.e. 2**15 vs 2**2) data = sc.parallelize([{'name': 'index', 'value': 0}] * 2**20) sdata = sqlCtx.inferSchema(data) sdata.first() {code} result: note - result returned as well as error {code} sdata.first() 14/07/18 12:10:25 INFO SparkContext: Starting job: runJob at PythonRDD.scala:290 14/07/18 12:10:25 INFO DAGScheduler: Got job 43 (runJob at PythonRDD.scala:290) with 1 output partitions (allowLocal=true) 14/07/18 12:10:25 INFO DAGScheduler: Final stage: Stage 52(runJob at PythonRDD.scala:290) 14/07/18 12:10:25 INFO DAGScheduler: Parents of final stage: List() 14/07/18 12:10:25 INFO DAGScheduler: Missing parents: List() 14/07/18 12:10:25 INFO DAGScheduler: Computing the requested partition locally 14/07/18 12:10:25 INFO PythonRDD: Times: total = 45, boot = 3, init = 40, finish = 2 14/07/18 12:10:25 INFO SparkContext: Job finished: runJob at PythonRDD.scala:290, took 0.048348426 s {u'name': u'index', u'value': 0} PySpark worker failed with exception: Traceback (most recent call last): File /home/matt/Documents/Repositories/spark/dist/python/pyspark/worker.py, line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File /home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py, line 191, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py, line 124, in dump_stream self._write_with_length(obj, stream) File /home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py, line 139, in _write_with_length stream.write(serialized) IOError: [Errno 32] Broken pipe Traceback (most recent call last): File /home/matt/Documents/Repositories/spark/dist/python/pyspark/daemon.py, line 130, in launch_worker worker(listen_sock) File /home/matt/Documents/Repositories/spark/dist/python/pyspark/daemon.py, line 119, in worker outfile.flush() IOError: [Errno 32] Broken pipe {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2576) slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file
[ https://issues.apache.org/jira/browse/SPARK-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066492#comment-14066492 ] Yin Huai commented on SPARK-2576: - Regarding {code} $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19) $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19) {code} what is your input at console line 19? slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file -- Key: SPARK-2576 URL: https://issues.apache.org/jira/browse/SPARK-2576 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.0.1 Environment: One Mesos 0.19 master without zookeeper and 4 mesos slaves. JDK 1.7.51 and Scala 2.10.4 on all nodes. HDFS from CDH5.0.3 Spark version: I tried both with the pre-built CDH5 spark package available from http://spark.apache.org/downloads.html and by packaging spark with sbt 0.13.2, JDK 1.7.51 and scala 2.10.4 as explained here http://mesosphere.io/learn/run-spark-on-mesos/ All nodes are running Debian 3.2.51-1 x86_64 GNU/Linux and have Reporter: Svend Execution of SQL query against HDFS systematically throws a class not found exception on slave nodes when executing . (this was originally reported on the user list: http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-1-spark-sql-error-java-lang-NoClassDefFoundError-Could-not-initialize-class-line11-read-tc10135.html) Sample code (ran from spark-shell): {code} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class Car(timestamp: Long, objectid: String, isGreen: Boolean) // I get the same error when pointing to the folder hdfs://vm28:8020/test/cardata val data = sc.textFile(hdfs://vm28:8020/test/cardata/part-0) val cars = data.map(_.split(,)).map ( ar = Car(ar(0).toLong, ar(1), ar(2).toBoolean)) cars.registerAsTable(mcars) val allgreens = sqlContext.sql(SELECT objectid from mcars where isGreen = true) allgreens.collect.take(10).foreach(println) {code} Stack trace on the slave nodes: {code} I0716 13:01:16.215158 13631 exec.cpp:131] Version: 0.19.0 I0716 13:01:16.219285 13656 exec.cpp:205] Executor registered on slave 20140714-142853-485682442-5050-25487-2 14/07/16 13:01:16 INFO MesosExecutorBackend: Registered with Mesos as executor ID 20140714-142853-485682442-5050-25487-2 14/07/16 13:01:16 INFO SecurityManager: Changing view acls to: mesos,mnubohadoop 14/07/16 13:01:16 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mesos, mnubohadoop) 14/07/16 13:01:17 INFO Slf4jLogger: Slf4jLogger started 14/07/16 13:01:17 INFO Remoting: Starting remoting 14/07/16 13:01:17 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sp...@vm23-hulk-priv.mtl.mnubo.com:38230] 14/07/16 13:01:17 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sp...@vm23-hulk-priv.mtl.mnubo.com:38230] 14/07/16 13:01:17 INFO SparkEnv: Connecting to MapOutputTracker: akka.tcp://spark@vm28-hulk-pub:41632/user/MapOutputTracker 14/07/16 13:01:17 INFO SparkEnv: Connecting to BlockManagerMaster: akka.tcp://spark@vm28-hulk-pub:41632/user/BlockManagerMaster 14/07/16 13:01:17 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140716130117-8ea0 14/07/16 13:01:17 INFO MemoryStore: MemoryStore started with capacity 294.9 MB. 14/07/16 13:01:17 INFO ConnectionManager: Bound socket to port 44501 with id = ConnectionManagerId(vm23-hulk-priv.mtl.mnubo.com,44501) 14/07/16 13:01:17 INFO BlockManagerMaster: Trying to register BlockManager 14/07/16 13:01:17 INFO BlockManagerMaster: Registered BlockManager 14/07/16 13:01:17 INFO HttpFileServer: HTTP File server directory is /tmp/spark-ccf6f36c-2541-4a25-8fe4-bb4ba00ee633 14/07/16 13:01:17 INFO HttpServer: Starting HTTP Server 14/07/16 13:01:18 INFO Executor: Using REPL class URI: http://vm28-hulk-pub:33973 14/07/16 13:01:18 INFO Executor: Running task ID 2 14/07/16 13:01:18 INFO HttpBroadcast: Started reading broadcast variable 0 14/07/16 13:01:18 INFO MemoryStore: ensureFreeSpace(125590) called with curMem=0, maxMem=309225062 14/07/16 13:01:18 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 122.6 KB, free 294.8 MB) 14/07/16 13:01:18 INFO HttpBroadcast: Reading broadcast variable 0 took 0.294602722 s 14/07/16 13:01:19 INFO HadoopRDD: Input split: hdfs://vm28-hulk-priv:8020/svend/testimport/com_mnubo_agriculture60_1396310400/part-0:23960450+23960451 I0716 13:01:19.905113 13657 exec.cpp:378] Executor asked to shutdown 14/07/16 13:01:20 ERROR Executor: Exception in task ID 2
[jira] [Created] (SPARK-2581) complete or withdraw visitedStages optimization in DAGScheduler’s stageDependsOn
Aaron Staple created SPARK-2581: --- Summary: complete or withdraw visitedStages optimization in DAGScheduler’s stageDependsOn Key: SPARK-2581 URL: https://issues.apache.org/jira/browse/SPARK-2581 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Aaron Staple Priority: Minor Right now the visitedStages HashSet is populated with stages, but never queried to limit examination of previously visited stages. It may make sense to check whether a mapStage has been visited previously before visiting it again, as in the nearby visitedRdds check. Or it may be that the existing visitedRdds check sufficiently optimizes this function, and visitedStages can simply be removed. See discussion here: https://github.com/apache/spark/pull/1362#discussion-diff-15018046L1107 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2568) RangePartitioner should go through the data only once
[ https://issues.apache.org/jira/browse/SPARK-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066541#comment-14066541 ] Reynold Xin commented on SPARK-2568: Yes. Let's solve the problem one by one though. RangePartitioner should go through the data only once - Key: SPARK-2568 URL: https://issues.apache.org/jira/browse/SPARK-2568 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Xiangrui Meng As of Spark 1.0, RangePartitioner goes through data twice: once to compute the count and once to do sampling. As a result, to do sortByKey, Spark goes through data 3 times (once to count, once to sample, and once to sort). RangePartitioner should go through data only once (remove the count step). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2568) RangePartitioner should go through the data only once
[ https://issues.apache.org/jira/browse/SPARK-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066553#comment-14066553 ] Mark Hamstra commented on SPARK-2568: - Sure, if they can be cleanly separated -- but there's also interaction with the ShuffleManager refactoring. Do you have some strategy in mind for addressing just SPARK-2568 in isolation? RangePartitioner should go through the data only once - Key: SPARK-2568 URL: https://issues.apache.org/jira/browse/SPARK-2568 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Xiangrui Meng As of Spark 1.0, RangePartitioner goes through data twice: once to compute the count and once to do sampling. As a result, to do sortByKey, Spark goes through data 3 times (once to count, once to sample, and once to sort). RangePartitioner should go through data only once (remove the count step). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2582) Make Block Manager Master pluggable
[ https://issues.apache.org/jira/browse/SPARK-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Shreedharan updated SPARK-2582: Component/s: Block Manager Affects Version/s: 1.0.0 Issue Type: Improvement (was: Bug) Make Block Manager Master pluggable --- Key: SPARK-2582 URL: https://issues.apache.org/jira/browse/SPARK-2582 Project: Spark Issue Type: Improvement Components: Block Manager Affects Versions: 1.0.0 Reporter: Hari Shreedharan Today, there is no way to make the BMM pluggable. So if we want an HA BMM, that needs to replace the current one. Making this pluggable and selected based on a config makes it easy to select HA or non-HA one based on the application's preference. Streaming applications would be better off with an HA one, while a normal application would not care (since the RDDs can be regenerated). Since communication from the Block Managers to the BMM is via akka, we can keep that the same and just have the implementation of the BMM implement the actual methods which do the real work - this would not affect the Block Managers too. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2582) Make Block Manager Master pluggable
Hari Shreedharan created SPARK-2582: --- Summary: Make Block Manager Master pluggable Key: SPARK-2582 URL: https://issues.apache.org/jira/browse/SPARK-2582 Project: Spark Issue Type: Bug Reporter: Hari Shreedharan Today, there is no way to make the BMM pluggable. So if we want an HA BMM, that needs to replace the current one. Making this pluggable and selected based on a config makes it easy to select HA or non-HA one based on the application's preference. Streaming applications would be better off with an HA one, while a normal application would not care (since the RDDs can be regenerated). Since communication from the Block Managers to the BMM is via akka, we can keep that the same and just have the implementation of the BMM implement the actual methods which do the real work - this would not affect the Block Managers too. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts
[ https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066600#comment-14066600 ] Xuefu Zhang commented on SPARK-2420: {quote} 2. For jetty, it was a problem with Hive on Spark POC, possibly because we shipped all libraries from Hive process's classpath to the spark cluster. We have a task (HIVE-7371) to identify a minimum set of jars to be shipped. With that, the story might change. We will confirm if Jetty is a problem once we have a better idea on HIVE-7371. {quote} A better good news. With the latest work in HIVE-7292, we found that servlet-api/jetty didn't seem to be a problem any more. Thus, the only conflict remaining is guava, for which HIVE-7387 has all the details. Change Spark build to minimize library conflicts Key: SPARK-2420 URL: https://issues.apache.org/jira/browse/SPARK-2420 Project: Spark Issue Type: Wish Components: Build Affects Versions: 1.0.0 Reporter: Xuefu Zhang Attachments: spark_1.0.0.patch During the prototyping of HIVE-7292, many library conflicts showed up because Spark build contains versions of libraries that's vastly different from current major Hadoop version. It would be nice if we can choose versions that's in line with Hadoop or shading them in the assembly. Here are the wish list: 1. Upgrade protobuf version to 2.5.0 from current 2.4.1 2. Shading Spark's jetty and servlet dependency in the assembly. 3. guava version difference. Spark is using a higher version. I'm not sure what's the best solution for this. The list may grow as HIVE-7292 proceeds. For information only, the attached is a patch that we applied on Spark in order to make Spark work with Hive. It gives an idea of the scope of changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts
[ https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066634#comment-14066634 ] Sean Owen commented on SPARK-2420: -- I'm going to make a PR that would show what downgrading Guava looks like, so at least that can be assessed by everyone. Change Spark build to minimize library conflicts Key: SPARK-2420 URL: https://issues.apache.org/jira/browse/SPARK-2420 Project: Spark Issue Type: Wish Components: Build Affects Versions: 1.0.0 Reporter: Xuefu Zhang Attachments: spark_1.0.0.patch During the prototyping of HIVE-7292, many library conflicts showed up because Spark build contains versions of libraries that's vastly different from current major Hadoop version. It would be nice if we can choose versions that's in line with Hadoop or shading them in the assembly. Here are the wish list: 1. Upgrade protobuf version to 2.5.0 from current 2.4.1 2. Shading Spark's jetty and servlet dependency in the assembly. 3. guava version difference. Spark is using a higher version. I'm not sure what's the best solution for this. The list may grow as HIVE-7292 proceeds. For information only, the attached is a patch that we applied on Spark in order to make Spark work with Hive. It gives an idea of the scope of changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2568) RangePartitioner should go through the data only once
[ https://issues.apache.org/jira/browse/SPARK-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066689#comment-14066689 ] Reynold Xin commented on SPARK-2568: Our PhD in Math is working on that :) RangePartitioner should go through the data only once - Key: SPARK-2568 URL: https://issues.apache.org/jira/browse/SPARK-2568 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Xiangrui Meng As of Spark 1.0, RangePartitioner goes through data twice: once to compute the count and once to do sampling. As a result, to do sortByKey, Spark goes through data 3 times (once to count, once to sample, and once to sort). RangePartitioner should go through data only once (remove the count step). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2584) Do not mutate block storage level on the UI
Andrew Or created SPARK-2584: Summary: Do not mutate block storage level on the UI Key: SPARK-2584 URL: https://issues.apache.org/jira/browse/SPARK-2584 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 1.1.0 Reporter: Andrew Or Fix For: 1.1.0 If a block is stored MEMORY_AND_DISK and we drop it from memory, it becomes DISK_ONLY on the UI. We should preserve the original storage level proposed by the user, in addition to the change in actual storage level. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2585) Remove special handling of Hadoop JobConf
Patrick Wendell created SPARK-2585: -- Summary: Remove special handling of Hadoop JobConf Key: SPARK-2585 URL: https://issues.apache.org/jira/browse/SPARK-2585 Project: Spark Issue Type: Improvement Reporter: Patrick Wendell Assignee: Reynold Xin This is a follow up to SPARK-2521 and should close SPARK-2546 (provided the implementation does not use shared conf objects). We no longer need to specially broadcast the Hadoop configuration since we are broadcasting RDD data anyways. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2585) Remove special handling of Hadoop JobConf
[ https://issues.apache.org/jira/browse/SPARK-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2585: --- Priority: Critical (was: Major) Remove special handling of Hadoop JobConf - Key: SPARK-2585 URL: https://issues.apache.org/jira/browse/SPARK-2585 Project: Spark Issue Type: Improvement Reporter: Patrick Wendell Assignee: Reynold Xin Priority: Critical This is a follow up to SPARK-2521 and should close SPARK-2546 (provided the implementation does not use shared conf objects). We no longer need to specially broadcast the Hadoop configuration since we are broadcasting RDD data anyways. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1767) Prefer HDFS-cached replicas when scheduling data-local tasks
[ https://issues.apache.org/jira/browse/SPARK-1767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066788#comment-14066788 ] Colin Patrick McCabe commented on SPARK-1767: - I posted a pull request here: https://github.com/apache/spark/pull/1486 This illustrates how to do it with Hadoop 2.5. I'm still not sure about whether we should change the type of getPreferredLocations (see the pull request) Prefer HDFS-cached replicas when scheduling data-local tasks Key: SPARK-1767 URL: https://issues.apache.org/jira/browse/SPARK-1767 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2521) Broadcast RDD object once per TaskSet (instead of sending it for every task)
[ https://issues.apache.org/jira/browse/SPARK-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2521: --- Description: Currently (as of Spark 1.0.1), Spark sends RDD object (which contains closures) using Akka along with the task itself to the executors. This is inefficient because all tasks in the same stage use the same RDD object, but we have to send RDD object multiple times to the executors. This is especially bad when a closure references some variable that is very large. The current design led to users having to explicitly broadcast large variables. The patch uses broadcast to send RDD objects and the closures to executors, and use Akka to only send a reference to the broadcast RDD/closure along with the partition specific information for the task. For those of you who know more about the internals, Spark already relies on broadcast to send the Hadoop JobConf every time it uses the Hadoop input, because the JobConf is large. The user-facing impact of the change include: Users won't need to decide what to broadcast anymore, unless they would want to use a large object multiple times in different operations Task size will get smaller, resulting in faster scheduling and higher task dispatch throughput. In addition, the change will simplify some internals of Spark, eliminating the need to maintain task caches and the complex logic to broadcast JobConf (which also led to a deadlock recently). A simple way to test this: {code} val a = new Array[Byte](1000*1000); scala.util.Random.nextBytes(a); sc.parallelize(1 to 1000, 1000).map { x = a; x }.groupBy { x = a; x }.count Numbers on 3 r3.8xlarge instances on EC2 master branch: 5.648436068 s, 4.715361895 s, 5.360161877 s with this change: 3.416348793 s, 1.477846558 s, 1.553432156 s {code} was: This can substantially reduce task size, as well as being able to support very large closures (e.g. closures that reference large variables). Once this is in, we can also remove broadcasting the Hadoop JobConf. Broadcast RDD object once per TaskSet (instead of sending it for every task) Key: SPARK-2521 URL: https://issues.apache.org/jira/browse/SPARK-2521 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Currently (as of Spark 1.0.1), Spark sends RDD object (which contains closures) using Akka along with the task itself to the executors. This is inefficient because all tasks in the same stage use the same RDD object, but we have to send RDD object multiple times to the executors. This is especially bad when a closure references some variable that is very large. The current design led to users having to explicitly broadcast large variables. The patch uses broadcast to send RDD objects and the closures to executors, and use Akka to only send a reference to the broadcast RDD/closure along with the partition specific information for the task. For those of you who know more about the internals, Spark already relies on broadcast to send the Hadoop JobConf every time it uses the Hadoop input, because the JobConf is large. The user-facing impact of the change include: Users won't need to decide what to broadcast anymore, unless they would want to use a large object multiple times in different operations Task size will get smaller, resulting in faster scheduling and higher task dispatch throughput. In addition, the change will simplify some internals of Spark, eliminating the need to maintain task caches and the complex logic to broadcast JobConf (which also led to a deadlock recently). A simple way to test this: {code} val a = new Array[Byte](1000*1000); scala.util.Random.nextBytes(a); sc.parallelize(1 to 1000, 1000).map { x = a; x }.groupBy { x = a; x }.count Numbers on 3 r3.8xlarge instances on EC2 master branch: 5.648436068 s, 4.715361895 s, 5.360161877 s with this change: 3.416348793 s, 1.477846558 s, 1.553432156 s {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1680) Clean up use of setExecutorEnvs in SparkConf
[ https://issues.apache.org/jira/browse/SPARK-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066872#comment-14066872 ] Thomas Graves commented on SPARK-1680: -- so what do we want to do here. Do we want to expose spark.executorEnv. to the user to allow them to set environment variables? I assume since it isn't documented it wasn't meant for end users but perhaps we just missed documenting it. I would like to add a config to replace the env variable SPARK_YARN_USER_ENV. Either way I will probably need another config to set env variables on the application master. I could make that one yarn specific like spark.yarn.appMasterEnv. Clean up use of setExecutorEnvs in SparkConf - Key: SPARK-1680 URL: https://issues.apache.org/jira/browse/SPARK-1680 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Patrick Wendell Priority: Blocker Fix For: 1.1.0 We should make this consistent between YARN and Standalone. Basically, YARN mode should just use the executorEnvs from the Spark conf and not need SPARK_YARN_USER_ENV. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1536) Add multiclass classification tree support to MLlib
[ https://issues.apache.org/jira/browse/SPARK-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1536. -- Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 886 [https://github.com/apache/spark/pull/886] Add multiclass classification tree support to MLlib --- Key: SPARK-1536 URL: https://issues.apache.org/jira/browse/SPARK-1536 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Manish Amde Assignee: Manish Amde Priority: Critical Fix For: 1.1.0 The current decision tree implementation in MLlib only supports binary classification. This task involves adding multiclass classification support to the decision tree implementation. The tasks involves: - Choosing a good strategy for multiclass classification among multiple options: -- add multi class support to impurity but it won't work well with the categorical features since the centriod-based ordering assumptions won't hold true -- error-correcting output codes -- one-vs-all - Code implementation - Unit tests - Functional tests - Performance tests - Documentation -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2269) Clean up and add unit tests for resourceOffers in MesosSchedulerBackend
[ https://issues.apache.org/jira/browse/SPARK-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066879#comment-14066879 ] Timothy Chen commented on SPARK-2269: - Created a PR for this: https://github.com/apache/spark/pull/1487 Clean up and add unit tests for resourceOffers in MesosSchedulerBackend --- Key: SPARK-2269 URL: https://issues.apache.org/jira/browse/SPARK-2269 Project: Spark Issue Type: Bug Components: Mesos Reporter: Patrick Wendell Assignee: Tim Chen This function could be simplified a bit. We could re-write it without offerableIndices or creating the mesosTasks array as large as the offer list. There is a lot of logic around making sure you get the correct index into mesosTasks and offers, really we should just build mesosTasks directly from the offers we get back. To associate the tasks we are launching with the offers we can just create a hashMap from the slaveId to the original offer. The basic logic of the function is that you take the mesos offers, convert them to spark offers, then convert the results back. One reason I think it might be designed as it is now is to deal with the case where Mesos gives multiple offers for a single slave. I checked directly with the Mesos team and they said this won't ever happen, you'll get at most one offer per mesos slave within a set of offers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2535) Add StringComparison case to NullPropagation.
[ https://issues.apache.org/jira/browse/SPARK-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2535. - Resolution: Fixed Fix Version/s: 1.1.0 Assignee: Takuya Ueshin Add StringComparison case to NullPropagation. - Key: SPARK-2535 URL: https://issues.apache.org/jira/browse/SPARK-2535 Project: Spark Issue Type: Improvement Components: SQL Reporter: Takuya Ueshin Assignee: Takuya Ueshin Fix For: 1.1.0 {{StringComparison}} expressions including {{null}} literal cases could be added to {{NullPropagation}}. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2540) Add More Types Support for unwarpData of HiveUDF
[ https://issues.apache.org/jira/browse/SPARK-2540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2540. - Resolution: Fixed Fix Version/s: 1.0.2 1.1.0 Add More Types Support for unwarpData of HiveUDF Key: SPARK-2540 URL: https://issues.apache.org/jira/browse/SPARK-2540 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Priority: Minor Fix For: 1.1.0, 1.0.2 In the function HiveInspectors.unwrapData, currently the HiveVarcharObjectInspector HiveDecimalObjectInspector are not supported. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2571) Shuffle read bytes are reported incorrectly for stages with multiple shuffle dependencies
[ https://issues.apache.org/jira/browse/SPARK-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-2571: -- Fix Version/s: 1.0.2 Shuffle read bytes are reported incorrectly for stages with multiple shuffle dependencies - Key: SPARK-2571 URL: https://issues.apache.org/jira/browse/SPARK-2571 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.0.1, 0.9.3 Reporter: Kay Ousterhout Assignee: Kay Ousterhout Fix For: 1.0.2 In BlockStoreShuffleFetcher, we set the shuffle metrics for a task to include information about data fetched from one BlockFetcherIterator. When tasks have multiple shuffle dependencies (e.g., a stage that joins two datasets together), the metrics will get set based on data fetched from the last BlockFetcherIterator to complete, rather than the sum of all data fetched from all BlockFetcherIterators. This can lead to dramatically underreporting the shuffle read bytes. Thanks [~andrewor14] and [~rxin] for helping to diagnose this issue. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2571) Shuffle read bytes are reported incorrectly for stages with multiple shuffle dependencies
[ https://issues.apache.org/jira/browse/SPARK-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-2571. --- Resolution: Fixed Shuffle read bytes are reported incorrectly for stages with multiple shuffle dependencies - Key: SPARK-2571 URL: https://issues.apache.org/jira/browse/SPARK-2571 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.0.1, 0.9.3 Reporter: Kay Ousterhout Assignee: Kay Ousterhout In BlockStoreShuffleFetcher, we set the shuffle metrics for a task to include information about data fetched from one BlockFetcherIterator. When tasks have multiple shuffle dependencies (e.g., a stage that joins two datasets together), the metrics will get set based on data fetched from the last BlockFetcherIterator to complete, rather than the sum of all data fetched from all BlockFetcherIterators. This can lead to dramatically underreporting the shuffle read bytes. Thanks [~andrewor14] and [~rxin] for helping to diagnose this issue. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2571) Shuffle read bytes are reported incorrectly for stages with multiple shuffle dependencies
[ https://issues.apache.org/jira/browse/SPARK-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-2571: -- https://github.com/apache/spark/commit/7b971b91caeebda57f1506ffc4fd266a1b379290 Shuffle read bytes are reported incorrectly for stages with multiple shuffle dependencies - Key: SPARK-2571 URL: https://issues.apache.org/jira/browse/SPARK-2571 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.0.1, 0.9.3 Reporter: Kay Ousterhout Assignee: Kay Ousterhout Fix For: 1.0.2 In BlockStoreShuffleFetcher, we set the shuffle metrics for a task to include information about data fetched from one BlockFetcherIterator. When tasks have multiple shuffle dependencies (e.g., a stage that joins two datasets together), the metrics will get set based on data fetched from the last BlockFetcherIterator to complete, rather than the sum of all data fetched from all BlockFetcherIterators. This can lead to dramatically underreporting the shuffle read bytes. Thanks [~andrewor14] and [~rxin] for helping to diagnose this issue. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2518) Fix foldability of Substring expression.
[ https://issues.apache.org/jira/browse/SPARK-2518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2518. - Resolution: Fixed Fix Version/s: 1.0.2 1.1.0 Fix foldability of Substring expression. Key: SPARK-2518 URL: https://issues.apache.org/jira/browse/SPARK-2518 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin Assignee: Takuya Ueshin Fix For: 1.1.0, 1.0.2 This is a follow-up of [#1428|https://github.com/apache/spark/pull/1428]. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2571) Shuffle read bytes are reported incorrectly for stages with multiple shuffle dependencies
[ https://issues.apache.org/jira/browse/SPARK-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-2571: -- Target Version/s: 1.1.0 (was: 1.0.2) Fix Version/s: (was: 1.0.2) 1.1.0 Shuffle read bytes are reported incorrectly for stages with multiple shuffle dependencies - Key: SPARK-2571 URL: https://issues.apache.org/jira/browse/SPARK-2571 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.0.1, 0.9.3 Reporter: Kay Ousterhout Assignee: Kay Ousterhout Fix For: 1.1.0 In BlockStoreShuffleFetcher, we set the shuffle metrics for a task to include information about data fetched from one BlockFetcherIterator. When tasks have multiple shuffle dependencies (e.g., a stage that joins two datasets together), the metrics will get set based on data fetched from the last BlockFetcherIterator to complete, rather than the sum of all data fetched from all BlockFetcherIterators. This can lead to dramatically underreporting the shuffle read bytes. Thanks [~andrewor14] and [~rxin] for helping to diagnose this issue. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2562) Add Date datatype support to Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2562: Target Version/s: 1.2.0 (was: 1.1.0, 1.2.0) Add Date datatype support to Spark SQL -- Key: SPARK-2562 URL: https://issues.apache.org/jira/browse/SPARK-2562 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.1 Reporter: Zongheng Yang Priority: Minor Spark SQL currently supports Timestamp, but not Date. Hive introduced support for Date in [HIVE-4055|https://issues.apache.org/jira/browse/HIVE-4055], where the underlying representation is {{java.sql.Date}}. (Thanks to user Rindra Ramamonjison for reporting this.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2551) Cleanup FilteringParquetRowInputFormat
[ https://issues.apache.org/jira/browse/SPARK-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2551: Target Version/s: 1.2.0 (was: 1.1.0) Cleanup FilteringParquetRowInputFormat -- Key: SPARK-2551 URL: https://issues.apache.org/jira/browse/SPARK-2551 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.1, 1.0.2 Reporter: Cheng Lian Priority: Minor To workaround [PARQUET-16|https://issues.apache.org/jira/browse/PARQUET-16] and fix [SPARK-2119|https://issues.apache.org/jira/browse/SPARK-2119], we did some reflection hacks in {{FilteringParquetRowInputFormat}}. This should be cleaned up once PARQUET-16 is fixed. A PR for PARQUET-16 is [here|https://github.com/apache/incubator-parquet-mr/pull/17]. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2205) Unnecessary exchange operators in a join on multiple tables with the same join key.
[ https://issues.apache.org/jira/browse/SPARK-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2205: Target Version/s: 1.2.0 (was: 1.1.0) Unnecessary exchange operators in a join on multiple tables with the same join key. --- Key: SPARK-2205 URL: https://issues.apache.org/jira/browse/SPARK-2205 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Assignee: Yin Huai Priority: Minor {code} hql(select * from src x join src y on (x.key=y.key) join src z on (y.key=z.key)) SchemaRDD[1] at RDD at SchemaRDD.scala:100 == Query Plan == Project [key#4:0,value#5:1,key#6:2,value#7:3,key#8:4,value#9:5] HashJoin [key#6], [key#8], BuildRight Exchange (HashPartitioning [key#6], 200) HashJoin [key#4], [key#6], BuildRight Exchange (HashPartitioning [key#4], 200) HiveTableScan [key#4,value#5], (MetastoreRelation default, src, Some(x)), None Exchange (HashPartitioning [key#6], 200) HiveTableScan [key#6,value#7], (MetastoreRelation default, src, Some(y)), None Exchange (HashPartitioning [key#8], 200) HiveTableScan [key#8,value#9], (MetastoreRelation default, src, Some(z)), None {code} However, this is fine... {code} hql(select * from src x join src y on (x.key=y.key) join src z on (x.key=z.key)) res5: org.apache.spark.sql.SchemaRDD = SchemaRDD[5] at RDD at SchemaRDD.scala:100 == Query Plan == Project [key#26:0,value#27:1,key#28:2,value#29:3,key#30:4,value#31:5] HashJoin [key#26], [key#30], BuildRight HashJoin [key#26], [key#28], BuildRight Exchange (HashPartitioning [key#26], 200) HiveTableScan [key#26,value#27], (MetastoreRelation default, src, Some(x)), None Exchange (HashPartitioning [key#28], 200) HiveTableScan [key#28,value#29], (MetastoreRelation default, src, Some(y)), None Exchange (HashPartitioning [key#30], 200) HiveTableScan [key#30,value#31], (MetastoreRelation default, src, Some(z)), None {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2087) Support SQLConf per session
[ https://issues.apache.org/jira/browse/SPARK-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2087: Target Version/s: 1.2.0 (was: 1.1.0) Support SQLConf per session --- Key: SPARK-2087 URL: https://issues.apache.org/jira/browse/SPARK-2087 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Zongheng Yang Priority: Minor For things like the SharkServer we should support configuration per thread instead of globally for a context. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2087) Support SQLConf per session
[ https://issues.apache.org/jira/browse/SPARK-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066964#comment-14066964 ] Michael Armbrust commented on SPARK-2087: - I'm thinking about this more... and really I think we need to do more than just have a SQLConf per session. Ideally we would have a full SQLContext per session. This way users could have their own temporary tables. However, doing that in the naive way would also mean that cached tables would not be shared across contexts... So really I think this JIRA needs to be about clean multi-user semantics for the thrift server. Support SQLConf per session --- Key: SPARK-2087 URL: https://issues.apache.org/jira/browse/SPARK-2087 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Zongheng Yang Priority: Minor For things like the SharkServer we should support configuration per thread instead of globally for a context. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2087) Clean Multi-user semantics for thrift JDBC/ODBC server.
[ https://issues.apache.org/jira/browse/SPARK-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2087: Description: Configuration and temporary tables should exist per-user. Cached tables should be shared across users. (was: For things like the SharkServer we should support configuration per thread instead of globally for a context.) Clean Multi-user semantics for thrift JDBC/ODBC server. --- Key: SPARK-2087 URL: https://issues.apache.org/jira/browse/SPARK-2087 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Zongheng Yang Priority: Minor Configuration and temporary tables should exist per-user. Cached tables should be shared across users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2087) Clean Multi-user semantics for thrift JDBC/ODBC server.
[ https://issues.apache.org/jira/browse/SPARK-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2087: Summary: Clean Multi-user semantics for thrift JDBC/ODBC server. (was: Support SQLConf per session) Clean Multi-user semantics for thrift JDBC/ODBC server. --- Key: SPARK-2087 URL: https://issues.apache.org/jira/browse/SPARK-2087 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Zongheng Yang Priority: Minor For things like the SharkServer we should support configuration per thread instead of globally for a context. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2066) Better error message for non-aggregated attributes with aggregates
[ https://issues.apache.org/jira/browse/SPARK-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2066: Target Version/s: 1.2.0 (was: 1.1.0) Better error message for non-aggregated attributes with aggregates -- Key: SPARK-2066 URL: https://issues.apache.org/jira/browse/SPARK-2066 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Cheng Lian [~marmbrus] Run the following query {code} scala c.hql(select key, count(*) from src).collect() {code} Got the following exception at runtime {code} org.apache.spark.sql.catalyst.errors.package$TreeNodeException: No function to evaluate expression. type: AttributeReference, tree: key#61 at org.apache.spark.sql.catalyst.expressions.AttributeReference.eval(namedExpressions.scala:157) at org.apache.spark.sql.catalyst.expressions.Projection.apply(Projection.scala:35) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$1.apply(Aggregate.scala:154) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$1.apply(Aggregate.scala:134) at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:558) at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:558) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {code} This should either fail in analysis time, or pass at runtime. Definitely shouldn't fail at runtime. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2586) Lack of information to figure out connection to Tachyon master is inactive/ down
Henry Saputra created SPARK-2586: Summary: Lack of information to figure out connection to Tachyon master is inactive/ down Key: SPARK-2586 URL: https://issues.apache.org/jira/browse/SPARK-2586 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Henry Saputra When you running Spark with Tachyon, when the connection to Tachyon master is down (due to problem in network or the Master node is down) there is no clear log or error message to indicate it. Here is sample stack running SparkTachyonPi example with Tachyon connecting: 14/07/15 16:43:10 INFO Utils: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/07/15 16:43:10 WARN Utils: Your hostname, henry-pivotal.local resolves to a loopback address: 127.0.0.1; using 10.64.5.148 instead (on interface en5) 14/07/15 16:43:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 14/07/15 16:43:11 INFO SecurityManager: Changing view acls to: hsaputra 14/07/15 16:43:11 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hsaputra) 14/07/15 16:43:11 INFO Slf4jLogger: Slf4jLogger started 14/07/15 16:43:11 INFO Remoting: Starting remoting 14/07/15 16:43:11 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sp...@office-5-148.pa.gopivotal.com:53203] 14/07/15 16:43:11 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sp...@office-5-148.pa.gopivotal.com:53203] 14/07/15 16:43:11 INFO SparkEnv: Registering MapOutputTracker 14/07/15 16:43:11 INFO SparkEnv: Registering BlockManagerMaster 14/07/15 16:43:11 INFO DiskBlockManager: Created local directory at /var/folders/nv/nsr_3ysj0wgfq93fqp0rdt3wgp/T/spark-local-20140715164311-e63c 14/07/15 16:43:11 INFO ConnectionManager: Bound socket to port 53204 with id = ConnectionManagerId(office-5-148.pa.gopivotal.com,53204) 14/07/15 16:43:11 INFO MemoryStore: MemoryStore started with capacity 2.1 GB 14/07/15 16:43:11 INFO BlockManagerMaster: Trying to register BlockManager 14/07/15 16:43:11 INFO BlockManagerMasterActor: Registering block manager office-5-148.pa.gopivotal.com:53204 with 2.1 GB RAM 14/07/15 16:43:11 INFO BlockManagerMaster: Registered BlockManager 14/07/15 16:43:11 INFO HttpServer: Starting HTTP Server 14/07/15 16:43:11 INFO HttpBroadcast: Broadcast server started at http://10.64.5.148:53205 14/07/15 16:43:11 INFO HttpFileServer: HTTP File server directory is /var/folders/nv/nsr_3ysj0wgfq93fqp0rdt3wgp/T/spark-b2fb12ae-4608-4833-87b6-b335da00738e 14/07/15 16:43:11 INFO HttpServer: Starting HTTP Server 14/07/15 16:43:12 INFO SparkUI: Started SparkUI at http://office-5-148.pa.gopivotal.com:4040 2014-07-15 16:43:12.210 java[39068:1903] Unable to load realm info from SCDynamicStore 14/07/15 16:43:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/15 16:43:12 INFO SparkContext: Added JAR examples/target/scala-2.10/spark-examples-1.1.0-SNAPSHOT-hadoop2.4.0.jar at http://10.64.5.148:53206/jars/spark-examples-1.1.0-SNAPSHOT-hadoop2.4.0.jar with timestamp 1405467792813 14/07/15 16:43:12 INFO AppClient$ClientActor: Connecting to master spark://henry-pivotal.local:7077... 14/07/15 16:43:12 INFO SparkContext: Starting job: reduce at SparkTachyonPi.scala:43 14/07/15 16:43:12 INFO DAGScheduler: Got job 0 (reduce at SparkTachyonPi.scala:43) with 2 output partitions (allowLocal=false) 14/07/15 16:43:12 INFO DAGScheduler: Final stage: Stage 0(reduce at SparkTachyonPi.scala:43) 14/07/15 16:43:12 INFO DAGScheduler: Parents of final stage: List() 14/07/15 16:43:12 INFO DAGScheduler: Missing parents: List() 14/07/15 16:43:12 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map at SparkTachyonPi.scala:39), which has no missing parents 14/07/15 16:43:13 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (MappedRDD[1] at map at SparkTachyonPi.scala:39) 14/07/15 16:43:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 14/07/15 16:43:13 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20140715164313- 14/07/15 16:43:13 INFO AppClient$ClientActor: Executor added: app-20140715164313-/0 on worker-20140715164009-office-5-148.pa.gopivotal.com-52519 (office-5-148.pa.gopivotal.com:52519) with 8 cores 14/07/15 16:43:13 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140715164313-/0 on hostPort office-5-148.pa.gopivotal.com:52519 with 8 cores, 512.0 MB RAM 14/07/15 16:43:13 INFO AppClient$ClientActor: Executor updated: app-20140715164313-/0 is now RUNNING 14/07/15 16:43:15 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkexecu...@office-5-148.pa.gopivotal.com:53213/user/Executor#-423405256] with ID 0 14/07/15 16:43:15 INFO TaskSetManager:
[jira] [Updated] (SPARK-2586) Lack of information to figure out connection to Tachyon master is inactive/ down
[ https://issues.apache.org/jira/browse/SPARK-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henry Saputra updated SPARK-2586: - Labels: tachyon (was: ) Lack of information to figure out connection to Tachyon master is inactive/ down Key: SPARK-2586 URL: https://issues.apache.org/jira/browse/SPARK-2586 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Henry Saputra Labels: tachyon When you running Spark with Tachyon, when the connection to Tachyon master is down (due to problem in network or the Master node is down) there is no clear log or error message to indicate it. Here is sample stack running SparkTachyonPi example with Tachyon connecting: 14/07/15 16:43:10 INFO Utils: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/07/15 16:43:10 WARN Utils: Your hostname, henry-pivotal.local resolves to a loopback address: 127.0.0.1; using 10.64.5.148 instead (on interface en5) 14/07/15 16:43:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 14/07/15 16:43:11 INFO SecurityManager: Changing view acls to: hsaputra 14/07/15 16:43:11 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hsaputra) 14/07/15 16:43:11 INFO Slf4jLogger: Slf4jLogger started 14/07/15 16:43:11 INFO Remoting: Starting remoting 14/07/15 16:43:11 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sp...@office-5-148.pa.gopivotal.com:53203] 14/07/15 16:43:11 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sp...@office-5-148.pa.gopivotal.com:53203] 14/07/15 16:43:11 INFO SparkEnv: Registering MapOutputTracker 14/07/15 16:43:11 INFO SparkEnv: Registering BlockManagerMaster 14/07/15 16:43:11 INFO DiskBlockManager: Created local directory at /var/folders/nv/nsr_3ysj0wgfq93fqp0rdt3wgp/T/spark-local-20140715164311-e63c 14/07/15 16:43:11 INFO ConnectionManager: Bound socket to port 53204 with id = ConnectionManagerId(office-5-148.pa.gopivotal.com,53204) 14/07/15 16:43:11 INFO MemoryStore: MemoryStore started with capacity 2.1 GB 14/07/15 16:43:11 INFO BlockManagerMaster: Trying to register BlockManager 14/07/15 16:43:11 INFO BlockManagerMasterActor: Registering block manager office-5-148.pa.gopivotal.com:53204 with 2.1 GB RAM 14/07/15 16:43:11 INFO BlockManagerMaster: Registered BlockManager 14/07/15 16:43:11 INFO HttpServer: Starting HTTP Server 14/07/15 16:43:11 INFO HttpBroadcast: Broadcast server started at http://10.64.5.148:53205 14/07/15 16:43:11 INFO HttpFileServer: HTTP File server directory is /var/folders/nv/nsr_3ysj0wgfq93fqp0rdt3wgp/T/spark-b2fb12ae-4608-4833-87b6-b335da00738e 14/07/15 16:43:11 INFO HttpServer: Starting HTTP Server 14/07/15 16:43:12 INFO SparkUI: Started SparkUI at http://office-5-148.pa.gopivotal.com:4040 2014-07-15 16:43:12.210 java[39068:1903] Unable to load realm info from SCDynamicStore 14/07/15 16:43:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/15 16:43:12 INFO SparkContext: Added JAR examples/target/scala-2.10/spark-examples-1.1.0-SNAPSHOT-hadoop2.4.0.jar at http://10.64.5.148:53206/jars/spark-examples-1.1.0-SNAPSHOT-hadoop2.4.0.jar with timestamp 1405467792813 14/07/15 16:43:12 INFO AppClient$ClientActor: Connecting to master spark://henry-pivotal.local:7077... 14/07/15 16:43:12 INFO SparkContext: Starting job: reduce at SparkTachyonPi.scala:43 14/07/15 16:43:12 INFO DAGScheduler: Got job 0 (reduce at SparkTachyonPi.scala:43) with 2 output partitions (allowLocal=false) 14/07/15 16:43:12 INFO DAGScheduler: Final stage: Stage 0(reduce at SparkTachyonPi.scala:43) 14/07/15 16:43:12 INFO DAGScheduler: Parents of final stage: List() 14/07/15 16:43:12 INFO DAGScheduler: Missing parents: List() 14/07/15 16:43:12 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map at SparkTachyonPi.scala:39), which has no missing parents 14/07/15 16:43:13 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (MappedRDD[1] at map at SparkTachyonPi.scala:39) 14/07/15 16:43:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 14/07/15 16:43:13 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20140715164313- 14/07/15 16:43:13 INFO AppClient$ClientActor: Executor added: app-20140715164313-/0 on worker-20140715164009-office-5-148.pa.gopivotal.com-52519 (office-5-148.pa.gopivotal.com:52519) with 8 cores 14/07/15 16:43:13 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140715164313-/0 on hostPort office-5-148.pa.gopivotal.com:52519 with 8 cores, 512.0 MB RAM 14/07/15
[jira] [Commented] (SPARK-2586) Lack of information to figure out connection to Tachyon master is inactive/ down
[ https://issues.apache.org/jira/browse/SPARK-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067224#comment-14067224 ] Henry Saputra commented on SPARK-2586: -- Using Standalone, I do not see log about Tachyon not available in Master or Worker nodes log files Lack of information to figure out connection to Tachyon master is inactive/ down Key: SPARK-2586 URL: https://issues.apache.org/jira/browse/SPARK-2586 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Henry Saputra Labels: tachyon When you running Spark with Tachyon, when the connection to Tachyon master is down (due to problem in network or the Master node is down) there is no clear log or error message to indicate it. Here is sample stack running SparkTachyonPi example with Tachyon connecting: 14/07/15 16:43:10 INFO Utils: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/07/15 16:43:10 WARN Utils: Your hostname, henry-pivotal.local resolves to a loopback address: 127.0.0.1; using 10.64.5.148 instead (on interface en5) 14/07/15 16:43:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 14/07/15 16:43:11 INFO SecurityManager: Changing view acls to: hsaputra 14/07/15 16:43:11 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hsaputra) 14/07/15 16:43:11 INFO Slf4jLogger: Slf4jLogger started 14/07/15 16:43:11 INFO Remoting: Starting remoting 14/07/15 16:43:11 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sp...@office-5-148.pa.gopivotal.com:53203] 14/07/15 16:43:11 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sp...@office-5-148.pa.gopivotal.com:53203] 14/07/15 16:43:11 INFO SparkEnv: Registering MapOutputTracker 14/07/15 16:43:11 INFO SparkEnv: Registering BlockManagerMaster 14/07/15 16:43:11 INFO DiskBlockManager: Created local directory at /var/folders/nv/nsr_3ysj0wgfq93fqp0rdt3wgp/T/spark-local-20140715164311-e63c 14/07/15 16:43:11 INFO ConnectionManager: Bound socket to port 53204 with id = ConnectionManagerId(office-5-148.pa.gopivotal.com,53204) 14/07/15 16:43:11 INFO MemoryStore: MemoryStore started with capacity 2.1 GB 14/07/15 16:43:11 INFO BlockManagerMaster: Trying to register BlockManager 14/07/15 16:43:11 INFO BlockManagerMasterActor: Registering block manager office-5-148.pa.gopivotal.com:53204 with 2.1 GB RAM 14/07/15 16:43:11 INFO BlockManagerMaster: Registered BlockManager 14/07/15 16:43:11 INFO HttpServer: Starting HTTP Server 14/07/15 16:43:11 INFO HttpBroadcast: Broadcast server started at http://10.64.5.148:53205 14/07/15 16:43:11 INFO HttpFileServer: HTTP File server directory is /var/folders/nv/nsr_3ysj0wgfq93fqp0rdt3wgp/T/spark-b2fb12ae-4608-4833-87b6-b335da00738e 14/07/15 16:43:11 INFO HttpServer: Starting HTTP Server 14/07/15 16:43:12 INFO SparkUI: Started SparkUI at http://office-5-148.pa.gopivotal.com:4040 2014-07-15 16:43:12.210 java[39068:1903] Unable to load realm info from SCDynamicStore 14/07/15 16:43:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/15 16:43:12 INFO SparkContext: Added JAR examples/target/scala-2.10/spark-examples-1.1.0-SNAPSHOT-hadoop2.4.0.jar at http://10.64.5.148:53206/jars/spark-examples-1.1.0-SNAPSHOT-hadoop2.4.0.jar with timestamp 1405467792813 14/07/15 16:43:12 INFO AppClient$ClientActor: Connecting to master spark://henry-pivotal.local:7077... 14/07/15 16:43:12 INFO SparkContext: Starting job: reduce at SparkTachyonPi.scala:43 14/07/15 16:43:12 INFO DAGScheduler: Got job 0 (reduce at SparkTachyonPi.scala:43) with 2 output partitions (allowLocal=false) 14/07/15 16:43:12 INFO DAGScheduler: Final stage: Stage 0(reduce at SparkTachyonPi.scala:43) 14/07/15 16:43:12 INFO DAGScheduler: Parents of final stage: List() 14/07/15 16:43:12 INFO DAGScheduler: Missing parents: List() 14/07/15 16:43:12 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map at SparkTachyonPi.scala:39), which has no missing parents 14/07/15 16:43:13 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (MappedRDD[1] at map at SparkTachyonPi.scala:39) 14/07/15 16:43:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 14/07/15 16:43:13 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20140715164313- 14/07/15 16:43:13 INFO AppClient$ClientActor: Executor added: app-20140715164313-/0 on worker-20140715164009-office-5-148.pa.gopivotal.com-52519 (office-5-148.pa.gopivotal.com:52519) with 8 cores 14/07/15 16:43:13 INFO SparkDeploySchedulerBackend: Granted
[jira] [Resolved] (SPARK-2513) Correlations
[ https://issues.apache.org/jira/browse/SPARK-2513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-2513. -- Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 1367 [https://github.com/apache/spark/pull/1367] Correlations Key: SPARK-2513 URL: https://issues.apache.org/jira/browse/SPARK-2513 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Doris Xin Fix For: 1.1.0 PR: https://github.com/apache/spark/pull/1367 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2587) Error message is incorrect in make-distribution.sh
Mark Wagner created SPARK-2587: -- Summary: Error message is incorrect in make-distribution.sh Key: SPARK-2587 URL: https://issues.apache.org/jira/browse/SPARK-2587 Project: Spark Issue Type: Bug Reporter: Mark Wagner Priority: Minor SPARK-2526 removed some options in favor of using Maven profiles, but it now gives incorrect guidance for those that try to use the old --with-hive flag: --with-hive' is no longer supported, use Maven option -Pyarn -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2567) Resubmitted stage sometimes remains as active stage in the web UI
[ https://issues.apache.org/jira/browse/SPARK-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067306#comment-14067306 ] Masayoshi TSUZUKI commented on SPARK-2567: -- I found this lines in log file. {noformat} 14/07/17 18:07:36 DEBUG DAGScheduler: Stage Stage 1 is actually done; true 3 3 {noformat} so I think this is because extra stage which has no tasks to be executed is submitted, and then SparkListenerStageSubmitted is called but SparkListenerStageCompleted is not called. Resubmitted stage sometimes remains as active stage in the web UI - Key: SPARK-2567 URL: https://issues.apache.org/jira/browse/SPARK-2567 Project: Spark Issue Type: Bug Reporter: Masayoshi TSUZUKI Attachments: SPARK-2567.png When a stage is resubmitted because of executor lost for example, sometimes more than one resubmitted task appears in the web UI and one stage remains as active even after the job has finished. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2587) Error message is incorrect in make-distribution.sh
[ https://issues.apache.org/jira/browse/SPARK-2587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Wagner updated SPARK-2587: --- Component/s: Build Error message is incorrect in make-distribution.sh -- Key: SPARK-2587 URL: https://issues.apache.org/jira/browse/SPARK-2587 Project: Spark Issue Type: Bug Components: Build Reporter: Mark Wagner Priority: Minor SPARK-2526 removed some options in favor of using Maven profiles, but it now gives incorrect guidance for those that try to use the old --with-hive flag: --with-hive' is no longer supported, use Maven option -Pyarn -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2583) ConnectionManager cannot distinguish whether error occurred or not
[ https://issues.apache.org/jira/browse/SPARK-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067309#comment-14067309 ] Kousuke Saruta commented on SPARK-2583: --- PR: https://github.com/apache/spark/pull/1490 ConnectionManager cannot distinguish whether error occurred or not -- Key: SPARK-2583 URL: https://issues.apache.org/jira/browse/SPARK-2583 Project: Spark Issue Type: Bug Reporter: Kousuke Saruta ConnectionManager#handleMessage sent empty messages to another peer if some error occurred or not in onReceiveCalback. {code} val ackMessage = if (onReceiveCallback != null) { logDebug(Calling back) onReceiveCallback(bufferMessage, connectionManagerId) } else { logDebug(Not calling back as callback is null) None } if (ackMessage.isDefined) { if (!ackMessage.get.isInstanceOf[BufferMessage]) { logDebug(Response to + bufferMessage + is not a buffer message, it is of type + ackMessage.get.getClass) } else if (!ackMessage.get.asInstanceOf[BufferMessage].hasAckId) { logDebug(Response to + bufferMessage + does not have ack id set) ackMessage.get.asInstanceOf[BufferMessage].ackId = bufferMessage.id } } // We have no way to tell peer whether error occurred or not sendMessage(connectionManagerId, ackMessage.getOrElse { Message.createBufferMessage(bufferMessage.id) }) } {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2576) slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file
[ https://issues.apache.org/jira/browse/SPARK-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Svend updated SPARK-2576: - Description: Execution of SQL query against HDFS systematically throws a class not found exception on slave nodes when executing . (this was originally reported on the user list: http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-1-spark-sql-error-java-lang-NoClassDefFoundError-Could-not-initialize-class-line11-read-tc10135.html) Sample code (ran from spark-shell): {code} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class Car(timestamp: Long, objectid: String, isGreen: Boolean) // I get the same error when pointing to the folder hdfs://vm28:8020/test/cardata val data = sc.textFile(hdfs://vm28:8020/test/cardata/part-0) val cars = data.map(_.split(,)).map ( ar = Car(ar(0).toLong, ar(1), ar(2).toBoolean)) cars.registerAsTable(mcars) val allgreens = sqlContext.sql(SELECT objectid from mcars where isGreen = true) allgreens.collect.take(10).foreach(println) {code} Stack trace on the slave nodes: {code} I0716 13:01:16.215158 13631 exec.cpp:131] Version: 0.19.0 I0716 13:01:16.219285 13656 exec.cpp:205] Executor registered on slave 20140714-142853-485682442-5050-25487-2 14/07/16 13:01:16 INFO MesosExecutorBackend: Registered with Mesos as executor ID 20140714-142853-485682442-5050-25487-2 14/07/16 13:01:16 INFO SecurityManager: Changing view acls to: mesos,mnubohadoop 14/07/16 13:01:16 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mesos, mnubohadoop) 14/07/16 13:01:17 INFO Slf4jLogger: Slf4jLogger started 14/07/16 13:01:17 INFO Remoting: Starting remoting 14/07/16 13:01:17 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sp...@vm23-hulk-priv.mtl.mnubo.com:38230] 14/07/16 13:01:17 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sp...@vm23-hulk-priv.mtl.mnubo.com:38230] 14/07/16 13:01:17 INFO SparkEnv: Connecting to MapOutputTracker: akka.tcp://spark@vm28-hulk-pub:41632/user/MapOutputTracker 14/07/16 13:01:17 INFO SparkEnv: Connecting to BlockManagerMaster: akka.tcp://spark@vm28-hulk-pub:41632/user/BlockManagerMaster 14/07/16 13:01:17 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140716130117-8ea0 14/07/16 13:01:17 INFO MemoryStore: MemoryStore started with capacity 294.9 MB. 14/07/16 13:01:17 INFO ConnectionManager: Bound socket to port 44501 with id = ConnectionManagerId(vm23-hulk-priv.mtl.mnubo.com,44501) 14/07/16 13:01:17 INFO BlockManagerMaster: Trying to register BlockManager 14/07/16 13:01:17 INFO BlockManagerMaster: Registered BlockManager 14/07/16 13:01:17 INFO HttpFileServer: HTTP File server directory is /tmp/spark-ccf6f36c-2541-4a25-8fe4-bb4ba00ee633 14/07/16 13:01:17 INFO HttpServer: Starting HTTP Server 14/07/16 13:01:18 INFO Executor: Using REPL class URI: http://vm28-hulk-pub:33973 14/07/16 13:01:18 INFO Executor: Running task ID 2 14/07/16 13:01:18 INFO HttpBroadcast: Started reading broadcast variable 0 14/07/16 13:01:18 INFO MemoryStore: ensureFreeSpace(125590) called with curMem=0, maxMem=309225062 14/07/16 13:01:18 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 122.6 KB, free 294.8 MB) 14/07/16 13:01:18 INFO HttpBroadcast: Reading broadcast variable 0 took 0.294602722 s 14/07/16 13:01:19 INFO HadoopRDD: Input split: hdfs://vm28:8020/test/cardata/part-0:23960450+23960451 I0716 13:01:19.905113 13657 exec.cpp:378] Executor asked to shutdown 14/07/16 13:01:20 ERROR Executor: Exception in task ID 2 java.lang.NoClassDefFoundError: $line11/$read$ at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19) at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$1.next(Iterator.scala:853) at scala.collection.Iterator$$anon$1.head(Iterator.scala:840) at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:181) at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:176) at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559) at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at
[jira] [Updated] (SPARK-2576) slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file
[ https://issues.apache.org/jira/browse/SPARK-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Svend updated SPARK-2576: - Description: Execution of SQL query against HDFS systematically throws a class not found exception on slave nodes when executing . (this was originally reported on the user list: http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-1-spark-sql-error-java-lang-NoClassDefFoundError-Could-not-initialize-class-line11-read-tc10135.html) Sample code (ran from spark-shell): {code} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class Car(timestamp: Long, objectid: String, isGreen: Boolean) // I get the same error when pointing to the folder hdfs://vm28:8020/test/cardata val data = sc.textFile(hdfs://vm28:8020/test/cardata/part-0) val cars = data.map(_.split(,)).map ( ar = Car(ar(0).toLong, ar(1), ar(2).toBoolean)) cars.registerAsTable(mcars) val allgreens = sqlContext.sql(SELECT objectid from mcars where isGreen = true) allgreens.collect.take(10).foreach(println) {code} Stack trace on the slave nodes: {code} I0716 13:01:16.215158 13631 exec.cpp:131] Version: 0.19.0 I0716 13:01:16.219285 13656 exec.cpp:205] Executor registered on slave 20140714-142853-485682442-5050-25487-2 14/07/16 13:01:16 INFO MesosExecutorBackend: Registered with Mesos as executor ID 20140714-142853-485682442-5050-25487-2 14/07/16 13:01:16 INFO SecurityManager: Changing view acls to: mesos,mnubohadoop 14/07/16 13:01:16 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mesos, mnubohadoop) 14/07/16 13:01:17 INFO Slf4jLogger: Slf4jLogger started 14/07/16 13:01:17 INFO Remoting: Starting remoting 14/07/16 13:01:17 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@vm23:38230] 14/07/16 13:01:17 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@vm23:38230] 14/07/16 13:01:17 INFO SparkEnv: Connecting to MapOutputTracker: akka.tcp://spark@vm28:41632/user/MapOutputTracker 14/07/16 13:01:17 INFO SparkEnv: Connecting to BlockManagerMaster: akka.tcp://spark@vm28:41632/user/BlockManagerMaster 14/07/16 13:01:17 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140716130117-8ea0 14/07/16 13:01:17 INFO MemoryStore: MemoryStore started with capacity 294.9 MB. 14/07/16 13:01:17 INFO ConnectionManager: Bound socket to port 44501 with id = ConnectionManagerId(vm23-hulk-priv.mtl.mnubo.com,44501) 14/07/16 13:01:17 INFO BlockManagerMaster: Trying to register BlockManager 14/07/16 13:01:17 INFO BlockManagerMaster: Registered BlockManager 14/07/16 13:01:17 INFO HttpFileServer: HTTP File server directory is /tmp/spark-ccf6f36c-2541-4a25-8fe4-bb4ba00ee633 14/07/16 13:01:17 INFO HttpServer: Starting HTTP Server 14/07/16 13:01:18 INFO Executor: Using REPL class URI: http://vm28:33973 14/07/16 13:01:18 INFO Executor: Running task ID 2 14/07/16 13:01:18 INFO HttpBroadcast: Started reading broadcast variable 0 14/07/16 13:01:18 INFO MemoryStore: ensureFreeSpace(125590) called with curMem=0, maxMem=309225062 14/07/16 13:01:18 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 122.6 KB, free 294.8 MB) 14/07/16 13:01:18 INFO HttpBroadcast: Reading broadcast variable 0 took 0.294602722 s 14/07/16 13:01:19 INFO HadoopRDD: Input split: hdfs://vm28:8020/test/cardata/part-0:23960450+23960451 I0716 13:01:19.905113 13657 exec.cpp:378] Executor asked to shutdown 14/07/16 13:01:20 ERROR Executor: Exception in task ID 2 java.lang.NoClassDefFoundError: $line11/$read$ at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19) at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$1.next(Iterator.scala:853) at scala.collection.Iterator$$anon$1.head(Iterator.scala:840) at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:181) at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:176) at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559) at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at
[jira] [Commented] (SPARK-2552) Stabilize the computation of logistic function in pyspark
[ https://issues.apache.org/jira/browse/SPARK-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067342#comment-14067342 ] Michael Yannakopoulos commented on SPARK-2552: -- Hi Xiangrui, From what I have seen so far, this error affects the prediction made using the _predict_ method of _LogisticRegressionModel_ defined in _spark/python/pyspark/mllib/classification.py_ file. Is there any other occurence of this issue in another file as well??? I can see two solutions in order to solve this issue: a) Either check if the dot product between coeffs and data attributes gives a value in the desired range [ -745, 709 ] and if not to just set it to the one of these two values. b) To create specific math functions in Java such as Logistic Function, SoftMax, etc.. and call them via py4j in python2.7 and store the result in a 'decimal.Decimal' variable. Thanks, Michael Stabilize the computation of logistic function in pyspark - Key: SPARK-2552 URL: https://issues.apache.org/jira/browse/SPARK-2552 Project: Spark Issue Type: Bug Components: MLlib, PySpark Reporter: Xiangrui Meng Labels: Starter exp(1000) throws an error in python. For logistic function, we can use either 1 / ( 1 + exp( -x ) ) or 1 - 1 / (1 + exp( x ) ) to compute its value which ensuring exp always takes a negative value. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2576) slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file
[ https://issues.apache.org/jira/browse/SPARK-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067348#comment-14067348 ] Svend commented on SPARK-2576: -- ctually I do not have a line 19 ??? I just re-ran this code, with no additional line breaks: {code} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class Car(timestamp: Long, objectid: String, isGreen: Boolean) val data = sc.textFile(hdfs://vm28:8020/test/cardata) val cars = data.map(_.split(,)).map ( ar = Car(ar(0).toLong, ar(1), ar(2).toBoolean)) cars.registerAsTable(mcars) val allgreens = sqlContext.sql(SELECT objectid from mcars where isGreen = true) allgreens.collect.take(10).foreach(println) {code} I first get this error: {code} java.lang.ExceptionInInitializerError at $line10.$read$$iwC.init(console:6) at $line10.$read.init(console:26) at $line10.$read$.init(console:30) at $line10.$read$.clinit(console) at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19) at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$1.next(Iterator.scala:853) at scala.collection.Iterator$$anon$1.head(Iterator.scala:840) at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:181) at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:176) at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559) at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: java.lang.NullPointerException at $line3.$read$$iwC$$iwC.init(console:8) at $line3.$read$$iwC.init(console:14) at $line3.$read.init(console:16) at $line3.$read$.init(console:20) at $line3.$read$.clinit(console) ... 31 more {code} (I do not know where do line 26, 30 or 12 come from) Then right after this one: {code} java.lang.NoClassDefFoundError: Could not initialize class $line10.$read$ at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19) at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$1.next(Iterator.scala:853) at scala.collection.Iterator$$anon$1.head(Iterator.scala:840) at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:181) at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:176) at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559) at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at
[jira] [Comment Edited] (SPARK-2576) slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file
[ https://issues.apache.org/jira/browse/SPARK-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067348#comment-14067348 ] Svend edited comment on SPARK-2576 at 7/19/14 2:24 AM: --- actually I do not have a line 19 ??? I just re-ran this code, with no additional line breaks: {code} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class Car(timestamp: Long, objectid: String, isGreen: Boolean) val data = sc.textFile(hdfs://vm28:8020/test/cardata) val cars = data.map(_.split(,)).map ( ar = Car(ar(0).toLong, ar(1), ar(2).toBoolean)) cars.registerAsTable(mcars) val allgreens = sqlContext.sql(SELECT objectid from mcars where isGreen = true) allgreens.collect.take(10).foreach(println) {code} I first get this error: {code} java.lang.ExceptionInInitializerError at $line10.$read$$iwC.init(console:6) at $line10.$read.init(console:26) at $line10.$read$.init(console:30) at $line10.$read$.clinit(console) at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19) at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$1.next(Iterator.scala:853) at scala.collection.Iterator$$anon$1.head(Iterator.scala:840) at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:181) at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:176) at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559) at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: java.lang.NullPointerException at $line3.$read$$iwC$$iwC.init(console:8) at $line3.$read$$iwC.init(console:14) at $line3.$read.init(console:16) at $line3.$read$.init(console:20) at $line3.$read$.clinit(console) ... 31 more {code} (I do not know where do line 26, 30 or 12 come from) Then right after this one: {code} java.lang.NoClassDefFoundError: Could not initialize class $line10.$read$ at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19) at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$1.next(Iterator.scala:853) at scala.collection.Iterator$$anon$1.head(Iterator.scala:840) at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:181) at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:176) at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559) at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at
[jira] [Commented] (SPARK-2576) slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file
[ https://issues.apache.org/jira/browse/SPARK-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067349#comment-14067349 ] Svend commented on SPARK-2576: -- I just noted that running the same spark-shell code with the session launched like this: {code} ./bin/spark-shell {code} (without connecting to mesos) computes the result successfully. Could it be that the {code} case class Car(timestamp: Long, objectid: String, isGreen: Boolean) {code} defined in the shell is not made available to the remote slaves on mesos? slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file -- Key: SPARK-2576 URL: https://issues.apache.org/jira/browse/SPARK-2576 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.0.1 Environment: One Mesos 0.19 master without zookeeper and 4 mesos slaves. JDK 1.7.51 and Scala 2.10.4 on all nodes. HDFS from CDH5.0.3 Spark version: I tried both with the pre-built CDH5 spark package available from http://spark.apache.org/downloads.html and by packaging spark with sbt 0.13.2, JDK 1.7.51 and scala 2.10.4 as explained here http://mesosphere.io/learn/run-spark-on-mesos/ All nodes are running Debian 3.2.51-1 x86_64 GNU/Linux and have Reporter: Svend Execution of SQL query against HDFS systematically throws a class not found exception on slave nodes when executing . (this was originally reported on the user list: http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-1-spark-sql-error-java-lang-NoClassDefFoundError-Could-not-initialize-class-line11-read-tc10135.html) Sample code (ran from spark-shell): {code} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class Car(timestamp: Long, objectid: String, isGreen: Boolean) // I get the same error when pointing to the folder hdfs://vm28:8020/test/cardata val data = sc.textFile(hdfs://vm28:8020/test/cardata/part-0) val cars = data.map(_.split(,)).map ( ar = Car(ar(0).toLong, ar(1), ar(2).toBoolean)) cars.registerAsTable(mcars) val allgreens = sqlContext.sql(SELECT objectid from mcars where isGreen = true) allgreens.collect.take(10).foreach(println) {code} Stack trace on the slave nodes: {code} I0716 13:01:16.215158 13631 exec.cpp:131] Version: 0.19.0 I0716 13:01:16.219285 13656 exec.cpp:205] Executor registered on slave 20140714-142853-485682442-5050-25487-2 14/07/16 13:01:16 INFO MesosExecutorBackend: Registered with Mesos as executor ID 20140714-142853-485682442-5050-25487-2 14/07/16 13:01:16 INFO SecurityManager: Changing view acls to: mesos,mnubohadoop 14/07/16 13:01:16 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mesos, mnubohadoop) 14/07/16 13:01:17 INFO Slf4jLogger: Slf4jLogger started 14/07/16 13:01:17 INFO Remoting: Starting remoting 14/07/16 13:01:17 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@vm23:38230] 14/07/16 13:01:17 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@vm23:38230] 14/07/16 13:01:17 INFO SparkEnv: Connecting to MapOutputTracker: akka.tcp://spark@vm28:41632/user/MapOutputTracker 14/07/16 13:01:17 INFO SparkEnv: Connecting to BlockManagerMaster: akka.tcp://spark@vm28:41632/user/BlockManagerMaster 14/07/16 13:01:17 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140716130117-8ea0 14/07/16 13:01:17 INFO MemoryStore: MemoryStore started with capacity 294.9 MB. 14/07/16 13:01:17 INFO ConnectionManager: Bound socket to port 44501 with id = ConnectionManagerId(vm23-hulk-priv.mtl.mnubo.com,44501) 14/07/16 13:01:17 INFO BlockManagerMaster: Trying to register BlockManager 14/07/16 13:01:17 INFO BlockManagerMaster: Registered BlockManager 14/07/16 13:01:17 INFO HttpFileServer: HTTP File server directory is /tmp/spark-ccf6f36c-2541-4a25-8fe4-bb4ba00ee633 14/07/16 13:01:17 INFO HttpServer: Starting HTTP Server 14/07/16 13:01:18 INFO Executor: Using REPL class URI: http://vm28:33973 14/07/16 13:01:18 INFO Executor: Running task ID 2 14/07/16 13:01:18 INFO HttpBroadcast: Started reading broadcast variable 0 14/07/16 13:01:18 INFO MemoryStore: ensureFreeSpace(125590) called with curMem=0, maxMem=309225062 14/07/16 13:01:18 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 122.6 KB, free 294.8 MB) 14/07/16 13:01:18 INFO HttpBroadcast: Reading broadcast variable 0 took 0.294602722 s 14/07/16 13:01:19 INFO HadoopRDD: Input split: hdfs://vm28:8020/test/cardata/part-0:23960450+23960451 I0716 13:01:19.905113 13657 exec.cpp:378] Executor asked to shutdown 14/07/16
[jira] [Comment Edited] (SPARK-2576) slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file
[ https://issues.apache.org/jira/browse/SPARK-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067348#comment-14067348 ] Svend edited comment on SPARK-2576 at 7/19/14 2:25 AM: --- actually I do not have a line 19 ??? I just re-ran these 8 lines of code, with no additional line breaks, on a freshly started session: {code} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class Car(timestamp: Long, objectid: String, isGreen: Boolean) val data = sc.textFile(hdfs://vm28:8020/test/cardata) val cars = data.map(_.split(,)).map ( ar = Car(ar(0).toLong, ar(1), ar(2).toBoolean)) cars.registerAsTable(mcars) val allgreens = sqlContext.sql(SELECT objectid from mcars where isGreen = true) allgreens.collect.take(10).foreach(println) {code} I first get this error: {code} java.lang.ExceptionInInitializerError at $line10.$read$$iwC.init(console:6) at $line10.$read.init(console:26) at $line10.$read$.init(console:30) at $line10.$read$.clinit(console) at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19) at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$1.next(Iterator.scala:853) at scala.collection.Iterator$$anon$1.head(Iterator.scala:840) at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:181) at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:176) at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559) at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: java.lang.NullPointerException at $line3.$read$$iwC$$iwC.init(console:8) at $line3.$read$$iwC.init(console:14) at $line3.$read.init(console:16) at $line3.$read$.init(console:20) at $line3.$read$.clinit(console) ... 31 more {code} (I do not know where do line 26, 30 or 12 come from) Then right after this one: {code} java.lang.NoClassDefFoundError: Could not initialize class $line10.$read$ at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19) at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$1.next(Iterator.scala:853) at scala.collection.Iterator$$anon$1.head(Iterator.scala:840) at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:181) at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:176) at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559) at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at
[jira] [Created] (SPARK-2588) Add some more DSLs.
Takuya Ueshin created SPARK-2588: Summary: Add some more DSLs. Key: SPARK-2588 URL: https://issues.apache.org/jira/browse/SPARK-2588 Project: Spark Issue Type: Improvement Components: SQL Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (SPARK-2045) Sort-based shuffle implementation
[ https://issues.apache.org/jira/browse/SPARK-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia reassigned SPARK-2045: Assignee: Matei Zaharia Sort-based shuffle implementation - Key: SPARK-2045 URL: https://issues.apache.org/jira/browse/SPARK-2045 Project: Spark Issue Type: New Feature Reporter: Matei Zaharia Assignee: Matei Zaharia Attachments: Sort-basedshuffledesign.pdf Building on the pluggability in SPARK-2044, a sort-based shuffle implementation that takes advantage of an Ordering for keys (or just sorts by hashcode for keys that don't have it) would likely improve performance and memory usage in very large shuffles. Our current hash-based shuffle needs an open file for each reduce task, which can fill up a lot of memory for compression buffers and cause inefficient IO. This would avoid both of those issues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2589) Support HAVING clause generated by Tableau
Cheng Lian created SPARK-2589: - Summary: Support HAVING clause generated by Tableau Key: SPARK-2589 URL: https://issues.apache.org/jira/browse/SPARK-2589 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.1, 1.0.2 Reporter: Cheng Lian The query on how Tableau generates # of records: {code} SELECT SUM(1) AS `sum_number_of_records_ok` FROM `some_db`.`some_table` `some_table` GROUP BY 1 HAVING (COUNT(1) 0) {code} The following query can be used to reproduce this issue under {{sbt hive/console}} {code} select sum(1) from src group by 1 having (count(1) 0); {code} Exception stacktrace: {code} org.apache.spark.sql.catalyst.errors.package$TreeNodeException (No function to evaluate expression. type: Count, tree: COUNT(1)) org.apache.spark.sql.catalyst.expressions.AggregateExpression.eval(aggregates.scala:40) org.apache.spark.sql.catalyst.expressions.Expression.c2(Expression.scala:207) org.apache.spark.sql.catalyst.expressions.GreaterThan.eval(predicates.scala:168) org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:54) org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:54) scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) scala.collection.AbstractIterator.to(Iterator.scala:1157) scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) scala.collection.AbstractIterator.toArray(Iterator.scala:1157) org.apache.spark.rdd.RDD$$anonfun$org$apache$spark$rdd$RDD$$collectPartition$1$1.apply(RDD.scala:773) org.apache.spark.rdd.RDD$$anonfun$org$apache$spark$rdd$RDD$$collectPartition$1$1.apply(RDD.scala:773) org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1096) org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1096) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:112) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:189) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)