[jira] [Updated] (SPARK-6566) Update Spark to use the latest version of Parquet libraries
[ https://issues.apache.org/jira/browse/SPARK-6566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6566: - Assignee: Yash Datta Update Spark to use the latest version of Parquet libraries --- Key: SPARK-6566 URL: https://issues.apache.org/jira/browse/SPARK-6566 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Konstantin Shaposhnikov Assignee: Yash Datta Fix For: 1.5.0 There are a lot of bug fixes in the latest version of parquet (1.6.0rc7). E.g. PARQUET-136 It would be good to update Spark to use the latest parquet version. The following changes are required: {code} diff --git a/pom.xml b/pom.xml index 5ad39a9..095b519 100644 --- a/pom.xml +++ b/pom.xml @@ -132,7 +132,7 @@ !-- Version used for internal directory structure -- hive.version.short0.13.1/hive.version.short derby.version10.10.1.1/derby.version -parquet.version1.6.0rc3/parquet.version +parquet.version1.6.0rc7/parquet.version jblas.version1.2.3/jblas.version jetty.version8.1.14.v20131031/jetty.version orbit.version3.0.0.v201112011016/orbit.version {code} and {code} --- a/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala @@ -480,7 +480,7 @@ private[parquet] class FilteringParquetRowInputFormat globalMetaData = new GlobalMetaData(globalMetaData.getSchema, mergedMetadata, globalMetaData.getCreatedBy) -val readContext = getReadSupport(configuration).init( +val readContext = ParquetInputFormat.getReadSupportInstance(configuration).init( new InitContext(configuration, globalMetaData.getKeyValueMetaData, globalMetaData.getSchema)) {code} I am happy to prepare a pull request if necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8337) KafkaUtils.createDirectStream for python is lacking API/feature parity with the Scala/Java version
[ https://issues.apache.org/jira/browse/SPARK-8337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588064#comment-14588064 ] Juan RodrĂguez Hortalá commented on SPARK-8337: --- Hi, I've made some advances. Due to the limited support for data types in pyspark and org.apache.spark.api.python.PythonRDD, I think adding a function to createDirectStream from MessageAndMetadata to arbitrary values is not such a good idea. In fact currently pyspark communicates with the Scala API by using JavaPairInputDStream[Array[Byte], Array[Byte]] and then decoding those arrays of bytes in python. So what I propose is adding an argument to choose between returning a dstream of (key, value) like it is done so far, and a dstream of dictionaries with entries for the key, the value (the message), and also the topic, partition and offset. An approximation to that is implemented in https://github.com/juanrh/spark/commit/7a824a814f56f839d2f3fbeda7e9f7467e683c6e as a python static method KafkaUtils.createDirectStreamJ, that uses KafkaUtilsPythonHelper.createDirectStreamJ. The following Python code can be used for using it: from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils ssc = StreamingContext(sc, 1) topics = [test] kafkaParams = {metadata.broker.list : localhost:9092} kafkaStream = KafkaUtils.createDirectStreamJ(ssc, topics, kafkaParams) kafkaStream.pprint() ssc.start() ssc.awaitTermination(timeout=5) which gets the following output 15/06/16 15:31:00 INFO TaskSchedulerImpl: Removed TaskSet 8.0, whose tasks have all completed, from pool 15/06/16 15:31:00 INFO DAGScheduler: ResultStage 8 (start at NativeMethodAccessorImpl.java:-2) finished 15/06/16 15:31:00 INFO DAGScheduler: Job 8 finished: start at NativeMethodAccessorImpl.java:-2, took 0,0 --- Time: 2015-06-16 15:31:00 --- {'topic': u'test', 'partition': 0, 'value': u'q tal?', 'key': None, 'offset': 87L} () 15/06/16 15:31:00 I have encoded the dictionary with the following Scala type alias, that uses types that PythonRDD can understand /** Using this weird type due to the limited set of types * supported by PythonRDD. This corresponds to * * ((key, message), (topic, (partition, offset))) * * where the key and the message are encoded as Array[Byte], * and topic, partition and offset are encoded as String. * Note we cannot even use triples because only pairs are supported * (we get an exception Unexpected element type class scala.Tuple3) */ type PyKafkaMsgWrapper = ((Array[Byte], Array[Byte]), (String, (String, String))) If this is enough for you I can refactor thing to join KafkaUtils.createDirectStreamJ and KafkaUtils.createDirectStream in a single method, with an additional argument to specify if the meta info is required, with a default value of False so the behaviour is the same as before by default Looking forward to hearing your opinions on this. Greetings, Juan Rodriguez Hortala KafkaUtils.createDirectStream for python is lacking API/feature parity with the Scala/Java version -- Key: SPARK-8337 URL: https://issues.apache.org/jira/browse/SPARK-8337 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.0 Reporter: Amit Ramesh Priority: Critical See the following thread for context. http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Spark-1-4-Python-API-for-getting-Kafka-offsets-in-direct-mode-tt12714.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8395) spark-submit documentation is incorrect
Dev Lakhani created SPARK-8395: -- Summary: spark-submit documentation is incorrect Key: SPARK-8395 URL: https://issues.apache.org/jira/browse/SPARK-8395 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.4.0 Reporter: Dev Lakhani Priority: Minor Using a fresh checkout of 1.4.0-bin-hadoop2.6 if you run ./start-slave.sh 1 spark://localhost:7077 you get failed to launch org.apache.spark.deploy.worker.Worker: Default is conf/spark-defaults.conf. 15/06/16 13:11:08 INFO Utils: Shutdown hook called it seems the worker number is not being accepted as desccribed here: https://spark.apache.org/docs/latest/spark-standalone.html The documentation says: ./sbin/start-slave.sh worker# master-spark-URL but the start.slave-sh script states: usage=Usage: start-slave.sh spark-master-URL where spark-master-URL is like spark://localhost:7077 I have checked for similar issues using : https://issues.apache.org/jira/browse/SPARK-6552?jql=text%20~%20%22start-slave%22 and found nothing similar so am raising this as an issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8333) Spark failed to delete temp directory created by HiveContext
[ https://issues.apache.org/jira/browse/SPARK-8333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8333: --- Assignee: Apache Spark Spark failed to delete temp directory created by HiveContext Key: SPARK-8333 URL: https://issues.apache.org/jira/browse/SPARK-8333 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Environment: Windows7 64bit Reporter: sheng Assignee: Apache Spark Priority: Minor Labels: Hive, metastore, sparksql Spark 1.4.0 failed to stop SparkContext. {code:title=LocalHiveTest.scala|borderStyle=solid} val sc = new SparkContext(local, local-hive-test, new SparkConf()) val hc = Utils.createHiveContext(sc) ... // execute some HiveQL statements sc.stop() {code} sc.stop() failed to execute, it threw the following exception: {quote} 15/06/13 03:19:06 INFO Utils: Shutdown hook called 15/06/13 03:19:06 INFO Utils: Deleting directory C:\Users\moshangcheng\AppData\Local\Temp\spark-d6d3c30e-512e-4693-a436-485e2af4baea 15/06/13 03:19:06 ERROR Utils: Exception while deleting Spark temp dir: C:\Users\moshangcheng\AppData\Local\Temp\spark-d6d3c30e-512e-4693-a436-485e2af4baea java.io.IOException: Failed to delete: C:\Users\moshangcheng\AppData\Local\Temp\spark-d6d3c30e-512e-4693-a436-485e2af4baea at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:963) at org.apache.spark.util.Utils$$anonfun$1$$anonfun$apply$mcV$sp$5.apply(Utils.scala:204) at org.apache.spark.util.Utils$$anonfun$1$$anonfun$apply$mcV$sp$5.apply(Utils.scala:201) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.util.Utils$$anonfun$1.apply$mcV$sp(Utils.scala:201) at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2292) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(Utils.scala:2262) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2262) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2262) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(Utils.scala:2262) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2262) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2262) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.util.SparkShutdownHookManager.runAll(Utils.scala:2262) at org.apache.spark.util.SparkShutdownHookManager$$anon$6.run(Utils.scala:2244) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) {quote} It seems this bug is introduced by this SPARK-6907. In SPARK-6907, a local hive metastore is created in a temp directory. The problem is the local hive metastore is not shut down correctly. At the end of application, if SparkContext.stop() is called, it tries to delete the temp directory which is still used by the local hive metastore, and throws an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8333) Spark failed to delete temp directory created by HiveContext
[ https://issues.apache.org/jira/browse/SPARK-8333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8333: --- Assignee: (was: Apache Spark) Spark failed to delete temp directory created by HiveContext Key: SPARK-8333 URL: https://issues.apache.org/jira/browse/SPARK-8333 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Environment: Windows7 64bit Reporter: sheng Priority: Minor Labels: Hive, metastore, sparksql Spark 1.4.0 failed to stop SparkContext. {code:title=LocalHiveTest.scala|borderStyle=solid} val sc = new SparkContext(local, local-hive-test, new SparkConf()) val hc = Utils.createHiveContext(sc) ... // execute some HiveQL statements sc.stop() {code} sc.stop() failed to execute, it threw the following exception: {quote} 15/06/13 03:19:06 INFO Utils: Shutdown hook called 15/06/13 03:19:06 INFO Utils: Deleting directory C:\Users\moshangcheng\AppData\Local\Temp\spark-d6d3c30e-512e-4693-a436-485e2af4baea 15/06/13 03:19:06 ERROR Utils: Exception while deleting Spark temp dir: C:\Users\moshangcheng\AppData\Local\Temp\spark-d6d3c30e-512e-4693-a436-485e2af4baea java.io.IOException: Failed to delete: C:\Users\moshangcheng\AppData\Local\Temp\spark-d6d3c30e-512e-4693-a436-485e2af4baea at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:963) at org.apache.spark.util.Utils$$anonfun$1$$anonfun$apply$mcV$sp$5.apply(Utils.scala:204) at org.apache.spark.util.Utils$$anonfun$1$$anonfun$apply$mcV$sp$5.apply(Utils.scala:201) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.util.Utils$$anonfun$1.apply$mcV$sp(Utils.scala:201) at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2292) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(Utils.scala:2262) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2262) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2262) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(Utils.scala:2262) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2262) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2262) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.util.SparkShutdownHookManager.runAll(Utils.scala:2262) at org.apache.spark.util.SparkShutdownHookManager$$anon$6.run(Utils.scala:2244) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) {quote} It seems this bug is introduced by this SPARK-6907. In SPARK-6907, a local hive metastore is created in a temp directory. The problem is the local hive metastore is not shut down correctly. At the end of application, if SparkContext.stop() is called, it tries to delete the temp directory which is still used by the local hive metastore, and throws an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8333) Spark failed to delete temp directory created by HiveContext
[ https://issues.apache.org/jira/browse/SPARK-8333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587937#comment-14587937 ] Apache Spark commented on SPARK-8333: - User 'navis' has created a pull request for this issue: https://github.com/apache/spark/pull/6840 Spark failed to delete temp directory created by HiveContext Key: SPARK-8333 URL: https://issues.apache.org/jira/browse/SPARK-8333 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Environment: Windows7 64bit Reporter: sheng Priority: Minor Labels: Hive, metastore, sparksql Spark 1.4.0 failed to stop SparkContext. {code:title=LocalHiveTest.scala|borderStyle=solid} val sc = new SparkContext(local, local-hive-test, new SparkConf()) val hc = Utils.createHiveContext(sc) ... // execute some HiveQL statements sc.stop() {code} sc.stop() failed to execute, it threw the following exception: {quote} 15/06/13 03:19:06 INFO Utils: Shutdown hook called 15/06/13 03:19:06 INFO Utils: Deleting directory C:\Users\moshangcheng\AppData\Local\Temp\spark-d6d3c30e-512e-4693-a436-485e2af4baea 15/06/13 03:19:06 ERROR Utils: Exception while deleting Spark temp dir: C:\Users\moshangcheng\AppData\Local\Temp\spark-d6d3c30e-512e-4693-a436-485e2af4baea java.io.IOException: Failed to delete: C:\Users\moshangcheng\AppData\Local\Temp\spark-d6d3c30e-512e-4693-a436-485e2af4baea at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:963) at org.apache.spark.util.Utils$$anonfun$1$$anonfun$apply$mcV$sp$5.apply(Utils.scala:204) at org.apache.spark.util.Utils$$anonfun$1$$anonfun$apply$mcV$sp$5.apply(Utils.scala:201) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.util.Utils$$anonfun$1.apply$mcV$sp(Utils.scala:201) at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2292) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(Utils.scala:2262) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2262) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2262) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(Utils.scala:2262) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2262) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2262) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.util.SparkShutdownHookManager.runAll(Utils.scala:2262) at org.apache.spark.util.SparkShutdownHookManager$$anon$6.run(Utils.scala:2244) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) {quote} It seems this bug is introduced by this SPARK-6907. In SPARK-6907, a local hive metastore is created in a temp directory. The problem is the local hive metastore is not shut down correctly. At the end of application, if SparkContext.stop() is called, it tries to delete the temp directory which is still used by the local hive metastore, and throws an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8396) GraphLoader.edgeListFile does not populate Graph.vertices.
[ https://issues.apache.org/jira/browse/SPARK-8396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Barrett updated SPARK-8396: --- Summary: GraphLoader.edgeListFile does not populate Graph.vertices. (was: GraphLoader.edgeListFile does not population Graph.vertices.) GraphLoader.edgeListFile does not populate Graph.vertices. -- Key: SPARK-8396 URL: https://issues.apache.org/jira/browse/SPARK-8396 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.4.0 Environment: Mac OS X. Spark-1.4.0 pre-compiled binary for Hadoop-2.4.0-bin. Reporter: Matthew Barrett Priority: Minor Labels: easyfix, newbie Original Estimate: 24h Remaining Estimate: 24h With input data like this 18090 31237 31237 31225 31225 31285 31285 31200 31200 31197 31197 31195 31195 31346 31346 54013 54013 31256 31256 23121 The code val graph : Graph[Int, Int] = GraphLoader.edgeListFile(sc, hdfsNode + /data/misc/Sample_DirectedGraphData.ssv) graph.vertices.foreach{println} graph.vertices.foreach{vertex: (VertexId, Int) = println(vertex._1.toString + *** + vertex._2.toString)} prints nothing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8143) Spark application history cannot be found even for finished jobs
[ https://issues.apache.org/jira/browse/SPARK-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8143. -- Resolution: Duplicate Fix Version/s: (was: 1.4.0) Spark application history cannot be found even for finished jobs Key: SPARK-8143 URL: https://issues.apache.org/jira/browse/SPARK-8143 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0, 1.3.1 Reporter: Dev Lakhani Whenever a job is killed or finished, because of an application error or otherwise and when I then click on Application Detail UI, even through the job state is : FINISHED, I get no log results and the message states: Application history not found for (app-xyz-abc) Application ABC is still in progress. An no logs are presented. I'm using spark.eventLog.enabled, true and spark.eventLog.dir=/tmp/spark under which I see lots of files app-2015xyz-abc.inprogress Even through the job has failed or finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-8143) Spark application history cannot be found even for finished jobs
[ https://issues.apache.org/jira/browse/SPARK-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-8143: -- Spark application history cannot be found even for finished jobs Key: SPARK-8143 URL: https://issues.apache.org/jira/browse/SPARK-8143 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0, 1.3.1 Reporter: Dev Lakhani Whenever a job is killed or finished, because of an application error or otherwise and when I then click on Application Detail UI, even through the job state is : FINISHED, I get no log results and the message states: Application history not found for (app-xyz-abc) Application ABC is still in progress. An no logs are presented. I'm using spark.eventLog.enabled, true and spark.eventLog.dir=/tmp/spark under which I see lots of files app-2015xyz-abc.inprogress Even through the job has failed or finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7799) Move StreamingContext.actorStream to a separate project and deprecate it in StreamingContext
[ https://issues.apache.org/jira/browse/SPARK-7799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7799: --- Assignee: Apache Spark Move StreamingContext.actorStream to a separate project and deprecate it in StreamingContext -- Key: SPARK-7799 URL: https://issues.apache.org/jira/browse/SPARK-7799 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Shixiong Zhu Assignee: Apache Spark Move {{StreamingContext.actorStream}} to a separate project and deprecate it in {{StreamingContext}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7799) Move StreamingContext.actorStream to a separate project and deprecate it in StreamingContext
[ https://issues.apache.org/jira/browse/SPARK-7799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588036#comment-14588036 ] Apache Spark commented on SPARK-7799: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/6841 Move StreamingContext.actorStream to a separate project and deprecate it in StreamingContext -- Key: SPARK-7799 URL: https://issues.apache.org/jira/browse/SPARK-7799 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Shixiong Zhu Move {{StreamingContext.actorStream}} to a separate project and deprecate it in {{StreamingContext}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7799) Move StreamingContext.actorStream to a separate project and deprecate it in StreamingContext
[ https://issues.apache.org/jira/browse/SPARK-7799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7799: --- Assignee: (was: Apache Spark) Move StreamingContext.actorStream to a separate project and deprecate it in StreamingContext -- Key: SPARK-7799 URL: https://issues.apache.org/jira/browse/SPARK-7799 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Shixiong Zhu Move {{StreamingContext.actorStream}} to a separate project and deprecate it in {{StreamingContext}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8393) JavaStreamingContext#awaitTermination() throws non-declared InterruptedException
[ https://issues.apache.org/jira/browse/SPARK-8393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jaromir Vanek updated SPARK-8393: - Description: Call to {{JavaStreamingContext#awaitTermination()}} can throw InterruptedException which cannot be caught easily in Java because it's not declared in {{@throws(classOf[InterruptedException])}} annotation. This InterruptedException comes originally from ContextWaiter where Java ReentrantLock is used. was: Call to JavaStreamingContext#awaitTermination() can throw InterruptedException which cannot be caught easily in Java because it's not declared in @throws(classOf[InterruptedException]) annotation. This InterruptedException comes originally from ContextWaiter where Java ReentrantLock is used. JavaStreamingContext#awaitTermination() throws non-declared InterruptedException Key: SPARK-8393 URL: https://issues.apache.org/jira/browse/SPARK-8393 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.1 Reporter: Jaromir Vanek Priority: Trivial Call to {{JavaStreamingContext#awaitTermination()}} can throw InterruptedException which cannot be caught easily in Java because it's not declared in {{@throws(classOf[InterruptedException])}} annotation. This InterruptedException comes originally from ContextWaiter where Java ReentrantLock is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8393) JavaStreamingContext#awaitTermination() throws non-declared InterruptedException
[ https://issues.apache.org/jira/browse/SPARK-8393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588013#comment-14588013 ] Jaromir Vanek commented on SPARK-8393: -- It's not a big problem in Java. But I took me quite a bit of time to realize where exactly this {{InterruptedException}} comes from. In Java it can be caught as general {{Exception}}: {code} try { streamingContext.awaitTermination(); } catch (Exception e) { if (e instanceof InterruptedException) { // handle exception } {code} As far as I know {{awaitTerminationOrTimeout}} may throw the same exception as well. JavaStreamingContext#awaitTermination() throws non-declared InterruptedException Key: SPARK-8393 URL: https://issues.apache.org/jira/browse/SPARK-8393 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.1 Reporter: Jaromir Vanek Priority: Trivial Call to JavaStreamingContext#awaitTermination() can throw InterruptedException which cannot be caught easily in Java because it's not declared in @throws(classOf[InterruptedException]) annotation. This InterruptedException comes originally from ContextWaiter where Java ReentrantLock is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8393) JavaStreamingContext#awaitTermination() throws non-declared InterruptedException
[ https://issues.apache.org/jira/browse/SPARK-8393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jaromir Vanek updated SPARK-8393: - Description: Call to {{JavaStreamingContext#awaitTermination()}} can throw {{InterruptedException}} which cannot be caught easily in Java because it's not declared in {{@throws(classOf[InterruptedException])}} annotation. This {{InterruptedException}} comes originally from {{ContextWaiter}} where Java {{ReentrantLock}} is used. was: Call to {{JavaStreamingContext#awaitTermination()}} can throw InterruptedException which cannot be caught easily in Java because it's not declared in {{@throws(classOf[InterruptedException])}} annotation. This {{InterruptedException}} comes originally from {{ContextWaiter}} where Java {{ReentrantLock}} is used. JavaStreamingContext#awaitTermination() throws non-declared InterruptedException Key: SPARK-8393 URL: https://issues.apache.org/jira/browse/SPARK-8393 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.1 Reporter: Jaromir Vanek Priority: Trivial Call to {{JavaStreamingContext#awaitTermination()}} can throw {{InterruptedException}} which cannot be caught easily in Java because it's not declared in {{@throws(classOf[InterruptedException])}} annotation. This {{InterruptedException}} comes originally from {{ContextWaiter}} where Java {{ReentrantLock}} is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8393) JavaStreamingContext#awaitTermination() throws non-declared InterruptedException
[ https://issues.apache.org/jira/browse/SPARK-8393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jaromir Vanek updated SPARK-8393: - Description: Call to {{JavaStreamingContext#awaitTermination()}} can throw InterruptedException which cannot be caught easily in Java because it's not declared in {{@throws(classOf[InterruptedException])}} annotation. This {{InterruptedException}} comes originally from {{ContextWaiter}} where Java {{ReentrantLock}} is used. was: Call to {{JavaStreamingContext#awaitTermination()}} can throw InterruptedException which cannot be caught easily in Java because it's not declared in {{@throws(classOf[InterruptedException])}} annotation. This InterruptedException comes originally from ContextWaiter where Java ReentrantLock is used. JavaStreamingContext#awaitTermination() throws non-declared InterruptedException Key: SPARK-8393 URL: https://issues.apache.org/jira/browse/SPARK-8393 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.1 Reporter: Jaromir Vanek Priority: Trivial Call to {{JavaStreamingContext#awaitTermination()}} can throw InterruptedException which cannot be caught easily in Java because it's not declared in {{@throws(classOf[InterruptedException])}} annotation. This {{InterruptedException}} comes originally from {{ContextWaiter}} where Java {{ReentrantLock}} is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7580) Driver out of memory
[ https://issues.apache.org/jira/browse/SPARK-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588086#comment-14588086 ] Yuance Li commented on SPARK-7580: -- Hey, how did you solve the problem? I also met this problem Driver out of memory Key: SPARK-7580 URL: https://issues.apache.org/jira/browse/SPARK-7580 Project: Spark Issue Type: Bug Affects Versions: 1.3.0 Environment: YARN, HDP 2.1, RedHat 6.4 200 x HP DL185 Reporter: Andrew Rothstein My 200 node cluster has a 8k executor capacity. When I submitted a job with 2k executors, 2g per executor, and 4g for the driver, the ApplicationMaster/driver quickly became unresponsive. It was making progress, then threw a couple of these exceptions: 2015-05-12 16:46:41,598 ERROR [Spark Context Cleaner] spark.ContextCleaner: Error cleaning broadcast 4 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:137) at org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:227) at org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45) at org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:66) at org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:185) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:147) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:138) at scala.Option.foreach(Option.scala:236) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:138) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:134) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:134) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617) at org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:133) at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:65) Then the job crashed with OOM. 2015-05-12 16:47:53,566 ERROR [sparkDriver-akka.actor.default-dispatcher-4] actor.ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher-8] shutting down ActorSystem [sparkDriver] java.lang.OutOfMemoryError: Java heap space at org.spark_project.protobuf.ByteString.copyFrom(ByteString.java:216) at org.spark_project.protobuf.ByteString.copyFrom(ByteString.java:229) at akka.remote.transport.AkkaPduProtobufCodec$.constructPayload(AkkaPduCodec.scala:145) at akka.remote.transport.AkkaProtocolHandle.write(AkkaProtocolTransport.scala:182) at akka.remote.EndpointWriter.writeSend(Endpoint.scala:760) at akka.remote.EndpointWriter$$anonfun$2.applyOrElse(Endpoint.scala:722) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) When I reran the job with 3g of memory per executor and 1k executors it ran to completion more quickly than the 2k executor run took to crash. I didn't think I was pushing the envelope by using 2k executors and the stock driver heap size. Is this a scale limitation of the driver? Any suggestions beyond increasing the heap size of the driver and/or using less executors? Thanks, Andrew -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe,
[jira] [Commented] (SPARK-7515) Update documentation for PySpark on YARN with cluster mode
[ https://issues.apache.org/jira/browse/SPARK-7515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588098#comment-14588098 ] Apache Spark commented on SPARK-7515: - User 'punya' has created a pull request for this issue: https://github.com/apache/spark/pull/6842 Update documentation for PySpark on YARN with cluster mode -- Key: SPARK-7515 URL: https://issues.apache.org/jira/browse/SPARK-7515 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Priority: Minor Fix For: 1.5.0 Now PySpark on YARN with cluster mode is supported so let's update doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8396) GraphLoader.edgeListFile does not population Graph.vertices.
Matthew Barrett created SPARK-8396: -- Summary: GraphLoader.edgeListFile does not population Graph.vertices. Key: SPARK-8396 URL: https://issues.apache.org/jira/browse/SPARK-8396 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.4.0 Environment: Mac OS X. Spark-1.4.0 pre-compiled binary for Hadoop-2.4.0-bin. Reporter: Matthew Barrett Priority: Minor With input data like this 18090 31237 31237 31225 31225 31285 31285 31200 31200 31197 31197 31195 31195 31346 31346 54013 54013 31256 31256 23121 The code val graph : Graph[Int, Int] = GraphLoader.edgeListFile(sc, hdfsNode + /data/misc/Sample_DirectedGraphData.ssv) graph.vertices.foreach{println} graph.vertices.foreach{vertex: (VertexId, Int) = println(vertex._1.toString + *** + vertex._2.toString)} prints nothing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7443) MLlib 1.4 QA plan
[ https://issues.apache.org/jira/browse/SPARK-7443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587941#comment-14587941 ] Sean Owen commented on SPARK-7443: -- [~mengxr] This contains 6 subtasks that aren't resolved, but this is a ticket for 1.4. Should we close them all? I'm asking because there are still 76 issues tagged for 1.4.0 that were not resolved. MLlib 1.4 QA plan - Key: SPARK-7443 URL: https://issues.apache.org/jira/browse/SPARK-7443 Project: Spark Issue Type: Umbrella Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Joseph K. Bradley Priority: Critical TODO: create JIRAs for each task and assign them accordingly. h2. API * Check API compliance using java-compliance-checker (SPARK-7458) * Audit new public APIs (from the generated html doc) ** Scala (do not forget to check the object doc) (SPARK-7537) ** Java compatibility (SPARK-7529) ** Python API coverage (SPARK-7536) * audit Pipeline APIs (SPARK-7535) * graduate spark.ml from alpha (SPARK-7748) ** remove AlphaComponent annotations ** remove mima excludes for spark.ml ** mark concrete classes final wherever reasonable h2. Algorithms and performance *Performance* * _List any other missing performance tests from spark-perf here_ * LDA online/EM (SPARK-7455) * ElasticNet for linear regression and logistic regression (SPARK-7456) * Bernoulli naive Bayes (SPARK-7453) * PIC (SPARK-7454) * ALS.recommendAll (SPARK-7457) * perf-tests in Python (SPARK-7539) *Correctness* * PMML ** scoring using PMML evaluator vs. MLlib models (SPARK-7540) * model save/load (SPARK-7541) h2. Documentation and example code * Create JIRAs for the user guide to each new algorithm and assign them to the corresponding author. Link here as requires ** Now that we have algorithms in spark.ml which are not in spark.mllib, we should start making subsections for the spark.ml API as needed. We can follow the structure of the spark.mllib user guide. *** The spark.ml user guide can provide: (a) code examples and (b) info on algorithms which do not exist in spark.mllib. *** We should not duplicate info in the spark.ml guides. Since spark.mllib is still the primary API, we should provide links to the corresponding algorithms in the spark.mllib user guide for more info. * Create example code for major components. Link here as requires ** cross validation in python (SPARK-7387) ** pipeline with complex feature transformations (scala/java/python) (SPARK-7546) ** elastic-net (possibly with cross validation) (SPARK-7547) ** kernel density (SPARK-7707) * Update Programming Guide for 1.4 (towards end of QA) (SPARK-7715) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8392) the process is hang on when getting cachedNodes
[ https://issues.apache.org/jira/browse/SPARK-8392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8392: - Priority: Minor (was: Major) This is not major. the process is hang on when getting cachedNodes --- Key: SPARK-8392 URL: https://issues.apache.org/jira/browse/SPARK-8392 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: meiyoula Priority: Minor def getAllNodes: Seq[RDDOperationNode] = { _childNodes ++ _childClusters.flatMap(_.childNodes) } when the _childClusters has so many nodes, the process will hang on. I think we can improve the efficiency here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7715) Update MLlib Programming Guide for 1.4
[ https://issues.apache.org/jira/browse/SPARK-7715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7715. -- Resolution: Fixed Fix Version/s: 1.4.0 Assignee: Joseph K. Bradley Assuming the umbrella can be closed Update MLlib Programming Guide for 1.4 -- Key: SPARK-7715 URL: https://issues.apache.org/jira/browse/SPARK-7715 Project: Spark Issue Type: Sub-task Components: Documentation, ML, MLlib Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Fix For: 1.4.0 Before the release, we need to update the MLlib Programming Guide. Updates will include: * Add migration guide subsection. ** Use the results of the QA audit JIRAs. * Check phrasing, especially in main sections (for outdated items such as In this release, ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7515) Update documentation for PySpark on YARN with cluster mode
[ https://issues.apache.org/jira/browse/SPARK-7515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Punya Biswal updated SPARK-7515: Fix Version/s: 1.4.1 Update documentation for PySpark on YARN with cluster mode -- Key: SPARK-7515 URL: https://issues.apache.org/jira/browse/SPARK-7515 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Priority: Minor Fix For: 1.4.1, 1.5.0 Now PySpark on YARN with cluster mode is supported so let's update doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5680) Sum function on all null values, should return zero
[ https://issues.apache.org/jira/browse/SPARK-5680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588163#comment-14588163 ] Holman Lan commented on SPARK-5680: --- Hello Venkata. Thanks very much for looking into this. Could you kindly let us know the JIRA for the patch when you have one created? Thanks. Sum function on all null values, should return zero --- Key: SPARK-5680 URL: https://issues.apache.org/jira/browse/SPARK-5680 Project: Spark Issue Type: Bug Components: SQL Reporter: Venkata Ramana G Assignee: Venkata Ramana G Priority: Minor Fix For: 1.3.1, 1.4.0 SELECT sum('a'), avg('a'), variance('a'), std('a') FROM src; Current output: NULL NULLNULLNULL Expected output: 0.0 NULLNULLNULL This fixes hive udaf_number_format.q -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8397) Allow custom configuration for TestHive
Punya Biswal created SPARK-8397: --- Summary: Allow custom configuration for TestHive Key: SPARK-8397 URL: https://issues.apache.org/jira/browse/SPARK-8397 Project: Spark Issue Type: Improvement Affects Versions: 1.4.0 Reporter: Punya Biswal Priority: Minor We encourage people to use {{TestHive}} in unit tests, because it's impossible to create more than one {{HiveContext}} within one process. The current implementation locks people into using a {{local[2]}} {{SparkContext}} underlying their {{HiveContext}}. We should make it possible to override this using a system property so that people can test against {{local-cluster}} or remote spark clusters to make their tests more realistic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7443) MLlib 1.4 QA plan
[ https://issues.apache.org/jira/browse/SPARK-7443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588265#comment-14588265 ] Joseph K. Bradley commented on SPARK-7443: -- [~srowen] Most of the QA items are pretty much ready to be closed, but I'd like to check through them, particularly for ones which need to spawn new JIRAs for 1.5. Not all of the documentation was finished, but we can mark it for 1.4.1, 1.5 and update the website doc ASAP (before 1.4.1). I'll have some time later today to make a pass through the JIRAs. MLlib 1.4 QA plan - Key: SPARK-7443 URL: https://issues.apache.org/jira/browse/SPARK-7443 Project: Spark Issue Type: Umbrella Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Joseph K. Bradley Priority: Critical TODO: create JIRAs for each task and assign them accordingly. h2. API * Check API compliance using java-compliance-checker (SPARK-7458) * Audit new public APIs (from the generated html doc) ** Scala (do not forget to check the object doc) (SPARK-7537) ** Java compatibility (SPARK-7529) ** Python API coverage (SPARK-7536) * audit Pipeline APIs (SPARK-7535) * graduate spark.ml from alpha (SPARK-7748) ** remove AlphaComponent annotations ** remove mima excludes for spark.ml ** mark concrete classes final wherever reasonable h2. Algorithms and performance *Performance* * _List any other missing performance tests from spark-perf here_ * LDA online/EM (SPARK-7455) * ElasticNet for linear regression and logistic regression (SPARK-7456) * Bernoulli naive Bayes (SPARK-7453) * PIC (SPARK-7454) * ALS.recommendAll (SPARK-7457) * perf-tests in Python (SPARK-7539) *Correctness* * PMML ** scoring using PMML evaluator vs. MLlib models (SPARK-7540) * model save/load (SPARK-7541) h2. Documentation and example code * Create JIRAs for the user guide to each new algorithm and assign them to the corresponding author. Link here as requires ** Now that we have algorithms in spark.ml which are not in spark.mllib, we should start making subsections for the spark.ml API as needed. We can follow the structure of the spark.mllib user guide. *** The spark.ml user guide can provide: (a) code examples and (b) info on algorithms which do not exist in spark.mllib. *** We should not duplicate info in the spark.ml guides. Since spark.mllib is still the primary API, we should provide links to the corresponding algorithms in the spark.mllib user guide for more info. * Create example code for major components. Link here as requires ** cross validation in python (SPARK-7387) ** pipeline with complex feature transformations (scala/java/python) (SPARK-7546) ** elastic-net (possibly with cross validation) (SPARK-7547) ** kernel density (SPARK-7707) * Update Programming Guide for 1.4 (towards end of QA) (SPARK-7715) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8268) string function: unbase64
[ https://issues.apache.org/jira/browse/SPARK-8268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588120#comment-14588120 ] Apache Spark commented on SPARK-8268: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/6843 string function: unbase64 - Key: SPARK-8268 URL: https://issues.apache.org/jira/browse/SPARK-8268 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Cheng Hao unbase64(string str): binary Converts the argument from a base 64 string to BINARY. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8243) string function: encode
[ https://issues.apache.org/jira/browse/SPARK-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588118#comment-14588118 ] Apache Spark commented on SPARK-8243: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/6843 string function: encode --- Key: SPARK-8243 URL: https://issues.apache.org/jira/browse/SPARK-8243 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Cheng Hao encode(string src, string charset): binary Encodes the first argument into a BINARY using the provided character set (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16'). If either argument is null, the result will also be null. (As of Hive 0.12.0.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8389) Expose KafkaRDDs offsetRange in Java and Python
[ https://issues.apache.org/jira/browse/SPARK-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588202#comment-14588202 ] Cody Koeninger commented on SPARK-8389: --- There's already a ticket for the Python side of things, SPARK-8337. Not sure if you want to combine them. I'll look at the java side of things to start. Expose KafkaRDDs offsetRange in Java and Python --- Key: SPARK-8389 URL: https://issues.apache.org/jira/browse/SPARK-8389 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Tathagata Das Assignee: Cody Koeninger Priority: Critical Probably requires creating a JavaKafkaPairRDD and also use that in the python APIs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7122) KafkaUtils.createDirectStream - unreasonable processing time in absence of load
[ https://issues.apache.org/jira/browse/SPARK-7122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588211#comment-14588211 ] Cody Koeninger commented on SPARK-7122: --- It's certainly your prerogative to wait for an official release. However, keep in mind that the patch in question is just a performance optimization, not necessarily a bug fix targeted at whatever your issue is. Without a minimal reproducible case of your problem, or testing patches against your workload, there's no way of knowing if the performance optimization solves your problem. If it doesn't, you're looking at waiting for yet another release after 1.4.1. KafkaUtils.createDirectStream - unreasonable processing time in absence of load --- Key: SPARK-7122 URL: https://issues.apache.org/jira/browse/SPARK-7122 Project: Spark Issue Type: Question Components: Streaming Affects Versions: 1.3.1 Environment: Spark Streaming 1.3.1, standalone mode running on just 1 box: Ubuntu 14.04.2 LTS, 4 cores, 8GB RAM, java version 1.8.0_40 Reporter: Platon Potapov Priority: Minor Attachments: 10.second.window.fast.job.txt, 5.second.window.slow.job.txt, SparkStreamingJob.scala attached is the complete source code of a test spark job. no external data generators are run - just the presence of a kafka topic named raw suffices. the spark job is run with no load whatsoever. http://localhost:4040/streaming is checked to obtain job processing duration. * in case the test contains the following transformation: {code} // dummy transformation val temperature = bytes.filter(_._1 == abc) val abc = temperature.window(Seconds(40), Seconds(5)) abc.print() {code} the median processing time is 3 seconds 80 ms * in case the test contains the following transformation: {code} // dummy transformation val temperature = bytes.filter(_._1 == abc) val abc = temperature.map(x = (1, x)) abc.print() {code} the median processing time is just 50 ms please explain why does the window transformation introduce such a growth of job duration? note: the result is the same regardless of the number of kafka topic partitions (I've tried 1 and 8) note2: the result is the same regardless of the window parameters (I've tried (20, 2) and (40, 5)) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8397) Allow custom configuration for TestHive
[ https://issues.apache.org/jira/browse/SPARK-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8397: --- Assignee: (was: Apache Spark) Allow custom configuration for TestHive --- Key: SPARK-8397 URL: https://issues.apache.org/jira/browse/SPARK-8397 Project: Spark Issue Type: Improvement Affects Versions: 1.4.0 Reporter: Punya Biswal Priority: Minor We encourage people to use {{TestHive}} in unit tests, because it's impossible to create more than one {{HiveContext}} within one process. The current implementation locks people into using a {{local[2]}} {{SparkContext}} underlying their {{HiveContext}}. We should make it possible to override this using a system property so that people can test against {{local-cluster}} or remote spark clusters to make their tests more realistic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8397) Allow custom configuration for TestHive
[ https://issues.apache.org/jira/browse/SPARK-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588161#comment-14588161 ] Apache Spark commented on SPARK-8397: - User 'punya' has created a pull request for this issue: https://github.com/apache/spark/pull/6844 Allow custom configuration for TestHive --- Key: SPARK-8397 URL: https://issues.apache.org/jira/browse/SPARK-8397 Project: Spark Issue Type: Improvement Affects Versions: 1.4.0 Reporter: Punya Biswal Priority: Minor We encourage people to use {{TestHive}} in unit tests, because it's impossible to create more than one {{HiveContext}} within one process. The current implementation locks people into using a {{local[2]}} {{SparkContext}} underlying their {{HiveContext}}. We should make it possible to override this using a system property so that people can test against {{local-cluster}} or remote spark clusters to make their tests more realistic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8397) Allow custom configuration for TestHive
[ https://issues.apache.org/jira/browse/SPARK-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8397: --- Assignee: Apache Spark Allow custom configuration for TestHive --- Key: SPARK-8397 URL: https://issues.apache.org/jira/browse/SPARK-8397 Project: Spark Issue Type: Improvement Affects Versions: 1.4.0 Reporter: Punya Biswal Assignee: Apache Spark Priority: Minor We encourage people to use {{TestHive}} in unit tests, because it's impossible to create more than one {{HiveContext}} within one process. The current implementation locks people into using a {{local[2]}} {{SparkContext}} underlying their {{HiveContext}}. We should make it possible to override this using a system property so that people can test against {{local-cluster}} or remote spark clusters to make their tests more realistic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8239) string function: base64
[ https://issues.apache.org/jira/browse/SPARK-8239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8239: --- Assignee: Cheng Hao (was: Apache Spark) string function: base64 --- Key: SPARK-8239 URL: https://issues.apache.org/jira/browse/SPARK-8239 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Cheng Hao base64(binary bin): string Converts the argument from binary to a base 64 string -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8239) string function: base64
[ https://issues.apache.org/jira/browse/SPARK-8239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8239: --- Assignee: Apache Spark (was: Cheng Hao) string function: base64 --- Key: SPARK-8239 URL: https://issues.apache.org/jira/browse/SPARK-8239 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark base64(binary bin): string Converts the argument from binary to a base 64 string -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8385) java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation
[ https://issues.apache.org/jira/browse/SPARK-8385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588330#comment-14588330 ] Peter Haumer commented on SPARK-8385: - Sean, I see the class in the big assembly file of the Spark for Hadoop 2.6 distributions for 1.3.1 and 1.4.0. However, it seems that with 1.4 a version was packaged that has unimplemented methods, which causes the regression. java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation - Key: SPARK-8385 URL: https://issues.apache.org/jira/browse/SPARK-8385 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 1.4.0 Environment: RHEL 7.1 Reporter: Peter Haumer I used to be able to debug my Spark apps in Eclipse. With Spark 1.3.1 I created a launch and just set the vm var -Dspark.master=local[4]. With 1.4 this stopped working when reading files from the OS filesystem. Running the same apps with spark-submit works fine. Loosing the ability to debug that way has a major impact on the usability of Spark. The following exception is thrown: Exception in thread main java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation at org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:213) at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2401) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2411) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:166) at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:653) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:389) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362) at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:762) at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:762) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1535) at org.apache.spark.rdd.RDD.reduce(RDD.scala:900) at org.apache.spark.api.java.JavaRDDLike$class.reduce(JavaRDDLike.scala:357) at org.apache.spark.api.java.AbstractJavaRDDLike.reduce(JavaRDDLike.scala:46) at com.databricks.apps.logs.LogAnalyzer.main(LogAnalyzer.java:60) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8242) string function: decode
[ https://issues.apache.org/jira/browse/SPARK-8242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8242: --- Assignee: Apache Spark (was: Cheng Hao) string function: decode --- Key: SPARK-8242 URL: https://issues.apache.org/jira/browse/SPARK-8242 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark decode(binary bin, string charset): string Decodes the first argument into a String using the provided character set (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16'). If either argument is null, the result will also be null. (As of Hive 0.12.0.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8268) string function: unbase64
[ https://issues.apache.org/jira/browse/SPARK-8268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8268: --- Assignee: Apache Spark (was: Cheng Hao) string function: unbase64 - Key: SPARK-8268 URL: https://issues.apache.org/jira/browse/SPARK-8268 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark unbase64(string str): binary Converts the argument from a base 64 string to BINARY. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8268) string function: unbase64
[ https://issues.apache.org/jira/browse/SPARK-8268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8268: --- Assignee: Cheng Hao (was: Apache Spark) string function: unbase64 - Key: SPARK-8268 URL: https://issues.apache.org/jira/browse/SPARK-8268 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Cheng Hao unbase64(string str): binary Converts the argument from a base 64 string to BINARY. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8238) string function: ascii
[ https://issues.apache.org/jira/browse/SPARK-8238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8238: --- Assignee: Cheng Hao (was: Apache Spark) string function: ascii -- Key: SPARK-8238 URL: https://issues.apache.org/jira/browse/SPARK-8238 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Cheng Hao ascii(string str): int Returns the numeric value of the first character of str. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8243) string function: encode
[ https://issues.apache.org/jira/browse/SPARK-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8243: --- Assignee: Cheng Hao (was: Apache Spark) string function: encode --- Key: SPARK-8243 URL: https://issues.apache.org/jira/browse/SPARK-8243 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Cheng Hao encode(string src, string charset): binary Encodes the first argument into a BINARY using the provided character set (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16'). If either argument is null, the result will also be null. (As of Hive 0.12.0.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8238) string function: ascii
[ https://issues.apache.org/jira/browse/SPARK-8238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588115#comment-14588115 ] Apache Spark commented on SPARK-8238: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/6843 string function: ascii -- Key: SPARK-8238 URL: https://issues.apache.org/jira/browse/SPARK-8238 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Cheng Hao ascii(string str): int Returns the numeric value of the first character of str. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8242) string function: decode
[ https://issues.apache.org/jira/browse/SPARK-8242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588117#comment-14588117 ] Apache Spark commented on SPARK-8242: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/6843 string function: decode --- Key: SPARK-8242 URL: https://issues.apache.org/jira/browse/SPARK-8242 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Cheng Hao decode(binary bin, string charset): string Decodes the first argument into a String using the provided character set (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16'). If either argument is null, the result will also be null. (As of Hive 0.12.0.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8242) string function: decode
[ https://issues.apache.org/jira/browse/SPARK-8242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8242: --- Assignee: Cheng Hao (was: Apache Spark) string function: decode --- Key: SPARK-8242 URL: https://issues.apache.org/jira/browse/SPARK-8242 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Cheng Hao decode(binary bin, string charset): string Decodes the first argument into a String using the provided character set (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16'). If either argument is null, the result will also be null. (As of Hive 0.12.0.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8239) string function: base64
[ https://issues.apache.org/jira/browse/SPARK-8239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588116#comment-14588116 ] Apache Spark commented on SPARK-8239: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/6843 string function: base64 --- Key: SPARK-8239 URL: https://issues.apache.org/jira/browse/SPARK-8239 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Cheng Hao base64(binary bin): string Converts the argument from binary to a base 64 string -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8238) string function: ascii
[ https://issues.apache.org/jira/browse/SPARK-8238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8238: --- Assignee: Apache Spark (was: Cheng Hao) string function: ascii -- Key: SPARK-8238 URL: https://issues.apache.org/jira/browse/SPARK-8238 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark ascii(string str): int Returns the numeric value of the first character of str. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8243) string function: encode
[ https://issues.apache.org/jira/browse/SPARK-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8243: --- Assignee: Apache Spark (was: Cheng Hao) string function: encode --- Key: SPARK-8243 URL: https://issues.apache.org/jira/browse/SPARK-8243 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark encode(string src, string charset): binary Encodes the first argument into a BINARY using the provided character set (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16'). If either argument is null, the result will also be null. (As of Hive 0.12.0.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8380) SparkR mis-counts
[ https://issues.apache.org/jira/browse/SPARK-8380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588196#comment-14588196 ] Shivaram Venkataraman commented on SPARK-8380: -- Thanks for the update. I'm going to mark this issue as resolved. BTW if there are documentation changes that you think will be helpful feel free to create JIRAs / PRs for them SparkR mis-counts - Key: SPARK-8380 URL: https://issues.apache.org/jira/browse/SPARK-8380 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.0 Reporter: Rick Moritz On my dataset of ~9 Million rows x 30 columns, queried via Hive, I can perform count operations on the entirety of the dataset and get the correct value, as double checked against the same code in scala. When I start to add conditions or even do a simple partial ascending histogram, I get discrepancies. In particular, there are missing values in SparkR, and massively so: A top 6 count of a certain feature in my dataset results in an order of magnitude smaller numbers, than I get via scala. The following logic, which I consider equivalent is the basis for this report: counts-summarize(groupBy(df, df$col_name), count = n(tdf$col_name)) head(arrange(counts, desc(counts$count))) versus: val table = sql(SELECT col_name, count(col_name) as value from df group by col_name order by value desc) The first, in particular, is taken directly from the SparkR programming guide. Since summarize isn't documented from what I can see, I'd hope it does what the programming guide indicates. In that case this would be a pretty serious logic bug (no errors are thrown). Otherwise, there's the possibility of a lack of documentation and badly worded example in the guide being behind my misperception of SparkRs functionality. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8398) Consistently expose Hadoop Configuration/JobConf parameters for Hadoop input/output formats
koert kuipers created SPARK-8398: Summary: Consistently expose Hadoop Configuration/JobConf parameters for Hadoop input/output formats Key: SPARK-8398 URL: https://issues.apache.org/jira/browse/SPARK-8398 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 1.4.0 Reporter: koert kuipers Priority: Trivial Currently a custom Hadoop Configuration or JobConf can be passed into quite a few functions that use Hadoop input formats to read or Hadoop output formats to write data. The goal of this JIRA is to make this consistent and expose Configuration/JobConf for all these methods, which facilitates re-use and discourages many additional parameters (that end up changing the Configuration/JobConf internally). See also: http://apache-spark-developers-list.1001551.n3.nabble.com/Fwd-hadoop-input-output-format-advanced-control-td11168.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6816) Add SparkConf API to configure SparkR
[ https://issues.apache.org/jira/browse/SPARK-6816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588271#comment-14588271 ] Rick Moritz commented on SPARK-6816: Apparently this work-around is no longer needed for spark-1.4.0, which invokes a shell script instead of going directly to java as sparkR-pkg did, and fetches the required environment parameters. With spark-defaults being respected, and SPARK_MEM available for memory options, there probably isn't a whole lot that needs to be passed by -D to shell script. Add SparkConf API to configure SparkR - Key: SPARK-6816 URL: https://issues.apache.org/jira/browse/SPARK-6816 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman Priority: Minor Right now the only way to configure SparkR is to pass in arguments to sparkR.init. The goal is to add an API similar to SparkConf on Scala/Python to make configuration easier -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8398) Consistently expose Hadoop Configuration/JobConf parameters for Hadoop input/output formats
[ https://issues.apache.org/jira/browse/SPARK-8398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8398: --- Assignee: Apache Spark Consistently expose Hadoop Configuration/JobConf parameters for Hadoop input/output formats --- Key: SPARK-8398 URL: https://issues.apache.org/jira/browse/SPARK-8398 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 1.4.0 Reporter: koert kuipers Assignee: Apache Spark Priority: Trivial Currently a custom Hadoop Configuration or JobConf can be passed into quite a few functions that use Hadoop input formats to read or Hadoop output formats to write data. The goal of this JIRA is to make this consistent and expose Configuration/JobConf for all these methods, which facilitates re-use and discourages many additional parameters (that end up changing the Configuration/JobConf internally). See also: http://apache-spark-developers-list.1001551.n3.nabble.com/Fwd-hadoop-input-output-format-advanced-control-td11168.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8398) Consistently expose Hadoop Configuration/JobConf parameters for Hadoop input/output formats
[ https://issues.apache.org/jira/browse/SPARK-8398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588424#comment-14588424 ] Apache Spark commented on SPARK-8398: - User 'koertkuipers' has created a pull request for this issue: https://github.com/apache/spark/pull/6848 Consistently expose Hadoop Configuration/JobConf parameters for Hadoop input/output formats --- Key: SPARK-8398 URL: https://issues.apache.org/jira/browse/SPARK-8398 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 1.4.0 Reporter: koert kuipers Priority: Trivial Currently a custom Hadoop Configuration or JobConf can be passed into quite a few functions that use Hadoop input formats to read or Hadoop output formats to write data. The goal of this JIRA is to make this consistent and expose Configuration/JobConf for all these methods, which facilitates re-use and discourages many additional parameters (that end up changing the Configuration/JobConf internally). See also: http://apache-spark-developers-list.1001551.n3.nabble.com/Fwd-hadoop-input-output-format-advanced-control-td11168.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8398) Consistently expose Hadoop Configuration/JobConf parameters for Hadoop input/output formats
[ https://issues.apache.org/jira/browse/SPARK-8398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8398: --- Assignee: (was: Apache Spark) Consistently expose Hadoop Configuration/JobConf parameters for Hadoop input/output formats --- Key: SPARK-8398 URL: https://issues.apache.org/jira/browse/SPARK-8398 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 1.4.0 Reporter: koert kuipers Priority: Trivial Currently a custom Hadoop Configuration or JobConf can be passed into quite a few functions that use Hadoop input formats to read or Hadoop output formats to write data. The goal of this JIRA is to make this consistent and expose Configuration/JobConf for all these methods, which facilitates re-use and discourages many additional parameters (that end up changing the Configuration/JobConf internally). See also: http://apache-spark-developers-list.1001551.n3.nabble.com/Fwd-hadoop-input-output-format-advanced-control-td11168.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8385) java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation
[ https://issues.apache.org/jira/browse/SPARK-8385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588540#comment-14588540 ] Sean Owen commented on SPARK-8385: -- Oh, is TFS Tachyon? Not sure what the status is on that, whether it's supposed to work without extra steps and just happened to in the past, or what. java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation - Key: SPARK-8385 URL: https://issues.apache.org/jira/browse/SPARK-8385 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 1.4.0 Environment: RHEL 7.1 Reporter: Peter Haumer I used to be able to debug my Spark apps in Eclipse. With Spark 1.3.1 I created a launch and just set the vm var -Dspark.master=local[4]. With 1.4 this stopped working when reading files from the OS filesystem. Running the same apps with spark-submit works fine. Loosing the ability to debug that way has a major impact on the usability of Spark. The following exception is thrown: Exception in thread main java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation at org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:213) at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2401) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2411) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:166) at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:653) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:389) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362) at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:762) at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:762) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1535) at org.apache.spark.rdd.RDD.reduce(RDD.scala:900) at org.apache.spark.api.java.JavaRDDLike$class.reduce(JavaRDDLike.scala:357) at org.apache.spark.api.java.AbstractJavaRDDLike.reduce(JavaRDDLike.scala:46) at com.databricks.apps.logs.LogAnalyzer.main(LogAnalyzer.java:60) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8384) Can not set checkpointDuration or Interval in spark 1.3 and later
[ https://issues.apache.org/jira/browse/SPARK-8384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588363#comment-14588363 ] Norman He commented on SPARK-8384: -- It seems that if checkpoint interval is the same as batchDuration, you will have a lot of checkpoint saving. ( old documentation talk about set checkpoint interval 5 ~10 times batch duration ). We are pushing batch duration to 200ms, if checkpoint duration is the same. I am not sure whether checkpoint saving to hdfs or disk will impact the streaming processing. Is this a limitation now or will be improved upon in the future? Can not set checkpointDuration or Interval in spark 1.3 and later - Key: SPARK-8384 URL: https://issues.apache.org/jira/browse/SPARK-8384 Project: Spark Issue Type: Bug Reporter: Norman He Priority: Critical StreamingContext missing setCheckpointDuration(). No way around for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8399) Overlap between histograms and axis' name in Spark Streaming UI
[ https://issues.apache.org/jira/browse/SPARK-8399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8399: --- Assignee: Apache Spark Overlap between histograms and axis' name in Spark Streaming UI --- Key: SPARK-8399 URL: https://issues.apache.org/jira/browse/SPARK-8399 Project: Spark Issue Type: Bug Components: Streaming, Web UI Affects Versions: 1.4.0 Reporter: Benjamin Fradet Assignee: Apache Spark Priority: Minor If you have an histogram skewed towards the maximum of the displayed values as is the case with the number of messages processed per batchInterval with the Kafka direct API (since it's a constant) for example, the histogram will overlap with the name of the X axis (#batches). Unfortunately, I don't have any screenshots available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8393) JavaStreamingContext#awaitTermination() throws non-declared InterruptedException
[ https://issues.apache.org/jira/browse/SPARK-8393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588546#comment-14588546 ] Sean Owen commented on SPARK-8393: -- {{awaitTerminationOrTimeout}} will return a {{boolean}} to let you know if it timed out, if that's what you're looking for, but I suspect it's not quite. Yeah, that's a good workaround for now if you really need to handle it. Hm, can you wrap it in a method that {{throws InterruptedException}} and catch for it as normal around an invocation to that method? I think it's still a valid API change for later. JavaStreamingContext#awaitTermination() throws non-declared InterruptedException Key: SPARK-8393 URL: https://issues.apache.org/jira/browse/SPARK-8393 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.1 Reporter: Jaromir Vanek Priority: Trivial Call to {{JavaStreamingContext#awaitTermination()}} can throw {{InterruptedException}} which cannot be caught easily in Java because it's not declared in {{@throws(classOf[InterruptedException])}} annotation. This {{InterruptedException}} comes originally from {{ContextWaiter}} where Java {{ReentrantLock}} is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8395) spark-submit documentation is incorrect
[ https://issues.apache.org/jira/browse/SPARK-8395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588551#comment-14588551 ] Sean Owen commented on SPARK-8395: -- I think that's right. This looks like a hold-over from when this might have been controlled by spark-daemon.sh. You can raise a PR for this. spark-submit documentation is incorrect --- Key: SPARK-8395 URL: https://issues.apache.org/jira/browse/SPARK-8395 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.4.0 Reporter: Dev Lakhani Priority: Minor Using a fresh checkout of 1.4.0-bin-hadoop2.6 if you run ./start-slave.sh 1 spark://localhost:7077 you get failed to launch org.apache.spark.deploy.worker.Worker: Default is conf/spark-defaults.conf. 15/06/16 13:11:08 INFO Utils: Shutdown hook called it seems the worker number is not being accepted as desccribed here: https://spark.apache.org/docs/latest/spark-standalone.html The documentation says: ./sbin/start-slave.sh worker# master-spark-URL but the start.slave-sh script states: usage=Usage: start-slave.sh spark-master-URL where spark-master-URL is like spark://localhost:7077 I have checked for similar issues using : https://issues.apache.org/jira/browse/SPARK-6552?jql=text%20~%20%22start-slave%22 and found nothing similar so am raising this as an issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8399) Overlap between histograms and axis' name in Spark Streaming UI
Benjamin Fradet created SPARK-8399: -- Summary: Overlap between histograms and axis' name in Spark Streaming UI Key: SPARK-8399 URL: https://issues.apache.org/jira/browse/SPARK-8399 Project: Spark Issue Type: Bug Components: Streaming, Web UI Affects Versions: 1.4.0 Reporter: Benjamin Fradet Priority: Minor If you have an histogram skewed towards the maximum of the displayed values as is the case with the number of messages processed per batchInterval with the Kafka direct API (since it's a constant) for example, the histogram will overlap with the name of the X axis (#batches). Unfortunately, I don't have any screenshots available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8399) Overlap between histograms and axis' name in Spark Streaming UI
[ https://issues.apache.org/jira/browse/SPARK-8399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588358#comment-14588358 ] Benjamin Fradet commented on SPARK-8399: I'll submit a patch shortly. Overlap between histograms and axis' name in Spark Streaming UI --- Key: SPARK-8399 URL: https://issues.apache.org/jira/browse/SPARK-8399 Project: Spark Issue Type: Bug Components: Streaming, Web UI Affects Versions: 1.4.0 Reporter: Benjamin Fradet Priority: Minor If you have an histogram skewed towards the maximum of the displayed values as is the case with the number of messages processed per batchInterval with the Kafka direct API (since it's a constant) for example, the histogram will overlap with the name of the X axis (#batches). Unfortunately, I don't have any screenshots available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8399) Overlap between histograms and axis' name in Spark Streaming UI
[ https://issues.apache.org/jira/browse/SPARK-8399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588379#comment-14588379 ] Apache Spark commented on SPARK-8399: - User 'BenFradet' has created a pull request for this issue: https://github.com/apache/spark/pull/6845 Overlap between histograms and axis' name in Spark Streaming UI --- Key: SPARK-8399 URL: https://issues.apache.org/jira/browse/SPARK-8399 Project: Spark Issue Type: Bug Components: Streaming, Web UI Affects Versions: 1.4.0 Reporter: Benjamin Fradet Priority: Minor If you have an histogram skewed towards the maximum of the displayed values as is the case with the number of messages processed per batchInterval with the Kafka direct API (since it's a constant) for example, the histogram will overlap with the name of the X axis (#batches). Unfortunately, I don't have any screenshots available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8399) Overlap between histograms and axis' name in Spark Streaming UI
[ https://issues.apache.org/jira/browse/SPARK-8399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8399: --- Assignee: (was: Apache Spark) Overlap between histograms and axis' name in Spark Streaming UI --- Key: SPARK-8399 URL: https://issues.apache.org/jira/browse/SPARK-8399 Project: Spark Issue Type: Bug Components: Streaming, Web UI Affects Versions: 1.4.0 Reporter: Benjamin Fradet Priority: Minor If you have an histogram skewed towards the maximum of the displayed values as is the case with the number of messages processed per batchInterval with the Kafka direct API (since it's a constant) for example, the histogram will overlap with the name of the X axis (#batches). Unfortunately, I don't have any screenshots available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8389) Expose KafkaRDDs offsetRange in Java and Python
[ https://issues.apache.org/jira/browse/SPARK-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8389: --- Assignee: Apache Spark (was: Cody Koeninger) Expose KafkaRDDs offsetRange in Java and Python --- Key: SPARK-8389 URL: https://issues.apache.org/jira/browse/SPARK-8389 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Tathagata Das Assignee: Apache Spark Priority: Critical Probably requires creating a JavaKafkaPairRDD and also use that in the python APIs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8389) Expose KafkaRDDs offsetRange in Java and Python
[ https://issues.apache.org/jira/browse/SPARK-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8389: --- Assignee: Cody Koeninger (was: Apache Spark) Expose KafkaRDDs offsetRange in Java and Python --- Key: SPARK-8389 URL: https://issues.apache.org/jira/browse/SPARK-8389 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Tathagata Das Assignee: Cody Koeninger Priority: Critical Probably requires creating a JavaKafkaPairRDD and also use that in the python APIs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8389) Expose KafkaRDDs offsetRange in Java and Python
[ https://issues.apache.org/jira/browse/SPARK-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588397#comment-14588397 ] Apache Spark commented on SPARK-8389: - User 'koeninger' has created a pull request for this issue: https://github.com/apache/spark/pull/6846 Expose KafkaRDDs offsetRange in Java and Python --- Key: SPARK-8389 URL: https://issues.apache.org/jira/browse/SPARK-8389 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Tathagata Das Assignee: Cody Koeninger Priority: Critical Probably requires creating a JavaKafkaPairRDD and also use that in the python APIs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8389) Expose KafkaRDDs offsetRange in Java and Python
[ https://issues.apache.org/jira/browse/SPARK-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588451#comment-14588451 ] Tathagata Das commented on SPARK-8389: -- Then lets atleast add it to the examples and the programming guide. But we definitely need to do something for python. Gotta brainstorm on that. On Tue, Jun 16, 2015 at 10:24 AM, Cody Koeninger (JIRA) j...@apache.org Expose KafkaRDDs offsetRange in Java and Python --- Key: SPARK-8389 URL: https://issues.apache.org/jira/browse/SPARK-8389 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Tathagata Das Assignee: Cody Koeninger Priority: Critical Probably requires creating a JavaKafkaPairRDD and also use that in the python APIs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8337) KafkaUtils.createDirectStream for python is lacking API/feature parity with the Scala/Java version
[ https://issues.apache.org/jira/browse/SPARK-8337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588517#comment-14588517 ] Cody Koeninger commented on SPARK-8337: --- So one thing to keep in mind is that if the Kafka project ends up adding more fields to MessageAndMetadata, the scala interface is going to continue to give users access to those fields, without changing anything other than the Kafka version. If you go with the approach of building a Python dict, someone's going to have to remember to go manually change the code to give access to the new fields. I don't have enough Python knowledge to comment on whether the approach of passing a messageHandler function is feasible... I can try to get up to speed on it. It may be worth trying to get the attention of Davies Liu after the spark conference hubub has died down. KafkaUtils.createDirectStream for python is lacking API/feature parity with the Scala/Java version -- Key: SPARK-8337 URL: https://issues.apache.org/jira/browse/SPARK-8337 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.0 Reporter: Amit Ramesh Priority: Critical See the following thread for context. http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Spark-1-4-Python-API-for-getting-Kafka-offsets-in-direct-mode-tt12714.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8356) Reconcile callUDF and callUdf
[ https://issues.apache.org/jira/browse/SPARK-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588579#comment-14588579 ] Benjamin Fradet commented on SPARK-8356: I've started working on this issue. Reconcile callUDF and callUdf - Key: SPARK-8356 URL: https://issues.apache.org/jira/browse/SPARK-8356 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Priority: Critical Labels: starter Right now we have two functions {{callUDF}} and {{callUdf}}. I think the former is used for calling Java functions (and the documentation is wrong) and the latter is for calling functions by name. Either way this is confusing and we should unify or pick different names. Also, lets make sure the docs are right. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6910) Support for pushing predicates down to metastore for partition pruning
[ https://issues.apache.org/jira/browse/SPARK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6910: Target Version/s: 1.5.0 Support for pushing predicates down to metastore for partition pruning -- Key: SPARK-6910 URL: https://issues.apache.org/jira/browse/SPARK-6910 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6910) Support for pushing predicates down to metastore for partition pruning
[ https://issues.apache.org/jira/browse/SPARK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6910: Priority: Critical (was: Major) Support for pushing predicates down to metastore for partition pruning -- Key: SPARK-6910 URL: https://issues.apache.org/jira/browse/SPARK-6910 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7665) MLlib Python API breaking changes check between 1.3 1.4
[ https://issues.apache.org/jira/browse/SPARK-7665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588777#comment-14588777 ] Joseph K. Bradley commented on SPARK-7665: -- I'm making a final pass before I close this. MLlib Python API breaking changes check between 1.3 1.4 - Key: SPARK-7665 URL: https://issues.apache.org/jira/browse/SPARK-7665 Project: Spark Issue Type: Documentation Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Yanbo Liang Comparing the MLlib Python APIs between 1.3 and 1.4, so we can note breaking changes. We'll need to note those changes (if any) in the user guide's Migration Guide section. If the API change is for an Alpha/Experimental/DeveloperApi component, we need also note that as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7665) MLlib Python API breaking changes check between 1.3 1.4
[ https://issues.apache.org/jira/browse/SPARK-7665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-7665. -- Resolution: Done Assignee: Joseph K. Bradley MLlib Python API breaking changes check between 1.3 1.4 - Key: SPARK-7665 URL: https://issues.apache.org/jira/browse/SPARK-7665 Project: Spark Issue Type: Documentation Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Yanbo Liang Assignee: Joseph K. Bradley Comparing the MLlib Python APIs between 1.3 and 1.4, so we can note breaking changes. We'll need to note those changes (if any) in the user guide's Migration Guide section. If the API change is for an Alpha/Experimental/DeveloperApi component, we need also note that as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7665) MLlib Python API breaking changes check between 1.3 1.4
[ https://issues.apache.org/jira/browse/SPARK-7665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588804#comment-14588804 ] Joseph K. Bradley commented on SPARK-7665: -- I believe everything checks out, so I'm going to mark this as resolved. MLlib Python API breaking changes check between 1.3 1.4 - Key: SPARK-7665 URL: https://issues.apache.org/jira/browse/SPARK-7665 Project: Spark Issue Type: Documentation Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Yanbo Liang Comparing the MLlib Python APIs between 1.3 and 1.4, so we can note breaking changes. We'll need to note those changes (if any) in the user guide's Migration Guide section. If the API change is for an Alpha/Experimental/DeveloperApi component, we need also note that as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7916) MLlib Python doc parity check for classification and regression.
[ https://issues.apache.org/jira/browse/SPARK-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-7916. -- Resolution: Fixed Fix Version/s: 1.4.1 1.5.0 Issue resolved by pull request 6460 [https://github.com/apache/spark/pull/6460] MLlib Python doc parity check for classification and regression. Key: SPARK-7916 URL: https://issues.apache.org/jira/browse/SPARK-7916 Project: Spark Issue Type: Improvement Components: Documentation, MLlib, PySpark Affects Versions: 1.4.0 Reporter: Yanbo Liang Fix For: 1.5.0, 1.4.1 Check then make the MLlib Python classification and regression doc to be as complete as the Scala doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7667) MLlib Python API consistency check
[ https://issues.apache.org/jira/browse/SPARK-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588852#comment-14588852 ] Joseph K. Bradley commented on SPARK-7667: -- [~yanboliang] What have you checked through, and what remains for this consistency check? MLlib Python API consistency check -- Key: SPARK-7667 URL: https://issues.apache.org/jira/browse/SPARK-7667 Project: Spark Issue Type: Task Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Yanbo Liang Check and ensure the MLlib Python API(class/method/parameter) consistent with Scala. The following APIs are not consistent: * class * method * parameter ** feature.StandardScaler.fit() ** many transform() function of feature module -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7674) R-like stats for ML models
[ https://issues.apache.org/jira/browse/SPARK-7674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588914#comment-14588914 ] Joseph K. Bradley commented on SPARK-7674: -- Definitely. I think the next items to do are: * confirm whether there is feedback about the general (backend) design in the doc linked above * add functionality to models one-by-one (but thinking about code sharing where possible) R-like stats for ML models -- Key: SPARK-7674 URL: https://issues.apache.org/jira/browse/SPARK-7674 Project: Spark Issue Type: New Feature Components: ML Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Critical This is an umbrella JIRA for supporting ML model summaries and statistics, following the example of R's summary() and plot() functions. [Design doc|https://docs.google.com/document/d/1oswC_Neqlqn5ElPwodlDY4IkSaHAi0Bx6Guo_LvhHK8/edit?usp=sharing] From the design doc: {quote} R and its well-established packages provide extensive functionality for inspecting a model and its results. This inspection is critical to interpreting, debugging and improving models. R is arguably a gold standard for a statistics/ML library, so this doc largely attempts to imitate it. The challenge we face is supporting similar functionality, but on big (distributed) data. Data size makes both efficient computation and meaningful displays/summaries difficult. R model and result summaries generally take 2 forms: * summary(model): Display text with information about the model and results on data * plot(model): Display plots about the model and results We aim to provide both of these types of information. Visualization for the plottable results will not be supported in MLlib itself, but we can provide results in a form which can be plotted easily with other tools. {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8389) Expose KafkaRDDs offsetRange in Java and Python
[ https://issues.apache.org/jira/browse/SPARK-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588404#comment-14588404 ] Cody Koeninger commented on SPARK-8389: --- So on the java side, just so I'm clear, are we talking about the difference between people writing OffsetRange[] offsets = ((HasOffsetRanges)rdd.rdd()).offsetRanges(); which, as far as I can tell, they can do currently (see attached PR with test change) versus OffsetRange[] offsets = ((HasOffsetRanges)rdd).offsetRanges(); I can see how the second is definitely a nicer api... but I don't know that it's a critical bugfix, and I also don't know that it's worth introducing additional JavaKafkaRDD and JavaDirectKafkaInputDStream wrappers. The typecast is kind of an ugly hack to begin with, there's only so much we can do to make it nicer... short of higher kinded return type parameters for rdd methods in Spark 2.0 :) Expose KafkaRDDs offsetRange in Java and Python --- Key: SPARK-8389 URL: https://issues.apache.org/jira/browse/SPARK-8389 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Tathagata Das Assignee: Cody Koeninger Priority: Critical Probably requires creating a JavaKafkaPairRDD and also use that in the python APIs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8337) KafkaUtils.createDirectStream for python is lacking API/feature parity with the Scala/Java version
[ https://issues.apache.org/jira/browse/SPARK-8337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588439#comment-14588439 ] Amit Ramesh commented on SPARK-8337: [~juanrh] this looks pretty good to me. And from what I can see shouldn't add much overhead compared to the existing logic. It is perfect in terms of what are in need of :). One stylistic suggestion is that you could return (key, value, kafka_offsets) where kafka_offsets is a dict of topic, parition and offset. This would keep things a little more consistent with what is returned when meta info is False. Thanks! Amit KafkaUtils.createDirectStream for python is lacking API/feature parity with the Scala/Java version -- Key: SPARK-8337 URL: https://issues.apache.org/jira/browse/SPARK-8337 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.0 Reporter: Amit Ramesh Priority: Critical See the following thread for context. http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Spark-1-4-Python-API-for-getting-Kafka-offsets-in-direct-mode-tt12714.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7944) Spark-Shell 2.11 1.4.0-RC-03 does not add jars to class path
[ https://issues.apache.org/jira/browse/SPARK-7944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588506#comment-14588506 ] Alex Baretta commented on SPARK-7944: - Bug confirmed on Spark 1.4.0 with Scala 2.11.6. The --jars option to spark-shell is properly passed on to the SparkSubmit class, and the jars seem to be loaded, but the classes are not available in the REPL. spark-shell --jars commons-csv-1.0.jar ... 15/06/16 17:57:32 INFO SparkContext: Added JAR file:/home/alex/commons-csv-1.0.jar at http://10.240.57.53:38821/jars/commons-csv-1.0.jar with timestamp 1434477452978 ... scala org.apache.commons.csv.CSVFormat.DEFAULT console:21: error: object csv is not a member of package org.apache.commons org.apache.commons.csv.CSVFormat.DEFAULT ^ Spark-Shell 2.11 1.4.0-RC-03 does not add jars to class path Key: SPARK-7944 URL: https://issues.apache.org/jira/browse/SPARK-7944 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.3.1, 1.4.0 Environment: scala 2.11 Reporter: Alexander Nakos Priority: Critical Attachments: spark_shell_output.txt, spark_shell_output_2.10.txt When I run the spark-shell with the --jars argument and supply a path to a single jar file, none of the classes in the jar are available in the REPL. I have encountered this same behaviour in both 1.3.1 and 1.4.0_RC-03 builds for scala 2.11. I have yet to do a 1.4.0 RC-03 build for scala 2.10, but the contents of the jar are available in the 1.3.1_2.10 REPL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-7715) Update MLlib Programming Guide for 1.4
[ https://issues.apache.org/jira/browse/SPARK-7715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reopened SPARK-7715: -- This should not actually be closed yet. We need to update the programming guide still, mainly to provide a new migration guide (which won't have much content) but also to make it easier to find the Pipelines API docs. (This should have happened before the release, but we can at least try to get it done ASAP.) I'm going to finish closing out some other JIRAs before addressing this one, since some of those might indicate items to include in the migration guide. Update MLlib Programming Guide for 1.4 -- Key: SPARK-7715 URL: https://issues.apache.org/jira/browse/SPARK-7715 Project: Spark Issue Type: Sub-task Components: Documentation, ML, MLlib Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Fix For: 1.4.0 Before the release, we need to update the MLlib Programming Guide. Updates will include: * Add migration guide subsection. ** Use the results of the QA audit JIRAs. * Check phrasing, especially in main sections (for outdated items such as In this release, ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7666) MLlib Python doc parity check
[ https://issues.apache.org/jira/browse/SPARK-7666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-7666. -- Resolution: Fixed Fix Version/s: 1.5.0 1.4.1 Assignee: Yanbo Liang I'm resolving this. I think we can complete this parity check for other parts of MLlib during this next release cycle. MLlib Python doc parity check - Key: SPARK-7666 URL: https://issues.apache.org/jira/browse/SPARK-7666 Project: Spark Issue Type: Documentation Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Yanbo Liang Assignee: Yanbo Liang Fix For: 1.4.1, 1.5.0 Check then make the MLlib Python doc to be as complete as the Scala doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7580) Driver out of memory
[ https://issues.apache.org/jira/browse/SPARK-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588855#comment-14588855 ] Andrew Rothstein commented on SPARK-7580: - Haven't heard any solutions. We basically reduced the number of executors to 500 or 1000. Upping the memory allocated to the driver will help as well. Unfortunately my cluster is configured to limit my driver container to 4g so I suspect it's thrashing. Driver out of memory Key: SPARK-7580 URL: https://issues.apache.org/jira/browse/SPARK-7580 Project: Spark Issue Type: Bug Affects Versions: 1.3.0 Environment: YARN, HDP 2.1, RedHat 6.4 200 x HP DL185 Reporter: Andrew Rothstein My 200 node cluster has a 8k executor capacity. When I submitted a job with 2k executors, 2g per executor, and 4g for the driver, the ApplicationMaster/driver quickly became unresponsive. It was making progress, then threw a couple of these exceptions: 2015-05-12 16:46:41,598 ERROR [Spark Context Cleaner] spark.ContextCleaner: Error cleaning broadcast 4 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:137) at org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:227) at org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45) at org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:66) at org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:185) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:147) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:138) at scala.Option.foreach(Option.scala:236) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:138) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:134) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:134) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617) at org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:133) at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:65) Then the job crashed with OOM. 2015-05-12 16:47:53,566 ERROR [sparkDriver-akka.actor.default-dispatcher-4] actor.ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher-8] shutting down ActorSystem [sparkDriver] java.lang.OutOfMemoryError: Java heap space at org.spark_project.protobuf.ByteString.copyFrom(ByteString.java:216) at org.spark_project.protobuf.ByteString.copyFrom(ByteString.java:229) at akka.remote.transport.AkkaPduProtobufCodec$.constructPayload(AkkaPduCodec.scala:145) at akka.remote.transport.AkkaProtocolHandle.write(AkkaProtocolTransport.scala:182) at akka.remote.EndpointWriter.writeSend(Endpoint.scala:760) at akka.remote.EndpointWriter$$anonfun$2.applyOrElse(Endpoint.scala:722) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) When I reran the job with 3g of memory per executor and 1k executors it ran to completion more quickly than the 2k executor run took to crash. I didn't think I was pushing the envelope by using 2k executors and the stock driver heap size. Is this a scale limitation of the driver? Any suggestions beyond increasing the
[jira] [Updated] (SPARK-7916) MLlib Python doc parity check for classification and regression.
[ https://issues.apache.org/jira/browse/SPARK-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7916: - Assignee: Yanbo Liang MLlib Python doc parity check for classification and regression. Key: SPARK-7916 URL: https://issues.apache.org/jira/browse/SPARK-7916 Project: Spark Issue Type: Improvement Components: Documentation, MLlib, PySpark Affects Versions: 1.4.0 Reporter: Yanbo Liang Assignee: Yanbo Liang Fix For: 1.4.1, 1.5.0 Check then make the MLlib Python classification and regression doc to be as complete as the Scala doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8400) ml.ALS doesn't handle -1 block size
Xiangrui Meng created SPARK-8400: Summary: ml.ALS doesn't handle -1 block size Key: SPARK-8400 URL: https://issues.apache.org/jira/browse/SPARK-8400 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.3.1 Reporter: Xiangrui Meng Under spark.mllib, if number blocks is set to -1, we set the block size automatically based on the input partition size. However, this behavior is not preserved in the spark.ml API. If user sets -1 in Spark 1.3, it will not work, but no error messages will show. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7944) Spark-Shell 2.11 1.4.0-RC-03 does not add jars to class path
[ https://issues.apache.org/jira/browse/SPARK-7944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588740#comment-14588740 ] Iulian Dragos commented on SPARK-7944: -- I'll have a look tomorrow, I vaguely remember a bug in the Scala REPL that was fixed. Since the code is forked, the fix may not be in there... Spark-Shell 2.11 1.4.0-RC-03 does not add jars to class path Key: SPARK-7944 URL: https://issues.apache.org/jira/browse/SPARK-7944 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.3.1, 1.4.0 Environment: scala 2.11 Reporter: Alexander Nakos Priority: Critical Attachments: spark_shell_output.txt, spark_shell_output_2.10.txt When I run the spark-shell with the --jars argument and supply a path to a single jar file, none of the classes in the jar are available in the REPL. I have encountered this same behaviour in both 1.3.1 and 1.4.0_RC-03 builds for scala 2.11. I have yet to do a 1.4.0 RC-03 build for scala 2.10, but the contents of the jar are available in the 1.3.1_2.10 REPL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8379) LeaseExpiredException when using dynamic partition with speculative execution
[ https://issues.apache.org/jira/browse/SPARK-8379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-8379: Target Version/s: 1.5.0 Shepherd: Cheng Lian Assignee: jeanlyn LeaseExpiredException when using dynamic partition with speculative execution - Key: SPARK-8379 URL: https://issues.apache.org/jira/browse/SPARK-8379 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.3.1, 1.4.0 Reporter: jeanlyn Assignee: jeanlyn when inserting to table using dynamic partitions with *spark.speculation=true* and there is a skew data of some partitions trigger the speculative tasks ,it will throws the exception like {code} org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): Lease mismatch on /tmp/hive-jeanlyn/hive_2015-06-15_15-20-44_734_8801220787219172413-1/-ext-1/ds=2015-06-15/type=2/part-00301.lzo owned by DFSClient_attempt_201506031520_0011_m_000189_0_-1513487243_53 but is accessed by DFSClient_attempt_201506031520_0011_m_42_0_-1275047721_57 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3258) Python API for streaming MLlib algorithms
[ https://issues.apache.org/jira/browse/SPARK-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-3258: --- Assignee: (was: Apache Spark) Python API for streaming MLlib algorithms - Key: SPARK-3258 URL: https://issues.apache.org/jira/browse/SPARK-3258 Project: Spark Issue Type: Umbrella Components: MLlib, PySpark, Streaming Reporter: Xiangrui Meng This is an umbrella JIRA to track Python port of streaming MLlib algorithms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3258) Python API for streaming MLlib algorithms
[ https://issues.apache.org/jira/browse/SPARK-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-3258: --- Assignee: Apache Spark Python API for streaming MLlib algorithms - Key: SPARK-3258 URL: https://issues.apache.org/jira/browse/SPARK-3258 Project: Spark Issue Type: Umbrella Components: MLlib, PySpark, Streaming Reporter: Xiangrui Meng Assignee: Apache Spark This is an umbrella JIRA to track Python port of streaming MLlib algorithms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3258) Python API for streaming MLlib algorithms
[ https://issues.apache.org/jira/browse/SPARK-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588585#comment-14588585 ] Apache Spark commented on SPARK-3258: - User 'MechCoder' has created a pull request for this issue: https://github.com/apache/spark/pull/6849 Python API for streaming MLlib algorithms - Key: SPARK-3258 URL: https://issues.apache.org/jira/browse/SPARK-3258 Project: Spark Issue Type: Umbrella Components: MLlib, PySpark, Streaming Reporter: Xiangrui Meng This is an umbrella JIRA to track Python port of streaming MLlib algorithms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7944) Spark-Shell 2.11 1.4.0-RC-03 does not add jars to class path
[ https://issues.apache.org/jira/browse/SPARK-7944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588588#comment-14588588 ] Vincent Ohprecio commented on SPARK-7944: - just compiled version 1.5.0-SNAPSHOT Using Scala version 2.10.4 from github. ~/dev/spark(master) $build/mvn -DskipTests clean package [INFO] BUILD SUCCESS ... ~/dev/spark(master) $bin/spark-shell --jars /Users/antigen/Downloads/algebird-core_2.10-0.10.2.jar Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. SQL context available as sqlContext. scala import com.twitter.algebird._ import com.twitter.algebird._ Spark-Shell 2.11 1.4.0-RC-03 does not add jars to class path Key: SPARK-7944 URL: https://issues.apache.org/jira/browse/SPARK-7944 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.3.1, 1.4.0 Environment: scala 2.11 Reporter: Alexander Nakos Priority: Critical Attachments: spark_shell_output.txt, spark_shell_output_2.10.txt When I run the spark-shell with the --jars argument and supply a path to a single jar file, none of the classes in the jar are available in the REPL. I have encountered this same behaviour in both 1.3.1 and 1.4.0_RC-03 builds for scala 2.11. I have yet to do a 1.4.0 RC-03 build for scala 2.10, but the contents of the jar are available in the 1.3.1_2.10 REPL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7633) Streaming Logistic Regression- Python bindings
[ https://issues.apache.org/jira/browse/SPARK-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7633: --- Assignee: (was: Apache Spark) Streaming Logistic Regression- Python bindings -- Key: SPARK-7633 URL: https://issues.apache.org/jira/browse/SPARK-7633 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Yanbo Liang Add Python API for StreamingLogisticRegressionWithSGD -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7633) Streaming Logistic Regression- Python bindings
[ https://issues.apache.org/jira/browse/SPARK-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7633: --- Assignee: Apache Spark Streaming Logistic Regression- Python bindings -- Key: SPARK-7633 URL: https://issues.apache.org/jira/browse/SPARK-7633 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Yanbo Liang Assignee: Apache Spark Add Python API for StreamingLogisticRegressionWithSGD -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7633) Streaming Logistic Regression- Python bindings
[ https://issues.apache.org/jira/browse/SPARK-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588600#comment-14588600 ] Manoj Kumar commented on SPARK-7633: I'm extremely sorry. I was halfway through when you had commented (Just tests were remaining) Streaming Logistic Regression- Python bindings -- Key: SPARK-7633 URL: https://issues.apache.org/jira/browse/SPARK-7633 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Yanbo Liang Add Python API for StreamingLogisticRegressionWithSGD -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7633) Streaming Logistic Regression- Python bindings
[ https://issues.apache.org/jira/browse/SPARK-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588595#comment-14588595 ] Apache Spark commented on SPARK-7633: - User 'MechCoder' has created a pull request for this issue: https://github.com/apache/spark/pull/6849 Streaming Logistic Regression- Python bindings -- Key: SPARK-7633 URL: https://issues.apache.org/jira/browse/SPARK-7633 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Yanbo Liang Add Python API for StreamingLogisticRegressionWithSGD -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8387) [SPARK][Web-UI] Only show 4096 bytes content for executor log instead all
[ https://issues.apache.org/jira/browse/SPARK-8387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8387. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6834 [https://github.com/apache/spark/pull/6834] [SPARK][Web-UI] Only show 4096 bytes content for executor log instead all - Key: SPARK-8387 URL: https://issues.apache.org/jira/browse/SPARK-8387 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.4.0 Reporter: SuYan Priority: Minor Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8387) [SPARK][Web-UI] Only show 4096 bytes content for executor log instead all
[ https://issues.apache.org/jira/browse/SPARK-8387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8387: - Assignee: SuYan [SPARK][Web-UI] Only show 4096 bytes content for executor log instead all - Key: SPARK-8387 URL: https://issues.apache.org/jira/browse/SPARK-8387 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.4.0 Reporter: SuYan Assignee: SuYan Priority: Minor Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3665) Java API for GraphX
[ https://issues.apache.org/jira/browse/SPARK-3665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-3665: --- Assignee: Ankur Dave (was: Apache Spark) Java API for GraphX --- Key: SPARK-3665 URL: https://issues.apache.org/jira/browse/SPARK-3665 Project: Spark Issue Type: Improvement Components: GraphX, Java API Affects Versions: 1.0.0 Reporter: Ankur Dave Assignee: Ankur Dave The Java API will wrap the Scala API in a similar manner as JavaRDD. Components will include: # JavaGraph #- removes optional param from persist, subgraph, mapReduceTriplets, Graph.fromEdgeTuples, Graph.fromEdges, Graph.apply #- removes implicit {{=:=}} param from mapVertices, outerJoinVertices #- merges multiple parameters lists #- incorporates GraphOps # JavaVertexRDD # JavaEdgeRDD -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3665) Java API for GraphX
[ https://issues.apache.org/jira/browse/SPARK-3665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-3665: --- Assignee: Apache Spark (was: Ankur Dave) Java API for GraphX --- Key: SPARK-3665 URL: https://issues.apache.org/jira/browse/SPARK-3665 Project: Spark Issue Type: Improvement Components: GraphX, Java API Affects Versions: 1.0.0 Reporter: Ankur Dave Assignee: Apache Spark The Java API will wrap the Scala API in a similar manner as JavaRDD. Components will include: # JavaGraph #- removes optional param from persist, subgraph, mapReduceTriplets, Graph.fromEdgeTuples, Graph.fromEdges, Graph.apply #- removes implicit {{=:=}} param from mapVertices, outerJoinVertices #- merges multiple parameters lists #- incorporates GraphOps # JavaVertexRDD # JavaEdgeRDD -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5362) Gradient and Optimizer to support generic output (instead of label) and data batches
[ https://issues.apache.org/jira/browse/SPARK-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5362: --- Assignee: (was: Apache Spark) Gradient and Optimizer to support generic output (instead of label) and data batches Key: SPARK-5362 URL: https://issues.apache.org/jira/browse/SPARK-5362 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Alexander Ulanov Original Estimate: 24h Remaining Estimate: 24h Currently, Gradient and Optimizer interfaces support data in form of RDD[Double, Vector] which refers to label and features. This limits its application to classification problems. For example, artificial neural network demands Vector as output (instead of label: Double). Moreover, current interface does not support data batches. I propose to replace label: Double with output: Vector. It enables passing generic output instead of label and also passing data and output batches stored in corresponding vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5362) Gradient and Optimizer to support generic output (instead of label) and data batches
[ https://issues.apache.org/jira/browse/SPARK-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5362: --- Assignee: Apache Spark Gradient and Optimizer to support generic output (instead of label) and data batches Key: SPARK-5362 URL: https://issues.apache.org/jira/browse/SPARK-5362 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Alexander Ulanov Assignee: Apache Spark Original Estimate: 24h Remaining Estimate: 24h Currently, Gradient and Optimizer interfaces support data in form of RDD[Double, Vector] which refers to label and features. This limits its application to classification problems. For example, artificial neural network demands Vector as output (instead of label: Double). Moreover, current interface does not support data batches. I propose to replace label: Double with output: Vector. It enables passing generic output instead of label and also passing data and output batches stored in corresponding vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org