[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390169#comment-14390169 ] Rahul Kumar commented on SPARK-6646: Love this idea, what about private cloud in pocket :-) store data on smart phone, do processing on it, small mobile based web server that power cool visualization reports. Lot of time our smart phones are idle we can share resources :-) 4 GB RAM, quadcore processer, LTE network not bad for a single node in cluster. Spark 2.0: Rearchitecting Spark for Mobile Platforms Key: SPARK-6646 URL: https://issues.apache.org/jira/browse/SPARK-6646 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Attachments: Spark on Mobile - Design Doc - v1.pdf Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This ticket outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. Mobile friendly version of the design doc: https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL
[ https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6644: -- Description: In Hive, the schema of a partition may differ from the table schema. For example, we may add new columns to the table after importing existing partitions. When using {{spark-sql}} to query the data in a partition whose schema is different from the table schema, problems may arise. Part of them have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. However, after adding new column(s) to the table, when inserting data into old partitions, values of newly added columns are all {{NULL}}. The following snippet can be used to reproduce this issue: {code} case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int, value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}') sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT key, value FROM testData) // Add new columns to the table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT key, value, 'test', 1.11 FROM testData) sql(SELECT * FROM table_with_partition WHERE ds = '1').collect().foreach(println) {code} Actual result: {noformat} [1,1,null,null,1] [2,2,null,null,1] {noformat} Expected result: {noformat} [1,1,test,1.11,1] [2,2,test,1.11,1] {noformat} was: In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. Some problems have been solved at PR4289 (https://github.com/apache/spark/pull/4289), but if we add new column, and put new data into the old partition schema,new column value is NULL [According to the following steps]: -- case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println) - result: [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] This bug will cause the wrong query number ,when we query : select count(1) from table_with_partition where key1 is not NULL [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL -- Key: SPARK-6644 URL: https://issues.apache.org/jira/browse/SPARK-6644 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: dongxu In Hive, the schema of a partition may differ from the table schema. For example, we may add new columns to the table after importing existing partitions. When using {{spark-sql}} to query the data in a partition whose schema is different from the table schema, problems may arise. Part of them have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. However, after adding new column(s) to the table, when inserting data into old partitions, values of newly added columns are all {{NULL}}. The following snippet can be used to reproduce this issue: {code} case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int, value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}') sql(INSERT OVERWRITE TABLE table_with_partition
[jira] [Updated] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6646: --- Description: Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This post outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. See also SPARK-6646 for community discussion on the issue. was:Design doc to come ... Spark 2.0: Rearchitecting Spark for Mobile Platforms Key: SPARK-6646 URL: https://issues.apache.org/jira/browse/SPARK-6646 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Attachments: Spark on Mobile - Design Doc - v1.pdf Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This post outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. See also SPARK-6646 for community discussion on the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6646: --- Attachment: Spark on Mobile - Design Doc - v1.pdf Spark 2.0: Rearchitecting Spark for Mobile Platforms Key: SPARK-6646 URL: https://issues.apache.org/jira/browse/SPARK-6646 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Attachments: Spark on Mobile - Design Doc - v1.pdf Design doc to come ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6640) Executor may connect to HeartbeartReceiver before it's setup in the driver side
[ https://issues.apache.org/jira/browse/SPARK-6640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6640: --- Assignee: Apache Spark Executor may connect to HeartbeartReceiver before it's setup in the driver side --- Key: SPARK-6640 URL: https://issues.apache.org/jira/browse/SPARK-6640 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Shixiong Zhu Assignee: Apache Spark Here is the current code about starting LocalBackend and creating HeartbeatReceiver: {code} // Create and start the scheduler private[spark] var (schedulerBackend, taskScheduler) = SparkContext.createTaskScheduler(this, master) private val heartbeatReceiver = env.actorSystem.actorOf( Props(new HeartbeatReceiver(this, taskScheduler)), HeartbeatReceiver) {code} When creating LocalBackend, it will start `LocalActor`. `LocalActor` will create Executor, and Executor's constructor will retrieve `HeartbeatReceiver`. So we should make sure this line: {code} private val heartbeatReceiver = env.actorSystem.actorOf( Props(new HeartbeatReceiver(this, taskScheduler)), HeartbeatReceiver) {code} happen before creating LocalActor. However, current codes can not guarantee that. Sometimes, creating Executor will crash. The issue was reported by sparkdi shopaddr1...@dubna.us in http://apache-spark-user-list.1001560.n3.nabble.com/Actor-not-found-td22265.html#a22324 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6640) Executor may connect to HeartbeartReceiver before it's setup in the driver side
[ https://issues.apache.org/jira/browse/SPARK-6640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6640: --- Assignee: (was: Apache Spark) Executor may connect to HeartbeartReceiver before it's setup in the driver side --- Key: SPARK-6640 URL: https://issues.apache.org/jira/browse/SPARK-6640 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Shixiong Zhu Here is the current code about starting LocalBackend and creating HeartbeatReceiver: {code} // Create and start the scheduler private[spark] var (schedulerBackend, taskScheduler) = SparkContext.createTaskScheduler(this, master) private val heartbeatReceiver = env.actorSystem.actorOf( Props(new HeartbeatReceiver(this, taskScheduler)), HeartbeatReceiver) {code} When creating LocalBackend, it will start `LocalActor`. `LocalActor` will create Executor, and Executor's constructor will retrieve `HeartbeatReceiver`. So we should make sure this line: {code} private val heartbeatReceiver = env.actorSystem.actorOf( Props(new HeartbeatReceiver(this, taskScheduler)), HeartbeatReceiver) {code} happen before creating LocalActor. However, current codes can not guarantee that. Sometimes, creating Executor will crash. The issue was reported by sparkdi shopaddr1...@dubna.us in http://apache-spark-user-list.1001560.n3.nabble.com/Actor-not-found-td22265.html#a22324 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6640) Executor may connect to HeartbeartReceiver before it's setup in the driver side
[ https://issues.apache.org/jira/browse/SPARK-6640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390090#comment-14390090 ] Apache Spark commented on SPARK-6640: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/5306 Executor may connect to HeartbeartReceiver before it's setup in the driver side --- Key: SPARK-6640 URL: https://issues.apache.org/jira/browse/SPARK-6640 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Shixiong Zhu Here is the current code about starting LocalBackend and creating HeartbeatReceiver: {code} // Create and start the scheduler private[spark] var (schedulerBackend, taskScheduler) = SparkContext.createTaskScheduler(this, master) private val heartbeatReceiver = env.actorSystem.actorOf( Props(new HeartbeatReceiver(this, taskScheduler)), HeartbeatReceiver) {code} When creating LocalBackend, it will start `LocalActor`. `LocalActor` will create Executor, and Executor's constructor will retrieve `HeartbeatReceiver`. So we should make sure this line: {code} private val heartbeatReceiver = env.actorSystem.actorOf( Props(new HeartbeatReceiver(this, taskScheduler)), HeartbeatReceiver) {code} happen before creating LocalActor. However, current codes can not guarantee that. Sometimes, creating Executor will crash. The issue was reported by sparkdi shopaddr1...@dubna.us in http://apache-spark-user-list.1001560.n3.nabble.com/Actor-not-found-td22265.html#a22324 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390153#comment-14390153 ] Sandy Ryza commented on SPARK-6646: --- This seems like a good opportunity to finally add a DataFrame registerTempTablet API. Spark 2.0: Rearchitecting Spark for Mobile Platforms Key: SPARK-6646 URL: https://issues.apache.org/jira/browse/SPARK-6646 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Attachments: Spark on Mobile - Design Doc - v1.pdf Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This ticket outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390160#comment-14390160 ] Yu Ishikawa commented on SPARK-6646: That sounds very interesting! We should support a deploying function a trained machine learning model to smartphone. :) Spark 2.0: Rearchitecting Spark for Mobile Platforms Key: SPARK-6646 URL: https://issues.apache.org/jira/browse/SPARK-6646 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Attachments: Spark on Mobile - Design Doc - v1.pdf Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This ticket outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. Mobile friendly version of the design doc: https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390176#comment-14390176 ] Jeremy Freeman commented on SPARK-6646: --- Very promising [~tdas]! We should evaluate the performance of streaming machine learning algorithms. In general I think running Spark in javascript via scala.js and node.js is extremely appealing, will make integration with visualization very straightforward. Spark 2.0: Rearchitecting Spark for Mobile Platforms Key: SPARK-6646 URL: https://issues.apache.org/jira/browse/SPARK-6646 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Attachments: Spark on Mobile - Design Doc - v1.pdf Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This ticket outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. Mobile friendly version of the design doc: https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6631) I am unable to get the Maven Build file in Example 2.13 to build anything but an empty file
[ https://issues.apache.org/jira/browse/SPARK-6631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390185#comment-14390185 ] Sean Owen commented on SPARK-6631: -- The Debian packaging was removed; I don't know how much it worked before. u...@spark.apache.org is appropriate for this kind of question. Here you're tacking on to an unrelated JIRA. I am unable to get the Maven Build file in Example 2.13 to build anything but an empty file --- Key: SPARK-6631 URL: https://issues.apache.org/jira/browse/SPARK-6631 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.3.0 Environment: Ubuntu 14.04 Reporter: Frank Domoney Priority: Blocker I have downloaded and built spark 1.3.0 under Ubuntu 14.04 but have been unable to get reduceByKey to work on what seems to be a valid RDD using the command line. scala counts.take(10) res17: Array[(String, Int)] = Array((Vladimir,1), (Putin,1), (has,1), (said,1), (Russia,1), (will,1), (fight,1), (for,1), (an,1), (independent,1)) scala val counts1 = counts.reduceByKey{case (x, y) = x + y} counts1.take(10) res16: Array[(String, Int)] = Array() I am attempting to build the Maven sequence in example 2.15 but get the following results Building example 0.0.1 [INFO] [INFO] [INFO] --- maven-resources-plugin:2.3:resources (default-resources) @ learning-spark-mini-example --- [WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent! [INFO] skip non existing resourceDirectory /home/panzerfrank/Downloads/spark-1.3.0/wordcount/src/main/resources [INFO] [INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ learning-spark-mini-example --- [INFO] No sources to compile [INFO] [INFO] --- maven-resources-plugin:2.3:testResources (default-testResources) @ learning-spark-mini-example --- [WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent! [INFO] skip non existing resourceDirectory /home/panzerfrank/Downloads/spark-1.3.0/wordcount/src/test/resources [INFO] [INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ learning-spark-mini-example --- [INFO] No sources to compile [INFO] [INFO] --- maven-surefire-plugin:2.10:test (default-test) @ learning-spark-mini-example --- [INFO] No tests to run. [INFO] Surefire report directory: /home/panzerfrank/Downloads/spark-1.3.0/wordcount/target/surefire-reports --- maven-jar-plugin:2.2:jar (default-jar) @ learning-spark-mini-example --- [WARNING] JAR will be empty - no content was marked for inclusion! [INFO] Building jar: /home/panzerfrank/Downloads/spark-1.3.0/wordcount/target/learning-spark-mini-example-0.0.1.jar I am using the POM file from Example 2-13. Java is Java -8 Am I doing something really stupid? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6646: --- Description: Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This ticket outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. was: Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This post outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. See also SPARK-6646 for community discussion on the issue. Spark 2.0: Rearchitecting Spark for Mobile Platforms Key: SPARK-6646 URL: https://issues.apache.org/jira/browse/SPARK-6646 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Attachments: Spark on Mobile - Design Doc - v1.pdf Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This ticket outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390161#comment-14390161 ] Tathagata Das commented on SPARK-6646: -- I have been working on running NetworkWordcount on our IPhone prototype, and I was pleasantly surprised with the performance I was getting. The network bandwidth is definitely less, and there is a higher cost of shuffling data, but its still quite good. Though the task launch latencies are higher, so streaming applications will require slightly higher batch sizes. But overall you will be surprised. I will post numbers when I can compile them in graphs. Spark 2.0: Rearchitecting Spark for Mobile Platforms Key: SPARK-6646 URL: https://issues.apache.org/jira/browse/SPARK-6646 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Attachments: Spark on Mobile - Design Doc - v1.pdf Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This ticket outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. Mobile friendly version of the design doc: https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390206#comment-14390206 ] Petar Zecevic commented on SPARK-6646: -- Good one :) Spark 2.0: Rearchitecting Spark for Mobile Platforms Key: SPARK-6646 URL: https://issues.apache.org/jira/browse/SPARK-6646 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Attachments: Spark on Mobile - Design Doc - v1.pdf Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This ticket outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. Mobile friendly version of the design doc: https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390183#comment-14390183 ] Sean Owen commented on SPARK-6646: -- Concept: smartphone app that lets you find the nearest Spark cluster to join. Swipe left/right on photos from the worker nodes to indicate which ones you want to join. Only problem is this *must* be called SparkR to be taken seriously, so think it will have to be rolled into the R library. Spark 2.0: Rearchitecting Spark for Mobile Platforms Key: SPARK-6646 URL: https://issues.apache.org/jira/browse/SPARK-6646 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Attachments: Spark on Mobile - Design Doc - v1.pdf Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This ticket outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. Mobile friendly version of the design doc: https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6644) After adding new columns to a partitioned table and inserting data to an old partition, data of newly added columns are all NULL
[ https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6644: -- Summary: After adding new columns to a partitioned table and inserting data to an old partition, data of newly added columns are all NULL (was: [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL) After adding new columns to a partitioned table and inserting data to an old partition, data of newly added columns are all NULL Key: SPARK-6644 URL: https://issues.apache.org/jira/browse/SPARK-6644 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: dongxu In Hive, the schema of a partition may differ from the table schema. For example, we may add new columns to the table after importing existing partitions. When using {{spark-sql}} to query the data in a partition whose schema is different from the table schema, problems may arise. Part of them have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. However, after adding new column(s) to the table, when inserting data into old partitions, values of newly added columns are all {{NULL}}. The following snippet can be used to reproduce this issue: {code} case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int, value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}') sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT key, value FROM testData) // Add new columns to the table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT key, value, 'test', 1.11 FROM testData) sql(SELECT * FROM table_with_partition WHERE ds = '1').collect().foreach(println) {code} Actual result: {noformat} [1,1,null,null,1] [2,2,null,null,1] {noformat} Expected result: {noformat} [1,1,test,1.11,1] [2,2,test,1.11,1] {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4346) YarnClientSchedulerBack.asyncMonitorApplication should be common with Client.monitorApplication
[ https://issues.apache.org/jira/browse/SPARK-4346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390086#comment-14390086 ] Apache Spark commented on SPARK-4346: - User 'Sephiroth-Lin' has created a pull request for this issue: https://github.com/apache/spark/pull/5305 YarnClientSchedulerBack.asyncMonitorApplication should be common with Client.monitorApplication --- Key: SPARK-4346 URL: https://issues.apache.org/jira/browse/SPARK-4346 Project: Spark Issue Type: Improvement Components: Scheduler, YARN Reporter: Thomas Graves The YarnClientSchedulerBackend.asyncMonitorApplication routine should move into ClientBase and be made common with monitorApplication. Make sure stop is handled properly. See discussion on https://github.com/apache/spark/pull/3143 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3596) Support changing the yarn client monitor interval
[ https://issues.apache.org/jira/browse/SPARK-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390087#comment-14390087 ] Apache Spark commented on SPARK-3596: - User 'Sephiroth-Lin' has created a pull request for this issue: https://github.com/apache/spark/pull/5305 Support changing the yarn client monitor interval -- Key: SPARK-3596 URL: https://issues.apache.org/jira/browse/SPARK-3596 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.2.0 Reporter: Thomas Graves Right now spark on yarn has a monitor interval that can be configured by spark.yarn.report.interval. This is how often the client checks with the RM to get status on the running application in cluster mode. We should allow users to set this interval as some may not need to check so often. There is another jira filed to make it so the client doesn't have to stay around for cluster mode. With the changes in https://github.com/apache/spark/pull/2350, it further extends that to affect client mode. We may want to add in specific configs for that since the monitorApplication function is now used in multiple different scenarios it actually might make sense for it to take the timeout as a parameter. You could want different timeout for different situations. for instance how quickly we poll on client side and print information (cluster mode) vs how quickly we recognize the application quit and we want to terminate (client mode), I want the latter to happen quickly where as in cluster mode I might not care as much about how often it is printing updated info to the screen. I guess its private so we could leave it as is and change if we add support for that later. my suggestion for name would be something like spark.yarn.client.progress.pollinterval. If we were to add separate ones in the future then they could be something like spark.yarn.app.ready.pollinterval and spark.yarn.app.completion.pollinterval -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4346) YarnClientSchedulerBack.asyncMonitorApplication should be common with Client.monitorApplication
[ https://issues.apache.org/jira/browse/SPARK-4346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4346: --- Assignee: Apache Spark YarnClientSchedulerBack.asyncMonitorApplication should be common with Client.monitorApplication --- Key: SPARK-4346 URL: https://issues.apache.org/jira/browse/SPARK-4346 Project: Spark Issue Type: Improvement Components: Scheduler, YARN Reporter: Thomas Graves Assignee: Apache Spark The YarnClientSchedulerBackend.asyncMonitorApplication routine should move into ClientBase and be made common with monitorApplication. Make sure stop is handled properly. See discussion on https://github.com/apache/spark/pull/3143 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4346) YarnClientSchedulerBack.asyncMonitorApplication should be common with Client.monitorApplication
[ https://issues.apache.org/jira/browse/SPARK-4346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4346: --- Assignee: (was: Apache Spark) YarnClientSchedulerBack.asyncMonitorApplication should be common with Client.monitorApplication --- Key: SPARK-4346 URL: https://issues.apache.org/jira/browse/SPARK-4346 Project: Spark Issue Type: Improvement Components: Scheduler, YARN Reporter: Thomas Graves The YarnClientSchedulerBackend.asyncMonitorApplication routine should move into ClientBase and be made common with monitorApplication. Make sure stop is handled properly. See discussion on https://github.com/apache/spark/pull/3143 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390158#comment-14390158 ] Reynold Xin commented on SPARK-6646: [~sandyryza] That's an excellent idea. I haven't thought of that yet. But now I think about it, there will be a lot of room for optimizations using DataFrame on tablets. Spark 2.0: Rearchitecting Spark for Mobile Platforms Key: SPARK-6646 URL: https://issues.apache.org/jira/browse/SPARK-6646 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Attachments: Spark on Mobile - Design Doc - v1.pdf Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This ticket outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. Mobile friendly version of the design doc: https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5682: --- Assignee: (was: Apache Spark) Add encrypted shuffle in spark -- Key: SPARK-5682 URL: https://issues.apache.org/jira/browse/SPARK-5682 Project: Spark Issue Type: New Feature Components: Shuffle Reporter: liyunzhang_intel Attachments: Design Document of Encrypted Spark Shuffle_20150209.docx, Design Document of Encrypted Spark Shuffle_20150318.docx Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle data safer. This feature is necessary in spark. AES is a specification for the encryption of electronic data. There are 5 common modes in AES. CTR is one of the modes. We use two codec JceAesCtrCryptoCodec and OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl provides. Because ugi credential info is used in the process of encrypted shuffle, we first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390157#comment-14390157 ] Apache Spark commented on SPARK-5682: - User 'kellyzly' has created a pull request for this issue: https://github.com/apache/spark/pull/5307 Add encrypted shuffle in spark -- Key: SPARK-5682 URL: https://issues.apache.org/jira/browse/SPARK-5682 Project: Spark Issue Type: New Feature Components: Shuffle Reporter: liyunzhang_intel Attachments: Design Document of Encrypted Spark Shuffle_20150209.docx, Design Document of Encrypted Spark Shuffle_20150318.docx Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle data safer. This feature is necessary in spark. AES is a specification for the encryption of electronic data. There are 5 common modes in AES. CTR is one of the modes. We use two codec JceAesCtrCryptoCodec and OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl provides. Because ugi credential info is used in the process of encrypted shuffle, we first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5682: --- Assignee: Apache Spark Add encrypted shuffle in spark -- Key: SPARK-5682 URL: https://issues.apache.org/jira/browse/SPARK-5682 Project: Spark Issue Type: New Feature Components: Shuffle Reporter: liyunzhang_intel Assignee: Apache Spark Attachments: Design Document of Encrypted Spark Shuffle_20150209.docx, Design Document of Encrypted Spark Shuffle_20150318.docx Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle data safer. This feature is necessary in spark. AES is a specification for the encryption of electronic data. There are 5 common modes in AES. CTR is one of the modes. We use two codec JceAesCtrCryptoCodec and OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl provides. Because ugi credential info is used in the process of encrypted shuffle, we first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6613) Starting stream from checkpoint causes Streaming tab to throw error
[ https://issues.apache.org/jira/browse/SPARK-6613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390221#comment-14390221 ] zhichao-li commented on SPARK-6613: --- [~msoutier] , have you found any solution for this ? or just report the bug? Starting stream from checkpoint causes Streaming tab to throw error --- Key: SPARK-6613 URL: https://issues.apache.org/jira/browse/SPARK-6613 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Marius Soutier When continuing my streaming job from a checkpoint, the job runs, but the Streaming tab in the standard UI initially no longer works (browser just shows HTTP ERROR: 500). Sometimes it gets back to normal after a while, and sometimes it stays in this state permanently. Stacktrace: WARN org.eclipse.jetty.servlet.ServletHandler: /streaming/ java.util.NoSuchElementException: key not found: 0 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:58) at org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1$$anonfun$apply$5.apply(StreamingJobProgressListener.scala:151) at org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1$$anonfun$apply$5.apply(StreamingJobProgressListener.scala:150) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.Range.foreach(Range.scala:141) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1.apply(StreamingJobProgressListener.scala:150) at org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1.apply(StreamingJobProgressListener.scala:149) at scala.Option.map(Option.scala:145) at org.apache.spark.streaming.ui.StreamingJobProgressListener.lastReceivedBatchRecords(StreamingJobProgressListener.scala:149) at org.apache.spark.streaming.ui.StreamingPage.generateReceiverStats(StreamingPage.scala:82) at org.apache.spark.streaming.ui.StreamingPage.render(StreamingPage.scala:43) at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68) at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68) at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:68) at javax.servlet.http.HttpServlet.service(HttpServlet.java:735) at javax.servlet.http.HttpServlet.service(HttpServlet.java:848) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:370) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667) at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:745)
[jira] [Updated] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6646: --- Description: Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This ticket outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. Mobile friendly version of the design doc: https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html was: Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This ticket outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. Spark 2.0: Rearchitecting Spark for Mobile Platforms Key: SPARK-6646 URL: https://issues.apache.org/jira/browse/SPARK-6646 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Attachments: Spark on Mobile - Design Doc - v1.pdf Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This ticket outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. Mobile friendly version of the design doc: https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5989) Model import/export for LDAModel
[ https://issues.apache.org/jira/browse/SPARK-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390209#comment-14390209 ] Manoj Kumar commented on SPARK-5989: Can this be assigned to me? Thanks! Model import/export for LDAModel Key: SPARK-5989 URL: https://issues.apache.org/jira/browse/SPARK-5989 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Add save/load for LDAModel and its local and distributed variants. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390208#comment-14390208 ] Kamal Banga commented on SPARK-6646: We want Spark for Apple Watch. That will be the real breakthrough! Spark 2.0: Rearchitecting Spark for Mobile Platforms Key: SPARK-6646 URL: https://issues.apache.org/jira/browse/SPARK-6646 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Attachments: Spark on Mobile - Design Doc - v1.pdf Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This ticket outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. Mobile friendly version of the design doc: https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390235#comment-14390235 ] liyunzhang_intel commented on SPARK-5682: - Hi all: Now there are two methods to implement SPARK-5682(Add encrypted shuffle in spark). Method1: use [Chimera|https://github.com/intel-hadoop/chimera](Chimera is a project which strips code related to CryptoInputStream/CryptoOutputStream from Hadoop to facilitate AES-NI based data encryption in other projects.) to implement spark encrypted shuffle. Pull request: https://github.com/apache/spark/pull/5307. Method2: Add crypto package in spark-core module and add CryptoInputStream.scala and CryptoOutputStream.scala and so on in this package. Pull request : https://github.com/apache/spark/pull/4491. Which one is better? Any advices/guidance are welcome! Add encrypted shuffle in spark -- Key: SPARK-5682 URL: https://issues.apache.org/jira/browse/SPARK-5682 Project: Spark Issue Type: New Feature Components: Shuffle Reporter: liyunzhang_intel Attachments: Design Document of Encrypted Spark Shuffle_20150209.docx, Design Document of Encrypted Spark Shuffle_20150318.docx Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle data safer. This feature is necessary in spark. AES is a specification for the encryption of electronic data. There are 5 common modes in AES. CTR is one of the modes. We use two codec JceAesCtrCryptoCodec and OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl provides. Because ugi credential info is used in the process of encrypted shuffle, we first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4655) Split Stage into ShuffleMapStage and ResultStage subclasses
[ https://issues.apache.org/jira/browse/SPARK-4655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4655. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 4708 [https://github.com/apache/spark/pull/4708] Split Stage into ShuffleMapStage and ResultStage subclasses --- Key: SPARK-4655 URL: https://issues.apache.org/jira/browse/SPARK-4655 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Josh Rosen Assignee: Ilya Ganelin Fix For: 1.4.0 The scheduler's {{Stage}} class has many fields which are only applicable to result stages or shuffle map stages. As a result, I think that it makes sense to make {{Stage}} into an abstract base class with two subclasses, {{ResultStage}} and {{ShuffleMapStage}}. This would improve the understandability of the DAGScheduler code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6597) Replace `input:checkbox` with `input[type=checkbox] in additional-metrics.js
[ https://issues.apache.org/jira/browse/SPARK-6597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6597: - Priority: Trivial (was: Minor) Assignee: Kousuke Saruta Replace `input:checkbox` with `input[type=checkbox] in additional-metrics.js -- Key: SPARK-6597 URL: https://issues.apache.org/jira/browse/SPARK-6597 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.2, 1.3.1, 1.4.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Priority: Trivial Fix For: 1.4.0 In additional-metrics.js, there are some selector notation like `input:checkbox` but JQuery's official document says `input[type=checkbox]` is better. https://api.jquery.com/checkbox-selector/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6600) Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway
[ https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6600. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5257 [https://github.com/apache/spark/pull/5257] Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway -- Key: SPARK-6600 URL: https://issues.apache.org/jira/browse/SPARK-6600 Project: Spark Issue Type: New Feature Components: EC2 Reporter: Florian Verhein Fix For: 1.4.0 Use case: User has set up the hadoop hdfs nfs gateway service on their spark_ec2.py launched cluster, and wants to mount that on their local machine. Requires the following ports to be opened on incoming rule set for MASTER for both UDP and TCP: 111, 2049, 4242. (I have tried this and it works) Note that this issue *does not* cover the implementation of a hdfs nfs gateway module in the spark-ec2 project. See linked issue. Reference: https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated SPARK-5682: Attachment: Design Document of Encrypted Spark Shuffle_20150401.docx Add encrypted shuffle in spark -- Key: SPARK-5682 URL: https://issues.apache.org/jira/browse/SPARK-5682 Project: Spark Issue Type: New Feature Components: Shuffle Reporter: liyunzhang_intel Attachments: Design Document of Encrypted Spark Shuffle_20150209.docx, Design Document of Encrypted Spark Shuffle_20150318.docx, Design Document of Encrypted Spark Shuffle_20150401.docx Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle data safer. This feature is necessary in spark. AES is a specification for the encryption of electronic data. There are 5 common modes in AES. CTR is one of the modes. We use two codec JceAesCtrCryptoCodec and OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl provides. Because ugi credential info is used in the process of encrypted shuffle, we first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6626) TwitterUtils.createStream documentation error
[ https://issues.apache.org/jira/browse/SPARK-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6626: - Priority: Trivial (was: Minor) Assignee: Jayson Sunshine TwitterUtils.createStream documentation error - Key: SPARK-6626 URL: https://issues.apache.org/jira/browse/SPARK-6626 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.3.0 Reporter: Jayson Sunshine Assignee: Jayson Sunshine Priority: Trivial Labels: documentation, easyfix Fix For: 1.3.1, 1.4.0 Original Estimate: 5m Remaining Estimate: 5m At http://spark.apache.org/docs/1.3.0/streaming-programming-guide.html#input-dstreams-and-receivers, under 'Advanced Sources', the documentation provides the following call for Scala: TwitterUtils.createStream(ssc) However, with only one parameter to this method it appears a jssc object is required, not a ssc object: http://spark.apache.org/docs/1.3.0/api/java/index.html?org/apache/spark/streaming/twitter/TwitterUtils.html To make the above call work one must instead provide an option argument, for example: TwitterUtils.createStream(ssc, None) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6626) TwitterUtils.createStream documentation error
[ https://issues.apache.org/jira/browse/SPARK-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6626. -- Resolution: Fixed Fix Version/s: 1.4.0 1.3.1 Issue resolved by pull request 5295 [https://github.com/apache/spark/pull/5295] TwitterUtils.createStream documentation error - Key: SPARK-6626 URL: https://issues.apache.org/jira/browse/SPARK-6626 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.3.0 Reporter: Jayson Sunshine Priority: Minor Labels: documentation, easyfix Fix For: 1.3.1, 1.4.0 Original Estimate: 5m Remaining Estimate: 5m At http://spark.apache.org/docs/1.3.0/streaming-programming-guide.html#input-dstreams-and-receivers, under 'Advanced Sources', the documentation provides the following call for Scala: TwitterUtils.createStream(ssc) However, with only one parameter to this method it appears a jssc object is required, not a ssc object: http://spark.apache.org/docs/1.3.0/api/java/index.html?org/apache/spark/streaming/twitter/TwitterUtils.html To make the above call work one must instead provide an option argument, for example: TwitterUtils.createStream(ssc, None) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6600) Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway
[ https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6600: - Priority: Minor (was: Major) Assignee: Florian Verhein Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway -- Key: SPARK-6600 URL: https://issues.apache.org/jira/browse/SPARK-6600 Project: Spark Issue Type: New Feature Components: EC2 Reporter: Florian Verhein Assignee: Florian Verhein Priority: Minor Fix For: 1.4.0 Use case: User has set up the hadoop hdfs nfs gateway service on their spark_ec2.py launched cluster, and wants to mount that on their local machine. Requires the following ports to be opened on incoming rule set for MASTER for both UDP and TCP: 111, 2049, 4242. (I have tried this and it works) Note that this issue *does not* cover the implementation of a hdfs nfs gateway module in the spark-ec2 project. See linked issue. Reference: https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6597) Replace `input:checkbox` with `input[type=checkbox] in additional-metrics.js
[ https://issues.apache.org/jira/browse/SPARK-6597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6597. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5254 [https://github.com/apache/spark/pull/5254] Replace `input:checkbox` with `input[type=checkbox] in additional-metrics.js -- Key: SPARK-6597 URL: https://issues.apache.org/jira/browse/SPARK-6597 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.2, 1.3.1, 1.4.0 Reporter: Kousuke Saruta Priority: Minor Fix For: 1.4.0 In additional-metrics.js, there are some selector notation like `input:checkbox` but JQuery's official document says `input[type=checkbox]` is better. https://api.jquery.com/checkbox-selector/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6630) SparkConf.setIfMissing should only evaluate the assigned value if indeed missing
[ https://issues.apache.org/jira/browse/SPARK-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390350#comment-14390350 ] Sean Owen commented on SPARK-6630: -- This should be as simple as {{ def setIfMissing(key: String, value: = String): SparkConf = ... }} if I'm not mistaken about how this works in Scala? Would you like to make a PR and verify it lazily evaluates? I can't think of a scenario where it would be important to always evaluate the argument. SparkConf.setIfMissing should only evaluate the assigned value if indeed missing Key: SPARK-6630 URL: https://issues.apache.org/jira/browse/SPARK-6630 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: Svend Vanderveken Priority: Minor The method setIfMissing() in SparkConf is currently systematically evaluating the right hand side of the assignment even if not used. This leads to unnecessary computation, like in the case of {code} conf.setIfMissing(spark.driver.host, Utils.localHostName()) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6644) After adding new columns to a partitioned table and inserting data to an old partition, data of newly added columns are all NULL
[ https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6644: -- Description: In Hive, the schema of a partition may differ from the table schema. For example, we may add new columns to the table after importing existing partitions. When using {{spark-sql}} to query the data in a partition whose schema is different from the table schema, problems may arise. Part of them have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. However, after adding new column(s) to the table, when inserting data into old partitions, values of newly added columns are all {{NULL}}. The following snippet can be used to reproduce this issue: {code} case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition (key INT, value STRING) PARTITIONED BY (ds STRING) LOCATION '${tmpDir.toURI.toString}') sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT key, value FROM testData) // Add new columns to the table sql(ALTER TABLE table_with_partition ADD COLUMNS (key1 STRING)) sql(ALTER TABLE table_with_partition ADD COLUMNS (destlng DOUBLE)) sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT key, value, 'test', 1.11 FROM testData) sql(SELECT * FROM table_with_partition WHERE ds = '1').collect().foreach(println) {code} Actual result: {noformat} [1,1,null,null,1] [2,2,null,null,1] {noformat} Expected result: {noformat} [1,1,test,1.11,1] [2,2,test,1.11,1] {noformat} was: In Hive, the schema of a partition may differ from the table schema. For example, we may add new columns to the table after importing existing partitions. When using {{spark-sql}} to query the data in a partition whose schema is different from the table schema, problems may arise. Part of them have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. However, after adding new column(s) to the table, when inserting data into old partitions, values of newly added columns are all {{NULL}}. The following snippet can be used to reproduce this issue: {code} case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int, value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}') sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT key, value FROM testData) // Add new columns to the table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT key, value, 'test', 1.11 FROM testData) sql(SELECT * FROM table_with_partition WHERE ds = '1').collect().foreach(println) {code} Actual result: {noformat} [1,1,null,null,1] [2,2,null,null,1] {noformat} Expected result: {noformat} [1,1,test,1.11,1] [2,2,test,1.11,1] {noformat} After adding new columns to a partitioned table and inserting data to an old partition, data of newly added columns are all NULL Key: SPARK-6644 URL: https://issues.apache.org/jira/browse/SPARK-6644 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: dongxu In Hive, the schema of a partition may differ from the table schema. For example, we may add new columns to the table after importing existing partitions. When using {{spark-sql}} to query the data in a partition whose schema is different from the table schema, problems may arise. Part of them have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. However, after adding new column(s) to the table, when inserting data into old partitions, values of newly added columns are all {{NULL}}. The following snippet can be used to reproduce this issue: {code} case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition (key INT, value STRING) PARTITIONED BY (ds STRING) LOCATION '${tmpDir.toURI.toString}') sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT key, value FROM testData) // Add new columns to the table
[jira] [Commented] (SPARK-6631) I am unable to get the Maven Build file in Example 2.13 to build anything but an empty file
[ https://issues.apache.org/jira/browse/SPARK-6631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390114#comment-14390114 ] Frank Domoney commented on SPARK-6631: -- Incidentally can you get the Debian build of Spark 1.3 work? mvn -Pdeb -DskipTests clean package Mine fails to build. I suspect that the Debian package might be the correct one for Ubuntu 14.04. and Java 8 Caused by: org.vafer.jdeb.PackagingException: Could not create deb package at org.vafer.jdeb.Processor.createDeb(Processor.java:171) at org.vafer.jdeb.maven.DebMaker.makeDeb(DebMaker.java:244) ... 22 more Caused by: org.vafer.jdeb.PackagingException: Control file descriptor keys are invalid [Version]. The following keys are mandatory [Package, Version, Section, Priority, Architecture, Maintainer, Description]. Please check your pom.xml/build.xml and your control file. at org.vafer.jdeb.Processor.createDeb(Processor.java:142) ... 23 more [INFO I am unable to get the Maven Build file in Example 2.13 to build anything but an empty file --- Key: SPARK-6631 URL: https://issues.apache.org/jira/browse/SPARK-6631 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.3.0 Environment: Ubuntu 14.04 Reporter: Frank Domoney Priority: Blocker I have downloaded and built spark 1.3.0 under Ubuntu 14.04 but have been unable to get reduceByKey to work on what seems to be a valid RDD using the command line. scala counts.take(10) res17: Array[(String, Int)] = Array((Vladimir,1), (Putin,1), (has,1), (said,1), (Russia,1), (will,1), (fight,1), (for,1), (an,1), (independent,1)) scala val counts1 = counts.reduceByKey{case (x, y) = x + y} counts1.take(10) res16: Array[(String, Int)] = Array() I am attempting to build the Maven sequence in example 2.15 but get the following results Building example 0.0.1 [INFO] [INFO] [INFO] --- maven-resources-plugin:2.3:resources (default-resources) @ learning-spark-mini-example --- [WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent! [INFO] skip non existing resourceDirectory /home/panzerfrank/Downloads/spark-1.3.0/wordcount/src/main/resources [INFO] [INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ learning-spark-mini-example --- [INFO] No sources to compile [INFO] [INFO] --- maven-resources-plugin:2.3:testResources (default-testResources) @ learning-spark-mini-example --- [WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent! [INFO] skip non existing resourceDirectory /home/panzerfrank/Downloads/spark-1.3.0/wordcount/src/test/resources [INFO] [INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ learning-spark-mini-example --- [INFO] No sources to compile [INFO] [INFO] --- maven-surefire-plugin:2.10:test (default-test) @ learning-spark-mini-example --- [INFO] No tests to run. [INFO] Surefire report directory: /home/panzerfrank/Downloads/spark-1.3.0/wordcount/target/surefire-reports --- maven-jar-plugin:2.2:jar (default-jar) @ learning-spark-mini-example --- [WARNING] JAR will be empty - no content was marked for inclusion! [INFO] Building jar: /home/panzerfrank/Downloads/spark-1.3.0/wordcount/target/learning-spark-mini-example-0.0.1.jar I am using the POM file from Example 2-13. Java is Java -8 Am I doing something really stupid? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390146#comment-14390146 ] Cong Yue commented on SPARK-6646: - Very cool idea. Current smartphone has much better performance than the servers 5-8 years ago. But in mobile networks, the data transferring speed between nodes can not be as stable as servers. So parallel computing can have the benefits from CPUs, but the bottleneck will be in the mobile networks. Spark 2.0: Rearchitecting Spark for Mobile Platforms Key: SPARK-6646 URL: https://issues.apache.org/jira/browse/SPARK-6646 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Attachments: Spark on Mobile - Design Doc - v1.pdf Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This ticket outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390196#comment-14390196 ] Sandy Ryza commented on SPARK-6646: --- [~srowen] I like the way you think. I know a lot of good nodes out there looking for love or at least a casual shutdown hookup. Spark 2.0: Rearchitecting Spark for Mobile Platforms Key: SPARK-6646 URL: https://issues.apache.org/jira/browse/SPARK-6646 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Attachments: Spark on Mobile - Design Doc - v1.pdf Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This ticket outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. Mobile friendly version of the design doc: https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390205#comment-14390205 ] Aaron Davidson commented on SPARK-6646: --- Please help, I tried putting spark on iphone but it ignited and now no phone. Spark 2.0: Rearchitecting Spark for Mobile Platforms Key: SPARK-6646 URL: https://issues.apache.org/jira/browse/SPARK-6646 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Attachments: Spark on Mobile - Design Doc - v1.pdf Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This ticket outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. Mobile friendly version of the design doc: https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4927) Spark does not clean up properly during long jobs.
[ https://issues.apache.org/jira/browse/SPARK-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4927. -- Resolution: Cannot Reproduce At the moment I've tried to reproduce this a few ways and wasn't able to. It may have been fixed somehow since. It can be reopened if there is a reproduction vs 1.3+ Spark does not clean up properly during long jobs. --- Key: SPARK-4927 URL: https://issues.apache.org/jira/browse/SPARK-4927 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Ilya Ganelin On a long running Spark job, Spark will eventually run out of memory on the driver node due to metadata overhead from the shuffle operation. Spark will continue to operate, however with drastically decreased performance (since swapping now occurs with every operation). The spark.cleanup.tll parameter allows a user to configure when cleanup happens but the issue with doing this is that it isn’t done safely, e.g. If this clears a cached RDD or active task in the middle of processing a stage, this ultimately causes a KeyNotFoundException when the next stage attempts to reference the cleared RDD or task. There should be a sustainable mechanism for cleaning up stale metadata that allows the program to continue running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1001) Memory leak when reading sequence file and then sorting
[ https://issues.apache.org/jira/browse/SPARK-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1001. -- Resolution: Cannot Reproduce Memory leak when reading sequence file and then sorting --- Key: SPARK-1001 URL: https://issues.apache.org/jira/browse/SPARK-1001 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 0.8.0 Reporter: Matthew Cheah Labels: Hadoop, Memory Spark appears to build up a backlog of unreachable byte arrays when an RDD is constructed from a sequence file, and then that RDD is sorted. I have a class that wraps a Java ArrayList, that can be serialized and written to a Hadoop SequenceFile (I.e. Implements the Writable interface). Let's call it WritableDataRow. It can take a Java List as its argument to wrap around, and also has a copy constructor. Setup: 10 slaves, launched via EC2, 65.9GB RAM each, dataset is 100GB of text, 120GB when in sequence file format (not using compression to compact the bytes). CDH4.2.0-backed hadoop cluster. First, building the RDD from a CSV and then sorting on index 1 works fine: {code} scala import scala.collection.JavaConversions._ // Other imports here as well import scala.collection.JavaConversions._ scala val rddAsTextFile = sc.textFile(s3n://some-bucket/events-*.csv) rddAsTextFile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at console:14 scala val rddAsWritableDataRows = rddAsTextFile.map(x = new WritableDataRow(x.split(\\|).toList)) rddAsWritableDataRows: org.apache.spark.rdd.RDD[com.palantir.finance.datatable.server.spark.WritableDataRow] = MappedRDD[2] at map at console:19 scala val rddAsKeyedWritableDataRows = rddAsWritableDataRows.map(x = (x.getContents().get(1).toString(), x)); rddAsKeyedWritableDataRows: org.apache.spark.rdd.RDD[(String, com.palantir.finance.datatable.server.spark.WritableDataRow)] = MappedRDD[4] at map at console:22 scala val orderedFunct = new org.apache.spark.rdd.OrderedRDDFunctions[String, WritableDataRow, (String, WritableDataRow)](rddAsKeyedWritableDataRows) orderedFunct: org.apache.spark.rdd.OrderedRDDFunctions[String,com.palantir.finance.datatable.server.spark.WritableDataRow,(String, com.palantir.finance.datatable.server.spark.WritableDataRow)] = org.apache.spark.rdd.OrderedRDDFunctions@587acb54 scala orderedFunct.sortByKey(true).count(); // Actually triggers the computation, as stated in a different e-mail thread res0: org.apache.spark.rdd.RDD[(String, com.palantir.finance.datatable.server.spark.WritableDataRow)] = MapPartitionsRDD[8] at sortByKey at console:27 {code} The above works without too many surprises. I then save it as a Sequence File (using JavaPairRDD as a way to more easily call saveAsHadoopFile(), and this is how it's done in our Java-based application): {code} scala val pairRDD = new JavaPairRDD(rddAsWritableDataRows.map(x = (NullWritable.get(), x))); pairRDD: org.apache.spark.api.java.JavaPairRDD[org.apache.hadoop.io.NullWritable,com.palantir.finance.datatable.server.spark.WritableDataRow] = org.apache.spark.api.java.JavaPairRDD@8d2e9d9 scala pairRDD.saveAsHadoopFile(hdfs://hdfs-master-url:9010/blah, classOf[NullWritable], classOf[WritableDataRow], classOf[org.apache.hadoop.mapred.SequenceFileOutputFormat[NullWritable, WritableDataRow]]); … 2013-12-11 20:09:14,444 [main] INFO org.apache.spark.SparkContext - Job finished: saveAsHadoopFile at console:26, took 1052.116712748 s {code} And now I want to get the RDD from the sequence file and sort THAT, and this is when I monitor Ganglia and ps aux and notice the memory usage climbing ridiculously: {code} scala val rddAsSequenceFile = sc.sequenceFile(hdfs://hdfs-master-url:9010/blah, classOf[NullWritable], classOf[WritableDataRow]).map(x = new WritableDataRow(x._2)); // Invokes copy constructor to get around re-use of writable objects rddAsSequenceFile: org.apache.spark.rdd.RDD[com.palantir.finance.datatable.server.spark.WritableDataRow] = MappedRDD[19] at map at console:19 scala val orderedFunct = new org.apache.spark.rdd.OrderedRDDFunctions[String, WritableDataRow, (String, WritableDataRow)](rddAsSequenceFile.map(x = (x.getContents().get(1).toString(), x))) orderedFunct: org.apache.spark.rdd.OrderedRDDFunctions[String,com.palantir.finance.datatable.server.spark.WritableDataRow,(String, com.palantir.finance.datatable.server.spark.WritableDataRow)] = org.apache.spark.rdd.OrderedRDDFunctions@6262a9a6 scalaorderedFunct.sortByKey().count(); {code} (On the necessity to copy writables from hadoop RDDs, see: https://mail-archives.apache.org/mod_mbox/spark-user/201308.mbox/%3ccaf_kkpzrq4otyqvwcoc6plaz9x9_sfo33u4ysatki5ptqoy...@mail.gmail.com%3E ) I got a
[jira] [Created] (SPARK-6647) Make trait StringComparison as BinaryPredicate and throw error when Predicate can't translate to data source Filter
Liang-Chi Hsieh created SPARK-6647: -- Summary: Make trait StringComparison as BinaryPredicate and throw error when Predicate can't translate to data source Filter Key: SPARK-6647 URL: https://issues.apache.org/jira/browse/SPARK-6647 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh Now trait {{StringComparison}} is a {{BinaryExpression}}. In fact, it should be a {{BinaryPredicate}}. By making {{StringComparison}} as {{BinaryPredicate}}, we can throw error when a {{expressions.Predicate}} can't translate to a data source {{Filter}} in function {{selectFilters}}. Without this modification, because we will wrap a {{Filter}} outside the scanned results in {{pruneFilterProjectRaw}}, we can't detect about something is wrong in translating predicates to filters in {{selectFilters}}. The unit test of SPARK-6625 demonstrates such problem. In that pr, even {{expressions.Contains}} is not properly translated to {{sources.StringContains}}, the filtering is still performed by the {{Filter}} and so the test passes. Of course, by doing this modification, all {{expressions.Predicate}} classes need to have its data source {{Filter}} correspondingly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3884) If deploy mode is cluster, --driver-memory shouldn't apply to client JVM
[ https://issues.apache.org/jira/browse/SPARK-3884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3884: - Component/s: (was: Spark Core) Spark Submit Affects Version/s: 1.2.0 1.3.0 If deploy mode is cluster, --driver-memory shouldn't apply to client JVM Key: SPARK-3884 URL: https://issues.apache.org/jira/browse/SPARK-3884 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 1.1.0, 1.2.0, 1.3.0 Reporter: Sandy Ryza Assignee: Marcelo Vanzin Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390373#comment-14390373 ] Steve Loughran commented on SPARK-6646: --- Obviously the barrier will be data source access; talking to remote data is going to run up bills. # couchdb has an offline mode, so its RDD/Dataframe support would allow spark-mobile to work in embedded mode. # Hadoop 2.8 add hardware CRC on ARM parts for HDFS (HADOOP-11660). A {{MiniHDFSCluster}} could be instantiated locally to benefit from this. # alternatively, mDNS could be used to discover and dynamically build up an HDFS cluster from nearby devices, MANET-style. The limited connectivity guarantees of moving devices means that a block size of 1536 bytes would be appropriate; probably 1KB blocks are safest. # Those nodes on the network with limited CPU power but access to external power supplies, such as toasters and coffee machines, could have a role as the persistent co-ordinators of work and HDFS Namenodes, as well as being used as the preferred routers of wifi packets. # It may be necessary to extend the hadoop {{s3://}} filesystem with the notion of monthly data quotas. Possibly even roaming and non-roaming quotas. The S3 client would need to query the runtime to determine whether it was at home vs roaming use the relevant quota. Apps could then set something like {code} fs.s3.quota.home=15GB fs.s3.quota.roaming=2GB {code} Dealing with use abroad would be more complex, as if a cost value were to be included, exchange rates would have to be dynamically assessed. # It may be interesting consider the notion of having devices publish some of their data (photos, healthkit history, movement history) to other devices nearby. If one phone could enumerate those nearby **and submit work to them**, the bandwidth problems could be addressed. Spark 2.0: Rearchitecting Spark for Mobile Platforms Key: SPARK-6646 URL: https://issues.apache.org/jira/browse/SPARK-6646 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Attachments: Spark on Mobile - Design Doc - v1.pdf Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This ticket outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. Mobile friendly version of the design doc: https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4544) Spark JVM Metrics doesn't have context.
[ https://issues.apache.org/jira/browse/SPARK-4544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4544. -- Resolution: Duplicate I'd like to bundle this under SPARK-5847, which proposes more general control over the namespacing, which could include instance as a higher-level grouping than the current app ID. Spark JVM Metrics doesn't have context. --- Key: SPARK-4544 URL: https://issues.apache.org/jira/browse/SPARK-4544 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Sreepathi Prasanna If we enable jvm metrics for executor, master, worker, driver instances, we don't have context where they are coming from ? This can be a issue if we are collecting all the metrics from different instances are storing into common datastore. This is mainly running Spark on Yarn but i believe Spark standalone has also this problems. It would be good if we attach some context for jvm metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3967) Spark applications fail in yarn-cluster mode when the directories configured in yarn.nodemanager.local-dirs are located on different disks/partitions
[ https://issues.apache.org/jira/browse/SPARK-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3967: - Component/s: (was: Spark Core) YARN Spark applications fail in yarn-cluster mode when the directories configured in yarn.nodemanager.local-dirs are located on different disks/partitions - Key: SPARK-3967 URL: https://issues.apache.org/jira/browse/SPARK-3967 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Christophe Préaud Attachments: spark-1.1.0-utils-fetch.patch, spark-1.1.0-yarn_cluster_tmpdir.patch Spark applications fail from time to time in yarn-cluster mode (but not in yarn-client mode) when yarn.nodemanager.local-dirs (Hadoop YARN config) is set to a comma-separated list of directories which are located on different disks/partitions. Steps to reproduce: 1. Set yarn.nodemanager.local-dirs (in yarn-site.xml) to a list of directories located on different partitions (the more you set, the more likely it will be to reproduce the bug): (...) property nameyarn.nodemanager.local-dirs/name valuefile:/d1/yarn/local/nm-local-dir,file:/d2/yarn/local/nm-local-dir,file:/d3/yarn/local/nm-local-dir,file:/d4/yarn/local/nm-local-dir,file:/d5/yarn/local/nm-local-dir,file:/d6/yarn/local/nm-local-dir,file:/d7/yarn/local/nm-local-dir/value /property (...) 2. Launch (several times) an application in yarn-cluster mode, it will fail (apparently randomly) from time to time -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6647) Make trait StringComparison as BinaryPredicate and throw error when Predicate can't translate to data source Filter
[ https://issues.apache.org/jira/browse/SPARK-6647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6647: --- Assignee: (was: Apache Spark) Make trait StringComparison as BinaryPredicate and throw error when Predicate can't translate to data source Filter --- Key: SPARK-6647 URL: https://issues.apache.org/jira/browse/SPARK-6647 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh Now trait {{StringComparison}} is a {{BinaryExpression}}. In fact, it should be a {{BinaryPredicate}}. By making {{StringComparison}} as {{BinaryPredicate}}, we can throw error when a {{expressions.Predicate}} can't translate to a data source {{Filter}} in function {{selectFilters}}. Without this modification, because we will wrap a {{Filter}} outside the scanned results in {{pruneFilterProjectRaw}}, we can't detect about something is wrong in translating predicates to filters in {{selectFilters}}. The unit test of SPARK-6625 demonstrates such problem. In that pr, even {{expressions.Contains}} is not properly translated to {{sources.StringContains}}, the filtering is still performed by the {{Filter}} and so the test passes. Of course, by doing this modification, all {{expressions.Predicate}} classes need to have its data source {{Filter}} correspondingly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3884) If deploy mode is cluster, --driver-memory shouldn't apply to client JVM
[ https://issues.apache.org/jira/browse/SPARK-3884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3884. -- Resolution: Fixed Fix Version/s: 1.4.0 Assignee: Marcelo Vanzin (was: Sandy Ryza) Target Version/s: (was: 1.1.2, 1.2.1) This is fixed in 1.4 due to the new launcher implementation. I verified that in yarn-cluster mode the SparkSubmit JVM is not run with -Xms / -Xmx set, but instead passes through spark.driver.memory in --conf. In yarn-client mode, it does set -Xms / -Xmx. If deploy mode is cluster, --driver-memory shouldn't apply to client JVM Key: SPARK-3884 URL: https://issues.apache.org/jira/browse/SPARK-3884 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Sandy Ryza Assignee: Marcelo Vanzin Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390367#comment-14390367 ] Nan Zhu commented on SPARK-6646: super cool, Spark enables Bigger than Bigger Data in mobile phones Spark 2.0: Rearchitecting Spark for Mobile Platforms Key: SPARK-6646 URL: https://issues.apache.org/jira/browse/SPARK-6646 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Attachments: Spark on Mobile - Design Doc - v1.pdf Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This ticket outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. Mobile friendly version of the design doc: https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4799) Spark should not rely on local host being resolvable on every node
[ https://issues.apache.org/jira/browse/SPARK-4799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4799. -- Resolution: Duplicate Target Version/s: (was: 1.2.1) Looks like this was subsumed by SPARK-5078 and SPARK_LOCAL_HOSTNAME Spark should not rely on local host being resolvable on every node -- Key: SPARK-4799 URL: https://issues.apache.org/jira/browse/SPARK-4799 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Environment: Tested a Spark+Mesos cluster on top of Docker to reproduce the issue. Reporter: Santiago M. Mola Spark fails when a node hostname is not resolvable by other nodes. See an example trace: {code} 14/12/09 17:02:41 ERROR SendingConnection: Error connecting to 27e434cf36ac:35093 java.nio.channels.UnresolvedAddressException at sun.nio.ch.Net.checkAddress(Net.java:127) at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:644) at org.apache.spark.network.SendingConnection.connect(Connection.scala:299) at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:278) at org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139) {code} The relevant code is here: https://github.com/apache/spark/blob/bcb5cdad614d4fce43725dfec3ce88172d2f8c11/core/src/main/scala/org/apache/spark/network/nio/ConnectionManager.scala#L170 {code} val id = new ConnectionManagerId(Utils.localHostName, serverChannel.socket.getLocalPort) {code} This piece of code should use the host IP with Utils.localIpAddress or a method that acknowleges user settings (e.g. SPARK_LOCAL_IP). Since I cannot think about a use case for using hostname here, I'm creating a PR with the former solution, but if you think the later is better, I'm willing to create a new PR with a more elaborate fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6647) Make trait StringComparison as BinaryPredicate and throw error when Predicate can't translate to data source Filter
[ https://issues.apache.org/jira/browse/SPARK-6647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6647: --- Assignee: Apache Spark Make trait StringComparison as BinaryPredicate and throw error when Predicate can't translate to data source Filter --- Key: SPARK-6647 URL: https://issues.apache.org/jira/browse/SPARK-6647 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh Assignee: Apache Spark Now trait {{StringComparison}} is a {{BinaryExpression}}. In fact, it should be a {{BinaryPredicate}}. By making {{StringComparison}} as {{BinaryPredicate}}, we can throw error when a {{expressions.Predicate}} can't translate to a data source {{Filter}} in function {{selectFilters}}. Without this modification, because we will wrap a {{Filter}} outside the scanned results in {{pruneFilterProjectRaw}}, we can't detect about something is wrong in translating predicates to filters in {{selectFilters}}. The unit test of SPARK-6625 demonstrates such problem. In that pr, even {{expressions.Contains}} is not properly translated to {{sources.StringContains}}, the filtering is still performed by the {{Filter}} and so the test passes. Of course, by doing this modification, all {{expressions.Predicate}} classes need to have its data source {{Filter}} correspondingly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6630) SparkConf.setIfMissing should only evaluate the assigned value if indeed missing
[ https://issues.apache.org/jira/browse/SPARK-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390421#comment-14390421 ] Svend Vanderveken commented on SPARK-6630: -- Thanks for you comment. I agree with the resolution, I just only found the time to open the Jira yesterday, I'll submit the corresponding PR shortly, promised :) SparkConf.setIfMissing should only evaluate the assigned value if indeed missing Key: SPARK-6630 URL: https://issues.apache.org/jira/browse/SPARK-6630 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: Svend Vanderveken Priority: Minor The method setIfMissing() in SparkConf is currently systematically evaluating the right hand side of the assignment even if not used. This leads to unnecessary computation, like in the case of {code} conf.setIfMissing(spark.driver.host, Utils.localHostName()) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6613) Starting stream from checkpoint causes Streaming tab to throw error
[ https://issues.apache.org/jira/browse/SPARK-6613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390423#comment-14390423 ] Marius Soutier commented on SPARK-6613: --- Bug report. Starting stream from checkpoint causes Streaming tab to throw error --- Key: SPARK-6613 URL: https://issues.apache.org/jira/browse/SPARK-6613 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Marius Soutier When continuing my streaming job from a checkpoint, the job runs, but the Streaming tab in the standard UI initially no longer works (browser just shows HTTP ERROR: 500). Sometimes it gets back to normal after a while, and sometimes it stays in this state permanently. Stacktrace: WARN org.eclipse.jetty.servlet.ServletHandler: /streaming/ java.util.NoSuchElementException: key not found: 0 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:58) at org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1$$anonfun$apply$5.apply(StreamingJobProgressListener.scala:151) at org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1$$anonfun$apply$5.apply(StreamingJobProgressListener.scala:150) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.Range.foreach(Range.scala:141) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1.apply(StreamingJobProgressListener.scala:150) at org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1.apply(StreamingJobProgressListener.scala:149) at scala.Option.map(Option.scala:145) at org.apache.spark.streaming.ui.StreamingJobProgressListener.lastReceivedBatchRecords(StreamingJobProgressListener.scala:149) at org.apache.spark.streaming.ui.StreamingPage.generateReceiverStats(StreamingPage.scala:82) at org.apache.spark.streaming.ui.StreamingPage.render(StreamingPage.scala:43) at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68) at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68) at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:68) at javax.servlet.http.HttpServlet.service(HttpServlet.java:735) at javax.servlet.http.HttpServlet.service(HttpServlet.java:848) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:370) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667) at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA
[jira] [Assigned] (SPARK-6643) Python API for StandardScalerModel
[ https://issues.apache.org/jira/browse/SPARK-6643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6643: --- Assignee: (was: Apache Spark) Python API for StandardScalerModel -- Key: SPARK-6643 URL: https://issues.apache.org/jira/browse/SPARK-6643 Project: Spark Issue Type: Task Components: MLlib Affects Versions: 1.3.0 Reporter: Kai Sasaki Priority: Minor Labels: mllib, python Fix For: 1.4.0 This is the sub-task of SPARK-6254. Wrap missing method for {{StandardScalerModel}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6643) Python API for StandardScalerModel
[ https://issues.apache.org/jira/browse/SPARK-6643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390610#comment-14390610 ] Apache Spark commented on SPARK-6643: - User 'Lewuathe' has created a pull request for this issue: https://github.com/apache/spark/pull/5310 Python API for StandardScalerModel -- Key: SPARK-6643 URL: https://issues.apache.org/jira/browse/SPARK-6643 Project: Spark Issue Type: Task Components: MLlib Affects Versions: 1.3.0 Reporter: Kai Sasaki Priority: Minor Labels: mllib, python Fix For: 1.4.0 This is the sub-task of SPARK-6254. Wrap missing method for {{StandardScalerModel}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6648) Reading Parquet files with different sub-files doesn't work
Marius Soutier created SPARK-6648: - Summary: Reading Parquet files with different sub-files doesn't work Key: SPARK-6648 URL: https://issues.apache.org/jira/browse/SPARK-6648 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Reporter: Marius Soutier When reading from multiple parquet files (via sqlContext.parquetFile(/path/1.parquet,/path/2.parquet), if the parquet files were created using a different coalesce, the reading fails with: ERROR c.w.r.websocket.ParquetReader efault-dispatcher-63 : Failed reading parquet file java.lang.IllegalArgumentException: Could not find Parquet metadata at path path at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459) ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1] at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459) ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1] at scala.Option.getOrElse(Option.scala:120) ~[org.scala-lang.scala-library-2.10.4.jar:na] at org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:458) ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1] at org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:477) ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1] at org.apache.spark.sql.parquet.ParquetRelation.init(ParquetRelation.scala:65) ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1] at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:165) ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1] I haven't tested with Spark 1.3 yet but will report back after upgrading to 1.3.1 (as soon as it's released). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6648) Reading Parquet files with different sub-files doesn't work
[ https://issues.apache.org/jira/browse/SPARK-6648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marius Soutier updated SPARK-6648: -- Description: When reading from multiple parquet files (via sqlContext.parquetFile(/path/1.parquet,/path/2.parquet), and one of the parquet files is being overwritten using a different coalesce (e.g. one only contains part-r-1.parquet, the other also part-r-2.parquet, part-r-3.parquet), the reading fails with: ERROR c.w.r.websocket.ParquetReader efault-dispatcher-63 : Failed reading parquet file java.lang.IllegalArgumentException: Could not find Parquet metadata at path path at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459) ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1] at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459) ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1] at scala.Option.getOrElse(Option.scala:120) ~[org.scala-lang.scala-library-2.10.4.jar:na] at org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:458) ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1] at org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:477) ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1] at org.apache.spark.sql.parquet.ParquetRelation.init(ParquetRelation.scala:65) ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1] at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:165) ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1] I haven't tested with Spark 1.3 yet but will report back after upgrading to 1.3.1 (as soon as it's released). was: When reading from multiple parquet files (via sqlContext.parquetFile(/path/1.parquet,/path/2.parquet), if the parquet files were created using a different coalesce (e.g. one only contains part-r-1.parquet, the other also part-r-2.parquet, part-r-3.parquet), the reading fails with: ERROR c.w.r.websocket.ParquetReader efault-dispatcher-63 : Failed reading parquet file java.lang.IllegalArgumentException: Could not find Parquet metadata at path path at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459) ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1] at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459) ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1] at scala.Option.getOrElse(Option.scala:120) ~[org.scala-lang.scala-library-2.10.4.jar:na] at org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:458) ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1] at org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:477) ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1] at org.apache.spark.sql.parquet.ParquetRelation.init(ParquetRelation.scala:65) ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1] at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:165) ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1] I haven't tested with Spark 1.3 yet but will report back after upgrading to 1.3.1 (as soon as it's released). Reading Parquet files with different sub-files doesn't work --- Key: SPARK-6648 URL: https://issues.apache.org/jira/browse/SPARK-6648 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Reporter: Marius Soutier When reading from multiple parquet files (via sqlContext.parquetFile(/path/1.parquet,/path/2.parquet), and one of the parquet files is being overwritten using a different coalesce (e.g. one only contains part-r-1.parquet, the other also part-r-2.parquet, part-r-3.parquet), the reading fails with: ERROR c.w.r.websocket.ParquetReader efault-dispatcher-63 : Failed reading parquet file java.lang.IllegalArgumentException: Could not find Parquet metadata at path path at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459) ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1] at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459) ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1] at scala.Option.getOrElse(Option.scala:120) ~[org.scala-lang.scala-library-2.10.4.jar:na] at org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:458) ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1] at org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:477)
[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390782#comment-14390782 ] Evan Sparks commented on SPARK-6646: Guys - you're clearly ignoring prior work. The database community solved this problem 20 years ago with the Gubba project - a mature prototype [can be seen here|http://i.imgur.com/FJK7K9x.jpg]. Additionally, everyone knows that joins don't scale on iOS, and you'll never be able to build indexes on this platform. Spark 2.0: Rearchitecting Spark for Mobile Platforms Key: SPARK-6646 URL: https://issues.apache.org/jira/browse/SPARK-6646 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Attachments: Spark on Mobile - Design Doc - v1.pdf Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This ticket outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. Mobile friendly version of the design doc: https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6433) hive tests to import spark-sql test JAR for QueryTest access
[ https://issues.apache.org/jira/browse/SPARK-6433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6433: - Assignee: Steve Loughran hive tests to import spark-sql test JAR for QueryTest access Key: SPARK-6433 URL: https://issues.apache.org/jira/browse/SPARK-6433 Project: Spark Issue Type: Improvement Components: Build, SQL Affects Versions: 1.4.0 Reporter: Steve Loughran Assignee: Steve Loughran Priority: Minor Fix For: 1.4.0 Original Estimate: 0.5h Remaining Estimate: 0.5h The hive module has its own clone of {{org.apache.spark.sql.QueryPlan}} and {{org.apache.spark.sql.catalyst.plans.PlanTest}} which are copied from the spark-sql module because it's hard to have maven allow one subproject depend on another subprojects test code It's actually relatively straightforward # tell maven to build publish the test JARs # import them in your other sub projects There is one consequence: the JARs will also end being published to mvn central. This is not really a bad thing; it does help downstream projects pick up the JARs too. It does become an issue if a test run depends on a custom file under {{src/test/resources}} containing things like EC2 authentication keys, or even just log4.properties files which can interfere with each other. These need to be excluded -the simplest way is to exclude all of the resources from test JARs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6649) DataFrame created through SQLContext.jdbc() failed if columns table must be quoted
Frédéric Blanc created SPARK-6649: - Summary: DataFrame created through SQLContext.jdbc() failed if columns table must be quoted Key: SPARK-6649 URL: https://issues.apache.org/jira/browse/SPARK-6649 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Frédéric Blanc Priority: Minor If I want to import the content a table from oracle, that contains a column with name COMMENT (a reserved keyword), I cannot use a DataFrame that map all the columns of this table. {code:title=ddl.sql|borderStyle=solid} CREATE TABLE TEST_TABLE ( COMMENT VARCHAR2(10) ); {code} {code:title=test.java|borderStyle=solid} SQLContext sqlContext = ... DataFrame df = sqlContext.jdbc(databaseURL, TEST_TABLE); df.rdd(); // = failed if the table contains a column with a reserved keyword {code} The same problem can be encounter if reserved keyword are used on table name. The JDBCRDD scala class could be improved, if the columnList initializer append the double-quote for each column. (line : 225) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6433) hive tests to import spark-sql test JAR for QueryTest access
[ https://issues.apache.org/jira/browse/SPARK-6433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6433. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5119 [https://github.com/apache/spark/pull/5119] hive tests to import spark-sql test JAR for QueryTest access Key: SPARK-6433 URL: https://issues.apache.org/jira/browse/SPARK-6433 Project: Spark Issue Type: Improvement Components: Build, SQL Affects Versions: 1.4.0 Reporter: Steve Loughran Priority: Minor Fix For: 1.4.0 Original Estimate: 0.5h Remaining Estimate: 0.5h The hive module has its own clone of {{org.apache.spark.sql.QueryPlan}} and {{org.apache.spark.sql.catalyst.plans.PlanTest}} which are copied from the spark-sql module because it's hard to have maven allow one subproject depend on another subprojects test code It's actually relatively straightforward # tell maven to build publish the test JARs # import them in your other sub projects There is one consequence: the JARs will also end being published to mvn central. This is not really a bad thing; it does help downstream projects pick up the JARs too. It does become an issue if a test run depends on a custom file under {{src/test/resources}} containing things like EC2 authentication keys, or even just log4.properties files which can interfere with each other. These need to be excluded -the simplest way is to exclude all of the resources from test JARs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6334) spark-local dir not getting cleared during ALS
[ https://issues.apache.org/jira/browse/SPARK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antony Mayi updated SPARK-6334: --- Attachment: gc.png spark-local dir not getting cleared during ALS -- Key: SPARK-6334 URL: https://issues.apache.org/jira/browse/SPARK-6334 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Reporter: Antony Mayi Attachments: als-diskusage.png, gc.png when running bigger ALS training spark spills loads of temp data into the local-dir (in my case yarn/local/usercache/antony.mayi/appcache/... - running on YARN from cdh 5.3.2) eventually causing all the disks of all nodes running out of space (in my case I have 12TB of available disk capacity before kicking off the ALS but it all gets used (and yarn kills the containers when reaching 90%). even with all recommended options (configuring checkpointing and forcing GC when possible) it still doesn't get cleared. here is my (pseudo)code (pyspark): {code} sc.setCheckpointDir('/tmp') training = sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK) model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40) sc._jvm.System.gc() {code} the training RDD has about 3.5 billions of items (~60GB on disk). after about 6 hours the ALS will consume all 12TB of disk space in local-dir data and gets killed. my cluster has 192 cores, 1.5TB RAM and for this task I am using 37 executors of 4 cores/28+4GB RAM each. this is the graph of disk consumption pattern showing the space being all eaten from 7% to 90% during the ALS (90% is when YARN kills the container): !als-diskusage.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6334) spark-local dir not getting cleared during ALS
[ https://issues.apache.org/jira/browse/SPARK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390771#comment-14390771 ] Antony Mayi commented on SPARK-6334: bq. btw. I see based on the sourcecode checkpointing should be happening every 3 iterations - how comes I don't see any drops in the disk usage at least once every three iterations? it just seems to be growing constantly... which worries me that even more frequent checkpointing wont help... ok, I am now sure increasing the checkpointing interval is likely not going to help same as it is not helping now - the disk usage just grows even after 3x iterations. I just tried dirty hack - running parallel thread that forces GC every x minutes and suddenly I can notice the disk space gets cleared upon every three iterations when GC runs. see this pattern - first run without forcing GC and then another one where there is noticeable disk usage drops every three steps (ALS iterations): !gc.png! so really what's needed to get the shuffles cleaned upon checkpointing is forcing GC. this was my dirty hack: {code} from threading import Thread, Event class GC(Thread): def __init__(self, context, period=600): Thread.__init__(self) self.context = context self.period = period self.daemon = True self.stopped = Event() def stop(self): self.stopped.set() def run(self): self.stopped.clear() while not self.stopped.is_set(): self.stopped.wait(self.period) self.context._jvm.System.gc() sc.setCheckpointDir('/tmp') gc = GC(sc) gc.start() training = sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK) model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40) gc.stop() {code} spark-local dir not getting cleared during ALS -- Key: SPARK-6334 URL: https://issues.apache.org/jira/browse/SPARK-6334 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Reporter: Antony Mayi Attachments: als-diskusage.png, gc.png when running bigger ALS training spark spills loads of temp data into the local-dir (in my case yarn/local/usercache/antony.mayi/appcache/... - running on YARN from cdh 5.3.2) eventually causing all the disks of all nodes running out of space (in my case I have 12TB of available disk capacity before kicking off the ALS but it all gets used (and yarn kills the containers when reaching 90%). even with all recommended options (configuring checkpointing and forcing GC when possible) it still doesn't get cleared. here is my (pseudo)code (pyspark): {code} sc.setCheckpointDir('/tmp') training = sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK) model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40) sc._jvm.System.gc() {code} the training RDD has about 3.5 billions of items (~60GB on disk). after about 6 hours the ALS will consume all 12TB of disk space in local-dir data and gets killed. my cluster has 192 cores, 1.5TB RAM and for this task I am using 37 executors of 4 cores/28+4GB RAM each. this is the graph of disk consumption pattern showing the space being all eaten from 7% to 90% during the ALS (90% is when YARN kills the container): !als-diskusage.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-6334) spark-local dir not getting cleared during ALS
[ https://issues.apache.org/jira/browse/SPARK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antony Mayi reopened SPARK-6334: spark-local dir not getting cleared during ALS -- Key: SPARK-6334 URL: https://issues.apache.org/jira/browse/SPARK-6334 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Reporter: Antony Mayi Attachments: als-diskusage.png, gc.png when running bigger ALS training spark spills loads of temp data into the local-dir (in my case yarn/local/usercache/antony.mayi/appcache/... - running on YARN from cdh 5.3.2) eventually causing all the disks of all nodes running out of space (in my case I have 12TB of available disk capacity before kicking off the ALS but it all gets used (and yarn kills the containers when reaching 90%). even with all recommended options (configuring checkpointing and forcing GC when possible) it still doesn't get cleared. here is my (pseudo)code (pyspark): {code} sc.setCheckpointDir('/tmp') training = sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK) model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40) sc._jvm.System.gc() {code} the training RDD has about 3.5 billions of items (~60GB on disk). after about 6 hours the ALS will consume all 12TB of disk space in local-dir data and gets killed. my cluster has 192 cores, 1.5TB RAM and for this task I am using 37 executors of 4 cores/28+4GB RAM each. this is the graph of disk consumption pattern showing the space being all eaten from 7% to 90% during the ALS (90% is when YARN kills the container): !als-diskusage.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2243) Support multiple SparkContexts in the same JVM
[ https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391770#comment-14391770 ] Jason Hubbard edited comment on SPARK-2243 at 4/1/15 11:40 PM: --- Apologizing for being flippant is a bit of an oxymoron isn't it? The answer you proprose is the only one available, but it isn't a real solution, it's a workaround. Obviously running in separate JVMs causes other issues with overhead of starting multiple JVMs and the complexity of having to serialize data so they can communicate. Having multiple workloads in the same SparkContext is what I have chosen, but sometimes you would like different settings for the different workloads which this would now not allow. was (Author: jahubba): Apologizing for being flippant is a bit of an oxymoron isn't it? The answer you purpose is the only one available, but it isn't a real solution, it's a workaround. Obviously running in separate JVMs causes other issues with overhead of starting multiple JVMs and the complexity of having to serialize data so they can communicate. Having multiple workloads in the same SparkContext is what I have chosen, but sometimes you would like different settings for the different workloads which this would now not allow. Support multiple SparkContexts in the same JVM -- Key: SPARK-2243 URL: https://issues.apache.org/jira/browse/SPARK-2243 Project: Spark Issue Type: New Feature Components: Block Manager, Spark Core Affects Versions: 0.7.0, 1.0.0, 1.1.0 Reporter: Miguel Angel Fernandez Diaz We're developing a platform where we create several Spark contexts for carrying out different calculations. Is there any restriction when using several Spark contexts? We have two contexts, one for Spark calculations and another one for Spark Streaming jobs. The next error arises when we first execute a Spark calculation and, once the execution is finished, a Spark Streaming job is launched: {code} 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63) at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/06/23 16:40:08 WARN
[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM
[ https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391770#comment-14391770 ] Jason Hubbard commented on SPARK-2243: -- Apologizing for being flippant is a bit of an oxymoron isn't it? The answer you purpose is the only one available, but it isn't a real solution, it's a workaround. Obviously running in separate JVMs causes other issues with overhead of starting multiple JVMs and the complexity of having to serialize data so they can communicate. Having multiple workloads in the same SparkContext is what I have chosen, but sometimes you would like different settings for the different workloads which this would now not allow. Support multiple SparkContexts in the same JVM -- Key: SPARK-2243 URL: https://issues.apache.org/jira/browse/SPARK-2243 Project: Spark Issue Type: New Feature Components: Block Manager, Spark Core Affects Versions: 0.7.0, 1.0.0, 1.1.0 Reporter: Miguel Angel Fernandez Diaz We're developing a platform where we create several Spark contexts for carrying out different calculations. Is there any restriction when using several Spark contexts? We have two contexts, one for Spark calculations and another one for Spark Streaming jobs. The next error arises when we first execute a Spark calculation and, once the execution is finished, a Spark Streaming job is launched: {code} 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63) at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0) 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
[jira] [Resolved] (SPARK-6553) Support for functools.partial as UserDefinedFunction
[ https://issues.apache.org/jira/browse/SPARK-6553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-6553. --- Resolution: Fixed Fix Version/s: 1.4.0 1.3.1 Support for functools.partial as UserDefinedFunction Key: SPARK-6553 URL: https://issues.apache.org/jira/browse/SPARK-6553 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.3.0 Reporter: Kalle Jepsen Assignee: Kalle Jepsen Labels: features Fix For: 1.3.1, 1.4.0 Currently {{functools.partial}} s cannot be used as {{UserDefinedFunction}} s for {{DataFrame}} s, as the {{\_\_name\_\_}} attribute does not exist. Passing a {{functools.partial}} object will raise an Exception at https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L126. {{functools.partial}} is very widely used and should probably be supported, despite its lack of a {{\_\_name\_\_}}. My suggestion is to use {{f.\_\_repr\_\_()}} instead, or check with {{hasattr(f, '\_\_name\_\_)}} and use {{\_\_class\_\_}} if {{False}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6613) Starting stream from checkpoint causes Streaming tab to throw error
[ https://issues.apache.org/jira/browse/SPARK-6613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392102#comment-14392102 ] zhichao-li commented on SPARK-6613: --- Just trying to understand the issue but it cann't be reproduced on my side. if possible could you elaborate on how to reproduce it ? i.e. code snippet or steps. Starting stream from checkpoint causes Streaming tab to throw error --- Key: SPARK-6613 URL: https://issues.apache.org/jira/browse/SPARK-6613 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Marius Soutier When continuing my streaming job from a checkpoint, the job runs, but the Streaming tab in the standard UI initially no longer works (browser just shows HTTP ERROR: 500). Sometimes it gets back to normal after a while, and sometimes it stays in this state permanently. Stacktrace: WARN org.eclipse.jetty.servlet.ServletHandler: /streaming/ java.util.NoSuchElementException: key not found: 0 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:58) at org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1$$anonfun$apply$5.apply(StreamingJobProgressListener.scala:151) at org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1$$anonfun$apply$5.apply(StreamingJobProgressListener.scala:150) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.Range.foreach(Range.scala:141) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1.apply(StreamingJobProgressListener.scala:150) at org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1.apply(StreamingJobProgressListener.scala:149) at scala.Option.map(Option.scala:145) at org.apache.spark.streaming.ui.StreamingJobProgressListener.lastReceivedBatchRecords(StreamingJobProgressListener.scala:149) at org.apache.spark.streaming.ui.StreamingPage.generateReceiverStats(StreamingPage.scala:82) at org.apache.spark.streaming.ui.StreamingPage.render(StreamingPage.scala:43) at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68) at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68) at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:68) at javax.servlet.http.HttpServlet.service(HttpServlet.java:735) at javax.servlet.http.HttpServlet.service(HttpServlet.java:848) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:370) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667) at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at
[jira] [Updated] (SPARK-6668) repeated asking to remove non-existent executor
[ https://issues.apache.org/jira/browse/SPARK-6668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-6668: -- Affects Version/s: 1.4.0 repeated asking to remove non-existent executor --- Key: SPARK-6668 URL: https://issues.apache.org/jira/browse/SPARK-6668 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Davies Liu {code} 15/04/01 21:37:15 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 15/04/01 21:37:17 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. 15/04/01 21:37:17 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 0 15/04/01 21:37:18 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 1 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. [Stage 0: (0 + 0) / 2]15/04/01 21:37:18 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 2 . 15/04/01 21:37:44 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 244 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. 15/04/01 21:37:44 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 245 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. 15/04/01 21:37:44 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 246 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. 15/04/01 21:37:45 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 247 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. 15/04/01 21:37:45 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 248 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. 15/04/01 21:37:45 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 249 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. 15/04/01 21:37:45 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 250 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. 15/04/01 21:37:45 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 251 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. 15/04/01 21:37:45 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 252 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. 15/04/01 21:37:45 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 253 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. 15/04/01 21:37:45 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 254 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. 15/04/01 21:37:45 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 255 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. 15/04/01 21:37:45 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 256 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. 15/04/01 21:37:46 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 257 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. 15/04/01 21:37:46 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 258 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6670) HiveContext.analyze should throw UnsupportedOperationException instead of NotImplementedError
Yin Huai created SPARK-6670: --- Summary: HiveContext.analyze should throw UnsupportedOperationException instead of NotImplementedError Key: SPARK-6670 URL: https://issues.apache.org/jira/browse/SPARK-6670 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.2.0 Reporter: Yin Huai Assignee: Michael Armbrust -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6662) Allow variable substitution in spark.yarn.historyServer.address
Cheolsoo Park created SPARK-6662: Summary: Allow variable substitution in spark.yarn.historyServer.address Key: SPARK-6662 URL: https://issues.apache.org/jira/browse/SPARK-6662 Project: Spark Issue Type: Wish Components: YARN Affects Versions: 1.3.0 Reporter: Cheolsoo Park Priority: Minor In Spark on YARN, explicit hostname and port number need to be set for spark.yarn.historyServer.address in SparkConf to make the HISTORY link. If the history server address is known and static, this is usually not a problem. But in cloud, that is usually not true. Particularly in EMR, the history server always runs on the same node as with RM. So I could simply set it to {{$\{yarn.resourcemanager.hostname\}:18080}} if variable substitution is allowed. In fact, Hadoop configuration already implements variable substitution, so if this property is read via YarnConf, this can be easily achievable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6664) Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)
[ https://issues.apache.org/jira/browse/SPARK-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florian Verhein updated SPARK-6664: --- Description: I can't find this functionality (if I missed something, apologies!), but it would be very useful for evaluating ml models. *Use case example* suppose you have pre-processed web logs for a few months, and now want to split it into a training set (where you train a model to predict some aspect of site accesses, perhaps per user) and an out of time test set (where you evaluate how well your model performs in the future). This example has just a single split, but in general you could want more for cross validation. You may also want to have multiple overlaping intervals. *Specification* 1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), return n+1 RDDs such that values in the ith RDD are within the (i-1)th and ith boundary. 2. More complex alternative (but similar under the hood): provide a sequence of possibly overlapping intervals (ordered by the start key of the interval), and return the RDDs containing values within those intervals. *Implementation ideas / notes for 1* - The ordered RDDs are likely RangePartitioned (or there should be a simple way to find ranges from partitions in an ordered RDD) - Find the partitions containing the boundary, and split them in two. - Construct the new RDDs from the original partitions (and any split ones) I suspect this could be done by launching only a few jobs to split the partitions containing the boundaries. Alternatively, it might be possible to decorate these partitions and use them in more than one RDD. I.e. let one of these partitions (for boundary i) be p. Apply two decorators p' and p'', where p' is masks out values above the ith boundary, and p'' masks out values below the ith boundary. Any operations on these partitions apply only to values not masked out. Then assign p' to the ith output RDD and p'' to the (i+1)th output RDD. If I understand Spark correctly, this should not require any jobs. Not sure whether it's worth trying this optimisation. *Implementation ideas / notes for 2* This is very similar, except that we have to handle entire (or parts) of partitions belonging to more than one output RDD, since they are no longer mutually exclusive. But since RDDs are immutable(??), the decorator idea should still work? Thoughts? was: I can't find this functionality (if I missed something, apologies!), but it would be very useful for evaluating ml models. Use case example: suppose you have pre-processed web logs for a few months, and now want to split it into a training set (where you train a model to predict some aspect of site accesses, perhaps per user) and an out of time test set (where you evaluate how well your model performs in the future). This example has just a single split, but in general you could want more for cross validation. You may also want to have multiple overlaping intervals. Specification: 1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), return n+1 RDDs such that values in the ith RDD are within the (i-1)th and ith boundary. 2. More complex alternative (but similar under the hood): provide a sequence of possibly overlapping intervals, and return the RDDs containing values within those intervals. Implementation ideas / notes for 1: - The ordered RDDs are likely RangePartitioned (or there should be a simple way to find ranges from partitions in an ordered RDD) - Find the partitions containing the boundary, and split them in two. - Construct the new RDDs from the original partitions (and any split ones) I suspect this could be done by launching only a few jobs to split the partitions containing the boundaries. Alternatively, it might be possible to decorate these partitions and use them in more than one RDD. I.e. let one of these partitions (for boundary i) be p. Apply two decorators p' and p'', where p' is masks out values above the ith boundary, and p'' masks out values below the ith boundary. Any operations on these partitions apply only to values not masked out. Then assign p' to the ith output RDD and p'' to the (i+1)th output RDD. If I understand Spark correctly, this should not require any jobs. Not sure whether it's worth trying this optimisation. Implementation ideas / notes for 2: This is very similar, except that we have to handle entire (or parts) of partitions belonging to more than one output RDD, since they are no longer mutually exclusive. But since RDDs are immutable(?), the decorator idea should still work? Thoughts? Split Ordered RDD into multiple RDDs by keys (boundaries or intervals) -- Key: SPARK-6664 URL: https://issues.apache.org/jira/browse/SPARK-6664
[jira] [Commented] (SPARK-6106) Support user group mapping and groups in view, modify and admin acls
[ https://issues.apache.org/jira/browse/SPARK-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392188#comment-14392188 ] Apache Spark commented on SPARK-6106: - User 'colinmjj' has created a pull request for this issue: https://github.com/apache/spark/pull/5325 Support user group mapping and groups in view, modify and admin acls Key: SPARK-6106 URL: https://issues.apache.org/jira/browse/SPARK-6106 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Jerry Chen Labels: Rhino, Security Attachments: SPARK-6106.001.patch Original Estimate: 672h Remaining Estimate: 672h Spark support various acl settings for job to control the visibility of a job and user privileges. Currently, the acls (view, modify and admin) is specified as a list of users. As a convention, Hadoop common support a mechanism named as user group mapping and group names can specified in acls. The ability to do user group mapping and to allow groups to be specified in acls will greatly improve the flexibility and support enterprise use cases such as AD group integration. This JIRA is to proposal to support user group mapping in Spark acl control and to allow specify group names in various acls. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6580) Optimize LogisticRegressionModel.predictPoint
[ https://issues.apache.org/jira/browse/SPARK-6580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6580: - Assignee: Yanbo Liang Optimize LogisticRegressionModel.predictPoint - Key: SPARK-6580 URL: https://issues.apache.org/jira/browse/SPARK-6580 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Yanbo Liang Priority: Minor LogisticRegressionModel.predictPoint could be optimized some. There are several checks which could be moved outside loops or even outside predictPoint to initialization of the model. Some include: {code} require(numFeatures == weightMatrix.size) val dataWithBiasSize = weightMatrix.size / (numClasses - 1) val weightsArray = weightMatrix match { ... if (dataMatrix.size + 1 == dataWithBiasSize) {... {code} Also, for multiclass, the 2 loops (over numClasses and margins) could be combined into 1 loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6553) Support for functools.partial as UserDefinedFunction
[ https://issues.apache.org/jira/browse/SPARK-6553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391851#comment-14391851 ] Josh Rosen commented on SPARK-6553: --- This was fixed by https://github.com/apache/spark/pull/5206 for 1.3.1 and 1.4.0. Support for functools.partial as UserDefinedFunction Key: SPARK-6553 URL: https://issues.apache.org/jira/browse/SPARK-6553 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.3.0 Reporter: Kalle Jepsen Assignee: Kalle Jepsen Labels: features Fix For: 1.3.1, 1.4.0 Currently {{functools.partial}} s cannot be used as {{UserDefinedFunction}} s for {{DataFrame}} s, as the {{\_\_name\_\_}} attribute does not exist. Passing a {{functools.partial}} object will raise an Exception at https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L126. {{functools.partial}} is very widely used and should probably be supported, despite its lack of a {{\_\_name\_\_}}. My suggestion is to use {{f.\_\_repr\_\_()}} instead, or check with {{hasattr(f, '\_\_name\_\_)}} and use {{\_\_class\_\_}} if {{False}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6553) Support for functools.partial as UserDefinedFunction
[ https://issues.apache.org/jira/browse/SPARK-6553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-6553: -- Assignee: Kalle Jepsen Support for functools.partial as UserDefinedFunction Key: SPARK-6553 URL: https://issues.apache.org/jira/browse/SPARK-6553 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.3.0 Reporter: Kalle Jepsen Assignee: Kalle Jepsen Labels: features Fix For: 1.3.1, 1.4.0 Currently {{functools.partial}} s cannot be used as {{UserDefinedFunction}} s for {{DataFrame}} s, as the {{\_\_name\_\_}} attribute does not exist. Passing a {{functools.partial}} object will raise an Exception at https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L126. {{functools.partial}} is very widely used and should probably be supported, despite its lack of a {{\_\_name\_\_}}. My suggestion is to use {{f.\_\_repr\_\_()}} instead, or check with {{hasattr(f, '\_\_name\_\_)}} and use {{\_\_class\_\_}} if {{False}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6653) New configuration property to specify port for sparkYarnAM actor system
[ https://issues.apache.org/jira/browse/SPARK-6653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391993#comment-14391993 ] Shixiong Zhu commented on SPARK-6653: - Could you send a pull request to https://github.com/apache/spark ? And because this is a yarn configuration, I recommend spark.yarn.am.port. New configuration property to specify port for sparkYarnAM actor system --- Key: SPARK-6653 URL: https://issues.apache.org/jira/browse/SPARK-6653 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.3.0 Environment: Spark On Yarn Reporter: Manoj Samel In 1.3.0 code line sparkYarnAM actor system is started on random port. See org.apache.spark.deploy.yarn ApplicationMaster.scala:282 actorSystem = AkkaUtils.createActorSystem(sparkYarnAM, Utils.localHostName, 0, conf = sparkConf, securityManager = securityMgr)._1 This may be issue when ports between Spark client and the Yarn cluster are limited by firewall and not all ports are open between client and Yarn cluster. Proposal is to introduce new property spark.am.actor.port and change code to val port = sparkConf.getInt(spark.am.actor.port, 0) actorSystem = AkkaUtils.createActorSystem(sparkYarnAM, Utils.localHostName, port, conf = sparkConf, securityManager = securityMgr)._1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6665) Randomly Shuffle an RDD
Florian Verhein created SPARK-6665: -- Summary: Randomly Shuffle an RDD Key: SPARK-6665 URL: https://issues.apache.org/jira/browse/SPARK-6665 Project: Spark Issue Type: New Feature Components: Spark Shell Reporter: Florian Verhein Priority: Minor *Use case* RDD created in a way that has some ordering, but you need to shuffle it because the ordering would cause problems downstream. E.g. - will be used to train a ML algorithm that makes stochastic assumptions (like SGD) - used as input for cross validation. e.g. after the shuffle, you could just grab partitions (or part files if saved to hdfs) as folds Related question in mailing list: http://apache-spark-user-list.1001560.n3.nabble.com/random-shuffle-streaming-RDDs-td17965.html *Possible implementation* As mentioned by [~sowen] in the above thread, could sort by( a good hash of( the element (or key if it's paired) and a random salt)). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6666) org.apache.spark.sql.jdbc.JDBCRDD does not escape/quote column names
John Ferguson created SPARK-: Summary: org.apache.spark.sql.jdbc.JDBCRDD does not escape/quote column names Key: SPARK- URL: https://issues.apache.org/jira/browse/SPARK- Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Environment: Reporter: John Ferguson Priority: Critical Is there a way to have JDBC DataFrames use quoted/escaped column names? Right now, it looks like it sees the names correctly in the schema created but does not escape them in the SQL it creates when they are not compliant: org.apache.spark.sql.jdbc.JDBCRDD private val columnList: String = { val sb = new StringBuilder() columns.foreach(x = sb.append(,).append(x)) if (sb.length == 0) 1 else sb.substring(1) } If you see value in this, I would take a shot at adding the quoting (escaping) of column names here. If you don't do it, some drivers... like postgresql's will simply drop case all names when parsing the query. As you can see in the TL;DR below that means they won't match the schema I am given. TL;DR: I am able to connect to a Postgres database in the shell (with driver referenced): val jdbcDf = sqlContext.jdbc(jdbc:postgresql://localhost/sparkdemo?user=dbuser, sp500) In fact when I run: jdbcDf.registerTempTable(sp500) val avgEPSNamed = sqlContext.sql(SELECT AVG(`Earnings/Share`) as AvgCPI FROM sp500) and val avgEPSProg = jsonDf.agg(avg(jsonDf.col(Earnings/Share))) The values come back as expected. However, if I try: jdbcDf.show Or if I try val all = sqlContext.sql(SELECT * FROM sp500) all.show I get errors about column names not being found. In fact the error includes a mention of column names all lower cased. For now I will change my schema to be more restrictive. Right now it is, per a Stack Overflow poster, not ANSI compliant by doing things that are allowed by 's in pgsql, MySQL and SQLServer. BTW, our users are giving us tables like this... because various tools they already use support non-compliant names. In fact, this is mild compared to what we've had to support. Currently the schema in question uses mixed case, quoted names with special characters and spaces: CREATE TABLE sp500 ( Symbol text, Name text, Sector text, Price double precision, Dividend Yield double precision, Price/Earnings double precision, Earnings/Share double precision, Book Value double precision, 52 week low double precision, 52 week high double precision, Market Cap double precision, EBITDA double precision, Price/Sales double precision, Price/Book double precision, SEC Filings text ) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6106) Support user group mapping and groups in view, modify and admin acls
[ https://issues.apache.org/jira/browse/SPARK-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6106: --- Assignee: Apache Spark Support user group mapping and groups in view, modify and admin acls Key: SPARK-6106 URL: https://issues.apache.org/jira/browse/SPARK-6106 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Jerry Chen Assignee: Apache Spark Labels: Rhino, Security Attachments: SPARK-6106.001.patch Original Estimate: 672h Remaining Estimate: 672h Spark support various acl settings for job to control the visibility of a job and user privileges. Currently, the acls (view, modify and admin) is specified as a list of users. As a convention, Hadoop common support a mechanism named as user group mapping and group names can specified in acls. The ability to do user group mapping and to allow groups to be specified in acls will greatly improve the flexibility and support enterprise use cases such as AD group integration. This JIRA is to proposal to support user group mapping in Spark acl control and to allow specify group names in various acls. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6106) Support user group mapping and groups in view, modify and admin acls
[ https://issues.apache.org/jira/browse/SPARK-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6106: --- Assignee: (was: Apache Spark) Support user group mapping and groups in view, modify and admin acls Key: SPARK-6106 URL: https://issues.apache.org/jira/browse/SPARK-6106 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Jerry Chen Labels: Rhino, Security Attachments: SPARK-6106.001.patch Original Estimate: 672h Remaining Estimate: 672h Spark support various acl settings for job to control the visibility of a job and user privileges. Currently, the acls (view, modify and admin) is specified as a list of users. As a convention, Hadoop common support a mechanism named as user group mapping and group names can specified in acls. The ability to do user group mapping and to allow groups to be specified in acls will greatly improve the flexibility and support enterprise use cases such as AD group integration. This JIRA is to proposal to support user group mapping in Spark acl control and to allow specify group names in various acls. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2243) Support multiple SparkContexts in the same JVM
[ https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391770#comment-14391770 ] Jason Hubbard edited comment on SPARK-2243 at 4/1/15 11:41 PM: --- Apologizing for being flippant is a bit of an oxymoron isn't it? The answer you propose is the only one available, but it isn't a real solution, it's a workaround. Obviously running in separate JVMs causes other issues with overhead of starting multiple JVMs and the complexity of having to serialize data so they can communicate. Having multiple workloads in the same SparkContext is what I have chosen, but sometimes you would like different settings for the different workloads which this would now not allow. was (Author: jahubba): Apologizing for being flippant is a bit of an oxymoron isn't it? The answer you proprose is the only one available, but it isn't a real solution, it's a workaround. Obviously running in separate JVMs causes other issues with overhead of starting multiple JVMs and the complexity of having to serialize data so they can communicate. Having multiple workloads in the same SparkContext is what I have chosen, but sometimes you would like different settings for the different workloads which this would now not allow. Support multiple SparkContexts in the same JVM -- Key: SPARK-2243 URL: https://issues.apache.org/jira/browse/SPARK-2243 Project: Spark Issue Type: New Feature Components: Block Manager, Spark Core Affects Versions: 0.7.0, 1.0.0, 1.1.0 Reporter: Miguel Angel Fernandez Diaz We're developing a platform where we create several Spark contexts for carrying out different calculations. Is there any restriction when using several Spark contexts? We have two contexts, one for Spark calculations and another one for Spark Streaming jobs. The next error arises when we first execute a Spark calculation and, once the execution is finished, a Spark Streaming job is launched: {code} 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63) at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/06/23 16:40:08 WARN
[jira] [Commented] (SPARK-6664) Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)
[ https://issues.apache.org/jira/browse/SPARK-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391950#comment-14391950 ] Florian Verhein commented on SPARK-6664: The closest approach I've found that should achieve the same result is calling OrderedRDDFunctions.filterByRange n+1 times. I assume this approach will be much slower, but... it may not be if it's completely lazy.. (??). I don't know spark well enough yet to be anywhere near sure of this. Split Ordered RDD into multiple RDDs by keys (boundaries or intervals) -- Key: SPARK-6664 URL: https://issues.apache.org/jira/browse/SPARK-6664 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Florian Verhein I can't find this functionality (if I missed something, apologies!), but it would be very useful for evaluating ml models. *Use case example* suppose you have pre-processed web logs for a few months, and now want to split it into a training set (where you train a model to predict some aspect of site accesses, perhaps per user) and an out of time test set (where you evaluate how well your model performs in the future). This example has just a single split, but in general you could want more for cross validation. You may also want to have multiple overlaping intervals. *Specification* 1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), return n+1 RDDs such that values in the ith RDD are within the (i-1)th and ith boundary. 2. More complex alternative (but similar under the hood): provide a sequence of possibly overlapping intervals (ordered by the start key of the interval), and return the RDDs containing values within those intervals. *Implementation ideas / notes for 1* - The ordered RDDs are likely RangePartitioned (or there should be a simple way to find ranges from partitions in an ordered RDD) - Find the partitions containing the boundary, and split them in two. - Construct the new RDDs from the original partitions (and any split ones) I suspect this could be done by launching only a few jobs to split the partitions containing the boundaries. Alternatively, it might be possible to decorate these partitions and use them in more than one RDD. I.e. let one of these partitions (for boundary i) be p. Apply two decorators p' and p'', where p' is masks out values above the ith boundary, and p'' masks out values below the ith boundary. Any operations on these partitions apply only to values not masked out. Then assign p' to the ith output RDD and p'' to the (i+1)th output RDD. If I understand Spark correctly, this should not require any jobs. Not sure whether it's worth trying this optimisation. *Implementation ideas / notes for 2* This is very similar, except that we have to handle entire (or parts) of partitions belonging to more than one output RDD, since they are no longer mutually exclusive. But since RDDs are immutable(??), the decorator idea should still work? Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6113) Stabilize DecisionTree and ensembles APIs
[ https://issues.apache.org/jira/browse/SPARK-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392054#comment-14392054 ] Joseph K. Bradley commented on SPARK-6113: -- I just noted that this is blocked by the 2 indexer JIRAs. (Really, it requires at least one of them.) This is because we made a decision to add this API directly to the spark.ml package, rather than creating another tree API within the spark.mllib package. In the spark.ml package, we will require some way to test categorical features and multiclass classification, which will require one of the indexer JIRAs (to add category metadata). Stabilize DecisionTree and ensembles APIs - Key: SPARK-6113 URL: https://issues.apache.org/jira/browse/SPARK-6113 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Critical *Issue*: The APIs for DecisionTree and ensembles (RandomForests and GradientBoostedTrees) have been experimental for a long time. The API has become very convoluted because trees and ensembles have many, many variants, some of which we have added incrementally without a long-term design. *Proposal*: This JIRA is for discussing changes required to finalize the APIs. After we discuss, I will make a PR to update the APIs and make them non-Experimental. This will require making many breaking changes; see the design doc for details. [Design doc | https://docs.google.com/document/d/1rJ_DZinyDG3PkYkAKSsQlY0QgCeefn4hUv7GsPkzBP4]: This outlines current issues and the proposed API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6667) hang while collect in PySpark
[ https://issues.apache.org/jira/browse/SPARK-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6667: --- Assignee: Davies Liu (was: Apache Spark) hang while collect in PySpark - Key: SPARK-6667 URL: https://issues.apache.org/jira/browse/SPARK-6667 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.1, 1.4.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Critical PySpark tests hang while collecting: -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6667) hang while collect in PySpark
[ https://issues.apache.org/jira/browse/SPARK-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6667: --- Assignee: Apache Spark (was: Davies Liu) hang while collect in PySpark - Key: SPARK-6667 URL: https://issues.apache.org/jira/browse/SPARK-6667 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.1, 1.4.0 Reporter: Davies Liu Assignee: Apache Spark Priority: Critical PySpark tests hang while collecting: -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6659) Spark SQL 1.3 cannot read json file that only with a record.
[ https://issues.apache.org/jira/browse/SPARK-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6659: --- Component/s: SQL Spark SQL 1.3 cannot read json file that only with a record. Key: SPARK-6659 URL: https://issues.apache.org/jira/browse/SPARK-6659 Project: Spark Issue Type: Bug Components: SQL Reporter: luochenghui Dear friends: Spark SQL 1.3 cannot read json file that only with a record. here is my json file's content. {name:milo,age,24} when i run Spark SQL under the local mode,it throws an exception rg.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns _corrupt_record; what i had done: 1 ./spark-shell 2 scala val sqlContext = new org.apache.spark.sql.SQLContext(sc) sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@5f3be6c8 scala val df = sqlContext.jsonFile(/home/milo/person.json) 15/03/19 22:11:45 INFO MemoryStore: ensureFreeSpace(163705) called with curMem=0, maxMem=280248975 15/03/19 22:11:45 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 159.9 KB, free 267.1 MB) 15/03/19 22:11:45 INFO MemoryStore: ensureFreeSpace(22692) called with curMem=163705, maxMem=280248975 15/03/19 22:11:45 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.2 KB, free 267.1 MB) 15/03/19 22:11:45 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:35842 (size: 22.2 KB, free: 267.2 MB) 15/03/19 22:11:45 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0 15/03/19 22:11:45 INFO SparkContext: Created broadcast 0 from textFile at JSONRelation.scala:98 15/03/19 22:11:47 INFO FileInputFormat: Total input paths to process : 1 15/03/19 22:11:47 INFO SparkContext: Starting job: reduce at JsonRDD.scala:51 15/03/19 22:11:47 INFO DAGScheduler: Got job 0 (reduce at JsonRDD.scala:51) with 1 output partitions (allowLocal=false) 15/03/19 22:11:47 INFO DAGScheduler: Final stage: Stage 0(reduce at JsonRDD.scala:51) 15/03/19 22:11:47 INFO DAGScheduler: Parents of final stage: List() 15/03/19 22:11:47 INFO DAGScheduler: Missing parents: List() 15/03/19 22:11:47 INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[3] at map at JsonRDD.scala:51), which has no missing parents 15/03/19 22:11:47 INFO MemoryStore: ensureFreeSpace(3184) called with curMem=186397, maxMem=280248975 15/03/19 22:11:47 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 267.1 MB) 15/03/19 22:11:47 INFO MemoryStore: ensureFreeSpace(2251) called with curMem=189581, maxMem=280248975 15/03/19 22:11:47 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.2 KB, free 267.1 MB) 15/03/19 22:11:47 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:35842 (size: 2.2 KB, free: 267.2 MB) 15/03/19 22:11:47 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0 15/03/19 22:11:47 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:839 15/03/19 22:11:48 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (MapPartitionsRDD[3] at map at JsonRDD.scala:51) 15/03/19 22:11:48 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 15/03/19 22:11:48 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1291 bytes) 15/03/19 22:11:48 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 15/03/19 22:11:48 INFO HadoopRDD: Input split: file:/home/milo/person.json:0+26 15/03/19 22:11:48 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 15/03/19 22:11:48 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 15/03/19 22:11:48 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 15/03/19 22:11:48 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 15/03/19 22:11:48 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 15/03/19 22:11:49 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2023 bytes result sent to driver 15/03/19 22:11:49 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1209 ms on localhost (1/1) 15/03/19 22:11:49 INFO DAGScheduler: Stage 0 (reduce at JsonRDD.scala:51) finished in 1.308 s 15/03/19 22:11:49 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 15/03/19 22:11:49 INFO DAGScheduler: Job 0 finished: reduce at JsonRDD.scala:51, took 2.002429 s df: org.apache.spark.sql.DataFrame = [_corrupt_record: string] 3 scala df.select(name).show() 15/03/19 22:12:44 INFO BlockManager:
[jira] [Closed] (SPARK-6659) Spark SQL 1.3 cannot read json file that only with a record.
[ https://issues.apache.org/jira/browse/SPARK-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell closed SPARK-6659. -- Resolution: Invalid Per the comment, I think the issue is the JSON is not correctly formatted. Spark SQL 1.3 cannot read json file that only with a record. Key: SPARK-6659 URL: https://issues.apache.org/jira/browse/SPARK-6659 Project: Spark Issue Type: Bug Components: SQL Reporter: luochenghui Dear friends: Spark SQL 1.3 cannot read json file that only with a record. here is my json file's content. {name:milo,age,24} when i run Spark SQL under the local mode,it throws an exception rg.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns _corrupt_record; what i had done: 1 ./spark-shell 2 scala val sqlContext = new org.apache.spark.sql.SQLContext(sc) sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@5f3be6c8 scala val df = sqlContext.jsonFile(/home/milo/person.json) 15/03/19 22:11:45 INFO MemoryStore: ensureFreeSpace(163705) called with curMem=0, maxMem=280248975 15/03/19 22:11:45 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 159.9 KB, free 267.1 MB) 15/03/19 22:11:45 INFO MemoryStore: ensureFreeSpace(22692) called with curMem=163705, maxMem=280248975 15/03/19 22:11:45 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.2 KB, free 267.1 MB) 15/03/19 22:11:45 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:35842 (size: 22.2 KB, free: 267.2 MB) 15/03/19 22:11:45 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0 15/03/19 22:11:45 INFO SparkContext: Created broadcast 0 from textFile at JSONRelation.scala:98 15/03/19 22:11:47 INFO FileInputFormat: Total input paths to process : 1 15/03/19 22:11:47 INFO SparkContext: Starting job: reduce at JsonRDD.scala:51 15/03/19 22:11:47 INFO DAGScheduler: Got job 0 (reduce at JsonRDD.scala:51) with 1 output partitions (allowLocal=false) 15/03/19 22:11:47 INFO DAGScheduler: Final stage: Stage 0(reduce at JsonRDD.scala:51) 15/03/19 22:11:47 INFO DAGScheduler: Parents of final stage: List() 15/03/19 22:11:47 INFO DAGScheduler: Missing parents: List() 15/03/19 22:11:47 INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[3] at map at JsonRDD.scala:51), which has no missing parents 15/03/19 22:11:47 INFO MemoryStore: ensureFreeSpace(3184) called with curMem=186397, maxMem=280248975 15/03/19 22:11:47 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 267.1 MB) 15/03/19 22:11:47 INFO MemoryStore: ensureFreeSpace(2251) called with curMem=189581, maxMem=280248975 15/03/19 22:11:47 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.2 KB, free 267.1 MB) 15/03/19 22:11:47 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:35842 (size: 2.2 KB, free: 267.2 MB) 15/03/19 22:11:47 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0 15/03/19 22:11:47 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:839 15/03/19 22:11:48 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (MapPartitionsRDD[3] at map at JsonRDD.scala:51) 15/03/19 22:11:48 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 15/03/19 22:11:48 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1291 bytes) 15/03/19 22:11:48 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 15/03/19 22:11:48 INFO HadoopRDD: Input split: file:/home/milo/person.json:0+26 15/03/19 22:11:48 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 15/03/19 22:11:48 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 15/03/19 22:11:48 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 15/03/19 22:11:48 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 15/03/19 22:11:48 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 15/03/19 22:11:49 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2023 bytes result sent to driver 15/03/19 22:11:49 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1209 ms on localhost (1/1) 15/03/19 22:11:49 INFO DAGScheduler: Stage 0 (reduce at JsonRDD.scala:51) finished in 1.308 s 15/03/19 22:11:49 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 15/03/19 22:11:49 INFO DAGScheduler: Job 0 finished: reduce at JsonRDD.scala:51, took 2.002429 s df: org.apache.spark.sql.DataFrame = [_corrupt_record: string] 3
[jira] [Resolved] (SPARK-6642) Change the lambda weight to number of explicit ratings in implicit ALS
[ https://issues.apache.org/jira/browse/SPARK-6642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-6642. -- Resolution: Fixed Fix Version/s: 1.4.0 1.3.1 Issue resolved by pull request 5314 [https://github.com/apache/spark/pull/5314] Change the lambda weight to number of explicit ratings in implicit ALS -- Key: SPARK-6642 URL: https://issues.apache.org/jira/browse/SPARK-6642 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.3.1, 1.4.0 Until SPARK-6637 is resolved, we should switch back to the 1.2 lambda weighting strategy to be consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3468) WebUI Timeline-View feature
[ https://issues.apache.org/jira/browse/SPARK-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-3468: -- Affects Version/s: 1.4.0 WebUI Timeline-View feature --- Key: SPARK-3468 URL: https://issues.apache.org/jira/browse/SPARK-3468 Project: Spark Issue Type: New Feature Components: Web UI Affects Versions: 1.4.0 Reporter: Kousuke Saruta Attachments: executors.png, stage-timeline.png, stages.png, taskDetails.png, tasks.png I sometimes trouble-shoot and analyse the cause of long time spending job. At the time, I find the stages which spends long time or fails, then I find the tasks which spends long time or fails, next I analyse the proportion of each phase in a task. Another case, I find executors which spends long time for running a task and analyse the details of a task. In such situation, I think it's helpful to visualize timeline view of stages / tasks / executors and visualize details of proportion of activity for each task. Now I'm developing prototypes like captures I attached. I'll integrate these viewer into WebUI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3468) WebUI Timeline-View feature
[ https://issues.apache.org/jira/browse/SPARK-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-3468: -- Target Version/s: 1.4.0 WebUI Timeline-View feature --- Key: SPARK-3468 URL: https://issues.apache.org/jira/browse/SPARK-3468 Project: Spark Issue Type: New Feature Components: Web UI Affects Versions: 1.4.0 Reporter: Kousuke Saruta Attachments: executors.png, stage-timeline.png, stages.png, taskDetails.png, tasks.png I sometimes trouble-shoot and analyse the cause of long time spending job. At the time, I find the stages which spends long time or fails, then I find the tasks which spends long time or fails, next I analyse the proportion of each phase in a task. Another case, I find executors which spends long time for running a task and analyse the details of a task. In such situation, I think it's helpful to visualize timeline view of stages / tasks / executors and visualize details of proportion of activity for each task. Now I'm developing prototypes like captures I attached. I'll integrate these viewer into WebUI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6660) MLLibPythonAPI.pythonToJava doesn't recognize object arrays
[ https://issues.apache.org/jira/browse/SPARK-6660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6660: --- Assignee: Apache Spark (was: Xiangrui Meng) MLLibPythonAPI.pythonToJava doesn't recognize object arrays --- Key: SPARK-6660 URL: https://issues.apache.org/jira/browse/SPARK-6660 Project: Spark Issue Type: Bug Components: MLlib, PySpark Reporter: Xiangrui Meng Assignee: Apache Spark Priority: Critical {code} points = MLUtils.loadLabeledPoints(sc, ...) _to_java_object_rdd(points).count() {code} throws exception {code} --- Py4JJavaError Traceback (most recent call last) ipython-input-22-5b481e99a111 in module() 1 jrdd.count() /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, -- 538 self.target_id, self.name) 539 540 for temp_arg in temp_args: /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. -- 300 format(target_id, '.', name), value) 301 else: 302 raise Py4JError( Py4JJavaError: An error occurred while calling o510.count. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 18 in stage 114.0 failed 4 times, most recent failure: Lost task 18.3 in stage 114.0 (TID 1133, ip-10-0-130-35.us-west-2.compute.internal): java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to java.util.ArrayList at org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:1090) at org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:1087) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1472) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1006) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1006) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1497) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1497) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6662) Allow variable substitution in spark.yarn.historyServer.address
[ https://issues.apache.org/jira/browse/SPARK-6662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391825#comment-14391825 ] Apache Spark commented on SPARK-6662: - User 'piaozhexiu' has created a pull request for this issue: https://github.com/apache/spark/pull/5321 Allow variable substitution in spark.yarn.historyServer.address --- Key: SPARK-6662 URL: https://issues.apache.org/jira/browse/SPARK-6662 Project: Spark Issue Type: Wish Components: YARN Affects Versions: 1.3.0 Reporter: Cheolsoo Park Priority: Minor Labels: yarn In Spark on YARN, explicit hostname and port number need to be set for spark.yarn.historyServer.address in SparkConf to make the HISTORY link. If the history server address is known and static, this is usually not a problem. But in cloud, that is usually not true. Particularly in EMR, the history server always runs on the same node as with RM. So I could simply set it to {{$\{yarn.resourcemanager.hostname\}:18080}} if variable substitution is allowed. In fact, Hadoop configuration already implements variable substitution, so if this property is read via YarnConf, this can be easily achievable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6662) Allow variable substitution in spark.yarn.historyServer.address
[ https://issues.apache.org/jira/browse/SPARK-6662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6662: --- Assignee: Apache Spark Allow variable substitution in spark.yarn.historyServer.address --- Key: SPARK-6662 URL: https://issues.apache.org/jira/browse/SPARK-6662 Project: Spark Issue Type: Wish Components: YARN Affects Versions: 1.3.0 Reporter: Cheolsoo Park Assignee: Apache Spark Priority: Minor Labels: yarn In Spark on YARN, explicit hostname and port number need to be set for spark.yarn.historyServer.address in SparkConf to make the HISTORY link. If the history server address is known and static, this is usually not a problem. But in cloud, that is usually not true. Particularly in EMR, the history server always runs on the same node as with RM. So I could simply set it to {{$\{yarn.resourcemanager.hostname\}:18080}} if variable substitution is allowed. In fact, Hadoop configuration already implements variable substitution, so if this property is read via YarnConf, this can be easily achievable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390235#comment-14390235 ] liyunzhang_intel edited comment on SPARK-5682 at 4/2/15 1:34 AM: - Hi all: Now there are two methods to implement SPARK-5682(Add encrypted shuffle in spark). Method1: use [Chimera|https://github.com/intel-hadoop/chimera](Chimera is a project which strips code related to CryptoInputStream/CryptoOutputStream from Hadoop to facilitate AES-NI based data encryption in other projects.) to implement spark encrypted shuffle. Pull request: https://github.com/apache/spark/pull/5307. Method2: Add crypto package in spark-core module and add CryptoInputStream.scala and CryptoOutputStream.scala and so on in this package. Pull request : https://github.com/apache/spark/pull/4491. The latest design doc Design Document of Encrypted Spark Shuffle_20150402 has been submitted. Which one is better? Any advices/guidance are welcome! was (Author: kellyzly): Hi all: Now there are two methods to implement SPARK-5682(Add encrypted shuffle in spark). Method1: use [Chimera|https://github.com/intel-hadoop/chimera](Chimera is a project which strips code related to CryptoInputStream/CryptoOutputStream from Hadoop to facilitate AES-NI based data encryption in other projects.) to implement spark encrypted shuffle. Pull request: https://github.com/apache/spark/pull/5307. Method2: Add crypto package in spark-core module and add CryptoInputStream.scala and CryptoOutputStream.scala and so on in this package. Pull request : https://github.com/apache/spark/pull/4491. Which one is better? Any advices/guidance are welcome! Add encrypted shuffle in spark -- Key: SPARK-5682 URL: https://issues.apache.org/jira/browse/SPARK-5682 Project: Spark Issue Type: New Feature Components: Shuffle Reporter: liyunzhang_intel Attachments: Design Document of Encrypted Spark Shuffle_20150209.docx, Design Document of Encrypted Spark Shuffle_20150318.docx, Design Document of Encrypted Spark Shuffle_20150402.docx Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle data safer. This feature is necessary in spark. AES is a specification for the encryption of electronic data. There are 5 common modes in AES. CTR is one of the modes. We use two codec JceAesCtrCryptoCodec and OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl provides. Because ugi credential info is used in the process of encrypted shuffle, we first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6658) Incorrect DataFrame Documentation Type References
[ https://issues.apache.org/jira/browse/SPARK-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chet Mancini closed SPARK-6658. --- Incorrect DataFrame Documentation Type References - Key: SPARK-6658 URL: https://issues.apache.org/jira/browse/SPARK-6658 Project: Spark Issue Type: Improvement Components: Documentation, SQL Affects Versions: 1.3.0 Reporter: Chet Mancini Priority: Trivial Labels: documentation Original Estimate: 5m Remaining Estimate: 5m A few methods under DataFrame incorrectly refer to the receiver as an RDD in their documentation. * createJDBCTable * insertIntoJDBC * registerTempTable -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org