[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Rahul Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390169#comment-14390169
 ] 

Rahul Kumar commented on SPARK-6646:


Love this idea, what about private cloud in pocket :-) store data on smart 
phone, do processing on it, small mobile based web server that power cool 
visualization reports. Lot of time our smart phones are idle we can share 
resources :-) 4 GB RAM, quadcore processer, LTE network not bad for a single 
node in cluster.

 Spark 2.0: Rearchitecting Spark for Mobile Platforms
 

 Key: SPARK-6646
 URL: https://issues.apache.org/jira/browse/SPARK-6646
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Attachments: Spark on Mobile - Design Doc - v1.pdf


 Mobile computing is quickly rising to dominance, and by the end of 2017, it 
 is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
 Spark’s project goal can be accomplished only when Spark runs efficiently for 
 the growing population of mobile users.
 Designed and optimized for modern data centers and Big Data applications, 
 Spark is unfortunately not a good fit for mobile computing today. In the past 
 few months, we have been prototyping the feasibility of a mobile-first Spark 
 architecture, and today we would like to share with you our findings. This 
 ticket outlines the technical design of Spark’s mobile support, and shares 
 results from several early prototypes.
 Mobile friendly version of the design doc: 
 https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL

2015-04-01 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6644:
--
Description: 
In Hive, the schema of a partition may differ from the table schema. For 
example, we may add new columns to the table after importing existing 
partitions. When using {{spark-sql}} to query the data in a partition whose 
schema is different from the table schema, problems may arise. Part of them 
have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. 
However, after adding new column(s) to the table, when inserting data into old 
partitions, values of newly added columns are all {{NULL}}.

The following snippet can be used to reproduce this issue:
{code}
case class TestData(key: Int, value: String)

val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, 
i.toString))).toDF()
testData.registerTempTable(testData)

sql(DROP TABLE IF EXISTS table_with_partition )
sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int, value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}')
sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT 
key, value FROM testData)

// Add new columns to the table
sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT 
key, value, 'test', 1.11 FROM testData)

sql(SELECT * FROM table_with_partition WHERE ds = 
'1').collect().foreach(println)  
{code}
Actual result:
{noformat}
[1,1,null,null,1]
[2,2,null,null,1]
{noformat}
Expected result:
{noformat}
[1,1,test,1.11,1]
[2,2,test,1.11,1]
{noformat}

  was:
In hive,the schema of partition may be difference from the table schema. For 
example, we add new column. When we use spark-sql to query the data of 
partition which schema is difference from the table schema.
Some problems have been solved at PR4289 
(https://github.com/apache/spark/pull/4289), 
but if we add new column, and put new data into the old partition schema,new 
column value is NULL

[According to the following steps]:
--
case class TestData(key: Int, value: String)

val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, 
i.toString))).toDF()
  testData.registerTempTable(testData)

 sql(DROP TABLE IF EXISTS table_with_partition )
 sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value FROM testData)

// add column to table
 sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
 sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value,'test',1.11 FROM testData)

 sql(select * from table_with_partition where ds='1' 
).collect().foreach(println)  
 
-
result: 
[1,1,null,null,1]
[2,2,null,null,1]
 
result we expect:
[1,1,test,1.11,1]
[2,2,test,1.11,1]

This bug will cause the wrong query number ,when we query : 

select  count(1)  from  table_with_partition  where   key1  is not NULL


 [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), 
 new column value is NULL
 --

 Key: SPARK-6644
 URL: https://issues.apache.org/jira/browse/SPARK-6644
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: dongxu

 In Hive, the schema of a partition may differ from the table schema. For 
 example, we may add new columns to the table after importing existing 
 partitions. When using {{spark-sql}} to query the data in a partition whose 
 schema is different from the table schema, problems may arise. Part of them 
 have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. 
 However, after adding new column(s) to the table, when inserting data into 
 old partitions, values of newly added columns are all {{NULL}}.
 The following snippet can be used to reproduce this issue:
 {code}
 case class TestData(key: Int, value: String)
 val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = 
 TestData(i, i.toString))).toDF()
 testData.registerTempTable(testData)
 sql(DROP TABLE IF EXISTS table_with_partition )
 sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int, value string) 
 PARTITIONED by (ds string) location '${tmpDir.toURI.toString}')
 sql(INSERT OVERWRITE TABLE table_with_partition 

[jira] [Updated] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6646:
---
Description: 
Mobile computing is quickly rising to dominance, and by the end of 2017, it is 
estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s 
project goal can be accomplished only when Spark runs efficiently for the 
growing population of mobile users.

Designed and optimized for modern data centers and Big Data applications, Spark 
is unfortunately not a good fit for mobile computing today. In the past few 
months, we have been prototyping the feasibility of a mobile-first Spark 
architecture, and today we would like to share with you our findings. This post 
outlines the technical design of Spark’s mobile support, and shares results 
from several early prototypes. See also SPARK-6646 for community discussion on 
the issue.


  was:Design doc to come ...


 Spark 2.0: Rearchitecting Spark for Mobile Platforms
 

 Key: SPARK-6646
 URL: https://issues.apache.org/jira/browse/SPARK-6646
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Attachments: Spark on Mobile - Design Doc - v1.pdf


 Mobile computing is quickly rising to dominance, and by the end of 2017, it 
 is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
 Spark’s project goal can be accomplished only when Spark runs efficiently for 
 the growing population of mobile users.
 Designed and optimized for modern data centers and Big Data applications, 
 Spark is unfortunately not a good fit for mobile computing today. In the past 
 few months, we have been prototyping the feasibility of a mobile-first Spark 
 architecture, and today we would like to share with you our findings. This 
 post outlines the technical design of Spark’s mobile support, and shares 
 results from several early prototypes. See also SPARK-6646 for community 
 discussion on the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6646:
---
Attachment: Spark on Mobile - Design Doc - v1.pdf

 Spark 2.0: Rearchitecting Spark for Mobile Platforms
 

 Key: SPARK-6646
 URL: https://issues.apache.org/jira/browse/SPARK-6646
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Attachments: Spark on Mobile - Design Doc - v1.pdf


 Design doc to come ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6640) Executor may connect to HeartbeartReceiver before it's setup in the driver side

2015-04-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6640:
---

Assignee: Apache Spark

 Executor may connect to HeartbeartReceiver before it's setup in the driver 
 side
 ---

 Key: SPARK-6640
 URL: https://issues.apache.org/jira/browse/SPARK-6640
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Shixiong Zhu
Assignee: Apache Spark

 Here is the current code about starting LocalBackend and creating 
 HeartbeatReceiver:
 {code}
   // Create and start the scheduler
   private[spark] var (schedulerBackend, taskScheduler) =
 SparkContext.createTaskScheduler(this, master)
   private val heartbeatReceiver = env.actorSystem.actorOf(
 Props(new HeartbeatReceiver(this, taskScheduler)), HeartbeatReceiver)
 {code}
 When creating LocalBackend, it will start `LocalActor`. `LocalActor` will   
 create Executor, and Executor's constructor will retrieve `HeartbeatReceiver`.
 So we should make sure this line:
 {code}
 private val heartbeatReceiver = env.actorSystem.actorOf(
 Props(new HeartbeatReceiver(this, taskScheduler)), HeartbeatReceiver)
 {code}
 happen before creating LocalActor.
 However, current codes can not guarantee that. Sometimes, creating Executor 
 will crash. The issue was reported by sparkdi shopaddr1...@dubna.us in 
 http://apache-spark-user-list.1001560.n3.nabble.com/Actor-not-found-td22265.html#a22324



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6640) Executor may connect to HeartbeartReceiver before it's setup in the driver side

2015-04-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6640:
---

Assignee: (was: Apache Spark)

 Executor may connect to HeartbeartReceiver before it's setup in the driver 
 side
 ---

 Key: SPARK-6640
 URL: https://issues.apache.org/jira/browse/SPARK-6640
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Shixiong Zhu

 Here is the current code about starting LocalBackend and creating 
 HeartbeatReceiver:
 {code}
   // Create and start the scheduler
   private[spark] var (schedulerBackend, taskScheduler) =
 SparkContext.createTaskScheduler(this, master)
   private val heartbeatReceiver = env.actorSystem.actorOf(
 Props(new HeartbeatReceiver(this, taskScheduler)), HeartbeatReceiver)
 {code}
 When creating LocalBackend, it will start `LocalActor`. `LocalActor` will   
 create Executor, and Executor's constructor will retrieve `HeartbeatReceiver`.
 So we should make sure this line:
 {code}
 private val heartbeatReceiver = env.actorSystem.actorOf(
 Props(new HeartbeatReceiver(this, taskScheduler)), HeartbeatReceiver)
 {code}
 happen before creating LocalActor.
 However, current codes can not guarantee that. Sometimes, creating Executor 
 will crash. The issue was reported by sparkdi shopaddr1...@dubna.us in 
 http://apache-spark-user-list.1001560.n3.nabble.com/Actor-not-found-td22265.html#a22324



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6640) Executor may connect to HeartbeartReceiver before it's setup in the driver side

2015-04-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390090#comment-14390090
 ] 

Apache Spark commented on SPARK-6640:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/5306

 Executor may connect to HeartbeartReceiver before it's setup in the driver 
 side
 ---

 Key: SPARK-6640
 URL: https://issues.apache.org/jira/browse/SPARK-6640
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Shixiong Zhu

 Here is the current code about starting LocalBackend and creating 
 HeartbeatReceiver:
 {code}
   // Create and start the scheduler
   private[spark] var (schedulerBackend, taskScheduler) =
 SparkContext.createTaskScheduler(this, master)
   private val heartbeatReceiver = env.actorSystem.actorOf(
 Props(new HeartbeatReceiver(this, taskScheduler)), HeartbeatReceiver)
 {code}
 When creating LocalBackend, it will start `LocalActor`. `LocalActor` will   
 create Executor, and Executor's constructor will retrieve `HeartbeatReceiver`.
 So we should make sure this line:
 {code}
 private val heartbeatReceiver = env.actorSystem.actorOf(
 Props(new HeartbeatReceiver(this, taskScheduler)), HeartbeatReceiver)
 {code}
 happen before creating LocalActor.
 However, current codes can not guarantee that. Sometimes, creating Executor 
 will crash. The issue was reported by sparkdi shopaddr1...@dubna.us in 
 http://apache-spark-user-list.1001560.n3.nabble.com/Actor-not-found-td22265.html#a22324



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390153#comment-14390153
 ] 

Sandy Ryza commented on SPARK-6646:
---

This seems like a good opportunity to finally add a DataFrame 
registerTempTablet API.

 Spark 2.0: Rearchitecting Spark for Mobile Platforms
 

 Key: SPARK-6646
 URL: https://issues.apache.org/jira/browse/SPARK-6646
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Attachments: Spark on Mobile - Design Doc - v1.pdf


 Mobile computing is quickly rising to dominance, and by the end of 2017, it 
 is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
 Spark’s project goal can be accomplished only when Spark runs efficiently for 
 the growing population of mobile users.
 Designed and optimized for modern data centers and Big Data applications, 
 Spark is unfortunately not a good fit for mobile computing today. In the past 
 few months, we have been prototyping the feasibility of a mobile-first Spark 
 architecture, and today we would like to share with you our findings. This 
 ticket outlines the technical design of Spark’s mobile support, and shares 
 results from several early prototypes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390160#comment-14390160
 ] 

Yu Ishikawa commented on SPARK-6646:


That sounds very interesting! We should support a deploying function a trained 
machine learning model to smartphone. :)

 Spark 2.0: Rearchitecting Spark for Mobile Platforms
 

 Key: SPARK-6646
 URL: https://issues.apache.org/jira/browse/SPARK-6646
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Attachments: Spark on Mobile - Design Doc - v1.pdf


 Mobile computing is quickly rising to dominance, and by the end of 2017, it 
 is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
 Spark’s project goal can be accomplished only when Spark runs efficiently for 
 the growing population of mobile users.
 Designed and optimized for modern data centers and Big Data applications, 
 Spark is unfortunately not a good fit for mobile computing today. In the past 
 few months, we have been prototyping the feasibility of a mobile-first Spark 
 architecture, and today we would like to share with you our findings. This 
 ticket outlines the technical design of Spark’s mobile support, and shares 
 results from several early prototypes.
 Mobile friendly version of the design doc: 
 https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390176#comment-14390176
 ] 

Jeremy Freeman commented on SPARK-6646:
---

Very promising [~tdas]! We should evaluate the performance of streaming machine 
learning algorithms. In general I think running Spark in javascript via 
scala.js and node.js is extremely appealing, will make integration with 
visualization very straightforward. 

 Spark 2.0: Rearchitecting Spark for Mobile Platforms
 

 Key: SPARK-6646
 URL: https://issues.apache.org/jira/browse/SPARK-6646
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Attachments: Spark on Mobile - Design Doc - v1.pdf


 Mobile computing is quickly rising to dominance, and by the end of 2017, it 
 is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
 Spark’s project goal can be accomplished only when Spark runs efficiently for 
 the growing population of mobile users.
 Designed and optimized for modern data centers and Big Data applications, 
 Spark is unfortunately not a good fit for mobile computing today. In the past 
 few months, we have been prototyping the feasibility of a mobile-first Spark 
 architecture, and today we would like to share with you our findings. This 
 ticket outlines the technical design of Spark’s mobile support, and shares 
 results from several early prototypes.
 Mobile friendly version of the design doc: 
 https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6631) I am unable to get the Maven Build file in Example 2.13 to build anything but an empty file

2015-04-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390185#comment-14390185
 ] 

Sean Owen commented on SPARK-6631:
--

The Debian packaging was removed; I don't know how much it worked before.
u...@spark.apache.org is appropriate for this kind of question. Here you're 
tacking on to an unrelated JIRA.

 I am unable to get the Maven Build file in Example 2.13 to build anything but 
 an empty file
 ---

 Key: SPARK-6631
 URL: https://issues.apache.org/jira/browse/SPARK-6631
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.3.0
 Environment: Ubuntu 14.04
Reporter: Frank Domoney
Priority: Blocker

 I have downloaded and built spark 1.3.0 under Ubuntu 14.04 but have been 
 unable to get reduceByKey to work on what seems to be a valid RDD using the 
 command line.
 scala counts.take(10)
 res17: Array[(String, Int)] = Array((Vladimir,1), (Putin,1), (has,1), 
 (said,1), (Russia,1), (will,1), (fight,1), (for,1), (an,1), (independent,1))
 scala val counts1 = counts.reduceByKey{case (x, y) = x + y}
 counts1.take(10)
 res16: Array[(String, Int)] = Array()
 I am attempting to build the Maven sequence in example 2.15 but get the 
 following results
 Building example 0.0.1
 [INFO] 
 
 [INFO] 
 [INFO] --- maven-resources-plugin:2.3:resources (default-resources) @ 
 learning-spark-mini-example ---
 [WARNING] Using platform encoding (UTF-8 actually) to copy filtered 
 resources, i.e. build is platform dependent!
 [INFO] skip non existing resourceDirectory 
 /home/panzerfrank/Downloads/spark-1.3.0/wordcount/src/main/resources
 [INFO] 
 [INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ 
 learning-spark-mini-example ---
 [INFO] No sources to compile
 [INFO] 
 [INFO] --- maven-resources-plugin:2.3:testResources (default-testResources) @ 
 learning-spark-mini-example ---
 [WARNING] Using platform encoding (UTF-8 actually) to copy filtered 
 resources, i.e. build is platform dependent!
 [INFO] skip non existing resourceDirectory 
 /home/panzerfrank/Downloads/spark-1.3.0/wordcount/src/test/resources
 [INFO] 
 [INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ 
 learning-spark-mini-example ---
 [INFO] No sources to compile
 [INFO] 
 [INFO] --- maven-surefire-plugin:2.10:test (default-test) @ 
 learning-spark-mini-example ---
 [INFO] No tests to run.
 [INFO] Surefire report directory: 
 /home/panzerfrank/Downloads/spark-1.3.0/wordcount/target/surefire-reports
  --- maven-jar-plugin:2.2:jar (default-jar) @ learning-spark-mini-example ---
 [WARNING] JAR will be empty - no content was marked for inclusion!
 [INFO] Building jar: 
 /home/panzerfrank/Downloads/spark-1.3.0/wordcount/target/learning-spark-mini-example-0.0.1.jar
 I am using the POM file from Example 2-13.  Java is Java -8  
 Am I doing something really stupid?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6646:
---
Description: 
Mobile computing is quickly rising to dominance, and by the end of 2017, it is 
estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s 
project goal can be accomplished only when Spark runs efficiently for the 
growing population of mobile users.

Designed and optimized for modern data centers and Big Data applications, Spark 
is unfortunately not a good fit for mobile computing today. In the past few 
months, we have been prototyping the feasibility of a mobile-first Spark 
architecture, and today we would like to share with you our findings. This 
ticket outlines the technical design of Spark’s mobile support, and shares 
results from several early prototypes.


  was:
Mobile computing is quickly rising to dominance, and by the end of 2017, it is 
estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s 
project goal can be accomplished only when Spark runs efficiently for the 
growing population of mobile users.

Designed and optimized for modern data centers and Big Data applications, Spark 
is unfortunately not a good fit for mobile computing today. In the past few 
months, we have been prototyping the feasibility of a mobile-first Spark 
architecture, and today we would like to share with you our findings. This post 
outlines the technical design of Spark’s mobile support, and shares results 
from several early prototypes. See also SPARK-6646 for community discussion on 
the issue.



 Spark 2.0: Rearchitecting Spark for Mobile Platforms
 

 Key: SPARK-6646
 URL: https://issues.apache.org/jira/browse/SPARK-6646
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Attachments: Spark on Mobile - Design Doc - v1.pdf


 Mobile computing is quickly rising to dominance, and by the end of 2017, it 
 is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
 Spark’s project goal can be accomplished only when Spark runs efficiently for 
 the growing population of mobile users.
 Designed and optimized for modern data centers and Big Data applications, 
 Spark is unfortunately not a good fit for mobile computing today. In the past 
 few months, we have been prototyping the feasibility of a mobile-first Spark 
 architecture, and today we would like to share with you our findings. This 
 ticket outlines the technical design of Spark’s mobile support, and shares 
 results from several early prototypes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390161#comment-14390161
 ] 

Tathagata Das commented on SPARK-6646:
--

I have been working on running NetworkWordcount on our IPhone prototype, and I 
was pleasantly surprised with the performance I was getting. The network 
bandwidth is definitely less, and there is a higher cost of shuffling data, but 
its still quite good. Though the task launch latencies are higher, so streaming 
applications will require slightly higher batch sizes. But overall you will be 
surprised. I will post numbers when I can compile them in graphs. 


 Spark 2.0: Rearchitecting Spark for Mobile Platforms
 

 Key: SPARK-6646
 URL: https://issues.apache.org/jira/browse/SPARK-6646
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Attachments: Spark on Mobile - Design Doc - v1.pdf


 Mobile computing is quickly rising to dominance, and by the end of 2017, it 
 is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
 Spark’s project goal can be accomplished only when Spark runs efficiently for 
 the growing population of mobile users.
 Designed and optimized for modern data centers and Big Data applications, 
 Spark is unfortunately not a good fit for mobile computing today. In the past 
 few months, we have been prototyping the feasibility of a mobile-first Spark 
 architecture, and today we would like to share with you our findings. This 
 ticket outlines the technical design of Spark’s mobile support, and shares 
 results from several early prototypes.
 Mobile friendly version of the design doc: 
 https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Petar Zecevic (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390206#comment-14390206
 ] 

Petar Zecevic commented on SPARK-6646:
--

Good one :)

 Spark 2.0: Rearchitecting Spark for Mobile Platforms
 

 Key: SPARK-6646
 URL: https://issues.apache.org/jira/browse/SPARK-6646
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Attachments: Spark on Mobile - Design Doc - v1.pdf


 Mobile computing is quickly rising to dominance, and by the end of 2017, it 
 is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
 Spark’s project goal can be accomplished only when Spark runs efficiently for 
 the growing population of mobile users.
 Designed and optimized for modern data centers and Big Data applications, 
 Spark is unfortunately not a good fit for mobile computing today. In the past 
 few months, we have been prototyping the feasibility of a mobile-first Spark 
 architecture, and today we would like to share with you our findings. This 
 ticket outlines the technical design of Spark’s mobile support, and shares 
 results from several early prototypes.
 Mobile friendly version of the design doc: 
 https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390183#comment-14390183
 ] 

Sean Owen commented on SPARK-6646:
--

Concept: smartphone app that lets you find the nearest Spark cluster to join. 
Swipe left/right on photos from the worker nodes to indicate which ones you 
want to join. Only problem is this *must* be called SparkR to be taken 
seriously, so think it will have to be rolled into the R library.

 Spark 2.0: Rearchitecting Spark for Mobile Platforms
 

 Key: SPARK-6646
 URL: https://issues.apache.org/jira/browse/SPARK-6646
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Attachments: Spark on Mobile - Design Doc - v1.pdf


 Mobile computing is quickly rising to dominance, and by the end of 2017, it 
 is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
 Spark’s project goal can be accomplished only when Spark runs efficiently for 
 the growing population of mobile users.
 Designed and optimized for modern data centers and Big Data applications, 
 Spark is unfortunately not a good fit for mobile computing today. In the past 
 few months, we have been prototyping the feasibility of a mobile-first Spark 
 architecture, and today we would like to share with you our findings. This 
 ticket outlines the technical design of Spark’s mobile support, and shares 
 results from several early prototypes.
 Mobile friendly version of the design doc: 
 https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6644) After adding new columns to a partitioned table and inserting data to an old partition, data of newly added columns are all NULL

2015-04-01 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6644:
--
Summary: After adding new columns to a partitioned table and inserting data 
to an old partition, data of newly added columns are all NULL  (was: 
[SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), 
new column value is NULL)

 After adding new columns to a partitioned table and inserting data to an old 
 partition, data of newly added columns are all NULL
 

 Key: SPARK-6644
 URL: https://issues.apache.org/jira/browse/SPARK-6644
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: dongxu

 In Hive, the schema of a partition may differ from the table schema. For 
 example, we may add new columns to the table after importing existing 
 partitions. When using {{spark-sql}} to query the data in a partition whose 
 schema is different from the table schema, problems may arise. Part of them 
 have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. 
 However, after adding new column(s) to the table, when inserting data into 
 old partitions, values of newly added columns are all {{NULL}}.
 The following snippet can be used to reproduce this issue:
 {code}
 case class TestData(key: Int, value: String)
 val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = 
 TestData(i, i.toString))).toDF()
 testData.registerTempTable(testData)
 sql(DROP TABLE IF EXISTS table_with_partition )
 sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int, value string) 
 PARTITIONED by (ds string) location '${tmpDir.toURI.toString}')
 sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT 
 key, value FROM testData)
 // Add new columns to the table
 sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
 sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
 sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT 
 key, value, 'test', 1.11 FROM testData)
 sql(SELECT * FROM table_with_partition WHERE ds = 
 '1').collect().foreach(println)
 {code}
 Actual result:
 {noformat}
 [1,1,null,null,1]
 [2,2,null,null,1]
 {noformat}
 Expected result:
 {noformat}
 [1,1,test,1.11,1]
 [2,2,test,1.11,1]
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4346) YarnClientSchedulerBack.asyncMonitorApplication should be common with Client.monitorApplication

2015-04-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390086#comment-14390086
 ] 

Apache Spark commented on SPARK-4346:
-

User 'Sephiroth-Lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5305

 YarnClientSchedulerBack.asyncMonitorApplication should be common with 
 Client.monitorApplication
 ---

 Key: SPARK-4346
 URL: https://issues.apache.org/jira/browse/SPARK-4346
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler, YARN
Reporter: Thomas Graves

 The YarnClientSchedulerBackend.asyncMonitorApplication routine should move 
 into ClientBase and be made common with monitorApplication.  Make sure stop 
 is handled properly.
 See discussion on https://github.com/apache/spark/pull/3143



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3596) Support changing the yarn client monitor interval

2015-04-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390087#comment-14390087
 ] 

Apache Spark commented on SPARK-3596:
-

User 'Sephiroth-Lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5305

 Support changing the yarn client monitor interval 
 --

 Key: SPARK-3596
 URL: https://issues.apache.org/jira/browse/SPARK-3596
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.2.0
Reporter: Thomas Graves

 Right now spark on yarn has a monitor interval that can be configured by 
 spark.yarn.report.interval.  This is how often the client checks with the RM 
 to get status on the running application in cluster mode.   We should allow 
 users to set this interval as some may not need to check so often.   There is 
 another jira filed to make it so the client doesn't have to stay around for 
 cluster mode.
 With the changes in https://github.com/apache/spark/pull/2350, it further 
 extends that to affect client mode. 
 We may want to add in specific configs for that since the monitorApplication 
 function is now used in multiple different scenarios it actually might make 
 sense for it to take the timeout as a parameter. You could want different 
 timeout for different situations.
 for instance how quickly we poll on client side and print information 
 (cluster mode) vs how quickly we recognize the application quit and we want 
 to terminate (client mode), I want the latter to happen quickly where as in 
 cluster mode I might not care as much about how often it is printing updated 
 info to the screen. I guess its private so we could leave it as is and change 
 if we add support for that later.
 my suggestion for name would be something like 
 spark.yarn.client.progress.pollinterval. If we were to add separate ones in 
 the future then they could be something like 
 spark.yarn.app.ready.pollinterval and spark.yarn.app.completion.pollinterval 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4346) YarnClientSchedulerBack.asyncMonitorApplication should be common with Client.monitorApplication

2015-04-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4346:
---

Assignee: Apache Spark

 YarnClientSchedulerBack.asyncMonitorApplication should be common with 
 Client.monitorApplication
 ---

 Key: SPARK-4346
 URL: https://issues.apache.org/jira/browse/SPARK-4346
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler, YARN
Reporter: Thomas Graves
Assignee: Apache Spark

 The YarnClientSchedulerBackend.asyncMonitorApplication routine should move 
 into ClientBase and be made common with monitorApplication.  Make sure stop 
 is handled properly.
 See discussion on https://github.com/apache/spark/pull/3143



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4346) YarnClientSchedulerBack.asyncMonitorApplication should be common with Client.monitorApplication

2015-04-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4346:
---

Assignee: (was: Apache Spark)

 YarnClientSchedulerBack.asyncMonitorApplication should be common with 
 Client.monitorApplication
 ---

 Key: SPARK-4346
 URL: https://issues.apache.org/jira/browse/SPARK-4346
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler, YARN
Reporter: Thomas Graves

 The YarnClientSchedulerBackend.asyncMonitorApplication routine should move 
 into ClientBase and be made common with monitorApplication.  Make sure stop 
 is handled properly.
 See discussion on https://github.com/apache/spark/pull/3143



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390158#comment-14390158
 ] 

Reynold Xin commented on SPARK-6646:


[~sandyryza] That's an excellent idea. I haven't thought of that yet. But now I 
think about it, there will be a lot of room for optimizations using DataFrame 
on tablets.


 Spark 2.0: Rearchitecting Spark for Mobile Platforms
 

 Key: SPARK-6646
 URL: https://issues.apache.org/jira/browse/SPARK-6646
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Attachments: Spark on Mobile - Design Doc - v1.pdf


 Mobile computing is quickly rising to dominance, and by the end of 2017, it 
 is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
 Spark’s project goal can be accomplished only when Spark runs efficiently for 
 the growing population of mobile users.
 Designed and optimized for modern data centers and Big Data applications, 
 Spark is unfortunately not a good fit for mobile computing today. In the past 
 few months, we have been prototyping the feasibility of a mobile-first Spark 
 architecture, and today we would like to share with you our findings. This 
 ticket outlines the technical design of Spark’s mobile support, and shares 
 results from several early prototypes.
 Mobile friendly version of the design doc: 
 https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5682) Add encrypted shuffle in spark

2015-04-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5682:
---

Assignee: (was: Apache Spark)

 Add encrypted shuffle in spark
 --

 Key: SPARK-5682
 URL: https://issues.apache.org/jira/browse/SPARK-5682
 Project: Spark
  Issue Type: New Feature
  Components: Shuffle
Reporter: liyunzhang_intel
 Attachments: Design Document of Encrypted Spark 
 Shuffle_20150209.docx, Design Document of Encrypted Spark 
 Shuffle_20150318.docx


 Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
 data safer. This feature is necessary in spark. AES  is a specification for 
 the encryption of electronic data. There are 5 common modes in AES. CTR is 
 one of the modes. We use two codec JceAesCtrCryptoCodec and 
 OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used 
 in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms  jdk 
 provides while OpensslAesCtrCryptoCodec uses encrypted algorithms  openssl 
 provides. 
 Because ugi credential info is used in the process of encrypted shuffle, we 
 first enable encrypted shuffle on spark-on-yarn framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5682) Add encrypted shuffle in spark

2015-04-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390157#comment-14390157
 ] 

Apache Spark commented on SPARK-5682:
-

User 'kellyzly' has created a pull request for this issue:
https://github.com/apache/spark/pull/5307

 Add encrypted shuffle in spark
 --

 Key: SPARK-5682
 URL: https://issues.apache.org/jira/browse/SPARK-5682
 Project: Spark
  Issue Type: New Feature
  Components: Shuffle
Reporter: liyunzhang_intel
 Attachments: Design Document of Encrypted Spark 
 Shuffle_20150209.docx, Design Document of Encrypted Spark 
 Shuffle_20150318.docx


 Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
 data safer. This feature is necessary in spark. AES  is a specification for 
 the encryption of electronic data. There are 5 common modes in AES. CTR is 
 one of the modes. We use two codec JceAesCtrCryptoCodec and 
 OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used 
 in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms  jdk 
 provides while OpensslAesCtrCryptoCodec uses encrypted algorithms  openssl 
 provides. 
 Because ugi credential info is used in the process of encrypted shuffle, we 
 first enable encrypted shuffle on spark-on-yarn framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5682) Add encrypted shuffle in spark

2015-04-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5682:
---

Assignee: Apache Spark

 Add encrypted shuffle in spark
 --

 Key: SPARK-5682
 URL: https://issues.apache.org/jira/browse/SPARK-5682
 Project: Spark
  Issue Type: New Feature
  Components: Shuffle
Reporter: liyunzhang_intel
Assignee: Apache Spark
 Attachments: Design Document of Encrypted Spark 
 Shuffle_20150209.docx, Design Document of Encrypted Spark 
 Shuffle_20150318.docx


 Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
 data safer. This feature is necessary in spark. AES  is a specification for 
 the encryption of electronic data. There are 5 common modes in AES. CTR is 
 one of the modes. We use two codec JceAesCtrCryptoCodec and 
 OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used 
 in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms  jdk 
 provides while OpensslAesCtrCryptoCodec uses encrypted algorithms  openssl 
 provides. 
 Because ugi credential info is used in the process of encrypted shuffle, we 
 first enable encrypted shuffle on spark-on-yarn framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6613) Starting stream from checkpoint causes Streaming tab to throw error

2015-04-01 Thread zhichao-li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390221#comment-14390221
 ] 

zhichao-li commented on SPARK-6613:
---

[~msoutier] , have you found any solution for this ? or just report the bug?  

 Starting stream from checkpoint causes Streaming tab to throw error
 ---

 Key: SPARK-6613
 URL: https://issues.apache.org/jira/browse/SPARK-6613
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.1
Reporter: Marius Soutier

 When continuing my streaming job from a checkpoint, the job runs, but the 
 Streaming tab in the standard UI initially no longer works (browser just 
 shows HTTP ERROR: 500). Sometimes  it gets back to normal after a while, and 
 sometimes it stays in this state permanently.
 Stacktrace:
 WARN org.eclipse.jetty.servlet.ServletHandler: /streaming/
 java.util.NoSuchElementException: key not found: 0
   at scala.collection.MapLike$class.default(MapLike.scala:228)
   at scala.collection.AbstractMap.default(Map.scala:58)
   at scala.collection.MapLike$class.apply(MapLike.scala:141)
   at scala.collection.AbstractMap.apply(Map.scala:58)
   at 
 org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1$$anonfun$apply$5.apply(StreamingJobProgressListener.scala:151)
   at 
 org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1$$anonfun$apply$5.apply(StreamingJobProgressListener.scala:150)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.Range.foreach(Range.scala:141)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1.apply(StreamingJobProgressListener.scala:150)
   at 
 org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1.apply(StreamingJobProgressListener.scala:149)
   at scala.Option.map(Option.scala:145)
   at 
 org.apache.spark.streaming.ui.StreamingJobProgressListener.lastReceivedBatchRecords(StreamingJobProgressListener.scala:149)
   at 
 org.apache.spark.streaming.ui.StreamingPage.generateReceiverStats(StreamingPage.scala:82)
   at 
 org.apache.spark.streaming.ui.StreamingPage.render(StreamingPage.scala:43)
   at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68)
   at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68)
   at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:68)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:735)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
   at 
 org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
   at 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
   at 
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
   at 
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
   at 
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
   at 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
   at 
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
   at 
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
   at org.eclipse.jetty.server.Server.handle(Server.java:370)
   at 
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
   at 
 org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
   at 
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
   at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644)
   at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
   at 
 org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
   at 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
   at 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
   at 
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
   at 
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
   at java.lang.Thread.run(Thread.java:745)

[jira] [Updated] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6646:
---
Description: 
Mobile computing is quickly rising to dominance, and by the end of 2017, it is 
estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s 
project goal can be accomplished only when Spark runs efficiently for the 
growing population of mobile users.

Designed and optimized for modern data centers and Big Data applications, Spark 
is unfortunately not a good fit for mobile computing today. In the past few 
months, we have been prototyping the feasibility of a mobile-first Spark 
architecture, and today we would like to share with you our findings. This 
ticket outlines the technical design of Spark’s mobile support, and shares 
results from several early prototypes.

Mobile friendly version of the design doc: 
https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html

  was:
Mobile computing is quickly rising to dominance, and by the end of 2017, it is 
estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s 
project goal can be accomplished only when Spark runs efficiently for the 
growing population of mobile users.

Designed and optimized for modern data centers and Big Data applications, Spark 
is unfortunately not a good fit for mobile computing today. In the past few 
months, we have been prototyping the feasibility of a mobile-first Spark 
architecture, and today we would like to share with you our findings. This 
ticket outlines the technical design of Spark’s mobile support, and shares 
results from several early prototypes.



 Spark 2.0: Rearchitecting Spark for Mobile Platforms
 

 Key: SPARK-6646
 URL: https://issues.apache.org/jira/browse/SPARK-6646
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Attachments: Spark on Mobile - Design Doc - v1.pdf


 Mobile computing is quickly rising to dominance, and by the end of 2017, it 
 is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
 Spark’s project goal can be accomplished only when Spark runs efficiently for 
 the growing population of mobile users.
 Designed and optimized for modern data centers and Big Data applications, 
 Spark is unfortunately not a good fit for mobile computing today. In the past 
 few months, we have been prototyping the feasibility of a mobile-first Spark 
 architecture, and today we would like to share with you our findings. This 
 ticket outlines the technical design of Spark’s mobile support, and shares 
 results from several early prototypes.
 Mobile friendly version of the design doc: 
 https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5989) Model import/export for LDAModel

2015-04-01 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390209#comment-14390209
 ] 

Manoj Kumar commented on SPARK-5989:


Can this be assigned to me? Thanks!

 Model import/export for LDAModel
 

 Key: SPARK-5989
 URL: https://issues.apache.org/jira/browse/SPARK-5989
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Add save/load for LDAModel and its local and distributed variants.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Kamal Banga (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390208#comment-14390208
 ] 

Kamal Banga commented on SPARK-6646:


We want Spark for Apple Watch. That will be the real breakthrough!

 Spark 2.0: Rearchitecting Spark for Mobile Platforms
 

 Key: SPARK-6646
 URL: https://issues.apache.org/jira/browse/SPARK-6646
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Attachments: Spark on Mobile - Design Doc - v1.pdf


 Mobile computing is quickly rising to dominance, and by the end of 2017, it 
 is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
 Spark’s project goal can be accomplished only when Spark runs efficiently for 
 the growing population of mobile users.
 Designed and optimized for modern data centers and Big Data applications, 
 Spark is unfortunately not a good fit for mobile computing today. In the past 
 few months, we have been prototyping the feasibility of a mobile-first Spark 
 architecture, and today we would like to share with you our findings. This 
 ticket outlines the technical design of Spark’s mobile support, and shares 
 results from several early prototypes.
 Mobile friendly version of the design doc: 
 https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5682) Add encrypted shuffle in spark

2015-04-01 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390235#comment-14390235
 ] 

liyunzhang_intel commented on SPARK-5682:
-

Hi all:
  Now there are two methods to implement SPARK-5682(Add encrypted shuffle in 
spark).
  Method1: use [Chimera|https://github.com/intel-hadoop/chimera](Chimera is a 
project which strips code related to CryptoInputStream/CryptoOutputStream from 
Hadoop to facilitate AES-NI based data encryption in other projects.) to 
implement spark encrypted shuffle.  Pull request: 
https://github.com/apache/spark/pull/5307.
  Method2: Add crypto package in spark-core module and add 
CryptoInputStream.scala and CryptoOutputStream.scala and so on in this package. 
Pull request : https://github.com/apache/spark/pull/4491.

Which one is better?  Any advices/guidance are welcome!


 Add encrypted shuffle in spark
 --

 Key: SPARK-5682
 URL: https://issues.apache.org/jira/browse/SPARK-5682
 Project: Spark
  Issue Type: New Feature
  Components: Shuffle
Reporter: liyunzhang_intel
 Attachments: Design Document of Encrypted Spark 
 Shuffle_20150209.docx, Design Document of Encrypted Spark 
 Shuffle_20150318.docx


 Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
 data safer. This feature is necessary in spark. AES  is a specification for 
 the encryption of electronic data. There are 5 common modes in AES. CTR is 
 one of the modes. We use two codec JceAesCtrCryptoCodec and 
 OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used 
 in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms  jdk 
 provides while OpensslAesCtrCryptoCodec uses encrypted algorithms  openssl 
 provides. 
 Because ugi credential info is used in the process of encrypted shuffle, we 
 first enable encrypted shuffle on spark-on-yarn framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4655) Split Stage into ShuffleMapStage and ResultStage subclasses

2015-04-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4655.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 4708
[https://github.com/apache/spark/pull/4708]

 Split Stage into ShuffleMapStage and ResultStage subclasses
 ---

 Key: SPARK-4655
 URL: https://issues.apache.org/jira/browse/SPARK-4655
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Ilya Ganelin
 Fix For: 1.4.0


 The scheduler's {{Stage}} class has many fields which are only applicable to 
 result stages or shuffle map stages.  As a result, I think that it makes 
 sense to make {{Stage}} into an abstract base class with two subclasses, 
 {{ResultStage}} and {{ShuffleMapStage}}.  This would improve the 
 understandability of the DAGScheduler code. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6597) Replace `input:checkbox` with `input[type=checkbox] in additional-metrics.js

2015-04-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6597:
-
Priority: Trivial  (was: Minor)
Assignee: Kousuke Saruta

 Replace `input:checkbox` with `input[type=checkbox] in additional-metrics.js
 --

 Key: SPARK-6597
 URL: https://issues.apache.org/jira/browse/SPARK-6597
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.2, 1.3.1, 1.4.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
Priority: Trivial
 Fix For: 1.4.0


 In additional-metrics.js, there are some selector notation like 
 `input:checkbox` but JQuery's official document says `input[type=checkbox]` 
 is better.
 https://api.jquery.com/checkbox-selector/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6600) Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway

2015-04-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6600.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5257
[https://github.com/apache/spark/pull/5257]

 Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway  
 --

 Key: SPARK-6600
 URL: https://issues.apache.org/jira/browse/SPARK-6600
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Reporter: Florian Verhein
 Fix For: 1.4.0


 Use case: User has set up the hadoop hdfs nfs gateway service on their 
 spark_ec2.py launched cluster, and wants to mount that on their local 
 machine. 
 Requires the following ports to be opened on incoming rule set for MASTER for 
 both UDP and TCP: 111, 2049, 4242.
 (I have tried this and it works)
 Note that this issue *does not* cover the implementation of a hdfs nfs 
 gateway module in the spark-ec2 project. See linked issue. 
 Reference:
 https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5682) Add encrypted shuffle in spark

2015-04-01 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated SPARK-5682:

Attachment: Design Document of Encrypted Spark Shuffle_20150401.docx

 Add encrypted shuffle in spark
 --

 Key: SPARK-5682
 URL: https://issues.apache.org/jira/browse/SPARK-5682
 Project: Spark
  Issue Type: New Feature
  Components: Shuffle
Reporter: liyunzhang_intel
 Attachments: Design Document of Encrypted Spark 
 Shuffle_20150209.docx, Design Document of Encrypted Spark 
 Shuffle_20150318.docx, Design Document of Encrypted Spark 
 Shuffle_20150401.docx


 Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
 data safer. This feature is necessary in spark. AES  is a specification for 
 the encryption of electronic data. There are 5 common modes in AES. CTR is 
 one of the modes. We use two codec JceAesCtrCryptoCodec and 
 OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used 
 in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms  jdk 
 provides while OpensslAesCtrCryptoCodec uses encrypted algorithms  openssl 
 provides. 
 Because ugi credential info is used in the process of encrypted shuffle, we 
 first enable encrypted shuffle on spark-on-yarn framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6626) TwitterUtils.createStream documentation error

2015-04-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6626:
-
Priority: Trivial  (was: Minor)
Assignee: Jayson Sunshine

 TwitterUtils.createStream documentation error
 -

 Key: SPARK-6626
 URL: https://issues.apache.org/jira/browse/SPARK-6626
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.3.0
Reporter: Jayson Sunshine
Assignee: Jayson Sunshine
Priority: Trivial
  Labels: documentation, easyfix
 Fix For: 1.3.1, 1.4.0

   Original Estimate: 5m
  Remaining Estimate: 5m

 At 
 http://spark.apache.org/docs/1.3.0/streaming-programming-guide.html#input-dstreams-and-receivers,
  under 'Advanced Sources', the documentation provides the following call for 
 Scala:
 TwitterUtils.createStream(ssc)
 However, with only one parameter to this method it appears a jssc object is 
 required, not a ssc object: 
 http://spark.apache.org/docs/1.3.0/api/java/index.html?org/apache/spark/streaming/twitter/TwitterUtils.html
 To make the above call work one must instead provide an option argument, for 
 example:
 TwitterUtils.createStream(ssc, None)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6626) TwitterUtils.createStream documentation error

2015-04-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6626.
--
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.1

Issue resolved by pull request 5295
[https://github.com/apache/spark/pull/5295]

 TwitterUtils.createStream documentation error
 -

 Key: SPARK-6626
 URL: https://issues.apache.org/jira/browse/SPARK-6626
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.3.0
Reporter: Jayson Sunshine
Priority: Minor
  Labels: documentation, easyfix
 Fix For: 1.3.1, 1.4.0

   Original Estimate: 5m
  Remaining Estimate: 5m

 At 
 http://spark.apache.org/docs/1.3.0/streaming-programming-guide.html#input-dstreams-and-receivers,
  under 'Advanced Sources', the documentation provides the following call for 
 Scala:
 TwitterUtils.createStream(ssc)
 However, with only one parameter to this method it appears a jssc object is 
 required, not a ssc object: 
 http://spark.apache.org/docs/1.3.0/api/java/index.html?org/apache/spark/streaming/twitter/TwitterUtils.html
 To make the above call work one must instead provide an option argument, for 
 example:
 TwitterUtils.createStream(ssc, None)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6600) Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway

2015-04-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6600:
-
Priority: Minor  (was: Major)
Assignee: Florian Verhein

 Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway  
 --

 Key: SPARK-6600
 URL: https://issues.apache.org/jira/browse/SPARK-6600
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Reporter: Florian Verhein
Assignee: Florian Verhein
Priority: Minor
 Fix For: 1.4.0


 Use case: User has set up the hadoop hdfs nfs gateway service on their 
 spark_ec2.py launched cluster, and wants to mount that on their local 
 machine. 
 Requires the following ports to be opened on incoming rule set for MASTER for 
 both UDP and TCP: 111, 2049, 4242.
 (I have tried this and it works)
 Note that this issue *does not* cover the implementation of a hdfs nfs 
 gateway module in the spark-ec2 project. See linked issue. 
 Reference:
 https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6597) Replace `input:checkbox` with `input[type=checkbox] in additional-metrics.js

2015-04-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6597.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5254
[https://github.com/apache/spark/pull/5254]

 Replace `input:checkbox` with `input[type=checkbox] in additional-metrics.js
 --

 Key: SPARK-6597
 URL: https://issues.apache.org/jira/browse/SPARK-6597
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.2, 1.3.1, 1.4.0
Reporter: Kousuke Saruta
Priority: Minor
 Fix For: 1.4.0


 In additional-metrics.js, there are some selector notation like 
 `input:checkbox` but JQuery's official document says `input[type=checkbox]` 
 is better.
 https://api.jquery.com/checkbox-selector/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6630) SparkConf.setIfMissing should only evaluate the assigned value if indeed missing

2015-04-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390350#comment-14390350
 ] 

Sean Owen commented on SPARK-6630:
--

This should be as simple as {{  def setIfMissing(key: String, value: = 
String): SparkConf = ... }} if I'm not mistaken about how this works in Scala? 
Would you like to make a PR and verify it lazily evaluates? I can't think of a 
scenario where it would be important to always evaluate the argument.

 SparkConf.setIfMissing should only evaluate the assigned value if indeed 
 missing
 

 Key: SPARK-6630
 URL: https://issues.apache.org/jira/browse/SPARK-6630
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Svend Vanderveken
Priority: Minor

 The method setIfMissing() in SparkConf is currently systematically evaluating 
 the right hand side of the assignment even if not used. This leads to 
 unnecessary computation, like in the case of 
 {code}
   conf.setIfMissing(spark.driver.host, Utils.localHostName())
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6644) After adding new columns to a partitioned table and inserting data to an old partition, data of newly added columns are all NULL

2015-04-01 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6644:
--
Description: 
In Hive, the schema of a partition may differ from the table schema. For 
example, we may add new columns to the table after importing existing 
partitions. When using {{spark-sql}} to query the data in a partition whose 
schema is different from the table schema, problems may arise. Part of them 
have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. 
However, after adding new column(s) to the table, when inserting data into old 
partitions, values of newly added columns are all {{NULL}}.

The following snippet can be used to reproduce this issue:
{code}
case class TestData(key: Int, value: String)

val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, 
i.toString))).toDF()
testData.registerTempTable(testData)

sql(DROP TABLE IF EXISTS table_with_partition )
sql(sCREATE TABLE IF NOT EXISTS table_with_partition (key INT, value STRING) 
PARTITIONED BY (ds STRING) LOCATION '${tmpDir.toURI.toString}')
sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT 
key, value FROM testData)

// Add new columns to the table
sql(ALTER TABLE table_with_partition ADD COLUMNS (key1 STRING))
sql(ALTER TABLE table_with_partition ADD COLUMNS (destlng DOUBLE)) 
sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT 
key, value, 'test', 1.11 FROM testData)

sql(SELECT * FROM table_with_partition WHERE ds = 
'1').collect().foreach(println)  
{code}
Actual result:
{noformat}
[1,1,null,null,1]
[2,2,null,null,1]
{noformat}
Expected result:
{noformat}
[1,1,test,1.11,1]
[2,2,test,1.11,1]
{noformat}

  was:
In Hive, the schema of a partition may differ from the table schema. For 
example, we may add new columns to the table after importing existing 
partitions. When using {{spark-sql}} to query the data in a partition whose 
schema is different from the table schema, problems may arise. Part of them 
have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. 
However, after adding new column(s) to the table, when inserting data into old 
partitions, values of newly added columns are all {{NULL}}.

The following snippet can be used to reproduce this issue:
{code}
case class TestData(key: Int, value: String)

val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, 
i.toString))).toDF()
testData.registerTempTable(testData)

sql(DROP TABLE IF EXISTS table_with_partition )
sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int, value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}')
sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT 
key, value FROM testData)

// Add new columns to the table
sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT 
key, value, 'test', 1.11 FROM testData)

sql(SELECT * FROM table_with_partition WHERE ds = 
'1').collect().foreach(println)  
{code}
Actual result:
{noformat}
[1,1,null,null,1]
[2,2,null,null,1]
{noformat}
Expected result:
{noformat}
[1,1,test,1.11,1]
[2,2,test,1.11,1]
{noformat}


 After adding new columns to a partitioned table and inserting data to an old 
 partition, data of newly added columns are all NULL
 

 Key: SPARK-6644
 URL: https://issues.apache.org/jira/browse/SPARK-6644
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: dongxu

 In Hive, the schema of a partition may differ from the table schema. For 
 example, we may add new columns to the table after importing existing 
 partitions. When using {{spark-sql}} to query the data in a partition whose 
 schema is different from the table schema, problems may arise. Part of them 
 have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. 
 However, after adding new column(s) to the table, when inserting data into 
 old partitions, values of newly added columns are all {{NULL}}.
 The following snippet can be used to reproduce this issue:
 {code}
 case class TestData(key: Int, value: String)
 val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = 
 TestData(i, i.toString))).toDF()
 testData.registerTempTable(testData)
 sql(DROP TABLE IF EXISTS table_with_partition )
 sql(sCREATE TABLE IF NOT EXISTS table_with_partition (key INT, value STRING) 
 PARTITIONED BY (ds STRING) LOCATION '${tmpDir.toURI.toString}')
 sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT 
 key, value FROM testData)
 // Add new columns to the table
 

[jira] [Commented] (SPARK-6631) I am unable to get the Maven Build file in Example 2.13 to build anything but an empty file

2015-04-01 Thread Frank Domoney (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390114#comment-14390114
 ] 

Frank Domoney commented on SPARK-6631:
--

Incidentally can you get the Debian build of Spark 1.3 work?  mvn -Pdeb 
-DskipTests clean package

Mine fails to build.  I suspect that the Debian package might be the correct 
one for Ubuntu 14.04. and Java 8

Caused by: org.vafer.jdeb.PackagingException: Could not create deb package
at org.vafer.jdeb.Processor.createDeb(Processor.java:171)
at org.vafer.jdeb.maven.DebMaker.makeDeb(DebMaker.java:244)
... 22 more
Caused by: org.vafer.jdeb.PackagingException: Control file descriptor keys are 
invalid [Version]. The following keys are mandatory [Package, Version, Section, 
Priority, Architecture, Maintainer, Description]. Please check your 
pom.xml/build.xml and your control file.
at org.vafer.jdeb.Processor.createDeb(Processor.java:142)
... 23 more
[INFO

 I am unable to get the Maven Build file in Example 2.13 to build anything but 
 an empty file
 ---

 Key: SPARK-6631
 URL: https://issues.apache.org/jira/browse/SPARK-6631
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.3.0
 Environment: Ubuntu 14.04
Reporter: Frank Domoney
Priority: Blocker

 I have downloaded and built spark 1.3.0 under Ubuntu 14.04 but have been 
 unable to get reduceByKey to work on what seems to be a valid RDD using the 
 command line.
 scala counts.take(10)
 res17: Array[(String, Int)] = Array((Vladimir,1), (Putin,1), (has,1), 
 (said,1), (Russia,1), (will,1), (fight,1), (for,1), (an,1), (independent,1))
 scala val counts1 = counts.reduceByKey{case (x, y) = x + y}
 counts1.take(10)
 res16: Array[(String, Int)] = Array()
 I am attempting to build the Maven sequence in example 2.15 but get the 
 following results
 Building example 0.0.1
 [INFO] 
 
 [INFO] 
 [INFO] --- maven-resources-plugin:2.3:resources (default-resources) @ 
 learning-spark-mini-example ---
 [WARNING] Using platform encoding (UTF-8 actually) to copy filtered 
 resources, i.e. build is platform dependent!
 [INFO] skip non existing resourceDirectory 
 /home/panzerfrank/Downloads/spark-1.3.0/wordcount/src/main/resources
 [INFO] 
 [INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ 
 learning-spark-mini-example ---
 [INFO] No sources to compile
 [INFO] 
 [INFO] --- maven-resources-plugin:2.3:testResources (default-testResources) @ 
 learning-spark-mini-example ---
 [WARNING] Using platform encoding (UTF-8 actually) to copy filtered 
 resources, i.e. build is platform dependent!
 [INFO] skip non existing resourceDirectory 
 /home/panzerfrank/Downloads/spark-1.3.0/wordcount/src/test/resources
 [INFO] 
 [INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ 
 learning-spark-mini-example ---
 [INFO] No sources to compile
 [INFO] 
 [INFO] --- maven-surefire-plugin:2.10:test (default-test) @ 
 learning-spark-mini-example ---
 [INFO] No tests to run.
 [INFO] Surefire report directory: 
 /home/panzerfrank/Downloads/spark-1.3.0/wordcount/target/surefire-reports
  --- maven-jar-plugin:2.2:jar (default-jar) @ learning-spark-mini-example ---
 [WARNING] JAR will be empty - no content was marked for inclusion!
 [INFO] Building jar: 
 /home/panzerfrank/Downloads/spark-1.3.0/wordcount/target/learning-spark-mini-example-0.0.1.jar
 I am using the POM file from Example 2-13.  Java is Java -8  
 Am I doing something really stupid?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Cong Yue (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390146#comment-14390146
 ] 

Cong Yue commented on SPARK-6646:
-

Very cool idea. Current smartphone has much better performance than the servers 
5-8 years ago.
But in mobile networks, the data transferring speed between nodes can not be as 
stable as servers. 
So parallel computing can have the benefits from CPUs, but the bottleneck will 
be in the mobile networks.


 Spark 2.0: Rearchitecting Spark for Mobile Platforms
 

 Key: SPARK-6646
 URL: https://issues.apache.org/jira/browse/SPARK-6646
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Attachments: Spark on Mobile - Design Doc - v1.pdf


 Mobile computing is quickly rising to dominance, and by the end of 2017, it 
 is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
 Spark’s project goal can be accomplished only when Spark runs efficiently for 
 the growing population of mobile users.
 Designed and optimized for modern data centers and Big Data applications, 
 Spark is unfortunately not a good fit for mobile computing today. In the past 
 few months, we have been prototyping the feasibility of a mobile-first Spark 
 architecture, and today we would like to share with you our findings. This 
 ticket outlines the technical design of Spark’s mobile support, and shares 
 results from several early prototypes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390196#comment-14390196
 ] 

Sandy Ryza commented on SPARK-6646:
---

[~srowen] I like the way you think.  I know a lot of good nodes out there 
looking for love or at least a casual shutdown hookup. 

 Spark 2.0: Rearchitecting Spark for Mobile Platforms
 

 Key: SPARK-6646
 URL: https://issues.apache.org/jira/browse/SPARK-6646
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Attachments: Spark on Mobile - Design Doc - v1.pdf


 Mobile computing is quickly rising to dominance, and by the end of 2017, it 
 is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
 Spark’s project goal can be accomplished only when Spark runs efficiently for 
 the growing population of mobile users.
 Designed and optimized for modern data centers and Big Data applications, 
 Spark is unfortunately not a good fit for mobile computing today. In the past 
 few months, we have been prototyping the feasibility of a mobile-first Spark 
 architecture, and today we would like to share with you our findings. This 
 ticket outlines the technical design of Spark’s mobile support, and shares 
 results from several early prototypes.
 Mobile friendly version of the design doc: 
 https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390205#comment-14390205
 ] 

Aaron Davidson commented on SPARK-6646:
---

Please help, I tried putting spark on iphone but it ignited and now no phone.

 Spark 2.0: Rearchitecting Spark for Mobile Platforms
 

 Key: SPARK-6646
 URL: https://issues.apache.org/jira/browse/SPARK-6646
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Attachments: Spark on Mobile - Design Doc - v1.pdf


 Mobile computing is quickly rising to dominance, and by the end of 2017, it 
 is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
 Spark’s project goal can be accomplished only when Spark runs efficiently for 
 the growing population of mobile users.
 Designed and optimized for modern data centers and Big Data applications, 
 Spark is unfortunately not a good fit for mobile computing today. In the past 
 few months, we have been prototyping the feasibility of a mobile-first Spark 
 architecture, and today we would like to share with you our findings. This 
 ticket outlines the technical design of Spark’s mobile support, and shares 
 results from several early prototypes.
 Mobile friendly version of the design doc: 
 https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4927) Spark does not clean up properly during long jobs.

2015-04-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4927.
--
Resolution: Cannot Reproduce

At the moment I've tried to reproduce this a few ways and wasn't able to. It 
may have been fixed somehow since. It can be reopened if there is a 
reproduction vs 1.3+

 Spark does not clean up properly during long jobs. 
 ---

 Key: SPARK-4927
 URL: https://issues.apache.org/jira/browse/SPARK-4927
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Ilya Ganelin

 On a long running Spark job, Spark will eventually run out of memory on the 
 driver node due to metadata overhead from the shuffle operation. Spark will 
 continue to operate, however with drastically decreased performance (since 
 swapping now occurs with every operation).
 The spark.cleanup.tll parameter allows a user to configure when cleanup 
 happens but the issue with doing this is that it isn’t done safely, e.g. If 
 this clears a cached RDD or active task in the middle of processing a stage, 
 this ultimately causes a KeyNotFoundException when the next stage attempts to 
 reference the cleared RDD or task.
 There should be a sustainable mechanism for cleaning up stale metadata that 
 allows the program to continue running. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1001) Memory leak when reading sequence file and then sorting

2015-04-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1001.
--
Resolution: Cannot Reproduce

 Memory leak when reading sequence file and then sorting
 ---

 Key: SPARK-1001
 URL: https://issues.apache.org/jira/browse/SPARK-1001
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 0.8.0
Reporter: Matthew Cheah
  Labels: Hadoop, Memory

 Spark appears to build up a backlog of unreachable byte arrays when an RDD is 
 constructed from a sequence file, and then that RDD is sorted.
 I have a class that wraps a Java ArrayList, that can be serialized and 
 written to a Hadoop SequenceFile (I.e. Implements the Writable interface). 
 Let's call it WritableDataRow. It can take a Java List as its argument to 
 wrap around, and also has a copy constructor.
 Setup: 10 slaves, launched via EC2, 65.9GB RAM each, dataset is 100GB of 
 text, 120GB when in sequence file format (not using compression to compact 
 the bytes). CDH4.2.0-backed hadoop cluster.
 First, building the RDD from a CSV and then sorting on index 1 works fine:
 {code}
 scala import scala.collection.JavaConversions._ // Other imports here as well
 import scala.collection.JavaConversions._
 scala val rddAsTextFile = sc.textFile(s3n://some-bucket/events-*.csv)
 rddAsTextFile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at 
 console:14
 scala val rddAsWritableDataRows = rddAsTextFile.map(x = new 
 WritableDataRow(x.split(\\|).toList))
 rddAsWritableDataRows: 
 org.apache.spark.rdd.RDD[com.palantir.finance.datatable.server.spark.WritableDataRow]
  = MappedRDD[2] at map at console:19
 scala val rddAsKeyedWritableDataRows = rddAsWritableDataRows.map(x = 
 (x.getContents().get(1).toString(), x));
 rddAsKeyedWritableDataRows: org.apache.spark.rdd.RDD[(String, 
 com.palantir.finance.datatable.server.spark.WritableDataRow)] = MappedRDD[4] 
 at map at console:22
 scala val orderedFunct = new 
 org.apache.spark.rdd.OrderedRDDFunctions[String, WritableDataRow, (String, 
 WritableDataRow)](rddAsKeyedWritableDataRows)
 orderedFunct: 
 org.apache.spark.rdd.OrderedRDDFunctions[String,com.palantir.finance.datatable.server.spark.WritableDataRow,(String,
  com.palantir.finance.datatable.server.spark.WritableDataRow)] = 
 org.apache.spark.rdd.OrderedRDDFunctions@587acb54
 scala orderedFunct.sortByKey(true).count(); // Actually triggers the 
 computation, as stated in a different e-mail thread
 res0: org.apache.spark.rdd.RDD[(String, 
 com.palantir.finance.datatable.server.spark.WritableDataRow)] = 
 MapPartitionsRDD[8] at sortByKey at console:27
 {code}
 The above works without too many surprises. I then save it as a Sequence File 
 (using JavaPairRDD as a way to more easily call saveAsHadoopFile(), and this 
 is how it's done in our Java-based application):
 {code}
 scala val pairRDD = new JavaPairRDD(rddAsWritableDataRows.map(x = 
 (NullWritable.get(), x)));
 pairRDD: 
 org.apache.spark.api.java.JavaPairRDD[org.apache.hadoop.io.NullWritable,com.palantir.finance.datatable.server.spark.WritableDataRow]
  = org.apache.spark.api.java.JavaPairRDD@8d2e9d9
 scala pairRDD.saveAsHadoopFile(hdfs://hdfs-master-url:9010/blah, 
 classOf[NullWritable], classOf[WritableDataRow], 
 classOf[org.apache.hadoop.mapred.SequenceFileOutputFormat[NullWritable, 
 WritableDataRow]]);
 …
 2013-12-11 20:09:14,444 [main] INFO  org.apache.spark.SparkContext - Job 
 finished: saveAsHadoopFile at console:26, took 1052.116712748 s
 {code}
 And now I want to get the RDD from the sequence file and sort THAT, and this 
 is when I monitor Ganglia and ps aux and notice the memory usage climbing 
 ridiculously:
 {code}
 scala val rddAsSequenceFile = 
 sc.sequenceFile(hdfs://hdfs-master-url:9010/blah, classOf[NullWritable], 
 classOf[WritableDataRow]).map(x = new WritableDataRow(x._2)); // Invokes 
 copy constructor to get around re-use of writable objects
 rddAsSequenceFile: 
 org.apache.spark.rdd.RDD[com.palantir.finance.datatable.server.spark.WritableDataRow]
  = MappedRDD[19] at map at console:19
 scala val orderedFunct = new 
 org.apache.spark.rdd.OrderedRDDFunctions[String, WritableDataRow, (String, 
 WritableDataRow)](rddAsSequenceFile.map(x = 
 (x.getContents().get(1).toString(), x)))
 orderedFunct: 
 org.apache.spark.rdd.OrderedRDDFunctions[String,com.palantir.finance.datatable.server.spark.WritableDataRow,(String,
  com.palantir.finance.datatable.server.spark.WritableDataRow)] = 
 org.apache.spark.rdd.OrderedRDDFunctions@6262a9a6
 scalaorderedFunct.sortByKey().count();
 {code}
 (On the necessity to copy writables from hadoop RDDs, see: 
 https://mail-archives.apache.org/mod_mbox/spark-user/201308.mbox/%3ccaf_kkpzrq4otyqvwcoc6plaz9x9_sfo33u4ysatki5ptqoy...@mail.gmail.com%3E
  )
 I got a 

[jira] [Created] (SPARK-6647) Make trait StringComparison as BinaryPredicate and throw error when Predicate can't translate to data source Filter

2015-04-01 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-6647:
--

 Summary: Make trait StringComparison as BinaryPredicate and throw 
error when Predicate can't translate to data source Filter
 Key: SPARK-6647
 URL: https://issues.apache.org/jira/browse/SPARK-6647
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh


Now trait {{StringComparison}} is a {{BinaryExpression}}. In fact, it should be 
a {{BinaryPredicate}}.

By making {{StringComparison}} as {{BinaryPredicate}}, we can throw error when 
a {{expressions.Predicate}} can't translate to a data source {{Filter}} in 
function {{selectFilters}}.

Without this modification, because we will wrap a {{Filter}} outside the 
scanned results in {{pruneFilterProjectRaw}}, we can't detect about something 
is wrong in translating predicates to filters in {{selectFilters}}.

The unit test of SPARK-6625 demonstrates such problem. In that pr, even 
{{expressions.Contains}} is not properly translated to 
{{sources.StringContains}}, the filtering is still performed by the {{Filter}} 
and so the test passes.

Of course, by doing this modification, all {{expressions.Predicate}} classes 
need to have its data source {{Filter}} correspondingly.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3884) If deploy mode is cluster, --driver-memory shouldn't apply to client JVM

2015-04-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3884:
-
  Component/s: (was: Spark Core)
   Spark Submit
Affects Version/s: 1.2.0
   1.3.0

 If deploy mode is cluster, --driver-memory shouldn't apply to client JVM
 

 Key: SPARK-3884
 URL: https://issues.apache.org/jira/browse/SPARK-3884
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 1.1.0, 1.2.0, 1.3.0
Reporter: Sandy Ryza
Assignee: Marcelo Vanzin
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390373#comment-14390373
 ] 

Steve Loughran commented on SPARK-6646:
---

Obviously the barrier will be data source access; talking to remote data is 
going to run up bills.

# couchdb has an offline mode, so its RDD/Dataframe support would allow 
spark-mobile to work in embedded mode.
# Hadoop 2.8 add hardware CRC on ARM parts for HDFS (HADOOP-11660). A 
{{MiniHDFSCluster}} could be instantiated locally to benefit from this.
# alternatively, mDNS could be used to discover and dynamically build up an 
HDFS cluster from nearby devices, MANET-style. The limited connectivity 
guarantees of moving devices means that a block size of 1536 bytes would be 
appropriate; probably 1KB blocks are safest.
# Those nodes on the network with limited CPU power but access to external 
power supplies, such as toasters and coffee machines, could have a role as the 
persistent co-ordinators of work and HDFS Namenodes, as well as being used as 
the preferred routers of wifi packets.
# It may be necessary to extend the hadoop {{s3://}} filesystem with the notion 
of monthly data quotas. Possibly even roaming and non-roaming quotas. The S3 
client would need to query the runtime to determine whether it was at home vs 
roaming  use the relevant quota. Apps could then set something like
{code}
fs.s3.quota.home=15GB
fs.s3.quota.roaming=2GB
{code}
Dealing with use abroad would be more complex, as if a cost value were to be 
included, exchange rates would have to be dynamically assessed.
# It may be interesting consider the notion of having devices publish some of 
their data (photos, healthkit history, movement history) to other devices 
nearby. If one phone could enumerate those nearby **and submit work to them**, 
the bandwidth problems could be addressed.



 Spark 2.0: Rearchitecting Spark for Mobile Platforms
 

 Key: SPARK-6646
 URL: https://issues.apache.org/jira/browse/SPARK-6646
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Attachments: Spark on Mobile - Design Doc - v1.pdf


 Mobile computing is quickly rising to dominance, and by the end of 2017, it 
 is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
 Spark’s project goal can be accomplished only when Spark runs efficiently for 
 the growing population of mobile users.
 Designed and optimized for modern data centers and Big Data applications, 
 Spark is unfortunately not a good fit for mobile computing today. In the past 
 few months, we have been prototyping the feasibility of a mobile-first Spark 
 architecture, and today we would like to share with you our findings. This 
 ticket outlines the technical design of Spark’s mobile support, and shares 
 results from several early prototypes.
 Mobile friendly version of the design doc: 
 https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4544) Spark JVM Metrics doesn't have context.

2015-04-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4544.
--
Resolution: Duplicate

I'd like to bundle this under SPARK-5847, which proposes more general control 
over the namespacing, which could include instance as a higher-level grouping 
than the current app ID.

 Spark JVM Metrics doesn't have context.
 ---

 Key: SPARK-4544
 URL: https://issues.apache.org/jira/browse/SPARK-4544
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Sreepathi Prasanna

 If we enable jvm metrics for executor, master, worker, driver instances, we 
 don't have context where they are coming from ?
 This can be a issue if we are collecting all the metrics from different 
 instances are storing into common datastore. 
 This is mainly running Spark on Yarn but i believe Spark standalone has also 
 this problems.
 It would be good if we attach some context for jvm metrics. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3967) Spark applications fail in yarn-cluster mode when the directories configured in yarn.nodemanager.local-dirs are located on different disks/partitions

2015-04-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3967:
-
Component/s: (was: Spark Core)
 YARN

 Spark applications fail in yarn-cluster mode when the directories configured 
 in yarn.nodemanager.local-dirs are located on different disks/partitions
 -

 Key: SPARK-3967
 URL: https://issues.apache.org/jira/browse/SPARK-3967
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Christophe Préaud
 Attachments: spark-1.1.0-utils-fetch.patch, 
 spark-1.1.0-yarn_cluster_tmpdir.patch


 Spark applications fail from time to time in yarn-cluster mode (but not in 
 yarn-client mode) when yarn.nodemanager.local-dirs (Hadoop YARN config) is 
 set to a comma-separated list of directories which are located on different 
 disks/partitions.
 Steps to reproduce:
 1. Set yarn.nodemanager.local-dirs (in yarn-site.xml) to a list of 
 directories located on different partitions (the more you set, the more 
 likely it will be to reproduce the bug):
 (...)
 property
   nameyarn.nodemanager.local-dirs/name
   
 valuefile:/d1/yarn/local/nm-local-dir,file:/d2/yarn/local/nm-local-dir,file:/d3/yarn/local/nm-local-dir,file:/d4/yarn/local/nm-local-dir,file:/d5/yarn/local/nm-local-dir,file:/d6/yarn/local/nm-local-dir,file:/d7/yarn/local/nm-local-dir/value
 /property
 (...)
 2. Launch (several times) an application in yarn-cluster mode, it will fail 
 (apparently randomly) from time to time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6647) Make trait StringComparison as BinaryPredicate and throw error when Predicate can't translate to data source Filter

2015-04-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6647:
---

Assignee: (was: Apache Spark)

 Make trait StringComparison as BinaryPredicate and throw error when Predicate 
 can't translate to data source Filter
 ---

 Key: SPARK-6647
 URL: https://issues.apache.org/jira/browse/SPARK-6647
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh

 Now trait {{StringComparison}} is a {{BinaryExpression}}. In fact, it should 
 be a {{BinaryPredicate}}.
 By making {{StringComparison}} as {{BinaryPredicate}}, we can throw error 
 when a {{expressions.Predicate}} can't translate to a data source {{Filter}} 
 in function {{selectFilters}}.
 Without this modification, because we will wrap a {{Filter}} outside the 
 scanned results in {{pruneFilterProjectRaw}}, we can't detect about something 
 is wrong in translating predicates to filters in {{selectFilters}}.
 The unit test of SPARK-6625 demonstrates such problem. In that pr, even 
 {{expressions.Contains}} is not properly translated to 
 {{sources.StringContains}}, the filtering is still performed by the 
 {{Filter}} and so the test passes.
 Of course, by doing this modification, all {{expressions.Predicate}} classes 
 need to have its data source {{Filter}} correspondingly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3884) If deploy mode is cluster, --driver-memory shouldn't apply to client JVM

2015-04-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3884.
--
  Resolution: Fixed
   Fix Version/s: 1.4.0
Assignee: Marcelo Vanzin  (was: Sandy Ryza)
Target Version/s:   (was: 1.1.2, 1.2.1)

This is fixed in 1.4 due to the new launcher implementation. I verified that in 
yarn-cluster mode the SparkSubmit JVM is not run with -Xms / -Xmx set, but 
instead passes through spark.driver.memory in --conf. In yarn-client mode, it 
does set -Xms / -Xmx.

 If deploy mode is cluster, --driver-memory shouldn't apply to client JVM
 

 Key: SPARK-3884
 URL: https://issues.apache.org/jira/browse/SPARK-3884
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Sandy Ryza
Assignee: Marcelo Vanzin
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390367#comment-14390367
 ] 

Nan Zhu commented on SPARK-6646:


super cool, Spark enables Bigger than Bigger Data in mobile phones

 Spark 2.0: Rearchitecting Spark for Mobile Platforms
 

 Key: SPARK-6646
 URL: https://issues.apache.org/jira/browse/SPARK-6646
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Attachments: Spark on Mobile - Design Doc - v1.pdf


 Mobile computing is quickly rising to dominance, and by the end of 2017, it 
 is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
 Spark’s project goal can be accomplished only when Spark runs efficiently for 
 the growing population of mobile users.
 Designed and optimized for modern data centers and Big Data applications, 
 Spark is unfortunately not a good fit for mobile computing today. In the past 
 few months, we have been prototyping the feasibility of a mobile-first Spark 
 architecture, and today we would like to share with you our findings. This 
 ticket outlines the technical design of Spark’s mobile support, and shares 
 results from several early prototypes.
 Mobile friendly version of the design doc: 
 https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4799) Spark should not rely on local host being resolvable on every node

2015-04-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4799.
--
  Resolution: Duplicate
Target Version/s:   (was: 1.2.1)

Looks like this was subsumed by SPARK-5078 and SPARK_LOCAL_HOSTNAME

 Spark should not rely on local host being resolvable on every node
 --

 Key: SPARK-4799
 URL: https://issues.apache.org/jira/browse/SPARK-4799
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: Tested a Spark+Mesos cluster on top of Docker to 
 reproduce the issue.
Reporter: Santiago M. Mola

 Spark fails when a node hostname is not resolvable by other nodes.
 See an example trace:
 {code}
 14/12/09 17:02:41 ERROR SendingConnection: Error connecting to 
 27e434cf36ac:35093
 java.nio.channels.UnresolvedAddressException
   at sun.nio.ch.Net.checkAddress(Net.java:127)
   at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:644)
   at 
 org.apache.spark.network.SendingConnection.connect(Connection.scala:299)
   at 
 org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:278)
   at 
 org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139)
 {code}
 The relevant code is here:
 https://github.com/apache/spark/blob/bcb5cdad614d4fce43725dfec3ce88172d2f8c11/core/src/main/scala/org/apache/spark/network/nio/ConnectionManager.scala#L170
 {code}
 val id = new ConnectionManagerId(Utils.localHostName, 
 serverChannel.socket.getLocalPort)
 {code}
 This piece of code should use the host IP with Utils.localIpAddress or a 
 method that acknowleges user settings (e.g. SPARK_LOCAL_IP). Since I cannot 
 think about a use case for using hostname here, I'm creating a PR with the 
 former solution, but if you think the later is better, I'm willing to create 
 a new PR with a more elaborate fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6647) Make trait StringComparison as BinaryPredicate and throw error when Predicate can't translate to data source Filter

2015-04-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6647:
---

Assignee: Apache Spark

 Make trait StringComparison as BinaryPredicate and throw error when Predicate 
 can't translate to data source Filter
 ---

 Key: SPARK-6647
 URL: https://issues.apache.org/jira/browse/SPARK-6647
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh
Assignee: Apache Spark

 Now trait {{StringComparison}} is a {{BinaryExpression}}. In fact, it should 
 be a {{BinaryPredicate}}.
 By making {{StringComparison}} as {{BinaryPredicate}}, we can throw error 
 when a {{expressions.Predicate}} can't translate to a data source {{Filter}} 
 in function {{selectFilters}}.
 Without this modification, because we will wrap a {{Filter}} outside the 
 scanned results in {{pruneFilterProjectRaw}}, we can't detect about something 
 is wrong in translating predicates to filters in {{selectFilters}}.
 The unit test of SPARK-6625 demonstrates such problem. In that pr, even 
 {{expressions.Contains}} is not properly translated to 
 {{sources.StringContains}}, the filtering is still performed by the 
 {{Filter}} and so the test passes.
 Of course, by doing this modification, all {{expressions.Predicate}} classes 
 need to have its data source {{Filter}} correspondingly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6630) SparkConf.setIfMissing should only evaluate the assigned value if indeed missing

2015-04-01 Thread Svend Vanderveken (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390421#comment-14390421
 ] 

Svend Vanderveken commented on SPARK-6630:
--

Thanks for you comment. I agree with the resolution, I just only found the time 
to open the Jira yesterday, I'll submit the corresponding PR shortly, promised 
:) 

 SparkConf.setIfMissing should only evaluate the assigned value if indeed 
 missing
 

 Key: SPARK-6630
 URL: https://issues.apache.org/jira/browse/SPARK-6630
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Svend Vanderveken
Priority: Minor

 The method setIfMissing() in SparkConf is currently systematically evaluating 
 the right hand side of the assignment even if not used. This leads to 
 unnecessary computation, like in the case of 
 {code}
   conf.setIfMissing(spark.driver.host, Utils.localHostName())
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6613) Starting stream from checkpoint causes Streaming tab to throw error

2015-04-01 Thread Marius Soutier (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390423#comment-14390423
 ] 

Marius Soutier commented on SPARK-6613:
---

Bug report.

 Starting stream from checkpoint causes Streaming tab to throw error
 ---

 Key: SPARK-6613
 URL: https://issues.apache.org/jira/browse/SPARK-6613
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.1
Reporter: Marius Soutier

 When continuing my streaming job from a checkpoint, the job runs, but the 
 Streaming tab in the standard UI initially no longer works (browser just 
 shows HTTP ERROR: 500). Sometimes  it gets back to normal after a while, and 
 sometimes it stays in this state permanently.
 Stacktrace:
 WARN org.eclipse.jetty.servlet.ServletHandler: /streaming/
 java.util.NoSuchElementException: key not found: 0
   at scala.collection.MapLike$class.default(MapLike.scala:228)
   at scala.collection.AbstractMap.default(Map.scala:58)
   at scala.collection.MapLike$class.apply(MapLike.scala:141)
   at scala.collection.AbstractMap.apply(Map.scala:58)
   at 
 org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1$$anonfun$apply$5.apply(StreamingJobProgressListener.scala:151)
   at 
 org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1$$anonfun$apply$5.apply(StreamingJobProgressListener.scala:150)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.Range.foreach(Range.scala:141)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1.apply(StreamingJobProgressListener.scala:150)
   at 
 org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1.apply(StreamingJobProgressListener.scala:149)
   at scala.Option.map(Option.scala:145)
   at 
 org.apache.spark.streaming.ui.StreamingJobProgressListener.lastReceivedBatchRecords(StreamingJobProgressListener.scala:149)
   at 
 org.apache.spark.streaming.ui.StreamingPage.generateReceiverStats(StreamingPage.scala:82)
   at 
 org.apache.spark.streaming.ui.StreamingPage.render(StreamingPage.scala:43)
   at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68)
   at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68)
   at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:68)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:735)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
   at 
 org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
   at 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
   at 
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
   at 
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
   at 
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
   at 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
   at 
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
   at 
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
   at org.eclipse.jetty.server.Server.handle(Server.java:370)
   at 
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
   at 
 org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
   at 
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
   at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644)
   at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
   at 
 org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
   at 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
   at 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
   at 
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
   at 
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA

[jira] [Assigned] (SPARK-6643) Python API for StandardScalerModel

2015-04-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6643:
---

Assignee: (was: Apache Spark)

 Python API for StandardScalerModel
 --

 Key: SPARK-6643
 URL: https://issues.apache.org/jira/browse/SPARK-6643
 Project: Spark
  Issue Type: Task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Kai Sasaki
Priority: Minor
  Labels: mllib, python
 Fix For: 1.4.0


 This is the sub-task of SPARK-6254.
 Wrap missing method for {{StandardScalerModel}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6643) Python API for StandardScalerModel

2015-04-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390610#comment-14390610
 ] 

Apache Spark commented on SPARK-6643:
-

User 'Lewuathe' has created a pull request for this issue:
https://github.com/apache/spark/pull/5310

 Python API for StandardScalerModel
 --

 Key: SPARK-6643
 URL: https://issues.apache.org/jira/browse/SPARK-6643
 Project: Spark
  Issue Type: Task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Kai Sasaki
Priority: Minor
  Labels: mllib, python
 Fix For: 1.4.0


 This is the sub-task of SPARK-6254.
 Wrap missing method for {{StandardScalerModel}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6648) Reading Parquet files with different sub-files doesn't work

2015-04-01 Thread Marius Soutier (JIRA)
Marius Soutier created SPARK-6648:
-

 Summary: Reading Parquet files with different sub-files doesn't 
work
 Key: SPARK-6648
 URL: https://issues.apache.org/jira/browse/SPARK-6648
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
Reporter: Marius Soutier


When reading from multiple parquet files (via 
sqlContext.parquetFile(/path/1.parquet,/path/2.parquet), if the parquet files 
were created using a different coalesce, the reading fails with:

ERROR c.w.r.websocket.ParquetReader  efault-dispatcher-63 : Failed reading 
parquet file
java.lang.IllegalArgumentException: Could not find Parquet metadata at path 
path
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
 ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]

at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
 ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
at scala.Option.getOrElse(Option.scala:120) 
~[org.scala-lang.scala-library-2.10.4.jar:na]
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:458)
 ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:477)
 ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
at 
org.apache.spark.sql.parquet.ParquetRelation.init(ParquetRelation.scala:65) 
~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:165) 
~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]

I haven't tested with Spark 1.3 yet but will report back after upgrading to 
1.3.1 (as soon as it's released).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6648) Reading Parquet files with different sub-files doesn't work

2015-04-01 Thread Marius Soutier (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marius Soutier updated SPARK-6648:
--
Description: 
When reading from multiple parquet files (via 
sqlContext.parquetFile(/path/1.parquet,/path/2.parquet), and one of the parquet 
files is being overwritten using a different coalesce (e.g. one only contains 
part-r-1.parquet, the other also part-r-2.parquet, part-r-3.parquet), the 
reading fails with:

ERROR c.w.r.websocket.ParquetReader  efault-dispatcher-63 : Failed reading 
parquet file
java.lang.IllegalArgumentException: Could not find Parquet metadata at path 
path
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
 ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]

at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
 ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
at scala.Option.getOrElse(Option.scala:120) 
~[org.scala-lang.scala-library-2.10.4.jar:na]
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:458)
 ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:477)
 ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
at 
org.apache.spark.sql.parquet.ParquetRelation.init(ParquetRelation.scala:65) 
~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:165) 
~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]

I haven't tested with Spark 1.3 yet but will report back after upgrading to 
1.3.1 (as soon as it's released).


  was:
When reading from multiple parquet files (via 
sqlContext.parquetFile(/path/1.parquet,/path/2.parquet), if the parquet files 
were created using a different coalesce (e.g. one only contains 
part-r-1.parquet, the other also part-r-2.parquet, part-r-3.parquet), the 
reading fails with:

ERROR c.w.r.websocket.ParquetReader  efault-dispatcher-63 : Failed reading 
parquet file
java.lang.IllegalArgumentException: Could not find Parquet metadata at path 
path
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
 ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]

at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
 ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
at scala.Option.getOrElse(Option.scala:120) 
~[org.scala-lang.scala-library-2.10.4.jar:na]
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:458)
 ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:477)
 ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
at 
org.apache.spark.sql.parquet.ParquetRelation.init(ParquetRelation.scala:65) 
~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:165) 
~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]

I haven't tested with Spark 1.3 yet but will report back after upgrading to 
1.3.1 (as soon as it's released).



 Reading Parquet files with different sub-files doesn't work
 ---

 Key: SPARK-6648
 URL: https://issues.apache.org/jira/browse/SPARK-6648
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
Reporter: Marius Soutier

 When reading from multiple parquet files (via 
 sqlContext.parquetFile(/path/1.parquet,/path/2.parquet), and one of the 
 parquet files is being overwritten using a different coalesce (e.g. one only 
 contains part-r-1.parquet, the other also part-r-2.parquet, 
 part-r-3.parquet), the reading fails with:
 ERROR c.w.r.websocket.ParquetReader  efault-dispatcher-63 : Failed reading 
 parquet file
 java.lang.IllegalArgumentException: Could not find Parquet metadata at path 
 path
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
  ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
   at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
  ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
   at scala.Option.getOrElse(Option.scala:120) 
 ~[org.scala-lang.scala-library-2.10.4.jar:na]
   at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:458)
  ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
   at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:477)
  

[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Evan Sparks (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390782#comment-14390782
 ] 

Evan Sparks commented on SPARK-6646:


Guys - you're clearly ignoring prior work. The database community solved this 
problem 20 years ago with the Gubba project - a mature prototype [can be seen 
here|http://i.imgur.com/FJK7K9x.jpg]. 

Additionally, everyone knows that joins don't scale on iOS, and you'll never be 
able to build indexes on this platform.


 Spark 2.0: Rearchitecting Spark for Mobile Platforms
 

 Key: SPARK-6646
 URL: https://issues.apache.org/jira/browse/SPARK-6646
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Attachments: Spark on Mobile - Design Doc - v1.pdf


 Mobile computing is quickly rising to dominance, and by the end of 2017, it 
 is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
 Spark’s project goal can be accomplished only when Spark runs efficiently for 
 the growing population of mobile users.
 Designed and optimized for modern data centers and Big Data applications, 
 Spark is unfortunately not a good fit for mobile computing today. In the past 
 few months, we have been prototyping the feasibility of a mobile-first Spark 
 architecture, and today we would like to share with you our findings. This 
 ticket outlines the technical design of Spark’s mobile support, and shares 
 results from several early prototypes.
 Mobile friendly version of the design doc: 
 https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6433) hive tests to import spark-sql test JAR for QueryTest access

2015-04-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6433:
-
Assignee: Steve Loughran

 hive tests to import spark-sql test JAR for QueryTest access
 

 Key: SPARK-6433
 URL: https://issues.apache.org/jira/browse/SPARK-6433
 Project: Spark
  Issue Type: Improvement
  Components: Build, SQL
Affects Versions: 1.4.0
Reporter: Steve Loughran
Assignee: Steve Loughran
Priority: Minor
 Fix For: 1.4.0

   Original Estimate: 0.5h
  Remaining Estimate: 0.5h

 The hive module has its own clone of {{org.apache.spark.sql.QueryPlan}} and 
 {{org.apache.spark.sql.catalyst.plans.PlanTest}} which are copied from the 
 spark-sql module because it's hard to have maven allow one subproject depend 
 on another subprojects test code
 It's actually relatively straightforward
 # tell maven to build  publish the test JARs
 # import them in your other sub projects
 There is one consequence: the JARs will also end being published to mvn 
 central. This is not really a bad thing; it does help downstream projects 
 pick up the JARs too. It does become an issue if a test run depends on a 
 custom file under {{src/test/resources}} containing things like EC2 
 authentication keys, or even just log4.properties files which can interfere 
 with each other. These need to be excluded -the simplest way is to exclude 
 all of the resources from test JARs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6649) DataFrame created through SQLContext.jdbc() failed if columns table must be quoted

2015-04-01 Thread JIRA
Frédéric Blanc created SPARK-6649:
-

 Summary: DataFrame created through SQLContext.jdbc() failed if 
columns table must be quoted
 Key: SPARK-6649
 URL: https://issues.apache.org/jira/browse/SPARK-6649
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Frédéric Blanc
Priority: Minor


If I want to import the content a table from oracle, that contains a column 
with name COMMENT (a reserved keyword), I cannot use a DataFrame that map all 
the columns of this table.

{code:title=ddl.sql|borderStyle=solid}
CREATE TABLE TEST_TABLE (
COMMENT VARCHAR2(10)
);
{code}

{code:title=test.java|borderStyle=solid}
SQLContext sqlContext = ...

DataFrame df = sqlContext.jdbc(databaseURL, TEST_TABLE);
df.rdd();   // = failed if the table contains a column with a reserved keyword
{code}

The same problem can be encounter if reserved keyword are used on table name.

The JDBCRDD scala class could be improved, if the columnList initializer append 
the double-quote for each column. (line : 225)






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6433) hive tests to import spark-sql test JAR for QueryTest access

2015-04-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6433.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5119
[https://github.com/apache/spark/pull/5119]

 hive tests to import spark-sql test JAR for QueryTest access
 

 Key: SPARK-6433
 URL: https://issues.apache.org/jira/browse/SPARK-6433
 Project: Spark
  Issue Type: Improvement
  Components: Build, SQL
Affects Versions: 1.4.0
Reporter: Steve Loughran
Priority: Minor
 Fix For: 1.4.0

   Original Estimate: 0.5h
  Remaining Estimate: 0.5h

 The hive module has its own clone of {{org.apache.spark.sql.QueryPlan}} and 
 {{org.apache.spark.sql.catalyst.plans.PlanTest}} which are copied from the 
 spark-sql module because it's hard to have maven allow one subproject depend 
 on another subprojects test code
 It's actually relatively straightforward
 # tell maven to build  publish the test JARs
 # import them in your other sub projects
 There is one consequence: the JARs will also end being published to mvn 
 central. This is not really a bad thing; it does help downstream projects 
 pick up the JARs too. It does become an issue if a test run depends on a 
 custom file under {{src/test/resources}} containing things like EC2 
 authentication keys, or even just log4.properties files which can interfere 
 with each other. These need to be excluded -the simplest way is to exclude 
 all of the resources from test JARs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6334) spark-local dir not getting cleared during ALS

2015-04-01 Thread Antony Mayi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antony Mayi updated SPARK-6334:
---
Attachment: gc.png

 spark-local dir not getting cleared during ALS
 --

 Key: SPARK-6334
 URL: https://issues.apache.org/jira/browse/SPARK-6334
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Antony Mayi
 Attachments: als-diskusage.png, gc.png


 when running bigger ALS training spark spills loads of temp data into the 
 local-dir (in my case yarn/local/usercache/antony.mayi/appcache/... - running 
 on YARN from cdh 5.3.2) eventually causing all the disks of all nodes running 
 out of space (in my case I have 12TB of available disk capacity before 
 kicking off the ALS but it all gets used (and yarn kills the containers when 
 reaching 90%).
 even with all recommended options (configuring checkpointing and forcing GC 
 when possible) it still doesn't get cleared.
 here is my (pseudo)code (pyspark):
 {code}
 sc.setCheckpointDir('/tmp')
 training = 
 sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK)
 model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40)
 sc._jvm.System.gc()
 {code}
 the training RDD has about 3.5 billions of items (~60GB on disk). after about 
 6 hours the ALS will consume all 12TB of disk space in local-dir data and 
 gets killed. my cluster has 192 cores, 1.5TB RAM and for this task I am using 
 37 executors of 4 cores/28+4GB RAM each.
 this is the graph of disk consumption pattern showing the space being all 
 eaten from 7% to 90% during the ALS (90% is when YARN kills the container):
 !als-diskusage.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6334) spark-local dir not getting cleared during ALS

2015-04-01 Thread Antony Mayi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390771#comment-14390771
 ] 

Antony Mayi commented on SPARK-6334:


bq. btw. I see based on the sourcecode checkpointing should be happening every 
3 iterations - how comes I don't see any drops in the disk usage at least once 
every three iterations? it just seems to be growing constantly... which worries 
me that even more frequent checkpointing wont help...

ok, I am now sure increasing the checkpointing interval is likely not going to 
help same as it is not helping now - the disk usage just grows even after 3x 
iterations. I just tried dirty hack - running parallel thread that forces GC 
every x minutes and suddenly I can notice the disk space gets cleared upon 
every three iterations when GC runs.

see this pattern - first run without forcing GC and then another one where 
there is noticeable disk usage drops every three steps (ALS iterations):
!gc.png!

so really what's needed to get the shuffles cleaned upon checkpointing is 
forcing GC.

this was my dirty hack:

{code}
from threading import Thread, Event
class GC(Thread):
def __init__(self, context, period=600):
Thread.__init__(self)
self.context = context
self.period = period
self.daemon = True
self.stopped = Event()
def stop(self):
self.stopped.set()
def run(self):
self.stopped.clear()
while not self.stopped.is_set():
self.stopped.wait(self.period)
self.context._jvm.System.gc()

sc.setCheckpointDir('/tmp')

gc = GC(sc)
gc.start()

training = 
sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK)
model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40)

gc.stop()
{code}

 spark-local dir not getting cleared during ALS
 --

 Key: SPARK-6334
 URL: https://issues.apache.org/jira/browse/SPARK-6334
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Antony Mayi
 Attachments: als-diskusage.png, gc.png


 when running bigger ALS training spark spills loads of temp data into the 
 local-dir (in my case yarn/local/usercache/antony.mayi/appcache/... - running 
 on YARN from cdh 5.3.2) eventually causing all the disks of all nodes running 
 out of space (in my case I have 12TB of available disk capacity before 
 kicking off the ALS but it all gets used (and yarn kills the containers when 
 reaching 90%).
 even with all recommended options (configuring checkpointing and forcing GC 
 when possible) it still doesn't get cleared.
 here is my (pseudo)code (pyspark):
 {code}
 sc.setCheckpointDir('/tmp')
 training = 
 sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK)
 model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40)
 sc._jvm.System.gc()
 {code}
 the training RDD has about 3.5 billions of items (~60GB on disk). after about 
 6 hours the ALS will consume all 12TB of disk space in local-dir data and 
 gets killed. my cluster has 192 cores, 1.5TB RAM and for this task I am using 
 37 executors of 4 cores/28+4GB RAM each.
 this is the graph of disk consumption pattern showing the space being all 
 eaten from 7% to 90% during the ALS (90% is when YARN kills the container):
 !als-diskusage.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-6334) spark-local dir not getting cleared during ALS

2015-04-01 Thread Antony Mayi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antony Mayi reopened SPARK-6334:


 spark-local dir not getting cleared during ALS
 --

 Key: SPARK-6334
 URL: https://issues.apache.org/jira/browse/SPARK-6334
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Antony Mayi
 Attachments: als-diskusage.png, gc.png


 when running bigger ALS training spark spills loads of temp data into the 
 local-dir (in my case yarn/local/usercache/antony.mayi/appcache/... - running 
 on YARN from cdh 5.3.2) eventually causing all the disks of all nodes running 
 out of space (in my case I have 12TB of available disk capacity before 
 kicking off the ALS but it all gets used (and yarn kills the containers when 
 reaching 90%).
 even with all recommended options (configuring checkpointing and forcing GC 
 when possible) it still doesn't get cleared.
 here is my (pseudo)code (pyspark):
 {code}
 sc.setCheckpointDir('/tmp')
 training = 
 sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK)
 model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40)
 sc._jvm.System.gc()
 {code}
 the training RDD has about 3.5 billions of items (~60GB on disk). after about 
 6 hours the ALS will consume all 12TB of disk space in local-dir data and 
 gets killed. my cluster has 192 cores, 1.5TB RAM and for this task I am using 
 37 executors of 4 cores/28+4GB RAM each.
 this is the graph of disk consumption pattern showing the space being all 
 eaten from 7% to 90% during the ALS (90% is when YARN kills the container):
 !als-diskusage.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2243) Support multiple SparkContexts in the same JVM

2015-04-01 Thread Jason Hubbard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391770#comment-14391770
 ] 

Jason Hubbard edited comment on SPARK-2243 at 4/1/15 11:40 PM:
---

Apologizing for being flippant is a bit of an oxymoron isn't it?

The answer you proprose is the only one available, but it isn't a real 
solution, it's a workaround.  Obviously running in separate JVMs causes other 
issues with overhead of starting multiple JVMs and the complexity of having to 
serialize data so they can communicate.  Having multiple workloads in the same 
SparkContext is what I have chosen, but sometimes you would like different 
settings for the different workloads which this would now not allow.


was (Author: jahubba):
Apologizing for being flippant is a bit of an oxymoron isn't it?

The answer you purpose is the only one available, but it isn't a real solution, 
it's a workaround.  Obviously running in separate JVMs causes other issues with 
overhead of starting multiple JVMs and the complexity of having to serialize 
data so they can communicate.  Having multiple workloads in the same 
SparkContext is what I have chosen, but sometimes you would like different 
settings for the different workloads which this would now not allow.

 Support multiple SparkContexts in the same JVM
 --

 Key: SPARK-2243
 URL: https://issues.apache.org/jira/browse/SPARK-2243
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager, Spark Core
Affects Versions: 0.7.0, 1.0.0, 1.1.0
Reporter: Miguel Angel Fernandez Diaz

 We're developing a platform where we create several Spark contexts for 
 carrying out different calculations. Is there any restriction when using 
 several Spark contexts? We have two contexts, one for Spark calculations and 
 another one for Spark Streaming jobs. The next error arises when we first 
 execute a Spark calculation and, once the execution is finished, a Spark 
 Streaming job is launched:
 {code}
 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
   at 
 org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
   at 
 org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
   at 
 org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
   at 
 java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
   at 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 14/06/23 16:40:08 WARN 

[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2015-04-01 Thread Jason Hubbard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391770#comment-14391770
 ] 

Jason Hubbard commented on SPARK-2243:
--

Apologizing for being flippant is a bit of an oxymoron isn't it?

The answer you purpose is the only one available, but it isn't a real solution, 
it's a workaround.  Obviously running in separate JVMs causes other issues with 
overhead of starting multiple JVMs and the complexity of having to serialize 
data so they can communicate.  Having multiple workloads in the same 
SparkContext is what I have chosen, but sometimes you would like different 
settings for the different workloads which this would now not allow.

 Support multiple SparkContexts in the same JVM
 --

 Key: SPARK-2243
 URL: https://issues.apache.org/jira/browse/SPARK-2243
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager, Spark Core
Affects Versions: 0.7.0, 1.0.0, 1.1.0
Reporter: Miguel Angel Fernandez Diaz

 We're developing a platform where we create several Spark contexts for 
 carrying out different calculations. Is there any restriction when using 
 several Spark contexts? We have two contexts, one for Spark calculations and 
 another one for Spark Streaming jobs. The next error arises when we first 
 execute a Spark calculation and, once the execution is finished, a Spark 
 Streaming job is launched:
 {code}
 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
   at 
 org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
   at 
 org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
   at 
 org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
   at 
 java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
   at 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
 java.io.FileNotFoundException
 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
   at 
 org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 

[jira] [Resolved] (SPARK-6553) Support for functools.partial as UserDefinedFunction

2015-04-01 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-6553.
---
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.1

 Support for functools.partial as UserDefinedFunction
 

 Key: SPARK-6553
 URL: https://issues.apache.org/jira/browse/SPARK-6553
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.3.0
Reporter: Kalle Jepsen
Assignee: Kalle Jepsen
  Labels: features
 Fix For: 1.3.1, 1.4.0


 Currently {{functools.partial}} s cannot be used as {{UserDefinedFunction}} s 
 for {{DataFrame}} s, as  the {{\_\_name\_\_}} attribute does not exist. 
 Passing a {{functools.partial}} object will raise an Exception at 
 https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L126.
  
 {{functools.partial}} is very widely used and should probably be supported, 
 despite its lack of a {{\_\_name\_\_}}.
 My suggestion is to use {{f.\_\_repr\_\_()}} instead, or check with 
 {{hasattr(f, '\_\_name\_\_)}} and use {{\_\_class\_\_}} if {{False}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6613) Starting stream from checkpoint causes Streaming tab to throw error

2015-04-01 Thread zhichao-li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392102#comment-14392102
 ] 

zhichao-li commented on SPARK-6613:
---

Just trying to understand the issue but it cann't be reproduced on my side. if 
possible could you elaborate on how to reproduce it ? i.e. code snippet or 
steps. 

 Starting stream from checkpoint causes Streaming tab to throw error
 ---

 Key: SPARK-6613
 URL: https://issues.apache.org/jira/browse/SPARK-6613
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.1
Reporter: Marius Soutier

 When continuing my streaming job from a checkpoint, the job runs, but the 
 Streaming tab in the standard UI initially no longer works (browser just 
 shows HTTP ERROR: 500). Sometimes  it gets back to normal after a while, and 
 sometimes it stays in this state permanently.
 Stacktrace:
 WARN org.eclipse.jetty.servlet.ServletHandler: /streaming/
 java.util.NoSuchElementException: key not found: 0
   at scala.collection.MapLike$class.default(MapLike.scala:228)
   at scala.collection.AbstractMap.default(Map.scala:58)
   at scala.collection.MapLike$class.apply(MapLike.scala:141)
   at scala.collection.AbstractMap.apply(Map.scala:58)
   at 
 org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1$$anonfun$apply$5.apply(StreamingJobProgressListener.scala:151)
   at 
 org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1$$anonfun$apply$5.apply(StreamingJobProgressListener.scala:150)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.Range.foreach(Range.scala:141)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1.apply(StreamingJobProgressListener.scala:150)
   at 
 org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1.apply(StreamingJobProgressListener.scala:149)
   at scala.Option.map(Option.scala:145)
   at 
 org.apache.spark.streaming.ui.StreamingJobProgressListener.lastReceivedBatchRecords(StreamingJobProgressListener.scala:149)
   at 
 org.apache.spark.streaming.ui.StreamingPage.generateReceiverStats(StreamingPage.scala:82)
   at 
 org.apache.spark.streaming.ui.StreamingPage.render(StreamingPage.scala:43)
   at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68)
   at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68)
   at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:68)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:735)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
   at 
 org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
   at 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
   at 
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
   at 
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
   at 
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
   at 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
   at 
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
   at 
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
   at org.eclipse.jetty.server.Server.handle(Server.java:370)
   at 
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
   at 
 org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
   at 
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
   at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644)
   at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
   at 
 org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
   at 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
   at 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
   at 
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
   at 
 

[jira] [Updated] (SPARK-6668) repeated asking to remove non-existent executor

2015-04-01 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-6668:
--
Affects Version/s: 1.4.0

 repeated asking to remove non-existent executor
 ---

 Key: SPARK-6668
 URL: https://issues.apache.org/jira/browse/SPARK-6668
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Davies Liu

 {code}
 15/04/01 21:37:15 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
 another address
 15/04/01 21:37:17 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
 Attempting port 4041.
 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes 
 ahead of assembly.
 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes 
 ahead of assembly.
 15/04/01 21:37:17 ERROR SparkDeploySchedulerBackend: Asked to remove 
 non-existent executor 0
 15/04/01 21:37:18 ERROR SparkDeploySchedulerBackend: Asked to remove 
 non-existent executor 1
 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes 
 ahead of assembly.
 [Stage 0:  (0 + 0) / 
 2]15/04/01 21:37:18 ERROR SparkDeploySchedulerBackend: Asked to remove 
 non-existent executor 2
 .
 15/04/01 21:37:44 ERROR SparkDeploySchedulerBackend: Asked to remove 
 non-existent executor 244
 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes 
 ahead of assembly.
 15/04/01 21:37:44 ERROR SparkDeploySchedulerBackend: Asked to remove 
 non-existent executor 245
 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes 
 ahead of assembly.
 15/04/01 21:37:44 ERROR SparkDeploySchedulerBackend: Asked to remove 
 non-existent executor 246
 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes 
 ahead of assembly.
 15/04/01 21:37:45 ERROR SparkDeploySchedulerBackend: Asked to remove 
 non-existent executor 247
 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes 
 ahead of assembly.
 15/04/01 21:37:45 ERROR SparkDeploySchedulerBackend: Asked to remove 
 non-existent executor 248
 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes 
 ahead of assembly.
 15/04/01 21:37:45 ERROR SparkDeploySchedulerBackend: Asked to remove 
 non-existent executor 249
 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes 
 ahead of assembly.
 15/04/01 21:37:45 ERROR SparkDeploySchedulerBackend: Asked to remove 
 non-existent executor 250
 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes 
 ahead of assembly.
 15/04/01 21:37:45 ERROR SparkDeploySchedulerBackend: Asked to remove 
 non-existent executor 251
 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes 
 ahead of assembly.
 15/04/01 21:37:45 ERROR SparkDeploySchedulerBackend: Asked to remove 
 non-existent executor 252
 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes 
 ahead of assembly.
 15/04/01 21:37:45 ERROR SparkDeploySchedulerBackend: Asked to remove 
 non-existent executor 253
 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes 
 ahead of assembly.
 15/04/01 21:37:45 ERROR SparkDeploySchedulerBackend: Asked to remove 
 non-existent executor 254
 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes 
 ahead of assembly.
 15/04/01 21:37:45 ERROR SparkDeploySchedulerBackend: Asked to remove 
 non-existent executor 255
 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes 
 ahead of assembly.
 15/04/01 21:37:45 ERROR SparkDeploySchedulerBackend: Asked to remove 
 non-existent executor 256
 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes 
 ahead of assembly.
 15/04/01 21:37:46 ERROR SparkDeploySchedulerBackend: Asked to remove 
 non-existent executor 257
 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes 
 ahead of assembly.
 15/04/01 21:37:46 ERROR SparkDeploySchedulerBackend: Asked to remove 
 non-existent executor 258
 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes 
 ahead of assembly.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6670) HiveContext.analyze should throw UnsupportedOperationException instead of NotImplementedError

2015-04-01 Thread Yin Huai (JIRA)
Yin Huai created SPARK-6670:
---

 Summary: HiveContext.analyze should throw 
UnsupportedOperationException instead of NotImplementedError
 Key: SPARK-6670
 URL: https://issues.apache.org/jira/browse/SPARK-6670
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.2.0
Reporter: Yin Huai
Assignee: Michael Armbrust






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6662) Allow variable substitution in spark.yarn.historyServer.address

2015-04-01 Thread Cheolsoo Park (JIRA)
Cheolsoo Park created SPARK-6662:


 Summary: Allow variable substitution in 
spark.yarn.historyServer.address
 Key: SPARK-6662
 URL: https://issues.apache.org/jira/browse/SPARK-6662
 Project: Spark
  Issue Type: Wish
  Components: YARN
Affects Versions: 1.3.0
Reporter: Cheolsoo Park
Priority: Minor


In Spark on YARN, explicit hostname and port number need to be set for 
spark.yarn.historyServer.address in SparkConf to make the HISTORY link. If 
the history server address is known and static, this is usually not a problem.

But in cloud, that is usually not true. Particularly in EMR, the history server 
always runs on the same node as with RM. So I could simply set it to 
{{$\{yarn.resourcemanager.hostname\}:18080}} if variable substitution is 
allowed.

In fact, Hadoop configuration already implements variable substitution, so if 
this property is read via YarnConf, this can be easily achievable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6664) Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)

2015-04-01 Thread Florian Verhein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florian Verhein updated SPARK-6664:
---
Description: 
I can't find this functionality (if I missed something, apologies!), but it 
would be very useful for evaluating ml models.  

*Use case example* 
suppose you have pre-processed web logs for a few months, and now want to split 
it into a training set (where you train a model to predict some aspect of site 
accesses, perhaps per user) and an out of time test set (where you evaluate how 
well your model performs in the future). This example has just a single split, 
but in general you could want more for cross validation. You may also want to 
have multiple overlaping intervals.   

*Specification* 

1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), 
return n+1 RDDs such that values in the ith RDD are within the (i-1)th and ith 
boundary.

2. More complex alternative (but similar under the hood): provide a sequence of 
possibly overlapping intervals (ordered by the start key of the interval), and 
return the RDDs containing values within those intervals. 

*Implementation ideas / notes for 1*

- The ordered RDDs are likely RangePartitioned (or there should be a simple way 
to find ranges from partitions in an ordered RDD)
- Find the partitions containing the boundary, and split them in two.  
- Construct the new RDDs from the original partitions (and any split ones)

I suspect this could be done by launching only a few jobs to split the 
partitions containing the boundaries. 
Alternatively, it might be possible to decorate these partitions and use them 
in more than one RDD. I.e. let one of these partitions (for boundary i) be p. 
Apply two decorators p' and p'', where p' is masks out values above the ith 
boundary, and p'' masks out values below the ith boundary. Any operations on 
these partitions apply only to values not masked out. Then assign p' to the ith 
output RDD and p'' to the (i+1)th output RDD.
If I understand Spark correctly, this should not require any jobs. Not sure 
whether it's worth trying this optimisation.

*Implementation ideas / notes for 2*
This is very similar, except that we have to handle entire (or parts) of 
partitions belonging to more than one output RDD, since they are no longer 
mutually exclusive. But since RDDs are immutable(??), the decorator idea should 
still work?

Thoughts?


  was:

I can't find this functionality (if I missed something, apologies!), but it 
would be very useful for evaluating ml models.  

Use case example: 
suppose you have pre-processed web logs for a few months, and now want to split 
it into a training set (where you train a model to predict some aspect of site 
accesses, perhaps per user) and an out of time test set (where you evaluate how 
well your model performs in the future). This example has just a single split, 
but in general you could want more for cross validation. You may also want to 
have multiple overlaping intervals.   

Specification: 

1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), 
return n+1 RDDs such that values in the ith RDD are within the (i-1)th and ith 
boundary.

2. More complex alternative (but similar under the hood): provide a sequence of 
possibly overlapping intervals, and return the RDDs containing values within 
those intervals. 

Implementation ideas / notes for 1:

- The ordered RDDs are likely RangePartitioned (or there should be a simple way 
to find ranges from partitions in an ordered RDD)
- Find the partitions containing the boundary, and split them in two.  
- Construct the new RDDs from the original partitions (and any split ones)

I suspect this could be done by launching only a few jobs to split the 
partitions containing the boundaries. 
Alternatively, it might be possible to decorate these partitions and use them 
in more than one RDD. I.e. let one of these partitions (for boundary i) be p. 
Apply two decorators p' and p'', where p' is masks out values above the ith 
boundary, and p'' masks out values below the ith boundary. Any operations on 
these partitions apply only to values not masked out. Then assign p' to the ith 
output RDD and p'' to the (i+1)th output RDD.
If I understand Spark correctly, this should not require any jobs. Not sure 
whether it's worth trying this optimisation.

Implementation ideas / notes for 2:
This is very similar, except that we have to handle entire (or parts) of 
partitions belonging to more than one output RDD, since they are no longer 
mutually exclusive. But since RDDs are immutable(?), the decorator idea should 
still work?

Thoughts?



 Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)
 --

 Key: SPARK-6664
 URL: https://issues.apache.org/jira/browse/SPARK-6664
   

[jira] [Commented] (SPARK-6106) Support user group mapping and groups in view, modify and admin acls

2015-04-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392188#comment-14392188
 ] 

Apache Spark commented on SPARK-6106:
-

User 'colinmjj' has created a pull request for this issue:
https://github.com/apache/spark/pull/5325

 Support user group mapping and groups in view, modify and admin acls
 

 Key: SPARK-6106
 URL: https://issues.apache.org/jira/browse/SPARK-6106
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Jerry Chen
  Labels: Rhino, Security
 Attachments: SPARK-6106.001.patch

   Original Estimate: 672h
  Remaining Estimate: 672h

 Spark support various acl settings for job to control the visibility of a job 
 and user privileges.
 Currently, the acls (view, modify and admin) is specified as a list of users.
 As a convention, Hadoop common support a mechanism named as user group 
 mapping and group names can specified in acls. The ability to do user group 
 mapping and to allow groups to be specified in acls will greatly improve the 
 flexibility and support enterprise use cases such as AD group integration.
 This JIRA is to proposal to support user group mapping in Spark acl control 
 and to allow specify group names in various acls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6580) Optimize LogisticRegressionModel.predictPoint

2015-04-01 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6580:
-
Assignee: Yanbo Liang

 Optimize LogisticRegressionModel.predictPoint
 -

 Key: SPARK-6580
 URL: https://issues.apache.org/jira/browse/SPARK-6580
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Yanbo Liang
Priority: Minor

 LogisticRegressionModel.predictPoint could be optimized some.  There are 
 several checks which could be moved outside loops or even outside 
 predictPoint to initialization of the model.
 Some include:
 {code}
 require(numFeatures == weightMatrix.size)
 val dataWithBiasSize = weightMatrix.size / (numClasses - 1)
 val weightsArray = weightMatrix match { ...
 if (dataMatrix.size + 1 == dataWithBiasSize) {...
 {code}
 Also, for multiclass, the 2 loops (over numClasses and margins) could be 
 combined into 1 loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6553) Support for functools.partial as UserDefinedFunction

2015-04-01 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391851#comment-14391851
 ] 

Josh Rosen commented on SPARK-6553:
---

This was fixed by https://github.com/apache/spark/pull/5206 for 1.3.1 and 1.4.0.

 Support for functools.partial as UserDefinedFunction
 

 Key: SPARK-6553
 URL: https://issues.apache.org/jira/browse/SPARK-6553
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.3.0
Reporter: Kalle Jepsen
Assignee: Kalle Jepsen
  Labels: features
 Fix For: 1.3.1, 1.4.0


 Currently {{functools.partial}} s cannot be used as {{UserDefinedFunction}} s 
 for {{DataFrame}} s, as  the {{\_\_name\_\_}} attribute does not exist. 
 Passing a {{functools.partial}} object will raise an Exception at 
 https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L126.
  
 {{functools.partial}} is very widely used and should probably be supported, 
 despite its lack of a {{\_\_name\_\_}}.
 My suggestion is to use {{f.\_\_repr\_\_()}} instead, or check with 
 {{hasattr(f, '\_\_name\_\_)}} and use {{\_\_class\_\_}} if {{False}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6553) Support for functools.partial as UserDefinedFunction

2015-04-01 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-6553:
--
Assignee: Kalle Jepsen

 Support for functools.partial as UserDefinedFunction
 

 Key: SPARK-6553
 URL: https://issues.apache.org/jira/browse/SPARK-6553
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.3.0
Reporter: Kalle Jepsen
Assignee: Kalle Jepsen
  Labels: features
 Fix For: 1.3.1, 1.4.0


 Currently {{functools.partial}} s cannot be used as {{UserDefinedFunction}} s 
 for {{DataFrame}} s, as  the {{\_\_name\_\_}} attribute does not exist. 
 Passing a {{functools.partial}} object will raise an Exception at 
 https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L126.
  
 {{functools.partial}} is very widely used and should probably be supported, 
 despite its lack of a {{\_\_name\_\_}}.
 My suggestion is to use {{f.\_\_repr\_\_()}} instead, or check with 
 {{hasattr(f, '\_\_name\_\_)}} and use {{\_\_class\_\_}} if {{False}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6653) New configuration property to specify port for sparkYarnAM actor system

2015-04-01 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391993#comment-14391993
 ] 

Shixiong Zhu commented on SPARK-6653:
-

Could you send a pull request to https://github.com/apache/spark ?

And because this is a yarn configuration, I recommend spark.yarn.am.port.

 New configuration property to specify port for sparkYarnAM actor system
 ---

 Key: SPARK-6653
 URL: https://issues.apache.org/jira/browse/SPARK-6653
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.3.0
 Environment: Spark On Yarn
Reporter: Manoj Samel

 In 1.3.0 code line sparkYarnAM actor system is started on random port. See 
 org.apache.spark.deploy.yarn ApplicationMaster.scala:282
 actorSystem = AkkaUtils.createActorSystem(sparkYarnAM, Utils.localHostName, 
 0, conf = sparkConf, securityManager = securityMgr)._1
 This may be issue when ports between Spark client and the Yarn cluster are 
 limited by firewall and not all ports are open between client and Yarn 
 cluster.
 Proposal is to introduce new property spark.am.actor.port and change code to
 val port = sparkConf.getInt(spark.am.actor.port, 0)
 actorSystem = AkkaUtils.createActorSystem(sparkYarnAM, 
 Utils.localHostName, port,
   conf = sparkConf, securityManager = securityMgr)._1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6665) Randomly Shuffle an RDD

2015-04-01 Thread Florian Verhein (JIRA)
Florian Verhein created SPARK-6665:
--

 Summary: Randomly Shuffle an RDD 
 Key: SPARK-6665
 URL: https://issues.apache.org/jira/browse/SPARK-6665
 Project: Spark
  Issue Type: New Feature
  Components: Spark Shell
Reporter: Florian Verhein
Priority: Minor


*Use case* 
RDD created in a way that has some ordering, but you need to shuffle it because 
the ordering would cause problems downstream. E.g.
- will be used to train a ML algorithm that makes stochastic assumptions (like 
SGD) 
- used as input for cross validation. e.g. after the shuffle, you could just 
grab partitions (or part files if saved to hdfs) as folds

Related question in mailing list: 
http://apache-spark-user-list.1001560.n3.nabble.com/random-shuffle-streaming-RDDs-td17965.html

*Possible implementation*
As mentioned by [~sowen] in the above thread, could sort by( a good  hash of( 
the element (or key if it's paired) and a random salt)). 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6666) org.apache.spark.sql.jdbc.JDBCRDD does not escape/quote column names

2015-04-01 Thread John Ferguson (JIRA)
John Ferguson created SPARK-:


 Summary: org.apache.spark.sql.jdbc.JDBCRDD  does not escape/quote 
column names
 Key: SPARK-
 URL: https://issues.apache.org/jira/browse/SPARK-
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
 Environment:  
Reporter: John Ferguson
Priority: Critical


Is there a way to have JDBC DataFrames use quoted/escaped column names?  Right 
now, it looks like it sees the names correctly in the schema created but does 
not escape them in the SQL it creates when they are not compliant:

org.apache.spark.sql.jdbc.JDBCRDD

private val columnList: String = {
val sb = new StringBuilder()
columns.foreach(x = sb.append(,).append(x))
if (sb.length == 0) 1 else sb.substring(1)
}


If you see value in this, I would take a shot at adding the quoting (escaping) 
of column names here.  If you don't do it, some drivers... like postgresql's 
will simply drop case all names when parsing the query.  As you can see in the 
TL;DR below that means they won't match the schema I am given.

TL;DR:
 
I am able to connect to a Postgres database in the shell (with driver 
referenced):

   val jdbcDf = 
sqlContext.jdbc(jdbc:postgresql://localhost/sparkdemo?user=dbuser, sp500)

In fact when I run:

   jdbcDf.registerTempTable(sp500)
   val avgEPSNamed = sqlContext.sql(SELECT AVG(`Earnings/Share`) as AvgCPI 
FROM sp500)

and

   val avgEPSProg = jsonDf.agg(avg(jsonDf.col(Earnings/Share)))

The values come back as expected.  However, if I try:

   jdbcDf.show

Or if I try
   
   val all = sqlContext.sql(SELECT * FROM sp500)
   all.show

I get errors about column names not being found.  In fact the error includes a 
mention of column names all lower cased.  For now I will change my schema to be 
more restrictive.  Right now it is, per a Stack Overflow poster, not ANSI 
compliant by doing things that are allowed by 's in pgsql, MySQL and 
SQLServer.  BTW, our users are giving us tables like this... because various 
tools they already use support non-compliant names.  In fact, this is mild 
compared to what we've had to support.

Currently the schema in question uses mixed case, quoted names with special 
characters and spaces:

CREATE TABLE sp500
(
Symbol text,
Name text,
Sector text,
Price double precision,
Dividend Yield double precision,
Price/Earnings double precision,
Earnings/Share double precision,
Book Value double precision,
52 week low double precision,
52 week high double precision,
Market Cap double precision,
EBITDA double precision,
Price/Sales double precision,
Price/Book double precision,
SEC Filings text
) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6106) Support user group mapping and groups in view, modify and admin acls

2015-04-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6106:
---

Assignee: Apache Spark

 Support user group mapping and groups in view, modify and admin acls
 

 Key: SPARK-6106
 URL: https://issues.apache.org/jira/browse/SPARK-6106
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Jerry Chen
Assignee: Apache Spark
  Labels: Rhino, Security
 Attachments: SPARK-6106.001.patch

   Original Estimate: 672h
  Remaining Estimate: 672h

 Spark support various acl settings for job to control the visibility of a job 
 and user privileges.
 Currently, the acls (view, modify and admin) is specified as a list of users.
 As a convention, Hadoop common support a mechanism named as user group 
 mapping and group names can specified in acls. The ability to do user group 
 mapping and to allow groups to be specified in acls will greatly improve the 
 flexibility and support enterprise use cases such as AD group integration.
 This JIRA is to proposal to support user group mapping in Spark acl control 
 and to allow specify group names in various acls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6106) Support user group mapping and groups in view, modify and admin acls

2015-04-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6106:
---

Assignee: (was: Apache Spark)

 Support user group mapping and groups in view, modify and admin acls
 

 Key: SPARK-6106
 URL: https://issues.apache.org/jira/browse/SPARK-6106
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Jerry Chen
  Labels: Rhino, Security
 Attachments: SPARK-6106.001.patch

   Original Estimate: 672h
  Remaining Estimate: 672h

 Spark support various acl settings for job to control the visibility of a job 
 and user privileges.
 Currently, the acls (view, modify and admin) is specified as a list of users.
 As a convention, Hadoop common support a mechanism named as user group 
 mapping and group names can specified in acls. The ability to do user group 
 mapping and to allow groups to be specified in acls will greatly improve the 
 flexibility and support enterprise use cases such as AD group integration.
 This JIRA is to proposal to support user group mapping in Spark acl control 
 and to allow specify group names in various acls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2243) Support multiple SparkContexts in the same JVM

2015-04-01 Thread Jason Hubbard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391770#comment-14391770
 ] 

Jason Hubbard edited comment on SPARK-2243 at 4/1/15 11:41 PM:
---

Apologizing for being flippant is a bit of an oxymoron isn't it?

The answer you propose is the only one available, but it isn't a real solution, 
it's a workaround.  Obviously running in separate JVMs causes other issues with 
overhead of starting multiple JVMs and the complexity of having to serialize 
data so they can communicate.  Having multiple workloads in the same 
SparkContext is what I have chosen, but sometimes you would like different 
settings for the different workloads which this would now not allow.


was (Author: jahubba):
Apologizing for being flippant is a bit of an oxymoron isn't it?

The answer you proprose is the only one available, but it isn't a real 
solution, it's a workaround.  Obviously running in separate JVMs causes other 
issues with overhead of starting multiple JVMs and the complexity of having to 
serialize data so they can communicate.  Having multiple workloads in the same 
SparkContext is what I have chosen, but sometimes you would like different 
settings for the different workloads which this would now not allow.

 Support multiple SparkContexts in the same JVM
 --

 Key: SPARK-2243
 URL: https://issues.apache.org/jira/browse/SPARK-2243
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager, Spark Core
Affects Versions: 0.7.0, 1.0.0, 1.1.0
Reporter: Miguel Angel Fernandez Diaz

 We're developing a platform where we create several Spark contexts for 
 carrying out different calculations. Is there any restriction when using 
 several Spark contexts? We have two contexts, one for Spark calculations and 
 another one for Spark Streaming jobs. The next error arises when we first 
 execute a Spark calculation and, once the execution is finished, a Spark 
 Streaming job is launched:
 {code}
 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
   at 
 org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
   at 
 org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
   at 
 org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
   at 
 java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
   at 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 14/06/23 16:40:08 WARN 

[jira] [Commented] (SPARK-6664) Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)

2015-04-01 Thread Florian Verhein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391950#comment-14391950
 ] 

Florian Verhein commented on SPARK-6664:


The closest approach I've found that should achieve the same result is calling 
OrderedRDDFunctions.filterByRange n+1 times. I assume this approach will be 
much slower, but... it may not be if it's completely lazy.. (??). I don't know 
spark well enough yet to be anywhere near sure of this.

 Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)
 --

 Key: SPARK-6664
 URL: https://issues.apache.org/jira/browse/SPARK-6664
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Florian Verhein

 I can't find this functionality (if I missed something, apologies!), but it 
 would be very useful for evaluating ml models.  
 *Use case example* 
 suppose you have pre-processed web logs for a few months, and now want to 
 split it into a training set (where you train a model to predict some aspect 
 of site accesses, perhaps per user) and an out of time test set (where you 
 evaluate how well your model performs in the future). This example has just a 
 single split, but in general you could want more for cross validation. You 
 may also want to have multiple overlaping intervals.   
 *Specification* 
 1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), 
 return n+1 RDDs such that values in the ith RDD are within the (i-1)th and 
 ith boundary.
 2. More complex alternative (but similar under the hood): provide a sequence 
 of possibly overlapping intervals (ordered by the start key of the interval), 
 and return the RDDs containing values within those intervals. 
 *Implementation ideas / notes for 1*
 - The ordered RDDs are likely RangePartitioned (or there should be a simple 
 way to find ranges from partitions in an ordered RDD)
 - Find the partitions containing the boundary, and split them in two.  
 - Construct the new RDDs from the original partitions (and any split ones)
 I suspect this could be done by launching only a few jobs to split the 
 partitions containing the boundaries. 
 Alternatively, it might be possible to decorate these partitions and use them 
 in more than one RDD. I.e. let one of these partitions (for boundary i) be p. 
 Apply two decorators p' and p'', where p' is masks out values above the ith 
 boundary, and p'' masks out values below the ith boundary. Any operations on 
 these partitions apply only to values not masked out. Then assign p' to the 
 ith output RDD and p'' to the (i+1)th output RDD.
 If I understand Spark correctly, this should not require any jobs. Not sure 
 whether it's worth trying this optimisation.
 *Implementation ideas / notes for 2*
 This is very similar, except that we have to handle entire (or parts) of 
 partitions belonging to more than one output RDD, since they are no longer 
 mutually exclusive. But since RDDs are immutable(??), the decorator idea 
 should still work?
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6113) Stabilize DecisionTree and ensembles APIs

2015-04-01 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392054#comment-14392054
 ] 

Joseph K. Bradley commented on SPARK-6113:
--

I just noted that this is blocked by the 2 indexer JIRAs.  (Really, it requires 
at least one of them.)  This is because we made a decision to add this API 
directly to the spark.ml package, rather than creating another tree API within 
the spark.mllib package.  In the spark.ml package, we will require some way to 
test categorical features and multiclass classification, which will require one 
of the indexer JIRAs (to add category metadata).

 Stabilize DecisionTree and ensembles APIs
 -

 Key: SPARK-6113
 URL: https://issues.apache.org/jira/browse/SPARK-6113
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Critical

 *Issue*: The APIs for DecisionTree and ensembles (RandomForests and 
 GradientBoostedTrees) have been experimental for a long time.  The API has 
 become very convoluted because trees and ensembles have many, many variants, 
 some of which we have added incrementally without a long-term design.
 *Proposal*: This JIRA is for discussing changes required to finalize the 
 APIs.  After we discuss, I will make a PR to update the APIs and make them 
 non-Experimental.  This will require making many breaking changes; see the 
 design doc for details.
 [Design doc | 
 https://docs.google.com/document/d/1rJ_DZinyDG3PkYkAKSsQlY0QgCeefn4hUv7GsPkzBP4]:
  This outlines current issues and the proposed API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6667) hang while collect in PySpark

2015-04-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6667:
---

Assignee: Davies Liu  (was: Apache Spark)

 hang while collect in PySpark
 -

 Key: SPARK-6667
 URL: https://issues.apache.org/jira/browse/SPARK-6667
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.1, 1.4.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Critical

 PySpark tests hang while collecting:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6667) hang while collect in PySpark

2015-04-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6667:
---

Assignee: Apache Spark  (was: Davies Liu)

 hang while collect in PySpark
 -

 Key: SPARK-6667
 URL: https://issues.apache.org/jira/browse/SPARK-6667
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.1, 1.4.0
Reporter: Davies Liu
Assignee: Apache Spark
Priority: Critical

 PySpark tests hang while collecting:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6659) Spark SQL 1.3 cannot read json file that only with a record.

2015-04-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6659:
---
Component/s: SQL

 Spark SQL 1.3 cannot read json file that only with a record.
 

 Key: SPARK-6659
 URL: https://issues.apache.org/jira/browse/SPARK-6659
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: luochenghui

 Dear friends:
  
 Spark SQL 1.3 cannot read json file that only with a record.
 here is my json file's content.
 {name:milo,age,24}
  
 when i run Spark SQL under the local mode,it throws an exception
 rg.apache.spark.sql.AnalysisException: cannot resolve 'name' given input 
 columns _corrupt_record;
  
 what i had done:
 1  ./spark-shell
 2 
 scala val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 sqlContext: org.apache.spark.sql.SQLContext = 
 org.apache.spark.sql.SQLContext@5f3be6c8
  
 scala val df = sqlContext.jsonFile(/home/milo/person.json)
 15/03/19 22:11:45 INFO MemoryStore: ensureFreeSpace(163705) called with 
 curMem=0, maxMem=280248975
 15/03/19 22:11:45 INFO MemoryStore: Block broadcast_0 stored as values in 
 memory (estimated size 159.9 KB, free 267.1 MB)
 15/03/19 22:11:45 INFO MemoryStore: ensureFreeSpace(22692) called with 
 curMem=163705, maxMem=280248975
 15/03/19 22:11:45 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
 in memory (estimated size 22.2 KB, free 267.1 MB)
 15/03/19 22:11:45 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
 on localhost:35842 (size: 22.2 KB, free: 267.2 MB)
 15/03/19 22:11:45 INFO BlockManagerMaster: Updated info of block 
 broadcast_0_piece0
 15/03/19 22:11:45 INFO SparkContext: Created broadcast 0 from textFile at 
 JSONRelation.scala:98
 15/03/19 22:11:47 INFO FileInputFormat: Total input paths to process : 1
 15/03/19 22:11:47 INFO SparkContext: Starting job: reduce at JsonRDD.scala:51
 15/03/19 22:11:47 INFO DAGScheduler: Got job 0 (reduce at JsonRDD.scala:51) 
 with 1 output partitions (allowLocal=false)
 15/03/19 22:11:47 INFO DAGScheduler: Final stage: Stage 0(reduce at 
 JsonRDD.scala:51)
 15/03/19 22:11:47 INFO DAGScheduler: Parents of final stage: List()
 15/03/19 22:11:47 INFO DAGScheduler: Missing parents: List()
 15/03/19 22:11:47 INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[3] 
 at map at JsonRDD.scala:51), which has no missing parents
 15/03/19 22:11:47 INFO MemoryStore: ensureFreeSpace(3184) called with 
 curMem=186397, maxMem=280248975
 15/03/19 22:11:47 INFO MemoryStore: Block broadcast_1 stored as values in 
 memory (estimated size 3.1 KB, free 267.1 MB)
 15/03/19 22:11:47 INFO MemoryStore: ensureFreeSpace(2251) called with 
 curMem=189581, maxMem=280248975
 15/03/19 22:11:47 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes 
 in memory (estimated size 2.2 KB, free 267.1 MB)
 15/03/19 22:11:47 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory 
 on localhost:35842 (size: 2.2 KB, free: 267.2 MB)
 15/03/19 22:11:47 INFO BlockManagerMaster: Updated info of block 
 broadcast_1_piece0
 15/03/19 22:11:47 INFO SparkContext: Created broadcast 1 from broadcast at 
 DAGScheduler.scala:839
 15/03/19 22:11:48 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 
 (MapPartitionsRDD[3] at map at JsonRDD.scala:51)
 15/03/19 22:11:48 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
 15/03/19 22:11:48 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
 localhost, PROCESS_LOCAL, 1291 bytes)
 15/03/19 22:11:48 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
 15/03/19 22:11:48 INFO HadoopRDD: Input split: 
 file:/home/milo/person.json:0+26
 15/03/19 22:11:48 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
 mapreduce.task.id
 15/03/19 22:11:48 INFO deprecation: mapred.task.id is deprecated. Instead, 
 use mapreduce.task.attempt.id
 15/03/19 22:11:48 INFO deprecation: mapred.task.is.map is deprecated. 
 Instead, use mapreduce.task.ismap
 15/03/19 22:11:48 INFO deprecation: mapred.task.partition is deprecated. 
 Instead, use mapreduce.task.partition
 15/03/19 22:11:48 INFO deprecation: mapred.job.id is deprecated. Instead, use 
 mapreduce.job.id
 15/03/19 22:11:49 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2023 
 bytes result sent to driver
 15/03/19 22:11:49 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) 
 in 1209 ms on localhost (1/1)
 15/03/19 22:11:49 INFO DAGScheduler: Stage 0 (reduce at JsonRDD.scala:51) 
 finished in 1.308 s
 15/03/19 22:11:49 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks 
 have all completed, from pool 
 15/03/19 22:11:49 INFO DAGScheduler: Job 0 finished: reduce at 
 JsonRDD.scala:51, took 2.002429 s
 df: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
  
 3  
 scala df.select(name).show()
 15/03/19 22:12:44 INFO BlockManager: 

[jira] [Closed] (SPARK-6659) Spark SQL 1.3 cannot read json file that only with a record.

2015-04-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell closed SPARK-6659.
--
Resolution: Invalid

Per the comment, I think the issue is the JSON is not correctly formatted.

 Spark SQL 1.3 cannot read json file that only with a record.
 

 Key: SPARK-6659
 URL: https://issues.apache.org/jira/browse/SPARK-6659
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: luochenghui

 Dear friends:
  
 Spark SQL 1.3 cannot read json file that only with a record.
 here is my json file's content.
 {name:milo,age,24}
  
 when i run Spark SQL under the local mode,it throws an exception
 rg.apache.spark.sql.AnalysisException: cannot resolve 'name' given input 
 columns _corrupt_record;
  
 what i had done:
 1  ./spark-shell
 2 
 scala val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 sqlContext: org.apache.spark.sql.SQLContext = 
 org.apache.spark.sql.SQLContext@5f3be6c8
  
 scala val df = sqlContext.jsonFile(/home/milo/person.json)
 15/03/19 22:11:45 INFO MemoryStore: ensureFreeSpace(163705) called with 
 curMem=0, maxMem=280248975
 15/03/19 22:11:45 INFO MemoryStore: Block broadcast_0 stored as values in 
 memory (estimated size 159.9 KB, free 267.1 MB)
 15/03/19 22:11:45 INFO MemoryStore: ensureFreeSpace(22692) called with 
 curMem=163705, maxMem=280248975
 15/03/19 22:11:45 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
 in memory (estimated size 22.2 KB, free 267.1 MB)
 15/03/19 22:11:45 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
 on localhost:35842 (size: 22.2 KB, free: 267.2 MB)
 15/03/19 22:11:45 INFO BlockManagerMaster: Updated info of block 
 broadcast_0_piece0
 15/03/19 22:11:45 INFO SparkContext: Created broadcast 0 from textFile at 
 JSONRelation.scala:98
 15/03/19 22:11:47 INFO FileInputFormat: Total input paths to process : 1
 15/03/19 22:11:47 INFO SparkContext: Starting job: reduce at JsonRDD.scala:51
 15/03/19 22:11:47 INFO DAGScheduler: Got job 0 (reduce at JsonRDD.scala:51) 
 with 1 output partitions (allowLocal=false)
 15/03/19 22:11:47 INFO DAGScheduler: Final stage: Stage 0(reduce at 
 JsonRDD.scala:51)
 15/03/19 22:11:47 INFO DAGScheduler: Parents of final stage: List()
 15/03/19 22:11:47 INFO DAGScheduler: Missing parents: List()
 15/03/19 22:11:47 INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[3] 
 at map at JsonRDD.scala:51), which has no missing parents
 15/03/19 22:11:47 INFO MemoryStore: ensureFreeSpace(3184) called with 
 curMem=186397, maxMem=280248975
 15/03/19 22:11:47 INFO MemoryStore: Block broadcast_1 stored as values in 
 memory (estimated size 3.1 KB, free 267.1 MB)
 15/03/19 22:11:47 INFO MemoryStore: ensureFreeSpace(2251) called with 
 curMem=189581, maxMem=280248975
 15/03/19 22:11:47 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes 
 in memory (estimated size 2.2 KB, free 267.1 MB)
 15/03/19 22:11:47 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory 
 on localhost:35842 (size: 2.2 KB, free: 267.2 MB)
 15/03/19 22:11:47 INFO BlockManagerMaster: Updated info of block 
 broadcast_1_piece0
 15/03/19 22:11:47 INFO SparkContext: Created broadcast 1 from broadcast at 
 DAGScheduler.scala:839
 15/03/19 22:11:48 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 
 (MapPartitionsRDD[3] at map at JsonRDD.scala:51)
 15/03/19 22:11:48 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
 15/03/19 22:11:48 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
 localhost, PROCESS_LOCAL, 1291 bytes)
 15/03/19 22:11:48 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
 15/03/19 22:11:48 INFO HadoopRDD: Input split: 
 file:/home/milo/person.json:0+26
 15/03/19 22:11:48 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
 mapreduce.task.id
 15/03/19 22:11:48 INFO deprecation: mapred.task.id is deprecated. Instead, 
 use mapreduce.task.attempt.id
 15/03/19 22:11:48 INFO deprecation: mapred.task.is.map is deprecated. 
 Instead, use mapreduce.task.ismap
 15/03/19 22:11:48 INFO deprecation: mapred.task.partition is deprecated. 
 Instead, use mapreduce.task.partition
 15/03/19 22:11:48 INFO deprecation: mapred.job.id is deprecated. Instead, use 
 mapreduce.job.id
 15/03/19 22:11:49 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2023 
 bytes result sent to driver
 15/03/19 22:11:49 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) 
 in 1209 ms on localhost (1/1)
 15/03/19 22:11:49 INFO DAGScheduler: Stage 0 (reduce at JsonRDD.scala:51) 
 finished in 1.308 s
 15/03/19 22:11:49 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks 
 have all completed, from pool 
 15/03/19 22:11:49 INFO DAGScheduler: Job 0 finished: reduce at 
 JsonRDD.scala:51, took 2.002429 s
 df: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
  
 3 

[jira] [Resolved] (SPARK-6642) Change the lambda weight to number of explicit ratings in implicit ALS

2015-04-01 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-6642.
--
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.1

Issue resolved by pull request 5314
[https://github.com/apache/spark/pull/5314]

 Change the lambda weight to number of explicit ratings in implicit ALS
 --

 Key: SPARK-6642
 URL: https://issues.apache.org/jira/browse/SPARK-6642
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.3.1, 1.4.0


 Until SPARK-6637 is resolved, we should switch back to the 1.2 lambda 
 weighting strategy to be consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3468) WebUI Timeline-View feature

2015-04-01 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3468:
--
Affects Version/s: 1.4.0

 WebUI Timeline-View feature
 ---

 Key: SPARK-3468
 URL: https://issues.apache.org/jira/browse/SPARK-3468
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Kousuke Saruta
 Attachments: executors.png, stage-timeline.png, stages.png, 
 taskDetails.png, tasks.png


 I sometimes trouble-shoot and analyse the cause of long time spending job.
 At the time, I find the stages which spends long time or fails, then I find 
 the tasks which spends long time or fails, next I analyse the proportion of 
 each phase in a task.
 Another case, I find executors which spends long time for running a task and 
 analyse the details of a task.
 In such situation, I think it's helpful to visualize timeline view of stages 
 / tasks / executors and visualize details of proportion of activity for each 
 task.
 Now I'm developing prototypes like captures I attached.
 I'll integrate these viewer into WebUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3468) WebUI Timeline-View feature

2015-04-01 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3468:
--
Target Version/s: 1.4.0

 WebUI Timeline-View feature
 ---

 Key: SPARK-3468
 URL: https://issues.apache.org/jira/browse/SPARK-3468
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Kousuke Saruta
 Attachments: executors.png, stage-timeline.png, stages.png, 
 taskDetails.png, tasks.png


 I sometimes trouble-shoot and analyse the cause of long time spending job.
 At the time, I find the stages which spends long time or fails, then I find 
 the tasks which spends long time or fails, next I analyse the proportion of 
 each phase in a task.
 Another case, I find executors which spends long time for running a task and 
 analyse the details of a task.
 In such situation, I think it's helpful to visualize timeline view of stages 
 / tasks / executors and visualize details of proportion of activity for each 
 task.
 Now I'm developing prototypes like captures I attached.
 I'll integrate these viewer into WebUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6660) MLLibPythonAPI.pythonToJava doesn't recognize object arrays

2015-04-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6660:
---

Assignee: Apache Spark  (was: Xiangrui Meng)

 MLLibPythonAPI.pythonToJava doesn't recognize object arrays
 ---

 Key: SPARK-6660
 URL: https://issues.apache.org/jira/browse/SPARK-6660
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Apache Spark
Priority: Critical

 {code}
 points = MLUtils.loadLabeledPoints(sc, ...)
 _to_java_object_rdd(points).count()
 {code}
 throws exception
 {code}
 ---
 Py4JJavaError Traceback (most recent call last)
 ipython-input-22-5b481e99a111 in module()
  1 jrdd.count()
 /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
  in __call__(self, *args)
 536 answer = self.gateway_client.send_command(command)
 537 return_value = get_return_value(answer, self.gateway_client,
 -- 538 self.target_id, self.name)
 539 
 540 for temp_arg in temp_args:
 /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
  in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o510.count.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 18 
 in stage 114.0 failed 4 times, most recent failure: Lost task 18.3 in stage 
 114.0 (TID 1133, ip-10-0-130-35.us-west-2.compute.internal): 
 java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to 
 java.util.ArrayList
   at 
 org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:1090)
   at 
 org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:1087)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1472)
   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1006)
   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1006)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1497)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1497)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6662) Allow variable substitution in spark.yarn.historyServer.address

2015-04-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391825#comment-14391825
 ] 

Apache Spark commented on SPARK-6662:
-

User 'piaozhexiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/5321

 Allow variable substitution in spark.yarn.historyServer.address
 ---

 Key: SPARK-6662
 URL: https://issues.apache.org/jira/browse/SPARK-6662
 Project: Spark
  Issue Type: Wish
  Components: YARN
Affects Versions: 1.3.0
Reporter: Cheolsoo Park
Priority: Minor
  Labels: yarn

 In Spark on YARN, explicit hostname and port number need to be set for 
 spark.yarn.historyServer.address in SparkConf to make the HISTORY link. If 
 the history server address is known and static, this is usually not a problem.
 But in cloud, that is usually not true. Particularly in EMR, the history 
 server always runs on the same node as with RM. So I could simply set it to 
 {{$\{yarn.resourcemanager.hostname\}:18080}} if variable substitution is 
 allowed.
 In fact, Hadoop configuration already implements variable substitution, so if 
 this property is read via YarnConf, this can be easily achievable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6662) Allow variable substitution in spark.yarn.historyServer.address

2015-04-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6662:
---

Assignee: Apache Spark

 Allow variable substitution in spark.yarn.historyServer.address
 ---

 Key: SPARK-6662
 URL: https://issues.apache.org/jira/browse/SPARK-6662
 Project: Spark
  Issue Type: Wish
  Components: YARN
Affects Versions: 1.3.0
Reporter: Cheolsoo Park
Assignee: Apache Spark
Priority: Minor
  Labels: yarn

 In Spark on YARN, explicit hostname and port number need to be set for 
 spark.yarn.historyServer.address in SparkConf to make the HISTORY link. If 
 the history server address is known and static, this is usually not a problem.
 But in cloud, that is usually not true. Particularly in EMR, the history 
 server always runs on the same node as with RM. So I could simply set it to 
 {{$\{yarn.resourcemanager.hostname\}:18080}} if variable substitution is 
 allowed.
 In fact, Hadoop configuration already implements variable substitution, so if 
 this property is read via YarnConf, this can be easily achievable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5682) Add encrypted shuffle in spark

2015-04-01 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390235#comment-14390235
 ] 

liyunzhang_intel edited comment on SPARK-5682 at 4/2/15 1:34 AM:
-

Hi all:
  Now there are two methods to implement SPARK-5682(Add encrypted shuffle in 
spark).
  Method1: use [Chimera|https://github.com/intel-hadoop/chimera](Chimera is a 
project which strips code related to CryptoInputStream/CryptoOutputStream from 
Hadoop to facilitate AES-NI based data encryption in other projects.) to 
implement spark encrypted shuffle.  Pull request: 
https://github.com/apache/spark/pull/5307.
  Method2: Add crypto package in spark-core module and add 
CryptoInputStream.scala and CryptoOutputStream.scala and so on in this package. 
Pull request : https://github.com/apache/spark/pull/4491.
The latest design doc Design Document of Encrypted Spark Shuffle_20150402 has 
been submitted.
Which one is better?  Any advices/guidance are welcome!



was (Author: kellyzly):
Hi all:
  Now there are two methods to implement SPARK-5682(Add encrypted shuffle in 
spark).
  Method1: use [Chimera|https://github.com/intel-hadoop/chimera](Chimera is a 
project which strips code related to CryptoInputStream/CryptoOutputStream from 
Hadoop to facilitate AES-NI based data encryption in other projects.) to 
implement spark encrypted shuffle.  Pull request: 
https://github.com/apache/spark/pull/5307.
  Method2: Add crypto package in spark-core module and add 
CryptoInputStream.scala and CryptoOutputStream.scala and so on in this package. 
Pull request : https://github.com/apache/spark/pull/4491.

Which one is better?  Any advices/guidance are welcome!


 Add encrypted shuffle in spark
 --

 Key: SPARK-5682
 URL: https://issues.apache.org/jira/browse/SPARK-5682
 Project: Spark
  Issue Type: New Feature
  Components: Shuffle
Reporter: liyunzhang_intel
 Attachments: Design Document of Encrypted Spark 
 Shuffle_20150209.docx, Design Document of Encrypted Spark 
 Shuffle_20150318.docx, Design Document of Encrypted Spark 
 Shuffle_20150402.docx


 Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
 data safer. This feature is necessary in spark. AES  is a specification for 
 the encryption of electronic data. There are 5 common modes in AES. CTR is 
 one of the modes. We use two codec JceAesCtrCryptoCodec and 
 OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used 
 in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms  jdk 
 provides while OpensslAesCtrCryptoCodec uses encrypted algorithms  openssl 
 provides. 
 Because ugi credential info is used in the process of encrypted shuffle, we 
 first enable encrypted shuffle on spark-on-yarn framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6658) Incorrect DataFrame Documentation Type References

2015-04-01 Thread Chet Mancini (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chet Mancini closed SPARK-6658.
---

 Incorrect DataFrame Documentation Type References
 -

 Key: SPARK-6658
 URL: https://issues.apache.org/jira/browse/SPARK-6658
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 1.3.0
Reporter: Chet Mancini
Priority: Trivial
  Labels: documentation
   Original Estimate: 5m
  Remaining Estimate: 5m

 A few methods under DataFrame incorrectly refer to the receiver as an RDD in 
 their documentation.
 * createJDBCTable
 * insertIntoJDBC
 * registerTempTable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >