date:20150401

[jira] [Commented] (SPARK-6631) I am unable to get the Maven Build file in Example 2.13 to build anything but an empty file

2015-04-01 Thread Frank Domoney (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390114#comment-14390114
 ] 

Frank Domoney commented on SPARK-6631:
--

Incidentally can you get the Debian build of Spark 1.3 work?  mvn -Pdeb 
-DskipTests clean package

Mine fails to build.  I suspect that the Debian package might be the correct 
one for Ubuntu 14.04. and Java 8

Caused by: org.vafer.jdeb.PackagingException: Could not create deb package
at org.vafer.jdeb.Processor.createDeb(Processor.java:171)
at org.vafer.jdeb.maven.DebMaker.makeDeb(DebMaker.java:244)
... 22 more
Caused by: org.vafer.jdeb.PackagingException: Control file descriptor keys are 
invalid [Version]. The following keys are mandatory [Package, Version, Section, 
Priority, Architecture, Maintainer, Description]. Please check your 
pom.xml/build.xml and your control file.
at org.vafer.jdeb.Processor.createDeb(Processor.java:142)
... 23 more
[INFO

> I am unable to get the Maven Build file in Example 2.13 to build anything but 
> an empty file
> ---
>
> Key: SPARK-6631
> URL: https://issues.apache.org/jira/browse/SPARK-6631
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.3.0
> Environment: Ubuntu 14.04
>Reporter: Frank Domoney
>Priority: Blocker
>
> I have downloaded and built spark 1.3.0 under Ubuntu 14.04 but have been 
> unable to get reduceByKey to work on what seems to be a valid RDD using the 
> command line.
> scala> counts.take(10)
> res17: Array[(String, Int)] = Array((Vladimir,1), (Putin,1), (has,1), 
> (said,1), (Russia,1), (will,1), (fight,1), (for,1), (an,1), (independent,1))
> scala> val counts1 = counts.reduceByKey{case (x, y) => x + y}
> counts1.take(10)
> res16: Array[(String, Int)] = Array()
> I am attempting to build the Maven sequence in example 2.15 but get the 
> following results
> Building example 0.0.1
> [INFO] 
> 
> [INFO] 
> [INFO] --- maven-resources-plugin:2.3:resources (default-resources) @ 
> learning-spark-mini-example ---
> [WARNING] Using platform encoding (UTF-8 actually) to copy filtered 
> resources, i.e. build is platform dependent!
> [INFO] skip non existing resourceDirectory 
> /home/panzerfrank/Downloads/spark-1.3.0/wordcount/src/main/resources
> [INFO] 
> [INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ 
> learning-spark-mini-example ---
> [INFO] No sources to compile
> [INFO] 
> [INFO] --- maven-resources-plugin:2.3:testResources (default-testResources) @ 
> learning-spark-mini-example ---
> [WARNING] Using platform encoding (UTF-8 actually) to copy filtered 
> resources, i.e. build is platform dependent!
> [INFO] skip non existing resourceDirectory 
> /home/panzerfrank/Downloads/spark-1.3.0/wordcount/src/test/resources
> [INFO] 
> [INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ 
> learning-spark-mini-example ---
> [INFO] No sources to compile
> [INFO] 
> [INFO] --- maven-surefire-plugin:2.10:test (default-test) @ 
> learning-spark-mini-example ---
> [INFO] No tests to run.
> [INFO] Surefire report directory: 
> /home/panzerfrank/Downloads/spark-1.3.0/wordcount/target/surefire-reports
>  --- maven-jar-plugin:2.2:jar (default-jar) @ learning-spark-mini-example ---
> [WARNING] JAR will be empty - no content was marked for inclusion!
> [INFO] Building jar: 
> /home/panzerfrank/Downloads/spark-1.3.0/wordcount/target/learning-spark-mini-example-0.0.1.jar
> I am using the POM file from Example 2-13.  Java is Java -8  
> Am I doing something really stupid?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Cong Yue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390146#comment-14390146
 ] 

Cong Yue commented on SPARK-6646:
-

Very cool idea. Current smartphone has much better performance than the servers 
5-8 years ago.
But in mobile networks, the data transferring speed between nodes can not be as 
stable as servers. 
So parallel computing can have the benefits from CPUs, but the bottleneck will 
be in the mobile networks.


> Spark 2.0: Rearchitecting Spark for Mobile Platforms
> 
>
> Key: SPARK-6646
> URL: https://issues.apache.org/jira/browse/SPARK-6646
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
> Attachments: Spark on Mobile - Design Doc - v1.pdf
>
>
> Mobile computing is quickly rising to dominance, and by the end of 2017, it 
> is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
> Spark’s project goal can be accomplished only when Spark runs efficiently for 
> the growing population of mobile users.
> Designed and optimized for modern data centers and Big Data applications, 
> Spark is unfortunately not a good fit for mobile computing today. In the past 
> few months, we have been prototyping the feasibility of a mobile-first Spark 
> architecture, and today we would like to share with you our findings. This 
> ticket outlines the technical design of Spark’s mobile support, and shares 
> results from several early prototypes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390153#comment-14390153
 ] 

Sandy Ryza commented on SPARK-6646:
---

This seems like a good opportunity to finally add a DataFrame 
registerTempTablet API.

> Spark 2.0: Rearchitecting Spark for Mobile Platforms
> 
>
> Key: SPARK-6646
> URL: https://issues.apache.org/jira/browse/SPARK-6646
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
> Attachments: Spark on Mobile - Design Doc - v1.pdf
>
>
> Mobile computing is quickly rising to dominance, and by the end of 2017, it 
> is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
> Spark’s project goal can be accomplished only when Spark runs efficiently for 
> the growing population of mobile users.
> Designed and optimized for modern data centers and Big Data applications, 
> Spark is unfortunately not a good fit for mobile computing today. In the past 
> few months, we have been prototyping the feasibility of a mobile-first Spark 
> architecture, and today we would like to share with you our findings. This 
> ticket outlines the technical design of Spark’s mobile support, and shares 
> results from several early prototypes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6646:
---
Description: 
Mobile computing is quickly rising to dominance, and by the end of 2017, it is 
estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s 
project goal can be accomplished only when Spark runs efficiently for the 
growing population of mobile users.

Designed and optimized for modern data centers and Big Data applications, Spark 
is unfortunately not a good fit for mobile computing today. In the past few 
months, we have been prototyping the feasibility of a mobile-first Spark 
architecture, and today we would like to share with you our findings. This 
ticket outlines the technical design of Spark’s mobile support, and shares 
results from several early prototypes.

Mobile friendly version of the design doc: 
https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html

  was:
Mobile computing is quickly rising to dominance, and by the end of 2017, it is 
estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s 
project goal can be accomplished only when Spark runs efficiently for the 
growing population of mobile users.

Designed and optimized for modern data centers and Big Data applications, Spark 
is unfortunately not a good fit for mobile computing today. In the past few 
months, we have been prototyping the feasibility of a mobile-first Spark 
architecture, and today we would like to share with you our findings. This 
ticket outlines the technical design of Spark’s mobile support, and shares 
results from several early prototypes.



> Spark 2.0: Rearchitecting Spark for Mobile Platforms
> 
>
> Key: SPARK-6646
> URL: https://issues.apache.org/jira/browse/SPARK-6646
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
> Attachments: Spark on Mobile - Design Doc - v1.pdf
>
>
> Mobile computing is quickly rising to dominance, and by the end of 2017, it 
> is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
> Spark’s project goal can be accomplished only when Spark runs efficiently for 
> the growing population of mobile users.
> Designed and optimized for modern data centers and Big Data applications, 
> Spark is unfortunately not a good fit for mobile computing today. In the past 
> few months, we have been prototyping the feasibility of a mobile-first Spark 
> architecture, and today we would like to share with you our findings. This 
> ticket outlines the technical design of Spark’s mobile support, and shares 
> results from several early prototypes.
> Mobile friendly version of the design doc: 
> https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390158#comment-14390158
 ] 

Reynold Xin commented on SPARK-6646:


[~sandyryza] That's an excellent idea. I haven't thought of that yet. But now I 
think about it, there will be a lot of room for optimizations using DataFrame 
on tablets.


> Spark 2.0: Rearchitecting Spark for Mobile Platforms
> 
>
> Key: SPARK-6646
> URL: https://issues.apache.org/jira/browse/SPARK-6646
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
> Attachments: Spark on Mobile - Design Doc - v1.pdf
>
>
> Mobile computing is quickly rising to dominance, and by the end of 2017, it 
> is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
> Spark’s project goal can be accomplished only when Spark runs efficiently for 
> the growing population of mobile users.
> Designed and optimized for modern data centers and Big Data applications, 
> Spark is unfortunately not a good fit for mobile computing today. In the past 
> few months, we have been prototyping the feasibility of a mobile-first Spark 
> architecture, and today we would like to share with you our findings. This 
> ticket outlines the technical design of Spark’s mobile support, and shares 
> results from several early prototypes.
> Mobile friendly version of the design doc: 
> https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-5682) Add encrypted shuffle in spark

2015-04-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5682:
---

Assignee: (was: Apache Spark)

> Add encrypted shuffle in spark
> --
>
> Key: SPARK-5682
> URL: https://issues.apache.org/jira/browse/SPARK-5682
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Reporter: liyunzhang_intel
> Attachments: Design Document of Encrypted Spark 
> Shuffle_20150209.docx, Design Document of Encrypted Spark 
> Shuffle_20150318.docx
>
>
> Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
> data safer. This feature is necessary in spark. AES  is a specification for 
> the encryption of electronic data. There are 5 common modes in AES. CTR is 
> one of the modes. We use two codec JceAesCtrCryptoCodec and 
> OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used 
> in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms  jdk 
> provides while OpensslAesCtrCryptoCodec uses encrypted algorithms  openssl 
> provides. 
> Because ugi credential info is used in the process of encrypted shuffle, we 
> first enable encrypted shuffle on spark-on-yarn framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5682) Add encrypted shuffle in spark

2015-04-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390157#comment-14390157
 ] 

Apache Spark commented on SPARK-5682:
-

User 'kellyzly' has created a pull request for this issue:
https://github.com/apache/spark/pull/5307

> Add encrypted shuffle in spark
> --
>
> Key: SPARK-5682
> URL: https://issues.apache.org/jira/browse/SPARK-5682
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Reporter: liyunzhang_intel
> Attachments: Design Document of Encrypted Spark 
> Shuffle_20150209.docx, Design Document of Encrypted Spark 
> Shuffle_20150318.docx
>
>
> Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
> data safer. This feature is necessary in spark. AES  is a specification for 
> the encryption of electronic data. There are 5 common modes in AES. CTR is 
> one of the modes. We use two codec JceAesCtrCryptoCodec and 
> OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used 
> in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms  jdk 
> provides while OpensslAesCtrCryptoCodec uses encrypted algorithms  openssl 
> provides. 
> Because ugi credential info is used in the process of encrypted shuffle, we 
> first enable encrypted shuffle on spark-on-yarn framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-5682) Add encrypted shuffle in spark

2015-04-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5682:
---

Assignee: Apache Spark

> Add encrypted shuffle in spark
> --
>
> Key: SPARK-5682
> URL: https://issues.apache.org/jira/browse/SPARK-5682
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Reporter: liyunzhang_intel
>Assignee: Apache Spark
> Attachments: Design Document of Encrypted Spark 
> Shuffle_20150209.docx, Design Document of Encrypted Spark 
> Shuffle_20150318.docx
>
>
> Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
> data safer. This feature is necessary in spark. AES  is a specification for 
> the encryption of electronic data. There are 5 common modes in AES. CTR is 
> one of the modes. We use two codec JceAesCtrCryptoCodec and 
> OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used 
> in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms  jdk 
> provides while OpensslAesCtrCryptoCodec uses encrypted algorithms  openssl 
> provides. 
> Because ugi credential info is used in the process of encrypted shuffle, we 
> first enable encrypted shuffle on spark-on-yarn framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Yu Ishikawa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390160#comment-14390160
 ] 

Yu Ishikawa commented on SPARK-6646:


That sounds very interesting! We should support a deploying function a trained 
machine learning model to smartphone. :)

> Spark 2.0: Rearchitecting Spark for Mobile Platforms
> 
>
> Key: SPARK-6646
> URL: https://issues.apache.org/jira/browse/SPARK-6646
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
> Attachments: Spark on Mobile - Design Doc - v1.pdf
>
>
> Mobile computing is quickly rising to dominance, and by the end of 2017, it 
> is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
> Spark’s project goal can be accomplished only when Spark runs efficiently for 
> the growing population of mobile users.
> Designed and optimized for modern data centers and Big Data applications, 
> Spark is unfortunately not a good fit for mobile computing today. In the past 
> few months, we have been prototyping the feasibility of a mobile-first Spark 
> architecture, and today we would like to share with you our findings. This 
> ticket outlines the technical design of Spark’s mobile support, and shares 
> results from several early prototypes.
> Mobile friendly version of the design doc: 
> https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Tathagata Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390161#comment-14390161
 ] 

Tathagata Das commented on SPARK-6646:
--

I have been working on running NetworkWordcount on our IPhone prototype, and I 
was pleasantly surprised with the performance I was getting. The network 
bandwidth is definitely less, and there is a higher cost of shuffling data, but 
its still quite good. Though the task launch latencies are higher, so streaming 
applications will require slightly higher batch sizes. But overall you will be 
surprised. I will post numbers when I can compile them in graphs. 


> Spark 2.0: Rearchitecting Spark for Mobile Platforms
> 
>
> Key: SPARK-6646
> URL: https://issues.apache.org/jira/browse/SPARK-6646
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
> Attachments: Spark on Mobile - Design Doc - v1.pdf
>
>
> Mobile computing is quickly rising to dominance, and by the end of 2017, it 
> is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
> Spark’s project goal can be accomplished only when Spark runs efficiently for 
> the growing population of mobile users.
> Designed and optimized for modern data centers and Big Data applications, 
> Spark is unfortunately not a good fit for mobile computing today. In the past 
> few months, we have been prototyping the feasibility of a mobile-first Spark 
> architecture, and today we would like to share with you our findings. This 
> ticket outlines the technical design of Spark’s mobile support, and shares 
> results from several early prototypes.
> Mobile friendly version of the design doc: 
> https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Rahul Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390169#comment-14390169
 ] 

Rahul Kumar commented on SPARK-6646:


Love this idea, what about "private cloud in pocket" :-) store data on smart 
phone, do processing on it, small mobile based web server that power cool 
visualization reports. Lot of time our smart phones are idle we can share 
resources :-) 4 GB RAM, quadcore processer, LTE network not bad for a single 
node in cluster.

> Spark 2.0: Rearchitecting Spark for Mobile Platforms
> 
>
> Key: SPARK-6646
> URL: https://issues.apache.org/jira/browse/SPARK-6646
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
> Attachments: Spark on Mobile - Design Doc - v1.pdf
>
>
> Mobile computing is quickly rising to dominance, and by the end of 2017, it 
> is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
> Spark’s project goal can be accomplished only when Spark runs efficiently for 
> the growing population of mobile users.
> Designed and optimized for modern data centers and Big Data applications, 
> Spark is unfortunately not a good fit for mobile computing today. In the past 
> few months, we have been prototyping the feasibility of a mobile-first Spark 
> architecture, and today we would like to share with you our findings. This 
> ticket outlines the technical design of Spark’s mobile support, and shares 
> results from several early prototypes.
> Mobile friendly version of the design doc: 
> https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Jeremy Freeman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390176#comment-14390176
 ] 

Jeremy Freeman commented on SPARK-6646:
---

Very promising [~tdas]! We should evaluate the performance of streaming machine 
learning algorithms. In general I think running Spark in javascript via 
scala.js and node.js is extremely appealing, will make integration with 
visualization very straightforward. 

> Spark 2.0: Rearchitecting Spark for Mobile Platforms
> 
>
> Key: SPARK-6646
> URL: https://issues.apache.org/jira/browse/SPARK-6646
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
> Attachments: Spark on Mobile - Design Doc - v1.pdf
>
>
> Mobile computing is quickly rising to dominance, and by the end of 2017, it 
> is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
> Spark’s project goal can be accomplished only when Spark runs efficiently for 
> the growing population of mobile users.
> Designed and optimized for modern data centers and Big Data applications, 
> Spark is unfortunately not a good fit for mobile computing today. In the past 
> few months, we have been prototyping the feasibility of a mobile-first Spark 
> architecture, and today we would like to share with you our findings. This 
> ticket outlines the technical design of Spark’s mobile support, and shares 
> results from several early prototypes.
> Mobile friendly version of the design doc: 
> https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390183#comment-14390183
 ] 

Sean Owen commented on SPARK-6646:
--

Concept: smartphone app that lets you find the nearest Spark cluster to join. 
Swipe left/right on photos from the worker nodes to indicate which ones you 
want to join. Only problem is this *must* be called SparkR to be taken 
seriously, so think it will have to be rolled into the R library.

> Spark 2.0: Rearchitecting Spark for Mobile Platforms
> 
>
> Key: SPARK-6646
> URL: https://issues.apache.org/jira/browse/SPARK-6646
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
> Attachments: Spark on Mobile - Design Doc - v1.pdf
>
>
> Mobile computing is quickly rising to dominance, and by the end of 2017, it 
> is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
> Spark’s project goal can be accomplished only when Spark runs efficiently for 
> the growing population of mobile users.
> Designed and optimized for modern data centers and Big Data applications, 
> Spark is unfortunately not a good fit for mobile computing today. In the past 
> few months, we have been prototyping the feasibility of a mobile-first Spark 
> architecture, and today we would like to share with you our findings. This 
> ticket outlines the technical design of Spark’s mobile support, and shares 
> results from several early prototypes.
> Mobile friendly version of the design doc: 
> https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6631) I am unable to get the Maven Build file in Example 2.13 to build anything but an empty file

2015-04-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390185#comment-14390185
 ] 

Sean Owen commented on SPARK-6631:
--

The Debian packaging was removed; I don't know how much it worked before.
u...@spark.apache.org is appropriate for this kind of question. Here you're 
tacking on to an unrelated JIRA.

> I am unable to get the Maven Build file in Example 2.13 to build anything but 
> an empty file
> ---
>
> Key: SPARK-6631
> URL: https://issues.apache.org/jira/browse/SPARK-6631
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.3.0
> Environment: Ubuntu 14.04
>Reporter: Frank Domoney
>Priority: Blocker
>
> I have downloaded and built spark 1.3.0 under Ubuntu 14.04 but have been 
> unable to get reduceByKey to work on what seems to be a valid RDD using the 
> command line.
> scala> counts.take(10)
> res17: Array[(String, Int)] = Array((Vladimir,1), (Putin,1), (has,1), 
> (said,1), (Russia,1), (will,1), (fight,1), (for,1), (an,1), (independent,1))
> scala> val counts1 = counts.reduceByKey{case (x, y) => x + y}
> counts1.take(10)
> res16: Array[(String, Int)] = Array()
> I am attempting to build the Maven sequence in example 2.15 but get the 
> following results
> Building example 0.0.1
> [INFO] 
> 
> [INFO] 
> [INFO] --- maven-resources-plugin:2.3:resources (default-resources) @ 
> learning-spark-mini-example ---
> [WARNING] Using platform encoding (UTF-8 actually) to copy filtered 
> resources, i.e. build is platform dependent!
> [INFO] skip non existing resourceDirectory 
> /home/panzerfrank/Downloads/spark-1.3.0/wordcount/src/main/resources
> [INFO] 
> [INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ 
> learning-spark-mini-example ---
> [INFO] No sources to compile
> [INFO] 
> [INFO] --- maven-resources-plugin:2.3:testResources (default-testResources) @ 
> learning-spark-mini-example ---
> [WARNING] Using platform encoding (UTF-8 actually) to copy filtered 
> resources, i.e. build is platform dependent!
> [INFO] skip non existing resourceDirectory 
> /home/panzerfrank/Downloads/spark-1.3.0/wordcount/src/test/resources
> [INFO] 
> [INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ 
> learning-spark-mini-example ---
> [INFO] No sources to compile
> [INFO] 
> [INFO] --- maven-surefire-plugin:2.10:test (default-test) @ 
> learning-spark-mini-example ---
> [INFO] No tests to run.
> [INFO] Surefire report directory: 
> /home/panzerfrank/Downloads/spark-1.3.0/wordcount/target/surefire-reports
>  --- maven-jar-plugin:2.2:jar (default-jar) @ learning-spark-mini-example ---
> [WARNING] JAR will be empty - no content was marked for inclusion!
> [INFO] Building jar: 
> /home/panzerfrank/Downloads/spark-1.3.0/wordcount/target/learning-spark-mini-example-0.0.1.jar
> I am using the POM file from Example 2-13.  Java is Java -8  
> Am I doing something really stupid?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390196#comment-14390196
 ] 

Sandy Ryza commented on SPARK-6646:
---

[~srowen] I like the way you think.  I know a lot of good nodes out there 
looking for love or at least a casual shutdown hookup. 

> Spark 2.0: Rearchitecting Spark for Mobile Platforms
> 
>
> Key: SPARK-6646
> URL: https://issues.apache.org/jira/browse/SPARK-6646
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
> Attachments: Spark on Mobile - Design Doc - v1.pdf
>
>
> Mobile computing is quickly rising to dominance, and by the end of 2017, it 
> is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
> Spark’s project goal can be accomplished only when Spark runs efficiently for 
> the growing population of mobile users.
> Designed and optimized for modern data centers and Big Data applications, 
> Spark is unfortunately not a good fit for mobile computing today. In the past 
> few months, we have been prototyping the feasibility of a mobile-first Spark 
> architecture, and today we would like to share with you our findings. This 
> ticket outlines the technical design of Spark’s mobile support, and shares 
> results from several early prototypes.
> Mobile friendly version of the design doc: 
> https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Aaron Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390205#comment-14390205
 ] 

Aaron Davidson commented on SPARK-6646:
---

Please help, I tried putting spark on iphone but it ignited and now no phone.

> Spark 2.0: Rearchitecting Spark for Mobile Platforms
> 
>
> Key: SPARK-6646
> URL: https://issues.apache.org/jira/browse/SPARK-6646
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
> Attachments: Spark on Mobile - Design Doc - v1.pdf
>
>
> Mobile computing is quickly rising to dominance, and by the end of 2017, it 
> is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
> Spark’s project goal can be accomplished only when Spark runs efficiently for 
> the growing population of mobile users.
> Designed and optimized for modern data centers and Big Data applications, 
> Spark is unfortunately not a good fit for mobile computing today. In the past 
> few months, we have been prototyping the feasibility of a mobile-first Spark 
> architecture, and today we would like to share with you our findings. This 
> ticket outlines the technical design of Spark’s mobile support, and shares 
> results from several early prototypes.
> Mobile friendly version of the design doc: 
> https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Petar Zecevic (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390206#comment-14390206
 ] 

Petar Zecevic commented on SPARK-6646:
--

Good one :)

> Spark 2.0: Rearchitecting Spark for Mobile Platforms
> 
>
> Key: SPARK-6646
> URL: https://issues.apache.org/jira/browse/SPARK-6646
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
> Attachments: Spark on Mobile - Design Doc - v1.pdf
>
>
> Mobile computing is quickly rising to dominance, and by the end of 2017, it 
> is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
> Spark’s project goal can be accomplished only when Spark runs efficiently for 
> the growing population of mobile users.
> Designed and optimized for modern data centers and Big Data applications, 
> Spark is unfortunately not a good fit for mobile computing today. In the past 
> few months, we have been prototyping the feasibility of a mobile-first Spark 
> architecture, and today we would like to share with you our findings. This 
> ticket outlines the technical design of Spark’s mobile support, and shares 
> results from several early prototypes.
> Mobile friendly version of the design doc: 
> https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5989) Model import/export for LDAModel

2015-04-01 Thread Manoj Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390209#comment-14390209
 ] 

Manoj Kumar commented on SPARK-5989:


Can this be assigned to me? Thanks!

> Model import/export for LDAModel
> 
>
> Key: SPARK-5989
> URL: https://issues.apache.org/jira/browse/SPARK-5989
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> Add save/load for LDAModel and its local and distributed variants.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Kamal Banga (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390208#comment-14390208
 ] 

Kamal Banga commented on SPARK-6646:


We want Spark for Apple Watch. That will be the real breakthrough!

> Spark 2.0: Rearchitecting Spark for Mobile Platforms
> 
>
> Key: SPARK-6646
> URL: https://issues.apache.org/jira/browse/SPARK-6646
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
> Attachments: Spark on Mobile - Design Doc - v1.pdf
>
>
> Mobile computing is quickly rising to dominance, and by the end of 2017, it 
> is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
> Spark’s project goal can be accomplished only when Spark runs efficiently for 
> the growing population of mobile users.
> Designed and optimized for modern data centers and Big Data applications, 
> Spark is unfortunately not a good fit for mobile computing today. In the past 
> few months, we have been prototyping the feasibility of a mobile-first Spark 
> architecture, and today we would like to share with you our findings. This 
> ticket outlines the technical design of Spark’s mobile support, and shares 
> results from several early prototypes.
> Mobile friendly version of the design doc: 
> https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6613) Starting stream from checkpoint causes Streaming tab to throw error

2015-04-01 Thread zhichao-li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390221#comment-14390221
 ] 

zhichao-li commented on SPARK-6613:
---

[~msoutier] , have you found any solution for this ? or just report the bug?  

> Starting stream from checkpoint causes Streaming tab to throw error
> ---
>
> Key: SPARK-6613
> URL: https://issues.apache.org/jira/browse/SPARK-6613
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.1
>Reporter: Marius Soutier
>
> When continuing my streaming job from a checkpoint, the job runs, but the 
> Streaming tab in the standard UI initially no longer works (browser just 
> shows HTTP ERROR: 500). Sometimes  it gets back to normal after a while, and 
> sometimes it stays in this state permanently.
> Stacktrace:
> WARN org.eclipse.jetty.servlet.ServletHandler: /streaming/
> java.util.NoSuchElementException: key not found: 0
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:58)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at scala.collection.AbstractMap.apply(Map.scala:58)
>   at 
> org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1$$anonfun$apply$5.apply(StreamingJobProgressListener.scala:151)
>   at 
> org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1$$anonfun$apply$5.apply(StreamingJobProgressListener.scala:150)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.Range.foreach(Range.scala:141)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1.apply(StreamingJobProgressListener.scala:150)
>   at 
> org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1.apply(StreamingJobProgressListener.scala:149)
>   at scala.Option.map(Option.scala:145)
>   at 
> org.apache.spark.streaming.ui.StreamingJobProgressListener.lastReceivedBatchRecords(StreamingJobProgressListener.scala:149)
>   at 
> org.apache.spark.streaming.ui.StreamingPage.generateReceiverStats(StreamingPage.scala:82)
>   at 
> org.apache.spark.streaming.ui.StreamingPage.render(StreamingPage.scala:43)
>   at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68)
>   at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68)
>   at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:68)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:735)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
>   at 
> org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>   at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>   at org.eclipse.jetty.server.Server.handle(Server.java:370)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
>   at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644)
>   at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
>   at 
> org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
>   at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
>   at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadP

[jira] [Commented] (SPARK-5682) Add encrypted shuffle in spark

2015-04-01 Thread liyunzhang_intel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390235#comment-14390235
 ] 

liyunzhang_intel commented on SPARK-5682:
-

Hi all:
  Now there are two methods to implement SPARK-5682(Add encrypted shuffle in 
spark).
  Method1: use [Chimera|https://github.com/intel-hadoop/chimera](Chimera is a 
project which strips code related to CryptoInputStream/CryptoOutputStream from 
Hadoop to facilitate AES-NI based data encryption in other projects.) to 
implement spark encrypted shuffle.  Pull request: 
https://github.com/apache/spark/pull/5307.
  Method2: Add crypto package in spark-core module and add 
CryptoInputStream.scala and CryptoOutputStream.scala and so on in this package. 
Pull request : https://github.com/apache/spark/pull/4491.

Which one is better?  Any advices/guidance are welcome!


> Add encrypted shuffle in spark
> --
>
> Key: SPARK-5682
> URL: https://issues.apache.org/jira/browse/SPARK-5682
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Reporter: liyunzhang_intel
> Attachments: Design Document of Encrypted Spark 
> Shuffle_20150209.docx, Design Document of Encrypted Spark 
> Shuffle_20150318.docx
>
>
> Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
> data safer. This feature is necessary in spark. AES  is a specification for 
> the encryption of electronic data. There are 5 common modes in AES. CTR is 
> one of the modes. We use two codec JceAesCtrCryptoCodec and 
> OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used 
> in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms  jdk 
> provides while OpensslAesCtrCryptoCodec uses encrypted algorithms  openssl 
> provides. 
> Because ugi credential info is used in the process of encrypted shuffle, we 
> first enable encrypted shuffle on spark-on-yarn framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5682) Add encrypted shuffle in spark

2015-04-01 Thread liyunzhang_intel (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated SPARK-5682:

Attachment: Design Document of Encrypted Spark Shuffle_20150401.docx

> Add encrypted shuffle in spark
> --
>
> Key: SPARK-5682
> URL: https://issues.apache.org/jira/browse/SPARK-5682
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Reporter: liyunzhang_intel
> Attachments: Design Document of Encrypted Spark 
> Shuffle_20150209.docx, Design Document of Encrypted Spark 
> Shuffle_20150318.docx, Design Document of Encrypted Spark 
> Shuffle_20150401.docx
>
>
> Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
> data safer. This feature is necessary in spark. AES  is a specification for 
> the encryption of electronic data. There are 5 common modes in AES. CTR is 
> one of the modes. We use two codec JceAesCtrCryptoCodec and 
> OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used 
> in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms  jdk 
> provides while OpensslAesCtrCryptoCodec uses encrypted algorithms  openssl 
> provides. 
> Because ugi credential info is used in the process of encrypted shuffle, we 
> first enable encrypted shuffle on spark-on-yarn framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4655) Split Stage into ShuffleMapStage and ResultStage subclasses

2015-04-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4655.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 4708
[https://github.com/apache/spark/pull/4708]

> Split Stage into ShuffleMapStage and ResultStage subclasses
> ---
>
> Key: SPARK-4655
> URL: https://issues.apache.org/jira/browse/SPARK-4655
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Ilya Ganelin
> Fix For: 1.4.0
>
>
> The scheduler's {{Stage}} class has many fields which are only applicable to 
> result stages or shuffle map stages.  As a result, I think that it makes 
> sense to make {{Stage}} into an abstract base class with two subclasses, 
> {{ResultStage}} and {{ShuffleMapStage}}.  This would improve the 
> understandability of the DAGScheduler code. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6600) Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway

2015-04-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6600.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5257
[https://github.com/apache/spark/pull/5257]

> Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway  
> --
>
> Key: SPARK-6600
> URL: https://issues.apache.org/jira/browse/SPARK-6600
> Project: Spark
>  Issue Type: New Feature
>  Components: EC2
>Reporter: Florian Verhein
> Fix For: 1.4.0
>
>
> Use case: User has set up the hadoop hdfs nfs gateway service on their 
> spark_ec2.py launched cluster, and wants to mount that on their local 
> machine. 
> Requires the following ports to be opened on incoming rule set for MASTER for 
> both UDP and TCP: 111, 2049, 4242.
> (I have tried this and it works)
> Note that this issue *does not* cover the implementation of a hdfs nfs 
> gateway module in the spark-ec2 project. See linked issue. 
> Reference:
> https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6600) Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway

2015-04-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6600:
-
Priority: Minor  (was: Major)
Assignee: Florian Verhein

> Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway  
> --
>
> Key: SPARK-6600
> URL: https://issues.apache.org/jira/browse/SPARK-6600
> Project: Spark
>  Issue Type: New Feature
>  Components: EC2
>Reporter: Florian Verhein
>Assignee: Florian Verhein
>Priority: Minor
> Fix For: 1.4.0
>
>
> Use case: User has set up the hadoop hdfs nfs gateway service on their 
> spark_ec2.py launched cluster, and wants to mount that on their local 
> machine. 
> Requires the following ports to be opened on incoming rule set for MASTER for 
> both UDP and TCP: 111, 2049, 4242.
> (I have tried this and it works)
> Note that this issue *does not* cover the implementation of a hdfs nfs 
> gateway module in the spark-ec2 project. See linked issue. 
> Reference:
> https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6597) Replace `input:checkbox` with `input[type="checkbox"] in additional-metrics.js

2015-04-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6597.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5254
[https://github.com/apache/spark/pull/5254]

> Replace `input:checkbox` with `input[type="checkbox"] in additional-metrics.js
> --
>
> Key: SPARK-6597
> URL: https://issues.apache.org/jira/browse/SPARK-6597
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.2.2, 1.3.1, 1.4.0
>Reporter: Kousuke Saruta
>Priority: Minor
> Fix For: 1.4.0
>
>
> In additional-metrics.js, there are some selector notation like 
> `input:checkbox` but JQuery's official document says `input[type="checkbox"]` 
> is better.
> https://api.jquery.com/checkbox-selector/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6597) Replace `input:checkbox` with `input[type="checkbox"] in additional-metrics.js

2015-04-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6597:
-
Priority: Trivial  (was: Minor)
Assignee: Kousuke Saruta

> Replace `input:checkbox` with `input[type="checkbox"] in additional-metrics.js
> --
>
> Key: SPARK-6597
> URL: https://issues.apache.org/jira/browse/SPARK-6597
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.2.2, 1.3.1, 1.4.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Trivial
> Fix For: 1.4.0
>
>
> In additional-metrics.js, there are some selector notation like 
> `input:checkbox` but JQuery's official document says `input[type="checkbox"]` 
> is better.
> https://api.jquery.com/checkbox-selector/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6626) TwitterUtils.createStream documentation error

2015-04-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6626:
-
Priority: Trivial  (was: Minor)
Assignee: Jayson Sunshine

> TwitterUtils.createStream documentation error
> -
>
> Key: SPARK-6626
> URL: https://issues.apache.org/jira/browse/SPARK-6626
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.3.0
>Reporter: Jayson Sunshine
>Assignee: Jayson Sunshine
>Priority: Trivial
>  Labels: documentation, easyfix
> Fix For: 1.3.1, 1.4.0
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> At 
> http://spark.apache.org/docs/1.3.0/streaming-programming-guide.html#input-dstreams-and-receivers,
>  under 'Advanced Sources', the documentation provides the following call for 
> Scala:
> TwitterUtils.createStream(ssc)
> However, with only one parameter to this method it appears a jssc object is 
> required, not a ssc object: 
> http://spark.apache.org/docs/1.3.0/api/java/index.html?org/apache/spark/streaming/twitter/TwitterUtils.html
> To make the above call work one must instead provide an option argument, for 
> example:
> TwitterUtils.createStream(ssc, None)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6626) TwitterUtils.createStream documentation error

2015-04-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6626.
--
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.1

Issue resolved by pull request 5295
[https://github.com/apache/spark/pull/5295]

> TwitterUtils.createStream documentation error
> -
>
> Key: SPARK-6626
> URL: https://issues.apache.org/jira/browse/SPARK-6626
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.3.0
>Reporter: Jayson Sunshine
>Priority: Minor
>  Labels: documentation, easyfix
> Fix For: 1.3.1, 1.4.0
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> At 
> http://spark.apache.org/docs/1.3.0/streaming-programming-guide.html#input-dstreams-and-receivers,
>  under 'Advanced Sources', the documentation provides the following call for 
> Scala:
> TwitterUtils.createStream(ssc)
> However, with only one parameter to this method it appears a jssc object is 
> required, not a ssc object: 
> http://spark.apache.org/docs/1.3.0/api/java/index.html?org/apache/spark/streaming/twitter/TwitterUtils.html
> To make the above call work one must instead provide an option argument, for 
> example:
> TwitterUtils.createStream(ssc, None)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6630) SparkConf.setIfMissing should only evaluate the assigned value if indeed missing

2015-04-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390350#comment-14390350
 ] 

Sean Owen commented on SPARK-6630:
--

This should be as simple as {{  def setIfMissing(key: String, value: => 
String): SparkConf = ... }} if I'm not mistaken about how this works in Scala? 
Would you like to make a PR and verify it lazily evaluates? I can't think of a 
scenario where it would be important to always evaluate the argument.

> SparkConf.setIfMissing should only evaluate the assigned value if indeed 
> missing
> 
>
> Key: SPARK-6630
> URL: https://issues.apache.org/jira/browse/SPARK-6630
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Svend Vanderveken
>Priority: Minor
>
> The method setIfMissing() in SparkConf is currently systematically evaluating 
> the right hand side of the assignment even if not used. This leads to 
> unnecessary computation, like in the case of 
> {code}
>   conf.setIfMissing("spark.driver.host", Utils.localHostName())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Nan Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390367#comment-14390367
 ] 

Nan Zhu commented on SPARK-6646:


super cool, Spark enables Bigger than Bigger Data in mobile phones

> Spark 2.0: Rearchitecting Spark for Mobile Platforms
> 
>
> Key: SPARK-6646
> URL: https://issues.apache.org/jira/browse/SPARK-6646
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
> Attachments: Spark on Mobile - Design Doc - v1.pdf
>
>
> Mobile computing is quickly rising to dominance, and by the end of 2017, it 
> is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
> Spark’s project goal can be accomplished only when Spark runs efficiently for 
> the growing population of mobile users.
> Designed and optimized for modern data centers and Big Data applications, 
> Spark is unfortunately not a good fit for mobile computing today. In the past 
> few months, we have been prototyping the feasibility of a mobile-first Spark 
> architecture, and today we would like to share with you our findings. This 
> ticket outlines the technical design of Spark’s mobile support, and shares 
> results from several early prototypes.
> Mobile friendly version of the design doc: 
> https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390373#comment-14390373
 ] 

Steve Loughran commented on SPARK-6646:
---

Obviously the barrier will be data source access; talking to remote data is 
going to run up bills.

# couchdb has an offline mode, so its RDD/Dataframe support would allow 
spark-mobile to work in embedded mode.
# Hadoop 2.8 add hardware CRC on ARM parts for HDFS (HADOOP-11660). A 
{{MiniHDFSCluster}} could be instantiated locally to benefit from this.
# alternatively, mDNS could be used to discover and dynamically build up an 
HDFS cluster from nearby devices, MANET-style. The limited connectivity 
guarantees of moving devices means that a block size of <1536 bytes would be 
appropriate; probably 1KB blocks are safest.
# Those nodes on the network with limited CPU power but access to external 
power supplies, such as toasters and coffee machines, could have a role as the 
persistent co-ordinators of work and HDFS Namenodes, as well as being used as 
the preferred routers of wifi packets.
# It may be necessary to extend the hadoop {{s3://}} filesystem with the notion 
of monthly data quotas. Possibly even roaming and non-roaming quotas. The S3 
client would need to query the runtime to determine whether it was at home vs 
roaming & use the relevant quota. Apps could then set something like
{code}
fs.s3.quota.home=15GB
fs.s3.quota.roaming=2GB
{code}
Dealing with use abroad would be more complex, as if a cost value were to be 
included, exchange rates would have to be dynamically assessed.
# It may be interesting consider the notion of having devices publish some of 
their data (photos, healthkit history, movement history) to other devices 
nearby. If one phone could enumerate those nearby **and submit work to them**, 
the bandwidth problems could be addressed.



> Spark 2.0: Rearchitecting Spark for Mobile Platforms
> 
>
> Key: SPARK-6646
> URL: https://issues.apache.org/jira/browse/SPARK-6646
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
> Attachments: Spark on Mobile - Design Doc - v1.pdf
>
>
> Mobile computing is quickly rising to dominance, and by the end of 2017, it 
> is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
> Spark’s project goal can be accomplished only when Spark runs efficiently for 
> the growing population of mobile users.
> Designed and optimized for modern data centers and Big Data applications, 
> Spark is unfortunately not a good fit for mobile computing today. In the past 
> few months, we have been prototyping the feasibility of a mobile-first Spark 
> architecture, and today we would like to share with you our findings. This 
> ticket outlines the technical design of Spark’s mobile support, and shares 
> results from several early prototypes.
> Mobile friendly version of the design doc: 
> https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4927) Spark does not clean up properly during long jobs.

2015-04-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4927.
--
Resolution: Cannot Reproduce

At the moment I've tried to reproduce this a few ways and wasn't able to. It 
may have been fixed somehow since. It can be reopened if there is a 
reproduction vs 1.3+

> Spark does not clean up properly during long jobs. 
> ---
>
> Key: SPARK-4927
> URL: https://issues.apache.org/jira/browse/SPARK-4927
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Ilya Ganelin
>
> On a long running Spark job, Spark will eventually run out of memory on the 
> driver node due to metadata overhead from the shuffle operation. Spark will 
> continue to operate, however with drastically decreased performance (since 
> swapping now occurs with every operation).
> The spark.cleanup.tll parameter allows a user to configure when cleanup 
> happens but the issue with doing this is that it isn’t done safely, e.g. If 
> this clears a cached RDD or active task in the middle of processing a stage, 
> this ultimately causes a KeyNotFoundException when the next stage attempts to 
> reference the cleared RDD or task.
> There should be a sustainable mechanism for cleaning up stale metadata that 
> allows the program to continue running. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4799) Spark should not rely on local host being resolvable on every node

2015-04-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4799.
--
  Resolution: Duplicate
Target Version/s:   (was: 1.2.1)

Looks like this was subsumed by SPARK-5078 and SPARK_LOCAL_HOSTNAME

> Spark should not rely on local host being resolvable on every node
> --
>
> Key: SPARK-4799
> URL: https://issues.apache.org/jira/browse/SPARK-4799
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: Tested a Spark+Mesos cluster on top of Docker to 
> reproduce the issue.
>Reporter: Santiago M. Mola
>
> Spark fails when a node hostname is not resolvable by other nodes.
> See an example trace:
> {code}
> 14/12/09 17:02:41 ERROR SendingConnection: Error connecting to 
> 27e434cf36ac:35093
> java.nio.channels.UnresolvedAddressException
>   at sun.nio.ch.Net.checkAddress(Net.java:127)
>   at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:644)
>   at 
> org.apache.spark.network.SendingConnection.connect(Connection.scala:299)
>   at 
> org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:278)
>   at 
> org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139)
> {code}
> The relevant code is here:
> https://github.com/apache/spark/blob/bcb5cdad614d4fce43725dfec3ce88172d2f8c11/core/src/main/scala/org/apache/spark/network/nio/ConnectionManager.scala#L170
> {code}
> val id = new ConnectionManagerId(Utils.localHostName, 
> serverChannel.socket.getLocalPort)
> {code}
> This piece of code should use the host IP with Utils.localIpAddress or a 
> method that acknowleges user settings (e.g. SPARK_LOCAL_IP). Since I cannot 
> think about a use case for using hostname here, I'm creating a PR with the 
> former solution, but if you think the later is better, I'm willing to create 
> a new PR with a more elaborate fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4544) Spark JVM Metrics doesn't have context.

2015-04-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4544.
--
Resolution: Duplicate

I'd like to bundle this under SPARK-5847, which proposes more general control 
over the namespacing, which could include "instance" as a higher-level grouping 
than the current app ID.

> Spark JVM Metrics doesn't have context.
> ---
>
> Key: SPARK-4544
> URL: https://issues.apache.org/jira/browse/SPARK-4544
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Sreepathi Prasanna
>
> If we enable jvm metrics for executor, master, worker, driver instances, we 
> don't have context where they are coming from ?
> This can be a issue if we are collecting all the metrics from different 
> instances are storing into common datastore. 
> This is mainly running Spark on Yarn but i believe Spark standalone has also 
> this problems.
> It would be good if we attach some context for jvm metrics. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3967) Spark applications fail in yarn-cluster mode when the directories configured in yarn.nodemanager.local-dirs are located on different disks/partitions

2015-04-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3967:
-
Component/s: (was: Spark Core)
 YARN

> Spark applications fail in yarn-cluster mode when the directories configured 
> in yarn.nodemanager.local-dirs are located on different disks/partitions
> -
>
> Key: SPARK-3967
> URL: https://issues.apache.org/jira/browse/SPARK-3967
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Christophe Préaud
> Attachments: spark-1.1.0-utils-fetch.patch, 
> spark-1.1.0-yarn_cluster_tmpdir.patch
>
>
> Spark applications fail from time to time in yarn-cluster mode (but not in 
> yarn-client mode) when yarn.nodemanager.local-dirs (Hadoop YARN config) is 
> set to a comma-separated list of directories which are located on different 
> disks/partitions.
> Steps to reproduce:
> 1. Set yarn.nodemanager.local-dirs (in yarn-site.xml) to a list of 
> directories located on different partitions (the more you set, the more 
> likely it will be to reproduce the bug):
> (...)
> 
>   yarn.nodemanager.local-dirs
>   
> file:/d1/yarn/local/nm-local-dir,file:/d2/yarn/local/nm-local-dir,file:/d3/yarn/local/nm-local-dir,file:/d4/yarn/local/nm-local-dir,file:/d5/yarn/local/nm-local-dir,file:/d6/yarn/local/nm-local-dir,file:/d7/yarn/local/nm-local-dir
> 
> (...)
> 2. Launch (several times) an application in yarn-cluster mode, it will fail 
> (apparently randomly) from time to time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1001) Memory leak when reading sequence file and then sorting

2015-04-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1001.
--
Resolution: Cannot Reproduce

> Memory leak when reading sequence file and then sorting
> ---
>
> Key: SPARK-1001
> URL: https://issues.apache.org/jira/browse/SPARK-1001
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 0.8.0
>Reporter: Matthew Cheah
>  Labels: Hadoop, Memory
>
> Spark appears to build up a backlog of unreachable byte arrays when an RDD is 
> constructed from a sequence file, and then that RDD is sorted.
> I have a class that wraps a Java ArrayList, that can be serialized and 
> written to a Hadoop SequenceFile (I.e. Implements the Writable interface). 
> Let's call it WritableDataRow. It can take a Java List as its argument to 
> wrap around, and also has a copy constructor.
> Setup: 10 slaves, launched via EC2, 65.9GB RAM each, dataset is 100GB of 
> text, 120GB when in sequence file format (not using compression to compact 
> the bytes). CDH4.2.0-backed hadoop cluster.
> First, building the RDD from a CSV and then sorting on index 1 works fine:
> {code}
> scala> import scala.collection.JavaConversions._ // Other imports here as well
> import scala.collection.JavaConversions._
> scala> val rddAsTextFile = sc.textFile("s3n://some-bucket/events-*.csv")
> rddAsTextFile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at 
> :14
> scala> val rddAsWritableDataRows = rddAsTextFile.map(x => new 
> WritableDataRow(x.split("\\|").toList))
> rddAsWritableDataRows: 
> org.apache.spark.rdd.RDD[com.palantir.finance.datatable.server.spark.WritableDataRow]
>  = MappedRDD[2] at map at :19
> scala> val rddAsKeyedWritableDataRows = rddAsWritableDataRows.map(x => 
> (x.getContents().get(1).toString(), x));
> rddAsKeyedWritableDataRows: org.apache.spark.rdd.RDD[(String, 
> com.palantir.finance.datatable.server.spark.WritableDataRow)] = MappedRDD[4] 
> at map at :22
> scala> val orderedFunct = new 
> org.apache.spark.rdd.OrderedRDDFunctions[String, WritableDataRow, (String, 
> WritableDataRow)](rddAsKeyedWritableDataRows)
> orderedFunct: 
> org.apache.spark.rdd.OrderedRDDFunctions[String,com.palantir.finance.datatable.server.spark.WritableDataRow,(String,
>  com.palantir.finance.datatable.server.spark.WritableDataRow)] = 
> org.apache.spark.rdd.OrderedRDDFunctions@587acb54
> scala> orderedFunct.sortByKey(true).count(); // Actually triggers the 
> computation, as stated in a different e-mail thread
> res0: org.apache.spark.rdd.RDD[(String, 
> com.palantir.finance.datatable.server.spark.WritableDataRow)] = 
> MapPartitionsRDD[8] at sortByKey at :27
> {code}
> The above works without too many surprises. I then save it as a Sequence File 
> (using JavaPairRDD as a way to more easily call saveAsHadoopFile(), and this 
> is how it's done in our Java-based application):
> {code}
> scala> val pairRDD = new JavaPairRDD(rddAsWritableDataRows.map(x => 
> (NullWritable.get(), x)));
> pairRDD: 
> org.apache.spark.api.java.JavaPairRDD[org.apache.hadoop.io.NullWritable,com.palantir.finance.datatable.server.spark.WritableDataRow]
>  = org.apache.spark.api.java.JavaPairRDD@8d2e9d9
> scala> pairRDD.saveAsHadoopFile("hdfs://:9010/blah", 
> classOf[NullWritable], classOf[WritableDataRow], 
> classOf[org.apache.hadoop.mapred.SequenceFileOutputFormat[NullWritable, 
> WritableDataRow]]);
> …
> 2013-12-11 20:09:14,444 [main] INFO  org.apache.spark.SparkContext - Job 
> finished: saveAsHadoopFile at :26, took 1052.116712748 s
> {code}
> And now I want to get the RDD from the sequence file and sort THAT, and this 
> is when I monitor Ganglia and "ps aux" and notice the memory usage climbing 
> ridiculously:
> {code}
> scala> val rddAsSequenceFile = 
> sc.sequenceFile("hdfs://:9010/blah", classOf[NullWritable], 
> classOf[WritableDataRow]).map(x => new WritableDataRow(x._2)); // Invokes 
> copy constructor to get around re-use of writable objects
> rddAsSequenceFile: 
> org.apache.spark.rdd.RDD[com.palantir.finance.datatable.server.spark.WritableDataRow]
>  = MappedRDD[19] at map at :19
> scala> val orderedFunct = new 
> org.apache.spark.rdd.OrderedRDDFunctions[String, WritableDataRow, (String, 
> WritableDataRow)](rddAsSequenceFile.map(x => 
> (x.getContents().get(1).toString(), x)))
> orderedFunct: 
> org.apache.spark.rdd.OrderedRDDFunctions[String,com.palantir.finance.datatable.server.spark.WritableDataRow,(String,
>  com.palantir.finance.datatable.server.spark.WritableDataRow)] = 
> org.apache.spark.rdd.OrderedRDDFunctions@6262a9a6
> scala>orderedFunct.sortByKey().count();
> {code}
> (On the necessity to copy writables from hadoop RDDs, see: 
> https://mail-archives.apache.org/mod_mbox/spark-user/201308.mbox/%3ccaf_kkpzrq4otyqvwcoc6plaz9x9_sfo33u4ysatki

[jira] [Updated] (SPARK-3231) select on a table in parquet format containing smallint as a field type does not work

2015-04-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3231:
-
Component/s: (was: Spark Core)
 SQL

> select on a table in parquet format containing smallint as a field type does 
> not work
> -
>
> Key: SPARK-3231
> URL: https://issues.apache.org/jira/browse/SPARK-3231
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
> Environment: The table is created through Hive-0.13.
> SparkSql 1.1 is used.
>Reporter: chirag aggarwal
>
> A table is created through hive. This table has a field of type smallint. The 
> format of the table is parquet.
> select on this table works perfectly on hive shell.
> But, when the select is run on this table from spark-sql, then the query 
> fails.
> Steps to reproduce the issue:
> --
> hive> create table abct (a smallint, b int) row format delimited fields 
> terminated by '|' stored as textfile;
> A text file is stored in hdfs for this table.
> hive> create table abc (a smallint, b int) stored as parquet; 
> hive> insert overwrite table abc select * from abct;
> hive> select * from abc;
> 2 1
> 2 2
> 2 3
> spark-sql> select * from abc;
> 10:08:46 ERROR CliDriver: org.apache.spark.SparkException: Job aborted due to 
> stage failure: Task 0.0 in stage 33.0 (TID 2340) had a not serializable 
> result: org.apache.hadoop.io.IntWritable
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1158)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1147)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1146)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1146)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:685)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:685)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:685)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> But, if the type of this table is now changed to int, then spark-sql gives 
> the correct results.
> hive> alter table abc change a a int;
> spark-sql> select * from abc;
> 2 1
> 2 2
> 2 3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6647) Make trait StringComparison as BinaryPredicate and throw error when Predicate can't translate to data source Filter

2015-04-01 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-6647:
--

 Summary: Make trait StringComparison as BinaryPredicate and throw 
error when Predicate can't translate to data source Filter
 Key: SPARK-6647
 URL: https://issues.apache.org/jira/browse/SPARK-6647
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh


Now trait {{StringComparison}} is a {{BinaryExpression}}. In fact, it should be 
a {{BinaryPredicate}}.

By making {{StringComparison}} as {{BinaryPredicate}}, we can throw error when 
a {{expressions.Predicate}} can't translate to a data source {{Filter}} in 
function {{selectFilters}}.

Without this modification, because we will wrap a {{Filter}} outside the 
scanned results in {{pruneFilterProjectRaw}}, we can't detect about something 
is wrong in translating predicates to filters in {{selectFilters}}.

The unit test of SPARK-6625 demonstrates such problem. In that pr, even 
{{expressions.Contains}} is not properly translated to 
{{sources.StringContains}}, the filtering is still performed by the {{Filter}} 
and so the test passes.

Of course, by doing this modification, all {{expressions.Predicate}} classes 
need to have its data source {{Filter}} correspondingly.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6647) Make trait StringComparison as BinaryPredicate and throw error when Predicate can't translate to data source Filter

2015-04-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6647:
---

Assignee: (was: Apache Spark)

> Make trait StringComparison as BinaryPredicate and throw error when Predicate 
> can't translate to data source Filter
> ---
>
> Key: SPARK-6647
> URL: https://issues.apache.org/jira/browse/SPARK-6647
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> Now trait {{StringComparison}} is a {{BinaryExpression}}. In fact, it should 
> be a {{BinaryPredicate}}.
> By making {{StringComparison}} as {{BinaryPredicate}}, we can throw error 
> when a {{expressions.Predicate}} can't translate to a data source {{Filter}} 
> in function {{selectFilters}}.
> Without this modification, because we will wrap a {{Filter}} outside the 
> scanned results in {{pruneFilterProjectRaw}}, we can't detect about something 
> is wrong in translating predicates to filters in {{selectFilters}}.
> The unit test of SPARK-6625 demonstrates such problem. In that pr, even 
> {{expressions.Contains}} is not properly translated to 
> {{sources.StringContains}}, the filtering is still performed by the 
> {{Filter}} and so the test passes.
> Of course, by doing this modification, all {{expressions.Predicate}} classes 
> need to have its data source {{Filter}} correspondingly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6647) Make trait StringComparison as BinaryPredicate and throw error when Predicate can't translate to data source Filter

2015-04-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390403#comment-14390403
 ] 

Apache Spark commented on SPARK-6647:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/5309

> Make trait StringComparison as BinaryPredicate and throw error when Predicate 
> can't translate to data source Filter
> ---
>
> Key: SPARK-6647
> URL: https://issues.apache.org/jira/browse/SPARK-6647
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> Now trait {{StringComparison}} is a {{BinaryExpression}}. In fact, it should 
> be a {{BinaryPredicate}}.
> By making {{StringComparison}} as {{BinaryPredicate}}, we can throw error 
> when a {{expressions.Predicate}} can't translate to a data source {{Filter}} 
> in function {{selectFilters}}.
> Without this modification, because we will wrap a {{Filter}} outside the 
> scanned results in {{pruneFilterProjectRaw}}, we can't detect about something 
> is wrong in translating predicates to filters in {{selectFilters}}.
> The unit test of SPARK-6625 demonstrates such problem. In that pr, even 
> {{expressions.Contains}} is not properly translated to 
> {{sources.StringContains}}, the filtering is still performed by the 
> {{Filter}} and so the test passes.
> Of course, by doing this modification, all {{expressions.Predicate}} classes 
> need to have its data source {{Filter}} correspondingly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6647) Make trait StringComparison as BinaryPredicate and throw error when Predicate can't translate to data source Filter

2015-04-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6647:
---

Assignee: Apache Spark

> Make trait StringComparison as BinaryPredicate and throw error when Predicate 
> can't translate to data source Filter
> ---
>
> Key: SPARK-6647
> URL: https://issues.apache.org/jira/browse/SPARK-6647
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> Now trait {{StringComparison}} is a {{BinaryExpression}}. In fact, it should 
> be a {{BinaryPredicate}}.
> By making {{StringComparison}} as {{BinaryPredicate}}, we can throw error 
> when a {{expressions.Predicate}} can't translate to a data source {{Filter}} 
> in function {{selectFilters}}.
> Without this modification, because we will wrap a {{Filter}} outside the 
> scanned results in {{pruneFilterProjectRaw}}, we can't detect about something 
> is wrong in translating predicates to filters in {{selectFilters}}.
> The unit test of SPARK-6625 demonstrates such problem. In that pr, even 
> {{expressions.Contains}} is not properly translated to 
> {{sources.StringContains}}, the filtering is still performed by the 
> {{Filter}} and so the test passes.
> Of course, by doing this modification, all {{expressions.Predicate}} classes 
> need to have its data source {{Filter}} correspondingly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6630) SparkConf.setIfMissing should only evaluate the assigned value if indeed missing

2015-04-01 Thread Svend Vanderveken (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390421#comment-14390421
 ] 

Svend Vanderveken commented on SPARK-6630:
--

Thanks for you comment. I agree with the resolution, I just only found the time 
to open the Jira yesterday, I'll submit the corresponding PR shortly, promised 
:) 

> SparkConf.setIfMissing should only evaluate the assigned value if indeed 
> missing
> 
>
> Key: SPARK-6630
> URL: https://issues.apache.org/jira/browse/SPARK-6630
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Svend Vanderveken
>Priority: Minor
>
> The method setIfMissing() in SparkConf is currently systematically evaluating 
> the right hand side of the assignment even if not used. This leads to 
> unnecessary computation, like in the case of 
> {code}
>   conf.setIfMissing("spark.driver.host", Utils.localHostName())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6613) Starting stream from checkpoint causes Streaming tab to throw error

2015-04-01 Thread Marius Soutier (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390423#comment-14390423
 ] 

Marius Soutier commented on SPARK-6613:
---

Bug report.

> Starting stream from checkpoint causes Streaming tab to throw error
> ---
>
> Key: SPARK-6613
> URL: https://issues.apache.org/jira/browse/SPARK-6613
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.1
>Reporter: Marius Soutier
>
> When continuing my streaming job from a checkpoint, the job runs, but the 
> Streaming tab in the standard UI initially no longer works (browser just 
> shows HTTP ERROR: 500). Sometimes  it gets back to normal after a while, and 
> sometimes it stays in this state permanently.
> Stacktrace:
> WARN org.eclipse.jetty.servlet.ServletHandler: /streaming/
> java.util.NoSuchElementException: key not found: 0
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:58)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at scala.collection.AbstractMap.apply(Map.scala:58)
>   at 
> org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1$$anonfun$apply$5.apply(StreamingJobProgressListener.scala:151)
>   at 
> org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1$$anonfun$apply$5.apply(StreamingJobProgressListener.scala:150)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.Range.foreach(Range.scala:141)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1.apply(StreamingJobProgressListener.scala:150)
>   at 
> org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1.apply(StreamingJobProgressListener.scala:149)
>   at scala.Option.map(Option.scala:145)
>   at 
> org.apache.spark.streaming.ui.StreamingJobProgressListener.lastReceivedBatchRecords(StreamingJobProgressListener.scala:149)
>   at 
> org.apache.spark.streaming.ui.StreamingPage.generateReceiverStats(StreamingPage.scala:82)
>   at 
> org.apache.spark.streaming.ui.StreamingPage.render(StreamingPage.scala:43)
>   at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68)
>   at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68)
>   at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:68)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:735)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
>   at 
> org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>   at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>   at org.eclipse.jetty.server.Server.handle(Server.java:370)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
>   at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644)
>   at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
>   at 
> org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
>   at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
>   at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>   at java.lang.T

[jira] [Resolved] (SPARK-3884) If deploy mode is cluster, --driver-memory shouldn't apply to client JVM

2015-04-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3884.
--
  Resolution: Fixed
   Fix Version/s: 1.4.0
Assignee: Marcelo Vanzin  (was: Sandy Ryza)
Target Version/s:   (was: 1.1.2, 1.2.1)

This is fixed in 1.4 due to the new launcher implementation. I verified that in 
yarn-cluster mode the SparkSubmit JVM is not run with -Xms / -Xmx set, but 
instead passes through spark.driver.memory in --conf. In yarn-client mode, it 
does set -Xms / -Xmx.

> If deploy mode is cluster, --driver-memory shouldn't apply to client JVM
> 
>
> Key: SPARK-3884
> URL: https://issues.apache.org/jira/browse/SPARK-3884
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Sandy Ryza
>Assignee: Marcelo Vanzin
> Fix For: 1.4.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3884) If deploy mode is cluster, --driver-memory shouldn't apply to client JVM

2015-04-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3884:
-
  Component/s: (was: Spark Core)
   Spark Submit
Affects Version/s: 1.2.0
   1.3.0

> If deploy mode is cluster, --driver-memory shouldn't apply to client JVM
> 
>
> Key: SPARK-3884
> URL: https://issues.apache.org/jira/browse/SPARK-3884
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.1.0, 1.2.0, 1.3.0
>Reporter: Sandy Ryza
>Assignee: Marcelo Vanzin
> Fix For: 1.4.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6644) After adding new columns to a partitioned table and inserting data to an old partition, data of newly added columns are all NULL

2015-04-01 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6644:
--
Description: 
In Hive, the schema of a partition may differ from the table schema. For 
example, we may add new columns to the table after importing existing 
partitions. When using {{spark-sql}} to query the data in a partition whose 
schema is different from the table schema, problems may arise. Part of them 
have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. 
However, after adding new column(s) to the table, when inserting data into old 
partitions, values of newly added columns are all {{NULL}}.

The following snippet can be used to reproduce this issue:
{code}
case class TestData(key: Int, value: String)

val testData = TestHive.sparkContext.parallelize((1 to 2).map(i => TestData(i, 
i.toString))).toDF()
testData.registerTempTable("testData")

sql("DROP TABLE IF EXISTS table_with_partition ")
sql(s"CREATE TABLE IF NOT EXISTS table_with_partition (key INT, value STRING) 
PARTITIONED BY (ds STRING) LOCATION '${tmpDir.toURI.toString}'")
sql("INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT 
key, value FROM testData")

// Add new columns to the table
sql("ALTER TABLE table_with_partition ADD COLUMNS (key1 STRING)")
sql("ALTER TABLE table_with_partition ADD COLUMNS (destlng DOUBLE)") 
sql("INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT 
key, value, 'test', 1.11 FROM testData")

sql("SELECT * FROM table_with_partition WHERE ds = 
'1'").collect().foreach(println)  
{code}
Actual result:
{noformat}
[1,1,null,null,1]
[2,2,null,null,1]
{noformat}
Expected result:
{noformat}
[1,1,test,1.11,1]
[2,2,test,1.11,1]
{noformat}

  was:
In Hive, the schema of a partition may differ from the table schema. For 
example, we may add new columns to the table after importing existing 
partitions. When using {{spark-sql}} to query the data in a partition whose 
schema is different from the table schema, problems may arise. Part of them 
have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. 
However, after adding new column(s) to the table, when inserting data into old 
partitions, values of newly added columns are all {{NULL}}.

The following snippet can be used to reproduce this issue:
{code}
case class TestData(key: Int, value: String)

val testData = TestHive.sparkContext.parallelize((1 to 2).map(i => TestData(i, 
i.toString))).toDF()
testData.registerTempTable("testData")

sql("DROP TABLE IF EXISTS table_with_partition ")
sql(s"CREATE TABLE IF NOT EXISTS table_with_partition(key int, value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}'")
sql("INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT 
key, value FROM testData")

// Add new columns to the table
sql("ALTER TABLE table_with_partition ADD COLUMNS(key1 string)")
sql("ALTER TABLE table_with_partition ADD COLUMNS(destlng double)") 
sql("INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT 
key, value, 'test', 1.11 FROM testData")

sql("SELECT * FROM table_with_partition WHERE ds = 
'1'").collect().foreach(println)  
{code}
Actual result:
{noformat}
[1,1,null,null,1]
[2,2,null,null,1]
{noformat}
Expected result:
{noformat}
[1,1,test,1.11,1]
[2,2,test,1.11,1]
{noformat}


> After adding new columns to a partitioned table and inserting data to an old 
> partition, data of newly added columns are all NULL
> 
>
> Key: SPARK-6644
> URL: https://issues.apache.org/jira/browse/SPARK-6644
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: dongxu
>
> In Hive, the schema of a partition may differ from the table schema. For 
> example, we may add new columns to the table after importing existing 
> partitions. When using {{spark-sql}} to query the data in a partition whose 
> schema is different from the table schema, problems may arise. Part of them 
> have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. 
> However, after adding new column(s) to the table, when inserting data into 
> old partitions, values of newly added columns are all {{NULL}}.
> The following snippet can be used to reproduce this issue:
> {code}
> case class TestData(key: Int, value: String)
> val testData = TestHive.sparkContext.parallelize((1 to 2).map(i => 
> TestData(i, i.toString))).toDF()
> testData.registerTempTable("testData")
> sql("DROP TABLE IF EXISTS table_with_partition ")
> sql(s"CREATE TABLE IF NOT EXISTS table_with_partition (key INT, value STRING) 
> PARTITIONED BY (ds STRING) LOCATION '${tmpDir.toURI.toString}'")
> sql("INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1')

[jira] [Resolved] (SPARK-6608) Make DataFrame.rdd a lazy val

2015-04-01 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6608.
---
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5265
[https://github.com/apache/spark/pull/5265]

> Make DataFrame.rdd a lazy val
> -
>
> Key: SPARK-6608
> URL: https://issues.apache.org/jira/browse/SPARK-6608
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
> Fix For: 1.4.0
>
>
> Before 1.3.0, {{SchemaRDD.id}} works as a unique identifier of each 
> {{SchemaRDD}}. In 1.3.0, unlike {{SchemaRDD}}, {{DataFrame}} is no longer an 
> RDD, and {{DataFrame.rdd}} is actually a function which always return a new 
> RDD instance. Making {{DataFrame.rdd}} a {{lazy val}} should bring the unique 
> identifier back.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6643) Python API for StandardScalerModel

2015-04-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6643:
---

Assignee: Apache Spark

> Python API for StandardScalerModel
> --
>
> Key: SPARK-6643
> URL: https://issues.apache.org/jira/browse/SPARK-6643
> Project: Spark
>  Issue Type: Task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Kai Sasaki
>Assignee: Apache Spark
>Priority: Minor
>  Labels: mllib, python
> Fix For: 1.4.0
>
>
> This is the sub-task of SPARK-6254.
> Wrap missing method for {{StandardScalerModel}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6643) Python API for StandardScalerModel

2015-04-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6643:
---

Assignee: (was: Apache Spark)

> Python API for StandardScalerModel
> --
>
> Key: SPARK-6643
> URL: https://issues.apache.org/jira/browse/SPARK-6643
> Project: Spark
>  Issue Type: Task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Kai Sasaki
>Priority: Minor
>  Labels: mllib, python
> Fix For: 1.4.0
>
>
> This is the sub-task of SPARK-6254.
> Wrap missing method for {{StandardScalerModel}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6643) Python API for StandardScalerModel

2015-04-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390610#comment-14390610
 ] 

Apache Spark commented on SPARK-6643:
-

User 'Lewuathe' has created a pull request for this issue:
https://github.com/apache/spark/pull/5310

> Python API for StandardScalerModel
> --
>
> Key: SPARK-6643
> URL: https://issues.apache.org/jira/browse/SPARK-6643
> Project: Spark
>  Issue Type: Task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Kai Sasaki
>Priority: Minor
>  Labels: mllib, python
> Fix For: 1.4.0
>
>
> This is the sub-task of SPARK-6254.
> Wrap missing method for {{StandardScalerModel}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6648) Reading Parquet files with different sub-files doesn't work

2015-04-01 Thread Marius Soutier (JIRA)

Marius Soutier created SPARK-6648:
-

 Summary: Reading Parquet files with different sub-files doesn't 
work
 Key: SPARK-6648
 URL: https://issues.apache.org/jira/browse/SPARK-6648
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
Reporter: Marius Soutier


When reading from multiple parquet files (via 
sqlContext.parquetFile(/path/1.parquet,/path/2.parquet), if the parquet files 
were created using a different coalesce, the reading fails with:

ERROR c.w.r.websocket.ParquetReader  efault-dispatcher-63 : Failed reading 
parquet file
java.lang.IllegalArgumentException: Could not find Parquet metadata at path 

at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
 ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]

at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
 ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
at scala.Option.getOrElse(Option.scala:120) 
~[org.scala-lang.scala-library-2.10.4.jar:na]
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:458)
 ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:477)
 ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
at 
org.apache.spark.sql.parquet.ParquetRelation.(ParquetRelation.scala:65) 
~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:165) 
~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]

I haven't tested with Spark 1.3 yet but will report back after upgrading to 
1.3.1 (as soon as it's released).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6648) Reading Parquet files with different sub-files doesn't work

2015-04-01 Thread Marius Soutier (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marius Soutier updated SPARK-6648:
--
Description: 
When reading from multiple parquet files (via 
sqlContext.parquetFile(/path/1.parquet,/path/2.parquet), if the parquet files 
were created using a different coalesce (e.g. one only contains 
part-r-1.parquet, the other also part-r-2.parquet, part-r-3.parquet), the 
reading fails with:

ERROR c.w.r.websocket.ParquetReader  efault-dispatcher-63 : Failed reading 
parquet file
java.lang.IllegalArgumentException: Could not find Parquet metadata at path 

at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
 ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]

at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
 ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
at scala.Option.getOrElse(Option.scala:120) 
~[org.scala-lang.scala-library-2.10.4.jar:na]
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:458)
 ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:477)
 ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
at 
org.apache.spark.sql.parquet.ParquetRelation.(ParquetRelation.scala:65) 
~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:165) 
~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]

I haven't tested with Spark 1.3 yet but will report back after upgrading to 
1.3.1 (as soon as it's released).


  was:
When reading from multiple parquet files (via 
sqlContext.parquetFile(/path/1.parquet,/path/2.parquet), if the parquet files 
were created using a different coalesce, the reading fails with:

ERROR c.w.r.websocket.ParquetReader  efault-dispatcher-63 : Failed reading 
parquet file
java.lang.IllegalArgumentException: Could not find Parquet metadata at path 

at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
 ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]

at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
 ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
at scala.Option.getOrElse(Option.scala:120) 
~[org.scala-lang.scala-library-2.10.4.jar:na]
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:458)
 ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:477)
 ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
at 
org.apache.spark.sql.parquet.ParquetRelation.(ParquetRelation.scala:65) 
~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:165) 
~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]

I haven't tested with Spark 1.3 yet but will report back after upgrading to 
1.3.1 (as soon as it's released).



> Reading Parquet files with different sub-files doesn't work
> ---
>
> Key: SPARK-6648
> URL: https://issues.apache.org/jira/browse/SPARK-6648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Marius Soutier
>
> When reading from multiple parquet files (via 
> sqlContext.parquetFile(/path/1.parquet,/path/2.parquet), if the parquet files 
> were created using a different coalesce (e.g. one only contains 
> part-r-1.parquet, the other also part-r-2.parquet, part-r-3.parquet), the 
> reading fails with:
> ERROR c.w.r.websocket.ParquetReader  efault-dispatcher-63 : Failed reading 
> parquet file
> java.lang.IllegalArgumentException: Could not find Parquet metadata at path 
> 
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
>  ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
>  ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
>   at scala.Option.getOrElse(Option.scala:120) 
> ~[org.scala-lang.scala-library-2.10.4.jar:na]
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:458)
>  ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:477)
>  ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
>   at 
> org.apache.spark.sql.parquet.ParquetRelation.(ParquetRelation.scala:65) 
> ~[org.apache.sp

[jira] [Updated] (SPARK-6648) Reading Parquet files with different sub-files doesn't work

2015-04-01 Thread Marius Soutier (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marius Soutier updated SPARK-6648:
--
Description: 
When reading from multiple parquet files (via 
sqlContext.parquetFile(/path/1.parquet,/path/2.parquet), and one of the parquet 
files is being overwritten using a different coalesce (e.g. one only contains 
part-r-1.parquet, the other also part-r-2.parquet, part-r-3.parquet), the 
reading fails with:

ERROR c.w.r.websocket.ParquetReader  efault-dispatcher-63 : Failed reading 
parquet file
java.lang.IllegalArgumentException: Could not find Parquet metadata at path 

at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
 ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]

at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
 ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
at scala.Option.getOrElse(Option.scala:120) 
~[org.scala-lang.scala-library-2.10.4.jar:na]
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:458)
 ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:477)
 ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
at 
org.apache.spark.sql.parquet.ParquetRelation.(ParquetRelation.scala:65) 
~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:165) 
~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]

I haven't tested with Spark 1.3 yet but will report back after upgrading to 
1.3.1 (as soon as it's released).


  was:
When reading from multiple parquet files (via 
sqlContext.parquetFile(/path/1.parquet,/path/2.parquet), if the parquet files 
were created using a different coalesce (e.g. one only contains 
part-r-1.parquet, the other also part-r-2.parquet, part-r-3.parquet), the 
reading fails with:

ERROR c.w.r.websocket.ParquetReader  efault-dispatcher-63 : Failed reading 
parquet file
java.lang.IllegalArgumentException: Could not find Parquet metadata at path 

at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
 ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]

at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
 ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
at scala.Option.getOrElse(Option.scala:120) 
~[org.scala-lang.scala-library-2.10.4.jar:na]
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:458)
 ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:477)
 ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
at 
org.apache.spark.sql.parquet.ParquetRelation.(ParquetRelation.scala:65) 
~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:165) 
~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]

I haven't tested with Spark 1.3 yet but will report back after upgrading to 
1.3.1 (as soon as it's released).



> Reading Parquet files with different sub-files doesn't work
> ---
>
> Key: SPARK-6648
> URL: https://issues.apache.org/jira/browse/SPARK-6648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Marius Soutier
>
> When reading from multiple parquet files (via 
> sqlContext.parquetFile(/path/1.parquet,/path/2.parquet), and one of the 
> parquet files is being overwritten using a different coalesce (e.g. one only 
> contains part-r-1.parquet, the other also part-r-2.parquet, 
> part-r-3.parquet), the reading fails with:
> ERROR c.w.r.websocket.ParquetReader  efault-dispatcher-63 : Failed reading 
> parquet file
> java.lang.IllegalArgumentException: Could not find Parquet metadata at path 
> 
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
>  ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
>  ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
>   at scala.Option.getOrElse(Option.scala:120) 
> ~[org.scala-lang.scala-library-2.10.4.jar:na]
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:458)
>  ~[org.apache.spark.spark-sql_2.10-1.2.1.jar:1.2.1]
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:477)
>  ~[org.apache.spark.spark-sql

[jira] [Updated] (SPARK-6334) spark-local dir not getting cleared during ALS

2015-04-01 Thread Antony Mayi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antony Mayi updated SPARK-6334:
---
Attachment: gc.png

> spark-local dir not getting cleared during ALS
> --
>
> Key: SPARK-6334
> URL: https://issues.apache.org/jira/browse/SPARK-6334
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Antony Mayi
> Attachments: als-diskusage.png, gc.png
>
>
> when running bigger ALS training spark spills loads of temp data into the 
> local-dir (in my case yarn/local/usercache/antony.mayi/appcache/... - running 
> on YARN from cdh 5.3.2) eventually causing all the disks of all nodes running 
> out of space (in my case I have 12TB of available disk capacity before 
> kicking off the ALS but it all gets used (and yarn kills the containers when 
> reaching 90%).
> even with all recommended options (configuring checkpointing and forcing GC 
> when possible) it still doesn't get cleared.
> here is my (pseudo)code (pyspark):
> {code}
> sc.setCheckpointDir('/tmp')
> training = 
> sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK)
> model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40)
> sc._jvm.System.gc()
> {code}
> the training RDD has about 3.5 billions of items (~60GB on disk). after about 
> 6 hours the ALS will consume all 12TB of disk space in local-dir data and 
> gets killed. my cluster has 192 cores, 1.5TB RAM and for this task I am using 
> 37 executors of 4 cores/28+4GB RAM each.
> this is the graph of disk consumption pattern showing the space being all 
> eaten from 7% to 90% during the ALS (90% is when YARN kills the container):
> !als-diskusage.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6334) spark-local dir not getting cleared during ALS

2015-04-01 Thread Antony Mayi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390771#comment-14390771
 ] 

Antony Mayi commented on SPARK-6334:


bq. btw. I see based on the sourcecode checkpointing should be happening every 
3 iterations - how comes I don't see any drops in the disk usage at least once 
every three iterations? it just seems to be growing constantly... which worries 
me that even more frequent checkpointing wont help...

ok, I am now sure increasing the checkpointing interval is likely not going to 
help same as it is not helping now - the disk usage just grows even after 3x 
iterations. I just tried dirty hack - running parallel thread that forces GC 
every x minutes and suddenly I can notice the disk space gets cleared upon 
every three iterations when GC runs.

see this pattern - first run without forcing GC and then another one where 
there is noticeable disk usage drops every three steps (ALS iterations):
!gc.png!

so really what's needed to get the shuffles cleaned upon checkpointing is 
forcing GC.

this was my dirty hack:

{code}
from threading import Thread, Event
class GC(Thread):
def __init__(self, context, period=600):
Thread.__init__(self)
self.context = context
self.period = period
self.daemon = True
self.stopped = Event()
def stop(self):
self.stopped.set()
def run(self):
self.stopped.clear()
while not self.stopped.is_set():
self.stopped.wait(self.period)
self.context._jvm.System.gc()

sc.setCheckpointDir('/tmp')

gc = GC(sc)
gc.start()

training = 
sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK)
model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40)

gc.stop()
{code}

> spark-local dir not getting cleared during ALS
> --
>
> Key: SPARK-6334
> URL: https://issues.apache.org/jira/browse/SPARK-6334
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Antony Mayi
> Attachments: als-diskusage.png, gc.png
>
>
> when running bigger ALS training spark spills loads of temp data into the 
> local-dir (in my case yarn/local/usercache/antony.mayi/appcache/... - running 
> on YARN from cdh 5.3.2) eventually causing all the disks of all nodes running 
> out of space (in my case I have 12TB of available disk capacity before 
> kicking off the ALS but it all gets used (and yarn kills the containers when 
> reaching 90%).
> even with all recommended options (configuring checkpointing and forcing GC 
> when possible) it still doesn't get cleared.
> here is my (pseudo)code (pyspark):
> {code}
> sc.setCheckpointDir('/tmp')
> training = 
> sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK)
> model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40)
> sc._jvm.System.gc()
> {code}
> the training RDD has about 3.5 billions of items (~60GB on disk). after about 
> 6 hours the ALS will consume all 12TB of disk space in local-dir data and 
> gets killed. my cluster has 192 cores, 1.5TB RAM and for this task I am using 
> 37 executors of 4 cores/28+4GB RAM each.
> this is the graph of disk consumption pattern showing the space being all 
> eaten from 7% to 90% during the ALS (90% is when YARN kills the container):
> !als-diskusage.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-6334) spark-local dir not getting cleared during ALS

2015-04-01 Thread Antony Mayi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antony Mayi reopened SPARK-6334:


> spark-local dir not getting cleared during ALS
> --
>
> Key: SPARK-6334
> URL: https://issues.apache.org/jira/browse/SPARK-6334
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Antony Mayi
> Attachments: als-diskusage.png, gc.png
>
>
> when running bigger ALS training spark spills loads of temp data into the 
> local-dir (in my case yarn/local/usercache/antony.mayi/appcache/... - running 
> on YARN from cdh 5.3.2) eventually causing all the disks of all nodes running 
> out of space (in my case I have 12TB of available disk capacity before 
> kicking off the ALS but it all gets used (and yarn kills the containers when 
> reaching 90%).
> even with all recommended options (configuring checkpointing and forcing GC 
> when possible) it still doesn't get cleared.
> here is my (pseudo)code (pyspark):
> {code}
> sc.setCheckpointDir('/tmp')
> training = 
> sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK)
> model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40)
> sc._jvm.System.gc()
> {code}
> the training RDD has about 3.5 billions of items (~60GB on disk). after about 
> 6 hours the ALS will consume all 12TB of disk space in local-dir data and 
> gets killed. my cluster has 192 cores, 1.5TB RAM and for this task I am using 
> 37 executors of 4 cores/28+4GB RAM each.
> this is the graph of disk consumption pattern showing the space being all 
> eaten from 7% to 90% during the ALS (90% is when YARN kills the container):
> !als-diskusage.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Evan Sparks (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390782#comment-14390782
 ] 

Evan Sparks commented on SPARK-6646:


Guys - you're clearly ignoring prior work. The database community solved this 
problem 20 years ago with the Gubba project - a mature prototype [can be seen 
here|http://i.imgur.com/FJK7K9x.jpg]. 

Additionally, everyone knows that joins don't scale on iOS, and you'll never be 
able to build indexes on this platform.


> Spark 2.0: Rearchitecting Spark for Mobile Platforms
> 
>
> Key: SPARK-6646
> URL: https://issues.apache.org/jira/browse/SPARK-6646
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
> Attachments: Spark on Mobile - Design Doc - v1.pdf
>
>
> Mobile computing is quickly rising to dominance, and by the end of 2017, it 
> is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
> Spark’s project goal can be accomplished only when Spark runs efficiently for 
> the growing population of mobile users.
> Designed and optimized for modern data centers and Big Data applications, 
> Spark is unfortunately not a good fit for mobile computing today. In the past 
> few months, we have been prototyping the feasibility of a mobile-first Spark 
> architecture, and today we would like to share with you our findings. This 
> ticket outlines the technical design of Spark’s mobile support, and shares 
> results from several early prototypes.
> Mobile friendly version of the design doc: 
> https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6433) hive tests to import spark-sql test JAR for QueryTest access

2015-04-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6433:
-
Assignee: Steve Loughran

> hive tests to import spark-sql test JAR for QueryTest access
> 
>
> Key: SPARK-6433
> URL: https://issues.apache.org/jira/browse/SPARK-6433
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 1.4.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 1.4.0
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> The hive module has its own clone of {{org.apache.spark.sql.QueryPlan}} and 
> {{org.apache.spark.sql.catalyst.plans.PlanTest}} which are copied from the 
> spark-sql module because it's "hard to have maven allow one subproject depend 
> on another subprojects test code"
> It's actually relatively straightforward
> # tell maven to build & publish the test JARs
> # import them in your other sub projects
> There is one consequence: the JARs will also end being published to mvn 
> central. This is not really a bad thing; it does help downstream projects 
> pick up the JARs too. It does become an issue if a test run depends on a 
> custom file under {{src/test/resources}} containing things like EC2 
> authentication keys, or even just log4.properties files which can interfere 
> with each other. These need to be excluded -the simplest way is to exclude 
> all of the resources from test JARs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6433) hive tests to import spark-sql test JAR for QueryTest access

2015-04-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6433.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5119
[https://github.com/apache/spark/pull/5119]

> hive tests to import spark-sql test JAR for QueryTest access
> 
>
> Key: SPARK-6433
> URL: https://issues.apache.org/jira/browse/SPARK-6433
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 1.4.0
>Reporter: Steve Loughran
>Priority: Minor
> Fix For: 1.4.0
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> The hive module has its own clone of {{org.apache.spark.sql.QueryPlan}} and 
> {{org.apache.spark.sql.catalyst.plans.PlanTest}} which are copied from the 
> spark-sql module because it's "hard to have maven allow one subproject depend 
> on another subprojects test code"
> It's actually relatively straightforward
> # tell maven to build & publish the test JARs
> # import them in your other sub projects
> There is one consequence: the JARs will also end being published to mvn 
> central. This is not really a bad thing; it does help downstream projects 
> pick up the JARs too. It does become an issue if a test run depends on a 
> custom file under {{src/test/resources}} containing things like EC2 
> authentication keys, or even just log4.properties files which can interfere 
> with each other. These need to be excluded -the simplest way is to exclude 
> all of the resources from test JARs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6649) DataFrame created through SQLContext.jdbc() failed if columns table must be quoted

2015-04-01 Thread JIRA

Frédéric Blanc created SPARK-6649:
-

 Summary: DataFrame created through SQLContext.jdbc() failed if 
columns table must be quoted
 Key: SPARK-6649
 URL: https://issues.apache.org/jira/browse/SPARK-6649
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Frédéric Blanc
Priority: Minor


If I want to import the content a table from oracle, that contains a column 
with name COMMENT (a reserved keyword), I cannot use a DataFrame that map all 
the columns of this table.

{code:title=ddl.sql|borderStyle=solid}
CREATE TABLE TEST_TABLE (
"COMMENT" VARCHAR2(10)
);
{code}

{code:title=test.java|borderStyle=solid}
SQLContext sqlContext = ...

DataFrame df = sqlContext.jdbc(databaseURL, "TEST_TABLE");
df.rdd();   // => failed if the table contains a column with a reserved keyword
{code}

The same problem can be encounter if reserved keyword are used on table name.

The JDBCRDD scala class could be improved, if the columnList initializer append 
the double-quote for each column. (line : 225)






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Vinay Shukla (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390950#comment-14390950
 ] 

Vinay Shukla commented on SPARK-6646:
-

This use case can benefit from running Spark inside a Mobile App Server. An App 
server that takes care of horizontal issues such as security, networking, etc 
will allow Spark to focus on the real hard problem of data processing in a 
lightening fast manner.

There is another idea of using having Spark leverage [parallel quantum 
computing | http://people.csail.mit.edu/nhm/pqc.pdf] but I suppose that calls 
for another JIRA.

> Spark 2.0: Rearchitecting Spark for Mobile Platforms
> 
>
> Key: SPARK-6646
> URL: https://issues.apache.org/jira/browse/SPARK-6646
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
> Attachments: Spark on Mobile - Design Doc - v1.pdf
>
>
> Mobile computing is quickly rising to dominance, and by the end of 2017, it 
> is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
> Spark’s project goal can be accomplished only when Spark runs efficiently for 
> the growing population of mobile users.
> Designed and optimized for modern data centers and Big Data applications, 
> Spark is unfortunately not a good fit for mobile computing today. In the past 
> few months, we have been prototyping the feasibility of a mobile-first Spark 
> architecture, and today we would like to share with you our findings. This 
> ticket outlines the technical design of Spark’s mobile support, and shares 
> results from several early prototypes.
> Mobile friendly version of the design doc: 
> https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6650) ExecutorAllocationManager never stops

2015-04-01 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-6650:
-

 Summary: ExecutorAllocationManager never stops
 Key: SPARK-6650
 URL: https://issues.apache.org/jira/browse/SPARK-6650
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Marcelo Vanzin


{{ExecutorAllocationManager}} doesn't even have a stop() method. That means 
that when the owning SparkContext goes away, the internal thread it uses to 
schedule its activities remains alive.

That means it constantly spams the logs and does who knows what else that could 
affect any future contexts that are allocated.

It's particularly evil during unit tests, since it slows down everything else 
after the suite is run, leaving multiple threads behind.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6650) ExecutorAllocationManager never stops

2015-04-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6650:
---

Assignee: Apache Spark

> ExecutorAllocationManager never stops
> -
>
> Key: SPARK-6650
> URL: https://issues.apache.org/jira/browse/SPARK-6650
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> {{ExecutorAllocationManager}} doesn't even have a stop() method. That means 
> that when the owning SparkContext goes away, the internal thread it uses to 
> schedule its activities remains alive.
> That means it constantly spams the logs and does who knows what else that 
> could affect any future contexts that are allocated.
> It's particularly evil during unit tests, since it slows down everything else 
> after the suite is run, leaving multiple threads behind.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6650) ExecutorAllocationManager never stops

2015-04-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6650:
---

Assignee: (was: Apache Spark)

> ExecutorAllocationManager never stops
> -
>
> Key: SPARK-6650
> URL: https://issues.apache.org/jira/browse/SPARK-6650
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Marcelo Vanzin
>
> {{ExecutorAllocationManager}} doesn't even have a stop() method. That means 
> that when the owning SparkContext goes away, the internal thread it uses to 
> schedule its activities remains alive.
> That means it constantly spams the logs and does who knows what else that 
> could affect any future contexts that are allocated.
> It's particularly evil during unit tests, since it slows down everything else 
> after the suite is run, leaving multiple threads behind.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6650) ExecutorAllocationManager never stops

2015-04-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390986#comment-14390986
 ] 

Apache Spark commented on SPARK-6650:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5311

> ExecutorAllocationManager never stops
> -
>
> Key: SPARK-6650
> URL: https://issues.apache.org/jira/browse/SPARK-6650
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Marcelo Vanzin
>
> {{ExecutorAllocationManager}} doesn't even have a stop() method. That means 
> that when the owning SparkContext goes away, the internal thread it uses to 
> schedule its activities remains alive.
> That means it constantly spams the logs and does who knows what else that 
> could affect any future contexts that are allocated.
> It's particularly evil during unit tests, since it slows down everything else 
> after the suite is run, leaving multiple threads behind.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6651) Delegate dense vector arithmetics to the underly numpy array

2015-04-01 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-6651:


 Summary: Delegate dense vector arithmetics to the underly numpy 
array
 Key: SPARK-6651
 URL: https://issues.apache.org/jira/browse/SPARK-6651
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


It is convenient to delegate dense linear algebra operations to numpy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6651) Delegate dense vector arithmetics to the underly numpy array

2015-04-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6651:
---

Assignee: Apache Spark  (was: Xiangrui Meng)

> Delegate dense vector arithmetics to the underly numpy array
> 
>
> Key: SPARK-6651
> URL: https://issues.apache.org/jira/browse/SPARK-6651
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>
> It is convenient to delegate dense linear algebra operations to numpy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6651) Delegate dense vector arithmetics to the underly numpy array

2015-04-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391020#comment-14391020
 ] 

Apache Spark commented on SPARK-6651:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/5312

> Delegate dense vector arithmetics to the underly numpy array
> 
>
> Key: SPARK-6651
> URL: https://issues.apache.org/jira/browse/SPARK-6651
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> It is convenient to delegate dense linear algebra operations to numpy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6651) Delegate dense vector arithmetics to the underly numpy array

2015-04-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6651:
---

Assignee: Xiangrui Meng  (was: Apache Spark)

> Delegate dense vector arithmetics to the underly numpy array
> 
>
> Key: SPARK-6651
> URL: https://issues.apache.org/jira/browse/SPARK-6651
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> It is convenient to delegate dense linear algebra operations to numpy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6587) Inferring schema for case class hierarchy fails with mysterious message

2015-04-01 Thread Spiro Michaylov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391035#comment-14391035
 ] 

Spiro Michaylov commented on SPARK-6587:


I appreciate the comment, and clearly I was confused about a couple of things, 
but I wonder if there's still an interesting RFE here. My example was intended 
to internalize into case classes some really powerful Spark SQL behavior that 
I've observed when inferring schema for JSON: 

{code}
val textConflict = sc.parallelize(Seq(
  "{\"key\":42}",
  "{\"key\":\"hello\"}",
  "{\"key\":false}"
), 4)

val jsonConflict = sqlContext.jsonRDD(textConflict)
jsonConflict.printSchema()
jsonConflict.registerTempTable("conflict")
sqlContext.sql("SELECT * FROM conflict").show()
{code}

Which produces:

{noformat}
root
 |-- key: string (nullable = true)

key  
42   
hello
false
{noformat}

This behavior is IMO a *really* nice compromise: a type is inferred, it is 
approximate, so there are certain things you can't do in the query, but type 
information is still preserved when returning results from the query. 

I was trying to help the poster on StackOverflow to achieve similar behavior 
from case classes, and I thought a hierarchy was necessary. While I was clearly 
barking up the wrong tree, I wonder:

a) Is it intended that these kinds of type "conflicts" be handled as elegantly 
when one is using case classes rather than the JSON parser?
b) Is there already a way to do it that I failed to find? (Suspicion: no, but 
I've been wrong before ...)
c) If respectively YES and NO, how should the RFE be phrased?


> Inferring schema for case class hierarchy fails with mysterious message
> ---
>
> Key: SPARK-6587
> URL: https://issues.apache.org/jira/browse/SPARK-6587
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: At least Windows 8, Scala 2.11.2.  
>Reporter: Spiro Michaylov
>
> (Don't know if this is a functionality bug, error reporting bug or an RFE ...)
> I define the following hierarchy:
> {code}
> private abstract class MyHolder
> private case class StringHolder(s: String) extends MyHolder
> private case class IntHolder(i: Int) extends MyHolder
> private case class BooleanHolder(b: Boolean) extends MyHolder
> {code}
> and a top level case class:
> {code}
> private case class Thing(key: Integer, foo: MyHolder)
> {code}
> When I try to convert it:
> {code}
> val things = Seq(
>   Thing(1, IntHolder(42)),
>   Thing(2, StringHolder("hello")),
>   Thing(3, BooleanHolder(false))
> )
> val thingsDF = sc.parallelize(things, 4).toDF()
> thingsDF.registerTempTable("things")
> val all = sqlContext.sql("SELECT * from things")
> {code}
> I get the following stack trace:
> {noformat}
> Exception in thread "main" scala.MatchError: 
> sql.CaseClassSchemaProblem.MyHolder (of class 
> scala.reflect.internal.Types$ClassNoArgsTypeRef)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:112)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:159)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:157)
>   at scala.collection.immutable.List.map(List.scala:276)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:157)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
>   at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:312)
>   at 
> org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:250)
>   at sql.CaseClassSchemaProblem$.main(CaseClassSchemaProblem.scala:35)
>   at sql.CaseClassSchemaProblem.main(CaseClassSchemaProblem.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
> {noformat}
> I wrote this to answer [a question on 
> StackOverflow|http://stackoverflow.com/questions/29310405/what-is-the-right-way-to-represent-an-any-type-in-spark-sql]
>  whic

[jira] [Created] (SPARK-6652) SQLContext and HiveContext do not handle "tricky" names well

2015-04-01 Thread Max Seiden (JIRA)

Max Seiden created SPARK-6652:
-

 Summary: SQLContext and HiveContext do not handle "tricky" names 
well
 Key: SPARK-6652
 URL: https://issues.apache.org/jira/browse/SPARK-6652
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
Reporter: Max Seiden


h3. Summary
There are cases where both the SQLContext and HiveContext fail when handling 
"tricky names" (containing UTF-8, tabs, newlines, etc) well. For example, the 
following string:

{noformat}
val tricky = "Tricky-\u4E2D[x.][\",/\\n * ? é\n&$(x)\t(':;#!^-Name"
{noformat}

causes the following exceptions during parsing and resolution (respectively).

h5. SQLContext parse failure
{noformat}
// pseudocode
val data = 0 until 100
val rdd = sc.parallelize(data)
val schema = StructType(StructField(Tricky, IntegerType, false) :: Nil)
val schemaRDD = sqlContext.applySchema(rdd.map(i => Row(i)), schema)
schemaRDD.registerAsTable(Tricky)
sqlContext.sql(s"select `$Tricky` from `$Tricky`")

java.lang.RuntimeException: [1.33] failure: ``UNION'' expected but 
ErrorToken(``' expected but 
 found) found

select `Tricky-中[x.][",/\n * ? é

^
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33)
at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
at 
org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174)
at 
org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:31)
at 
org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83)
at 
org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:83)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:303)
{noformat}

h5. HiveContext resolution failure
{noformat}
// pseudocode
val data = 0 until 100
val rdd = sc.parallelize(data)
val schema = StructType(StructField(Tricky, IntegerType, false) :: Nil)
val schemaRDD = sqlContext.applySchema(rdd.map(i => Row(i)), schema)
schemaRDD.registerAsTable(Tricky)
sqlContext.sql(s"select `$Tricky` from `$Tricky`").collect()

// the parse is ok in this case...
15/04/01 10:41:48 WARN HiveConf: DEPRECATED: hive.metastore.ds.retry.* no 
longer has any effect.  Use hive.hmshandler.retry.* instead
15/04/01 10:41:48 INFO ParseDriver: Parsing command: select `Tricky-中[x.][",/\n 
* ? é
&$(x)   (':;#!^-Name` from `Tricky-中[x.][",/\n * ? é
&$(x)   (':;#!^-Name`
15/04/01 10:41:48 INFO ParseDriver: Parse Completed

// but resolution fails
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes: 'Tricky-中[x.][",/\n * ? é
&$(x)   (':;#!^-Name, tree:
'Project ['Tricky-中[x.][",/\n * ? é
&$(x)   (':;#!^-Name]
 Subquery tricky-中[x.][",/\n * ? é
&$(x)   (':;#!^-name
  LogicalRDD [Tricky-中[x.][",/\n * ? é
&$(x)   (':;#!^-Name#2], MappedRDD[16] at map at :30

at 
org.apache.s

[jira] [Created] (SPARK-6653) New configuration property to specify port for sparkYarnAM actor system

2015-04-01 Thread Manoj Samel (JIRA)

Manoj Samel created SPARK-6653:
--

 Summary: New configuration property to specify port for 
sparkYarnAM actor system
 Key: SPARK-6653
 URL: https://issues.apache.org/jira/browse/SPARK-6653
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.3.0
 Environment: Spark On Yarn
Reporter: Manoj Samel


In 1.3.0 code line sparkYarnAM actor system is started on random port. See 
org.apache.spark.deploy.yarn ApplicationMaster.scala:282

actorSystem = AkkaUtils.createActorSystem("sparkYarnAM", Utils.localHostName, 
0, conf = sparkConf, securityManager = securityMgr)._1

This may be issue when ports between Spark client and the Yarn cluster are 
limited by firewall and not all ports are open between client and Yarn cluster.

Proposal is to introduce new property spark.am.actor.port and change code to

val port = sparkConf.getInt("spark.am.actor.port", 0)
actorSystem = AkkaUtils.createActorSystem("sparkYarnAM", 
Utils.localHostName, port,
  conf = sparkConf, securityManager = securityMgr)._1








--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6654) Update Kinesis Streaming impls (both KCL-based and Direct) to use latest aws-java-sdk and kinesis-client-library

2015-04-01 Thread Chris Fregly (JIRA)

Chris Fregly created SPARK-6654:
---

 Summary: Update Kinesis Streaming impls (both KCL-based and 
Direct) to use latest aws-java-sdk and kinesis-client-library
 Key: SPARK-6654
 URL: https://issues.apache.org/jira/browse/SPARK-6654
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Chris Fregly






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6655) We need to read the schema of a data source table stored in spark.sql.sources.schema property

2015-04-01 Thread Yin Huai (JIRA)

Yin Huai created SPARK-6655:
---

 Summary: We need to read the schema of a data source table stored 
in spark.sql.sources.schema property
 Key: SPARK-6655
 URL: https://issues.apache.org/jira/browse/SPARK-6655
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker
 Fix For: 1.3.1






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6655) We need to read the schema of a data source table stored in spark.sql.sources.schema property

2015-04-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6655:
---

Assignee: Apache Spark  (was: Yin Huai)

> We need to read the schema of a data source table stored in 
> spark.sql.sources.schema property
> -
>
> Key: SPARK-6655
> URL: https://issues.apache.org/jira/browse/SPARK-6655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Yin Huai
>Assignee: Apache Spark
>Priority: Blocker
> Fix For: 1.3.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6655) We need to read the schema of a data source table stored in spark.sql.sources.schema property

2015-04-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6655:
---

Assignee: Yin Huai  (was: Apache Spark)

> We need to read the schema of a data source table stored in 
> spark.sql.sources.schema property
> -
>
> Key: SPARK-6655
> URL: https://issues.apache.org/jira/browse/SPARK-6655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 1.3.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6655) We need to read the schema of a data source table stored in spark.sql.sources.schema property

2015-04-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391282#comment-14391282
 ] 

Apache Spark commented on SPARK-6655:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/5313

> We need to read the schema of a data source table stored in 
> spark.sql.sources.schema property
> -
>
> Key: SPARK-6655
> URL: https://issues.apache.org/jira/browse/SPARK-6655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 1.3.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6656) Allow the application name to be passed in versus pulling from SparkContext.getAppName()

2015-04-01 Thread Chris Fregly (JIRA)

Chris Fregly created SPARK-6656:
---

 Summary: Allow the application name to be passed in versus pulling 
from SparkContext.getAppName() 
 Key: SPARK-6656
 URL: https://issues.apache.org/jira/browse/SPARK-6656
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Chris Fregly


this is useful for the scenario where Kinesis Spark Streaming is being invoked 
from the Spark Shell.  in this case, the application name in the SparkContext 
is pre-set to "Spark Shell".

this isn't a common or recommended use case, but it's best to make this 
configurable outside of SparkContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4184) Improve Spark Streaming documentation to address commonly-asked questions

2015-04-01 Thread Chris Fregly (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly updated SPARK-4184:

Target Version/s: 1.4.0  (was: 1.3.1)

> Improve Spark Streaming documentation to address commonly-asked questions 
> --
>
> Key: SPARK-4184
> URL: https://issues.apache.org/jira/browse/SPARK-4184
> Project: Spark
>  Issue Type: Documentation
>  Components: Streaming
>Reporter: Chris Fregly
>  Labels: documentation, streaming
>
> Improve Streaming documentation including API descriptions, 
> concurrency/thread safety, fault tolerance, replication, checkpointing, 
> scalability, resource allocation and utilization, back pressure, and 
> monitoring.
> also, add a section to the kinesis streaming guide describing how to use IAM 
> roles with the Spark Kinesis Receiver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5960) Allow AWS credentials to be passed to KinesisUtils.createStream()

2015-04-01 Thread Chris Fregly (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly updated SPARK-5960:

Target Version/s: 1.4.0  (was: 1.3.1)

> Allow AWS credentials to be passed to KinesisUtils.createStream()
> -
>
> Key: SPARK-5960
> URL: https://issues.apache.org/jira/browse/SPARK-5960
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: Chris Fregly
>Assignee: Chris Fregly
>
> While IAM roles are preferable, we're seeing a lot of cases where we need to 
> pass AWS credentials when creating the KinesisReceiver.
> Notes:
> * Make sure we don't log the credentials anywhere
> * Maintain compatibility with existing KinesisReceiver-based code.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Deenar Toraskar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391335#comment-14391335
 ] 

Deenar Toraskar commented on SPARK-6646:


maybe Spark 2.0 should be branded i-Spark

> Spark 2.0: Rearchitecting Spark for Mobile Platforms
> 
>
> Key: SPARK-6646
> URL: https://issues.apache.org/jira/browse/SPARK-6646
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
> Attachments: Spark on Mobile - Design Doc - v1.pdf
>
>
> Mobile computing is quickly rising to dominance, and by the end of 2017, it 
> is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
> Spark’s project goal can be accomplished only when Spark runs efficiently for 
> the growing population of mobile users.
> Designed and optimized for modern data centers and Big Data applications, 
> Spark is unfortunately not a good fit for mobile computing today. In the past 
> few months, we have been prototyping the feasibility of a mobile-first Spark 
> architecture, and today we would like to share with you our findings. This 
> ticket outlines the technical design of Spark’s mobile support, and shares 
> results from several early prototypes.
> Mobile friendly version of the design doc: 
> https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6651) Delegate dense vector arithmetics to the underly numpy array

2015-04-01 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-6651.
--
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.1

Issue resolved by pull request 5312
[https://github.com/apache/spark/pull/5312]

> Delegate dense vector arithmetics to the underly numpy array
> 
>
> Key: SPARK-6651
> URL: https://issues.apache.org/jira/browse/SPARK-6651
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.3.1, 1.4.0
>
>
> It is convenient to delegate dense linear algebra operations to numpy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6642) Change the lambda weight to number of explicit ratings in implicit ALS

2015-04-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391412#comment-14391412
 ] 

Apache Spark commented on SPARK-6642:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/5314

> Change the lambda weight to number of explicit ratings in implicit ALS
> --
>
> Key: SPARK-6642
> URL: https://issues.apache.org/jira/browse/SPARK-6642
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> Until SPARK-6637 is resolved, we should switch back to the 1.2 lambda 
> weighting strategy to be consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6642) Change the lambda weight to number of explicit ratings in implicit ALS

2015-04-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6642:
---

Assignee: Xiangrui Meng  (was: Apache Spark)

> Change the lambda weight to number of explicit ratings in implicit ALS
> --
>
> Key: SPARK-6642
> URL: https://issues.apache.org/jira/browse/SPARK-6642
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> Until SPARK-6637 is resolved, we should switch back to the 1.2 lambda 
> weighting strategy to be consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6642) Change the lambda weight to number of explicit ratings in implicit ALS

2015-04-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6642:
---

Assignee: Apache Spark  (was: Xiangrui Meng)

> Change the lambda weight to number of explicit ratings in implicit ALS
> --
>
> Key: SPARK-6642
> URL: https://issues.apache.org/jira/browse/SPARK-6642
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>
> Until SPARK-6637 is resolved, we should switch back to the 1.2 lambda 
> weighting strategy to be consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6373) Add SSL/TLS for the Netty based BlockTransferService

2015-04-01 Thread Jeffrey Turpin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391452#comment-14391452
 ] 

Jeffrey Turpin commented on SPARK-6373:
---

Hey Aaron,

Sorry for the delay... I have cleaned things up a bit and refactored the 
implementation to be more inline with our earlier conversation... Have a look 
at 
https://github.com/turp1twin/spark/commit/d976a7ab9b57e26fc180d649fd084a6acb9d027e
 and let me know your thoughts...

Jeff


> Add SSL/TLS for the Netty based BlockTransferService 
> -
>
> Key: SPARK-6373
> URL: https://issues.apache.org/jira/browse/SPARK-6373
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Shuffle
>Affects Versions: 1.2.1
>Reporter: Jeffrey Turpin
>
> Add the ability to allow for secure communications (SSL/TLS) for the Netty 
> based BlockTransferService and the ExternalShuffleClient. This ticket will 
> hopefully start the conversation around potential designs... Below is a 
> reference to a WIP prototype which implements this functionality 
> (prototype)... I have attempted to disrupt as little code as possible and 
> tried to follow the current code structure (for the most part) in the areas I 
> modified. I also studied how Hadoop achieves encrypted shuffle 
> (http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html)
> https://github.com/turp1twin/spark/commit/024b559f27945eb63068d1badf7f82e4e7c3621c



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Matei Zaharia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391456#comment-14391456
 ] 

Matei Zaharia commented on SPARK-6646:
--

Not to rain on the parade here, but I worry that focusing on mobile phones is 
short-sighted. Does this design present a path forward for the Internet of 
Things as well? You'd want something that runs on Arduino, Raspberry Pi, etc. 
We already have MQTT input in Spark Streaming so we could consider using MQTT 
to replace Netty for shuffle as well. Has anybody benchmarked that?

> Spark 2.0: Rearchitecting Spark for Mobile Platforms
> 
>
> Key: SPARK-6646
> URL: https://issues.apache.org/jira/browse/SPARK-6646
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
> Attachments: Spark on Mobile - Design Doc - v1.pdf
>
>
> Mobile computing is quickly rising to dominance, and by the end of 2017, it 
> is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
> Spark’s project goal can be accomplished only when Spark runs efficiently for 
> the growing population of mobile users.
> Designed and optimized for modern data centers and Big Data applications, 
> Spark is unfortunately not a good fit for mobile computing today. In the past 
> few months, we have been prototyping the feasibility of a mobile-first Spark 
> architecture, and today we would like to share with you our findings. This 
> ticket outlines the technical design of Spark’s mobile support, and shares 
> results from several early prototypes.
> Mobile friendly version of the design doc: 
> https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5972) Cache residuals for GradientBoostedTrees during training

2015-04-01 Thread Manoj Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391521#comment-14391521
 ] 

Manoj Kumar commented on SPARK-5972:


[~josephkb] This should be done independently of evaluateEachIteration right? 
(In the sense, that evaluateEachIteration should not be used in the 
GradientBoostedTrees code that does this, that is caching the error and 
residuals, since the model has not been trained yet)



> Cache residuals for GradientBoostedTrees during training
> 
>
> Key: SPARK-5972
> URL: https://issues.apache.org/jira/browse/SPARK-5972
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> In gradient boosting, the current model's prediction is re-computed for each 
> training instance on every iteration.  The current residual (cumulative 
> prediction of previously trained trees in the ensemble) should be cached.  
> That could reduce both computation (only computing the prediction of the most 
> recently trained tree) and communication (only sending the most recently 
> trained tree to the workers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6657) Fix Python doc build warnings

2015-04-01 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-6657:


 Summary: Fix Python doc build warnings
 Key: SPARK-6657
 URL: https://issues.apache.org/jira/browse/SPARK-6657
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib, PySpark, SQL, Streaming
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Trivial


Reported by [~rxin]

{code}
/scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
pyspark.mllib.tree.RandomForest.trainClassifier:15: ERROR: Unexpected 
indentation.

/scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
pyspark.mllib.tree.RandomForest.trainClassifier:16: WARNING: Block quote ends 
without a blank line; unexpected unindent.

/scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
pyspark.mllib.tree.RandomForest.trainClassifier:18: ERROR: Unexpected 
indentation.

/scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
pyspark.mllib.tree.RandomForest.trainClassifier:22: WARNING: Definition list 
ends without a blank line; unexpected unindent.

/scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
pyspark.mllib.tree.RandomForest.trainClassifier:28: WARNING: Definition list 
ends without a blank line; unexpected unindent.

/scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
pyspark.mllib.tree.RandomForest.trainRegressor:13: ERROR: Unexpected 
indentation.

/scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
pyspark.mllib.tree.RandomForest.trainRegressor:14: WARNING: Block quote ends 
without a blank line; unexpected unindent.

/scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
pyspark.mllib.tree.RandomForest.trainRegressor:16: ERROR: Unexpected 
indentation.

/scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
pyspark.mllib.tree.RandomForest.trainRegressor:18: ERROR: Unexpected 
indentation.

/scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of 
pyspark.sql.DataFrame.collect:1: WARNING: Inline interpreted text or phrase 
reference start-string without end-string.

/scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of 
pyspark.sql.DataFrame.orderBy:3: WARNING: Inline interpreted text or phrase 
reference start-string without end-string.

/scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of 
pyspark.sql.DataFrame.sort:3: WARNING: Inline interpreted text or phrase 
reference start-string without end-string.

/scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of 
pyspark.sql.DataFrame.take:1: WARNING: Inline interpreted text or phrase 
reference start-string without end-string.

/scratch/rxin/spark/python/docs/pyspark.streaming.rst:13: WARNING: Title 
underline too short.



pyspark.streaming.kafka module



/scratch/rxin/spark/python/docs/pyspark.streaming.rst:13: WARNING: Title 
underline too short.



pyspark.streaming.kafka module


{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6658) Incorrect DataFrame Documentation Type References

2015-04-01 Thread Chet Mancini (JIRA)

Chet Mancini created SPARK-6658:
---

 Summary: Incorrect DataFrame Documentation Type References
 Key: SPARK-6658
 URL: https://issues.apache.org/jira/browse/SPARK-6658
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 1.3.0
Reporter: Chet Mancini


A few methods under DataFrame incorrectly refer to the receiver as an RDD in 
their documentation.

* createJDBCTable
* insertIntoJDBC
* registerTempTable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5972) Cache residuals for GradientBoostedTrees during training

2015-04-01 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391536#comment-14391536
 ] 

Joseph K. Bradley commented on SPARK-5972:
--

They should be at least partly separate, in that evaluateEachIteration itself 
will not be used for this.  But this JIRA and evaluateEachIteration might be 
able to share some code to avoid code duplication.

> Cache residuals for GradientBoostedTrees during training
> 
>
> Key: SPARK-5972
> URL: https://issues.apache.org/jira/browse/SPARK-5972
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> In gradient boosting, the current model's prediction is re-computed for each 
> training instance on every iteration.  The current residual (cumulative 
> prediction of previously trained trees in the ensemble) should be cached.  
> That could reduce both computation (only computing the prediction of the most 
> recently trained tree) and communication (only sending the most recently 
> trained tree to the workers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6658) Incorrect DataFrame Documentation Type References

2015-04-01 Thread Chet Mancini (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chet Mancini updated SPARK-6658:

Priority: Trivial  (was: Major)

> Incorrect DataFrame Documentation Type References
> -
>
> Key: SPARK-6658
> URL: https://issues.apache.org/jira/browse/SPARK-6658
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 1.3.0
>Reporter: Chet Mancini
>Priority: Trivial
>  Labels: docuentation
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> A few methods under DataFrame incorrectly refer to the receiver as an RDD in 
> their documentation.
> * createJDBCTable
> * insertIntoJDBC
> * registerTempTable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6658) Incorrect DataFrame Documentation Type References

2015-04-01 Thread Chet Mancini (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chet Mancini updated SPARK-6658:

Labels: documentation  (was: docuentation)

> Incorrect DataFrame Documentation Type References
> -
>
> Key: SPARK-6658
> URL: https://issues.apache.org/jira/browse/SPARK-6658
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 1.3.0
>Reporter: Chet Mancini
>Priority: Trivial
>  Labels: documentation
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> A few methods under DataFrame incorrectly refer to the receiver as an RDD in 
> their documentation.
> * createJDBCTable
> * insertIntoJDBC
> * registerTempTable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6659) Spark SQL 1.3 cannot read json file that only with a record.

2015-04-01 Thread luochenghui (JIRA)

luochenghui created SPARK-6659:
--

 Summary: Spark SQL 1.3 cannot read json file that only with a 
record.
 Key: SPARK-6659
 URL: https://issues.apache.org/jira/browse/SPARK-6659
 Project: Spark
  Issue Type: Bug
Reporter: luochenghui


Dear friends:
 
Spark SQL 1.3 cannot read json file that only with a record.
here is my json file's content.
{"name":"milo","age",24}
 
when i run Spark SQL under the local mode,it throws an exception
rg.apache.spark.sql.AnalysisException: cannot resolve 'name' given input 
columns _corrupt_record;
 
what i had done:
1  ./spark-shell
2 
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext: org.apache.spark.sql.SQLContext = 
org.apache.spark.sql.SQLContext@5f3be6c8
 
scala> val df = sqlContext.jsonFile("/home/milo/person.json")
15/03/19 22:11:45 INFO MemoryStore: ensureFreeSpace(163705) called with 
curMem=0, maxMem=280248975
15/03/19 22:11:45 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 159.9 KB, free 267.1 MB)
15/03/19 22:11:45 INFO MemoryStore: ensureFreeSpace(22692) called with 
curMem=163705, maxMem=280248975
15/03/19 22:11:45 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in 
memory (estimated size 22.2 KB, free 267.1 MB)
15/03/19 22:11:45 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 
localhost:35842 (size: 22.2 KB, free: 267.2 MB)
15/03/19 22:11:45 INFO BlockManagerMaster: Updated info of block 
broadcast_0_piece0
15/03/19 22:11:45 INFO SparkContext: Created broadcast 0 from textFile at 
JSONRelation.scala:98
15/03/19 22:11:47 INFO FileInputFormat: Total input paths to process : 1
15/03/19 22:11:47 INFO SparkContext: Starting job: reduce at JsonRDD.scala:51
15/03/19 22:11:47 INFO DAGScheduler: Got job 0 (reduce at JsonRDD.scala:51) 
with 1 output partitions (allowLocal=false)
15/03/19 22:11:47 INFO DAGScheduler: Final stage: Stage 0(reduce at 
JsonRDD.scala:51)
15/03/19 22:11:47 INFO DAGScheduler: Parents of final stage: List()
15/03/19 22:11:47 INFO DAGScheduler: Missing parents: List()
15/03/19 22:11:47 INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[3] at 
map at JsonRDD.scala:51), which has no missing parents
15/03/19 22:11:47 INFO MemoryStore: ensureFreeSpace(3184) called with 
curMem=186397, maxMem=280248975
15/03/19 22:11:47 INFO MemoryStore: Block broadcast_1 stored as values in 
memory (estimated size 3.1 KB, free 267.1 MB)
15/03/19 22:11:47 INFO MemoryStore: ensureFreeSpace(2251) called with 
curMem=189581, maxMem=280248975
15/03/19 22:11:47 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in 
memory (estimated size 2.2 KB, free 267.1 MB)
15/03/19 22:11:47 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 
localhost:35842 (size: 2.2 KB, free: 267.2 MB)
15/03/19 22:11:47 INFO BlockManagerMaster: Updated info of block 
broadcast_1_piece0
15/03/19 22:11:47 INFO SparkContext: Created broadcast 1 from broadcast at 
DAGScheduler.scala:839
15/03/19 22:11:48 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 
(MapPartitionsRDD[3] at map at JsonRDD.scala:51)
15/03/19 22:11:48 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
15/03/19 22:11:48 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
localhost, PROCESS_LOCAL, 1291 bytes)
15/03/19 22:11:48 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/03/19 22:11:48 INFO HadoopRDD: Input split: file:/home/milo/person.json:0+26
15/03/19 22:11:48 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
mapreduce.task.id
15/03/19 22:11:48 INFO deprecation: mapred.task.id is deprecated. Instead, use 
mapreduce.task.attempt.id
15/03/19 22:11:48 INFO deprecation: mapred.task.is.map is deprecated. Instead, 
use mapreduce.task.ismap
15/03/19 22:11:48 INFO deprecation: mapred.task.partition is deprecated. 
Instead, use mapreduce.task.partition
15/03/19 22:11:48 INFO deprecation: mapred.job.id is deprecated. Instead, use 
mapreduce.job.id
15/03/19 22:11:49 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2023 
bytes result sent to driver
15/03/19 22:11:49 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) 
in 1209 ms on localhost (1/1)
15/03/19 22:11:49 INFO DAGScheduler: Stage 0 (reduce at JsonRDD.scala:51) 
finished in 1.308 s
15/03/19 22:11:49 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have 
all completed, from pool 
15/03/19 22:11:49 INFO DAGScheduler: Job 0 finished: reduce at 
JsonRDD.scala:51, took 2.002429 s
df: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
 
3  
scala> df.select("name").show()
15/03/19 22:12:44 INFO BlockManager: Removing broadcast 1
15/03/19 22:12:44 INFO BlockManager: Removing block broadcast_1_piece0
15/03/19 22:12:44 INFO MemoryStore: Block broadcast_1_piece0 of size 2251 
dropped from memory (free 280059394)
15/03/19 22:12:44 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 
localhost:35842 in memory (s

[jira] [Assigned] (SPARK-6658) Incorrect DataFrame Documentation Type References

2015-04-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6658:
---

Assignee: Apache Spark

> Incorrect DataFrame Documentation Type References
> -
>
> Key: SPARK-6658
> URL: https://issues.apache.org/jira/browse/SPARK-6658
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 1.3.0
>Reporter: Chet Mancini
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: documentation
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> A few methods under DataFrame incorrectly refer to the receiver as an RDD in 
> their documentation.
> * createJDBCTable
> * insertIntoJDBC
> * registerTempTable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6658) Incorrect DataFrame Documentation Type References

2015-04-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391563#comment-14391563
 ] 

Apache Spark commented on SPARK-6658:
-

User 'chetmancini' has created a pull request for this issue:
https://github.com/apache/spark/pull/5316

> Incorrect DataFrame Documentation Type References
> -
>
> Key: SPARK-6658
> URL: https://issues.apache.org/jira/browse/SPARK-6658
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 1.3.0
>Reporter: Chet Mancini
>Priority: Trivial
>  Labels: documentation
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> A few methods under DataFrame incorrectly refer to the receiver as an RDD in 
> their documentation.
> * createJDBCTable
> * insertIntoJDBC
> * registerTempTable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6658) Incorrect DataFrame Documentation Type References

2015-04-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6658:
---

Assignee: (was: Apache Spark)

> Incorrect DataFrame Documentation Type References
> -
>
> Key: SPARK-6658
> URL: https://issues.apache.org/jira/browse/SPARK-6658
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 1.3.0
>Reporter: Chet Mancini
>Priority: Trivial
>  Labels: documentation
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> A few methods under DataFrame incorrectly refer to the receiver as an RDD in 
> their documentation.
> * createJDBCTable
> * insertIntoJDBC
> * registerTempTable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5989) Model import/export for LDAModel

2015-04-01 Thread Manoj Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390209#comment-14390209
 ] 

Manoj Kumar edited comment on SPARK-5989 at 4/1/15 10:04 PM:
-

[~josephkb] Can this be assigned to me? Thanks!


was (Author: mechcoder):
Can this be assigned to me? Thanks!

> Model import/export for LDAModel
> 
>
> Key: SPARK-5989
> URL: https://issues.apache.org/jira/browse/SPARK-5989
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> Add save/load for LDAModel and its local and distributed variants.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6657) Fix Python doc build warnings

2015-04-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6657:
---

Assignee: Apache Spark  (was: Joseph K. Bradley)

> Fix Python doc build warnings
> -
>
> Key: SPARK-6657
> URL: https://issues.apache.org/jira/browse/SPARK-6657
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib, PySpark, SQL, Streaming
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Trivial
>
> Reported by [~rxin]
> {code}
> /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
> pyspark.mllib.tree.RandomForest.trainClassifier:15: ERROR: Unexpected 
> indentation.
> /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
> pyspark.mllib.tree.RandomForest.trainClassifier:16: WARNING: Block quote ends 
> without a blank line; unexpected unindent.
> /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
> pyspark.mllib.tree.RandomForest.trainClassifier:18: ERROR: Unexpected 
> indentation.
> /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
> pyspark.mllib.tree.RandomForest.trainClassifier:22: WARNING: Definition list 
> ends without a blank line; unexpected unindent.
> /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
> pyspark.mllib.tree.RandomForest.trainClassifier:28: WARNING: Definition list 
> ends without a blank line; unexpected unindent.
> /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
> pyspark.mllib.tree.RandomForest.trainRegressor:13: ERROR: Unexpected 
> indentation.
> /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
> pyspark.mllib.tree.RandomForest.trainRegressor:14: WARNING: Block quote ends 
> without a blank line; unexpected unindent.
> /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
> pyspark.mllib.tree.RandomForest.trainRegressor:16: ERROR: Unexpected 
> indentation.
> /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
> pyspark.mllib.tree.RandomForest.trainRegressor:18: ERROR: Unexpected 
> indentation.
> /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of 
> pyspark.sql.DataFrame.collect:1: WARNING: Inline interpreted text or phrase 
> reference start-string without end-string.
> /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of 
> pyspark.sql.DataFrame.orderBy:3: WARNING: Inline interpreted text or phrase 
> reference start-string without end-string.
> /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of 
> pyspark.sql.DataFrame.sort:3: WARNING: Inline interpreted text or phrase 
> reference start-string without end-string.
> /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of 
> pyspark.sql.DataFrame.take:1: WARNING: Inline interpreted text or phrase 
> reference start-string without end-string.
> /scratch/rxin/spark/python/docs/pyspark.streaming.rst:13: WARNING: Title 
> underline too short.
> pyspark.streaming.kafka module
> 
> /scratch/rxin/spark/python/docs/pyspark.streaming.rst:13: WARNING: Title 
> underline too short.
> pyspark.streaming.kafka module
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 192 matches

Mail list logo