[jira] [Updated] (SPARK-11637) Alias do not work with udf with * parameter

2015-11-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11637:
--
Component/s: SQL

> Alias do not work with udf with * parameter
> ---
>
> Key: SPARK-11637
> URL: https://issues.apache.org/jira/browse/SPARK-11637
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 1.5.2
>Reporter: Pierre Borckmans
>
> In Spark < 1.5.0, this used to work :
> {code:java|title=Spark <1.5.0|borderStyle=solid}
> scala> sqlContext.sql("select hash(*) as x from T")
> res2: org.apache.spark.sql.DataFrame = [x: int]
> {code}
> From Spark 1.5.0+, it fails:
> {code:java|title=Spark>=1.5.0|borderStyle=solid}
> scala> sqlContext.sql("select hash(*) as x from T")
> org.apache.spark.sql.AnalysisException: unresolved operator 'Project 
> ['hash(*) AS x#1];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
> ...
> {code}
> This is not specific to the `hash` udf. It also applies to user defined 
> functions.
> The `*` seems to be the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11661) We should still pushdown filters returned by a data source's unhandledFilters

2015-11-11 Thread Yin Huai (JIRA)
Yin Huai created SPARK-11661:


 Summary: We should still pushdown filters returned by a data 
source's unhandledFilters
 Key: SPARK-11661
 URL: https://issues.apache.org/jira/browse/SPARK-11661
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Blocker


We added unhandledFilters interface to SPARK-10978. So, a data source has a 
chance to let Spark SQL know that for those returned filters, it is possible 
that the data source will not apply them to every row. So, Spark SQL should use 
a Filter operator to evaluate those filters. However, if a filter is a part of 
returned unhandledFilters, we should still push it down. For example, our 
internal data sources do not override this method, if we do not push down those 
filters, we are actually turning off the filter pushdown feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11652) Remote code execution with InvokerTransformer

2015-11-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11652:
--
Component/s: Spark Core

> Remote code execution with InvokerTransformer
> -
>
> Key: SPARK-11652
> URL: https://issues.apache.org/jira/browse/SPARK-11652
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Daniel Darabos
>Priority: Minor
>
> There is a remote code execution vulnerability in the Apache Commons 
> collections library (https://issues.apache.org/jira/browse/COLLECTIONS-580) 
> that can be exploited simply by causing malicious data to be deserialized 
> using Java serialization.
> As Spark is used in security-conscious environments I think it's worth taking 
> a closer look at how the vulnerability affects Spark. What are the points 
> where Spark deserializes external data? Which are affected by using Kryo 
> instead of Java serialization? What mitigation strategies are available?
> If the issue is serious enough but mitigation is possible, it may be useful 
> to post about it on the mailing list or blog.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11643) inserting date with leading zero inserts null example '0001-12-10'

2015-11-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11643:
--
Component/s: SQL

> inserting date with leading zero inserts null example '0001-12-10'
> --
>
> Key: SPARK-11643
> URL: https://issues.apache.org/jira/browse/SPARK-11643
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Chip Sands
>
> inserting date with leading zero inserts null value, example '0001-12-10'.
> This worked until 1.5/1.5.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11655) SparkLauncherBackendSuite leaks child processes

2015-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000856#comment-15000856
 ] 

Apache Spark commented on SPARK-11655:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/9633

> SparkLauncherBackendSuite leaks child processes
> ---
>
> Key: SPARK-11655
> URL: https://issues.apache.org/jira/browse/SPARK-11655
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Priority: Blocker
> Attachments: month_of_doom.png, screenshot-1.png, year_or_doom.png
>
>
> We've been combatting an orphaned process issue on AMPLab Jenkins since 
> October and I finally was able to dig in and figure out what's going on.
> After some sleuthing and working around OS limits and JDK bugs, I was able to 
> get the full launch commands for the hanging orphaned processes. It looks 
> like they're all running spark-submit:
> {code}
> org.apache.spark.deploy.SparkSubmit --master local-cluster[1,1,1024] --conf 
> spark.driver.extraClassPath=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/test-classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/launcher/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/common/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/shuffle/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/unsafe/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/tags/target/scala-2.10/
>  -Xms1g -Xmx1g -Dtest.appender=console -XX:MaxPermSize=256m
> {code}
> Based on the output of some Ganglia graphs, I was able to figure out that 
> these leaks started around October 9.
>  !screenshot-1.png|thumbnail! 
> This roughly lines up with when https://github.com/apache/spark/pull/7052 was 
> merged, which added LauncherBackendSuite. The launch arguments used in this 
> suite seem to line up with the arguments that I observe in the hanging 
> processes' {{jps}} output: 
> https://github.com/apache/spark/blame/1bc41125ee6306e627be212969854f639969c440/core/src/test/scala/org/apache/spark/launcher/LauncherBackendSuite.scala#L46
> Interestingly, Jenkins doesn't show test timing or output for this suite! I 
> think that what might be happening is that we have a mixed Scala/Java 
> package, so maybe the two test runner XML files aren't being merged properly: 
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.5-SBT/746/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/org.apache.spark.launcher/
> Whenever I try running this suite locally, it looks like it ends up creating 
> a zombie SparkSubmit process! I think that what's happening is that the 
> launcher's {{handle.kill()}} call ends up destroying the bash 
> {{spark-submit}} subprocess such that its child process (a JVM) leaks.
> I think that we'll have to do something similar to what we do in PySpark when 
> launching a child JVM from a Python / Bash process: connect it to a socket or 
> stream such that it can detect its parent's death and clean up after itself 
> appropriately.
> /cc [~shaneknapp] and [~vanzin].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11655) SparkLauncherBackendSuite leaks child processes

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11655:


Assignee: Apache Spark

> SparkLauncherBackendSuite leaks child processes
> ---
>
> Key: SPARK-11655
> URL: https://issues.apache.org/jira/browse/SPARK-11655
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Blocker
> Attachments: month_of_doom.png, screenshot-1.png, year_or_doom.png
>
>
> We've been combatting an orphaned process issue on AMPLab Jenkins since 
> October and I finally was able to dig in and figure out what's going on.
> After some sleuthing and working around OS limits and JDK bugs, I was able to 
> get the full launch commands for the hanging orphaned processes. It looks 
> like they're all running spark-submit:
> {code}
> org.apache.spark.deploy.SparkSubmit --master local-cluster[1,1,1024] --conf 
> spark.driver.extraClassPath=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/test-classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/launcher/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/common/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/shuffle/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/unsafe/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/tags/target/scala-2.10/
>  -Xms1g -Xmx1g -Dtest.appender=console -XX:MaxPermSize=256m
> {code}
> Based on the output of some Ganglia graphs, I was able to figure out that 
> these leaks started around October 9.
>  !screenshot-1.png|thumbnail! 
> This roughly lines up with when https://github.com/apache/spark/pull/7052 was 
> merged, which added LauncherBackendSuite. The launch arguments used in this 
> suite seem to line up with the arguments that I observe in the hanging 
> processes' {{jps}} output: 
> https://github.com/apache/spark/blame/1bc41125ee6306e627be212969854f639969c440/core/src/test/scala/org/apache/spark/launcher/LauncherBackendSuite.scala#L46
> Interestingly, Jenkins doesn't show test timing or output for this suite! I 
> think that what might be happening is that we have a mixed Scala/Java 
> package, so maybe the two test runner XML files aren't being merged properly: 
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.5-SBT/746/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/org.apache.spark.launcher/
> Whenever I try running this suite locally, it looks like it ends up creating 
> a zombie SparkSubmit process! I think that what's happening is that the 
> launcher's {{handle.kill()}} call ends up destroying the bash 
> {{spark-submit}} subprocess such that its child process (a JVM) leaks.
> I think that we'll have to do something similar to what we do in PySpark when 
> launching a child JVM from a Python / Bash process: connect it to a socket or 
> stream such that it can detect its parent's death and clean up after itself 
> appropriately.
> /cc [~shaneknapp] and [~vanzin].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11655) SparkLauncherBackendSuite leaks child processes

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11655:


Assignee: (was: Apache Spark)

> SparkLauncherBackendSuite leaks child processes
> ---
>
> Key: SPARK-11655
> URL: https://issues.apache.org/jira/browse/SPARK-11655
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Priority: Blocker
> Attachments: month_of_doom.png, screenshot-1.png, year_or_doom.png
>
>
> We've been combatting an orphaned process issue on AMPLab Jenkins since 
> October and I finally was able to dig in and figure out what's going on.
> After some sleuthing and working around OS limits and JDK bugs, I was able to 
> get the full launch commands for the hanging orphaned processes. It looks 
> like they're all running spark-submit:
> {code}
> org.apache.spark.deploy.SparkSubmit --master local-cluster[1,1,1024] --conf 
> spark.driver.extraClassPath=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/test-classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/launcher/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/common/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/shuffle/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/unsafe/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/tags/target/scala-2.10/
>  -Xms1g -Xmx1g -Dtest.appender=console -XX:MaxPermSize=256m
> {code}
> Based on the output of some Ganglia graphs, I was able to figure out that 
> these leaks started around October 9.
>  !screenshot-1.png|thumbnail! 
> This roughly lines up with when https://github.com/apache/spark/pull/7052 was 
> merged, which added LauncherBackendSuite. The launch arguments used in this 
> suite seem to line up with the arguments that I observe in the hanging 
> processes' {{jps}} output: 
> https://github.com/apache/spark/blame/1bc41125ee6306e627be212969854f639969c440/core/src/test/scala/org/apache/spark/launcher/LauncherBackendSuite.scala#L46
> Interestingly, Jenkins doesn't show test timing or output for this suite! I 
> think that what might be happening is that we have a mixed Scala/Java 
> package, so maybe the two test runner XML files aren't being merged properly: 
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.5-SBT/746/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/org.apache.spark.launcher/
> Whenever I try running this suite locally, it looks like it ends up creating 
> a zombie SparkSubmit process! I think that what's happening is that the 
> launcher's {{handle.kill()}} call ends up destroying the bash 
> {{spark-submit}} subprocess such that its child process (a JVM) leaks.
> I think that we'll have to do something similar to what we do in PySpark when 
> launching a child JVM from a Python / Bash process: connect it to a socket or 
> stream such that it can detect its parent's death and clean up after itself 
> appropriately.
> /cc [~shaneknapp] and [~vanzin].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11642) networkcount is not working using my stream source

2015-11-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11642.
---
Resolution: Invalid

Please use u...@spark.apache.org for questions rather than JIRA.

> networkcount is not working using my stream source
> --
>
> Key: SPARK-11642
> URL: https://issues.apache.org/jira/browse/SPARK-11642
> Project: Spark
>  Issue Type: Question
>Reporter: Amir Rahnama
>
> My small nodejs stream source writes some words to socket port 8000 but spark 
> streaming example, Network Word Count is not reading that data. Can you guys 
> help me?
> var net = require('net'),
> EXEC_INTERVAL = 1000,
> words = ['I', 'You', 'He', 'She', 'It', 'We', 'You', 'They'],
> server = net.createServer(function(socket) {
> setInterval(function() {
> var index = Math.floor((Math.random() * 7) + 1);
> socket.write(words[index] + ' ' + words[index - 1]);
> }, EXEC_INTERVAL);
> }).listen(8000);
> this is the stream and nc localhost 8000 is showing me the result of socket 
> write but Spark is not getting it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11660) Spark Thrift GetResultSetMetadata describes a VARCHAR as a STRING

2015-11-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11660:
--
Component/s: SQL

> Spark Thrift GetResultSetMetadata describes a VARCHAR as a STRING
> -
>
> Key: SPARK-11660
> URL: https://issues.apache.org/jira/browse/SPARK-11660
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.5.0
>Reporter: Chip Sands
>
> In the Spark SQL  thrift interface the GetResultSetMetadata reply packet that 
> describes the result set metadata, reports a column that is defined as a 
> VARCHAR in the database, as Native type of STRING. Data still returns 
> correctly in the thrift string type but ODBC/JDBC is not able to correctly 
> describe the data type being return or its defined maximum length.
> FYI Hive returns it correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11652) Remote code execution with InvokerTransformer

2015-11-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000858#comment-15000858
 ] 

Sean Owen commented on SPARK-11652:
---

I may be missing some point, but Spark isn't consuming serialized data from 
untrusted sources in general, right? The risk here is way down the list of 
risks if untrusted sources are sending closures to your cluster.

> Remote code execution with InvokerTransformer
> -
>
> Key: SPARK-11652
> URL: https://issues.apache.org/jira/browse/SPARK-11652
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Daniel Darabos
>Priority: Minor
>
> There is a remote code execution vulnerability in the Apache Commons 
> collections library (https://issues.apache.org/jira/browse/COLLECTIONS-580) 
> that can be exploited simply by causing malicious data to be deserialized 
> using Java serialization.
> As Spark is used in security-conscious environments I think it's worth taking 
> a closer look at how the vulnerability affects Spark. What are the points 
> where Spark deserializes external data? Which are affected by using Kryo 
> instead of Java serialization? What mitigation strategies are available?
> If the issue is serious enough but mitigation is possible, it may be useful 
> to post about it on the mailing list or blog.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11661) We should still pushdown filters returned by a data source's unhandledFilters

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11661:


Assignee: Apache Spark

> We should still pushdown filters returned by a data source's unhandledFilters
> -
>
> Key: SPARK-11661
> URL: https://issues.apache.org/jira/browse/SPARK-11661
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>Priority: Blocker
>
> We added unhandledFilters interface to SPARK-10978. So, a data source has a 
> chance to let Spark SQL know that for those returned filters, it is possible 
> that the data source will not apply them to every row. So, Spark SQL should 
> use a Filter operator to evaluate those filters. However, if a filter is a 
> part of returned unhandledFilters, we should still push it down. For example, 
> our internal data sources do not override this method, if we do not push down 
> those filters, we are actually turning off the filter pushdown feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11661) We should still pushdown filters returned by a data source's unhandledFilters

2015-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000882#comment-15000882
 ] 

Apache Spark commented on SPARK-11661:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/9634

> We should still pushdown filters returned by a data source's unhandledFilters
> -
>
> Key: SPARK-11661
> URL: https://issues.apache.org/jira/browse/SPARK-11661
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>
> We added unhandledFilters interface to SPARK-10978. So, a data source has a 
> chance to let Spark SQL know that for those returned filters, it is possible 
> that the data source will not apply them to every row. So, Spark SQL should 
> use a Filter operator to evaluate those filters. However, if a filter is a 
> part of returned unhandledFilters, we should still push it down. For example, 
> our internal data sources do not override this method, if we do not push down 
> those filters, we are actually turning off the filter pushdown feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11661) We should still pushdown filters returned by a data source's unhandledFilters

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11661:


Assignee: (was: Apache Spark)

> We should still pushdown filters returned by a data source's unhandledFilters
> -
>
> Key: SPARK-11661
> URL: https://issues.apache.org/jira/browse/SPARK-11661
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>
> We added unhandledFilters interface to SPARK-10978. So, a data source has a 
> chance to let Spark SQL know that for those returned filters, it is possible 
> that the data source will not apply them to every row. So, Spark SQL should 
> use a Filter operator to evaluate those filters. However, if a filter is a 
> part of returned unhandledFilters, we should still push it down. For example, 
> our internal data sources do not override this method, if we do not push down 
> those filters, we are actually turning off the filter pushdown feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6152) Spark does not support Java 8 compiled Scala classes

2015-11-11 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-6152.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9512
[https://github.com/apache/spark/pull/9512]

> Spark does not support Java 8 compiled Scala classes
> 
>
> Key: SPARK-6152
> URL: https://issues.apache.org/jira/browse/SPARK-6152
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
> Environment: Java 8+
> Scala 2.11
>Reporter: Ronald Chen
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.6.0
>
>
> Spark uses reflectasm to check Scala closures which fails if the *user 
> defined Scala closures* are compiled to Java 8 class version
> The cause is reflectasm does not support Java 8
> https://github.com/EsotericSoftware/reflectasm/issues/35
> Workaround:
> Don't compile Scala classes to Java 8, Scala 2.11 does not support nor 
> require any Java 8 features
> Stack trace:
> {code}
> java.lang.IllegalArgumentException
>   at 
> com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.(Unknown
>  Source)
>   at 
> com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.(Unknown
>  Source)
>   at 
> com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.(Unknown
>  Source)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$getClassReader(ClosureCleaner.scala:41)
>   at 
> org.apache.spark.util.ClosureCleaner$.getInnerClasses(ClosureCleaner.scala:84)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:107)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:1478)
>   at org.apache.spark.rdd.RDD.map(RDD.scala:288)
>   at ...my Scala 2.11 compiled to Java 8 code calling into spark
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11660) Spark Thrift GetResultSetMetadata describes a VARCHAR as a STRING

2015-11-11 Thread Huaxin Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000925#comment-15000925
 ] 

Huaxin Gao commented on SPARK-11660:


I would like to work on this issue. 

> Spark Thrift GetResultSetMetadata describes a VARCHAR as a STRING
> -
>
> Key: SPARK-11660
> URL: https://issues.apache.org/jira/browse/SPARK-11660
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.5.0
>Reporter: Chip Sands
>
> In the Spark SQL  thrift interface the GetResultSetMetadata reply packet that 
> describes the result set metadata, reports a column that is defined as a 
> VARCHAR in the database, as Native type of STRING. Data still returns 
> correctly in the thrift string type but ODBC/JDBC is not able to correctly 
> describe the data type being return or its defined maximum length.
> FYI Hive returns it correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11639) Flaky test: BatchedWriteAheadLog - name log with aggregated entries with the timestamp of last entry

2015-11-11 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-11639.
---
   Resolution: Fixed
 Assignee: Burak Yavuz
Fix Version/s: 1.6.0

> Flaky test: BatchedWriteAheadLog - name log with aggregated entries with the 
> timestamp of last entry
> 
>
> Key: SPARK-11639
> URL: https://issues.apache.org/jira/browse/SPARK-11639
> Project: Spark
>  Issue Type: Test
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>  Labels: flaky, flaky-test
> Fix For: 1.6.0
>
>
> I added this test yesterday, and it has started showing flakiness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11662) Call startExecutorDelegationTokenRenewer() ahead of client app submission

2015-11-11 Thread Ted Yu (JIRA)
Ted Yu created SPARK-11662:
--

 Summary: Call startExecutorDelegationTokenRenewer() ahead of 
client app submission
 Key: SPARK-11662
 URL: https://issues.apache.org/jira/browse/SPARK-11662
 Project: Spark
  Issue Type: Bug
Reporter: Ted Yu


As reported in the thread 'Creating new Spark context when running in Secure 
YARN fails', IOException may be thrown when SparkContext is stopped and started 
again working with secure YARN cluster:
{code}
15/11/11 10:19:53 ERROR spark.SparkContext: Error initializing SparkContext.
org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token
can be issued only with kerberos or web authentication
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:6638)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:563)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:987)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)

at org.apache.hadoop.ipc.Client.call(Client.java:1476)
at org.apache.hadoop.ipc.Client.call(Client.java:1407)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy12.getDelegationToken(Unknown Source)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getDelegationToken(ClientNamenodeProtocolTranslatorPB.java:933)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy13.getDelegationToken(Unknown Source)
at
org.apache.hadoop.hdfs.DFSClient.getDelegationToken(DFSClient.java:1044)
at
org.apache.hadoop.hdfs.DistributedFileSystem.getDelegationToken(DistributedFileSystem.java:1543)
at
org.apache.hadoop.fs.FileSystem.collectDelegationTokens(FileSystem.java:530)
at
org.apache.hadoop.fs.FileSystem.addDelegationTokens(FileSystem.java:508)
at
org.apache.hadoop.hdfs.DistributedFileSystem.addDelegationTokens(DistributedFileSystem.java:2228)
at
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$obtainTokensForNamenodes$1.apply(YarnSparkHadoopUtil.scala:126)
at
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$obtainTokensForNamenodes$1.apply(YarnSparkHadoopUtil.scala:123)
at scala.collection.immutable.Set$Set1.foreach(Set.scala:74)
at
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.obtainTokensForNamenodes(YarnSparkHadoopUtil.scala:123)
at
org.apache.spark.deploy.yarn.Client.getTokenRenewalInterval(Client.scala:495)
at
org.apache.spark.deploy.yarn.Client.setupLaunchEnv(Client.scala:528)
at
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:628)
at
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
at
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
at org.apache.spark.SparkContext.(SparkContext.scala:523)
{code}
One fix is to call startExecutorDelegationTokenRenewer(conf) ahead of client 
app submission.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11662) Call startExecutorDelegationTokenRenewer() ahead of client app submission

2015-11-11 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-11662:
---
Component/s: YARN

> Call startExecutorDelegationTokenRenewer() ahead of client app submission
> -
>
> Key: SPARK-11662
> URL: https://issues.apache.org/jira/browse/SPARK-11662
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Ted Yu
>
> As reported in the thread 'Creating new Spark context when running in Secure 
> YARN fails', IOException may be thrown when SparkContext is stopped and 
> started again working with secure YARN cluster:
> {code}
> 15/11/11 10:19:53 ERROR spark.SparkContext: Error initializing SparkContext.
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token
> can be issued only with kerberos or web authentication
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:6638)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:563)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:987)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> at org.apache.hadoop.ipc.Client.call(Client.java:1476)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy12.getDelegationToken(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getDelegationToken(ClientNamenodeProtocolTranslatorPB.java:933)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy13.getDelegationToken(Unknown Source)
> at
> org.apache.hadoop.hdfs.DFSClient.getDelegationToken(DFSClient.java:1044)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.getDelegationToken(DistributedFileSystem.java:1543)
> at
> org.apache.hadoop.fs.FileSystem.collectDelegationTokens(FileSystem.java:530)
> at
> org.apache.hadoop.fs.FileSystem.addDelegationTokens(FileSystem.java:508)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.addDelegationTokens(DistributedFileSystem.java:2228)
> at
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$obtainTokensForNamenodes$1.apply(YarnSparkHadoopUtil.scala:126)
> at
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$obtainTokensForNamenodes$1.apply(YarnSparkHadoopUtil.scala:123)
> at scala.collection.immutable.Set$Set1.foreach(Set.scala:74)
> at
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.obtainTokensForNamenodes(YarnSparkHadoopUtil.scala:123)
> at
> org.apache.spark.deploy.yarn.Client.getTokenRenewalInterval(Client.scala:495)
> at
> org.apache.spark.deploy.yarn.Client.setupLaunchEnv(Client.scala:528)
> at
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:628)
> at
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
> at
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
> at
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
> at org.apache.spark.SparkContext.(SparkContext.scala:523)
> {code}
> One fix is to call startExecutorDelegationTokenRenewer(conf) ahead of client 
> app submission.



--
This message

[jira] [Assigned] (SPARK-11662) Call startExecutorDelegationTokenRenewer() ahead of client app submission

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11662:


Assignee: Apache Spark

> Call startExecutorDelegationTokenRenewer() ahead of client app submission
> -
>
> Key: SPARK-11662
> URL: https://issues.apache.org/jira/browse/SPARK-11662
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Ted Yu
>Assignee: Apache Spark
>
> As reported in the thread 'Creating new Spark context when running in Secure 
> YARN fails', IOException may be thrown when SparkContext is stopped and 
> started again working with secure YARN cluster:
> {code}
> 15/11/11 10:19:53 ERROR spark.SparkContext: Error initializing SparkContext.
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token
> can be issued only with kerberos or web authentication
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:6638)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:563)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:987)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> at org.apache.hadoop.ipc.Client.call(Client.java:1476)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy12.getDelegationToken(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getDelegationToken(ClientNamenodeProtocolTranslatorPB.java:933)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy13.getDelegationToken(Unknown Source)
> at
> org.apache.hadoop.hdfs.DFSClient.getDelegationToken(DFSClient.java:1044)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.getDelegationToken(DistributedFileSystem.java:1543)
> at
> org.apache.hadoop.fs.FileSystem.collectDelegationTokens(FileSystem.java:530)
> at
> org.apache.hadoop.fs.FileSystem.addDelegationTokens(FileSystem.java:508)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.addDelegationTokens(DistributedFileSystem.java:2228)
> at
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$obtainTokensForNamenodes$1.apply(YarnSparkHadoopUtil.scala:126)
> at
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$obtainTokensForNamenodes$1.apply(YarnSparkHadoopUtil.scala:123)
> at scala.collection.immutable.Set$Set1.foreach(Set.scala:74)
> at
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.obtainTokensForNamenodes(YarnSparkHadoopUtil.scala:123)
> at
> org.apache.spark.deploy.yarn.Client.getTokenRenewalInterval(Client.scala:495)
> at
> org.apache.spark.deploy.yarn.Client.setupLaunchEnv(Client.scala:528)
> at
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:628)
> at
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
> at
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
> at
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
> at org.apache.spark.SparkContext.(SparkContext.scala:523)
> {code}
> One fix is to call startExecutorDelegationTokenRenewer

[jira] [Assigned] (SPARK-11662) Call startExecutorDelegationTokenRenewer() ahead of client app submission

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11662:


Assignee: (was: Apache Spark)

> Call startExecutorDelegationTokenRenewer() ahead of client app submission
> -
>
> Key: SPARK-11662
> URL: https://issues.apache.org/jira/browse/SPARK-11662
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Ted Yu
>
> As reported in the thread 'Creating new Spark context when running in Secure 
> YARN fails', IOException may be thrown when SparkContext is stopped and 
> started again working with secure YARN cluster:
> {code}
> 15/11/11 10:19:53 ERROR spark.SparkContext: Error initializing SparkContext.
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token
> can be issued only with kerberos or web authentication
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:6638)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:563)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:987)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> at org.apache.hadoop.ipc.Client.call(Client.java:1476)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy12.getDelegationToken(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getDelegationToken(ClientNamenodeProtocolTranslatorPB.java:933)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy13.getDelegationToken(Unknown Source)
> at
> org.apache.hadoop.hdfs.DFSClient.getDelegationToken(DFSClient.java:1044)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.getDelegationToken(DistributedFileSystem.java:1543)
> at
> org.apache.hadoop.fs.FileSystem.collectDelegationTokens(FileSystem.java:530)
> at
> org.apache.hadoop.fs.FileSystem.addDelegationTokens(FileSystem.java:508)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.addDelegationTokens(DistributedFileSystem.java:2228)
> at
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$obtainTokensForNamenodes$1.apply(YarnSparkHadoopUtil.scala:126)
> at
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$obtainTokensForNamenodes$1.apply(YarnSparkHadoopUtil.scala:123)
> at scala.collection.immutable.Set$Set1.foreach(Set.scala:74)
> at
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.obtainTokensForNamenodes(YarnSparkHadoopUtil.scala:123)
> at
> org.apache.spark.deploy.yarn.Client.getTokenRenewalInterval(Client.scala:495)
> at
> org.apache.spark.deploy.yarn.Client.setupLaunchEnv(Client.scala:528)
> at
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:628)
> at
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
> at
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
> at
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
> at org.apache.spark.SparkContext.(SparkContext.scala:523)
> {code}
> One fix is to call startExecutorDelegationTokenRenewer(conf) ahead of client 
>

[jira] [Commented] (SPARK-11662) Call startExecutorDelegationTokenRenewer() ahead of client app submission

2015-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000937#comment-15000937
 ] 

Apache Spark commented on SPARK-11662:
--

User 'tedyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/9635

> Call startExecutorDelegationTokenRenewer() ahead of client app submission
> -
>
> Key: SPARK-11662
> URL: https://issues.apache.org/jira/browse/SPARK-11662
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Ted Yu
>
> As reported in the thread 'Creating new Spark context when running in Secure 
> YARN fails', IOException may be thrown when SparkContext is stopped and 
> started again working with secure YARN cluster:
> {code}
> 15/11/11 10:19:53 ERROR spark.SparkContext: Error initializing SparkContext.
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token
> can be issued only with kerberos or web authentication
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:6638)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:563)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:987)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> at org.apache.hadoop.ipc.Client.call(Client.java:1476)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy12.getDelegationToken(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getDelegationToken(ClientNamenodeProtocolTranslatorPB.java:933)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy13.getDelegationToken(Unknown Source)
> at
> org.apache.hadoop.hdfs.DFSClient.getDelegationToken(DFSClient.java:1044)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.getDelegationToken(DistributedFileSystem.java:1543)
> at
> org.apache.hadoop.fs.FileSystem.collectDelegationTokens(FileSystem.java:530)
> at
> org.apache.hadoop.fs.FileSystem.addDelegationTokens(FileSystem.java:508)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.addDelegationTokens(DistributedFileSystem.java:2228)
> at
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$obtainTokensForNamenodes$1.apply(YarnSparkHadoopUtil.scala:126)
> at
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$obtainTokensForNamenodes$1.apply(YarnSparkHadoopUtil.scala:123)
> at scala.collection.immutable.Set$Set1.foreach(Set.scala:74)
> at
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.obtainTokensForNamenodes(YarnSparkHadoopUtil.scala:123)
> at
> org.apache.spark.deploy.yarn.Client.getTokenRenewalInterval(Client.scala:495)
> at
> org.apache.spark.deploy.yarn.Client.setupLaunchEnv(Client.scala:528)
> at
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:628)
> at
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
> at
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
> at
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
> at org.apache.spark.SparkContext.(Spa

[jira] [Created] (SPARK-11663) Add Java API for trackStateByKey

2015-11-11 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-11663:
-

 Summary: Add Java API for trackStateByKey
 Key: SPARK-11663
 URL: https://issues.apache.org/jira/browse/SPARK-11663
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tathagata Das
Assignee: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11664) Add methods to get bisecting k-means cluster structure

2015-11-11 Thread Yu Ishikawa (JIRA)
Yu Ishikawa created SPARK-11664:
---

 Summary: Add methods to get bisecting k-means cluster structure
 Key: SPARK-11664
 URL: https://issues.apache.org/jira/browse/SPARK-11664
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Yu Ishikawa
Priority: Minor


I think users want to visualize the result of bisecting k-means clustering as a 
dendrogram in order to confirm it. So it would be great to support method to 
get the cluster tree structure as an adjacency list, linkage matrix and so on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11665) Support other distance metrics for bisecting k-means

2015-11-11 Thread Yu Ishikawa (JIRA)
Yu Ishikawa created SPARK-11665:
---

 Summary: Support other distance metrics for bisecting k-means
 Key: SPARK-11665
 URL: https://issues.apache.org/jira/browse/SPARK-11665
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Yu Ishikawa
Priority: Minor


Some guys reqested me to support other distance metrics, such as cosine 
distance, tanimoto distance, in bisecting k-means. 

We should
- desing the interfaces for distance metrics
- support the distances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11666) Find the best `k` by cutting bisecting k-means cluster tree without recomputation

2015-11-11 Thread Yu Ishikawa (JIRA)
Yu Ishikawa created SPARK-11666:
---

 Summary: Find the best `k` by cutting bisecting k-means cluster 
tree without recomputation
 Key: SPARK-11666
 URL: https://issues.apache.org/jira/browse/SPARK-11666
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Yu Ishikawa
Priority: Minor


For example, scikit-learn's hierarchical clustering support a feature to 
extract partial tree from the result. We should support a feature like that in 
order to reduce compute cost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer

2015-11-11 Thread Hitoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001022#comment-15001022
 ] 

Hitoshi Ozawa commented on SPARK-7708:
--

Upgrading to Kryo 3.x will resolve this issue

> Incorrect task serialization with Kryo closure serializer
> -
>
> Key: SPARK-7708
> URL: https://issues.apache.org/jira/browse/SPARK-7708
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.2
>Reporter: Akshat Aranya
>
> I've been investigating the use of Kryo for closure serialization with Spark 
> 1.2, and it seems like I've hit upon a bug:
> When a task is serialized before scheduling, the following log message is 
> generated:
> [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, 
> , PROCESS_LOCAL, 302 bytes)
> This message comes from TaskSetManager which serializes the task using the 
> closure serializer.  Before the message is sent out, the TaskDescription 
> (which included the original task as a byte array), is serialized again into 
> a byte array with the closure serializer.  I added a log message for this in 
> CoarseGrainedSchedulerBackend, which produces the following output:
> [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132
> The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ 
> than serialized task that it contains (302 bytes). This implies that 
> TaskDescription.buffer is not getting serialized correctly.
> On the executor side, the deserialization produces a null value for 
> TaskDescription.buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer

2015-11-11 Thread Hitoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001025#comment-15001025
 ] 

Hitoshi Ozawa commented on SPARK-7708:
--

Upgrading to Kryo 3.x may resolve this issue

> Incorrect task serialization with Kryo closure serializer
> -
>
> Key: SPARK-7708
> URL: https://issues.apache.org/jira/browse/SPARK-7708
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.2
>Reporter: Akshat Aranya
>
> I've been investigating the use of Kryo for closure serialization with Spark 
> 1.2, and it seems like I've hit upon a bug:
> When a task is serialized before scheduling, the following log message is 
> generated:
> [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, 
> , PROCESS_LOCAL, 302 bytes)
> This message comes from TaskSetManager which serializes the task using the 
> closure serializer.  Before the message is sent out, the TaskDescription 
> (which included the original task as a byte array), is serialized again into 
> a byte array with the closure serializer.  I added a log message for this in 
> CoarseGrainedSchedulerBackend, which produces the following output:
> [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132
> The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ 
> than serialized task that it contains (302 bytes). This implies that 
> TaskDescription.buffer is not getting serialized correctly.
> On the executor side, the deserialization produces a null value for 
> TaskDescription.buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11416) Upgrade kryo package to version 3.0

2015-11-11 Thread Hitoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001031#comment-15001031
 ] 

Hitoshi Ozawa commented on SPARK-11416:
---

Need to coordinate with Hive to version up Kryo

> Upgrade kryo package to version 3.0
> ---
>
> Key: SPARK-11416
> URL: https://issues.apache.org/jira/browse/SPARK-11416
> Project: Spark
>  Issue Type: Wish
>  Components: Build
>Affects Versions: 1.5.1
>Reporter: Hitoshi Ozawa
>Priority: Trivial
>
> Would like to have Apache Spark upgrade kryo package from 2.x (current) to 
> 3.x.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11644) Remove the option to turn off unsafe and codegen

2015-11-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-11644.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

> Remove the option to turn off unsafe and codegen
> 
>
> Key: SPARK-11644
> URL: https://issues.apache.org/jira/browse/SPARK-11644
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: releasenotes
> Fix For: 1.6.0
>
>
> We don't sufficiently test the code path with these settings off. It is 
> better to just consolidate and focus on making one code path work well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11644) Remove the option to turn off unsafe and codegen

2015-11-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-11644:

Labels: releasenotes  (was: )

> Remove the option to turn off unsafe and codegen
> 
>
> Key: SPARK-11644
> URL: https://issues.apache.org/jira/browse/SPARK-11644
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: releasenotes
> Fix For: 1.6.0
>
>
> We don't sufficiently test the code path with these settings off. It is 
> better to just consolidate and focus on making one code path work well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11645) Remove OpenHashSet for the old aggregate.

2015-11-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-11645.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

> Remove OpenHashSet for the old aggregate.
> -
>
> Key: SPARK-11645
> URL: https://issues.apache.org/jira/browse/SPARK-11645
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11667) Update list of cluster managers supporting dynamic allocation

2015-11-11 Thread Andrew Or (JIRA)
Andrew Or created SPARK-11667:
-

 Summary: Update list of cluster managers supporting dynamic 
allocation
 Key: SPARK-11667
 URL: https://issues.apache.org/jira/browse/SPARK-11667
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.5.0
Reporter: Andrew Or
Assignee: Andrew Or


It still says it's only supported on YARN. In reality it is supported on all 
coarse-grained modes now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11668) R style summary stats in GLM package SparkR

2015-11-11 Thread Shubhanshu Mishra (JIRA)
Shubhanshu Mishra created SPARK-11668:
-

 Summary: R style summary stats in GLM package SparkR
 Key: SPARK-11668
 URL: https://issues.apache.org/jira/browse/SPARK-11668
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 1.5.1, 1.5.0
 Environment: LINUX
WINDOWS
MAC
Reporter: Shubhanshu Mishra
 Fix For: 1.5.1


In the current GLM module in R the `summary(model)` function call will only 
return the values of the coefficients however in the actual R GLM module, the 
function also returns the std. err, z score, p-value and confidence intervals 
for the coefficients as well as some model based statistics like R-squared 
values, AIC, BIC etc. 

Another inspiration for adding these metrics can be using the format of python 
statsmodels package described here: 
http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/formulas.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11669) Python interface to SparkR GLM module

2015-11-11 Thread Shubhanshu Mishra (JIRA)
Shubhanshu Mishra created SPARK-11669:
-

 Summary: Python interface to SparkR GLM module
 Key: SPARK-11669
 URL: https://issues.apache.org/jira/browse/SPARK-11669
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SparkR
Affects Versions: 1.5.1, 1.5.0
 Environment: LINUX
MAC
WINDOWS
Reporter: Shubhanshu Mishra
 Fix For: 1.5.1


There should be a python interface to the sparkR GLM module. Currently the only 
python library which creates R style GLM module results in statsmodels. 

Inspiration for the API can be taken from the following page. 
http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/formulas.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11668) R style summary stats in GLM package SparkR

2015-11-11 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001124#comment-15001124
 ] 

Shivaram Venkataraman commented on SPARK-11668:
---

I think this is already covered in 
https://issues.apache.org/jira/browse/SPARK-11494 ? 

cc [~mengxr]

> R style summary stats in GLM package SparkR
> ---
>
> Key: SPARK-11668
> URL: https://issues.apache.org/jira/browse/SPARK-11668
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.0, 1.5.1
> Environment: LINUX
> WINDOWS
> MAC
>Reporter: Shubhanshu Mishra
>  Labels: GLM, sparkr
> Fix For: 1.5.1
>
>
> In the current GLM module in R the `summary(model)` function call will only 
> return the values of the coefficients however in the actual R GLM module, the 
> function also returns the std. err, z score, p-value and confidence intervals 
> for the coefficients as well as some model based statistics like R-squared 
> values, AIC, BIC etc. 
> Another inspiration for adding these metrics can be using the format of 
> python statsmodels package described here: 
> http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/formulas.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11663) Add Java API for trackStateByKey

2015-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001139#comment-15001139
 ] 

Apache Spark commented on SPARK-11663:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/9636

> Add Java API for trackStateByKey
> 
>
> Key: SPARK-11663
> URL: https://issues.apache.org/jira/browse/SPARK-11663
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11663) Add Java API for trackStateByKey

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11663:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Add Java API for trackStateByKey
> 
>
> Key: SPARK-11663
> URL: https://issues.apache.org/jira/browse/SPARK-11663
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11335) Update documentation on accessing Kafka offsets from Python

2015-11-11 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-11335.
---
   Resolution: Fixed
 Assignee: Nick Evans
Fix Version/s: 1.6.0

> Update documentation on accessing Kafka offsets from Python
> ---
>
> Key: SPARK-11335
> URL: https://issues.apache.org/jira/browse/SPARK-11335
> Project: Spark
>  Issue Type: Documentation
>  Components: Streaming
>Reporter: Nick Evans
>Assignee: Nick Evans
>Priority: Minor
>  Labels: docuentation, kafka, pyspark, streaming
> Fix For: 1.6.0
>
>
> The 
> [docs|http://spark.apache.org/docs/latest/streaming-kafka-integration.html] 
> state that the Python API for accessing offsets is not yet available, but it 
> seems to be available via the callable {{offsetRanges}} on a {{KafkaRDD}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11663) Add Java API for trackStateByKey

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11663:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Add Java API for trackStateByKey
> 
>
> Key: SPARK-11663
> URL: https://issues.apache.org/jira/browse/SPARK-11663
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11667) Update dynamic allocation docs to reflect supported cluster managers

2015-11-11 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-11667:
--
Summary: Update dynamic allocation docs to reflect supported cluster 
managers  (was: Update list of cluster managers supporting dynamic allocation)

> Update dynamic allocation docs to reflect supported cluster managers
> 
>
> Key: SPARK-11667
> URL: https://issues.apache.org/jira/browse/SPARK-11667
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> It still says it's only supported on YARN. In reality it is supported on all 
> coarse-grained modes now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11667) Update dynamic allocation docs to reflect supported cluster managers

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11667:


Assignee: Apache Spark  (was: Andrew Or)

> Update dynamic allocation docs to reflect supported cluster managers
> 
>
> Key: SPARK-11667
> URL: https://issues.apache.org/jira/browse/SPARK-11667
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>
> It still says it's only supported on YARN. In reality it is supported on all 
> coarse-grained modes now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11667) Update dynamic allocation docs to reflect supported cluster managers

2015-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001157#comment-15001157
 ] 

Apache Spark commented on SPARK-11667:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/9637

> Update dynamic allocation docs to reflect supported cluster managers
> 
>
> Key: SPARK-11667
> URL: https://issues.apache.org/jira/browse/SPARK-11667
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> It still says it's only supported on YARN. In reality it is supported on all 
> coarse-grained modes now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11667) Update dynamic allocation docs to reflect supported cluster managers

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11667:


Assignee: Andrew Or  (was: Apache Spark)

> Update dynamic allocation docs to reflect supported cluster managers
> 
>
> Key: SPARK-11667
> URL: https://issues.apache.org/jira/browse/SPARK-11667
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> It still says it's only supported on YARN. In reality it is supported on all 
> coarse-grained modes now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11416) Upgrade kryo package to version 3.0

2015-11-11 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001186#comment-15001186
 ] 

Steve Loughran commented on SPARK-11416:


Being in sync with Hive would significantly aid hive/spark integration, which 
SPARK-10793 looks at improving. However, given the intimacy of Hive+Kryo and 
Spark+Kryo, this is something that will scare people

> Upgrade kryo package to version 3.0
> ---
>
> Key: SPARK-11416
> URL: https://issues.apache.org/jira/browse/SPARK-11416
> Project: Spark
>  Issue Type: Wish
>  Components: Build
>Affects Versions: 1.5.1
>Reporter: Hitoshi Ozawa
>Priority: Trivial
>
> Would like to have Apache Spark upgrade kryo package from 2.x (current) to 
> 3.x.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10793) Make spark's use/subclassing of hive more maintainable

2015-11-11 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-10793:
---
Summary: Make spark's use/subclassing of hive more maintainable  (was: Make 
sparks use/subclassing of hive more maintainable)

> Make spark's use/subclassing of hive more maintainable
> --
>
> Key: SPARK-10793
> URL: https://issues.apache.org/jira/browse/SPARK-10793
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Steve Loughran
>
> The latest spark/hive integration round has closed the gap with Hive 
> versions, but the integration is still pretty complex
> # SparkSQL has deep hooks into the parser
> # hivethriftserver uses "aggressive reflection" to inject spark classes into 
> the Hive base classes.
> # there's a separate org.sparkproject.hive JAR to isolate Kryo versions while 
> avoiding the hive uberjar with all its dependencies getting into the spark 
> uberjar.
> We can improve this with some assistance from the other projects, even though 
> no guarantees of stability of things like the parser and thrift server APIs 
> are likely in the near future



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11670) Fix incorrect kryo buffer default value in docs

2015-11-11 Thread Andrew Or (JIRA)
Andrew Or created SPARK-11670:
-

 Summary: Fix incorrect kryo buffer default value in docs
 Key: SPARK-11670
 URL: https://issues.apache.org/jira/browse/SPARK-11670
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Minor


The default is 64K, but the doc says it's 2?

https://spark.apache.org/docs/1.5.0/tuning.html#data-serialization
{code}
If your objects are large, you may also need to increase the 
spark.kryoserializer.buffer config property. The default is 2, but this value 
needs to be large enough to hold the largest object you will serialize.
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11670) Fix incorrect kryo buffer default value in docs

2015-11-11 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-11670:
--
Description: 
The default is 64K, but the doc says it's 2?

https://spark.apache.org/docs/1.5.0/tuning.html#data-serialization
{quote}
If your objects are large, you may also need to increase the 
spark.kryoserializer.buffer config property. The default is 2, but this value 
needs to be large enough to hold the largest object you will serialize.
{quote}

  was:
The default is 64K, but the doc says it's 2?

https://spark.apache.org/docs/1.5.0/tuning.html#data-serialization
{code}
If your objects are large, you may also need to increase the 
spark.kryoserializer.buffer config property. The default is 2, but this value 
needs to be large enough to hold the largest object you will serialize.
{code}


> Fix incorrect kryo buffer default value in docs
> ---
>
> Key: SPARK-11670
> URL: https://issues.apache.org/jira/browse/SPARK-11670
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Minor
>
> The default is 64K, but the doc says it's 2?
> https://spark.apache.org/docs/1.5.0/tuning.html#data-serialization
> {quote}
> If your objects are large, you may also need to increase the 
> spark.kryoserializer.buffer config property. The default is 2, but this value 
> needs to be large enough to hold the largest object you will serialize.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-11660) Spark Thrift GetResultSetMetadata describes a VARCHAR as a STRING

2015-11-11 Thread Huaxin Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-11660:
---
Comment: was deleted

(was: I would like to work on this issue. )

> Spark Thrift GetResultSetMetadata describes a VARCHAR as a STRING
> -
>
> Key: SPARK-11660
> URL: https://issues.apache.org/jira/browse/SPARK-11660
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.5.0
>Reporter: Chip Sands
>
> In the Spark SQL  thrift interface the GetResultSetMetadata reply packet that 
> describes the result set metadata, reports a column that is defined as a 
> VARCHAR in the database, as Native type of STRING. Data still returns 
> correctly in the thrift string type but ODBC/JDBC is not able to correctly 
> describe the data type being return or its defined maximum length.
> FYI Hive returns it correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11670) Fix incorrect kryo buffer default value in docs

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11670:


Assignee: Andrew Or  (was: Apache Spark)

> Fix incorrect kryo buffer default value in docs
> ---
>
> Key: SPARK-11670
> URL: https://issues.apache.org/jira/browse/SPARK-11670
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Minor
>
> The default is 64K, but the doc says it's 2?
> https://spark.apache.org/docs/1.5.0/tuning.html#data-serialization
> {quote}
> If your objects are large, you may also need to increase the 
> spark.kryoserializer.buffer config property. The default is 2, but this value 
> needs to be large enough to hold the largest object you will serialize.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11660) Spark Thrift GetResultSetMetadata describes a VARCHAR as a STRING

2015-11-11 Thread Huaxin Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001200#comment-15001200
 ] 

Huaxin Gao commented on SPARK-11660:


Seems this is working as designed. Please refer to  jira 5918. 

https://issues.apache.org/jira/browse/SPARK-5918

> Spark Thrift GetResultSetMetadata describes a VARCHAR as a STRING
> -
>
> Key: SPARK-11660
> URL: https://issues.apache.org/jira/browse/SPARK-11660
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.5.0
>Reporter: Chip Sands
>
> In the Spark SQL  thrift interface the GetResultSetMetadata reply packet that 
> describes the result set metadata, reports a column that is defined as a 
> VARCHAR in the database, as Native type of STRING. Data still returns 
> correctly in the thrift string type but ODBC/JDBC is not able to correctly 
> describe the data type being return or its defined maximum length.
> FYI Hive returns it correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11670) Fix incorrect kryo buffer default value in docs

2015-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001201#comment-15001201
 ] 

Apache Spark commented on SPARK-11670:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/9638

> Fix incorrect kryo buffer default value in docs
> ---
>
> Key: SPARK-11670
> URL: https://issues.apache.org/jira/browse/SPARK-11670
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Minor
>
> The default is 64K, but the doc says it's 2?
> https://spark.apache.org/docs/1.5.0/tuning.html#data-serialization
> {quote}
> If your objects are large, you may also need to increase the 
> spark.kryoserializer.buffer config property. The default is 2, but this value 
> needs to be large enough to hold the largest object you will serialize.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11670) Fix incorrect kryo buffer default value in docs

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11670:


Assignee: Apache Spark  (was: Andrew Or)

> Fix incorrect kryo buffer default value in docs
> ---
>
> Key: SPARK-11670
> URL: https://issues.apache.org/jira/browse/SPARK-11670
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>Priority: Minor
>
> The default is 64K, but the doc says it's 2?
> https://spark.apache.org/docs/1.5.0/tuning.html#data-serialization
> {quote}
> If your objects are large, you may also need to increase the 
> spark.kryoserializer.buffer config property. The default is 2, but this value 
> needs to be large enough to hold the largest object you will serialize.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11647) Attempt to reduce flakiness of Hive Cli / SparkSubmit tests via conf. changes

2015-11-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-11647.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

> Attempt to reduce flakiness of Hive Cli / SparkSubmit tests via conf. changes
> -
>
> Key: SPARK-11647
> URL: https://issues.apache.org/jira/browse/SPARK-11647
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.6.0
>
>
> There are a few test configuration changes, such as properly system 
> properties, disabling web UIs, and disabling the Derby WAL, which might speed 
> up HiveSparkSubmitSuite and the Thriftserver's CliSuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11671) Example for sqlContext.createDataDrame from pandas.DataFrame has a typo

2015-11-11 Thread chris snow (JIRA)
chris snow created SPARK-11671:
--

 Summary: Example for sqlContext.createDataDrame from 
pandas.DataFrame has a typo
 Key: SPARK-11671
 URL: https://issues.apache.org/jira/browse/SPARK-11671
 Project: Spark
  Issue Type: Bug
  Components: Deploy, PySpark
Affects Versions: 1.5.1
Reporter: chris snow
Priority: Minor


PySpark documentation error:

{code}
sqlContext.createDataFrame(pandas.DataFrame([[1, 2]]).collect()) 
{code}

Results in:

{code}
---
AttributeErrorTraceback (most recent call last)
 in ()
> 1 sqlContext.createDataFrame(pandas.DataFrame([[1, 2]]).collect())

/usr/local/src/bluemix_ipythonspark_141/notebook/lib/python2.7/site-packages/pandas-0.14.0-py2.7-linux-x86_64.egg/pandas/core/generic.pyc
 in __getattr__(self, name)
   1841 return self[name]
   1842 raise AttributeError("'%s' object has no attribute '%s'" %
-> 1843  (type(self).__name__, name))
   1844 
   1845 def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'collect'
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11671) Example for sqlContext.createDataDrame from pandas.DataFrame has a typo

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11671:


Assignee: (was: Apache Spark)

> Example for sqlContext.createDataDrame from pandas.DataFrame has a typo
> ---
>
> Key: SPARK-11671
> URL: https://issues.apache.org/jira/browse/SPARK-11671
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, PySpark
>Affects Versions: 1.5.1
>Reporter: chris snow
>Priority: Minor
>
> PySpark documentation error:
> {code}
> sqlContext.createDataFrame(pandas.DataFrame([[1, 2]]).collect()) 
> {code}
> Results in:
> {code}
> ---
> AttributeErrorTraceback (most recent call last)
>  in ()
> > 1 sqlContext.createDataFrame(pandas.DataFrame([[1, 2]]).collect())
> /usr/local/src/bluemix_ipythonspark_141/notebook/lib/python2.7/site-packages/pandas-0.14.0-py2.7-linux-x86_64.egg/pandas/core/generic.pyc
>  in __getattr__(self, name)
>1841 return self[name]
>1842 raise AttributeError("'%s' object has no attribute '%s'" %
> -> 1843  (type(self).__name__, name))
>1844 
>1845 def __setattr__(self, name, value):
> AttributeError: 'DataFrame' object has no attribute 'collect'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11672) Flaky test: ml.JavaDefaultReadWriteSuite

2015-11-11 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-11672:
-

 Summary: Flaky test: ml.JavaDefaultReadWriteSuite
 Key: SPARK-11672
 URL: https://issues.apache.org/jira/browse/SPARK-11672
 Project: Spark
  Issue Type: Bug
  Components: ML
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical


Saw several failures on Jenkins, e.g., 
https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2040/testReport/org.apache.spark.ml.util/JavaDefaultReadWriteSuite/testDefaultReadWrite/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11671) Example for sqlContext.createDataDrame from pandas.DataFrame has a typo

2015-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001260#comment-15001260
 ] 

Apache Spark commented on SPARK-11671:
--

User 'snowch' has created a pull request for this issue:
https://github.com/apache/spark/pull/9639

> Example for sqlContext.createDataDrame from pandas.DataFrame has a typo
> ---
>
> Key: SPARK-11671
> URL: https://issues.apache.org/jira/browse/SPARK-11671
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, PySpark
>Affects Versions: 1.5.1
>Reporter: chris snow
>Priority: Minor
>
> PySpark documentation error:
> {code}
> sqlContext.createDataFrame(pandas.DataFrame([[1, 2]]).collect()) 
> {code}
> Results in:
> {code}
> ---
> AttributeErrorTraceback (most recent call last)
>  in ()
> > 1 sqlContext.createDataFrame(pandas.DataFrame([[1, 2]]).collect())
> /usr/local/src/bluemix_ipythonspark_141/notebook/lib/python2.7/site-packages/pandas-0.14.0-py2.7-linux-x86_64.egg/pandas/core/generic.pyc
>  in __getattr__(self, name)
>1841 return self[name]
>1842 raise AttributeError("'%s' object has no attribute '%s'" %
> -> 1843  (type(self).__name__, name))
>1844 
>1845 def __setattr__(self, name, value):
> AttributeError: 'DataFrame' object has no attribute 'collect'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11671) Example for sqlContext.createDataDrame from pandas.DataFrame has a typo

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11671:


Assignee: Apache Spark

> Example for sqlContext.createDataDrame from pandas.DataFrame has a typo
> ---
>
> Key: SPARK-11671
> URL: https://issues.apache.org/jira/browse/SPARK-11671
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, PySpark
>Affects Versions: 1.5.1
>Reporter: chris snow
>Assignee: Apache Spark
>Priority: Minor
>
> PySpark documentation error:
> {code}
> sqlContext.createDataFrame(pandas.DataFrame([[1, 2]]).collect()) 
> {code}
> Results in:
> {code}
> ---
> AttributeErrorTraceback (most recent call last)
>  in ()
> > 1 sqlContext.createDataFrame(pandas.DataFrame([[1, 2]]).collect())
> /usr/local/src/bluemix_ipythonspark_141/notebook/lib/python2.7/site-packages/pandas-0.14.0-py2.7-linux-x86_64.egg/pandas/core/generic.pyc
>  in __getattr__(self, name)
>1841 return self[name]
>1842 raise AttributeError("'%s' object has no attribute '%s'" %
> -> 1843  (type(self).__name__, name))
>1844 
>1845 def __setattr__(self, name, value):
> AttributeError: 'DataFrame' object has no attribute 'collect'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11658) simplify documentation for PySpark combineByKey

2015-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001266#comment-15001266
 ] 

Apache Spark commented on SPARK-11658:
--

User 'snowch' has created a pull request for this issue:
https://github.com/apache/spark/pull/9640

> simplify documentation for PySpark combineByKey
> ---
>
> Key: SPARK-11658
> URL: https://issues.apache.org/jira/browse/SPARK-11658
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 1.5.1
>Reporter: chris snow
>Priority: Minor
>
> The current documentation for combineByKey looks like this:
> {code}
> >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
> >>> def f(x): return x
> >>> def add(a, b): return a + str(b)
> >>> sorted(x.combineByKey(str, add, add).collect())
> [('a', '11'), ('b', '1')]
> """
> {code}
> I think it could be simplified to:
> {code}
> >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
> >>> def add(a, b): return a + str(b)
> >>> x.combineByKey(str, add, add).collect()
> [('a', '11'), ('b', '1')]
> """
> {code}
> I'll shortly add a patch for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11672) Flaky test: ml.JavaDefaultReadWriteSuite

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11672:


Assignee: Apache Spark  (was: Xiangrui Meng)

> Flaky test: ml.JavaDefaultReadWriteSuite
> 
>
> Key: SPARK-11672
> URL: https://issues.apache.org/jira/browse/SPARK-11672
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Critical
>
> Saw several failures on Jenkins, e.g., 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2040/testReport/org.apache.spark.ml.util/JavaDefaultReadWriteSuite/testDefaultReadWrite/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11672) Flaky test: ml.JavaDefaultReadWriteSuite

2015-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001267#comment-15001267
 ] 

Apache Spark commented on SPARK-11672:
--

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/9641

> Flaky test: ml.JavaDefaultReadWriteSuite
> 
>
> Key: SPARK-11672
> URL: https://issues.apache.org/jira/browse/SPARK-11672
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> Saw several failures on Jenkins, e.g., 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2040/testReport/org.apache.spark.ml.util/JavaDefaultReadWriteSuite/testDefaultReadWrite/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11672) Flaky test: ml.JavaDefaultReadWriteSuite

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11672:


Assignee: Xiangrui Meng  (was: Apache Spark)

> Flaky test: ml.JavaDefaultReadWriteSuite
> 
>
> Key: SPARK-11672
> URL: https://issues.apache.org/jira/browse/SPARK-11672
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> Saw several failures on Jenkins, e.g., 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2040/testReport/org.apache.spark.ml.util/JavaDefaultReadWriteSuite/testDefaultReadWrite/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11553) row.getInt(i) if row[i]=null returns 0

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11553:


Assignee: (was: Apache Spark)

> row.getInt(i) if row[i]=null returns 0
> --
>
> Key: SPARK-11553
> URL: https://issues.apache.org/jira/browse/SPARK-11553
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Tofigh
>Priority: Minor
>
> row.getInt|Float|Double in SPARK RDD return 0 if row[index] is null. (Even 
> according to the document they should throw nullException error)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11553) row.getInt(i) if row[i]=null returns 0

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11553:


Assignee: Apache Spark

> row.getInt(i) if row[i]=null returns 0
> --
>
> Key: SPARK-11553
> URL: https://issues.apache.org/jira/browse/SPARK-11553
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Tofigh
>Assignee: Apache Spark
>Priority: Minor
>
> row.getInt|Float|Double in SPARK RDD return 0 if row[index] is null. (Even 
> according to the document they should throw nullException error)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11553) row.getInt(i) if row[i]=null returns 0

2015-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001285#comment-15001285
 ] 

Apache Spark commented on SPARK-11553:
--

User 'alberskib' has created a pull request for this issue:
https://github.com/apache/spark/pull/9642

> row.getInt(i) if row[i]=null returns 0
> --
>
> Key: SPARK-11553
> URL: https://issues.apache.org/jira/browse/SPARK-11553
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Tofigh
>Priority: Minor
>
> row.getInt|Float|Double in SPARK RDD return 0 if row[index] is null. (Even 
> according to the document they should throw nullException error)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11671) Example for sqlContext.createDataDrame from pandas.DataFrame has a typo

2015-11-11 Thread chris snow (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chris snow updated SPARK-11671:
---
Component/s: (was: Deploy)
 Documentation

> Example for sqlContext.createDataDrame from pandas.DataFrame has a typo
> ---
>
> Key: SPARK-11671
> URL: https://issues.apache.org/jira/browse/SPARK-11671
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark
>Affects Versions: 1.5.1
>Reporter: chris snow
>Priority: Minor
>
> PySpark documentation error:
> {code}
> sqlContext.createDataFrame(pandas.DataFrame([[1, 2]]).collect()) 
> {code}
> Results in:
> {code}
> ---
> AttributeErrorTraceback (most recent call last)
>  in ()
> > 1 sqlContext.createDataFrame(pandas.DataFrame([[1, 2]]).collect())
> /usr/local/src/bluemix_ipythonspark_141/notebook/lib/python2.7/site-packages/pandas-0.14.0-py2.7-linux-x86_64.egg/pandas/core/generic.pyc
>  in __getattr__(self, name)
>1841 return self[name]
>1842 raise AttributeError("'%s' object has no attribute '%s'" %
> -> 1843  (type(self).__name__, name))
>1844 
>1845 def __setattr__(self, name, value):
> AttributeError: 'DataFrame' object has no attribute 'collect'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11648) IllegalReferenceCountException in Spark workloads

2015-11-11 Thread Nishkam Ravi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001291#comment-15001291
 ] 

Nishkam Ravi commented on SPARK-11648:
--

The two issues are closely related, not sure if they are duplicates. The 
proposed PR for Spark-11617 doesn't fix this issue, so will keep this one open 
for now.

> IllegalReferenceCountException in Spark workloads
> -
>
> Key: SPARK-11648
> URL: https://issues.apache.org/jira/browse/SPARK-11648
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Nishkam Ravi
>
> This exception is thrown for multiple workloads. Can be reproduced with 
> WordCount/PageRank/TeraSort.
> -
> Stack trace:
> 15/11/10 01:11:31 WARN TaskSetManager: Lost task 6.0 in stage 1.0 (TID 459, 
> 10.20.78.15): io.netty.util.IllegalReferenceCountException: refCnt: 0
>   at 
> io.netty.buffer.AbstractByteBuf.ensureAccessible(AbstractByteBuf.java:1178)
>   at io.netty.buffer.AbstractByteBuf.checkIndex(AbstractByteBuf.java:1129)
>   at io.netty.buffer.SlicedByteBuf.getBytes(SlicedByteBuf.java:180)
>   at io.netty.buffer.CompositeByteBuf.getBytes(CompositeByteBuf.java:687)
>   at io.netty.buffer.CompositeByteBuf.getBytes(CompositeByteBuf.java:42)
>   at io.netty.buffer.SlicedByteBuf.getBytes(SlicedByteBuf.java:181)
>   at io.netty.buffer.AbstractByteBuf.readBytes(AbstractByteBuf.java:677)
>   at io.netty.buffer.ByteBufInputStream.read(ByteBufInputStream.java:120)
>   at 
> org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:360)
>   at com.ning.compress.lzf.ChunkDecoder.readHeader(ChunkDecoder.java:213)
>   at 
> com.ning.compress.lzf.impl.UnsafeChunkDecoder.decodeChunk(UnsafeChunkDecoder.java:49)
>   at 
> com.ning.compress.lzf.LZFInputStream.readyBuffer(LZFInputStream.java:363)
>   at com.ning.compress.lzf.LZFInputStream.read(LZFInputStream.java:193)
>   at 
> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)
>   at 
> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323)
>   at 
> java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794)
>   at 
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801)
>   at java.io.ObjectInputStream.(ObjectInputStream.java:299)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.(JavaSerializer.scala:64)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.(JavaSerializer.scala:64)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:123)
>   at 
> org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$3.apply(BlockStoreShuffleReader.scala:64)
>   at 
> org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$3.apply(BlockStoreShuffleReader.scala:60)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:152)
>   at 
> org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:58)
>   at 
> org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:83)
>   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11673) Remove the normal Project physical operator (and keep TungstenProject)

2015-11-11 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-11673:
---

 Summary: Remove the normal Project physical operator (and keep 
TungstenProject)
 Key: SPARK-11673
 URL: https://issues.apache.org/jira/browse/SPARK-11673
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5968) Parquet warning in spark-shell

2015-11-11 Thread swetha k (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001294#comment-15001294
 ] 

swetha k commented on SPARK-5968:
-

[~lian cheng]

Is this just a logger issue or would it have any potential impact on the 
functionality?

> Parquet warning in spark-shell
> --
>
> Key: SPARK-5968
> URL: https://issues.apache.org/jira/browse/SPARK-5968
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Michael Armbrust
>Assignee: Cheng Lian
>Priority: Critical
> Fix For: 1.3.0
>
>
> This may happen in the case of schema evolving, namely appending new Parquet 
> data with different but compatible schema to existing Parquet files:
> {code}
> 15/02/23 23:29:24 WARN ParquetOutputCommitter: could not write summary file 
> for rankings
> parquet.io.ParquetEncodingException: 
> file:/Users/matei/workspace/apache-spark/rankings/part-r-1.parquet 
> invalid: all the files must be contained in the root rankings
> at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422)
> at 
> parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398)
> at 
> parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51)
> {code}
> The reason is that the Spark SQL schemas stored in Parquet key-value metadata 
> differ. Parquet doesn't know how to "merge" these opaque user-defined 
> metadata, and just throw an exception and give up writing summary files. 
> Since the Parquet data source in Spark 1.3.0 supports schema merging, it's 
> harmless.  But this is kind of scary for the user.  We should try to suppress 
> this through the logger. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11673) Remove the normal Project physical operator (and keep TungstenProject)

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11673:


Assignee: Apache Spark  (was: Reynold Xin)

> Remove the normal Project physical operator (and keep TungstenProject)
> --
>
> Key: SPARK-11673
> URL: https://issues.apache.org/jira/browse/SPARK-11673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11673) Remove the normal Project physical operator (and keep TungstenProject)

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11673:


Assignee: Reynold Xin  (was: Apache Spark)

> Remove the normal Project physical operator (and keep TungstenProject)
> --
>
> Key: SPARK-11673
> URL: https://issues.apache.org/jira/browse/SPARK-11673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11673) Remove the normal Project physical operator (and keep TungstenProject)

2015-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001298#comment-15001298
 ] 

Apache Spark commented on SPARK-11673:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/9643

> Remove the normal Project physical operator (and keep TungstenProject)
> --
>
> Key: SPARK-11673
> URL: https://issues.apache.org/jira/browse/SPARK-11673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11154) make specificaition spark.yarn.executor.memoryOverhead consistent with typical JVM options

2015-11-11 Thread Chris Howard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001316#comment-15001316
 ] 

Chris Howard commented on SPARK-11154:
--

Hi [~tgraves] - this overlaps to a point with 
[SPARK-3374|https://issues.apache.org/jira/browse/SPARK-3374] and 
[SPARK-4408|https://issues.apache.org/jira/browse/SPARK-4408]

I would agree with [~srowen] that 2.0 provides an opportunity for house keeping 
/ consolidation and there is scope to clean up the config / args for cluster 
and client modes.

I would prefer not to create new configs and would rather stick with the 
current naming and support k | m | g modifiers unless somebody has a strong 
view on what to rename the existing configs.

> make specificaition spark.yarn.executor.memoryOverhead consistent with 
> typical JVM options
> --
>
> Key: SPARK-11154
> URL: https://issues.apache.org/jira/browse/SPARK-11154
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Submit
>Reporter: Dustin Cote
>Priority: Minor
>
> spark.yarn.executor.memoryOverhead is currently specified in megabytes by 
> default, but it would be nice to allow users to specify the size as though it 
> were a typical -Xmx option to a JVM where you can have 'm' and 'g' appended 
> to the end to explicitly specify megabytes or gigabytes.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer

2015-11-11 Thread Akshat Aranya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001320#comment-15001320
 ] 

Akshat Aranya commented on SPARK-7708:
--

Kryo 3.x will resolve this issue, but it can't be used (yet) because Spark 
relies on Chill and Chill is pegged to Kryo 2.2.1. 

> Incorrect task serialization with Kryo closure serializer
> -
>
> Key: SPARK-7708
> URL: https://issues.apache.org/jira/browse/SPARK-7708
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.2
>Reporter: Akshat Aranya
>
> I've been investigating the use of Kryo for closure serialization with Spark 
> 1.2, and it seems like I've hit upon a bug:
> When a task is serialized before scheduling, the following log message is 
> generated:
> [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, 
> , PROCESS_LOCAL, 302 bytes)
> This message comes from TaskSetManager which serializes the task using the 
> closure serializer.  Before the message is sent out, the TaskDescription 
> (which included the original task as a byte array), is serialized again into 
> a byte array with the closure serializer.  I added a log message for this in 
> CoarseGrainedSchedulerBackend, which produces the following output:
> [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132
> The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ 
> than serialized task that it contains (302 bytes). This implies that 
> TaskDescription.buffer is not getting serialized correctly.
> On the executor side, the deserialization produces a null value for 
> TaskDescription.buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11621) ORC filter pushdown not working properly after new unhandled filter interface.

2015-11-11 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001327#comment-15001327
 ] 

Hyukjin Kwon commented on SPARK-11621:
--

This is a duplicate (rather a sebset of) 
ofhttps://issues.apache.org/jira/browse/SPARK-11661.

> ORC filter pushdown not working properly after new unhandled filter interface.
> --
>
> Key: SPARK-11621
> URL: https://issues.apache.org/jira/browse/SPARK-11621
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Hyukjin Kwon
>
> After the new interface to get rid of filters predicate-push-downed which are 
> already processed in datasource-level 
> (https://github.com/apache/spark/pull/9399), it dose not push down filters 
> for ORC.
> This is because at {{DataSourceStrategy}}, all the filters are treated as 
> unhandled filters.
> Also, since ORC does not support to filter fully record by record but instead 
> rough results came out, the filters for ORC should not go to unhandled 
> filters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11621) ORC filter pushdown not working properly after new unhandled filter interface.

2015-11-11 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001327#comment-15001327
 ] 

Hyukjin Kwon edited comment on SPARK-11621 at 11/11/15 11:30 PM:
-

This is a duplicate (rather a sebset of) 
https://issues.apache.org/jira/browse/SPARK-11661.


was (Author: hyukjin.kwon):
This is a duplicate (rather a sebset of) 
ofhttps://issues.apache.org/jira/browse/SPARK-11661.

> ORC filter pushdown not working properly after new unhandled filter interface.
> --
>
> Key: SPARK-11621
> URL: https://issues.apache.org/jira/browse/SPARK-11621
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Hyukjin Kwon
>
> After the new interface to get rid of filters predicate-push-downed which are 
> already processed in datasource-level 
> (https://github.com/apache/spark/pull/9399), it dose not push down filters 
> for ORC.
> This is because at {{DataSourceStrategy}}, all the filters are treated as 
> unhandled filters.
> Also, since ORC does not support to filter fully record by record but instead 
> rough results came out, the filters for ORC should not go to unhandled 
> filters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11674) Word2Vec code failed compile in Scala 2.11

2015-11-11 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-11674:
-

 Summary: Word2Vec code failed compile in Scala 2.11
 Key: SPARK-11674
 URL: https://issues.apache.org/jira/browse/SPARK-11674
 Project: Spark
  Issue Type: Bug
  Components: ML
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical


https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-Master-Scala211-Compile/2007/consoleFull

{code}
[error] [warn] 
/home/jenkins/workspace/Spark-Master-Scala211-Compile/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala:149:
 no valid targets for annotation on value wordVectors - it is discarded unused. 
You may specify targets with meta-annotations, e.g. @(transient @param)
[error] [warn] @transient wordVectors: feature.Word2VecModel)
[error] [warn] 
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11674) Word2Vec code failed compile in Scala 2.11

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11674:


Assignee: Xiangrui Meng  (was: Apache Spark)

> Word2Vec code failed compile in Scala 2.11
> --
>
> Key: SPARK-11674
> URL: https://issues.apache.org/jira/browse/SPARK-11674
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-Master-Scala211-Compile/2007/consoleFull
> {code}
> [error] [warn] 
> /home/jenkins/workspace/Spark-Master-Scala211-Compile/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala:149:
>  no valid targets for annotation on value wordVectors - it is discarded 
> unused. You may specify targets with meta-annotations, e.g. @(transient 
> @param)
> [error] [warn] @transient wordVectors: feature.Word2VecModel)
> [error] [warn] 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11674) Word2Vec code failed compile in Scala 2.11

2015-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001373#comment-15001373
 ] 

Apache Spark commented on SPARK-11674:
--

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/9644

> Word2Vec code failed compile in Scala 2.11
> --
>
> Key: SPARK-11674
> URL: https://issues.apache.org/jira/browse/SPARK-11674
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-Master-Scala211-Compile/2007/consoleFull
> {code}
> [error] [warn] 
> /home/jenkins/workspace/Spark-Master-Scala211-Compile/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala:149:
>  no valid targets for annotation on value wordVectors - it is discarded 
> unused. You may specify targets with meta-annotations, e.g. @(transient 
> @param)
> [error] [warn] @transient wordVectors: feature.Word2VecModel)
> [error] [warn] 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11674) Word2Vec code failed compile in Scala 2.11

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11674:


Assignee: Apache Spark  (was: Xiangrui Meng)

> Word2Vec code failed compile in Scala 2.11
> --
>
> Key: SPARK-11674
> URL: https://issues.apache.org/jira/browse/SPARK-11674
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Critical
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-Master-Scala211-Compile/2007/consoleFull
> {code}
> [error] [warn] 
> /home/jenkins/workspace/Spark-Master-Scala211-Compile/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala:149:
>  no valid targets for annotation on value wordVectors - it is discarded 
> unused. You may specify targets with meta-annotations, e.g. @(transient 
> @param)
> [error] [warn] @transient wordVectors: feature.Word2VecModel)
> [error] [warn] 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11675) Remove shuffle hash joins

2015-11-11 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-11675:
---

 Summary: Remove shuffle hash joins
 Key: SPARK-11675
 URL: https://issues.apache.org/jira/browse/SPARK-11675
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


They are off by default. I think we should just standardize on sort merge join 
for large joins for now, and create better implementations of hash joins if 
needed in the future.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11675) Remove shuffle hash joins

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11675:


Assignee: Apache Spark  (was: Reynold Xin)

> Remove shuffle hash joins
> -
>
> Key: SPARK-11675
> URL: https://issues.apache.org/jira/browse/SPARK-11675
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> They are off by default. I think we should just standardize on sort merge 
> join for large joins for now, and create better implementations of hash joins 
> if needed in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11675) Remove shuffle hash joins

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11675:


Assignee: Reynold Xin  (was: Apache Spark)

> Remove shuffle hash joins
> -
>
> Key: SPARK-11675
> URL: https://issues.apache.org/jira/browse/SPARK-11675
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> They are off by default. I think we should just standardize on sort merge 
> join for large joins for now, and create better implementations of hash joins 
> if needed in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11675) Remove shuffle hash joins

2015-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001383#comment-15001383
 ] 

Apache Spark commented on SPARK-11675:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/9645

> Remove shuffle hash joins
> -
>
> Key: SPARK-11675
> URL: https://issues.apache.org/jira/browse/SPARK-11675
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> They are off by default. I think we should just standardize on sort merge 
> join for large joins for now, and create better implementations of hash joins 
> if needed in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11676) Parquet filter tests all pass if filters are not really pushed down

2015-11-11 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-11676:


 Summary: Parquet filter tests all pass if filters are not really 
pushed down
 Key: SPARK-11676
 URL: https://issues.apache.org/jira/browse/SPARK-11676
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 1.6.0
Reporter: Hyukjin Kwon
Priority: Critical


All the tests in {{ParquetFilterSuite}} pass although actually the filters are 
pushed down.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11677) ORC filter tests all pass if filters are actually not pushed down.

2015-11-11 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-11677:


 Summary: ORC filter tests all pass if filters are actually not 
pushed down.
 Key: SPARK-11677
 URL: https://issues.apache.org/jira/browse/SPARK-11677
 Project: Spark
  Issue Type: Test
Affects Versions: 1.6.0
Reporter: Hyukjin Kwon
Priority: Critical


Several tests of ORC filters just pass event if filters are actually not pushed 
down.

Maybe it needs some more tests separately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11676) Parquet filter tests all pass if filters are not really pushed down

2015-11-11 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001393#comment-15001393
 ] 

Hyukjin Kwon commented on SPARK-11676:
--

I will work on this.


> Parquet filter tests all pass if filters are not really pushed down
> ---
>
> Key: SPARK-11676
> URL: https://issues.apache.org/jira/browse/SPARK-11676
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Hyukjin Kwon
>Priority: Critical
>
> All the tests in {{ParquetFilterSuite}} pass although actually the filters 
> are pushed down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11677) ORC filter tests all pass if filters are actually not pushed down.

2015-11-11 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001391#comment-15001391
 ] 

Hyukjin Kwon commented on SPARK-11677:
--

I will work on this.


> ORC filter tests all pass if filters are actually not pushed down.
> --
>
> Key: SPARK-11677
> URL: https://issues.apache.org/jira/browse/SPARK-11677
> Project: Spark
>  Issue Type: Test
>Affects Versions: 1.6.0
>Reporter: Hyukjin Kwon
>Priority: Critical
>
> Several tests of ORC filters just pass event if filters are actually not 
> pushed down.
> Maybe it needs some more tests separately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8992) Add Pivot functionality to Spark SQL

2015-11-11 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-8992.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 7841
[https://github.com/apache/spark/pull/7841]

> Add Pivot functionality to Spark SQL
> 
>
> Key: SPARK-8992
> URL: https://issues.apache.org/jira/browse/SPARK-8992
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Richard Williamson
> Fix For: 1.6.0
>
>
> SQLServer and other databases provide capabilities to transpose data from 
> columns to rows and rows to columns, the later is very useful for analytic 
> use cases like turning rows into feature columns for an MLlib model. Here is 
> a reference for the SQLServer implementation: 
> http://sqlhints.com/2014/03/10/pivot-and-unpivot-in-sql-server/
> Additional Referenced Request:
> http://stackoverflow.com/questions/30244910/pivot-spark-dataframe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8992) Add Pivot functionality to Spark SQL

2015-11-11 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-8992:

Assignee: Andrew Ray

> Add Pivot functionality to Spark SQL
> 
>
> Key: SPARK-8992
> URL: https://issues.apache.org/jira/browse/SPARK-8992
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Richard Williamson
>Assignee: Andrew Ray
> Fix For: 1.6.0
>
>
> SQLServer and other databases provide capabilities to transpose data from 
> columns to rows and rows to columns, the later is very useful for analytic 
> use cases like turning rows into feature columns for an MLlib model. Here is 
> a reference for the SQLServer implementation: 
> http://sqlhints.com/2014/03/10/pivot-and-unpivot-in-sql-server/
> Additional Referenced Request:
> http://stackoverflow.com/questions/30244910/pivot-spark-dataframe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11677) ORC filter tests all pass if filters are actually not pushed down.

2015-11-11 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-11677:
-
Component/s: SQL

> ORC filter tests all pass if filters are actually not pushed down.
> --
>
> Key: SPARK-11677
> URL: https://issues.apache.org/jira/browse/SPARK-11677
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Hyukjin Kwon
>Priority: Critical
>
> Several tests of ORC filters just pass event if filters are actually not 
> pushed down.
> Maybe it needs some more tests separately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10848) Applied JSON Schema Works for json RDD but not when loading json file

2015-11-11 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001404#comment-15001404
 ] 

Xin Wu commented on SPARK-10848:


I can recreate this as describe. Looking into the code. 

> Applied JSON Schema Works for json RDD but not when loading json file
> -
>
> Key: SPARK-10848
> URL: https://issues.apache.org/jira/browse/SPARK-10848
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Miklos Christine
>Priority: Minor
>
> Using a defined schema to load a json rdd works as expected. Loading the json 
> records from a file does not apply the supplied schema. Mainly the nullable 
> field isn't applied correctly. Loading from a file uses nullable=true on all 
> fields regardless of applied schema. 
> Code to reproduce:
> {code}
> import  org.apache.spark.sql.types._
> val jsonRdd = sc.parallelize(List(
>   """{"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", 
> "ProductCode": "WQT648", "Qty": 5}""",
>   """{"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", 
> "ProductCode": "LG4-Z5", "Qty": 10, "Discount":0.25, 
> "expressDelivery":true}"""))
> val mySchema = StructType(Array(
>   StructField(name="OrderID"   , dataType=LongType, nullable=false),
>   StructField("CustomerID", IntegerType, false),
>   StructField("OrderDate", DateType, false),
>   StructField("ProductCode", StringType, false),
>   StructField("Qty", IntegerType, false),
>   StructField("Discount", FloatType, true),
>   StructField("expressDelivery", BooleanType, true)
> ))
> val myDF = sqlContext.read.schema(mySchema).json(jsonRdd)
> val schema1 = myDF.printSchema
> val dfDFfromFile = sqlContext.read.schema(mySchema).json("Orders.json")
> val schema2 = dfDFfromFile.printSchema
> {code}
> Orders.json
> {code}
> {"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", "ProductCode": 
> "WQT648", "Qty": 5}
> {"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", "ProductCode": 
> "LG4-Z5", "Qty": 10, "Discount":0.25, "expressDelivery":true}
> {code}
> The behavior should be consistent. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11598) Add tests for ShuffledHashOuterJoin

2015-11-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-11598:

Fix Version/s: (was: 1.7.0)
   1.6.0

> Add tests for ShuffledHashOuterJoin
> ---
>
> Key: SPARK-11598
> URL: https://issues.apache.org/jira/browse/SPARK-11598
> Project: Spark
>  Issue Type: Test
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.6.0
>
>
> We only test the default algorithm (SortMergeOuterJoin) for outer join, 
> ShuffledHashOuterJoin is not well tested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11583) Make MapStatus use less memory uage

2015-11-11 Thread Kent Yao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001420#comment-15001420
 ] 

Kent Yao commented on SPARK-11583:
--

Are these methods {noformat} r.add(0,20) {noformat} and 
{noformat}runOptimize {noformat} unavailable in current version of Roaring that 
spark uses?

Accord to my test, i would like to say Roaring is better than BitSet, but not 
be used good enough...

To most of spark tasks, dense cases may be usual.

To those sparse cases, we may use Roaring with {noformat}runOptimize {noformat} 
or just track those non-empty block.

> Make MapStatus use less memory uage
> ---
>
> Key: SPARK-11583
> URL: https://issues.apache.org/jira/browse/SPARK-11583
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Reporter: Kent Yao
>
> In the resolved issue https://issues.apache.org/jira/browse/SPARK-11271, as I 
> said, using BitSet can save ≈20% memory usage compared to RoaringBitMap. 
> For a spark job contains quite a lot of tasks, 20% seems a drop in the ocean. 
> Essentially, BitSet uses long[]. For example a BitSet[200k] = long[3125].
> So if we use a HashSet[Int] to store reduceId (when non-empty blocks are 
> dense,use reduceId of empty blocks; when sparse, use non-empty ones). 
> For dense cases: if HashSet[Int](numNonEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingNoEmptyBlocks
> For sparse cases: if HashSet[Int](numEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingEmptyBlocks
> sparse case, 299/300 are empty
> sc.makeRDD(1 to 3, 3000).groupBy(x=>x).top(5)
> dense case,  no block is empty
> sc.makeRDD(1 to 900, 3000).groupBy(x=>x).top(5)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10865) [Spark SQL] [UDF] the ceil/ceiling function got wrong return value type

2015-11-11 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001422#comment-15001422
 ] 

Cheng Hao commented on SPARK-10865:
---

We actually follow the criteria of Hive, and actually I tested it in MySQL, it 
works in the same way. 

> [Spark SQL] [UDF] the ceil/ceiling function got wrong return value type
> ---
>
> Key: SPARK-10865
> URL: https://issues.apache.org/jira/browse/SPARK-10865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Cheng Hao
> Fix For: 1.6.0
>
>
> As per ceil/ceiling definition,it should get BIGINT return value
> -ceil(DOUBLE a), ceiling(DOUBLE a)
> -Returns the minimum BIGINT value that is equal to or greater than a.
> But in current Spark implementation, it got wrong value type.
> e.g., 
> select ceil(2642.12) from udf_test_web_sales limit 1;
> 2643.0
> In hive implementation, it got return value type like below:
> hive> select ceil(2642.12) from udf_test_web_sales limit 1;
> OK
> 2643



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10865) [Spark SQL] [UDF] the ceil/ceiling function got wrong return value type

2015-11-11 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001423#comment-15001423
 ] 

Cheng Hao commented on SPARK-10865:
---

1.5.2 is released, I am not sure whether part of it now or not.

> [Spark SQL] [UDF] the ceil/ceiling function got wrong return value type
> ---
>
> Key: SPARK-10865
> URL: https://issues.apache.org/jira/browse/SPARK-10865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Cheng Hao
> Fix For: 1.6.0
>
>
> As per ceil/ceiling definition,it should get BIGINT return value
> -ceil(DOUBLE a), ceiling(DOUBLE a)
> -Returns the minimum BIGINT value that is equal to or greater than a.
> But in current Spark implementation, it got wrong value type.
> e.g., 
> select ceil(2642.12) from udf_test_web_sales limit 1;
> 2643.0
> In hive implementation, it got return value type like below:
> hive> select ceil(2642.12) from udf_test_web_sales limit 1;
> OK
> 2643



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11661) We should still pushdown filters returned by a data source's unhandledFilters

2015-11-11 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai reassigned SPARK-11661:


Assignee: Yin Huai

> We should still pushdown filters returned by a data source's unhandledFilters
> -
>
> Key: SPARK-11661
> URL: https://issues.apache.org/jira/browse/SPARK-11661
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
>
> We added unhandledFilters interface to SPARK-10978. So, a data source has a 
> chance to let Spark SQL know that for those returned filters, it is possible 
> that the data source will not apply them to every row. So, Spark SQL should 
> use a Filter operator to evaluate those filters. However, if a filter is a 
> part of returned unhandledFilters, we should still push it down. For example, 
> our internal data sources do not override this method, if we do not push down 
> those filters, we are actually turning off the filter pushdown feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11583) Make MapStatus use less memory uage

2015-11-11 Thread Kent Yao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001420#comment-15001420
 ] 

Kent Yao edited comment on SPARK-11583 at 11/12/15 12:48 AM:
-

Are these methods *r.add(0,20)* and *runOptimize* unavailable in current 
version of Roaring that spark uses?

Accord to my test, i would like to say Roaring is better than BitSet  , but 
maybe not used good enough currently in spark ...

To most of spark tasks, dense cases may be usual.

To those sparse cases, we may use Roaring with *runOptimize* or just track 
those non-empty block.


was (Author: qin yao):
Are these methods {noformat} r.add(0,20) {noformat} and 
{noformat}runOptimize {noformat} unavailable in current version of Roaring that 
spark uses?

Accord to my test, i would like to say Roaring is better than BitSet, but not 
be used good enough...

To most of spark tasks, dense cases may be usual.

To those sparse cases, we may use Roaring with {noformat}runOptimize {noformat} 
or just track those non-empty block.

> Make MapStatus use less memory uage
> ---
>
> Key: SPARK-11583
> URL: https://issues.apache.org/jira/browse/SPARK-11583
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Reporter: Kent Yao
>
> In the resolved issue https://issues.apache.org/jira/browse/SPARK-11271, as I 
> said, using BitSet can save ≈20% memory usage compared to RoaringBitMap. 
> For a spark job contains quite a lot of tasks, 20% seems a drop in the ocean. 
> Essentially, BitSet uses long[]. For example a BitSet[200k] = long[3125].
> So if we use a HashSet[Int] to store reduceId (when non-empty blocks are 
> dense,use reduceId of empty blocks; when sparse, use non-empty ones). 
> For dense cases: if HashSet[Int](numNonEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingNoEmptyBlocks
> For sparse cases: if HashSet[Int](numEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingEmptyBlocks
> sparse case, 299/300 are empty
> sc.makeRDD(1 to 3, 3000).groupBy(x=>x).top(5)
> dense case,  no block is empty
> sc.makeRDD(1 to 900, 3000).groupBy(x=>x).top(5)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11678) Partition discovery fail if there is a _SUCCESS file in the table's root dir

2015-11-11 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai reassigned SPARK-11678:


Assignee: Yin Huai

> Partition discovery fail if there is a _SUCCESS file in the table's root dir
> 
>
> Key: SPARK-11678
> URL: https://issues.apache.org/jira/browse/SPARK-11678
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11678) Partition discovery fail if there is a _SUCCESS file in the table's root dir

2015-11-11 Thread Yin Huai (JIRA)
Yin Huai created SPARK-11678:


 Summary: Partition discovery fail if there is a _SUCCESS file in 
the table's root dir
 Key: SPARK-11678
 URL: https://issues.apache.org/jira/browse/SPARK-11678
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Blocker






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   >