date:20160728

[jira] [Assigned] (SPARK-16771) Infinite recursion loop in org.apache.spark.sql.catalyst.trees.TreeNode when table name collides.

2016-07-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16771:


Assignee: (was: Apache Spark)

> Infinite recursion loop in org.apache.spark.sql.catalyst.trees.TreeNode when 
> table name collides.
> -
>
> Key: SPARK-16771
> URL: https://issues.apache.org/jira/browse/SPARK-16771
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Furcy Pin
>
> How to reproduce: 
> In spark-sql on Hive
> {code}
> DROP TABLE IF EXISTS t1 ;
> CREATE TABLE test.t1(col1 string) ;
> WITH t1 AS (
>   SELECT  col1
>   FROM t1 
> )
> SELECT col1
> FROM t1
> LIMIT 2
> ;
> {code}
> This make a nice StackOverflowError:
> {code}
> java.lang.StackOverflowError
>   at 
> scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:230)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:348)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:170)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution$$anonfun$substituteCTE$1.applyOrElse(Analyzer.scala:147)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution$$anonfun$substituteCTE$1.applyOrElse(Analyzer.scala:133)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$2.apply(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$2.apply(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
> ...
> {code}
> This does not happen if I change the name of the CTE.
> I guess Catalyst get caught in an infinite recursion loop because the CTE and 
> the source

[jira] [Commented] (SPARK-16771) Infinite recursion loop in org.apache.spark.sql.catalyst.trees.TreeNode when table name collides.

2016-07-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398714#comment-15398714
 ] 

Apache Spark commented on SPARK-16771:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/14397

> Infinite recursion loop in org.apache.spark.sql.catalyst.trees.TreeNode when 
> table name collides.
> -
>
> Key: SPARK-16771
> URL: https://issues.apache.org/jira/browse/SPARK-16771
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Furcy Pin
>
> How to reproduce: 
> In spark-sql on Hive
> {code}
> DROP TABLE IF EXISTS t1 ;
> CREATE TABLE test.t1(col1 string) ;
> WITH t1 AS (
>   SELECT  col1
>   FROM t1 
> )
> SELECT col1
> FROM t1
> LIMIT 2
> ;
> {code}
> This make a nice StackOverflowError:
> {code}
> java.lang.StackOverflowError
>   at 
> scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:230)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:348)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:170)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution$$anonfun$substituteCTE$1.applyOrElse(Analyzer.scala:147)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution$$anonfun$substituteCTE$1.applyOrElse(Analyzer.scala:133)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$2.apply(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$2.apply(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
> ...
> {code}
> This does not happen if I change the name of the

[jira] [Assigned] (SPARK-16771) Infinite recursion loop in org.apache.spark.sql.catalyst.trees.TreeNode when table name collides.

2016-07-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16771:


Assignee: Apache Spark

> Infinite recursion loop in org.apache.spark.sql.catalyst.trees.TreeNode when 
> table name collides.
> -
>
> Key: SPARK-16771
> URL: https://issues.apache.org/jira/browse/SPARK-16771
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Furcy Pin
>Assignee: Apache Spark
>
> How to reproduce: 
> In spark-sql on Hive
> {code}
> DROP TABLE IF EXISTS t1 ;
> CREATE TABLE test.t1(col1 string) ;
> WITH t1 AS (
>   SELECT  col1
>   FROM t1 
> )
> SELECT col1
> FROM t1
> LIMIT 2
> ;
> {code}
> This make a nice StackOverflowError:
> {code}
> java.lang.StackOverflowError
>   at 
> scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:230)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:348)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:170)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution$$anonfun$substituteCTE$1.applyOrElse(Analyzer.scala:147)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution$$anonfun$substituteCTE$1.applyOrElse(Analyzer.scala:133)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$2.apply(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$2.apply(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
> ...
> {code}
> This does not happen if I change the name of the CTE.
> I guess Catalyst get caught in an infinite recursion loop because the

[jira] [Assigned] (SPARK-16787) SparkContext.addFile() should not fail if called twice with the same file

2016-07-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16787:


Assignee: Apache Spark  (was: Josh Rosen)

> SparkContext.addFile() should not fail if called twice with the same file
> -
>
> Key: SPARK-16787
> URL: https://issues.apache.org/jira/browse/SPARK-16787
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> The behavior of SparkContext.addFile() changed slightly with the introduction 
> of the Netty-RPC-based file server, which was introduced in Spark 1.6 (where 
> it was disabled by default) and became the default / only file server in 
> Spark 2.0.0.
> Prior to 2.0, calling SparkContext.addFile() twice with the same path would 
> succeed and would cause future tasks to receive an updated copy of the file. 
> This behavior was never explicitly documented but Spark has behaved this way 
> since very early 1.x versions (some of the relevant lines in 
> Executor.updateDependencies() have existed since 2012).
> In 2.0 (or 1.6 with the Netty file server enabled), the second addFile() call 
> will fail with a requirement error because NettyStreamManager tries to guard 
> against duplicate file registration.
> I believe that this change of behavior was unintentional and propose to 
> remove the {{require}} check so that Spark 2.0 matches 1.x's default behavior.
> This problem also affects addJar() in a more subtle way: the 
> fileServer.addJar() call will also fail with an exception but that exception 
> is logged and ignored due to some code which was added in 2014 in order to 
> ignore errors caused by missing Spark examples JARs when running on YARN 
> cluster mode (AFAIK).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16787) SparkContext.addFile() should not fail if called twice with the same file

2016-07-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398709#comment-15398709
 ] 

Apache Spark commented on SPARK-16787:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/14396

> SparkContext.addFile() should not fail if called twice with the same file
> -
>
> Key: SPARK-16787
> URL: https://issues.apache.org/jira/browse/SPARK-16787
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The behavior of SparkContext.addFile() changed slightly with the introduction 
> of the Netty-RPC-based file server, which was introduced in Spark 1.6 (where 
> it was disabled by default) and became the default / only file server in 
> Spark 2.0.0.
> Prior to 2.0, calling SparkContext.addFile() twice with the same path would 
> succeed and would cause future tasks to receive an updated copy of the file. 
> This behavior was never explicitly documented but Spark has behaved this way 
> since very early 1.x versions (some of the relevant lines in 
> Executor.updateDependencies() have existed since 2012).
> In 2.0 (or 1.6 with the Netty file server enabled), the second addFile() call 
> will fail with a requirement error because NettyStreamManager tries to guard 
> against duplicate file registration.
> I believe that this change of behavior was unintentional and propose to 
> remove the {{require}} check so that Spark 2.0 matches 1.x's default behavior.
> This problem also affects addJar() in a more subtle way: the 
> fileServer.addJar() call will also fail with an exception but that exception 
> is logged and ignored due to some code which was added in 2014 in order to 
> ignore errors caused by missing Spark examples JARs when running on YARN 
> cluster mode (AFAIK).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16787) SparkContext.addFile() should not fail if called twice with the same file

2016-07-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16787:


Assignee: Josh Rosen  (was: Apache Spark)

> SparkContext.addFile() should not fail if called twice with the same file
> -
>
> Key: SPARK-16787
> URL: https://issues.apache.org/jira/browse/SPARK-16787
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The behavior of SparkContext.addFile() changed slightly with the introduction 
> of the Netty-RPC-based file server, which was introduced in Spark 1.6 (where 
> it was disabled by default) and became the default / only file server in 
> Spark 2.0.0.
> Prior to 2.0, calling SparkContext.addFile() twice with the same path would 
> succeed and would cause future tasks to receive an updated copy of the file. 
> This behavior was never explicitly documented but Spark has behaved this way 
> since very early 1.x versions (some of the relevant lines in 
> Executor.updateDependencies() have existed since 2012).
> In 2.0 (or 1.6 with the Netty file server enabled), the second addFile() call 
> will fail with a requirement error because NettyStreamManager tries to guard 
> against duplicate file registration.
> I believe that this change of behavior was unintentional and propose to 
> remove the {{require}} check so that Spark 2.0 matches 1.x's default behavior.
> This problem also affects addJar() in a more subtle way: the 
> fileServer.addJar() call will also fail with an exception but that exception 
> is logged and ignored due to some code which was added in 2014 in order to 
> ignore errors caused by missing Spark examples JARs when running on YARN 
> cluster mode (AFAIK).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16771) Infinite recursion loop in org.apache.spark.sql.catalyst.trees.TreeNode when table name collides.

2016-07-28 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398701#comment-15398701
 ] 

Dongjoon Hyun commented on SPARK-16771:
---

Hi, [~fpin]. You're right. I can regenerate like the following.

{code}
spark.range(10).createOrReplaceTempView("t")
sql("WITH t AS (SELECT 1 FROM t) SELECT * FROM t")
{code}

{code}
spark.range(10).createOrReplaceTempView("t1")
spark.range(10).createOrReplaceTempView("t2")
sql("WITH t1 AS (SELECT 1 FROM t2), t2 AS (SELECT 1 FROM t1) SELECT * FROM t1, 
t2")
{code}

I'll make a PR soon.

> Infinite recursion loop in org.apache.spark.sql.catalyst.trees.TreeNode when 
> table name collides.
> -
>
> Key: SPARK-16771
> URL: https://issues.apache.org/jira/browse/SPARK-16771
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Furcy Pin
>
> How to reproduce: 
> In spark-sql on Hive
> {code}
> DROP TABLE IF EXISTS t1 ;
> CREATE TABLE test.t1(col1 string) ;
> WITH t1 AS (
>   SELECT  col1
>   FROM t1 
> )
> SELECT col1
> FROM t1
> LIMIT 2
> ;
> {code}
> This make a nice StackOverflowError:
> {code}
> java.lang.StackOverflowError
>   at 
> scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:230)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:348)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:170)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution$$anonfun$substituteCTE$1.applyOrElse(Analyzer.scala:147)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution$$anonfun$substituteCTE$1.applyOrElse(Analyzer.scala:133)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$2.apply(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$2.apply(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
>

[jira] [Commented] (SPARK-15694) Implement ScriptTransformation in sql/core

2016-07-28 Thread Tejas Patil (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398694#comment-15398694
 ] 

Tejas Patil commented on SPARK-15694:
-

I spent some time over last weekend working on this (WIP : 
https://github.com/tejasapatil/spark/commit/71c2596a1929a890c6e6f3f0d956557fadbf2403).
 You are right that some parts of Hive needs to be copied into Spark. To begin 
with, I plan to have a patch without that ie. have it working without custom 
serde and record reader / writer and in next patch add that.

> Implement ScriptTransformation in sql/core
> --
>
> Key: SPARK-15694
> URL: https://issues.apache.org/jira/browse/SPARK-15694
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> ScriptTransformation currently relies on Hive internals. It'd be great if we 
> can implement a native ScriptTransformation in sql/core module to remove the 
> extra Hive dependency here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16646) LEAST doesn't accept numeric arguments with different data types

2016-07-28 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398650#comment-15398650
 ] 

Hyukjin Kwon commented on SPARK-16646:
--

Sure, I will close the PR for meanwhile. Then please update me after that.

> LEAST doesn't accept numeric arguments with different data types
> 
>
> Key: SPARK-16646
> URL: https://issues.apache.org/jira/browse/SPARK-16646
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Hyukjin Kwon
>
> {code:sql}
> SELECT LEAST(1, 1.5);
> {code}
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: cannot resolve 'least(1, 
> CAST(2.1 AS DECIMAL(2,1)))' due to data type mismatch: The expressions should 
> all have the same type, got LEAST (ArrayBuffer(IntegerType, 
> DecimalType(2,1))).; line 1 pos 7 (state=,code=0)
> {noformat}
> This query works for 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16788) Investigate JSR-310 & scala-time alternatives to our own datetime utils

2016-07-28 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398632#comment-15398632
 ] 

holdenk commented on SPARK-16788:
-

cc [~davies] [~ckadner] :)

> Investigate JSR-310 & scala-time alternatives to our own datetime utils
> ---
>
> Key: SPARK-16788
> URL: https://issues.apache.org/jira/browse/SPARK-16788
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: holdenk
>
> Our own timezone handling code is essoteric and may have some bugs. We should 
> investigate replacing the logic with a more common library like scala-date or 
> directly using JSR-310 (included in Java 8+, but also backwards available on 
> maven through http://www.threeten.org/)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16789) Can't run saveAsTable with database name

2016-07-28 Thread SonixLegend (JIRA)

SonixLegend created SPARK-16789:
---

 Summary: Can't run saveAsTable with database name
 Key: SPARK-16789
 URL: https://issues.apache.org/jira/browse/SPARK-16789
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
 Environment: CentOS 7 JDK 1.8 Hive 1.2.1
Reporter: SonixLegend


The function "saveAsTable" with database and table name is running via 1.6.2 
successfully. But when I upgrade 2.0, it's got the error. There are my code and 
error message. Can you help me?

conf/hive-site.xml

hive.metastore.uris
thrift://localhost:9083


val spark = 
SparkSession.builder().appName("SparkHive").enableHiveSupport().getOrCreate()

import spark.implicits._
import spark.sql

val source = sql("select * from sample.sample")

source.createOrReplaceTempView("test")
source.collect.foreach{tuple => println(tuple(0) + ":" + tuple(1))}

val target = sql("select key, 'Spark' as value from test")

println(target.count())

target.write.mode(SaveMode.Append).saveAsTable("sample.sample")

spark.stop()

Exception in thread "main" org.apache.spark.sql.AnalysisException: Saving data 
in MetastoreRelation sample, sample
 is not supported.;
at 
org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:218)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
at 
org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:378)
at 
org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:354)
at com.newtouch.sample.SparkHive$.main(SparkHive.scala:25)
at com.newtouch.sample.SparkHive.main(SparkHive.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16788) Investigate JSR-310 & scala-time alternatives to our own datetime utils

2016-07-28 Thread holdenk (JIRA)

holdenk created SPARK-16788:
---

 Summary: Investigate JSR-310 & scala-time alternatives to our own 
datetime utils
 Key: SPARK-16788
 URL: https://issues.apache.org/jira/browse/SPARK-16788
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: holdenk


Our own timezone handling code is essoteric and may have some bugs. We should 
investigate replacing the logic with a more common library like scala-date or 
directly using JSR-310 (included in Java 8+, but also backwards available on 
maven through http://www.threeten.org/)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16746) Spark streaming lost data when ReceiverTracker writes Blockinfo to hdfs timeout

2016-07-28 Thread Hongyao Zhao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398615#comment-15398615
 ] 

Hongyao Zhao commented on SPARK-16746:
--

I did some test yesterday,  It seems that spark 1.6 direct api can consume 
messages from Kafka 0.9 brokers, so I can get around this problem by using 
direct api. It a good news to me, but I think what I mentioned in issue has 
nothing to do with whatkind of receivers I use,  because ReceiverTracker is 
a internal class in spark source code. 

> Spark streaming lost data when ReceiverTracker writes Blockinfo to hdfs 
> timeout
> ---
>
> Key: SPARK-16746
> URL: https://issues.apache.org/jira/browse/SPARK-16746
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.1
>Reporter: Hongyao Zhao
>Priority: Minor
>
> I wrote a spark streaming program which consume 1000 messages from one topic 
> of Kafka, did some transformation, and wrote the result back to another 
> topic. But only found 988 messages in the second topic. I checked log info 
> and confirmed all messages was received by receivers. But I found a hdfs 
> writing time out message printed from Class BatchedWriteAheadLog. 
> 
> I checkout source code and found code like this: 
>   
> {code:borderStyle=solid}
> /** Add received block. This event will get written to the write ahead 
> log (if enabled). */ 
>   def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = { 
> try { 
>   val writeResult = writeToLog(BlockAdditionEvent(receivedBlockInfo)) 
>   if (writeResult) { 
> synchronized { 
>   getReceivedBlockQueue(receivedBlockInfo.streamId) += 
> receivedBlockInfo 
> } 
> logDebug(s"Stream ${receivedBlockInfo.streamId} received " + 
>   s"block ${receivedBlockInfo.blockStoreResult.blockId}") 
>   } else { 
> logDebug(s"Failed to acknowledge stream ${receivedBlockInfo.streamId} 
> receiving " + 
>   s"block ${receivedBlockInfo.blockStoreResult.blockId} in the Write 
> Ahead Log.") 
>   } 
>   writeResult 
> } catch { 
>   case NonFatal(e) => 
> logError(s"Error adding block $receivedBlockInfo", e) 
> false 
> } 
>   } 
> {code}
> 
> It seems that ReceiverTracker tries to write block info to hdfs, but the 
> write operation time out, this cause writeToLog function return false, and  
> this code "getReceivedBlockQueue(receivedBlockInfo.streamId) += 
> receivedBlockInfo" is skipped. so the block info is lost. 
>The spark version I use is 1.6.1 and I did not turn on 
> spark.streaming.receiver.writeAheadLog.enable. 
> 
>I want to know whether or not this is a designed behaviour. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16748) Errors thrown by UDFs cause TreeNodeException when the query has an ORDER BY clause

2016-07-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398587#comment-15398587
 ] 

Apache Spark commented on SPARK-16748:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/14395

> Errors thrown by UDFs cause TreeNodeException when the query has an ORDER BY 
> clause
> ---
>
> Key: SPARK-16748
> URL: https://issues.apache.org/jira/browse/SPARK-16748
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Tathagata Das
>
> {code}
> import org.apache.spark.sql.functions._
> val myUDF = udf((c: String) => s"""${c.take(5)}""")
> spark.sql("SELECT cast(null as string) as 
> a").select(myUDF($"a").as("b")).orderBy($"b").collect
> {code}
> {code}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
> Exchange rangepartitioning(b#345 ASC, 200)
> +- *Project [UDF(null) AS b#345]
>+- Scan OneRowRelation[]
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233)
>   at org.apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:113)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:361)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:272)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2187)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2187)
>   at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2545)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2187)
>   at org.apache.spark.sql.Dataset.collect(Dataset.scala:2163)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16748) Errors thrown by UDFs cause TreeNodeException when the query has an ORDER BY clause

2016-07-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16748:


Assignee: Tathagata Das  (was: Apache Spark)

> Errors thrown by UDFs cause TreeNodeException when the query has an ORDER BY 
> clause
> ---
>
> Key: SPARK-16748
> URL: https://issues.apache.org/jira/browse/SPARK-16748
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Tathagata Das
>
> {code}
> import org.apache.spark.sql.functions._
> val myUDF = udf((c: String) => s"""${c.take(5)}""")
> spark.sql("SELECT cast(null as string) as 
> a").select(myUDF($"a").as("b")).orderBy($"b").collect
> {code}
> {code}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
> Exchange rangepartitioning(b#345 ASC, 200)
> +- *Project [UDF(null) AS b#345]
>+- Scan OneRowRelation[]
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233)
>   at org.apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:113)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:361)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:272)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2187)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2187)
>   at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2545)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2187)
>   at org.apache.spark.sql.Dataset.collect(Dataset.scala:2163)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16748) Errors thrown by UDFs cause TreeNodeException when the query has an ORDER BY clause

2016-07-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16748:


Assignee: Apache Spark  (was: Tathagata Das)

> Errors thrown by UDFs cause TreeNodeException when the query has an ORDER BY 
> clause
> ---
>
> Key: SPARK-16748
> URL: https://issues.apache.org/jira/browse/SPARK-16748
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>
> {code}
> import org.apache.spark.sql.functions._
> val myUDF = udf((c: String) => s"""${c.take(5)}""")
> spark.sql("SELECT cast(null as string) as 
> a").select(myUDF($"a").as("b")).orderBy($"b").collect
> {code}
> {code}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
> Exchange rangepartitioning(b#345 ASC, 200)
> +- *Project [UDF(null) AS b#345]
>+- Scan OneRowRelation[]
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233)
>   at org.apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:113)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:361)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:272)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2187)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2187)
>   at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2545)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2187)
>   at org.apache.spark.sql.Dataset.collect(Dataset.scala:2163)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16768) pyspark calls incorrect version of logistic regression

2016-07-28 Thread Colin Beckingham (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398571#comment-15398571
 ] 

Colin Beckingham edited comment on SPARK-16768 at 7/29/16 2:08 AM:
---

This is very strange then. I can launch Spark 2.1.0 with pyspark, run "from 
pyspark.mllib.classification import LogisticRegressionWithLBFGS" and the import 
succeeds, and I can call help on the import and get a description of what it 
does. If there is no longer an LBFGS version should not the import fail with 
some warning that the command is deprecated? I see from 
http://spark.apache.org/docs/latest/mllib-optimization.html that implementation 
of LBFGS is an issue that is "being worked on". It raises the issue of whether 
the currently working version in 1.6.2 is reliable; right now running the same 
problem on both 1.6.2 and 2.1.0 produces a much faster and accurate result on 
the former.


was (Author: colbec):
This is very strange then. I can launch Spark 2.1.0 with pyspark, run "from 
pyspark.mllib.classification import LogisticRegressionWithLBFGS" and the import 
succeeds, and I can call help on the import and get a description of what it 
does. If there is no longer an LBGFS version should not the import fail with 
some warning that the command is deprecated? I see from 
http://spark.apache.org/docs/latest/mllib-optimization.html that implementation 
of LGBFS is an issue that is "being worked on". It raises the issue of whether 
the currently working version in 1.6.2 is reliable; right now running the same 
problem on both 1.6.2 and 2.1.0 produces a much faster and accurate result on 
the former.

> pyspark calls incorrect version of logistic regression
> --
>
> Key: SPARK-16768
> URL: https://issues.apache.org/jira/browse/SPARK-16768
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
> Environment: Linux openSUSE Leap 42.1 Gnome
>Reporter: Colin Beckingham
> Fix For: 2.1.0
>
>
> PySpark call with Spark 1.6.2 "LogisticRegressionWithLBFGS.train()"  runs 
> "treeAggregate at LBFGS.scala:218" but the same command in pyspark with Spark 
> 2.1 runs "treeAggregate at LogisticRegression.scala:1092". This non-optimized 
> version is much slower and produces a different answer from LBFGS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16646) LEAST doesn't accept numeric arguments with different data types

2016-07-28 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398582#comment-15398582
 ] 

Wenchen Fan commented on SPARK-16646:
-

We are discussing this internally, can you hold it for a while? We may decide 
to increase the max precision to 76 and keep max scale as 38, then we don't 
have this problem.

> LEAST doesn't accept numeric arguments with different data types
> 
>
> Key: SPARK-16646
> URL: https://issues.apache.org/jira/browse/SPARK-16646
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Hyukjin Kwon
>
> {code:sql}
> SELECT LEAST(1, 1.5);
> {code}
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: cannot resolve 'least(1, 
> CAST(2.1 AS DECIMAL(2,1)))' due to data type mismatch: The expressions should 
> all have the same type, got LEAST (ArrayBuffer(IntegerType, 
> DecimalType(2,1))).; line 1 pos 7 (state=,code=0)
> {noformat}
> This query works for 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16768) pyspark calls incorrect version of logistic regression

2016-07-28 Thread Colin Beckingham (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398571#comment-15398571
 ] 

Colin Beckingham commented on SPARK-16768:
--

This is very strange then. I can launch Spark 2.1.0 with pyspark, run "from 
pyspark.mllib.classification import LogisticRegressionWithLBFGS" and the import 
succeeds, and I can call help on the import and get a description of what it 
does. If there is no longer an LBGFS version should not the import fail with 
some warning that the command is deprecated? I see from 
http://spark.apache.org/docs/latest/mllib-optimization.html that implementation 
of LGBFS is an issue that is "being worked on". It raises the issue of whether 
the currently working version in 1.6.2 is reliable; right now running the same 
problem on both 1.6.2 and 2.1.0 produces a much faster and accurate result on 
the former.

> pyspark calls incorrect version of logistic regression
> --
>
> Key: SPARK-16768
> URL: https://issues.apache.org/jira/browse/SPARK-16768
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
> Environment: Linux openSUSE Leap 42.1 Gnome
>Reporter: Colin Beckingham
> Fix For: 2.1.0
>
>
> PySpark call with Spark 1.6.2 "LogisticRegressionWithLBFGS.train()"  runs 
> "treeAggregate at LBFGS.scala:218" but the same command in pyspark with Spark 
> 2.1 runs "treeAggregate at LogisticRegression.scala:1092". This non-optimized 
> version is much slower and produces a different answer from LBFGS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16786) LDA topic distributions for new documents in PySpark

2016-07-28 Thread Jordan Beauchamp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398560#comment-15398560
 ] 

Jordan Beauchamp commented on SPARK-16786:
--

topicDistribution has been implemented by this issue.

> LDA topic distributions for new documents in PySpark
> 
>
> Key: SPARK-16786
> URL: https://issues.apache.org/jira/browse/SPARK-16786
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 2.0.0
> Environment: N/A
>Reporter: Jordan Beauchamp
>Priority: Minor
>  Labels: patch
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> pyspark.mllib.clustering.LDAModel has no way to estimate the topic 
> distribution for new documents. However, this functionality exists in 
> org.apache.spark.mllib.clustering.LDAModel. This change would only require 
> setting up the API calls. I have forked the spark repo and implemented the 
> changes locally



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16786) LDA topic distributions for new documents in PySpark

2016-07-28 Thread Jordan Beauchamp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398559#comment-15398559
 ] 

Jordan Beauchamp commented on SPARK-16786:
--

One minor difficulty is that downstream LocalLDAModel implements 
topicDistributions() while DistributedLDAModel does not. 
I've chosen to throw a NotImplementedException in the DistributedLDAModel 
implementation with a warning about converting the model to Local or using the 
online optimizer.

> LDA topic distributions for new documents in PySpark
> 
>
> Key: SPARK-16786
> URL: https://issues.apache.org/jira/browse/SPARK-16786
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 2.0.0
> Environment: N/A
>Reporter: Jordan Beauchamp
>Priority: Minor
>  Labels: patch
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> pyspark.mllib.clustering.LDAModel has no way to estimate the topic 
> distribution for new documents. However, this functionality exists in 
> org.apache.spark.mllib.clustering.LDAModel. This change would only require 
> setting up the API calls. I have forked the spark repo and implemented the 
> changes locally



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16786) LDA topic distributions for new documents in PySpark

2016-07-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16786:


Assignee: Apache Spark

> LDA topic distributions for new documents in PySpark
> 
>
> Key: SPARK-16786
> URL: https://issues.apache.org/jira/browse/SPARK-16786
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 2.0.0
> Environment: N/A
>Reporter: Jordan Beauchamp
>Assignee: Apache Spark
>Priority: Minor
>  Labels: patch
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> pyspark.mllib.clustering.LDAModel has no way to estimate the topic 
> distribution for new documents. However, this functionality exists in 
> org.apache.spark.mllib.clustering.LDAModel. This change would only require 
> setting up the API calls. I have forked the spark repo and implemented the 
> changes locally



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16786) LDA topic distributions for new documents in PySpark

2016-07-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16786:


Assignee: (was: Apache Spark)

> LDA topic distributions for new documents in PySpark
> 
>
> Key: SPARK-16786
> URL: https://issues.apache.org/jira/browse/SPARK-16786
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 2.0.0
> Environment: N/A
>Reporter: Jordan Beauchamp
>Priority: Minor
>  Labels: patch
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> pyspark.mllib.clustering.LDAModel has no way to estimate the topic 
> distribution for new documents. However, this functionality exists in 
> org.apache.spark.mllib.clustering.LDAModel. This change would only require 
> setting up the API calls. I have forked the spark repo and implemented the 
> changes locally



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16786) LDA topic distributions for new documents in PySpark

2016-07-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398558#comment-15398558
 ] 

Apache Spark commented on SPARK-16786:
--

User 'supremekai' has created a pull request for this issue:
https://github.com/apache/spark/pull/14394

> LDA topic distributions for new documents in PySpark
> 
>
> Key: SPARK-16786
> URL: https://issues.apache.org/jira/browse/SPARK-16786
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 2.0.0
> Environment: N/A
>Reporter: Jordan Beauchamp
>Priority: Minor
>  Labels: patch
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> pyspark.mllib.clustering.LDAModel has no way to estimate the topic 
> distribution for new documents. However, this functionality exists in 
> org.apache.spark.mllib.clustering.LDAModel. This change would only require 
> setting up the API calls. I have forked the spark repo and implemented the 
> changes locally



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16753) Spark SQL doesn't handle skewed dataset joins properly

2016-07-28 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398551#comment-15398551
 ] 

Reynold Xin commented on SPARK-16753:
-

Got it - definitely good to do skew join. There are a lot of different ways to 
implement this. I think teradata had a paper on skewed join too.


> Spark SQL doesn't handle skewed dataset joins properly
> --
>
> Key: SPARK-16753
> URL: https://issues.apache.org/jira/browse/SPARK-16753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 1.6.2, 2.0.0, 2.0.1
>Reporter: Jurriaan Pruis
> Attachments: screenshot-1.png
>
>
> I'm having issues with joining a 1 billion row dataframe with skewed data 
> with multiple dataframes with sizes ranging from 100,000 to 10 million rows. 
> This means some of the joins (about half of them) can be done using 
> broadcast, but not all.
> Because the data in the large dataframe is skewed we get out of memory errors 
> in the executors or errors like: 
> `org.apache.spark.shuffle.FetchFailedException: Too large frame`.
> We tried a lot of things, like broadcast joining the skewed rows separately 
> and unioning them with the dataset containing the sort merge joined data. 
> Which works perfectly when doing one or two joins, but when doing 10 joins 
> like this the query planner gets confused (see [SPARK-15326]).
> As most of the rows are skewed on the NULL value we use a hack where we put 
> unique values in those NULL columns so the data is properly distributed over 
> all partitions. This works fine for NULL values, but since this table is 
> growing rapidly and we have skewed data for non-NULL values as well this 
> isn't a full solution to the problem.
> Right now this specific spark task runs well 30% of the time and it's getting 
> worse and worse because of the increasing amount of data.
> How to approach these kinds of joins using Spark? It seems weird that I can't 
> find proper solutions for this problem/other people having the same kind of 
> issues when Spark profiles itself as a large-scale data processing engine. 
> Doing joins on big datasets should be a thing Spark should have no problem 
> with out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16748) Errors thrown by UDFs cause TreeNodeException when the query has an ORDER BY clause

2016-07-28 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-16748:
-

Assignee: Tathagata Das

> Errors thrown by UDFs cause TreeNodeException when the query has an ORDER BY 
> clause
> ---
>
> Key: SPARK-16748
> URL: https://issues.apache.org/jira/browse/SPARK-16748
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Tathagata Das
>
> {code}
> import org.apache.spark.sql.functions._
> val myUDF = udf((c: String) => s"""${c.take(5)}""")
> spark.sql("SELECT cast(null as string) as 
> a").select(myUDF($"a").as("b")).orderBy($"b").collect
> {code}
> {code}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
> Exchange rangepartitioning(b#345 ASC, 200)
> +- *Project [UDF(null) AS b#345]
>+- Scan OneRowRelation[]
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233)
>   at org.apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:113)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:361)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:272)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2187)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2187)
>   at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2545)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2187)
>   at org.apache.spark.sql.Dataset.collect(Dataset.scala:2163)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16787) SparkContext.addFile() should not fail if called twice with the same file

2016-07-28 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-16787:
---
Target Version/s: 2.0.1  (was: 1.6.3, 2.0.1)

> SparkContext.addFile() should not fail if called twice with the same file
> -
>
> Key: SPARK-16787
> URL: https://issues.apache.org/jira/browse/SPARK-16787
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The behavior of SparkContext.addFile() changed slightly with the introduction 
> of the Netty-RPC-based file server, which was introduced in Spark 1.6 (where 
> it was disabled by default) and became the default / only file server in 
> Spark 2.0.0.
> Prior to 2.0, calling SparkContext.addFile() twice with the same path would 
> succeed and would cause future tasks to receive an updated copy of the file. 
> This behavior was never explicitly documented but Spark has behaved this way 
> since very early 1.x versions (some of the relevant lines in 
> Executor.updateDependencies() have existed since 2012).
> In 2.0 (or 1.6 with the Netty file server enabled), the second addFile() call 
> will fail with a requirement error because NettyStreamManager tries to guard 
> against duplicate file registration.
> I believe that this change of behavior was unintentional and propose to 
> remove the {{require}} check so that Spark 2.0 matches 1.x's default behavior.
> This problem also affects addJar() in a more subtle way: the 
> fileServer.addJar() call will also fail with an exception but that exception 
> is logged and ignored due to some code which was added in 2014 in order to 
> ignore errors caused by missing Spark examples JARs when running on YARN 
> cluster mode (AFAIK).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16787) SparkContext.addFile() should not fail if called twice with the same file

2016-07-28 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-16787:
--

 Summary: SparkContext.addFile() should not fail if called twice 
with the same file
 Key: SPARK-16787
 URL: https://issues.apache.org/jira/browse/SPARK-16787
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0, 1.6.2
Reporter: Josh Rosen
Assignee: Josh Rosen


The behavior of SparkContext.addFile() changed slightly with the introduction 
of the Netty-RPC-based file server, which was introduced in Spark 1.6 (where it 
was disabled by default) and became the default / only file server in Spark 
2.0.0.

Prior to 2.0, calling SparkContext.addFile() twice with the same path would 
succeed and would cause future tasks to receive an updated copy of the file. 
This behavior was never explicitly documented but Spark has behaved this way 
since very early 1.x versions (some of the relevant lines in 
Executor.updateDependencies() have existed since 2012).

In 2.0 (or 1.6 with the Netty file server enabled), the second addFile() call 
will fail with a requirement error because NettyStreamManager tries to guard 
against duplicate file registration.

I believe that this change of behavior was unintentional and propose to remove 
the {{require}} check so that Spark 2.0 matches 1.x's default behavior.

This problem also affects addJar() in a more subtle way: the 
fileServer.addJar() call will also fail with an exception but that exception is 
logged and ignored due to some code which was added in 2014 in order to ignore 
errors caused by missing Spark examples JARs when running on YARN cluster mode 
(AFAIK).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16774) Fix use of deprecated TimeStamp constructor (also providing incorrect results)

2016-07-28 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398424#comment-15398424
 ] 

holdenk edited comment on SPARK-16774 at 7/28/16 11:38 PM:
---

While diving into this (relatedly I hate timezones) - I'm pretty sure I've 
encountered a logic bug for our handling close to daylight savings boundaries - 
namely the inline comments suggest the timestamp is being constructed against 
UTC but careful reading of the JDK code indicates its actually using the system 
default. I'm going to add some tests around this and replace the fall-back code 
with the calendar based approach.


was (Author: holdenk):
While diving into this (relatedly I hate timezones) - I'm pretty sure I've 
encountered a logic bug for our handling close to timezone boundaries - namely 
the inline comments suggest the timestamp is being constructed against UTC but 
careful reading of the JDK code indicates its actually using the system 
default. I'm going to add some tests around this and replace the fall-back code 
with the calendar based approach.

> Fix use of deprecated TimeStamp constructor (also providing incorrect results)
> --
>
> Key: SPARK-16774
> URL: https://issues.apache.org/jira/browse/SPARK-16774
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: holdenk
>
> The TimeStamp constructor we use inside of DateTime utils has been deprecated 
> since JDK 1.1 - while Java does take a long time to remove deprecated 
> functionality we might as well address this. Additionally it does not handle 
> DST boundaries correctly all the time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16774) Fix use of deprecated TimeStamp constructor

2016-07-28 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-16774:

Description: The TimeStamp constructor we use inside of DateTime utils has 
been deprecated since JDK 1.1 - while Java does take a long time to remove 
deprecated functionality we might as well address this. Additionally it does   
(was: The TimeStamp constructor we use inside of DateTime utils has been 
deprecated since JDK 1.1 - while Java does take a long time to remove 
deprecated functionality we might as well address this.)

> Fix use of deprecated TimeStamp constructor
> ---
>
> Key: SPARK-16774
> URL: https://issues.apache.org/jira/browse/SPARK-16774
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: holdenk
>
> The TimeStamp constructor we use inside of DateTime utils has been deprecated 
> since JDK 1.1 - while Java does take a long time to remove deprecated 
> functionality we might as well address this. Additionally it does 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16774) Fix use of deprecated TimeStamp constructor (also providing incorrect results)

2016-07-28 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-16774:

Description: The TimeStamp constructor we use inside of DateTime utils has 
been deprecated since JDK 1.1 - while Java does take a long time to remove 
deprecated functionality we might as well address this. Additionally it does 
not handle DST boundaries correctly all the time  (was: The TimeStamp 
constructor we use inside of DateTime utils has been deprecated since JDK 1.1 - 
while Java does take a long time to remove deprecated functionality we might as 
well address this. Additionally it does )

> Fix use of deprecated TimeStamp constructor (also providing incorrect results)
> --
>
> Key: SPARK-16774
> URL: https://issues.apache.org/jira/browse/SPARK-16774
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: holdenk
>
> The TimeStamp constructor we use inside of DateTime utils has been deprecated 
> since JDK 1.1 - while Java does take a long time to remove deprecated 
> functionality we might as well address this. Additionally it does not handle 
> DST boundaries correctly all the time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16774) Fix use of deprecated TimeStamp constructor (also providing incorrect results)

2016-07-28 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-16774:

Summary: Fix use of deprecated TimeStamp constructor (also providing 
incorrect results)  (was: Fix use of deprecated TimeStamp constructor)

> Fix use of deprecated TimeStamp constructor (also providing incorrect results)
> --
>
> Key: SPARK-16774
> URL: https://issues.apache.org/jira/browse/SPARK-16774
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: holdenk
>
> The TimeStamp constructor we use inside of DateTime utils has been deprecated 
> since JDK 1.1 - while Java does take a long time to remove deprecated 
> functionality we might as well address this. Additionally it does 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16774) Fix use of deprecated TimeStamp constructor

2016-07-28 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398424#comment-15398424
 ] 

holdenk commented on SPARK-16774:
-

While diving into this (relatedly I hate timezones) - I'm pretty sure I've 
encountered a logic bug for our handling close to timezone boundaries - namely 
the inline comments suggest the timestamp is being constructed against UTC but 
careful reading of the JDK code indicates its actually using the system 
default. I'm going to add some tests around this and replace the fall-back code 
with the calendar based approach.

> Fix use of deprecated TimeStamp constructor
> ---
>
> Key: SPARK-16774
> URL: https://issues.apache.org/jira/browse/SPARK-16774
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: holdenk
>
> The TimeStamp constructor we use inside of DateTime utils has been deprecated 
> since JDK 1.1 - while Java does take a long time to remove deprecated 
> functionality we might as well address this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16768) pyspark calls incorrect version of logistic regression

2016-07-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398409#comment-15398409
 ] 

Sean Owen commented on SPARK-16768:
---

That's spark.ml.LogisticRegression, right? there's no longer a separate LBFGS 
version in .ml. 

> pyspark calls incorrect version of logistic regression
> --
>
> Key: SPARK-16768
> URL: https://issues.apache.org/jira/browse/SPARK-16768
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
> Environment: Linux openSUSE Leap 42.1 Gnome
>Reporter: Colin Beckingham
> Fix For: 2.1.0
>
>
> PySpark call with Spark 1.6.2 "LogisticRegressionWithLBFGS.train()"  runs 
> "treeAggregate at LBFGS.scala:218" but the same command in pyspark with Spark 
> 2.1 runs "treeAggregate at LogisticRegression.scala:1092". This non-optimized 
> version is much slower and produces a different answer from LBFGS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16785) dapply doesn't return array or raw columns

2016-07-28 Thread Clark Fitzgerald (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398369#comment-15398369
 ] 

Clark Fitzgerald commented on SPARK-16785:
--

To fix this I propose to treat the rows as a list of dataframes instead of as a 
list of lists:

{{rowlist = split(df, 1:nrow(df))}}
{{df2 = do.call(rbind, rowlist)}}

> dapply doesn't return array or raw columns
> --
>
> Key: SPARK-16785
> URL: https://issues.apache.org/jira/browse/SPARK-16785
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
> Environment: Mac OS X
>Reporter: Clark Fitzgerald
>Priority: Minor
>
> Calling SparkR::dapplyCollect with R functions that return dataframes 
> produces an error. This comes up when returning columns of binary data- ie. 
> serialized fitted models. Also happens when functions return columns 
> containing vectors.
> The error message:
> R computation failed with
>  Error in (function (..., deparse.level = 1, make.row.names = TRUE, 
> stringsAsFactors = default.stringsAsFactors())  :
>   invalid list argument: all variables should have the same length
> Reproducible example: 
> https://github.com/clarkfitzg/phd_research/blob/master/ddR/spark/sparkR_dapplyCollect7.R
> Relates to SPARK-16611



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16785) dapply doesn't return array or raw columns

2016-07-28 Thread Clark Fitzgerald (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398312#comment-15398312
 ] 

Clark Fitzgerald edited comment on SPARK-16785 at 7/28/16 10:51 PM:


[~shivaram] and I have had some email correspondence on this: 

bq. parallelizing data containing binary values: To create a SparkDataFrame 
from a local R data frame we first convert it to a list of rows and then 
partition the rows. This happens on the driver 1. The rows are then 
reconstructed back into a data.frame on the workers 2. The rbind.data.frame 
fails for the the binary example. I've created a small reproducible example at 
https://gist.github.com/shivaram/2f07208a8ebc0832098328fa0a3fac9d

1 
https://github.com/apache/spark/blob/0869b3a5f028b64c2da511e70b02ab42f65fc949/R/pkg/R/SQLContext.R#L209
2 
https://github.com/apache/spark/blob/0869b3a5f028b64c2da511e70b02ab42f65fc949/R/pkg/inst/worker/worker.R#L39


was (Author: clarkfitzg):
[~shivaram] and I have had some email correspondence on this: 

bq. parallelizing data containing binary values: To create a SparkDataFrame 
from a local R data frame we first convert it to a list of rows and then 
partition the rows. This happens on the driver 1. The rows are then 
reconstructed back into a data.frame on the workers 2. The rbind.data.frame 
fails for the the binary example. I've created a small reproducible example at 
https://gist.github.com/shivaram/2f07208a8ebc0832098328fa0a3fac9d

1 
https://github.com/apache/spark/blob/0869b3a5f028b64c2da511e70b02ab42f65fc949/R/pkg/R/SQLContext.R#L209
2 
https://github.com/apache/spark/blob/0869b3a5f028b64c2da511e70b02ab42f65fc949/R/pkg/inst/worker/worker.R#L39

To fix this I propose to treat the rows as a list of dataframes instead of as a 
list of lists:

{{rowlist = split(df, 1:nrow(df))}}
{{df2 = do.call(rbind, rowlist)}}



> dapply doesn't return array or raw columns
> --
>
> Key: SPARK-16785
> URL: https://issues.apache.org/jira/browse/SPARK-16785
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
> Environment: Mac OS X
>Reporter: Clark Fitzgerald
>Priority: Minor
>
> Calling SparkR::dapplyCollect with R functions that return dataframes 
> produces an error. This comes up when returning columns of binary data- ie. 
> serialized fitted models. Also happens when functions return columns 
> containing vectors.
> The error message:
> R computation failed with
>  Error in (function (..., deparse.level = 1, make.row.names = TRUE, 
> stringsAsFactors = default.stringsAsFactors())  :
>   invalid list argument: all variables should have the same length
> Reproducible example: 
> https://github.com/clarkfitzg/phd_research/blob/master/ddR/spark/sparkR_dapplyCollect7.R
> Relates to SPARK-16611



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions

2016-07-28 Thread Clark Fitzgerald (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398346#comment-15398346
 ] 

Clark Fitzgerald commented on SPARK-16611:
--

+1 for more direct access to the RDD's. This would be very helpful for me as I 
try to implement general R objects using Spark as a backend for ddR 
https://github.com/vertica/ddR

Longer term it might make sense to organize SparkR into separate packages 
offering various levels of abstraction:
# dataframes - for most end users
# RDD's - for package authors or special applications
# Java objects - for directly invoking methods in Spark. This is what sparkapi 
does.

For my application it would be much better to be working at this middle layer.

> Expose several hidden DataFrame/RDD functions
> -
>
> Key: SPARK-16611
> URL: https://issues.apache.org/jira/browse/SPARK-16611
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>
> Expose the following functions:
> - lapply or map
> - lapplyPartition or mapPartition
> - flatMap
> - RDD
> - toRDD
> - getJRDD
> - cleanup.jobj
> cc:
> [~javierluraschi] [~j...@rstudio.com] [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16786) LDA topic distributions for new documents in PySpark

2016-07-28 Thread Jordan Beauchamp (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jordan Beauchamp updated SPARK-16786:
-
Target Version/s:   (was: 2.0.0)

> LDA topic distributions for new documents in PySpark
> 
>
> Key: SPARK-16786
> URL: https://issues.apache.org/jira/browse/SPARK-16786
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 2.0.0
> Environment: N/A
>Reporter: Jordan Beauchamp
>Priority: Minor
>  Labels: patch
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> pyspark.mllib.clustering.LDAModel has no way to estimate the topic 
> distribution for new documents. However, this functionality exists in 
> org.apache.spark.mllib.clustering.LDAModel. This change would only require 
> setting up the API calls. I have forked the spark repo and implemented the 
> changes locally



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16786) LDA topic distributions for new documents in PySpark

2016-07-28 Thread Jordan Beauchamp (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jordan Beauchamp updated SPARK-16786:
-
Fix Version/s: (was: 2.0.0)

> LDA topic distributions for new documents in PySpark
> 
>
> Key: SPARK-16786
> URL: https://issues.apache.org/jira/browse/SPARK-16786
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 2.0.0
> Environment: N/A
>Reporter: Jordan Beauchamp
>Priority: Minor
>  Labels: patch
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> pyspark.mllib.clustering.LDAModel has no way to estimate the topic 
> distribution for new documents. However, this functionality exists in 
> org.apache.spark.mllib.clustering.LDAModel. This change would only require 
> setting up the API calls. I have forked the spark repo and implemented the 
> changes locally



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16786) LDA topic distributions for new documents in PySpark

2016-07-28 Thread Jordan Beauchamp (JIRA)

Jordan Beauchamp created SPARK-16786:


 Summary: LDA topic distributions for new documents in PySpark
 Key: SPARK-16786
 URL: https://issues.apache.org/jira/browse/SPARK-16786
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 2.0.0
 Environment: N/A
Reporter: Jordan Beauchamp
Priority: Minor
 Fix For: 2.0.0


pyspark.mllib.clustering.LDAModel has no way to estimate the topic distribution 
for new documents. However, this functionality exists in 
org.apache.spark.mllib.clustering.LDAModel. This change would only require 
setting up the API calls. I have forked the spark repo and implemented the 
changes locally



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16785) dapply doesn't return array or raw columns

2016-07-28 Thread Clark Fitzgerald (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Clark Fitzgerald updated SPARK-16785:
-
Priority: Minor  (was: Major)

> dapply doesn't return array or raw columns
> --
>
> Key: SPARK-16785
> URL: https://issues.apache.org/jira/browse/SPARK-16785
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
> Environment: Mac OS X
>Reporter: Clark Fitzgerald
>Priority: Minor
>
> Calling SparkR::dapplyCollect with R functions that return dataframes 
> produces an error. This comes up when returning columns of binary data- ie. 
> serialized fitted models. Also happens when functions return columns 
> containing vectors.
> The error message:
> R computation failed with
>  Error in (function (..., deparse.level = 1, make.row.names = TRUE, 
> stringsAsFactors = default.stringsAsFactors())  :
>   invalid list argument: all variables should have the same length
> Reproducible example: 
> https://github.com/clarkfitzg/phd_research/blob/master/ddR/spark/sparkR_dapplyCollect7.R
> Relates to SPARK-16611



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16785) dapply doesn't return array or raw columns

2016-07-28 Thread Clark Fitzgerald (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398312#comment-15398312
 ] 

Clark Fitzgerald commented on SPARK-16785:
--

[~shivaram] and I have had some email correspondence on this: 

bq. parallelizing data containing binary values: To create a SparkDataFrame 
from a local R data frame we first convert it to a list of rows and then 
partition the rows. This happens on the driver 1. The rows are then 
reconstructed back into a data.frame on the workers 2. The rbind.data.frame 
fails for the the binary example. I've created a small reproducible example at 
https://gist.github.com/shivaram/2f07208a8ebc0832098328fa0a3fac9d

1 
https://github.com/apache/spark/blob/0869b3a5f028b64c2da511e70b02ab42f65fc949/R/pkg/R/SQLContext.R#L209
2 
https://github.com/apache/spark/blob/0869b3a5f028b64c2da511e70b02ab42f65fc949/R/pkg/inst/worker/worker.R#L39

To fix this I propose to treat the rows as a list of dataframes instead of as a 
list of lists:

{{rowlist = split(df, 1:nrow(df))}}
{{df2 = do.call(rbind, rowlist)}}



> dapply doesn't return array or raw columns
> --
>
> Key: SPARK-16785
> URL: https://issues.apache.org/jira/browse/SPARK-16785
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
> Environment: Mac OS X
>Reporter: Clark Fitzgerald
>
> Calling SparkR::dapplyCollect with R functions that return dataframes 
> produces an error. This comes up when returning columns of binary data- ie. 
> serialized fitted models. Also happens when functions return columns 
> containing vectors.
> The error message:
> R computation failed with
>  Error in (function (..., deparse.level = 1, make.row.names = TRUE, 
> stringsAsFactors = default.stringsAsFactors())  :
>   invalid list argument: all variables should have the same length
> Reproducible example: 
> https://github.com/clarkfitzg/phd_research/blob/master/ddR/spark/sparkR_dapplyCollect7.R
> Relates to SPARK-16611



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16785) dapply doesn't return array or raw columns

2016-07-28 Thread Clark Fitzgerald (JIRA)

Clark Fitzgerald created SPARK-16785:


 Summary: dapply doesn't return array or raw columns
 Key: SPARK-16785
 URL: https://issues.apache.org/jira/browse/SPARK-16785
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.0.0
 Environment: Mac OS X
Reporter: Clark Fitzgerald


Calling SparkR::dapplyCollect with R functions that return dataframes produces 
an error. This comes up when returning columns of binary data- ie. serialized 
fitted models. Also happens when functions return columns containing vectors.

The error message:

R computation failed with
 Error in (function (..., deparse.level = 1, make.row.names = TRUE, 
stringsAsFactors = default.stringsAsFactors())  :
  invalid list argument: all variables should have the same length

Reproducible example: 
https://github.com/clarkfitzg/phd_research/blob/master/ddR/spark/sparkR_dapplyCollect7.R

Relates to SPARK-16611



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory

2016-07-28 Thread Denis Serduik (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398289#comment-15398289
 ] 

Denis Serduik commented on SPARK-2984:
--

Something like this...
The problem occurs when I change importPartition to  importPartitionAsync  
 service.getHitRecordsFromFile return DataFrame.


{code}
import 

import java.nio.file.{Path => JavaPath, Paths, Files}
class DefaultHitDataManager(service : SparkService) extends DataSourceManager 
with Logging {


  val proceeded_prefix = "_proceeded_"
  val hit_data_pattern = "*_data*"

  val dsCache = new ConcurrentHashMap[String, HitsDataSource]()


  override def loadDataSource(remoteLocationBase: URI, alias: String): 
Future[HitsDataSource] = {

val dsWorkingDir = java.nio.Files.createTempDirectory(alias)
val fs = FileSystem.get(remoteLocationBase, 
service.sqlContext.sparkContext.hadoopConfiguration)
val loadFuture = scanPartitions(remoteLocationBase).flatMap { case 
newPartitions =>
  Future.traverse(newPartitions.toList)( {  case remotePartitionFile =>
importPartitionAsync(alias, dsWorkingDir, fs, remotePartitionFile)
  }).flatMap { case statuses =>
FileUtil.fullyDelete(dsWorkingDir.toFile)
val ds = HitsDataSource(remoteLocationBase, alias)
dsCache.put(ds.alias, ds)
Future.successful(ds)
  }
}
loadFuture
  }
  ///XXX: async loading causes race condition in FileSystem and things like 
that https://issues.apache.org/jira/browse/SPARK-2984
  def importPartitionAsync(alias: String, dsWorkingDir: JavaPath, partitionFS: 
FileSystem,
   remotePartitionFile: FileStatus) = Future {
importPartition(alias, dsWorkingDir,partitionFS, remotePartitionFile )
  }

  def importPartition(alias: String, dsWorkingDir: JavaPath, partitionFS: 
FileSystem,
  remotePartitionFile: FileStatus) = {
val partitionWorkingDir = Files.createTempDirectory(dsWorkingDir, 
remotePartitionFile.getPath.getName)
val localPath = new Path("file://" + partitionWorkingDir.toString, 
remotePartitionFile.getPath.getName)

partitionFS.copyToLocalFile(remotePartitionFile.getPath, localPath)
val unzippedPartition = s"$partitionWorkingDir/hit_data*.tsv"

  val filePath = localPath.toUri.getPath
  val untarCommand = s"tar -xzf $filePath  --wildcards --no-anchored 
'$hit_data_pattern'"
  val shellCmd = Array("bash", "-c", untarCommand.toString)
  val shexec = new ShellCommandExecutor(shellCmd, 
partitionWorkingDir.toFile)
  shexec.execute()
  val exitcode = shexec.getExitCode
  if (exitcode != 0) {
throw new IOException(s"Error untarring file $filePath. Tar process 
exited with exit code $exitcode")
  }
  val partitionData = service.getHitRecordsFromFile("file://" + 
unzippedPartition)
  partitionData.coalesce(1).write.partitionBy("createdAtYear", 
"createdAtMonth").
mode(SaveMode.Append).saveAsTable(alias)

  val newPartitionPath = new Path(remotePartitionFile.getPath.getParent,
proceeded_prefix + remotePartitionFile.getPath.getName)
  partitionFS.rename(remotePartitionFile.getPath, newPartitionPath)
  }

  def scanPartitions(locationBase: URI) = Future {
val fs = FileSystem.get(locationBase, 
service.sqlContext.sparkContext.hadoopConfiguration)
val location = new Path(locationBase)
val newPartitions  = if (fs.isDirectory(location)) {
  fs.listStatus(location, new PathFilter {
override def accept(path: Path): Boolean = {
  !path.getName.contains(proceeded_prefix)
}
  })
} else {
  Array(fs.getFileStatus(location))
}
logDebug(s"Found new partitions : $newPartitions")
newPartitions
  }
}
{code}

I hope this will help

> FileNotFoundException on _temporary directory
> -
>
> Key: SPARK-2984
> URL: https://issues.apache.org/jira/browse/SPARK-2984
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.3.0
>
>
> We've seen several stacktraces and threads on the user mailing list where 
> people are having issues with a {{FileNotFoundException}} stemming from an 
> HDFS path containing {{_temporary}}.
> I ([~aash]) think this may be related to {{spark.speculation}}.  I think the 
> error condition might manifest in this circumstance:
> 1) task T starts on a executor E1
> 2) it takes a long time, so task T' is started on another executor E2
> 3) T finishes in E1 so moves its data from {{_temporary}} to the final 
> destination and deletes the {{_temporary}} directory during cleanup
> 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but 
> those files no longer exist!

[jira] [Closed] (SPARK-16783) make-distri

2016-07-28 Thread Michael Gummelt (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gummelt closed SPARK-16783.
---
Resolution: Not A Problem

> make-distri
> ---
>
> Key: SPARK-16783
> URL: https://issues.apache.org/jira/browse/SPARK-16783
> Project: Spark
>  Issue Type: Bug
>Reporter: Michael Gummelt
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16784) Configurable log4j settings

2016-07-28 Thread Michael Gummelt (JIRA)

Michael Gummelt created SPARK-16784:
---

 Summary: Configurable log4j settings
 Key: SPARK-16784
 URL: https://issues.apache.org/jira/browse/SPARK-16784
 Project: Spark
  Issue Type: Improvement
Affects Versions: 2.0.0
Reporter: Michael Gummelt


I often want to change the logging configuration on a single spark job.  This 
is easy in client mode.  I just modify log4j.properties.  It's difficult in 
cluster mode, because I need to modify the log4j.properties in the distribution 
in which the driver runs.  I'd like a way of setting this dynamically, such as 
a java system property.  Some brief searching showed that log4j doesn't seem to 
accept such a property, but I'd like to open up this idea for further comment.  
Maybe we can find a solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16772) Correct API doc references to PySpark classes + formatting fixes

2016-07-28 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16772.
-
   Resolution: Fixed
 Assignee: Nicholas Chammas
Fix Version/s: 2.1.0
   2.0.1

> Correct API doc references to PySpark classes + formatting fixes
> 
>
> Key: SPARK-16772
> URL: https://issues.apache.org/jira/browse/SPARK-16772
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Trivial
> Fix For: 2.0.1, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16769) httpclient classic dependency - potentially a patch required?

2016-07-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16769:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Question)

I think the issue is that jets3t needs to match up with Hadoop versions. It 
should be OK to update to maintenance releases, but not across minor releases 
necessarily. I recall issues with that. However 0.7.1 is only used for Hadoop 
2.2. See the profiles later that set this to 0.9.3. You can update those. See 
if that fixes it.

You don't need another JIRA. Make a PR for this one.

> httpclient classic dependency - potentially a patch required?
> -
>
> Key: SPARK-16769
> URL: https://issues.apache.org/jira/browse/SPARK-16769
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.6.2, 2.0.0
> Environment: All Spark versions, any environment
>Reporter: Adam Roberts
>Priority: Minor
>
> In our jars folder for Spark we provide a jar with a CVE 
> https://www.versioneye.com/java/commons-httpclient:commons-httpclient/3.1. 
> CVE-2012-5783
> This paper outlines the problem
> www.cs.utexas.edu/~shmat/shmat_ccs12.pdf
> My question is: do we need to ship this version as well or is it only used 
> for tests? Is it a patched version? I plan to run without this dependency and 
> if there are NoClassDefFound problems I'll add test so we 
> don't ship it (downloading it in the first place is bad enough though)
> Note that this is valid for all versions, suggesting it be raised to a 
> critical if Spark functionality is depending on it because of what the pdf 
> I've linked to mentions
> Here is the jar being included:
> ls $SPARK_HOME/jars | grep "httpclient"
> commons-httpclient-3.1.jar
> httpclient-4.5.2.jar
> The first jar potentially contains the security issue, could be a patched 
> version, need to verify. SHA1 sum for this jar is 
> 964cd74171f427720480efdec40a7c7f6e58426a



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16745) Spark job completed however have to wait for 13 mins (data size is small)

2016-07-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398272#comment-15398272
 ] 

Sean Owen commented on SPARK-16745:
---

I suspect this is a duplicate of one of the many issues related to janino 
codegen. I don't know which one. Many were fixed in 2.0. Can you try that?

> Spark job completed however have to wait for 13 mins (data size is small)
> -
>
> Key: SPARK-16745
> URL: https://issues.apache.org/jira/browse/SPARK-16745
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.1
> Environment: Max OS X Yosemite, Terminal, MacBook Air Late 2014
>Reporter: Joe Chong
>Priority: Minor
>
> I submitted a job in scala spark shell to show a DataFrame. The data size is 
> about 43K. The job was successful in the end, but took more than 13 minutes 
> to resolve. Upon checking the log, there's multiple exception raised on 
> "Failed to check existence of class" with a java.net.connectionexpcetion 
> message indicating timeout trying to connect to the port 52067, the repl port 
> that Spark setup. Please assist to troubleshoot. Thanks. 
> Started Spark in standalone mode
> $ spark-shell --driver-memory 5g --master local[*]
> 16/07/26 21:05:29 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 16/07/26 21:05:30 INFO spark.SecurityManager: Changing view acls to: joechong
> 16/07/26 21:05:30 INFO spark.SecurityManager: Changing modify acls to: 
> joechong
> 16/07/26 21:05:30 INFO spark.SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(joechong); users 
> with modify permissions: Set(joechong)
> 16/07/26 21:05:30 INFO spark.HttpServer: Starting HTTP Server
> 16/07/26 21:05:30 INFO server.Server: jetty-8.y.z-SNAPSHOT
> 16/07/26 21:05:30 INFO server.AbstractConnector: Started 
> SocketConnector@0.0.0.0:52067
> 16/07/26 21:05:30 INFO util.Utils: Successfully started service 'HTTP class 
> server' on port 52067.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.1
>   /_/
> Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66)
> Type in expressions to have them evaluated.
> Type :help for more information.
> 16/07/26 21:05:34 INFO spark.SparkContext: Running Spark version 1.6.1
> 16/07/26 21:05:34 INFO spark.SecurityManager: Changing view acls to: joechong
> 16/07/26 21:05:34 INFO spark.SecurityManager: Changing modify acls to: 
> joechong
> 16/07/26 21:05:34 INFO spark.SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(joechong); users 
> with modify permissions: Set(joechong)
> 16/07/26 21:05:35 INFO util.Utils: Successfully started service 'sparkDriver' 
> on port 52072.
> 16/07/26 21:05:35 INFO slf4j.Slf4jLogger: Slf4jLogger started
> 16/07/26 21:05:35 INFO Remoting: Starting remoting
> 16/07/26 21:05:35 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://sparkDriverActorSystem@10.199.29.218:52074]
> 16/07/26 21:05:35 INFO util.Utils: Successfully started service 
> 'sparkDriverActorSystem' on port 52074.
> 16/07/26 21:05:35 INFO spark.SparkEnv: Registering MapOutputTracker
> 16/07/26 21:05:35 INFO spark.SparkEnv: Registering BlockManagerMaster
> 16/07/26 21:05:35 INFO storage.DiskBlockManager: Created local directory at 
> /private/var/folders/r7/bs2f87nj6lnd5vm51lvxcw68gn/T/blockmgr-cd542a27-6ff1-4f51-a72b-78654142fdb6
> 16/07/26 21:05:35 INFO storage.MemoryStore: MemoryStore started with capacity 
> 3.4 GB
> 16/07/26 21:05:35 INFO spark.SparkEnv: Registering OutputCommitCoordinator
> 16/07/26 21:05:36 INFO server.Server: jetty-8.y.z-SNAPSHOT
> 16/07/26 21:05:36 INFO server.AbstractConnector: Started 
> SelectChannelConnector@0.0.0.0:4040
> 16/07/26 21:05:36 INFO util.Utils: Successfully started service 'SparkUI' on 
> port 4040.
> 16/07/26 21:05:36 INFO ui.SparkUI: Started SparkUI at 
> http://10.199.29.218:4040
> 16/07/26 21:05:36 INFO executor.Executor: Starting executor ID driver on host 
> localhost
> 16/07/26 21:05:36 INFO executor.Executor: Using REPL class URI: 
> http://10.199.29.218:52067
> 16/07/26 21:05:36 INFO util.Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 52075.
> 16/07/26 21:05:36 INFO netty.NettyBlockTransferService: Server created on 
> 52075
> 16/07/26 21:05:36 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/07/26 21:05:36 INFO storage.BlockManagerMasterEndpoint: Registering block 
> manager localhost:52075 with 3.4 GB RAM, BlockManagerId(driver,

[jira] [Commented] (SPARK-12157) Support numpy types as return values of Python UDFs

2016-07-28 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398267#comment-15398267
 ] 

Nicholas Chammas commented on SPARK-12157:
--

I'm looking to define a UDF in PySpark that returns a 
{{pyspark.ml.linalg.Vector}}. Since {{Vector}} is a wrapper for numpy types, I 
believe this issue covers what I'm looking for.

My use case is that I want a UDF that takes in several DataFrame columns and 
extracts/computes features, returning them as a new {{Vector}} column. I 
believe {{VectorAssembler}} is for when you already have the features and you 
just want them put in a {{Vector}}.

[~josephkb] [~zjffdu] So is it possible to do that today? Have I misunderstood 
how to approach my use case?

> Support numpy types as return values of Python UDFs
> ---
>
> Key: SPARK-12157
> URL: https://issues.apache.org/jira/browse/SPARK-12157
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.2
>Reporter: Justin Uang
>
> Currently, if I have a python UDF
> {code}
> import pyspark.sql.types as T
> import pyspark.sql.functions as F
> from pyspark.sql import Row
> import numpy as np
> argmax = F.udf(lambda x: np.argmax(x), T.IntegerType())
> df = sqlContext.createDataFrame([Row(array=[1,2,3])])
> df.select(argmax("array")).count()
> {code}
> I get an exception that is fairly opaque:
> {code}
> Caused by: net.razorvine.pickle.PickleException: expected zero arguments for 
> construction of ClassDict (for numpy.dtype)
> at 
> net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
> at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:701)
> at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:171)
> at net.razorvine.pickle.Unpickler.load(Unpickler.java:85)
> at net.razorvine.pickle.Unpickler.loads(Unpickler.java:98)
> at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:404)
> at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:403)
> {code}
> Numpy types like np.int and np.float64 should automatically be cast to the 
> proper dtypes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2016-07-28 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398237#comment-15398237
 ] 

Reynold Xin commented on SPARK-6305:


Thanks, Mikael. Would love to get some help here. 

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16783) make-distri

2016-07-28 Thread Michael Gummelt (JIRA)

Michael Gummelt created SPARK-16783:
---

 Summary: make-distri
 Key: SPARK-16783
 URL: https://issues.apache.org/jira/browse/SPARK-16783
 Project: Spark
  Issue Type: Bug
Reporter: Michael Gummelt






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16365) Ideas for moving "mllib-local" forward

2016-07-28 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398178#comment-15398178
 ] 

Joseph K. Bradley commented on SPARK-16365:
---

This JIRA is covering multiple potential projects within mllib-local.  I'd find 
it useful to separate these.  Here are the ones I see so far:

h3. Local model implementations

IMO this is a no brainer: we should definitely work towards moving model 
implementations into mllib-local.  This is _separate_ from model serving, etc. 
but is an important dependency of non-Spark model serving efforts.  
Essentially, what is required now is:
* duplicate transformation and prediction functionality outside of Spark
* build model serving infra based on the duplicate code path

What we should have is:
* local model implementation in mllib-local
* MLlib depends on that implementation
* external model serving infra depends on mllib-local

The key benefit is utilizing exactly the same code paths for prediction within 
MLlib and in production systems, making it easy to keep them in sync if MLlib 
changes its behavior.

CCing [~chromaticbum] with whom I had a great discussion on this.

h3. Local model serialization format

This is easily conflated with model implementations, but I believe it should be 
a separate discussion.  The community did a great job on ML persistence for the 
DataFrame-based API, so I do believe it is possible to do a good job here, 
though we must achieve consensus first.  This would of course need to happen 
after the local model implementations were built.

h3. Local linear algebra

This is already being discussed elsewhere, e.g. [SPARK-6442], so let's not 
discuss it here.

h3. Local model training

I completely agree with what was said above about (a) not duplicating 
functionality in other single-machine libraries and (b) doing all training with 
Spark proper.

> Ideas for moving "mllib-local" forward
> --
>
> Key: SPARK-16365
> URL: https://issues.apache.org/jira/browse/SPARK-16365
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Nick Pentreath
>
> Since SPARK-13944 is all done, we should all think about what the "next 
> steps" might be for {{mllib-local}}. E.g., it could be "improve Spark's 
> linear algebra", or "investigate how we will implement local models/pipelines 
> in Spark", etc.
> This ticket is for comments, ideas, brainstormings and PoCs. The separation 
> of linalg into a standalone project turned out to be significantly more 
> complex than originally expected. So I vote we devote sufficient discussion 
> and time to planning out the next move :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16780) spark-streaming-kafka_2.10 version 2.0.0 not on maven central

2016-07-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398179#comment-15398179
 ] 

Sean Owen commented on SPARK-16780:
---

There is also an "0.8" artifact, which is likely what you need.

> spark-streaming-kafka_2.10 version 2.0.0 not on maven central
> -
>
> Key: SPARK-16780
> URL: https://issues.apache.org/jira/browse/SPARK-16780
> Project: Spark
>  Issue Type: Bug
>Reporter: Andrew B
>
> I cannot seem to find spark-streaming-kafka_2.10 version 2.0.0 on maven 
> central. Has this been released?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16334) [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException

2016-07-28 Thread Keith Kraus (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398130#comment-15398130
 ] 

Keith Kraus commented on SPARK-16334:
-

[~sameerag] I have just built branch-2.0 which should have included your patch, 
but I am still experiencing this issue. The parquet file I am using was written 
using Spark 1.6 and has ~450 columns.

Issuing {{spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")}} 
prevents the issue from occurring, so it's definitely the vectorized parquet 
reader.

Let me know if I can provide any additional information to help resolve this 
issue.

> [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException
> -
>
> Key: SPARK-16334
> URL: https://issues.apache.org/jira/browse/SPARK-16334
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Assignee: Sameer Agarwal
>Priority: Critical
>  Labels: sql
> Fix For: 2.0.1, 2.1.0
>
>
> Query:
> {code}
> select * from blabla where user_id = 415706251
> {code}
> Error:
> {code}
> 16/06/30 14:07:27 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0 
> (TID 3, hadoop6): java.lang.ArrayIndexOutOfBoundsException: 6934
> at 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:119)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:273)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:170)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
> at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Work on 1.6.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16764) Recommend disabling vectorized parquet reader on OutOfMemoryError

2016-07-28 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16764.
-
   Resolution: Fixed
 Assignee: Sameer Agarwal
Fix Version/s: 2.1.0
   2.0.1

> Recommend disabling vectorized parquet reader on OutOfMemoryError
> -
>
> Key: SPARK-16764
> URL: https://issues.apache.org/jira/browse/SPARK-16764
> Project: Spark
>  Issue Type: Improvement
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
> Fix For: 2.0.1, 2.1.0
>
>
> We currently don't bound or manage the data array size used by column vectors 
> in the vectorized reader (they're just bound by INT.MAX) which may lead to 
> OOMs while reading data. In the short term, we can probably intercept this 
> exception and suggest the user to disable the vectorized parquet reader. 
> Longer term, we should probably do explicit memory management for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16780) spark-streaming-kafka_2.10 version 2.0.0 not on maven central

2016-07-28 Thread Andrew B (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398105#comment-15398105
 ] 

Andrew B commented on SPARK-16780:
--

How are the new artifacts used with the example below? 

https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java#L3

The example contains a reference to KafkaUtil class which contains a 
createStream() method. However, org.apache.spark.streaming.kafka010.KafkaUtil, 
which is in spark-streaming-kafka-0-10_2.10,  has switched over to DStream API, 
so it does not have a createStream method.

> spark-streaming-kafka_2.10 version 2.0.0 not on maven central
> -
>
> Key: SPARK-16780
> URL: https://issues.apache.org/jira/browse/SPARK-16780
> Project: Spark
>  Issue Type: Bug
>Reporter: Andrew B
>
> I cannot seem to find spark-streaming-kafka_2.10 version 2.0.0 on maven 
> central. Has this been released?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16762) spark hanging when action method print

2016-07-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16762.
---
Resolution: Invalid

Yeah, too many possible things wrong; it should be narrowed down separately 
first, possibly discussed on user@. If there's a fairly isolateable issue here, 
we can reopen.

> spark hanging when action method print
> --
>
> Key: SPARK-16762
> URL: https://issues.apache.org/jira/browse/SPARK-16762
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Reporter: Anh Nguyen
> Attachments: Screen Shot 2016-07-28 at 12.33.35 PM.png, Screen Shot 
> 2016-07-28 at 12.34.22 PM.png
>
>
> I write code spark Streaming (consumer) intergate kafaka and deploy on mesos 
> FW:
> import org.apache.spark.SparkConf
> import org.apache.spark.streaming._
> import org.apache.spark.streaming.kafka._
> import org.apache.spark.streaming.kafka._
> object consumer {
> def main(args: Array[String]) {
>   if (args.length < 4) {
> System.err.println("Usage: KafkaWordCount   
> ")
> System.exit(1)
>   }
>   val Array(zkQuorum, group, topics, numThreads) = args
>   val sparkConf = new 
> SparkConf().setAppName("KafkaWordCount").set("spark.rpc.netty.dispatcher.numThreads","4")
>   val ssc = new StreamingContext(sparkConf, Seconds(2))
>   ssc.checkpoint("checkpoint")
>   val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
>   val lines = KafkaUtils.createStream(ssc, zkQuorum, group, 
> topicMap).map(_._2)
>   lines.print()
>   val words = lines.flatMap(_.split(" "))
>   val wordCounts = words.map(x => (x, 1L)).reduceByKeyAndWindow(_ + _, _ 
> - _, Minutes(10), Seconds(2), 2)
>   ssc.start()
>   ssc.awaitTermination()
> }
>   }
> This log:
> I0728 04:28:05.439469  4295 exec.cpp:143] Version: 0.28.2
> I0728 04:28:05.443464  4296 exec.cpp:217] Executor registered on slave 
> 8e00b1ea-3f70-428a-8125-fb0eed88aede-S0
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 16/07/28 04:28:07 INFO CoarseGrainedExecutorBackend: Registered signal 
> handlers for [TERM, HUP, INT]
> 16/07/28 04:28:08 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/07/28 04:28:09 INFO SecurityManager: Changing view acls to: vagrant
> 16/07/28 04:28:09 INFO SecurityManager: Changing modify acls to: vagrant
> 16/07/28 04:28:09 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(vagrant); users 
> with modify permissions: Set(vagrant)
> 16/07/28 04:28:11 INFO SecurityManager: Changing view acls to: vagrant
> 16/07/28 04:28:11 INFO SecurityManager: Changing modify acls to: vagrant
> 16/07/28 04:28:11 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(vagrant); users 
> with modify permissions: Set(vagrant)
> 16/07/28 04:28:12 INFO Slf4jLogger: Slf4jLogger started
> 16/07/28 04:28:12 INFO Remoting: Starting remoting
> 16/07/28 04:28:13 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://sparkExecutorActorSystem@slave1:57380]
> 16/07/28 04:28:13 INFO Utils: Successfully started service 
> 'sparkExecutorActorSystem' on port 57380.
> 16/07/28 04:28:13 INFO DiskBlockManager: Created local directory at 
> /home/vagrant/mesos/slave1/work/slaves/8e00b1ea-3f70-428a-8125-fb0eed88aede-S0/frameworks/3b189004-daea-47a3-9ba5-dfac5143f4ab-/executors/0/runs/0433234f-332d-4858-9527-64558fe7fb90/blockmgr-4e94f4ea-f074-4dbe-bfa1-34054e6e079c
> 16/07/28 04:28:13 INFO MemoryStore: MemoryStore started with capacity 517.4 MB
> 16/07/28 04:28:13 INFO CoarseGrainedExecutorBackend: Connecting to driver: 
> spark://CoarseGrainedScheduler@192.168.33.30:36179
> 16/07/28 04:28:13 INFO CoarseGrainedExecutorBackend: Successfully registered 
> with driver
> 16/07/28 04:28:13 INFO Executor: Starting executor ID 
> 8e00b1ea-3f70-428a-8125-fb0eed88aede-S0 on host slave1
> 16/07/28 04:28:13 INFO Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 48672.
> 16/07/28 04:28:13 INFO NettyBlockTransferService: Server created on 48672
> 16/07/28 04:28:13 INFO BlockManagerMaster: Trying to register BlockManager
> 16/07/28 04:28:13 INFO BlockManagerMaster: Registered BlockManager
> 16/07/28 04:28:13 INFO CoarseGrainedExecutorBackend: Got assigned task 0
> 16/07/28 04:28:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
> 16/07/28 04:28:13 INFO Executor: Fetching 
> http://192.168.33.30:52977/jars/spaka-1.0-SNAPSHOT-jar-with-dependencies.jar 
> with timestamp 1469680084686
> 16/07/28 04:28:14 INFO Utils: Fetching 
>

[jira] [Commented] (SPARK-16770) Spark shell not usable with german keyboard due to JLine version

2016-07-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398072#comment-15398072
 ] 

Sean Owen commented on SPARK-16770:
---

Sure, if you would please open a pull request to update the dependency. Check 
'mvn dependency:tree' to make sure it causes all instances of the dependency to 
be updated.

> Spark shell not usable with german keyboard due to JLine version
> 
>
> Key: SPARK-16770
> URL: https://issues.apache.org/jira/browse/SPARK-16770
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0
>Reporter: Stefan Schulze
>Priority: Minor
>
> It is impossible to enter a right square bracket with a single keystroke 
> using a german keyboard layout. The problem is known from former Scala 
> version, responsible is jline-2.12.jar (see 
> https://issues.scala-lang.org/browse/SI-8759).
> Workaround: Replace jline-2.12.jar by jline.2.12.1.jar in the jars folder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16774) Fix use of deprecated TimeStamp constructor

2016-07-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398063#comment-15398063
 ] 

Sean Owen commented on SPARK-16774:
---

Yeah I remember this one. Seems like it would take some careful use of Calendar 
to resolve, but certainly can be done. +1

> Fix use of deprecated TimeStamp constructor
> ---
>
> Key: SPARK-16774
> URL: https://issues.apache.org/jira/browse/SPARK-16774
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: holdenk
>
> The TimeStamp constructor we use inside of DateTime utils has been deprecated 
> since JDK 1.1 - while Java does take a long time to remove deprecated 
> functionality we might as well address this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16782) Use Sphinx autodoc to eliminate duplication of Python docstrings

2016-07-28 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398059#comment-15398059
 ] 

Nicholas Chammas commented on SPARK-16782:
--

[~davies] [~joshrosen] - I can take this on if the idea sounds good to you.

One thing I will have to look into is how the doctests show up, since we don't 
want those to be identical (or do we?) due to the different invocation methods. 
I'm not sure if autodoc will make that possible.

> Use Sphinx autodoc to eliminate duplication of Python docstrings
> 
>
> Key: SPARK-16782
> URL: https://issues.apache.org/jira/browse/SPARK-16782
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Reporter: Nicholas Chammas
>Priority: Minor
>
> In several cases it appears that we duplicate docstrings for methods that are 
> exposed in multiple places.
> For example, here are two instances of {{createDataFrame}} with identical 
> docstrings:
> * 
> https://github.com/apache/spark/blob/ab6e4aea5f39c429d5ea62a5170c8a1da612b74a/python/pyspark/sql/context.py#L218
> * 
> https://github.com/apache/spark/blob/39c836e976fcae51568bed5ebab28e148383b5d4/python/pyspark/sql/session.py#L406
> I believe we can eliminate this duplication by using the [Sphinx autodoc 
> extension|http://www.sphinx-doc.org/en/stable/ext/autodoc.html].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16782) Use Sphinx autodoc to eliminate duplication of Python docstrings

2016-07-28 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-16782:


 Summary: Use Sphinx autodoc to eliminate duplication of Python 
docstrings
 Key: SPARK-16782
 URL: https://issues.apache.org/jira/browse/SPARK-16782
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, PySpark
Reporter: Nicholas Chammas
Priority: Minor


In several cases it appears that we duplicate docstrings for methods that are 
exposed in multiple places.

For example, here are two instances of {{createDataFrame}} with identical 
docstrings:
* 
https://github.com/apache/spark/blob/ab6e4aea5f39c429d5ea62a5170c8a1da612b74a/python/pyspark/sql/context.py#L218
* 
https://github.com/apache/spark/blob/39c836e976fcae51568bed5ebab28e148383b5d4/python/pyspark/sql/session.py#L406

I believe we can eliminate this duplication by using the [Sphinx autodoc 
extension|http://www.sphinx-doc.org/en/stable/ext/autodoc.html].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16780) spark-streaming-kafka_2.10 version 2.0.0 not on maven central

2016-07-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16780.
---
Resolution: Not A Problem

These have become artifacts like spark-streaming-kafka-0-10_2.10 in 2.0.0

> spark-streaming-kafka_2.10 version 2.0.0 not on maven central
> -
>
> Key: SPARK-16780
> URL: https://issues.apache.org/jira/browse/SPARK-16780
> Project: Spark
>  Issue Type: Bug
>Reporter: Andrew B
>
> I cannot seem to find spark-streaming-kafka_2.10 version 2.0.0 on maven 
> central. Has this been released?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16769) httpclient classic dependency - potentially a patch required?

2016-07-28 Thread Adam Roberts (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398018#comment-15398018
 ] 

Adam Roberts commented on SPARK-16769:
--

Thanks, looking on Maven central I see the latest jets3t version is 0.9.4 so 
I've removed the commons-httpclient dependency from both the main pom.xml and 
the hive pom.xml and I'm building/testing now to see if anything breaks on 
master before trying the same on branch-1.6

This line in the main pom is indeed pulling in the bad jar 

{code}0.7.1{code}

and any jets3t 0.9.x version doesn't mention the commons-httpclient dependency

Pending anything breaking I'll create a JIRA and pull request to up the version 
and get rid of the classic one - unless something is really dependent on 0.7.1 
in which case we may have some code to modify if the API changed

> httpclient classic dependency - potentially a patch required?
> -
>
> Key: SPARK-16769
> URL: https://issues.apache.org/jira/browse/SPARK-16769
> Project: Spark
>  Issue Type: Question
>  Components: Build
>Affects Versions: 1.6.2, 2.0.0
> Environment: All Spark versions, any environment
>Reporter: Adam Roberts
>
> In our jars folder for Spark we provide a jar with a CVE 
> https://www.versioneye.com/java/commons-httpclient:commons-httpclient/3.1. 
> CVE-2012-5783
> This paper outlines the problem
> www.cs.utexas.edu/~shmat/shmat_ccs12.pdf
> My question is: do we need to ship this version as well or is it only used 
> for tests? Is it a patched version? I plan to run without this dependency and 
> if there are NoClassDefFound problems I'll add test so we 
> don't ship it (downloading it in the first place is bad enough though)
> Note that this is valid for all versions, suggesting it be raised to a 
> critical if Spark functionality is depending on it because of what the pdf 
> I've linked to mentions
> Here is the jar being included:
> ls $SPARK_HOME/jars | grep "httpclient"
> commons-httpclient-3.1.jar
> httpclient-4.5.2.jar
> The first jar potentially contains the security issue, could be a patched 
> version, need to verify. SHA1 sum for this jar is 
> 964cd74171f427720480efdec40a7c7f6e58426a



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16765) Add Pipeline API example for KMeans

2016-07-28 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397994#comment-15397994
 ] 

Bryan Cutler commented on SPARK-16765:
--

Was there some specific use of Pipelines with KMeans that you would like to 
see?  Generally speaking, you can build a pipeline and transform features for 
KMeans in the same way as any other algorithm, so I don't think another example 
would really add value.

> Add Pipeline API example for KMeans
> ---
>
> Key: SPARK-16765
> URL: https://issues.apache.org/jira/browse/SPARK-16765
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Manish Mishra
>Priority: Trivial
>
> A pipeline API example for K Means would be nicer to have in examples package 
> since it is one of the widely used ML algorithms. A pipeline API example of 
> Logistic Regression is already added to the ml example package. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16779) Fix unnecessary use of postfix operations

2016-07-28 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397984#comment-15397984
 ] 

holdenk edited comment on SPARK-16779 at 7/28/16 6:36 PM:
--

I'm sort of on the fence with fixing as well - but we are also using it for 
command execution with {quote}!!{quote} which (in my opinion) doesn't increase 
readability at all. So my current plan is to make a small fix to the one place 
where we don't have the warning squelched with doing {quote}!!{quote} and if 
people agree that its cleaner without it - I'll go through and cleanup the rest 
of the {quote}!!{quote} places.


was (Author: holdenk):
I'm sort of on the fence with fixing as well - but we are also using it for 
command execution with "!!" which (in my opinion) doesn't increase readability 
at all. So my current plan is to make a small fix to the one place where we 
don't have the warning squelched with doing "!!" and if people agree that its 
cleaner without it - I'll go through and cleanup the rest of the "!!" places.

> Fix unnecessary use of postfix operations
> -
>
> Key: SPARK-16779
> URL: https://issues.apache.org/jira/browse/SPARK-16779
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: holdenk
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16779) Fix unnecessary use of postfix operations

2016-07-28 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397984#comment-15397984
 ] 

holdenk commented on SPARK-16779:
-

I'm sort of on the fence with fixing as well - but we are also using it for 
command execution with "!!" which (in my opinion) doesn't increase readability 
at all. So my current plan is to make a small fix to the one place where we 
don't have the warning squelched with doing "!!" and if people agree that its 
cleaner without it - I'll go through and cleanup the rest of the "!!" places.

> Fix unnecessary use of postfix operations
> -
>
> Key: SPARK-16779
> URL: https://issues.apache.org/jira/browse/SPARK-16779
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: holdenk
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16781) java launched by PySpark as gateway may not be the same java used in the spark environment

2016-07-28 Thread Michael Berman (JIRA)

Michael Berman created SPARK-16781:
--

 Summary: java launched by PySpark as gateway may not be the same 
java used in the spark environment
 Key: SPARK-16781
 URL: https://issues.apache.org/jira/browse/SPARK-16781
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.6.2
Reporter: Michael Berman


When launching spark on a system with multiple javas installed, there are a few 
options for choosing which JRE to use, setting `JAVA_HOME` being the most 
straightforward.

However, when pyspark's internal py4j launches its JavaGateway, it always 
invokes `java` directly, without qualification. This means you get whatever 
java's first on your path, which is not necessarily the same one in spark's 
JAVA_HOME.

This could be seen as a py4j issue, but from their point of view, the fix is 
easy: make sure the java you want is first on your path. I can't figure out a 
way to make that reliably happen through the pyspark executor launch path, and 
it seems like something that would ideally happen automatically. If I set 
JAVA_HOME when launching spark, I would expect that to be the only java used 
throughout the stack.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16779) Fix unnecessary use of postfix operations

2016-07-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397968#comment-15397968
 ] 

Sean Owen commented on SPARK-16779:
---

Yeah I tend to agree with fixing this up, because Scala 2.11 actually warns on 
this unless you explicitly make an import to enable it. Sounds like something 
to be avoided. I'm only on the fence about cases where it makes for nice 
natural syntax like "0 to 10"

> Fix unnecessary use of postfix operations
> -
>
> Key: SPARK-16779
> URL: https://issues.apache.org/jira/browse/SPARK-16779
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: holdenk
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16780) spark-streaming-kafka_2.10 version 2.0.0 not on maven central

2016-07-28 Thread Andrew B (JIRA)

Andrew B created SPARK-16780:


 Summary: spark-streaming-kafka_2.10 version 2.0.0 not on maven 
central
 Key: SPARK-16780
 URL: https://issues.apache.org/jira/browse/SPARK-16780
 Project: Spark
  Issue Type: Bug
Reporter: Andrew B


I cannot seem to find spark-streaming-kafka_2.10 version 2.0.0 on maven 
central. Has this been released?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16779) Fix unnecessary use of postfix operations

2016-07-28 Thread holdenk (JIRA)

holdenk created SPARK-16779:
---

 Summary: Fix unnecessary use of postfix operations
 Key: SPARK-16779
 URL: https://issues.apache.org/jira/browse/SPARK-16779
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: holdenk






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16773) Post Spark 2.0 deprecation & warnings cleanup

2016-07-28 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-16773:

Summary: Post Spark 2.0 deprecation & warnings cleanup  (was: Post Spark 
2.0 deprecation cleanup)

> Post Spark 2.0 deprecation & warnings cleanup
> -
>
> Key: SPARK-16773
> URL: https://issues.apache.org/jira/browse/SPARK-16773
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark, Spark Core, SQL
>Reporter: holdenk
>
> As part of the 2.0 release we deprecated a number of different internal 
> components (one of the largest ones being the old accumulator API), and also 
> upgraded our default build to Scala 2.11.
> This has added a large number of deprecation warnings (internal and external) 
> - some of which can be worked around - and some of which can't (mostly in the 
> Scala 2.10 -> 2.11 reflection API and various tests).
> We should attempt to limit the number of warnings in our build so that we can 
> notice new ones and thoughtfully consider if they are warranted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16769) httpclient classic dependency - potentially a patch required?

2016-07-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397961#comment-15397961
 ] 

Sean Owen commented on SPARK-16769:
---

I'm pretty certain it's there only because some dependency needs it, and looks 
like you found it: jets3t. It wouldn't be a patched version. I don't think that 
can be excluded because it will break S3. However, maybe there's a newer 
version of jets3t or the old httpclient we can manage up to?

> httpclient classic dependency - potentially a patch required?
> -
>
> Key: SPARK-16769
> URL: https://issues.apache.org/jira/browse/SPARK-16769
> Project: Spark
>  Issue Type: Question
>  Components: Build
>Affects Versions: 1.6.2, 2.0.0
> Environment: All Spark versions, any environment
>Reporter: Adam Roberts
>
> In our jars folder for Spark we provide a jar with a CVE 
> https://www.versioneye.com/java/commons-httpclient:commons-httpclient/3.1. 
> CVE-2012-5783
> This paper outlines the problem
> www.cs.utexas.edu/~shmat/shmat_ccs12.pdf
> My question is: do we need to ship this version as well or is it only used 
> for tests? Is it a patched version? I plan to run without this dependency and 
> if there are NoClassDefFound problems I'll add test so we 
> don't ship it (downloading it in the first place is bad enough though)
> Note that this is valid for all versions, suggesting it be raised to a 
> critical if Spark functionality is depending on it because of what the pdf 
> I've linked to mentions
> Here is the jar being included:
> ls $SPARK_HOME/jars | grep "httpclient"
> commons-httpclient-3.1.jar
> httpclient-4.5.2.jar
> The first jar potentially contains the security issue, could be a patched 
> version, need to verify. SHA1 sum for this jar is 
> 964cd74171f427720480efdec40a7c7f6e58426a



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16778) Fix use of deprecated SQLContext constructor

2016-07-28 Thread holdenk (JIRA)

holdenk created SPARK-16778:
---

 Summary: Fix use of deprecated SQLContext constructor
 Key: SPARK-16778
 URL: https://issues.apache.org/jira/browse/SPARK-16778
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: holdenk






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16775) Reduce internal warnings from deprecated accumulator API

2016-07-28 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-16775:

Component/s: SQL

> Reduce internal warnings from deprecated accumulator API
> 
>
> Key: SPARK-16775
> URL: https://issues.apache.org/jira/browse/SPARK-16775
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Spark Core, SQL
>Reporter: holdenk
>
> Deprecating the old accumulator API added a large number of warnings - many 
> of these could be fixed with a bit of refactoring to offer a non-deprecated 
> internal class while still preserving the external deprecation warnings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16777) Parquet schema converter depends on deprecated APIs

2016-07-28 Thread holdenk (JIRA)

holdenk created SPARK-16777:
---

 Summary: Parquet schema converter depends on deprecated APIs
 Key: SPARK-16777
 URL: https://issues.apache.org/jira/browse/SPARK-16777
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: holdenk






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16775) Reduce internal warnings from deprecated accumulator API

2016-07-28 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-16775:

Component/s: (was: ML)
 (was: SQL)
 (was: MLlib)

> Reduce internal warnings from deprecated accumulator API
> 
>
> Key: SPARK-16775
> URL: https://issues.apache.org/jira/browse/SPARK-16775
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Spark Core, SQL
>Reporter: holdenk
>
> Deprecating the old accumulator API added a large number of warnings - many 
> of these could be fixed with a bit of refactoring to offer a non-deprecated 
> internal class while still preserving the external deprecation warnings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16776) Fix Kafka deprecation warnings

2016-07-28 Thread holdenk (JIRA)

holdenk created SPARK-16776:
---

 Summary: Fix Kafka deprecation warnings
 Key: SPARK-16776
 URL: https://issues.apache.org/jira/browse/SPARK-16776
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: holdenk


The new KafkaTestUtils depends on many deprecated APIs - see if we can refactor 
this to be avoided.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16775) Reduce internal warnings from deprecated accumulator API

2016-07-28 Thread holdenk (JIRA)

holdenk created SPARK-16775:
---

 Summary: Reduce internal warnings from deprecated accumulator API
 Key: SPARK-16775
 URL: https://issues.apache.org/jira/browse/SPARK-16775
 Project: Spark
  Issue Type: Sub-task
Reporter: holdenk


Deprecating the old accumulator API added a large number of warnings - many of 
these could be fixed with a bit of refactoring to offer a non-deprecated 
internal class while still preserving the external deprecation warnings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16774) Fix use of deprecated TimeStamp constructor

2016-07-28 Thread holdenk (JIRA)

holdenk created SPARK-16774:
---

 Summary: Fix use of deprecated TimeStamp constructor
 Key: SPARK-16774
 URL: https://issues.apache.org/jira/browse/SPARK-16774
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: holdenk


The TimeStamp constructor we use inside of DateTime utils has been deprecated 
since JDK 1.1 - while Java does take a long time to remove deprecated 
functionality we might as well address this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16773) Post Spark 2.0 deprecation cleanup

2016-07-28 Thread holdenk (JIRA)

holdenk created SPARK-16773:
---

 Summary: Post Spark 2.0 deprecation cleanup
 Key: SPARK-16773
 URL: https://issues.apache.org/jira/browse/SPARK-16773
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib, PySpark, Spark Core, SQL
Reporter: holdenk


As part of the 2.0 release we deprecated a number of different internal 
components (one of the largest ones being the old accumulator API), and also 
upgraded our default build to Scala 2.11.
This has added a large number of deprecation warnings (internal and external) - 
some of which can be worked around - and some of which can't (mostly in the 
Scala 2.10 -> 2.11 reflection API and various tests).
We should attempt to limit the number of warnings in our build so that we can 
notice new ones and thoughtfully consider if they are warranted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16769) httpclient classic dependency - potentially a patch required?

2016-07-28 Thread Adam Roberts (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397889#comment-15397889
 ] 

Adam Roberts commented on SPARK-16769:
--

[~rxin] [~srowen] Reynold and Sean, interested in what you think, a test run on 
my VM hasn't produced any new failures after deleting the jar from my m2 repo, 
I'm also trying to figure out if this version has a patch or not so I'm running 
with a test case described at 
https://issues.apache.org/jira/browse/HTTPCLIENT-1265

A quick grep of the source code for Spark itself doesn't show it directly being 
used in our code either
aroberts@aroberts-VirtualBox:~/Desktop/Spark-DK$ grep -R 
"org.apache.commons.httpclient" .
Binary file ./dist/jars/commons-httpclient-3.1.jar matches
Binary file ./dist/jars/jets3t-0.9.3.jar matches
Binary file 
./sql/hive/target/tmp/hive-ivy-cache/cache/commons-httpclient/commons-httpclient/jars/commons-httpclient-3.1.jar
 matches
Binary file 
./sql/hive/target/tmp/hive-ivy-cache/jars/commons-httpclient_commons-httpclient-3.1.jar
 matches

Can you please clarify why we have this dependency? 

> httpclient classic dependency - potentially a patch required?
> -
>
> Key: SPARK-16769
> URL: https://issues.apache.org/jira/browse/SPARK-16769
> Project: Spark
>  Issue Type: Question
>  Components: Build
>Affects Versions: 1.6.2, 2.0.0
> Environment: All Spark versions, any environment
>Reporter: Adam Roberts
>
> In our jars folder for Spark we provide a jar with a CVE 
> https://www.versioneye.com/java/commons-httpclient:commons-httpclient/3.1. 
> CVE-2012-5783
> This paper outlines the problem
> www.cs.utexas.edu/~shmat/shmat_ccs12.pdf
> My question is: do we need to ship this version as well or is it only used 
> for tests? Is it a patched version? I plan to run without this dependency and 
> if there are NoClassDefFound problems I'll add test so we 
> don't ship it (downloading it in the first place is bad enough though)
> Note that this is valid for all versions, suggesting it be raised to a 
> critical if Spark functionality is depending on it because of what the pdf 
> I've linked to mentions
> Here is the jar being included:
> ls $SPARK_HOME/jars | grep "httpclient"
> commons-httpclient-3.1.jar
> httpclient-4.5.2.jar
> The first jar potentially contains the security issue, could be a patched 
> version, need to verify. SHA1 sum for this jar is 
> 964cd74171f427720480efdec40a7c7f6e58426a



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11248) Spark hivethriftserver is using the wrong user to while getting HDFS permissions

2016-07-28 Thread Furcy Pin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397882#comment-15397882
 ] 

Furcy Pin commented on SPARK-11248:
---

+1

I'm trying on spark 2.0.0, and I've configured 
hive.exec.stagingdir=/tmp/spark-staging/spark-staging

then the spark thrift server creates temporary files here:
/tmp/spark-staging/spark-staging_hive_2016-07-28_17-30-40_226_8664570533244973761-9/-ext-1/_temporary/0/task_201607281730_0015_m_00

with the following rights:
{code}
drwxrwxrwx spark:spark /tmp/spark-staging
drwxrwxrwx  user:spark 
/tmp/spark-staging/spark-staging_hive_2016-07-28_17-30-40_226_8664570533244973761-9
drwxrwxr x  user:spark 
/tmp/spark-staging/spark-staging_hive_2016-07-28_17-30-40_226_8664570533244973761-9/-ext-1
drwxrwxr x  user:spark 
/tmp/spark-staging/spark-staging_hive_2016-07-28_17-30-40_226_8664570533244973761-9/-ext-1/_temporary/
drwxrwxr x  user:spark 
/tmp/spark-staging/spark-staging_hive_2016-07-28_17-30-40_226_8664570533244973761-9/-ext-1/_temporary/0
drwxrwxr x spark:spark 
/tmp/spark-staging/spark-staging_hive_2016-07-28_17-30-40_226_8664570533244973761-9/-ext-1/_temporary/0/task_201607281730_0015_m_00
{code}

I then get an error when trying to move the staging files to the hive table.

org.apache.hadoop.security.AccessControlException: Permission denied: 
user=user, access=WRITE, 
inode="/tmp/spark-staging/spark-staging_hive_2016-07-28_17-30-40_226_8664570533244973761-9/-ext-1/_temporary/0/task_201607281730_0015_m_00":spark:spark:drwxrwxr-x

I tried setting the property "hive.server2.enable.doAs" to true or false, but 
it didn't seem to change anything.



> Spark hivethriftserver is using the wrong user to while getting HDFS 
> permissions
> 
>
> Key: SPARK-11248
> URL: https://issues.apache.org/jira/browse/SPARK-11248
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Trystan Leftwich
>
> While running spark as a hivethrift-server via Yarn Spark will use the user 
> running the Hivethrift server rather than the user connecting via JDBC to 
> check HDFS perms.
> i.e.
> In HDFS the perms are
> rwx--   3 testuser testuser /user/testuser/table/testtable
> And i connect via beeline as user testuser
> beeline -u 'jdbc:hive2://localhost:10511' -n 'testuser' -p ''
> If i try to hit that table
> select count(*) from test_table;
> I get the following error
> Error: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch 
> table test_table. java.security.AccessControlException: Permission denied: 
> user=hive, access=READ, 
> inode="/user/testuser/table/testtable":testuser:testuser:drwxr-x--x
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:185)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6795)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6777)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPathAccess(FSNamesystem.java:6702)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:9529)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1516)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1433)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) 
> (state=,code=0)
> I have the following in set in hive-site.xml so it should be using the 
> correct user.
> 
>   hive.server2.enable.doAs
>   true
> 
> 
>   hive.metastore.execute.setugi
>   true

[jira] [Commented] (SPARK-16751) Upgrade derby to 10.12.1.1 from 10.11.1.1

2016-07-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397865#comment-15397865
 ] 

Sean Owen commented on SPARK-16751:
---

Yeah, I see from the PR that it's actually packaged. I don't know that it 
actually surfaces the attack vector in Spark in practice, but, let's just apply 
this of course.

> Upgrade derby to 10.12.1.1 from 10.11.1.1
> -
>
> Key: SPARK-16751
> URL: https://issues.apache.org/jira/browse/SPARK-16751
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.2, 2.0.0
> Environment: All platforms and major Spark releases
>Reporter: Adam Roberts
>Priority: Minor
>
> This JIRA is to upgrade the derby version from 10.11.1.1 to 10.12.1.1
> Sean and I figured that we only use derby for tests and so the initial pull 
> request was to not include it in the jars folder for Spark. I now believe it 
> is required based on comments for the pull request and so this is only a 
> dependency upgrade.
> The upgrade is due to an already disclosed vulnerability (CVE-2015-1832) in 
> derby 10.11.1.1. We used https://www.versioneye.com/search and will be 
> checking for any other problems in a variety of libraries too: investigating 
> if we can set up a Jenkins job to check our pom on a regular basis so we can 
> stay ahead of the game for matters like this.
> This was raised on the mailing list at 
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-2-0-0-RC5-tp18367p18465.html
>  by Stephen Hellberg and replied to by Sean Owen.
> I've checked the impact to previous Spark releases and this particular 
> version of derby is the only relatively recent and without vulnerabilities 
> version (I checked up to the 1.3 branch) so ideally we'd backport this for 
> all impacted Spark releases.
> I've marked this as critical and ticked the important checkbox as it's going 
> to impact every user, there isn't a security component (should we add one?) 
> and hence the build tag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16763) Factor out Tunsten for General Purpose Use

2016-07-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16763.
---
Resolution: Invalid

I'm going to close this because it's a question for the user@ list. However I 
don't think Tungsten is separable from Spark. But bits may be; see the 'unsafe' 
module for example. It's already separated.

> Factor out Tunsten for General Purpose Use
> --
>
> Key: SPARK-16763
> URL: https://issues.apache.org/jira/browse/SPARK-16763
> Project: Spark
>  Issue Type: Improvement
>Reporter: Suminda Dharmasena
>
> Can you factor your Tunsten to be used a general purpose library



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16772) Correct API doc references to PySpark classes + formatting fixes

2016-07-28 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-16772:
-
Summary: Correct API doc references to PySpark classes + formatting fixes  
(was: Correct API doc references to PySpark classes)

> Correct API doc references to PySpark classes + formatting fixes
> 
>
> Key: SPARK-16772
> URL: https://issues.apache.org/jira/browse/SPARK-16772
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Reporter: Nicholas Chammas
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16772) Correct API doc references to PySpark classes

2016-07-28 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-16772:
-
Summary: Correct API doc references to PySpark classes  (was: Correct API 
doc references to DataType + other minor doc tweaks)

> Correct API doc references to PySpark classes
> -
>
> Key: SPARK-16772
> URL: https://issues.apache.org/jira/browse/SPARK-16772
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Reporter: Nicholas Chammas
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions

2016-07-28 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397827#comment-15397827
 ] 

Shivaram Venkataraman commented on SPARK-16611:
---

1. lapply: From an API perspective we can add an lapply that is implemented 
using dapply. i.e. dapply runs on each partition with input as a data.frame 
while lapply could operate on each row. Also dapply should not shuffle any data 
and should work the same as lapplyPartition in terms of execution.

It would be more interesting to know if dapply is somehow worse in terms of 
functionality.  Could you explain what the issue with useBroadcast is ? From 
what I can see the code path is same in RDD.R and DataFrame.R

2. getJRDD: So it looks like the function required here is a way to extract the 
java rdd object from the Dataframe ? In that case there is a `toRDD` function 
in Dataset.scala that we should be able to expose (I think it should just 
involve using `callJMethod(df@sdf, "toRDD")`). This can be made to return a 
jobj instead of a RDD object in R.

3. Thanks for explaining this -- I think we can expose cleanup.jobj as a public 
function to be used to register finalizers.

This JIRA isnt about removing anything yet -- but short term we will need to 
remove / rename parts of the R api of RDDs to satisfy CRAN checks (See 
SPARK-16519). Longer term it would be great to have one code path that is well 
maintained, so knowing what doesn't work with dapply family of functions will 
be very useful.

> Expose several hidden DataFrame/RDD functions
> -
>
> Key: SPARK-16611
> URL: https://issues.apache.org/jira/browse/SPARK-16611
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>
> Expose the following functions:
> - lapply or map
> - lapplyPartition or mapPartition
> - flatMap
> - RDD
> - toRDD
> - getJRDD
> - cleanup.jobj
> cc:
> [~javierluraschi] [~j...@rstudio.com] [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16740) joins.LongToUnsafeRowMap crashes with NegativeArraySizeException

2016-07-28 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16740.
-
   Resolution: Fixed
 Assignee: Sylvain Zimmer
Fix Version/s: 2.1.0
   2.0.1

> joins.LongToUnsafeRowMap crashes with NegativeArraySizeException
> 
>
> Key: SPARK-16740
> URL: https://issues.apache.org/jira/browse/SPARK-16740
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>Assignee: Sylvain Zimmer
> Fix For: 2.0.1, 2.1.0
>
>
> Hello,
> Here is a crash in Spark SQL joins, with a minimal reproducible test case. 
> Interestingly, it only seems to happen when reading Parquet data (I added a 
> {{crash = True}} variable to show it)
> This is an {{left_outer}} example, but it also crashes with a regular 
> {{inner}} join.
> {code}
> import os
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> schema1 = SparkTypes.StructType([
> SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True)
> ])
> schema2 = SparkTypes.StructType([
> SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True)
> ])
> # Valid Long values (-9223372036854775808 < -5543241376386463808 , 
> 4661454128115150227 < 9223372036854775807)
> data1 = [(4661454128115150227,), (-5543241376386463808,)]
> data2 = [(650460285, )]
> df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1)
> df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2)
> crash = True
> if crash:
> os.system("rm -rf /tmp/sparkbug")
> df1.write.parquet("/tmp/sparkbug/vertex")
> df2.write.parquet("/tmp/sparkbug/edge")
> df1 = sqlc.read.load("/tmp/sparkbug/vertex")
> df2 = sqlc.read.load("/tmp/sparkbug/edge")
> result_df = df2.join(df1, on=(df1.id1 == df2.id2), how="left_outer")
> # Should print [Row(id2=650460285, id1=None)]
> print result_df.collect()
> {code}
> When ran with {{spark-submit}}, the final {{collect()}} call crashes with 
> this:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o61.collectToPython.
> : org.apache.spark.SparkException: Exception thrown in awaitResult:
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.consume(ExistingRDD.scala:225)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.doProduce(ExistingRDD.scala:328)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.produce(ExistingRDD.scala:225)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:77)
>   at 
>

[jira] [Assigned] (SPARK-16772) Correct API doc references to DataType + other minor doc tweaks

2016-07-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16772:


Assignee: (was: Apache Spark)

> Correct API doc references to DataType + other minor doc tweaks
> ---
>
> Key: SPARK-16772
> URL: https://issues.apache.org/jira/browse/SPARK-16772
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Reporter: Nicholas Chammas
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16772) Correct API doc references to DataType + other minor doc tweaks

2016-07-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397796#comment-15397796
 ] 

Apache Spark commented on SPARK-16772:
--

User 'nchammas' has created a pull request for this issue:
https://github.com/apache/spark/pull/14393

> Correct API doc references to DataType + other minor doc tweaks
> ---
>
> Key: SPARK-16772
> URL: https://issues.apache.org/jira/browse/SPARK-16772
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Reporter: Nicholas Chammas
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16772) Correct API doc references to DataType + other minor doc tweaks

2016-07-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16772:


Assignee: Apache Spark

> Correct API doc references to DataType + other minor doc tweaks
> ---
>
> Key: SPARK-16772
> URL: https://issues.apache.org/jira/browse/SPARK-16772
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Reporter: Nicholas Chammas
>Assignee: Apache Spark
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16772) Correct API doc references to DataType + other minor doc tweaks

2016-07-28 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-16772:


 Summary: Correct API doc references to DataType + other minor doc 
tweaks
 Key: SPARK-16772
 URL: https://issues.apache.org/jira/browse/SPARK-16772
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, PySpark
Reporter: Nicholas Chammas
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory

2016-07-28 Thread Gaurav Shah (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397654#comment-15397654
 ] 

Gaurav Shah commented on SPARK-2984:


[~dmaverick] Can you tell a little more on sequential import vs append ? didn't 
get that

> FileNotFoundException on _temporary directory
> -
>
> Key: SPARK-2984
> URL: https://issues.apache.org/jira/browse/SPARK-2984
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.3.0
>
>
> We've seen several stacktraces and threads on the user mailing list where 
> people are having issues with a {{FileNotFoundException}} stemming from an 
> HDFS path containing {{_temporary}}.
> I ([~aash]) think this may be related to {{spark.speculation}}.  I think the 
> error condition might manifest in this circumstance:
> 1) task T starts on a executor E1
> 2) it takes a long time, so task T' is started on another executor E2
> 3) T finishes in E1 so moves its data from {{_temporary}} to the final 
> destination and deletes the {{_temporary}} directory during cleanup
> 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but 
> those files no longer exist!  exception
> Some samples:
> {noformat}
> 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job 
> 140774430 ms.0
> java.io.FileNotFoundException: File 
> hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07
>  does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
> at 
> org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
> at 
> org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643)
> at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at scala.util.Try$.apply(Try.scala:161)
> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
> at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> -- Chen Song at 
> http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFiles-file-not-found-exception-td10686.html
> {noformat}
> I am running a Spark Streaming job that uses saveAsTextFiles to save results 
> into hdfs files. However, it has an exception after 20 batches
> result-140631234/_temporary/0/task_201407251119__m_03 does not 
> exist.
> {noformat}
> and
> {noformat}
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>  No lease on /apps/data/vddil/real-time/checkpoint/temp: File does not exist. 
> Holder DFSClient_NONMAPREDUCE_327993456_13 does not have any open files.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2946)
>   at 
>

[jira] [Updated] (SPARK-16639) query fails if having condition contains grouping column

2016-07-28 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-16639:

Component/s: SQL

> query fails if having condition contains grouping column
> 
>
> Key: SPARK-16639
> URL: https://issues.apache.org/jira/browse/SPARK-16639
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.1, 2.1.0
>
>
> {code}
> create table tbl(a int, b string);
> select count(b) from tbl group by a + 1 having a + 1 = 2;
> {code}
> this will fail analysis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16639) query fails if having condition contains grouping column

2016-07-28 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-16639:

Fix Version/s: 2.0.1

> query fails if having condition contains grouping column
> 
>
> Key: SPARK-16639
> URL: https://issues.apache.org/jira/browse/SPARK-16639
> Project: Spark
>  Issue Type: Bug
>Reporter: Wenchen Fan
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.1, 2.1.0
>
>
> {code}
> create table tbl(a int, b string);
> select count(b) from tbl group by a + 1 having a + 1 = 2;
> {code}
> this will fail analysis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16639) query fails if having condition contains grouping column

2016-07-28 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-16639:

Assignee: Liang-Chi Hsieh

> query fails if having condition contains grouping column
> 
>
> Key: SPARK-16639
> URL: https://issues.apache.org/jira/browse/SPARK-16639
> Project: Spark
>  Issue Type: Bug
>Reporter: Wenchen Fan
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> {code}
> create table tbl(a int, b string);
> select count(b) from tbl group by a + 1 having a + 1 = 2;
> {code}
> this will fail analysis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16639) query fails if having condition contains grouping column

2016-07-28 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-16639.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14296
[https://github.com/apache/spark/pull/14296]

> query fails if having condition contains grouping column
> 
>
> Key: SPARK-16639
> URL: https://issues.apache.org/jira/browse/SPARK-16639
> Project: Spark
>  Issue Type: Bug
>Reporter: Wenchen Fan
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> {code}
> create table tbl(a int, b string);
> select count(b) from tbl group by a + 1 having a + 1 = 2;
> {code}
> this will fail analysis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16771) Infinite recursion loop in org.apache.spark.sql.catalyst.trees.TreeNode when table name collides.

2016-07-28 Thread Furcy Pin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Furcy Pin updated SPARK-16771:
--
Description: 
How to reproduce: 

In spark-sql on Hive

{code}
DROP TABLE IF EXISTS t1 ;
CREATE TABLE test.t1(col1 string) ;

WITH t1 AS (
  SELECT  col1
  FROM t1 
)
SELECT col1
FROM t1
LIMIT 2
;
{code}

This make a nice StackOverflowError:

{code}
java.lang.StackOverflowError
at 
scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:230)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
at scala.collection.immutable.List.map(List.scala:285)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:348)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:170)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution$$anonfun$substituteCTE$1.applyOrElse(Analyzer.scala:147)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution$$anonfun$substituteCTE$1.applyOrElse(Analyzer.scala:133)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$2.apply(TreeNode.scala:286)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$2.apply(TreeNode.scala:286)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:286)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
...
{code}

This does not happen if I change the name of the CTE.

I guess Catalyst get caught in an infinite recursion loop because the CTE and 
the source table have the same name.





  was:
How to reproduce: 

In spark-sql on Hive

{code}
DROP TABLE IF EXISTS t1 ;
CREATE TABLE test.t1(col1 string) ;

WITH t1 AS (
  SELECT  col1
  FROM t1 
)
SELECT col1
FROM t1
LIMIT 2
;
{code}

This make a nice StackOverflowError:

{code}
java.lang.StackOverflowError
at 
scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:230)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
at scala.collection.immutable.List.map(List.scala:285)
at

[jira] [Created] (SPARK-16771) Infinite recursion loop in org.apache.spark.sql.catalyst.trees.TreeNode when table name collides.

2016-07-28 Thread Furcy Pin (JIRA)

Furcy Pin created SPARK-16771:
-

 Summary: Infinite recursion loop in 
org.apache.spark.sql.catalyst.trees.TreeNode when table name collides.
 Key: SPARK-16771
 URL: https://issues.apache.org/jira/browse/SPARK-16771
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0, 1.6.2
Reporter: Furcy Pin


How to reproduce: 

In spark-sql on Hive

{code}
DROP TABLE IF EXISTS t1 ;
CREATE TABLE test.t1(col1 string) ;

WITH t1 AS (
  SELECT  col1
  FROM t1 
)
SELECT col1
FROM t1
LIMIT 2
;
{code}

This make a nice StackOverflowError:

{code}
java.lang.StackOverflowError
at 
scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:230)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
at scala.collection.immutable.List.map(List.scala:285)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:348)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:170)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution$$anonfun$substituteCTE$1.applyOrElse(Analyzer.scala:147)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution$$anonfun$substituteCTE$1.applyOrElse(Analyzer.scala:133)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$2.apply(TreeNode.scala:286)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$2.apply(TreeNode.scala:286)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:286)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
{code}

This does not happen if I change the name of the CTE.

I guess Catalyst get caught in an infinite recursion loop because the CTE and 
the source table have the same name.







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 145 matches

Mail list logo