[jira] [Assigned] (SPARK-16771) Infinite recursion loop in org.apache.spark.sql.catalyst.trees.TreeNode when table name collides.
[ https://issues.apache.org/jira/browse/SPARK-16771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16771: Assignee: (was: Apache Spark) > Infinite recursion loop in org.apache.spark.sql.catalyst.trees.TreeNode when > table name collides. > - > > Key: SPARK-16771 > URL: https://issues.apache.org/jira/browse/SPARK-16771 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.2, 2.0.0 >Reporter: Furcy Pin > > How to reproduce: > In spark-sql on Hive > {code} > DROP TABLE IF EXISTS t1 ; > CREATE TABLE test.t1(col1 string) ; > WITH t1 AS ( > SELECT col1 > FROM t1 > ) > SELECT col1 > FROM t1 > LIMIT 2 > ; > {code} > This make a nice StackOverflowError: > {code} > java.lang.StackOverflowError > at > scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:230) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:233) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:348) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:170) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution$$anonfun$substituteCTE$1.applyOrElse(Analyzer.scala:147) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution$$anonfun$substituteCTE$1.applyOrElse(Analyzer.scala:133) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$2.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$2.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > ... > {code} > This does not happen if I change the name of the CTE. > I guess Catalyst get caught in an infinite recursion loop because the CTE and > the source
[jira] [Commented] (SPARK-16771) Infinite recursion loop in org.apache.spark.sql.catalyst.trees.TreeNode when table name collides.
[ https://issues.apache.org/jira/browse/SPARK-16771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398714#comment-15398714 ] Apache Spark commented on SPARK-16771: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/14397 > Infinite recursion loop in org.apache.spark.sql.catalyst.trees.TreeNode when > table name collides. > - > > Key: SPARK-16771 > URL: https://issues.apache.org/jira/browse/SPARK-16771 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.2, 2.0.0 >Reporter: Furcy Pin > > How to reproduce: > In spark-sql on Hive > {code} > DROP TABLE IF EXISTS t1 ; > CREATE TABLE test.t1(col1 string) ; > WITH t1 AS ( > SELECT col1 > FROM t1 > ) > SELECT col1 > FROM t1 > LIMIT 2 > ; > {code} > This make a nice StackOverflowError: > {code} > java.lang.StackOverflowError > at > scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:230) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:233) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:348) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:170) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution$$anonfun$substituteCTE$1.applyOrElse(Analyzer.scala:147) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution$$anonfun$substituteCTE$1.applyOrElse(Analyzer.scala:133) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$2.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$2.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > ... > {code} > This does not happen if I change the name of the
[jira] [Assigned] (SPARK-16771) Infinite recursion loop in org.apache.spark.sql.catalyst.trees.TreeNode when table name collides.
[ https://issues.apache.org/jira/browse/SPARK-16771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16771: Assignee: Apache Spark > Infinite recursion loop in org.apache.spark.sql.catalyst.trees.TreeNode when > table name collides. > - > > Key: SPARK-16771 > URL: https://issues.apache.org/jira/browse/SPARK-16771 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.2, 2.0.0 >Reporter: Furcy Pin >Assignee: Apache Spark > > How to reproduce: > In spark-sql on Hive > {code} > DROP TABLE IF EXISTS t1 ; > CREATE TABLE test.t1(col1 string) ; > WITH t1 AS ( > SELECT col1 > FROM t1 > ) > SELECT col1 > FROM t1 > LIMIT 2 > ; > {code} > This make a nice StackOverflowError: > {code} > java.lang.StackOverflowError > at > scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:230) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:233) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:348) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:170) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution$$anonfun$substituteCTE$1.applyOrElse(Analyzer.scala:147) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution$$anonfun$substituteCTE$1.applyOrElse(Analyzer.scala:133) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$2.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$2.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > ... > {code} > This does not happen if I change the name of the CTE. > I guess Catalyst get caught in an infinite recursion loop because the
[jira] [Assigned] (SPARK-16787) SparkContext.addFile() should not fail if called twice with the same file
[ https://issues.apache.org/jira/browse/SPARK-16787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16787: Assignee: Apache Spark (was: Josh Rosen) > SparkContext.addFile() should not fail if called twice with the same file > - > > Key: SPARK-16787 > URL: https://issues.apache.org/jira/browse/SPARK-16787 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 >Reporter: Josh Rosen >Assignee: Apache Spark > > The behavior of SparkContext.addFile() changed slightly with the introduction > of the Netty-RPC-based file server, which was introduced in Spark 1.6 (where > it was disabled by default) and became the default / only file server in > Spark 2.0.0. > Prior to 2.0, calling SparkContext.addFile() twice with the same path would > succeed and would cause future tasks to receive an updated copy of the file. > This behavior was never explicitly documented but Spark has behaved this way > since very early 1.x versions (some of the relevant lines in > Executor.updateDependencies() have existed since 2012). > In 2.0 (or 1.6 with the Netty file server enabled), the second addFile() call > will fail with a requirement error because NettyStreamManager tries to guard > against duplicate file registration. > I believe that this change of behavior was unintentional and propose to > remove the {{require}} check so that Spark 2.0 matches 1.x's default behavior. > This problem also affects addJar() in a more subtle way: the > fileServer.addJar() call will also fail with an exception but that exception > is logged and ignored due to some code which was added in 2014 in order to > ignore errors caused by missing Spark examples JARs when running on YARN > cluster mode (AFAIK). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16787) SparkContext.addFile() should not fail if called twice with the same file
[ https://issues.apache.org/jira/browse/SPARK-16787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398709#comment-15398709 ] Apache Spark commented on SPARK-16787: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/14396 > SparkContext.addFile() should not fail if called twice with the same file > - > > Key: SPARK-16787 > URL: https://issues.apache.org/jira/browse/SPARK-16787 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > > The behavior of SparkContext.addFile() changed slightly with the introduction > of the Netty-RPC-based file server, which was introduced in Spark 1.6 (where > it was disabled by default) and became the default / only file server in > Spark 2.0.0. > Prior to 2.0, calling SparkContext.addFile() twice with the same path would > succeed and would cause future tasks to receive an updated copy of the file. > This behavior was never explicitly documented but Spark has behaved this way > since very early 1.x versions (some of the relevant lines in > Executor.updateDependencies() have existed since 2012). > In 2.0 (or 1.6 with the Netty file server enabled), the second addFile() call > will fail with a requirement error because NettyStreamManager tries to guard > against duplicate file registration. > I believe that this change of behavior was unintentional and propose to > remove the {{require}} check so that Spark 2.0 matches 1.x's default behavior. > This problem also affects addJar() in a more subtle way: the > fileServer.addJar() call will also fail with an exception but that exception > is logged and ignored due to some code which was added in 2014 in order to > ignore errors caused by missing Spark examples JARs when running on YARN > cluster mode (AFAIK). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16787) SparkContext.addFile() should not fail if called twice with the same file
[ https://issues.apache.org/jira/browse/SPARK-16787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16787: Assignee: Josh Rosen (was: Apache Spark) > SparkContext.addFile() should not fail if called twice with the same file > - > > Key: SPARK-16787 > URL: https://issues.apache.org/jira/browse/SPARK-16787 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > > The behavior of SparkContext.addFile() changed slightly with the introduction > of the Netty-RPC-based file server, which was introduced in Spark 1.6 (where > it was disabled by default) and became the default / only file server in > Spark 2.0.0. > Prior to 2.0, calling SparkContext.addFile() twice with the same path would > succeed and would cause future tasks to receive an updated copy of the file. > This behavior was never explicitly documented but Spark has behaved this way > since very early 1.x versions (some of the relevant lines in > Executor.updateDependencies() have existed since 2012). > In 2.0 (or 1.6 with the Netty file server enabled), the second addFile() call > will fail with a requirement error because NettyStreamManager tries to guard > against duplicate file registration. > I believe that this change of behavior was unintentional and propose to > remove the {{require}} check so that Spark 2.0 matches 1.x's default behavior. > This problem also affects addJar() in a more subtle way: the > fileServer.addJar() call will also fail with an exception but that exception > is logged and ignored due to some code which was added in 2014 in order to > ignore errors caused by missing Spark examples JARs when running on YARN > cluster mode (AFAIK). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16771) Infinite recursion loop in org.apache.spark.sql.catalyst.trees.TreeNode when table name collides.
[ https://issues.apache.org/jira/browse/SPARK-16771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398701#comment-15398701 ] Dongjoon Hyun commented on SPARK-16771: --- Hi, [~fpin]. You're right. I can regenerate like the following. {code} spark.range(10).createOrReplaceTempView("t") sql("WITH t AS (SELECT 1 FROM t) SELECT * FROM t") {code} {code} spark.range(10).createOrReplaceTempView("t1") spark.range(10).createOrReplaceTempView("t2") sql("WITH t1 AS (SELECT 1 FROM t2), t2 AS (SELECT 1 FROM t1) SELECT * FROM t1, t2") {code} I'll make a PR soon. > Infinite recursion loop in org.apache.spark.sql.catalyst.trees.TreeNode when > table name collides. > - > > Key: SPARK-16771 > URL: https://issues.apache.org/jira/browse/SPARK-16771 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.2, 2.0.0 >Reporter: Furcy Pin > > How to reproduce: > In spark-sql on Hive > {code} > DROP TABLE IF EXISTS t1 ; > CREATE TABLE test.t1(col1 string) ; > WITH t1 AS ( > SELECT col1 > FROM t1 > ) > SELECT col1 > FROM t1 > LIMIT 2 > ; > {code} > This make a nice StackOverflowError: > {code} > java.lang.StackOverflowError > at > scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:230) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:233) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:348) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:170) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution$$anonfun$substituteCTE$1.applyOrElse(Analyzer.scala:147) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution$$anonfun$substituteCTE$1.applyOrElse(Analyzer.scala:133) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$2.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$2.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at >
[jira] [Commented] (SPARK-15694) Implement ScriptTransformation in sql/core
[ https://issues.apache.org/jira/browse/SPARK-15694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398694#comment-15398694 ] Tejas Patil commented on SPARK-15694: - I spent some time over last weekend working on this (WIP : https://github.com/tejasapatil/spark/commit/71c2596a1929a890c6e6f3f0d956557fadbf2403). You are right that some parts of Hive needs to be copied into Spark. To begin with, I plan to have a patch without that ie. have it working without custom serde and record reader / writer and in next patch add that. > Implement ScriptTransformation in sql/core > -- > > Key: SPARK-15694 > URL: https://issues.apache.org/jira/browse/SPARK-15694 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > ScriptTransformation currently relies on Hive internals. It'd be great if we > can implement a native ScriptTransformation in sql/core module to remove the > extra Hive dependency here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16646) LEAST doesn't accept numeric arguments with different data types
[ https://issues.apache.org/jira/browse/SPARK-16646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398650#comment-15398650 ] Hyukjin Kwon commented on SPARK-16646: -- Sure, I will close the PR for meanwhile. Then please update me after that. > LEAST doesn't accept numeric arguments with different data types > > > Key: SPARK-16646 > URL: https://issues.apache.org/jira/browse/SPARK-16646 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Hyukjin Kwon > > {code:sql} > SELECT LEAST(1, 1.5); > {code} > {noformat} > Error: org.apache.spark.sql.AnalysisException: cannot resolve 'least(1, > CAST(2.1 AS DECIMAL(2,1)))' due to data type mismatch: The expressions should > all have the same type, got LEAST (ArrayBuffer(IntegerType, > DecimalType(2,1))).; line 1 pos 7 (state=,code=0) > {noformat} > This query works for 1.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16788) Investigate JSR-310 & scala-time alternatives to our own datetime utils
[ https://issues.apache.org/jira/browse/SPARK-16788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398632#comment-15398632 ] holdenk commented on SPARK-16788: - cc [~davies] [~ckadner] :) > Investigate JSR-310 & scala-time alternatives to our own datetime utils > --- > > Key: SPARK-16788 > URL: https://issues.apache.org/jira/browse/SPARK-16788 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: holdenk > > Our own timezone handling code is essoteric and may have some bugs. We should > investigate replacing the logic with a more common library like scala-date or > directly using JSR-310 (included in Java 8+, but also backwards available on > maven through http://www.threeten.org/) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16789) Can't run saveAsTable with database name
SonixLegend created SPARK-16789: --- Summary: Can't run saveAsTable with database name Key: SPARK-16789 URL: https://issues.apache.org/jira/browse/SPARK-16789 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Environment: CentOS 7 JDK 1.8 Hive 1.2.1 Reporter: SonixLegend The function "saveAsTable" with database and table name is running via 1.6.2 successfully. But when I upgrade 2.0, it's got the error. There are my code and error message. Can you help me? conf/hive-site.xml hive.metastore.uris thrift://localhost:9083 val spark = SparkSession.builder().appName("SparkHive").enableHiveSupport().getOrCreate() import spark.implicits._ import spark.sql val source = sql("select * from sample.sample") source.createOrReplaceTempView("test") source.collect.foreach{tuple => println(tuple(0) + ":" + tuple(1))} val target = sql("select key, 'Spark' as value from test") println(target.count()) target.write.mode(SaveMode.Append).saveAsTable("sample.sample") spark.stop() Exception in thread "main" org.apache.spark.sql.AnalysisException: Saving data in MetastoreRelation sample, sample is not supported.; at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:218) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:378) at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:354) at com.newtouch.sample.SparkHive$.main(SparkHive.scala:25) at com.newtouch.sample.SparkHive.main(SparkHive.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16788) Investigate JSR-310 & scala-time alternatives to our own datetime utils
holdenk created SPARK-16788: --- Summary: Investigate JSR-310 & scala-time alternatives to our own datetime utils Key: SPARK-16788 URL: https://issues.apache.org/jira/browse/SPARK-16788 Project: Spark Issue Type: Improvement Components: SQL Reporter: holdenk Our own timezone handling code is essoteric and may have some bugs. We should investigate replacing the logic with a more common library like scala-date or directly using JSR-310 (included in Java 8+, but also backwards available on maven through http://www.threeten.org/) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16746) Spark streaming lost data when ReceiverTracker writes Blockinfo to hdfs timeout
[ https://issues.apache.org/jira/browse/SPARK-16746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398615#comment-15398615 ] Hongyao Zhao commented on SPARK-16746: -- I did some test yesterday, It seems that spark 1.6 direct api can consume messages from Kafka 0.9 brokers, so I can get around this problem by using direct api. It a good news to me, but I think what I mentioned in issue has nothing to do with whatkind of receivers I use, because ReceiverTracker is a internal class in spark source code. > Spark streaming lost data when ReceiverTracker writes Blockinfo to hdfs > timeout > --- > > Key: SPARK-16746 > URL: https://issues.apache.org/jira/browse/SPARK-16746 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.1 >Reporter: Hongyao Zhao >Priority: Minor > > I wrote a spark streaming program which consume 1000 messages from one topic > of Kafka, did some transformation, and wrote the result back to another > topic. But only found 988 messages in the second topic. I checked log info > and confirmed all messages was received by receivers. But I found a hdfs > writing time out message printed from Class BatchedWriteAheadLog. > > I checkout source code and found code like this: > > {code:borderStyle=solid} > /** Add received block. This event will get written to the write ahead > log (if enabled). */ > def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = { > try { > val writeResult = writeToLog(BlockAdditionEvent(receivedBlockInfo)) > if (writeResult) { > synchronized { > getReceivedBlockQueue(receivedBlockInfo.streamId) += > receivedBlockInfo > } > logDebug(s"Stream ${receivedBlockInfo.streamId} received " + > s"block ${receivedBlockInfo.blockStoreResult.blockId}") > } else { > logDebug(s"Failed to acknowledge stream ${receivedBlockInfo.streamId} > receiving " + > s"block ${receivedBlockInfo.blockStoreResult.blockId} in the Write > Ahead Log.") > } > writeResult > } catch { > case NonFatal(e) => > logError(s"Error adding block $receivedBlockInfo", e) > false > } > } > {code} > > It seems that ReceiverTracker tries to write block info to hdfs, but the > write operation time out, this cause writeToLog function return false, and > this code "getReceivedBlockQueue(receivedBlockInfo.streamId) += > receivedBlockInfo" is skipped. so the block info is lost. >The spark version I use is 1.6.1 and I did not turn on > spark.streaming.receiver.writeAheadLog.enable. > >I want to know whether or not this is a designed behaviour. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16748) Errors thrown by UDFs cause TreeNodeException when the query has an ORDER BY clause
[ https://issues.apache.org/jira/browse/SPARK-16748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398587#comment-15398587 ] Apache Spark commented on SPARK-16748: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/14395 > Errors thrown by UDFs cause TreeNodeException when the query has an ORDER BY > clause > --- > > Key: SPARK-16748 > URL: https://issues.apache.org/jira/browse/SPARK-16748 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Tathagata Das > > {code} > import org.apache.spark.sql.functions._ > val myUDF = udf((c: String) => s"""${c.take(5)}""") > spark.sql("SELECT cast(null as string) as > a").select(myUDF($"a").as("b")).orderBy($"b").collect > {code} > {code} > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: > Exchange rangepartitioning(b#345 ASC, 200) > +- *Project [UDF(null) AS b#345] >+- Scan OneRowRelation[] > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50) > at > org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233) > at org.apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:113) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:361) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:272) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2187) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2187) > at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2545) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2187) > at org.apache.spark.sql.Dataset.collect(Dataset.scala:2163) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16748) Errors thrown by UDFs cause TreeNodeException when the query has an ORDER BY clause
[ https://issues.apache.org/jira/browse/SPARK-16748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16748: Assignee: Tathagata Das (was: Apache Spark) > Errors thrown by UDFs cause TreeNodeException when the query has an ORDER BY > clause > --- > > Key: SPARK-16748 > URL: https://issues.apache.org/jira/browse/SPARK-16748 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Tathagata Das > > {code} > import org.apache.spark.sql.functions._ > val myUDF = udf((c: String) => s"""${c.take(5)}""") > spark.sql("SELECT cast(null as string) as > a").select(myUDF($"a").as("b")).orderBy($"b").collect > {code} > {code} > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: > Exchange rangepartitioning(b#345 ASC, 200) > +- *Project [UDF(null) AS b#345] >+- Scan OneRowRelation[] > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50) > at > org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233) > at org.apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:113) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:361) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:272) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2187) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2187) > at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2545) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2187) > at org.apache.spark.sql.Dataset.collect(Dataset.scala:2163) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16748) Errors thrown by UDFs cause TreeNodeException when the query has an ORDER BY clause
[ https://issues.apache.org/jira/browse/SPARK-16748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16748: Assignee: Apache Spark (was: Tathagata Das) > Errors thrown by UDFs cause TreeNodeException when the query has an ORDER BY > clause > --- > > Key: SPARK-16748 > URL: https://issues.apache.org/jira/browse/SPARK-16748 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark > > {code} > import org.apache.spark.sql.functions._ > val myUDF = udf((c: String) => s"""${c.take(5)}""") > spark.sql("SELECT cast(null as string) as > a").select(myUDF($"a").as("b")).orderBy($"b").collect > {code} > {code} > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: > Exchange rangepartitioning(b#345 ASC, 200) > +- *Project [UDF(null) AS b#345] >+- Scan OneRowRelation[] > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50) > at > org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233) > at org.apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:113) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:361) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:272) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2187) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2187) > at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2545) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2187) > at org.apache.spark.sql.Dataset.collect(Dataset.scala:2163) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16768) pyspark calls incorrect version of logistic regression
[ https://issues.apache.org/jira/browse/SPARK-16768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398571#comment-15398571 ] Colin Beckingham edited comment on SPARK-16768 at 7/29/16 2:08 AM: --- This is very strange then. I can launch Spark 2.1.0 with pyspark, run "from pyspark.mllib.classification import LogisticRegressionWithLBFGS" and the import succeeds, and I can call help on the import and get a description of what it does. If there is no longer an LBFGS version should not the import fail with some warning that the command is deprecated? I see from http://spark.apache.org/docs/latest/mllib-optimization.html that implementation of LBFGS is an issue that is "being worked on". It raises the issue of whether the currently working version in 1.6.2 is reliable; right now running the same problem on both 1.6.2 and 2.1.0 produces a much faster and accurate result on the former. was (Author: colbec): This is very strange then. I can launch Spark 2.1.0 with pyspark, run "from pyspark.mllib.classification import LogisticRegressionWithLBFGS" and the import succeeds, and I can call help on the import and get a description of what it does. If there is no longer an LBGFS version should not the import fail with some warning that the command is deprecated? I see from http://spark.apache.org/docs/latest/mllib-optimization.html that implementation of LGBFS is an issue that is "being worked on". It raises the issue of whether the currently working version in 1.6.2 is reliable; right now running the same problem on both 1.6.2 and 2.1.0 produces a much faster and accurate result on the former. > pyspark calls incorrect version of logistic regression > -- > > Key: SPARK-16768 > URL: https://issues.apache.org/jira/browse/SPARK-16768 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark > Environment: Linux openSUSE Leap 42.1 Gnome >Reporter: Colin Beckingham > Fix For: 2.1.0 > > > PySpark call with Spark 1.6.2 "LogisticRegressionWithLBFGS.train()" runs > "treeAggregate at LBFGS.scala:218" but the same command in pyspark with Spark > 2.1 runs "treeAggregate at LogisticRegression.scala:1092". This non-optimized > version is much slower and produces a different answer from LBFGS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16646) LEAST doesn't accept numeric arguments with different data types
[ https://issues.apache.org/jira/browse/SPARK-16646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398582#comment-15398582 ] Wenchen Fan commented on SPARK-16646: - We are discussing this internally, can you hold it for a while? We may decide to increase the max precision to 76 and keep max scale as 38, then we don't have this problem. > LEAST doesn't accept numeric arguments with different data types > > > Key: SPARK-16646 > URL: https://issues.apache.org/jira/browse/SPARK-16646 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Hyukjin Kwon > > {code:sql} > SELECT LEAST(1, 1.5); > {code} > {noformat} > Error: org.apache.spark.sql.AnalysisException: cannot resolve 'least(1, > CAST(2.1 AS DECIMAL(2,1)))' due to data type mismatch: The expressions should > all have the same type, got LEAST (ArrayBuffer(IntegerType, > DecimalType(2,1))).; line 1 pos 7 (state=,code=0) > {noformat} > This query works for 1.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16768) pyspark calls incorrect version of logistic regression
[ https://issues.apache.org/jira/browse/SPARK-16768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398571#comment-15398571 ] Colin Beckingham commented on SPARK-16768: -- This is very strange then. I can launch Spark 2.1.0 with pyspark, run "from pyspark.mllib.classification import LogisticRegressionWithLBFGS" and the import succeeds, and I can call help on the import and get a description of what it does. If there is no longer an LBGFS version should not the import fail with some warning that the command is deprecated? I see from http://spark.apache.org/docs/latest/mllib-optimization.html that implementation of LGBFS is an issue that is "being worked on". It raises the issue of whether the currently working version in 1.6.2 is reliable; right now running the same problem on both 1.6.2 and 2.1.0 produces a much faster and accurate result on the former. > pyspark calls incorrect version of logistic regression > -- > > Key: SPARK-16768 > URL: https://issues.apache.org/jira/browse/SPARK-16768 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark > Environment: Linux openSUSE Leap 42.1 Gnome >Reporter: Colin Beckingham > Fix For: 2.1.0 > > > PySpark call with Spark 1.6.2 "LogisticRegressionWithLBFGS.train()" runs > "treeAggregate at LBFGS.scala:218" but the same command in pyspark with Spark > 2.1 runs "treeAggregate at LogisticRegression.scala:1092". This non-optimized > version is much slower and produces a different answer from LBFGS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16786) LDA topic distributions for new documents in PySpark
[ https://issues.apache.org/jira/browse/SPARK-16786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398560#comment-15398560 ] Jordan Beauchamp commented on SPARK-16786: -- topicDistribution has been implemented by this issue. > LDA topic distributions for new documents in PySpark > > > Key: SPARK-16786 > URL: https://issues.apache.org/jira/browse/SPARK-16786 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 2.0.0 > Environment: N/A >Reporter: Jordan Beauchamp >Priority: Minor > Labels: patch > Original Estimate: 24h > Remaining Estimate: 24h > > pyspark.mllib.clustering.LDAModel has no way to estimate the topic > distribution for new documents. However, this functionality exists in > org.apache.spark.mllib.clustering.LDAModel. This change would only require > setting up the API calls. I have forked the spark repo and implemented the > changes locally -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16786) LDA topic distributions for new documents in PySpark
[ https://issues.apache.org/jira/browse/SPARK-16786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398559#comment-15398559 ] Jordan Beauchamp commented on SPARK-16786: -- One minor difficulty is that downstream LocalLDAModel implements topicDistributions() while DistributedLDAModel does not. I've chosen to throw a NotImplementedException in the DistributedLDAModel implementation with a warning about converting the model to Local or using the online optimizer. > LDA topic distributions for new documents in PySpark > > > Key: SPARK-16786 > URL: https://issues.apache.org/jira/browse/SPARK-16786 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 2.0.0 > Environment: N/A >Reporter: Jordan Beauchamp >Priority: Minor > Labels: patch > Original Estimate: 24h > Remaining Estimate: 24h > > pyspark.mllib.clustering.LDAModel has no way to estimate the topic > distribution for new documents. However, this functionality exists in > org.apache.spark.mllib.clustering.LDAModel. This change would only require > setting up the API calls. I have forked the spark repo and implemented the > changes locally -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16786) LDA topic distributions for new documents in PySpark
[ https://issues.apache.org/jira/browse/SPARK-16786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16786: Assignee: Apache Spark > LDA topic distributions for new documents in PySpark > > > Key: SPARK-16786 > URL: https://issues.apache.org/jira/browse/SPARK-16786 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 2.0.0 > Environment: N/A >Reporter: Jordan Beauchamp >Assignee: Apache Spark >Priority: Minor > Labels: patch > Original Estimate: 24h > Remaining Estimate: 24h > > pyspark.mllib.clustering.LDAModel has no way to estimate the topic > distribution for new documents. However, this functionality exists in > org.apache.spark.mllib.clustering.LDAModel. This change would only require > setting up the API calls. I have forked the spark repo and implemented the > changes locally -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16786) LDA topic distributions for new documents in PySpark
[ https://issues.apache.org/jira/browse/SPARK-16786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16786: Assignee: (was: Apache Spark) > LDA topic distributions for new documents in PySpark > > > Key: SPARK-16786 > URL: https://issues.apache.org/jira/browse/SPARK-16786 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 2.0.0 > Environment: N/A >Reporter: Jordan Beauchamp >Priority: Minor > Labels: patch > Original Estimate: 24h > Remaining Estimate: 24h > > pyspark.mllib.clustering.LDAModel has no way to estimate the topic > distribution for new documents. However, this functionality exists in > org.apache.spark.mllib.clustering.LDAModel. This change would only require > setting up the API calls. I have forked the spark repo and implemented the > changes locally -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16786) LDA topic distributions for new documents in PySpark
[ https://issues.apache.org/jira/browse/SPARK-16786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398558#comment-15398558 ] Apache Spark commented on SPARK-16786: -- User 'supremekai' has created a pull request for this issue: https://github.com/apache/spark/pull/14394 > LDA topic distributions for new documents in PySpark > > > Key: SPARK-16786 > URL: https://issues.apache.org/jira/browse/SPARK-16786 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 2.0.0 > Environment: N/A >Reporter: Jordan Beauchamp >Priority: Minor > Labels: patch > Original Estimate: 24h > Remaining Estimate: 24h > > pyspark.mllib.clustering.LDAModel has no way to estimate the topic > distribution for new documents. However, this functionality exists in > org.apache.spark.mllib.clustering.LDAModel. This change would only require > setting up the API calls. I have forked the spark repo and implemented the > changes locally -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16753) Spark SQL doesn't handle skewed dataset joins properly
[ https://issues.apache.org/jira/browse/SPARK-16753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398551#comment-15398551 ] Reynold Xin commented on SPARK-16753: - Got it - definitely good to do skew join. There are a lot of different ways to implement this. I think teradata had a paper on skewed join too. > Spark SQL doesn't handle skewed dataset joins properly > -- > > Key: SPARK-16753 > URL: https://issues.apache.org/jira/browse/SPARK-16753 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 1.6.2, 2.0.0, 2.0.1 >Reporter: Jurriaan Pruis > Attachments: screenshot-1.png > > > I'm having issues with joining a 1 billion row dataframe with skewed data > with multiple dataframes with sizes ranging from 100,000 to 10 million rows. > This means some of the joins (about half of them) can be done using > broadcast, but not all. > Because the data in the large dataframe is skewed we get out of memory errors > in the executors or errors like: > `org.apache.spark.shuffle.FetchFailedException: Too large frame`. > We tried a lot of things, like broadcast joining the skewed rows separately > and unioning them with the dataset containing the sort merge joined data. > Which works perfectly when doing one or two joins, but when doing 10 joins > like this the query planner gets confused (see [SPARK-15326]). > As most of the rows are skewed on the NULL value we use a hack where we put > unique values in those NULL columns so the data is properly distributed over > all partitions. This works fine for NULL values, but since this table is > growing rapidly and we have skewed data for non-NULL values as well this > isn't a full solution to the problem. > Right now this specific spark task runs well 30% of the time and it's getting > worse and worse because of the increasing amount of data. > How to approach these kinds of joins using Spark? It seems weird that I can't > find proper solutions for this problem/other people having the same kind of > issues when Spark profiles itself as a large-scale data processing engine. > Doing joins on big datasets should be a thing Spark should have no problem > with out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16748) Errors thrown by UDFs cause TreeNodeException when the query has an ORDER BY clause
[ https://issues.apache.org/jira/browse/SPARK-16748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das reassigned SPARK-16748: - Assignee: Tathagata Das > Errors thrown by UDFs cause TreeNodeException when the query has an ORDER BY > clause > --- > > Key: SPARK-16748 > URL: https://issues.apache.org/jira/browse/SPARK-16748 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Tathagata Das > > {code} > import org.apache.spark.sql.functions._ > val myUDF = udf((c: String) => s"""${c.take(5)}""") > spark.sql("SELECT cast(null as string) as > a").select(myUDF($"a").as("b")).orderBy($"b").collect > {code} > {code} > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: > Exchange rangepartitioning(b#345 ASC, 200) > +- *Project [UDF(null) AS b#345] >+- Scan OneRowRelation[] > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50) > at > org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233) > at org.apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:113) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:361) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:272) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2187) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2187) > at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2545) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2187) > at org.apache.spark.sql.Dataset.collect(Dataset.scala:2163) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16787) SparkContext.addFile() should not fail if called twice with the same file
[ https://issues.apache.org/jira/browse/SPARK-16787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-16787: --- Target Version/s: 2.0.1 (was: 1.6.3, 2.0.1) > SparkContext.addFile() should not fail if called twice with the same file > - > > Key: SPARK-16787 > URL: https://issues.apache.org/jira/browse/SPARK-16787 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > > The behavior of SparkContext.addFile() changed slightly with the introduction > of the Netty-RPC-based file server, which was introduced in Spark 1.6 (where > it was disabled by default) and became the default / only file server in > Spark 2.0.0. > Prior to 2.0, calling SparkContext.addFile() twice with the same path would > succeed and would cause future tasks to receive an updated copy of the file. > This behavior was never explicitly documented but Spark has behaved this way > since very early 1.x versions (some of the relevant lines in > Executor.updateDependencies() have existed since 2012). > In 2.0 (or 1.6 with the Netty file server enabled), the second addFile() call > will fail with a requirement error because NettyStreamManager tries to guard > against duplicate file registration. > I believe that this change of behavior was unintentional and propose to > remove the {{require}} check so that Spark 2.0 matches 1.x's default behavior. > This problem also affects addJar() in a more subtle way: the > fileServer.addJar() call will also fail with an exception but that exception > is logged and ignored due to some code which was added in 2014 in order to > ignore errors caused by missing Spark examples JARs when running on YARN > cluster mode (AFAIK). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16787) SparkContext.addFile() should not fail if called twice with the same file
Josh Rosen created SPARK-16787: -- Summary: SparkContext.addFile() should not fail if called twice with the same file Key: SPARK-16787 URL: https://issues.apache.org/jira/browse/SPARK-16787 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.0, 1.6.2 Reporter: Josh Rosen Assignee: Josh Rosen The behavior of SparkContext.addFile() changed slightly with the introduction of the Netty-RPC-based file server, which was introduced in Spark 1.6 (where it was disabled by default) and became the default / only file server in Spark 2.0.0. Prior to 2.0, calling SparkContext.addFile() twice with the same path would succeed and would cause future tasks to receive an updated copy of the file. This behavior was never explicitly documented but Spark has behaved this way since very early 1.x versions (some of the relevant lines in Executor.updateDependencies() have existed since 2012). In 2.0 (or 1.6 with the Netty file server enabled), the second addFile() call will fail with a requirement error because NettyStreamManager tries to guard against duplicate file registration. I believe that this change of behavior was unintentional and propose to remove the {{require}} check so that Spark 2.0 matches 1.x's default behavior. This problem also affects addJar() in a more subtle way: the fileServer.addJar() call will also fail with an exception but that exception is logged and ignored due to some code which was added in 2014 in order to ignore errors caused by missing Spark examples JARs when running on YARN cluster mode (AFAIK). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16774) Fix use of deprecated TimeStamp constructor (also providing incorrect results)
[ https://issues.apache.org/jira/browse/SPARK-16774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398424#comment-15398424 ] holdenk edited comment on SPARK-16774 at 7/28/16 11:38 PM: --- While diving into this (relatedly I hate timezones) - I'm pretty sure I've encountered a logic bug for our handling close to daylight savings boundaries - namely the inline comments suggest the timestamp is being constructed against UTC but careful reading of the JDK code indicates its actually using the system default. I'm going to add some tests around this and replace the fall-back code with the calendar based approach. was (Author: holdenk): While diving into this (relatedly I hate timezones) - I'm pretty sure I've encountered a logic bug for our handling close to timezone boundaries - namely the inline comments suggest the timestamp is being constructed against UTC but careful reading of the JDK code indicates its actually using the system default. I'm going to add some tests around this and replace the fall-back code with the calendar based approach. > Fix use of deprecated TimeStamp constructor (also providing incorrect results) > -- > > Key: SPARK-16774 > URL: https://issues.apache.org/jira/browse/SPARK-16774 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: holdenk > > The TimeStamp constructor we use inside of DateTime utils has been deprecated > since JDK 1.1 - while Java does take a long time to remove deprecated > functionality we might as well address this. Additionally it does not handle > DST boundaries correctly all the time -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16774) Fix use of deprecated TimeStamp constructor
[ https://issues.apache.org/jira/browse/SPARK-16774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk updated SPARK-16774: Description: The TimeStamp constructor we use inside of DateTime utils has been deprecated since JDK 1.1 - while Java does take a long time to remove deprecated functionality we might as well address this. Additionally it does (was: The TimeStamp constructor we use inside of DateTime utils has been deprecated since JDK 1.1 - while Java does take a long time to remove deprecated functionality we might as well address this.) > Fix use of deprecated TimeStamp constructor > --- > > Key: SPARK-16774 > URL: https://issues.apache.org/jira/browse/SPARK-16774 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: holdenk > > The TimeStamp constructor we use inside of DateTime utils has been deprecated > since JDK 1.1 - while Java does take a long time to remove deprecated > functionality we might as well address this. Additionally it does -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16774) Fix use of deprecated TimeStamp constructor (also providing incorrect results)
[ https://issues.apache.org/jira/browse/SPARK-16774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk updated SPARK-16774: Description: The TimeStamp constructor we use inside of DateTime utils has been deprecated since JDK 1.1 - while Java does take a long time to remove deprecated functionality we might as well address this. Additionally it does not handle DST boundaries correctly all the time (was: The TimeStamp constructor we use inside of DateTime utils has been deprecated since JDK 1.1 - while Java does take a long time to remove deprecated functionality we might as well address this. Additionally it does ) > Fix use of deprecated TimeStamp constructor (also providing incorrect results) > -- > > Key: SPARK-16774 > URL: https://issues.apache.org/jira/browse/SPARK-16774 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: holdenk > > The TimeStamp constructor we use inside of DateTime utils has been deprecated > since JDK 1.1 - while Java does take a long time to remove deprecated > functionality we might as well address this. Additionally it does not handle > DST boundaries correctly all the time -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16774) Fix use of deprecated TimeStamp constructor (also providing incorrect results)
[ https://issues.apache.org/jira/browse/SPARK-16774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk updated SPARK-16774: Summary: Fix use of deprecated TimeStamp constructor (also providing incorrect results) (was: Fix use of deprecated TimeStamp constructor) > Fix use of deprecated TimeStamp constructor (also providing incorrect results) > -- > > Key: SPARK-16774 > URL: https://issues.apache.org/jira/browse/SPARK-16774 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: holdenk > > The TimeStamp constructor we use inside of DateTime utils has been deprecated > since JDK 1.1 - while Java does take a long time to remove deprecated > functionality we might as well address this. Additionally it does -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16774) Fix use of deprecated TimeStamp constructor
[ https://issues.apache.org/jira/browse/SPARK-16774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398424#comment-15398424 ] holdenk commented on SPARK-16774: - While diving into this (relatedly I hate timezones) - I'm pretty sure I've encountered a logic bug for our handling close to timezone boundaries - namely the inline comments suggest the timestamp is being constructed against UTC but careful reading of the JDK code indicates its actually using the system default. I'm going to add some tests around this and replace the fall-back code with the calendar based approach. > Fix use of deprecated TimeStamp constructor > --- > > Key: SPARK-16774 > URL: https://issues.apache.org/jira/browse/SPARK-16774 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: holdenk > > The TimeStamp constructor we use inside of DateTime utils has been deprecated > since JDK 1.1 - while Java does take a long time to remove deprecated > functionality we might as well address this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16768) pyspark calls incorrect version of logistic regression
[ https://issues.apache.org/jira/browse/SPARK-16768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398409#comment-15398409 ] Sean Owen commented on SPARK-16768: --- That's spark.ml.LogisticRegression, right? there's no longer a separate LBFGS version in .ml. > pyspark calls incorrect version of logistic regression > -- > > Key: SPARK-16768 > URL: https://issues.apache.org/jira/browse/SPARK-16768 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark > Environment: Linux openSUSE Leap 42.1 Gnome >Reporter: Colin Beckingham > Fix For: 2.1.0 > > > PySpark call with Spark 1.6.2 "LogisticRegressionWithLBFGS.train()" runs > "treeAggregate at LBFGS.scala:218" but the same command in pyspark with Spark > 2.1 runs "treeAggregate at LogisticRegression.scala:1092". This non-optimized > version is much slower and produces a different answer from LBFGS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16785) dapply doesn't return array or raw columns
[ https://issues.apache.org/jira/browse/SPARK-16785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398369#comment-15398369 ] Clark Fitzgerald commented on SPARK-16785: -- To fix this I propose to treat the rows as a list of dataframes instead of as a list of lists: {{rowlist = split(df, 1:nrow(df))}} {{df2 = do.call(rbind, rowlist)}} > dapply doesn't return array or raw columns > -- > > Key: SPARK-16785 > URL: https://issues.apache.org/jira/browse/SPARK-16785 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 > Environment: Mac OS X >Reporter: Clark Fitzgerald >Priority: Minor > > Calling SparkR::dapplyCollect with R functions that return dataframes > produces an error. This comes up when returning columns of binary data- ie. > serialized fitted models. Also happens when functions return columns > containing vectors. > The error message: > R computation failed with > Error in (function (..., deparse.level = 1, make.row.names = TRUE, > stringsAsFactors = default.stringsAsFactors()) : > invalid list argument: all variables should have the same length > Reproducible example: > https://github.com/clarkfitzg/phd_research/blob/master/ddR/spark/sparkR_dapplyCollect7.R > Relates to SPARK-16611 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16785) dapply doesn't return array or raw columns
[ https://issues.apache.org/jira/browse/SPARK-16785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398312#comment-15398312 ] Clark Fitzgerald edited comment on SPARK-16785 at 7/28/16 10:51 PM: [~shivaram] and I have had some email correspondence on this: bq. parallelizing data containing binary values: To create a SparkDataFrame from a local R data frame we first convert it to a list of rows and then partition the rows. This happens on the driver 1. The rows are then reconstructed back into a data.frame on the workers 2. The rbind.data.frame fails for the the binary example. I've created a small reproducible example at https://gist.github.com/shivaram/2f07208a8ebc0832098328fa0a3fac9d 1 https://github.com/apache/spark/blob/0869b3a5f028b64c2da511e70b02ab42f65fc949/R/pkg/R/SQLContext.R#L209 2 https://github.com/apache/spark/blob/0869b3a5f028b64c2da511e70b02ab42f65fc949/R/pkg/inst/worker/worker.R#L39 was (Author: clarkfitzg): [~shivaram] and I have had some email correspondence on this: bq. parallelizing data containing binary values: To create a SparkDataFrame from a local R data frame we first convert it to a list of rows and then partition the rows. This happens on the driver 1. The rows are then reconstructed back into a data.frame on the workers 2. The rbind.data.frame fails for the the binary example. I've created a small reproducible example at https://gist.github.com/shivaram/2f07208a8ebc0832098328fa0a3fac9d 1 https://github.com/apache/spark/blob/0869b3a5f028b64c2da511e70b02ab42f65fc949/R/pkg/R/SQLContext.R#L209 2 https://github.com/apache/spark/blob/0869b3a5f028b64c2da511e70b02ab42f65fc949/R/pkg/inst/worker/worker.R#L39 To fix this I propose to treat the rows as a list of dataframes instead of as a list of lists: {{rowlist = split(df, 1:nrow(df))}} {{df2 = do.call(rbind, rowlist)}} > dapply doesn't return array or raw columns > -- > > Key: SPARK-16785 > URL: https://issues.apache.org/jira/browse/SPARK-16785 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 > Environment: Mac OS X >Reporter: Clark Fitzgerald >Priority: Minor > > Calling SparkR::dapplyCollect with R functions that return dataframes > produces an error. This comes up when returning columns of binary data- ie. > serialized fitted models. Also happens when functions return columns > containing vectors. > The error message: > R computation failed with > Error in (function (..., deparse.level = 1, make.row.names = TRUE, > stringsAsFactors = default.stringsAsFactors()) : > invalid list argument: all variables should have the same length > Reproducible example: > https://github.com/clarkfitzg/phd_research/blob/master/ddR/spark/sparkR_dapplyCollect7.R > Relates to SPARK-16611 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions
[ https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398346#comment-15398346 ] Clark Fitzgerald commented on SPARK-16611: -- +1 for more direct access to the RDD's. This would be very helpful for me as I try to implement general R objects using Spark as a backend for ddR https://github.com/vertica/ddR Longer term it might make sense to organize SparkR into separate packages offering various levels of abstraction: # dataframes - for most end users # RDD's - for package authors or special applications # Java objects - for directly invoking methods in Spark. This is what sparkapi does. For my application it would be much better to be working at this middle layer. > Expose several hidden DataFrame/RDD functions > - > > Key: SPARK-16611 > URL: https://issues.apache.org/jira/browse/SPARK-16611 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Oscar D. Lara Yejas > > Expose the following functions: > - lapply or map > - lapplyPartition or mapPartition > - flatMap > - RDD > - toRDD > - getJRDD > - cleanup.jobj > cc: > [~javierluraschi] [~j...@rstudio.com] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16786) LDA topic distributions for new documents in PySpark
[ https://issues.apache.org/jira/browse/SPARK-16786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jordan Beauchamp updated SPARK-16786: - Target Version/s: (was: 2.0.0) > LDA topic distributions for new documents in PySpark > > > Key: SPARK-16786 > URL: https://issues.apache.org/jira/browse/SPARK-16786 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 2.0.0 > Environment: N/A >Reporter: Jordan Beauchamp >Priority: Minor > Labels: patch > Original Estimate: 24h > Remaining Estimate: 24h > > pyspark.mllib.clustering.LDAModel has no way to estimate the topic > distribution for new documents. However, this functionality exists in > org.apache.spark.mllib.clustering.LDAModel. This change would only require > setting up the API calls. I have forked the spark repo and implemented the > changes locally -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16786) LDA topic distributions for new documents in PySpark
[ https://issues.apache.org/jira/browse/SPARK-16786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jordan Beauchamp updated SPARK-16786: - Fix Version/s: (was: 2.0.0) > LDA topic distributions for new documents in PySpark > > > Key: SPARK-16786 > URL: https://issues.apache.org/jira/browse/SPARK-16786 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 2.0.0 > Environment: N/A >Reporter: Jordan Beauchamp >Priority: Minor > Labels: patch > Original Estimate: 24h > Remaining Estimate: 24h > > pyspark.mllib.clustering.LDAModel has no way to estimate the topic > distribution for new documents. However, this functionality exists in > org.apache.spark.mllib.clustering.LDAModel. This change would only require > setting up the API calls. I have forked the spark repo and implemented the > changes locally -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16786) LDA topic distributions for new documents in PySpark
Jordan Beauchamp created SPARK-16786: Summary: LDA topic distributions for new documents in PySpark Key: SPARK-16786 URL: https://issues.apache.org/jira/browse/SPARK-16786 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 2.0.0 Environment: N/A Reporter: Jordan Beauchamp Priority: Minor Fix For: 2.0.0 pyspark.mllib.clustering.LDAModel has no way to estimate the topic distribution for new documents. However, this functionality exists in org.apache.spark.mllib.clustering.LDAModel. This change would only require setting up the API calls. I have forked the spark repo and implemented the changes locally -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16785) dapply doesn't return array or raw columns
[ https://issues.apache.org/jira/browse/SPARK-16785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Clark Fitzgerald updated SPARK-16785: - Priority: Minor (was: Major) > dapply doesn't return array or raw columns > -- > > Key: SPARK-16785 > URL: https://issues.apache.org/jira/browse/SPARK-16785 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 > Environment: Mac OS X >Reporter: Clark Fitzgerald >Priority: Minor > > Calling SparkR::dapplyCollect with R functions that return dataframes > produces an error. This comes up when returning columns of binary data- ie. > serialized fitted models. Also happens when functions return columns > containing vectors. > The error message: > R computation failed with > Error in (function (..., deparse.level = 1, make.row.names = TRUE, > stringsAsFactors = default.stringsAsFactors()) : > invalid list argument: all variables should have the same length > Reproducible example: > https://github.com/clarkfitzg/phd_research/blob/master/ddR/spark/sparkR_dapplyCollect7.R > Relates to SPARK-16611 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16785) dapply doesn't return array or raw columns
[ https://issues.apache.org/jira/browse/SPARK-16785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398312#comment-15398312 ] Clark Fitzgerald commented on SPARK-16785: -- [~shivaram] and I have had some email correspondence on this: bq. parallelizing data containing binary values: To create a SparkDataFrame from a local R data frame we first convert it to a list of rows and then partition the rows. This happens on the driver 1. The rows are then reconstructed back into a data.frame on the workers 2. The rbind.data.frame fails for the the binary example. I've created a small reproducible example at https://gist.github.com/shivaram/2f07208a8ebc0832098328fa0a3fac9d 1 https://github.com/apache/spark/blob/0869b3a5f028b64c2da511e70b02ab42f65fc949/R/pkg/R/SQLContext.R#L209 2 https://github.com/apache/spark/blob/0869b3a5f028b64c2da511e70b02ab42f65fc949/R/pkg/inst/worker/worker.R#L39 To fix this I propose to treat the rows as a list of dataframes instead of as a list of lists: {{rowlist = split(df, 1:nrow(df))}} {{df2 = do.call(rbind, rowlist)}} > dapply doesn't return array or raw columns > -- > > Key: SPARK-16785 > URL: https://issues.apache.org/jira/browse/SPARK-16785 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 > Environment: Mac OS X >Reporter: Clark Fitzgerald > > Calling SparkR::dapplyCollect with R functions that return dataframes > produces an error. This comes up when returning columns of binary data- ie. > serialized fitted models. Also happens when functions return columns > containing vectors. > The error message: > R computation failed with > Error in (function (..., deparse.level = 1, make.row.names = TRUE, > stringsAsFactors = default.stringsAsFactors()) : > invalid list argument: all variables should have the same length > Reproducible example: > https://github.com/clarkfitzg/phd_research/blob/master/ddR/spark/sparkR_dapplyCollect7.R > Relates to SPARK-16611 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16785) dapply doesn't return array or raw columns
Clark Fitzgerald created SPARK-16785: Summary: dapply doesn't return array or raw columns Key: SPARK-16785 URL: https://issues.apache.org/jira/browse/SPARK-16785 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.0.0 Environment: Mac OS X Reporter: Clark Fitzgerald Calling SparkR::dapplyCollect with R functions that return dataframes produces an error. This comes up when returning columns of binary data- ie. serialized fitted models. Also happens when functions return columns containing vectors. The error message: R computation failed with Error in (function (..., deparse.level = 1, make.row.names = TRUE, stringsAsFactors = default.stringsAsFactors()) : invalid list argument: all variables should have the same length Reproducible example: https://github.com/clarkfitzg/phd_research/blob/master/ddR/spark/sparkR_dapplyCollect7.R Relates to SPARK-16611 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory
[ https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398289#comment-15398289 ] Denis Serduik commented on SPARK-2984: -- Something like this... The problem occurs when I change importPartition to importPartitionAsync service.getHitRecordsFromFile return DataFrame. {code} import import java.nio.file.{Path => JavaPath, Paths, Files} class DefaultHitDataManager(service : SparkService) extends DataSourceManager with Logging { val proceeded_prefix = "_proceeded_" val hit_data_pattern = "*_data*" val dsCache = new ConcurrentHashMap[String, HitsDataSource]() override def loadDataSource(remoteLocationBase: URI, alias: String): Future[HitsDataSource] = { val dsWorkingDir = java.nio.Files.createTempDirectory(alias) val fs = FileSystem.get(remoteLocationBase, service.sqlContext.sparkContext.hadoopConfiguration) val loadFuture = scanPartitions(remoteLocationBase).flatMap { case newPartitions => Future.traverse(newPartitions.toList)( { case remotePartitionFile => importPartitionAsync(alias, dsWorkingDir, fs, remotePartitionFile) }).flatMap { case statuses => FileUtil.fullyDelete(dsWorkingDir.toFile) val ds = HitsDataSource(remoteLocationBase, alias) dsCache.put(ds.alias, ds) Future.successful(ds) } } loadFuture } ///XXX: async loading causes race condition in FileSystem and things like that https://issues.apache.org/jira/browse/SPARK-2984 def importPartitionAsync(alias: String, dsWorkingDir: JavaPath, partitionFS: FileSystem, remotePartitionFile: FileStatus) = Future { importPartition(alias, dsWorkingDir,partitionFS, remotePartitionFile ) } def importPartition(alias: String, dsWorkingDir: JavaPath, partitionFS: FileSystem, remotePartitionFile: FileStatus) = { val partitionWorkingDir = Files.createTempDirectory(dsWorkingDir, remotePartitionFile.getPath.getName) val localPath = new Path("file://" + partitionWorkingDir.toString, remotePartitionFile.getPath.getName) partitionFS.copyToLocalFile(remotePartitionFile.getPath, localPath) val unzippedPartition = s"$partitionWorkingDir/hit_data*.tsv" val filePath = localPath.toUri.getPath val untarCommand = s"tar -xzf $filePath --wildcards --no-anchored '$hit_data_pattern'" val shellCmd = Array("bash", "-c", untarCommand.toString) val shexec = new ShellCommandExecutor(shellCmd, partitionWorkingDir.toFile) shexec.execute() val exitcode = shexec.getExitCode if (exitcode != 0) { throw new IOException(s"Error untarring file $filePath. Tar process exited with exit code $exitcode") } val partitionData = service.getHitRecordsFromFile("file://" + unzippedPartition) partitionData.coalesce(1).write.partitionBy("createdAtYear", "createdAtMonth"). mode(SaveMode.Append).saveAsTable(alias) val newPartitionPath = new Path(remotePartitionFile.getPath.getParent, proceeded_prefix + remotePartitionFile.getPath.getName) partitionFS.rename(remotePartitionFile.getPath, newPartitionPath) } def scanPartitions(locationBase: URI) = Future { val fs = FileSystem.get(locationBase, service.sqlContext.sparkContext.hadoopConfiguration) val location = new Path(locationBase) val newPartitions = if (fs.isDirectory(location)) { fs.listStatus(location, new PathFilter { override def accept(path: Path): Boolean = { !path.getName.contains(proceeded_prefix) } }) } else { Array(fs.getFileStatus(location)) } logDebug(s"Found new partitions : $newPartitions") newPartitions } } {code} I hope this will help > FileNotFoundException on _temporary directory > - > > Key: SPARK-2984 > URL: https://issues.apache.org/jira/browse/SPARK-2984 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Ash >Assignee: Josh Rosen >Priority: Critical > Fix For: 1.3.0 > > > We've seen several stacktraces and threads on the user mailing list where > people are having issues with a {{FileNotFoundException}} stemming from an > HDFS path containing {{_temporary}}. > I ([~aash]) think this may be related to {{spark.speculation}}. I think the > error condition might manifest in this circumstance: > 1) task T starts on a executor E1 > 2) it takes a long time, so task T' is started on another executor E2 > 3) T finishes in E1 so moves its data from {{_temporary}} to the final > destination and deletes the {{_temporary}} directory during cleanup > 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but > those files no longer exist!
[jira] [Closed] (SPARK-16783) make-distri
[ https://issues.apache.org/jira/browse/SPARK-16783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt closed SPARK-16783. --- Resolution: Not A Problem > make-distri > --- > > Key: SPARK-16783 > URL: https://issues.apache.org/jira/browse/SPARK-16783 > Project: Spark > Issue Type: Bug >Reporter: Michael Gummelt > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16784) Configurable log4j settings
Michael Gummelt created SPARK-16784: --- Summary: Configurable log4j settings Key: SPARK-16784 URL: https://issues.apache.org/jira/browse/SPARK-16784 Project: Spark Issue Type: Improvement Affects Versions: 2.0.0 Reporter: Michael Gummelt I often want to change the logging configuration on a single spark job. This is easy in client mode. I just modify log4j.properties. It's difficult in cluster mode, because I need to modify the log4j.properties in the distribution in which the driver runs. I'd like a way of setting this dynamically, such as a java system property. Some brief searching showed that log4j doesn't seem to accept such a property, but I'd like to open up this idea for further comment. Maybe we can find a solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16772) Correct API doc references to PySpark classes + formatting fixes
[ https://issues.apache.org/jira/browse/SPARK-16772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16772. - Resolution: Fixed Assignee: Nicholas Chammas Fix Version/s: 2.1.0 2.0.1 > Correct API doc references to PySpark classes + formatting fixes > > > Key: SPARK-16772 > URL: https://issues.apache.org/jira/browse/SPARK-16772 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Trivial > Fix For: 2.0.1, 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16769) httpclient classic dependency - potentially a patch required?
[ https://issues.apache.org/jira/browse/SPARK-16769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16769: -- Priority: Minor (was: Major) Issue Type: Improvement (was: Question) I think the issue is that jets3t needs to match up with Hadoop versions. It should be OK to update to maintenance releases, but not across minor releases necessarily. I recall issues with that. However 0.7.1 is only used for Hadoop 2.2. See the profiles later that set this to 0.9.3. You can update those. See if that fixes it. You don't need another JIRA. Make a PR for this one. > httpclient classic dependency - potentially a patch required? > - > > Key: SPARK-16769 > URL: https://issues.apache.org/jira/browse/SPARK-16769 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.6.2, 2.0.0 > Environment: All Spark versions, any environment >Reporter: Adam Roberts >Priority: Minor > > In our jars folder for Spark we provide a jar with a CVE > https://www.versioneye.com/java/commons-httpclient:commons-httpclient/3.1. > CVE-2012-5783 > This paper outlines the problem > www.cs.utexas.edu/~shmat/shmat_ccs12.pdf > My question is: do we need to ship this version as well or is it only used > for tests? Is it a patched version? I plan to run without this dependency and > if there are NoClassDefFound problems I'll add test so we > don't ship it (downloading it in the first place is bad enough though) > Note that this is valid for all versions, suggesting it be raised to a > critical if Spark functionality is depending on it because of what the pdf > I've linked to mentions > Here is the jar being included: > ls $SPARK_HOME/jars | grep "httpclient" > commons-httpclient-3.1.jar > httpclient-4.5.2.jar > The first jar potentially contains the security issue, could be a patched > version, need to verify. SHA1 sum for this jar is > 964cd74171f427720480efdec40a7c7f6e58426a -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16745) Spark job completed however have to wait for 13 mins (data size is small)
[ https://issues.apache.org/jira/browse/SPARK-16745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398272#comment-15398272 ] Sean Owen commented on SPARK-16745: --- I suspect this is a duplicate of one of the many issues related to janino codegen. I don't know which one. Many were fixed in 2.0. Can you try that? > Spark job completed however have to wait for 13 mins (data size is small) > - > > Key: SPARK-16745 > URL: https://issues.apache.org/jira/browse/SPARK-16745 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.6.1 > Environment: Max OS X Yosemite, Terminal, MacBook Air Late 2014 >Reporter: Joe Chong >Priority: Minor > > I submitted a job in scala spark shell to show a DataFrame. The data size is > about 43K. The job was successful in the end, but took more than 13 minutes > to resolve. Upon checking the log, there's multiple exception raised on > "Failed to check existence of class" with a java.net.connectionexpcetion > message indicating timeout trying to connect to the port 52067, the repl port > that Spark setup. Please assist to troubleshoot. Thanks. > Started Spark in standalone mode > $ spark-shell --driver-memory 5g --master local[*] > 16/07/26 21:05:29 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 16/07/26 21:05:30 INFO spark.SecurityManager: Changing view acls to: joechong > 16/07/26 21:05:30 INFO spark.SecurityManager: Changing modify acls to: > joechong > 16/07/26 21:05:30 INFO spark.SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(joechong); users > with modify permissions: Set(joechong) > 16/07/26 21:05:30 INFO spark.HttpServer: Starting HTTP Server > 16/07/26 21:05:30 INFO server.Server: jetty-8.y.z-SNAPSHOT > 16/07/26 21:05:30 INFO server.AbstractConnector: Started > SocketConnector@0.0.0.0:52067 > 16/07/26 21:05:30 INFO util.Utils: Successfully started service 'HTTP class > server' on port 52067. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 1.6.1 > /_/ > Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66) > Type in expressions to have them evaluated. > Type :help for more information. > 16/07/26 21:05:34 INFO spark.SparkContext: Running Spark version 1.6.1 > 16/07/26 21:05:34 INFO spark.SecurityManager: Changing view acls to: joechong > 16/07/26 21:05:34 INFO spark.SecurityManager: Changing modify acls to: > joechong > 16/07/26 21:05:34 INFO spark.SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(joechong); users > with modify permissions: Set(joechong) > 16/07/26 21:05:35 INFO util.Utils: Successfully started service 'sparkDriver' > on port 52072. > 16/07/26 21:05:35 INFO slf4j.Slf4jLogger: Slf4jLogger started > 16/07/26 21:05:35 INFO Remoting: Starting remoting > 16/07/26 21:05:35 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://sparkDriverActorSystem@10.199.29.218:52074] > 16/07/26 21:05:35 INFO util.Utils: Successfully started service > 'sparkDriverActorSystem' on port 52074. > 16/07/26 21:05:35 INFO spark.SparkEnv: Registering MapOutputTracker > 16/07/26 21:05:35 INFO spark.SparkEnv: Registering BlockManagerMaster > 16/07/26 21:05:35 INFO storage.DiskBlockManager: Created local directory at > /private/var/folders/r7/bs2f87nj6lnd5vm51lvxcw68gn/T/blockmgr-cd542a27-6ff1-4f51-a72b-78654142fdb6 > 16/07/26 21:05:35 INFO storage.MemoryStore: MemoryStore started with capacity > 3.4 GB > 16/07/26 21:05:35 INFO spark.SparkEnv: Registering OutputCommitCoordinator > 16/07/26 21:05:36 INFO server.Server: jetty-8.y.z-SNAPSHOT > 16/07/26 21:05:36 INFO server.AbstractConnector: Started > SelectChannelConnector@0.0.0.0:4040 > 16/07/26 21:05:36 INFO util.Utils: Successfully started service 'SparkUI' on > port 4040. > 16/07/26 21:05:36 INFO ui.SparkUI: Started SparkUI at > http://10.199.29.218:4040 > 16/07/26 21:05:36 INFO executor.Executor: Starting executor ID driver on host > localhost > 16/07/26 21:05:36 INFO executor.Executor: Using REPL class URI: > http://10.199.29.218:52067 > 16/07/26 21:05:36 INFO util.Utils: Successfully started service > 'org.apache.spark.network.netty.NettyBlockTransferService' on port 52075. > 16/07/26 21:05:36 INFO netty.NettyBlockTransferService: Server created on > 52075 > 16/07/26 21:05:36 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/07/26 21:05:36 INFO storage.BlockManagerMasterEndpoint: Registering block > manager localhost:52075 with 3.4 GB RAM, BlockManagerId(driver,
[jira] [Commented] (SPARK-12157) Support numpy types as return values of Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-12157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398267#comment-15398267 ] Nicholas Chammas commented on SPARK-12157: -- I'm looking to define a UDF in PySpark that returns a {{pyspark.ml.linalg.Vector}}. Since {{Vector}} is a wrapper for numpy types, I believe this issue covers what I'm looking for. My use case is that I want a UDF that takes in several DataFrame columns and extracts/computes features, returning them as a new {{Vector}} column. I believe {{VectorAssembler}} is for when you already have the features and you just want them put in a {{Vector}}. [~josephkb] [~zjffdu] So is it possible to do that today? Have I misunderstood how to approach my use case? > Support numpy types as return values of Python UDFs > --- > > Key: SPARK-12157 > URL: https://issues.apache.org/jira/browse/SPARK-12157 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.5.2 >Reporter: Justin Uang > > Currently, if I have a python UDF > {code} > import pyspark.sql.types as T > import pyspark.sql.functions as F > from pyspark.sql import Row > import numpy as np > argmax = F.udf(lambda x: np.argmax(x), T.IntegerType()) > df = sqlContext.createDataFrame([Row(array=[1,2,3])]) > df.select(argmax("array")).count() > {code} > I get an exception that is fairly opaque: > {code} > Caused by: net.razorvine.pickle.PickleException: expected zero arguments for > construction of ClassDict (for numpy.dtype) > at > net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) > at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:701) > at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:171) > at net.razorvine.pickle.Unpickler.load(Unpickler.java:85) > at net.razorvine.pickle.Unpickler.loads(Unpickler.java:98) > at > org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:404) > at > org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:403) > {code} > Numpy types like np.int and np.float64 should automatically be cast to the > proper dtypes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398237#comment-15398237 ] Reynold Xin commented on SPARK-6305: Thanks, Mikael. Would love to get some help here. > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16783) make-distri
Michael Gummelt created SPARK-16783: --- Summary: make-distri Key: SPARK-16783 URL: https://issues.apache.org/jira/browse/SPARK-16783 Project: Spark Issue Type: Bug Reporter: Michael Gummelt -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16365) Ideas for moving "mllib-local" forward
[ https://issues.apache.org/jira/browse/SPARK-16365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398178#comment-15398178 ] Joseph K. Bradley commented on SPARK-16365: --- This JIRA is covering multiple potential projects within mllib-local. I'd find it useful to separate these. Here are the ones I see so far: h3. Local model implementations IMO this is a no brainer: we should definitely work towards moving model implementations into mllib-local. This is _separate_ from model serving, etc. but is an important dependency of non-Spark model serving efforts. Essentially, what is required now is: * duplicate transformation and prediction functionality outside of Spark * build model serving infra based on the duplicate code path What we should have is: * local model implementation in mllib-local * MLlib depends on that implementation * external model serving infra depends on mllib-local The key benefit is utilizing exactly the same code paths for prediction within MLlib and in production systems, making it easy to keep them in sync if MLlib changes its behavior. CCing [~chromaticbum] with whom I had a great discussion on this. h3. Local model serialization format This is easily conflated with model implementations, but I believe it should be a separate discussion. The community did a great job on ML persistence for the DataFrame-based API, so I do believe it is possible to do a good job here, though we must achieve consensus first. This would of course need to happen after the local model implementations were built. h3. Local linear algebra This is already being discussed elsewhere, e.g. [SPARK-6442], so let's not discuss it here. h3. Local model training I completely agree with what was said above about (a) not duplicating functionality in other single-machine libraries and (b) doing all training with Spark proper. > Ideas for moving "mllib-local" forward > -- > > Key: SPARK-16365 > URL: https://issues.apache.org/jira/browse/SPARK-16365 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Nick Pentreath > > Since SPARK-13944 is all done, we should all think about what the "next > steps" might be for {{mllib-local}}. E.g., it could be "improve Spark's > linear algebra", or "investigate how we will implement local models/pipelines > in Spark", etc. > This ticket is for comments, ideas, brainstormings and PoCs. The separation > of linalg into a standalone project turned out to be significantly more > complex than originally expected. So I vote we devote sufficient discussion > and time to planning out the next move :) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16780) spark-streaming-kafka_2.10 version 2.0.0 not on maven central
[ https://issues.apache.org/jira/browse/SPARK-16780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398179#comment-15398179 ] Sean Owen commented on SPARK-16780: --- There is also an "0.8" artifact, which is likely what you need. > spark-streaming-kafka_2.10 version 2.0.0 not on maven central > - > > Key: SPARK-16780 > URL: https://issues.apache.org/jira/browse/SPARK-16780 > Project: Spark > Issue Type: Bug >Reporter: Andrew B > > I cannot seem to find spark-streaming-kafka_2.10 version 2.0.0 on maven > central. Has this been released? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16334) [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-16334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398130#comment-15398130 ] Keith Kraus commented on SPARK-16334: - [~sameerag] I have just built branch-2.0 which should have included your patch, but I am still experiencing this issue. The parquet file I am using was written using Spark 1.6 and has ~450 columns. Issuing {{spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")}} prevents the issue from occurring, so it's definitely the vectorized parquet reader. Let me know if I can provide any additional information to help resolve this issue. > [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException > - > > Key: SPARK-16334 > URL: https://issues.apache.org/jira/browse/SPARK-16334 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Egor Pahomov >Assignee: Sameer Agarwal >Priority: Critical > Labels: sql > Fix For: 2.0.1, 2.1.0 > > > Query: > {code} > select * from blabla where user_id = 415706251 > {code} > Error: > {code} > 16/06/30 14:07:27 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0 > (TID 3, hadoop6): java.lang.ArrayIndexOutOfBoundsException: 6934 > at > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:119) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:273) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:170) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Work on 1.6.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16764) Recommend disabling vectorized parquet reader on OutOfMemoryError
[ https://issues.apache.org/jira/browse/SPARK-16764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16764. - Resolution: Fixed Assignee: Sameer Agarwal Fix Version/s: 2.1.0 2.0.1 > Recommend disabling vectorized parquet reader on OutOfMemoryError > - > > Key: SPARK-16764 > URL: https://issues.apache.org/jira/browse/SPARK-16764 > Project: Spark > Issue Type: Improvement >Reporter: Sameer Agarwal >Assignee: Sameer Agarwal > Fix For: 2.0.1, 2.1.0 > > > We currently don't bound or manage the data array size used by column vectors > in the vectorized reader (they're just bound by INT.MAX) which may lead to > OOMs while reading data. In the short term, we can probably intercept this > exception and suggest the user to disable the vectorized parquet reader. > Longer term, we should probably do explicit memory management for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16780) spark-streaming-kafka_2.10 version 2.0.0 not on maven central
[ https://issues.apache.org/jira/browse/SPARK-16780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398105#comment-15398105 ] Andrew B commented on SPARK-16780: -- How are the new artifacts used with the example below? https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java#L3 The example contains a reference to KafkaUtil class which contains a createStream() method. However, org.apache.spark.streaming.kafka010.KafkaUtil, which is in spark-streaming-kafka-0-10_2.10, has switched over to DStream API, so it does not have a createStream method. > spark-streaming-kafka_2.10 version 2.0.0 not on maven central > - > > Key: SPARK-16780 > URL: https://issues.apache.org/jira/browse/SPARK-16780 > Project: Spark > Issue Type: Bug >Reporter: Andrew B > > I cannot seem to find spark-streaming-kafka_2.10 version 2.0.0 on maven > central. Has this been released? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16762) spark hanging when action method print
[ https://issues.apache.org/jira/browse/SPARK-16762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16762. --- Resolution: Invalid Yeah, too many possible things wrong; it should be narrowed down separately first, possibly discussed on user@. If there's a fairly isolateable issue here, we can reopen. > spark hanging when action method print > -- > > Key: SPARK-16762 > URL: https://issues.apache.org/jira/browse/SPARK-16762 > Project: Spark > Issue Type: Bug > Components: Deploy >Reporter: Anh Nguyen > Attachments: Screen Shot 2016-07-28 at 12.33.35 PM.png, Screen Shot > 2016-07-28 at 12.34.22 PM.png > > > I write code spark Streaming (consumer) intergate kafaka and deploy on mesos > FW: > import org.apache.spark.SparkConf > import org.apache.spark.streaming._ > import org.apache.spark.streaming.kafka._ > import org.apache.spark.streaming.kafka._ > object consumer { > def main(args: Array[String]) { > if (args.length < 4) { > System.err.println("Usage: KafkaWordCount > ") > System.exit(1) > } > val Array(zkQuorum, group, topics, numThreads) = args > val sparkConf = new > SparkConf().setAppName("KafkaWordCount").set("spark.rpc.netty.dispatcher.numThreads","4") > val ssc = new StreamingContext(sparkConf, Seconds(2)) > ssc.checkpoint("checkpoint") > val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap > val lines = KafkaUtils.createStream(ssc, zkQuorum, group, > topicMap).map(_._2) > lines.print() > val words = lines.flatMap(_.split(" ")) > val wordCounts = words.map(x => (x, 1L)).reduceByKeyAndWindow(_ + _, _ > - _, Minutes(10), Seconds(2), 2) > ssc.start() > ssc.awaitTermination() > } > } > This log: > I0728 04:28:05.439469 4295 exec.cpp:143] Version: 0.28.2 > I0728 04:28:05.443464 4296 exec.cpp:217] Executor registered on slave > 8e00b1ea-3f70-428a-8125-fb0eed88aede-S0 > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 16/07/28 04:28:07 INFO CoarseGrainedExecutorBackend: Registered signal > handlers for [TERM, HUP, INT] > 16/07/28 04:28:08 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/07/28 04:28:09 INFO SecurityManager: Changing view acls to: vagrant > 16/07/28 04:28:09 INFO SecurityManager: Changing modify acls to: vagrant > 16/07/28 04:28:09 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(vagrant); users > with modify permissions: Set(vagrant) > 16/07/28 04:28:11 INFO SecurityManager: Changing view acls to: vagrant > 16/07/28 04:28:11 INFO SecurityManager: Changing modify acls to: vagrant > 16/07/28 04:28:11 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(vagrant); users > with modify permissions: Set(vagrant) > 16/07/28 04:28:12 INFO Slf4jLogger: Slf4jLogger started > 16/07/28 04:28:12 INFO Remoting: Starting remoting > 16/07/28 04:28:13 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://sparkExecutorActorSystem@slave1:57380] > 16/07/28 04:28:13 INFO Utils: Successfully started service > 'sparkExecutorActorSystem' on port 57380. > 16/07/28 04:28:13 INFO DiskBlockManager: Created local directory at > /home/vagrant/mesos/slave1/work/slaves/8e00b1ea-3f70-428a-8125-fb0eed88aede-S0/frameworks/3b189004-daea-47a3-9ba5-dfac5143f4ab-/executors/0/runs/0433234f-332d-4858-9527-64558fe7fb90/blockmgr-4e94f4ea-f074-4dbe-bfa1-34054e6e079c > 16/07/28 04:28:13 INFO MemoryStore: MemoryStore started with capacity 517.4 MB > 16/07/28 04:28:13 INFO CoarseGrainedExecutorBackend: Connecting to driver: > spark://CoarseGrainedScheduler@192.168.33.30:36179 > 16/07/28 04:28:13 INFO CoarseGrainedExecutorBackend: Successfully registered > with driver > 16/07/28 04:28:13 INFO Executor: Starting executor ID > 8e00b1ea-3f70-428a-8125-fb0eed88aede-S0 on host slave1 > 16/07/28 04:28:13 INFO Utils: Successfully started service > 'org.apache.spark.network.netty.NettyBlockTransferService' on port 48672. > 16/07/28 04:28:13 INFO NettyBlockTransferService: Server created on 48672 > 16/07/28 04:28:13 INFO BlockManagerMaster: Trying to register BlockManager > 16/07/28 04:28:13 INFO BlockManagerMaster: Registered BlockManager > 16/07/28 04:28:13 INFO CoarseGrainedExecutorBackend: Got assigned task 0 > 16/07/28 04:28:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) > 16/07/28 04:28:13 INFO Executor: Fetching > http://192.168.33.30:52977/jars/spaka-1.0-SNAPSHOT-jar-with-dependencies.jar > with timestamp 1469680084686 > 16/07/28 04:28:14 INFO Utils: Fetching >
[jira] [Commented] (SPARK-16770) Spark shell not usable with german keyboard due to JLine version
[ https://issues.apache.org/jira/browse/SPARK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398072#comment-15398072 ] Sean Owen commented on SPARK-16770: --- Sure, if you would please open a pull request to update the dependency. Check 'mvn dependency:tree' to make sure it causes all instances of the dependency to be updated. > Spark shell not usable with german keyboard due to JLine version > > > Key: SPARK-16770 > URL: https://issues.apache.org/jira/browse/SPARK-16770 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.0 >Reporter: Stefan Schulze >Priority: Minor > > It is impossible to enter a right square bracket with a single keystroke > using a german keyboard layout. The problem is known from former Scala > version, responsible is jline-2.12.jar (see > https://issues.scala-lang.org/browse/SI-8759). > Workaround: Replace jline-2.12.jar by jline.2.12.1.jar in the jars folder. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16774) Fix use of deprecated TimeStamp constructor
[ https://issues.apache.org/jira/browse/SPARK-16774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398063#comment-15398063 ] Sean Owen commented on SPARK-16774: --- Yeah I remember this one. Seems like it would take some careful use of Calendar to resolve, but certainly can be done. +1 > Fix use of deprecated TimeStamp constructor > --- > > Key: SPARK-16774 > URL: https://issues.apache.org/jira/browse/SPARK-16774 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: holdenk > > The TimeStamp constructor we use inside of DateTime utils has been deprecated > since JDK 1.1 - while Java does take a long time to remove deprecated > functionality we might as well address this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16782) Use Sphinx autodoc to eliminate duplication of Python docstrings
[ https://issues.apache.org/jira/browse/SPARK-16782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398059#comment-15398059 ] Nicholas Chammas commented on SPARK-16782: -- [~davies] [~joshrosen] - I can take this on if the idea sounds good to you. One thing I will have to look into is how the doctests show up, since we don't want those to be identical (or do we?) due to the different invocation methods. I'm not sure if autodoc will make that possible. > Use Sphinx autodoc to eliminate duplication of Python docstrings > > > Key: SPARK-16782 > URL: https://issues.apache.org/jira/browse/SPARK-16782 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Reporter: Nicholas Chammas >Priority: Minor > > In several cases it appears that we duplicate docstrings for methods that are > exposed in multiple places. > For example, here are two instances of {{createDataFrame}} with identical > docstrings: > * > https://github.com/apache/spark/blob/ab6e4aea5f39c429d5ea62a5170c8a1da612b74a/python/pyspark/sql/context.py#L218 > * > https://github.com/apache/spark/blob/39c836e976fcae51568bed5ebab28e148383b5d4/python/pyspark/sql/session.py#L406 > I believe we can eliminate this duplication by using the [Sphinx autodoc > extension|http://www.sphinx-doc.org/en/stable/ext/autodoc.html]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16782) Use Sphinx autodoc to eliminate duplication of Python docstrings
Nicholas Chammas created SPARK-16782: Summary: Use Sphinx autodoc to eliminate duplication of Python docstrings Key: SPARK-16782 URL: https://issues.apache.org/jira/browse/SPARK-16782 Project: Spark Issue Type: Improvement Components: Documentation, PySpark Reporter: Nicholas Chammas Priority: Minor In several cases it appears that we duplicate docstrings for methods that are exposed in multiple places. For example, here are two instances of {{createDataFrame}} with identical docstrings: * https://github.com/apache/spark/blob/ab6e4aea5f39c429d5ea62a5170c8a1da612b74a/python/pyspark/sql/context.py#L218 * https://github.com/apache/spark/blob/39c836e976fcae51568bed5ebab28e148383b5d4/python/pyspark/sql/session.py#L406 I believe we can eliminate this duplication by using the [Sphinx autodoc extension|http://www.sphinx-doc.org/en/stable/ext/autodoc.html]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16780) spark-streaming-kafka_2.10 version 2.0.0 not on maven central
[ https://issues.apache.org/jira/browse/SPARK-16780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16780. --- Resolution: Not A Problem These have become artifacts like spark-streaming-kafka-0-10_2.10 in 2.0.0 > spark-streaming-kafka_2.10 version 2.0.0 not on maven central > - > > Key: SPARK-16780 > URL: https://issues.apache.org/jira/browse/SPARK-16780 > Project: Spark > Issue Type: Bug >Reporter: Andrew B > > I cannot seem to find spark-streaming-kafka_2.10 version 2.0.0 on maven > central. Has this been released? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16769) httpclient classic dependency - potentially a patch required?
[ https://issues.apache.org/jira/browse/SPARK-16769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398018#comment-15398018 ] Adam Roberts commented on SPARK-16769: -- Thanks, looking on Maven central I see the latest jets3t version is 0.9.4 so I've removed the commons-httpclient dependency from both the main pom.xml and the hive pom.xml and I'm building/testing now to see if anything breaks on master before trying the same on branch-1.6 This line in the main pom is indeed pulling in the bad jar {code}0.7.1{code} and any jets3t 0.9.x version doesn't mention the commons-httpclient dependency Pending anything breaking I'll create a JIRA and pull request to up the version and get rid of the classic one - unless something is really dependent on 0.7.1 in which case we may have some code to modify if the API changed > httpclient classic dependency - potentially a patch required? > - > > Key: SPARK-16769 > URL: https://issues.apache.org/jira/browse/SPARK-16769 > Project: Spark > Issue Type: Question > Components: Build >Affects Versions: 1.6.2, 2.0.0 > Environment: All Spark versions, any environment >Reporter: Adam Roberts > > In our jars folder for Spark we provide a jar with a CVE > https://www.versioneye.com/java/commons-httpclient:commons-httpclient/3.1. > CVE-2012-5783 > This paper outlines the problem > www.cs.utexas.edu/~shmat/shmat_ccs12.pdf > My question is: do we need to ship this version as well or is it only used > for tests? Is it a patched version? I plan to run without this dependency and > if there are NoClassDefFound problems I'll add test so we > don't ship it (downloading it in the first place is bad enough though) > Note that this is valid for all versions, suggesting it be raised to a > critical if Spark functionality is depending on it because of what the pdf > I've linked to mentions > Here is the jar being included: > ls $SPARK_HOME/jars | grep "httpclient" > commons-httpclient-3.1.jar > httpclient-4.5.2.jar > The first jar potentially contains the security issue, could be a patched > version, need to verify. SHA1 sum for this jar is > 964cd74171f427720480efdec40a7c7f6e58426a -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16765) Add Pipeline API example for KMeans
[ https://issues.apache.org/jira/browse/SPARK-16765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397994#comment-15397994 ] Bryan Cutler commented on SPARK-16765: -- Was there some specific use of Pipelines with KMeans that you would like to see? Generally speaking, you can build a pipeline and transform features for KMeans in the same way as any other algorithm, so I don't think another example would really add value. > Add Pipeline API example for KMeans > --- > > Key: SPARK-16765 > URL: https://issues.apache.org/jira/browse/SPARK-16765 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.0 >Reporter: Manish Mishra >Priority: Trivial > > A pipeline API example for K Means would be nicer to have in examples package > since it is one of the widely used ML algorithms. A pipeline API example of > Logistic Regression is already added to the ml example package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16779) Fix unnecessary use of postfix operations
[ https://issues.apache.org/jira/browse/SPARK-16779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397984#comment-15397984 ] holdenk edited comment on SPARK-16779 at 7/28/16 6:36 PM: -- I'm sort of on the fence with fixing as well - but we are also using it for command execution with {quote}!!{quote} which (in my opinion) doesn't increase readability at all. So my current plan is to make a small fix to the one place where we don't have the warning squelched with doing {quote}!!{quote} and if people agree that its cleaner without it - I'll go through and cleanup the rest of the {quote}!!{quote} places. was (Author: holdenk): I'm sort of on the fence with fixing as well - but we are also using it for command execution with "!!" which (in my opinion) doesn't increase readability at all. So my current plan is to make a small fix to the one place where we don't have the warning squelched with doing "!!" and if people agree that its cleaner without it - I'll go through and cleanup the rest of the "!!" places. > Fix unnecessary use of postfix operations > - > > Key: SPARK-16779 > URL: https://issues.apache.org/jira/browse/SPARK-16779 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: holdenk > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16779) Fix unnecessary use of postfix operations
[ https://issues.apache.org/jira/browse/SPARK-16779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397984#comment-15397984 ] holdenk commented on SPARK-16779: - I'm sort of on the fence with fixing as well - but we are also using it for command execution with "!!" which (in my opinion) doesn't increase readability at all. So my current plan is to make a small fix to the one place where we don't have the warning squelched with doing "!!" and if people agree that its cleaner without it - I'll go through and cleanup the rest of the "!!" places. > Fix unnecessary use of postfix operations > - > > Key: SPARK-16779 > URL: https://issues.apache.org/jira/browse/SPARK-16779 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: holdenk > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16781) java launched by PySpark as gateway may not be the same java used in the spark environment
Michael Berman created SPARK-16781: -- Summary: java launched by PySpark as gateway may not be the same java used in the spark environment Key: SPARK-16781 URL: https://issues.apache.org/jira/browse/SPARK-16781 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.6.2 Reporter: Michael Berman When launching spark on a system with multiple javas installed, there are a few options for choosing which JRE to use, setting `JAVA_HOME` being the most straightforward. However, when pyspark's internal py4j launches its JavaGateway, it always invokes `java` directly, without qualification. This means you get whatever java's first on your path, which is not necessarily the same one in spark's JAVA_HOME. This could be seen as a py4j issue, but from their point of view, the fix is easy: make sure the java you want is first on your path. I can't figure out a way to make that reliably happen through the pyspark executor launch path, and it seems like something that would ideally happen automatically. If I set JAVA_HOME when launching spark, I would expect that to be the only java used throughout the stack. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16779) Fix unnecessary use of postfix operations
[ https://issues.apache.org/jira/browse/SPARK-16779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397968#comment-15397968 ] Sean Owen commented on SPARK-16779: --- Yeah I tend to agree with fixing this up, because Scala 2.11 actually warns on this unless you explicitly make an import to enable it. Sounds like something to be avoided. I'm only on the fence about cases where it makes for nice natural syntax like "0 to 10" > Fix unnecessary use of postfix operations > - > > Key: SPARK-16779 > URL: https://issues.apache.org/jira/browse/SPARK-16779 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: holdenk > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16780) spark-streaming-kafka_2.10 version 2.0.0 not on maven central
Andrew B created SPARK-16780: Summary: spark-streaming-kafka_2.10 version 2.0.0 not on maven central Key: SPARK-16780 URL: https://issues.apache.org/jira/browse/SPARK-16780 Project: Spark Issue Type: Bug Reporter: Andrew B I cannot seem to find spark-streaming-kafka_2.10 version 2.0.0 on maven central. Has this been released? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16779) Fix unnecessary use of postfix operations
holdenk created SPARK-16779: --- Summary: Fix unnecessary use of postfix operations Key: SPARK-16779 URL: https://issues.apache.org/jira/browse/SPARK-16779 Project: Spark Issue Type: Sub-task Components: SQL Reporter: holdenk -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16773) Post Spark 2.0 deprecation & warnings cleanup
[ https://issues.apache.org/jira/browse/SPARK-16773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk updated SPARK-16773: Summary: Post Spark 2.0 deprecation & warnings cleanup (was: Post Spark 2.0 deprecation cleanup) > Post Spark 2.0 deprecation & warnings cleanup > - > > Key: SPARK-16773 > URL: https://issues.apache.org/jira/browse/SPARK-16773 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, PySpark, Spark Core, SQL >Reporter: holdenk > > As part of the 2.0 release we deprecated a number of different internal > components (one of the largest ones being the old accumulator API), and also > upgraded our default build to Scala 2.11. > This has added a large number of deprecation warnings (internal and external) > - some of which can be worked around - and some of which can't (mostly in the > Scala 2.10 -> 2.11 reflection API and various tests). > We should attempt to limit the number of warnings in our build so that we can > notice new ones and thoughtfully consider if they are warranted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16769) httpclient classic dependency - potentially a patch required?
[ https://issues.apache.org/jira/browse/SPARK-16769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397961#comment-15397961 ] Sean Owen commented on SPARK-16769: --- I'm pretty certain it's there only because some dependency needs it, and looks like you found it: jets3t. It wouldn't be a patched version. I don't think that can be excluded because it will break S3. However, maybe there's a newer version of jets3t or the old httpclient we can manage up to? > httpclient classic dependency - potentially a patch required? > - > > Key: SPARK-16769 > URL: https://issues.apache.org/jira/browse/SPARK-16769 > Project: Spark > Issue Type: Question > Components: Build >Affects Versions: 1.6.2, 2.0.0 > Environment: All Spark versions, any environment >Reporter: Adam Roberts > > In our jars folder for Spark we provide a jar with a CVE > https://www.versioneye.com/java/commons-httpclient:commons-httpclient/3.1. > CVE-2012-5783 > This paper outlines the problem > www.cs.utexas.edu/~shmat/shmat_ccs12.pdf > My question is: do we need to ship this version as well or is it only used > for tests? Is it a patched version? I plan to run without this dependency and > if there are NoClassDefFound problems I'll add test so we > don't ship it (downloading it in the first place is bad enough though) > Note that this is valid for all versions, suggesting it be raised to a > critical if Spark functionality is depending on it because of what the pdf > I've linked to mentions > Here is the jar being included: > ls $SPARK_HOME/jars | grep "httpclient" > commons-httpclient-3.1.jar > httpclient-4.5.2.jar > The first jar potentially contains the security issue, could be a patched > version, need to verify. SHA1 sum for this jar is > 964cd74171f427720480efdec40a7c7f6e58426a -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16778) Fix use of deprecated SQLContext constructor
holdenk created SPARK-16778: --- Summary: Fix use of deprecated SQLContext constructor Key: SPARK-16778 URL: https://issues.apache.org/jira/browse/SPARK-16778 Project: Spark Issue Type: Sub-task Components: SQL Reporter: holdenk -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16775) Reduce internal warnings from deprecated accumulator API
[ https://issues.apache.org/jira/browse/SPARK-16775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk updated SPARK-16775: Component/s: SQL > Reduce internal warnings from deprecated accumulator API > > > Key: SPARK-16775 > URL: https://issues.apache.org/jira/browse/SPARK-16775 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Spark Core, SQL >Reporter: holdenk > > Deprecating the old accumulator API added a large number of warnings - many > of these could be fixed with a bit of refactoring to offer a non-deprecated > internal class while still preserving the external deprecation warnings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16777) Parquet schema converter depends on deprecated APIs
holdenk created SPARK-16777: --- Summary: Parquet schema converter depends on deprecated APIs Key: SPARK-16777 URL: https://issues.apache.org/jira/browse/SPARK-16777 Project: Spark Issue Type: Sub-task Components: SQL Reporter: holdenk -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16775) Reduce internal warnings from deprecated accumulator API
[ https://issues.apache.org/jira/browse/SPARK-16775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk updated SPARK-16775: Component/s: (was: ML) (was: SQL) (was: MLlib) > Reduce internal warnings from deprecated accumulator API > > > Key: SPARK-16775 > URL: https://issues.apache.org/jira/browse/SPARK-16775 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Spark Core, SQL >Reporter: holdenk > > Deprecating the old accumulator API added a large number of warnings - many > of these could be fixed with a bit of refactoring to offer a non-deprecated > internal class while still preserving the external deprecation warnings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16776) Fix Kafka deprecation warnings
holdenk created SPARK-16776: --- Summary: Fix Kafka deprecation warnings Key: SPARK-16776 URL: https://issues.apache.org/jira/browse/SPARK-16776 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: holdenk The new KafkaTestUtils depends on many deprecated APIs - see if we can refactor this to be avoided. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16775) Reduce internal warnings from deprecated accumulator API
holdenk created SPARK-16775: --- Summary: Reduce internal warnings from deprecated accumulator API Key: SPARK-16775 URL: https://issues.apache.org/jira/browse/SPARK-16775 Project: Spark Issue Type: Sub-task Reporter: holdenk Deprecating the old accumulator API added a large number of warnings - many of these could be fixed with a bit of refactoring to offer a non-deprecated internal class while still preserving the external deprecation warnings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16774) Fix use of deprecated TimeStamp constructor
holdenk created SPARK-16774: --- Summary: Fix use of deprecated TimeStamp constructor Key: SPARK-16774 URL: https://issues.apache.org/jira/browse/SPARK-16774 Project: Spark Issue Type: Sub-task Components: SQL Reporter: holdenk The TimeStamp constructor we use inside of DateTime utils has been deprecated since JDK 1.1 - while Java does take a long time to remove deprecated functionality we might as well address this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16773) Post Spark 2.0 deprecation cleanup
holdenk created SPARK-16773: --- Summary: Post Spark 2.0 deprecation cleanup Key: SPARK-16773 URL: https://issues.apache.org/jira/browse/SPARK-16773 Project: Spark Issue Type: Improvement Components: ML, MLlib, PySpark, Spark Core, SQL Reporter: holdenk As part of the 2.0 release we deprecated a number of different internal components (one of the largest ones being the old accumulator API), and also upgraded our default build to Scala 2.11. This has added a large number of deprecation warnings (internal and external) - some of which can be worked around - and some of which can't (mostly in the Scala 2.10 -> 2.11 reflection API and various tests). We should attempt to limit the number of warnings in our build so that we can notice new ones and thoughtfully consider if they are warranted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16769) httpclient classic dependency - potentially a patch required?
[ https://issues.apache.org/jira/browse/SPARK-16769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397889#comment-15397889 ] Adam Roberts commented on SPARK-16769: -- [~rxin] [~srowen] Reynold and Sean, interested in what you think, a test run on my VM hasn't produced any new failures after deleting the jar from my m2 repo, I'm also trying to figure out if this version has a patch or not so I'm running with a test case described at https://issues.apache.org/jira/browse/HTTPCLIENT-1265 A quick grep of the source code for Spark itself doesn't show it directly being used in our code either aroberts@aroberts-VirtualBox:~/Desktop/Spark-DK$ grep -R "org.apache.commons.httpclient" . Binary file ./dist/jars/commons-httpclient-3.1.jar matches Binary file ./dist/jars/jets3t-0.9.3.jar matches Binary file ./sql/hive/target/tmp/hive-ivy-cache/cache/commons-httpclient/commons-httpclient/jars/commons-httpclient-3.1.jar matches Binary file ./sql/hive/target/tmp/hive-ivy-cache/jars/commons-httpclient_commons-httpclient-3.1.jar matches Can you please clarify why we have this dependency? > httpclient classic dependency - potentially a patch required? > - > > Key: SPARK-16769 > URL: https://issues.apache.org/jira/browse/SPARK-16769 > Project: Spark > Issue Type: Question > Components: Build >Affects Versions: 1.6.2, 2.0.0 > Environment: All Spark versions, any environment >Reporter: Adam Roberts > > In our jars folder for Spark we provide a jar with a CVE > https://www.versioneye.com/java/commons-httpclient:commons-httpclient/3.1. > CVE-2012-5783 > This paper outlines the problem > www.cs.utexas.edu/~shmat/shmat_ccs12.pdf > My question is: do we need to ship this version as well or is it only used > for tests? Is it a patched version? I plan to run without this dependency and > if there are NoClassDefFound problems I'll add test so we > don't ship it (downloading it in the first place is bad enough though) > Note that this is valid for all versions, suggesting it be raised to a > critical if Spark functionality is depending on it because of what the pdf > I've linked to mentions > Here is the jar being included: > ls $SPARK_HOME/jars | grep "httpclient" > commons-httpclient-3.1.jar > httpclient-4.5.2.jar > The first jar potentially contains the security issue, could be a patched > version, need to verify. SHA1 sum for this jar is > 964cd74171f427720480efdec40a7c7f6e58426a -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11248) Spark hivethriftserver is using the wrong user to while getting HDFS permissions
[ https://issues.apache.org/jira/browse/SPARK-11248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397882#comment-15397882 ] Furcy Pin commented on SPARK-11248: --- +1 I'm trying on spark 2.0.0, and I've configured hive.exec.stagingdir=/tmp/spark-staging/spark-staging then the spark thrift server creates temporary files here: /tmp/spark-staging/spark-staging_hive_2016-07-28_17-30-40_226_8664570533244973761-9/-ext-1/_temporary/0/task_201607281730_0015_m_00 with the following rights: {code} drwxrwxrwx spark:spark /tmp/spark-staging drwxrwxrwx user:spark /tmp/spark-staging/spark-staging_hive_2016-07-28_17-30-40_226_8664570533244973761-9 drwxrwxr x user:spark /tmp/spark-staging/spark-staging_hive_2016-07-28_17-30-40_226_8664570533244973761-9/-ext-1 drwxrwxr x user:spark /tmp/spark-staging/spark-staging_hive_2016-07-28_17-30-40_226_8664570533244973761-9/-ext-1/_temporary/ drwxrwxr x user:spark /tmp/spark-staging/spark-staging_hive_2016-07-28_17-30-40_226_8664570533244973761-9/-ext-1/_temporary/0 drwxrwxr x spark:spark /tmp/spark-staging/spark-staging_hive_2016-07-28_17-30-40_226_8664570533244973761-9/-ext-1/_temporary/0/task_201607281730_0015_m_00 {code} I then get an error when trying to move the staging files to the hive table. org.apache.hadoop.security.AccessControlException: Permission denied: user=user, access=WRITE, inode="/tmp/spark-staging/spark-staging_hive_2016-07-28_17-30-40_226_8664570533244973761-9/-ext-1/_temporary/0/task_201607281730_0015_m_00":spark:spark:drwxrwxr-x I tried setting the property "hive.server2.enable.doAs" to true or false, but it didn't seem to change anything. > Spark hivethriftserver is using the wrong user to while getting HDFS > permissions > > > Key: SPARK-11248 > URL: https://issues.apache.org/jira/browse/SPARK-11248 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Trystan Leftwich > > While running spark as a hivethrift-server via Yarn Spark will use the user > running the Hivethrift server rather than the user connecting via JDBC to > check HDFS perms. > i.e. > In HDFS the perms are > rwx-- 3 testuser testuser /user/testuser/table/testtable > And i connect via beeline as user testuser > beeline -u 'jdbc:hive2://localhost:10511' -n 'testuser' -p '' > If i try to hit that table > select count(*) from test_table; > I get the following error > Error: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch > table test_table. java.security.AccessControlException: Permission denied: > user=hive, access=READ, > inode="/user/testuser/table/testtable":testuser:testuser:drwxr-x--x > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:185) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6795) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6777) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPathAccess(FSNamesystem.java:6702) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:9529) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1516) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1433) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) > (state=,code=0) > I have the following in set in hive-site.xml so it should be using the > correct user. > > hive.server2.enable.doAs > true > > > hive.metastore.execute.setugi > true
[jira] [Commented] (SPARK-16751) Upgrade derby to 10.12.1.1 from 10.11.1.1
[ https://issues.apache.org/jira/browse/SPARK-16751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397865#comment-15397865 ] Sean Owen commented on SPARK-16751: --- Yeah, I see from the PR that it's actually packaged. I don't know that it actually surfaces the attack vector in Spark in practice, but, let's just apply this of course. > Upgrade derby to 10.12.1.1 from 10.11.1.1 > - > > Key: SPARK-16751 > URL: https://issues.apache.org/jira/browse/SPARK-16751 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.2, 2.0.0 > Environment: All platforms and major Spark releases >Reporter: Adam Roberts >Priority: Minor > > This JIRA is to upgrade the derby version from 10.11.1.1 to 10.12.1.1 > Sean and I figured that we only use derby for tests and so the initial pull > request was to not include it in the jars folder for Spark. I now believe it > is required based on comments for the pull request and so this is only a > dependency upgrade. > The upgrade is due to an already disclosed vulnerability (CVE-2015-1832) in > derby 10.11.1.1. We used https://www.versioneye.com/search and will be > checking for any other problems in a variety of libraries too: investigating > if we can set up a Jenkins job to check our pom on a regular basis so we can > stay ahead of the game for matters like this. > This was raised on the mailing list at > http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-2-0-0-RC5-tp18367p18465.html > by Stephen Hellberg and replied to by Sean Owen. > I've checked the impact to previous Spark releases and this particular > version of derby is the only relatively recent and without vulnerabilities > version (I checked up to the 1.3 branch) so ideally we'd backport this for > all impacted Spark releases. > I've marked this as critical and ticked the important checkbox as it's going > to impact every user, there isn't a security component (should we add one?) > and hence the build tag. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16763) Factor out Tunsten for General Purpose Use
[ https://issues.apache.org/jira/browse/SPARK-16763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16763. --- Resolution: Invalid I'm going to close this because it's a question for the user@ list. However I don't think Tungsten is separable from Spark. But bits may be; see the 'unsafe' module for example. It's already separated. > Factor out Tunsten for General Purpose Use > -- > > Key: SPARK-16763 > URL: https://issues.apache.org/jira/browse/SPARK-16763 > Project: Spark > Issue Type: Improvement >Reporter: Suminda Dharmasena > > Can you factor your Tunsten to be used a general purpose library -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16772) Correct API doc references to PySpark classes + formatting fixes
[ https://issues.apache.org/jira/browse/SPARK-16772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-16772: - Summary: Correct API doc references to PySpark classes + formatting fixes (was: Correct API doc references to PySpark classes) > Correct API doc references to PySpark classes + formatting fixes > > > Key: SPARK-16772 > URL: https://issues.apache.org/jira/browse/SPARK-16772 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Reporter: Nicholas Chammas >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16772) Correct API doc references to PySpark classes
[ https://issues.apache.org/jira/browse/SPARK-16772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-16772: - Summary: Correct API doc references to PySpark classes (was: Correct API doc references to DataType + other minor doc tweaks) > Correct API doc references to PySpark classes > - > > Key: SPARK-16772 > URL: https://issues.apache.org/jira/browse/SPARK-16772 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Reporter: Nicholas Chammas >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions
[ https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397827#comment-15397827 ] Shivaram Venkataraman commented on SPARK-16611: --- 1. lapply: From an API perspective we can add an lapply that is implemented using dapply. i.e. dapply runs on each partition with input as a data.frame while lapply could operate on each row. Also dapply should not shuffle any data and should work the same as lapplyPartition in terms of execution. It would be more interesting to know if dapply is somehow worse in terms of functionality. Could you explain what the issue with useBroadcast is ? From what I can see the code path is same in RDD.R and DataFrame.R 2. getJRDD: So it looks like the function required here is a way to extract the java rdd object from the Dataframe ? In that case there is a `toRDD` function in Dataset.scala that we should be able to expose (I think it should just involve using `callJMethod(df@sdf, "toRDD")`). This can be made to return a jobj instead of a RDD object in R. 3. Thanks for explaining this -- I think we can expose cleanup.jobj as a public function to be used to register finalizers. This JIRA isnt about removing anything yet -- but short term we will need to remove / rename parts of the R api of RDDs to satisfy CRAN checks (See SPARK-16519). Longer term it would be great to have one code path that is well maintained, so knowing what doesn't work with dapply family of functions will be very useful. > Expose several hidden DataFrame/RDD functions > - > > Key: SPARK-16611 > URL: https://issues.apache.org/jira/browse/SPARK-16611 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Oscar D. Lara Yejas > > Expose the following functions: > - lapply or map > - lapplyPartition or mapPartition > - flatMap > - RDD > - toRDD > - getJRDD > - cleanup.jobj > cc: > [~javierluraschi] [~j...@rstudio.com] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16740) joins.LongToUnsafeRowMap crashes with NegativeArraySizeException
[ https://issues.apache.org/jira/browse/SPARK-16740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16740. - Resolution: Fixed Assignee: Sylvain Zimmer Fix Version/s: 2.1.0 2.0.1 > joins.LongToUnsafeRowMap crashes with NegativeArraySizeException > > > Key: SPARK-16740 > URL: https://issues.apache.org/jira/browse/SPARK-16740 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer >Assignee: Sylvain Zimmer > Fix For: 2.0.1, 2.1.0 > > > Hello, > Here is a crash in Spark SQL joins, with a minimal reproducible test case. > Interestingly, it only seems to happen when reading Parquet data (I added a > {{crash = True}} variable to show it) > This is an {{left_outer}} example, but it also crashes with a regular > {{inner}} join. > {code} > import os > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > schema1 = SparkTypes.StructType([ > SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True) > ]) > schema2 = SparkTypes.StructType([ > SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True) > ]) > # Valid Long values (-9223372036854775808 < -5543241376386463808 , > 4661454128115150227 < 9223372036854775807) > data1 = [(4661454128115150227,), (-5543241376386463808,)] > data2 = [(650460285, )] > df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1) > df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2) > crash = True > if crash: > os.system("rm -rf /tmp/sparkbug") > df1.write.parquet("/tmp/sparkbug/vertex") > df2.write.parquet("/tmp/sparkbug/edge") > df1 = sqlc.read.load("/tmp/sparkbug/vertex") > df2 = sqlc.read.load("/tmp/sparkbug/edge") > result_df = df2.join(df1, on=(df1.id1 == df2.id2), how="left_outer") > # Should print [Row(id2=650460285, id1=None)] > print result_df.collect() > {code} > When ran with {{spark-submit}}, the final {{collect()}} call crashes with > this: > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling > o61.collectToPython. > : org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120) > at > org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.consume(ExistingRDD.scala:225) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.doProduce(ExistingRDD.scala:328) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.produce(ExistingRDD.scala:225) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:77) > at >
[jira] [Assigned] (SPARK-16772) Correct API doc references to DataType + other minor doc tweaks
[ https://issues.apache.org/jira/browse/SPARK-16772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16772: Assignee: (was: Apache Spark) > Correct API doc references to DataType + other minor doc tweaks > --- > > Key: SPARK-16772 > URL: https://issues.apache.org/jira/browse/SPARK-16772 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Reporter: Nicholas Chammas >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16772) Correct API doc references to DataType + other minor doc tweaks
[ https://issues.apache.org/jira/browse/SPARK-16772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397796#comment-15397796 ] Apache Spark commented on SPARK-16772: -- User 'nchammas' has created a pull request for this issue: https://github.com/apache/spark/pull/14393 > Correct API doc references to DataType + other minor doc tweaks > --- > > Key: SPARK-16772 > URL: https://issues.apache.org/jira/browse/SPARK-16772 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Reporter: Nicholas Chammas >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16772) Correct API doc references to DataType + other minor doc tweaks
[ https://issues.apache.org/jira/browse/SPARK-16772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16772: Assignee: Apache Spark > Correct API doc references to DataType + other minor doc tweaks > --- > > Key: SPARK-16772 > URL: https://issues.apache.org/jira/browse/SPARK-16772 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Reporter: Nicholas Chammas >Assignee: Apache Spark >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16772) Correct API doc references to DataType + other minor doc tweaks
Nicholas Chammas created SPARK-16772: Summary: Correct API doc references to DataType + other minor doc tweaks Key: SPARK-16772 URL: https://issues.apache.org/jira/browse/SPARK-16772 Project: Spark Issue Type: Improvement Components: Documentation, PySpark Reporter: Nicholas Chammas Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory
[ https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397654#comment-15397654 ] Gaurav Shah commented on SPARK-2984: [~dmaverick] Can you tell a little more on sequential import vs append ? didn't get that > FileNotFoundException on _temporary directory > - > > Key: SPARK-2984 > URL: https://issues.apache.org/jira/browse/SPARK-2984 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Ash >Assignee: Josh Rosen >Priority: Critical > Fix For: 1.3.0 > > > We've seen several stacktraces and threads on the user mailing list where > people are having issues with a {{FileNotFoundException}} stemming from an > HDFS path containing {{_temporary}}. > I ([~aash]) think this may be related to {{spark.speculation}}. I think the > error condition might manifest in this circumstance: > 1) task T starts on a executor E1 > 2) it takes a long time, so task T' is started on another executor E2 > 3) T finishes in E1 so moves its data from {{_temporary}} to the final > destination and deletes the {{_temporary}} directory during cleanup > 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but > those files no longer exist! exception > Some samples: > {noformat} > 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job > 140774430 ms.0 > java.io.FileNotFoundException: File > hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07 > does not exist. > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654) > at > org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310) > at > org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136) > at > org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643) > at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) > at scala.util.Try$.apply(Try.scala:161) > at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) > at > org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > -- Chen Song at > http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFiles-file-not-found-exception-td10686.html > {noformat} > I am running a Spark Streaming job that uses saveAsTextFiles to save results > into hdfs files. However, it has an exception after 20 batches > result-140631234/_temporary/0/task_201407251119__m_03 does not > exist. > {noformat} > and > {noformat} > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): > No lease on /apps/data/vddil/real-time/checkpoint/temp: File does not exist. > Holder DFSClient_NONMAPREDUCE_327993456_13 does not have any open files. > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2946) > at >
[jira] [Updated] (SPARK-16639) query fails if having condition contains grouping column
[ https://issues.apache.org/jira/browse/SPARK-16639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-16639: Component/s: SQL > query fails if having condition contains grouping column > > > Key: SPARK-16639 > URL: https://issues.apache.org/jira/browse/SPARK-16639 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Liang-Chi Hsieh > Fix For: 2.0.1, 2.1.0 > > > {code} > create table tbl(a int, b string); > select count(b) from tbl group by a + 1 having a + 1 = 2; > {code} > this will fail analysis -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16639) query fails if having condition contains grouping column
[ https://issues.apache.org/jira/browse/SPARK-16639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-16639: Fix Version/s: 2.0.1 > query fails if having condition contains grouping column > > > Key: SPARK-16639 > URL: https://issues.apache.org/jira/browse/SPARK-16639 > Project: Spark > Issue Type: Bug >Reporter: Wenchen Fan >Assignee: Liang-Chi Hsieh > Fix For: 2.0.1, 2.1.0 > > > {code} > create table tbl(a int, b string); > select count(b) from tbl group by a + 1 having a + 1 = 2; > {code} > this will fail analysis -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16639) query fails if having condition contains grouping column
[ https://issues.apache.org/jira/browse/SPARK-16639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-16639: Assignee: Liang-Chi Hsieh > query fails if having condition contains grouping column > > > Key: SPARK-16639 > URL: https://issues.apache.org/jira/browse/SPARK-16639 > Project: Spark > Issue Type: Bug >Reporter: Wenchen Fan >Assignee: Liang-Chi Hsieh > Fix For: 2.1.0 > > > {code} > create table tbl(a int, b string); > select count(b) from tbl group by a + 1 having a + 1 = 2; > {code} > this will fail analysis -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16639) query fails if having condition contains grouping column
[ https://issues.apache.org/jira/browse/SPARK-16639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-16639. - Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14296 [https://github.com/apache/spark/pull/14296] > query fails if having condition contains grouping column > > > Key: SPARK-16639 > URL: https://issues.apache.org/jira/browse/SPARK-16639 > Project: Spark > Issue Type: Bug >Reporter: Wenchen Fan >Assignee: Liang-Chi Hsieh > Fix For: 2.1.0 > > > {code} > create table tbl(a int, b string); > select count(b) from tbl group by a + 1 having a + 1 = 2; > {code} > this will fail analysis -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16771) Infinite recursion loop in org.apache.spark.sql.catalyst.trees.TreeNode when table name collides.
[ https://issues.apache.org/jira/browse/SPARK-16771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Furcy Pin updated SPARK-16771: -- Description: How to reproduce: In spark-sql on Hive {code} DROP TABLE IF EXISTS t1 ; CREATE TABLE test.t1(col1 string) ; WITH t1 AS ( SELECT col1 FROM t1 ) SELECT col1 FROM t1 LIMIT 2 ; {code} This make a nice StackOverflowError: {code} java.lang.StackOverflowError at scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:230) at scala.collection.TraversableLike$class.map(TraversableLike.scala:233) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:348) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:170) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144) at org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution$$anonfun$substituteCTE$1.applyOrElse(Analyzer.scala:147) at org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution$$anonfun$substituteCTE$1.applyOrElse(Analyzer.scala:133) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$2.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$2.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) ... {code} This does not happen if I change the name of the CTE. I guess Catalyst get caught in an infinite recursion loop because the CTE and the source table have the same name. was: How to reproduce: In spark-sql on Hive {code} DROP TABLE IF EXISTS t1 ; CREATE TABLE test.t1(col1 string) ; WITH t1 AS ( SELECT col1 FROM t1 ) SELECT col1 FROM t1 LIMIT 2 ; {code} This make a nice StackOverflowError: {code} java.lang.StackOverflowError at scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:230) at scala.collection.TraversableLike$class.map(TraversableLike.scala:233) at scala.collection.immutable.List.map(List.scala:285) at
[jira] [Created] (SPARK-16771) Infinite recursion loop in org.apache.spark.sql.catalyst.trees.TreeNode when table name collides.
Furcy Pin created SPARK-16771: - Summary: Infinite recursion loop in org.apache.spark.sql.catalyst.trees.TreeNode when table name collides. Key: SPARK-16771 URL: https://issues.apache.org/jira/browse/SPARK-16771 Project: Spark Issue Type: Bug Affects Versions: 2.0.0, 1.6.2 Reporter: Furcy Pin How to reproduce: In spark-sql on Hive {code} DROP TABLE IF EXISTS t1 ; CREATE TABLE test.t1(col1 string) ; WITH t1 AS ( SELECT col1 FROM t1 ) SELECT col1 FROM t1 LIMIT 2 ; {code} This make a nice StackOverflowError: {code} java.lang.StackOverflowError at scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:230) at scala.collection.TraversableLike$class.map(TraversableLike.scala:233) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:348) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:170) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144) at org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution$$anonfun$substituteCTE$1.applyOrElse(Analyzer.scala:147) at org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution$$anonfun$substituteCTE$1.applyOrElse(Analyzer.scala:133) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$2.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$2.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) {code} This does not happen if I change the name of the CTE. I guess Catalyst get caught in an infinite recursion loop because the CTE and the source table have the same name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org