[jira] [Commented] (SPARK-13966) Regression using .withColumn() on a parquet
[ https://issues.apache.org/jira/browse/SPARK-13966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230128#comment-15230128 ] Federico Ponzi commented on SPARK-13966: Seems working now to me too. Thanks > Regression using .withColumn() on a parquet > --- > > Key: SPARK-13966 > URL: https://issues.apache.org/jira/browse/SPARK-13966 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: Linux 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt20-1+deb8u3 > (2016-01-17) x86_64 GNU/Linux >Reporter: Federico Ponzi >Assignee: Davies Liu >Priority: Critical > > If we load a parquet, add a column with {{withcolumn()}} with a date type and > try to select a join of himself, we get a {{java.util.NoSuchElementException: > key not found: key#6}} > Here is a simple program to reproduce it: > {code} > from pyspark.sql import SQLContext, Row > from pyspark import SparkContext > from pyspark.sql.functions import from_unixtime, lit > sc = SparkContext() > sqlContext = SQLContext(sc) > df = sqlContext.createDataFrame(sc.parallelize([Row(x=123)])) > df.write.parquet("/tmp/testcase", mode="overwrite") > df = sqlContext.read.parquet("/tmp/testcase") > # df = df.unionAll(df.limit(0)) # WORKAROUND > df = df.withColumn("key", from_unixtime(lit(1457650800))) # also happens with > a .cast("timestamp") > df.registerTempTable("test") > res = sqlContext.sql("SELECT COUNT(1) from test t1, test t2 where t1.key = > t2.key") > res.show() > {code} > This only occurs when the added columns is of type timestamp, and dosen't > happen in Spark 1.6.x > {noformat} > Traceback (most recent call last): > File "/tmp/bug.py", line 17, in > res.show() > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", > line 217, in show > File "/usr/local/spark/python/lib/py4j-0.9.2-src.zip/py4j/java_gateway.py", > line 836, in __call__ > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line > 45, in deco > File "/usr/local/spark/python/lib/py4j-0.9.2-src.zip/py4j/protocol.py", > line 310, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o67.showString. > : java.util.NoSuchElementException: key not found: key#6 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at > org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:38) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at > org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:38) > at > org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$35$$anonfun$apply$2.applyOrElse(DataSourceStrategy.scala:566) > at > org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$35$$anonfun$apply$2.applyOrElse(DataSourceStrategy.scala:565) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:67) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:258) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:264) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:264) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:301) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308) > at scala.collection.AbstractIterator.to(Iterator.scala:1194) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1194) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:350) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:264) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(Tr
[jira] [Commented] (SPARK-13966) Regression using .withColumn() on a parquet
[ https://issues.apache.org/jira/browse/SPARK-13966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15229042#comment-15229042 ] Davies Liu commented on SPARK-13966: I checked this on latest master, it works, could you check this again? > Regression using .withColumn() on a parquet > --- > > Key: SPARK-13966 > URL: https://issues.apache.org/jira/browse/SPARK-13966 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: Linux 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt20-1+deb8u3 > (2016-01-17) x86_64 GNU/Linux >Reporter: Federico Ponzi >Priority: Critical > > If we load a parquet, add a column with {{withcolumn()}} with a date type and > try to select a join of himself, we get a {{java.util.NoSuchElementException: > key not found: key#6}} > Here is a simple program to reproduce it: > {code} > from pyspark.sql import SQLContext, Row > from pyspark import SparkContext > from pyspark.sql.functions import from_unixtime, lit > sc = SparkContext() > sqlContext = SQLContext(sc) > df = sqlContext.createDataFrame(sc.parallelize([Row(x=123)])) > df.write.parquet("/tmp/testcase", mode="overwrite") > df = sqlContext.read.parquet("/tmp/testcase") > # df = df.unionAll(df.limit(0)) # WORKAROUND > df = df.withColumn("key", from_unixtime(lit(1457650800))) # also happens with > a .cast("timestamp") > df.registerTempTable("test") > res = sqlContext.sql("SELECT COUNT(1) from test t1, test t2 where t1.key = > t2.key") > res.show() > {code} > This only occurs when the added columns is of type timestamp, and dosen't > happen in Spark 1.6.x > {noformat} > Traceback (most recent call last): > File "/tmp/bug.py", line 17, in > res.show() > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", > line 217, in show > File "/usr/local/spark/python/lib/py4j-0.9.2-src.zip/py4j/java_gateway.py", > line 836, in __call__ > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line > 45, in deco > File "/usr/local/spark/python/lib/py4j-0.9.2-src.zip/py4j/protocol.py", > line 310, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o67.showString. > : java.util.NoSuchElementException: key not found: key#6 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at > org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:38) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at > org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:38) > at > org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$35$$anonfun$apply$2.applyOrElse(DataSourceStrategy.scala:566) > at > org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$35$$anonfun$apply$2.applyOrElse(DataSourceStrategy.scala:565) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:67) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:258) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:264) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:264) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:301) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308) > at scala.collection.AbstractIterator.to(Iterator.scala:1194) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1194) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:350) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:264) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.