[ https://issues.apache.org/jira/browse/SPARK-7182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Don Drake updated SPARK-7182: ----------------------------- Description: I'm having trouble saving a dataframe as parquet after performing a simple table join. Below is a trivial example that demonstrates the issue. The following is from a pyspark session: {code} d1=[{'a':1, 'b':2, 'c':3}] d2=[{'a':1, 'b':2, 'd':4}] t1 = sqlContext.createDataFrame(d1) t2 = sqlContext.createDataFrame(d2) j = t1.join(t2, t1.a==t2.a and t1.b==t2.b) >>> j DataFrame[a: bigint, b: bigint, c: bigint, a: bigint, b: bigint, d: bigint] {code} Try to get a unique list of the columns: {code} u = sorted(list(set(j.columns))) >>> nt = j.select(*u) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/dataframe.py", lin e 586, in select jdf = self._jdf.select(self.sql_ctx._sc._jvm.PythonUtils.toSeq(jcols)) File "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/ java_gateway.py", line 538, in __call__ File "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/ protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o829.select. : org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#0L, a#3L .; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:2 29) {code} That didn't work, save the file (that works), but reading it back in fails.: {code} j.saveAsParquetFile('j') >>> z = sqlContext.parquetFile('j') >>> z.take(1) ... : An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 171 in stage 104.0 failed 1 times, most recent failure: Lost task 171.0 in stage 104.0 (TID 1235, localhost): parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/Users/drake/fd/spark/j/part-r-00172.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) {code} was: I'm having trouble saving a dataframe as parquet after performing a simple table join. Below is a trivial example that demonstrates the issue. The following is from a pyspark session: {code} d1=[{'a':1, 'b':2, 'c':3}] d2=[{'a':1, 'b':2, 'd':4}] t1 = sqlContext.createDataFrame(d1) t2 = sqlContext.createDataFrame(d2) j = t1.join(t2, t1.a==t2.a and t1.b==t2.b) >>> j DataFrame[a: bigint, b: bigint, c: bigint, a: bigint, b: bigint, d: bigint] u = sorted(list(set(j.columns))) >>> nt = j.select(*u) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/dataframe.py", lin e 586, in select jdf = self._jdf.select(self.sql_ctx._sc._jvm.PythonUtils.toSeq(jcols)) File "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/ java_gateway.py", line 538, in __call__ File "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/ protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o829.select. : org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#0L, a#3L .; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:2 29) j.saveAsParquetFile('j') >>> z = sqlContext.parquetFile('j') >>> z.take(1) ... : An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 171 in stage 104.0 failed 1 times, most recent failure: Lost task 171.0 in stage 104.0 (TID 1235, localhost): parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/Users/drake/fd/spark/j/part-r-00172.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) {code} > [SQL] Can't remove or save DataFrame from a join due to duplicate columns > ------------------------------------------------------------------------- > > Key: SPARK-7182 > URL: https://issues.apache.org/jira/browse/SPARK-7182 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.3.1 > Reporter: Don Drake > > I'm having trouble saving a dataframe as parquet after performing a simple > table join. > Below is a trivial example that demonstrates the issue. > The following is from a pyspark session: > {code} > d1=[{'a':1, 'b':2, 'c':3}] > d2=[{'a':1, 'b':2, 'd':4}] > t1 = sqlContext.createDataFrame(d1) > t2 = sqlContext.createDataFrame(d2) > j = t1.join(t2, t1.a==t2.a and t1.b==t2.b) > >>> j > DataFrame[a: bigint, b: bigint, c: bigint, a: bigint, b: bigint, d: bigint] > {code} > Try to get a unique list of the columns: > {code} > u = sorted(list(set(j.columns))) > >>> nt = j.select(*u) > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File > "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/dataframe.py", > lin > e 586, in select > jdf = self._jdf.select(self.sql_ctx._sc._jvm.PythonUtils.toSeq(jcols)) > File > "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/ > java_gateway.py", line 538, in __call__ > File > "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/ > protocol.py", line 300, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o829.select. > : org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could > be: a#0L, a#3L > .; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:2 > 29) > {code} > That didn't work, save the file (that works), but reading it back in fails.: > {code} > j.saveAsParquetFile('j') > >>> z = sqlContext.parquetFile('j') > >>> z.take(1) > ... > : An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 171 > in stage 104.0 failed 1 times, most recent failure: Lost task 171.0 in stage > 104.0 (TID 1235, localhost): parquet.io.ParquetDecodingException: Can not > read value at 0 in block -1 in file > file:/Users/drake/fd/spark/j/part-r-00172.parquet > at > parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org