Don Drake created SPARK-7182: -------------------------------- Summary: [SQL] Can't remove or save DataFrame from a join due to duplicate columns Key: SPARK-7182 URL: https://issues.apache.org/jira/browse/SPARK-7182 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Don Drake
I'm having trouble saving a dataframe as parquet after performing a simple table join. Below is a trivial example that demonstrates the issue. The following is from a pyspark session: {code} d1=[{'a':1, 'b':2, 'c':3}] d2=[{'a':1, 'b':2, 'd':4}] t1 = sqlContext.createDataFrame(d1) t2 = sqlContext.createDataFrame(d2) j = t1.join(t2, t1.a==t2.a and t1.b==t2.b) >>> j DataFrame[a: bigint, b: bigint, c: bigint, a: bigint, b: bigint, d: bigint] u = sorted(list(set(j.columns))) >>> nt = j.select(*u) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/dataframe.py", lin e 586, in select jdf = self._jdf.select(self.sql_ctx._sc._jvm.PythonUtils.toSeq(jcols)) File "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/ java_gateway.py", line 538, in __call__ File "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/ protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o829.select. : org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#0L, a#3L .; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:2 29) j.saveAsParquetFile('j') >>> z = sqlContext.parquetFile('j') >>> z.take(1) ... : An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 171 in stage 104.0 failed 1 times, most recent failure: Lost task 171.0 in stage 104.0 (TID 1235, localhost): parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/Users/drake/fd/spark/j/part-r-00172.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org