[jira] [Created] (SPARK-7182) [SQL] Can't remove or save DataFrame from a join due to duplicate columns

Don Drake (JIRA) Mon, 27 Apr 2015 21:10:06 -0700

Don Drake created SPARK-7182:
--------------------------------

             Summary: [SQL] Can't remove or save DataFrame from a join due to 
duplicate columns
                 Key: SPARK-7182
                 URL: https://issues.apache.org/jira/browse/SPARK-7182
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.3.1
            Reporter: Don Drake




I'm having trouble saving a dataframe as parquet after performing a simple 
table join.

Below is a trivial example that demonstrates the issue.


The following is from a pyspark session:

{code}
d1=[{'a':1, 'b':2, 'c':3}]
d2=[{'a':1, 'b':2, 'd':4}]

t1 = sqlContext.createDataFrame(d1)
t2 = sqlContext.createDataFrame(d2)

j = t1.join(t2, t1.a==t2.a and t1.b==t2.b)

>>> j
DataFrame[a: bigint, b: bigint, c: bigint, a: bigint, b: bigint, d: bigint]



u = sorted(list(set(j.columns)))

>>> nt = j.select(*u)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File 
"/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/dataframe.py", 
lin
e 586, in select
    jdf = self._jdf.select(self.sql_ctx._sc._jvm.PythonUtils.toSeq(jcols))
  File 
"/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/
java_gateway.py", line 538, in __call__
  File 
"/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/
protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o829.select.
: org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: 
a#0L, a#3L
.;
    at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:2
29)

j.saveAsParquetFile('j')

>>> z = sqlContext.parquetFile('j')
>>> z.take(1)
...
: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 171 
in stage 104.0 failed 1 times, most recent failure: Lost task 171.0 in stage 
104.0 (TID 1235, localhost): parquet.io.ParquetDecodingException: Can not read 
value at 0 in block -1 in file file:/Users/drake/fd/spark/j/part-r-00172.parquet
        at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7182) [SQL] Can't remove or save DataFrame from a join due to duplicate columns

Reply via email to