[jira] [Updated] (SPARK-7182) [SQL] Can't remove or save DataFrame from a join due to duplicate columns

Don Drake (JIRA) Mon, 27 Apr 2015 21:12:12 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-7182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Don Drake updated SPARK-7182:
-----------------------------
    Description: 
I'm having trouble saving a dataframe as parquet after performing a simple 
table join.

Below is a trivial example that demonstrates the issue.


The following is from a pyspark session:

{code}
d1=[{'a':1, 'b':2, 'c':3}]
d2=[{'a':1, 'b':2, 'd':4}]

t1 = sqlContext.createDataFrame(d1)
t2 = sqlContext.createDataFrame(d2)

j = t1.join(t2, t1.a==t2.a and t1.b==t2.b)

>>> j
DataFrame[a: bigint, b: bigint, c: bigint, a: bigint, b: bigint, d: bigint]


{code}

Try to get a unique list of the columns:
{code}
u = sorted(list(set(j.columns)))

>>> nt = j.select(*u)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File 
"/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/dataframe.py", 
lin
e 586, in select
    jdf = self._jdf.select(self.sql_ctx._sc._jvm.PythonUtils.toSeq(jcols))
  File 
"/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/
java_gateway.py", line 538, in __call__
  File 
"/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/
protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o829.select.
: org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: 
a#0L, a#3L
.;
    at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:2
29)

{code}

That didn't work, save the file (that works), but reading it back in fails.:
{code}
j.saveAsParquetFile('j')

>>> z = sqlContext.parquetFile('j')
>>> z.take(1)
...
: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 171 
in stage 104.0 failed 1 times, most recent failure: Lost task 171.0 in stage 
104.0 (TID 1235, localhost): parquet.io.ParquetDecodingException: Can not read 
value at 0 in block -1 in file file:/Users/drake/fd/spark/j/part-r-00172.parquet
        at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
{code}

  was:

I'm having trouble saving a dataframe as parquet after performing a simple 
table join.

Below is a trivial example that demonstrates the issue.


The following is from a pyspark session:

{code}
d1=[{'a':1, 'b':2, 'c':3}]
d2=[{'a':1, 'b':2, 'd':4}]

t1 = sqlContext.createDataFrame(d1)
t2 = sqlContext.createDataFrame(d2)

j = t1.join(t2, t1.a==t2.a and t1.b==t2.b)

>>> j
DataFrame[a: bigint, b: bigint, c: bigint, a: bigint, b: bigint, d: bigint]



u = sorted(list(set(j.columns)))

>>> nt = j.select(*u)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File 
"/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/dataframe.py", 
lin
e 586, in select
    jdf = self._jdf.select(self.sql_ctx._sc._jvm.PythonUtils.toSeq(jcols))
  File 
"/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/
java_gateway.py", line 538, in __call__
  File 
"/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/
protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o829.select.
: org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: 
a#0L, a#3L
.;
    at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:2
29)

j.saveAsParquetFile('j')

>>> z = sqlContext.parquetFile('j')
>>> z.take(1)
...
: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 171 
in stage 104.0 failed 1 times, most recent failure: Lost task 171.0 in stage 
104.0 (TID 1235, localhost): parquet.io.ParquetDecodingException: Can not read 
value at 0 in block -1 in file file:/Users/drake/fd/spark/j/part-r-00172.parquet
        at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
{code}


> [SQL] Can't remove or save DataFrame from a join due to duplicate columns
> -------------------------------------------------------------------------
>
>                 Key: SPARK-7182
>                 URL: https://issues.apache.org/jira/browse/SPARK-7182
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.3.1
>            Reporter: Don Drake
>
> I'm having trouble saving a dataframe as parquet after performing a simple 
> table join.
> Below is a trivial example that demonstrates the issue.
> The following is from a pyspark session:
> {code}
> d1=[{'a':1, 'b':2, 'c':3}]
> d2=[{'a':1, 'b':2, 'd':4}]
> t1 = sqlContext.createDataFrame(d1)
> t2 = sqlContext.createDataFrame(d2)
> j = t1.join(t2, t1.a==t2.a and t1.b==t2.b)
> >>> j
> DataFrame[a: bigint, b: bigint, c: bigint, a: bigint, b: bigint, d: bigint]
> {code}
> Try to get a unique list of the columns:
> {code}
> u = sorted(list(set(j.columns)))
> >>> nt = j.select(*u)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File 
> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/dataframe.py",
>  lin
> e 586, in select
>     jdf = self._jdf.select(self.sql_ctx._sc._jvm.PythonUtils.toSeq(jcols))
>   File 
> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/
> java_gateway.py", line 538, in __call__
>   File 
> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/
> protocol.py", line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o829.select.
> : org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could 
> be: a#0L, a#3L
> .;
>     at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:2
> 29)
> {code}
> That didn't work, save the file (that works), but reading it back in fails.:
> {code}
> j.saveAsParquetFile('j')
> >>> z = sqlContext.parquetFile('j')
> >>> z.take(1)
> ...
> : An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 171 
> in stage 104.0 failed 1 times, most recent failure: Lost task 171.0 in stage 
> 104.0 (TID 1235, localhost): parquet.io.ParquetDecodingException: Can not 
> read value at 0 in block -1 in file 
> file:/Users/drake/fd/spark/j/part-r-00172.parquet
>       at 
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7182) [SQL] Can't remove or save DataFrame from a join due to duplicate columns

Reply via email to