[jira] [Commented] (SPARK-21392) Unable to infer schema when loading large Parquet file

Hyukjin Kwon (JIRA) Sun, 16 Jul 2017 06:02:33 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16088921#comment-16088921
 ]


Hyukjin Kwon commented on SPARK-21392:
--------------------------------------

Thanks for investigations and details. I tried to reproduce this as below:

{code}
response = "mi_or_chd_5"
data = [[226, None], [442, None], [978, 0], [851, 0], [428, 0]]
spark.createDataFrame(data, "eid: int, mi_or_chd_5: 
short").createOrReplaceTempView("outcomes")

df = sql("SELECT eid,mi_or_chd_5 FROM outcomes")
df.write.parquet(response, mode="overwrite")
spark.read.parquet(response).show()
{code}

but I couldn't. Would you mind sharing the output file from {{.write.parquet}} 
and checking output files after writing out that via {{.write.csv}} via {{cat}} 
?

Also, It would be helpful if you remove out the custom codes parts.

> Unable to infer schema when loading large Parquet file
> ------------------------------------------------------
>
>                 Key: SPARK-21392
>                 URL: https://issues.apache.org/jira/browse/SPARK-21392
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.1.1, 2.2.0
>         Environment: Spark 2.1.1. python 2.7.6
>            Reporter: Stuart Reynolds
>              Labels: parquet, pyspark
>
> The following boring code works up until when I read in the parquet file.
> {code:none}
> response = "mi_or_chd_5"
> sc = get_spark_context() # custom
> sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom
> rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes")
> print rdd.schema
> #>>    
> StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true)))
> rdd.show()
> #+-------+-----------+
> #|eid|mi_or_chd_5|
> #+-------+-----------+
> #|226|       null|
> #|442|       null|
> #|978|          0|
> #|851|          0|
> #|428|          0|
> rdd.write.parquet(response, mode="overwrite") # success!
> rdd2 = sqlc.read.parquet(response) # fail
> {code}
>     
> fails with:
> {code:none}AnalysisException: u'Unable to infer schema for Parquet. It must 
> be specified manually.;'
> {code}
> in 
> {code:none} 
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
>  in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> The error doesn't happen if I add "limit 10" to the sql query. The whole 
> selected table is 500k rows with an int and short column.
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but 
> which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21392) Unable to infer schema when loading large Parquet file

Reply via email to