[ 
https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart Reynolds updated SPARK-21392:
------------------------------------
    Description: 
The following boring code works up until when I read in the parquet file.

{code:none}
import numpy as np
import pandas as pd
import pyspark
from pyspark import SQLContext, SparkContext, SparkConf

print pyspark.__version__
sc = SparkContext(conf=SparkConf().setMaster('local'))
df = pd.DataFrame({"mi":np.arange(100), "eid":np.arange(100)})
print df
sqlc = SQLContext(sc)
df = sqlc.createDataFrame(df)
df = df.createOrReplaceTempView("outcomes")
rdd = sqlc.sql("SELECT eid,mi FROM outcomes limit 5")
print rdd.schema
rdd.show()
rdd.write.parquet("mi", mode="overwrite")
rdd2 = sqlc.read.parquet("mi")
{code}

{code:none}
# print pyspark.__version__
2.2.0

# print df
    eid  mi
0     0   0
1     1   1
2     2   2
3     3   3
---

[100 rows x 2 columns]

# print rdd.schema
StructType(List(StructField(eid,LongType,true),StructField(mi,LongType,true)))

# rdd.show()
+---+---+
|eid| mi|
+---+---+
|  0|  0|
|  1|  1|
|  2|  2|
|  3|  3|
|  4|  4|
+---+---+
{code}
    
fails with:

{code:none}
    rdd2 = sqlc.read.parquet("mixx")
  File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/readwriter.py", line 
291, in parquet
    return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
  File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py", line 
1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/utils.py", line 69, 
in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It 
must be specified manually.;'
{code}

in 

{code:none} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

Works with master='local', but fails with my cluster is specified.


  was:
The following boring code works up until when I read in the parquet file.

{code:none}
response = "mi_or_chd_5"
sc = get_spark_context() # custom
sqlc = get_sparkSQLContextWithTables(sc, tables=["outcomes"]) # custom


rdd = sqlc.sql("SELECT eid,mi_or_chd_5 FROM outcomes")
print rdd.schema
#>>    
StructType(List(StructField(eid,IntegerType,true),StructField(mi_or_chd_5,ShortType,true)))
rdd.show()
#+-------+-----------+
#|eid|mi_or_chd_5|
#+-------+-----------+
#|226|       null|
#|442|       null|
#|978|          0|
#|851|          0|
#|428|          0|

rdd.write.parquet(response, mode="overwrite") # success!
rdd2 = sqlc.read.parquet(response) # fail
{code}
    
fails with:

{code:none}AnalysisException: u'Unable to infer schema for Parquet. It must be 
specified manually.;'
{code}

in 

{code:none} 
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
 in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full 
schema was available when the parquet file was saved. What gives?

The error doesn't happen if I add "limit 10" to the sql query. The whole 
selected table is 500k rows with an int and short column.

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which 
claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



> Unable to infer schema when loading large Parquet file
> ------------------------------------------------------
>
>                 Key: SPARK-21392
>                 URL: https://issues.apache.org/jira/browse/SPARK-21392
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.1.1, 2.2.0
>         Environment: Spark 2.1.1. python 2.7.6
>            Reporter: Stuart Reynolds
>              Labels: parquet, pyspark
>
> The following boring code works up until when I read in the parquet file.
> {code:none}
> import numpy as np
> import pandas as pd
> import pyspark
> from pyspark import SQLContext, SparkContext, SparkConf
> print pyspark.__version__
> sc = SparkContext(conf=SparkConf().setMaster('local'))
> df = pd.DataFrame({"mi":np.arange(100), "eid":np.arange(100)})
> print df
> sqlc = SQLContext(sc)
> df = sqlc.createDataFrame(df)
> df = df.createOrReplaceTempView("outcomes")
> rdd = sqlc.sql("SELECT eid,mi FROM outcomes limit 5")
> print rdd.schema
> rdd.show()
> rdd.write.parquet("mi", mode="overwrite")
> rdd2 = sqlc.read.parquet("mi")
> {code}
> {code:none}
> # print pyspark.__version__
> 2.2.0
> # print df
>     eid  mi
> 0     0   0
> 1     1   1
> 2     2   2
> 3     3   3
> ---
> [100 rows x 2 columns]
> # print rdd.schema
> StructType(List(StructField(eid,LongType,true),StructField(mi,LongType,true)))
> # rdd.show()
> +---+---+
> |eid| mi|
> +---+---+
> |  0|  0|
> |  1|  1|
> |  2|  2|
> |  3|  3|
> |  4|  4|
> +---+---+
> {code}
>     
> fails with:
> {code:none}
>     rdd2 = sqlc.read.parquet("mixx")
>   File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/readwriter.py", 
> line 291, in parquet
>     return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
>   File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py", line 
> 1133, in __call__
>     answer, self.gateway_client, self.target_id, self.name)
>   File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/utils.py", line 
> 69, in deco
>     raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It 
> must be specified manually.;'
> {code}
> in 
> {code:none} 
> /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
>  in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the 
> full schema was available when the parquet file was saved. What gives?
> Works with master='local', but fails with my cluster is specified.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to