Paul Pearce created SPARK-19381:
-----------------------------------

             Summary: spark 2.1.0 raises unrelated (unhelpful) error for 
parquet files beginning with '_'
                 Key: SPARK-19381
                 URL: https://issues.apache.org/jira/browse/SPARK-19381
             Project: Spark
          Issue Type: Bug
    Affects Versions: 2.1.0
            Reporter: Paul Pearce
            Priority: Minor


Under spark 2.1.0 if you attempt to read a parquet file with filename beginning 
with '_' the error returned is 

"Unable to infer schema for Parquet. It must be specified manually."

The bug is not the inability to read the file, rather that the error is 
unrelated to the actual problem. Below shows the generation of parquet files 
under spark 2.0.0 and the attempted reading of them under spark 2.1.0.

Generation:
{code}
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0.cloudera1
      /_/

Using Python version 2.7.6 (default, Oct 26 2016 20:30:19)
SparkSession available as 'spark'.

>>> from pyspark.sql import Row
>>> df = spark.createDataFrame(sc.parallelize(range(1, 6)).map(lambda i: 
>>> Row(single=i, double=i ** 2)))
>>> df.write.parquet("debug.parquet")
>>> df.write.parquet("_debug.parquet")
{code}

Reading
{code}
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/

Using Python version 2.7.6 (default, Oct 26 2016 20:30:19)
SparkSession available as 'spark'.
>>> df = spark.read.parquet("debug.parquet")
>>> df = spark.read.parquet("_debug.parquet")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File 
"/opt/apache/spark-2.1.0-bin-hadoop2.6/python/pyspark/sql/readwriter.py", line 
274, in parquet
    return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
  File 
"/opt/apache/spark-2.1.0-bin-hadoop2.6/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1133, in __call__
  File "/opt/apache/spark-2.1.0-bin-hadoop2.6/python/pyspark/sql/utils.py", 
line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It 
must be specified manually.;'
{code}

I only realized the source of the problem when reading issue: 
https://issues.apache.org/jira/browse/SPARK-16975 which describes a similar 
problem but with column names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to