[ 
https://issues.apache.org/jira/browse/SPARK-19381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-19381.
----------------------------------
    Resolution: Cannot Reproduce

{code}
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.0-SNAPSHOT
      /_/

Using Python version 2.7.10 (default, Jul 30 2016 19:40:32)
SparkSession available as 'spark'.
>>> from pyspark.sql import Row
>>> df = spark.createDataFrame(sc.parallelize(range(1, 6)).map(lambda i: 
>>> Row(single=i, double=i ** 2)))
>>> df.write.parquet("debug.parquet")
>>> df.write.parquet("_debug.parquet")
>>> df = spark.read.parquet("debug.parquet")
>>> df = spark.read.parquet("_debug.parquet")
{code}

This seems fixed in the current master. I am resolving this as this can be 
reproduced as reported in the current master. It would be nice if someone 
identifies this JIRA and backports it if applicable.

> spark 2.1.0 raises unrelated (unhelpful) error for parquet filenames 
> beginning with '_'
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-19381
>                 URL: https://issues.apache.org/jira/browse/SPARK-19381
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 2.1.0
>            Reporter: Paul Pearce
>            Priority: Minor
>
> Under spark 2.1.0 if you attempt to read a parquet file with filename 
> beginning with '_' the error returned is 
> "Unable to infer schema for Parquet. It must be specified manually."
> The bug is not the inability to read the file, rather that the error is 
> unrelated to the actual problem. Below shows the generation of parquet files 
> under spark 2.0.0 and the attempted reading of them under spark 2.1.0.
> Generation:
> {code}
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /__ / .__/\_,_/_/ /_/\_\   version 2.0.0.cloudera1
>       /_/
> Using Python version 2.7.6 (default, Oct 26 2016 20:30:19)
> SparkSession available as 'spark'.
> >>> from pyspark.sql import Row
> >>> df = spark.createDataFrame(sc.parallelize(range(1, 6)).map(lambda i: 
> >>> Row(single=i, double=i ** 2)))
> >>> df.write.parquet("debug.parquet")
> >>> df.write.parquet("_debug.parquet")
> {code}
> Reading
> {code}
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>       /_/
> Using Python version 2.7.6 (default, Oct 26 2016 20:30:19)
> SparkSession available as 'spark'.
> >>> df = spark.read.parquet("debug.parquet")
> >>> df = spark.read.parquet("_debug.parquet")
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File 
> "/opt/apache/spark-2.1.0-bin-hadoop2.6/python/pyspark/sql/readwriter.py", 
> line 274, in parquet
>     return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
>   File 
> "/opt/apache/spark-2.1.0-bin-hadoop2.6/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
>  line 1133, in __call__
>   File "/opt/apache/spark-2.1.0-bin-hadoop2.6/python/pyspark/sql/utils.py", 
> line 69, in deco
>     raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It 
> must be specified manually.;'
> {code}
> I only realized the source of the problem when reading issue: 
> https://issues.apache.org/jira/browse/SPARK-16975 which describes a similar 
> problem but with column names.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to