[ https://issues.apache.org/jira/browse/SPARK-19381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-19381. ---------------------------------- Resolution: Cannot Reproduce {code} Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.2.0-SNAPSHOT /_/ Using Python version 2.7.10 (default, Jul 30 2016 19:40:32) SparkSession available as 'spark'. >>> from pyspark.sql import Row >>> df = spark.createDataFrame(sc.parallelize(range(1, 6)).map(lambda i: >>> Row(single=i, double=i ** 2))) >>> df.write.parquet("debug.parquet") >>> df.write.parquet("_debug.parquet") >>> df = spark.read.parquet("debug.parquet") >>> df = spark.read.parquet("_debug.parquet") {code} This seems fixed in the current master. I am resolving this as this can be reproduced as reported in the current master. It would be nice if someone identifies this JIRA and backports it if applicable. > spark 2.1.0 raises unrelated (unhelpful) error for parquet filenames > beginning with '_' > --------------------------------------------------------------------------------------- > > Key: SPARK-19381 > URL: https://issues.apache.org/jira/browse/SPARK-19381 > Project: Spark > Issue Type: Bug > Affects Versions: 2.1.0 > Reporter: Paul Pearce > Priority: Minor > > Under spark 2.1.0 if you attempt to read a parquet file with filename > beginning with '_' the error returned is > "Unable to infer schema for Parquet. It must be specified manually." > The bug is not the inability to read the file, rather that the error is > unrelated to the actual problem. Below shows the generation of parquet files > under spark 2.0.0 and the attempted reading of them under spark 2.1.0. > Generation: > {code} > Welcome to > ____ __ > / __/__ ___ _____/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /__ / .__/\_,_/_/ /_/\_\ version 2.0.0.cloudera1 > /_/ > Using Python version 2.7.6 (default, Oct 26 2016 20:30:19) > SparkSession available as 'spark'. > >>> from pyspark.sql import Row > >>> df = spark.createDataFrame(sc.parallelize(range(1, 6)).map(lambda i: > >>> Row(single=i, double=i ** 2))) > >>> df.write.parquet("debug.parquet") > >>> df.write.parquet("_debug.parquet") > {code} > Reading > {code} > Welcome to > ____ __ > / __/__ ___ _____/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /__ / .__/\_,_/_/ /_/\_\ version 2.1.0 > /_/ > Using Python version 2.7.6 (default, Oct 26 2016 20:30:19) > SparkSession available as 'spark'. > >>> df = spark.read.parquet("debug.parquet") > >>> df = spark.read.parquet("_debug.parquet") > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File > "/opt/apache/spark-2.1.0-bin-hadoop2.6/python/pyspark/sql/readwriter.py", > line 274, in parquet > return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths))) > File > "/opt/apache/spark-2.1.0-bin-hadoop2.6/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", > line 1133, in __call__ > File "/opt/apache/spark-2.1.0-bin-hadoop2.6/python/pyspark/sql/utils.py", > line 69, in deco > raise AnalysisException(s.split(': ', 1)[1], stackTrace) > pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It > must be specified manually.;' > {code} > I only realized the source of the problem when reading issue: > https://issues.apache.org/jira/browse/SPARK-16975 which describes a similar > problem but with column names. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org