Paul Pearce created SPARK-19381: ----------------------------------- Summary: spark 2.1.0 raises unrelated (unhelpful) error for parquet files beginning with '_' Key: SPARK-19381 URL: https://issues.apache.org/jira/browse/SPARK-19381 Project: Spark Issue Type: Bug Affects Versions: 2.1.0 Reporter: Paul Pearce Priority: Minor
Under spark 2.1.0 if you attempt to read a parquet file with filename beginning with '_' the error returned is "Unable to infer schema for Parquet. It must be specified manually." The bug is not the inability to read the file, rather that the error is unrelated to the actual problem. Below shows the generation of parquet files under spark 2.0.0 and the attempted reading of them under spark 2.1.0. Generation: {code} Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.0.0.cloudera1 /_/ Using Python version 2.7.6 (default, Oct 26 2016 20:30:19) SparkSession available as 'spark'. >>> from pyspark.sql import Row >>> df = spark.createDataFrame(sc.parallelize(range(1, 6)).map(lambda i: >>> Row(single=i, double=i ** 2))) >>> df.write.parquet("debug.parquet") >>> df.write.parquet("_debug.parquet") {code} Reading {code} Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.1.0 /_/ Using Python version 2.7.6 (default, Oct 26 2016 20:30:19) SparkSession available as 'spark'. >>> df = spark.read.parquet("debug.parquet") >>> df = spark.read.parquet("_debug.parquet") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/opt/apache/spark-2.1.0-bin-hadoop2.6/python/pyspark/sql/readwriter.py", line 274, in parquet return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths))) File "/opt/apache/spark-2.1.0-bin-hadoop2.6/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/opt/apache/spark-2.1.0-bin-hadoop2.6/python/pyspark/sql/utils.py", line 69, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;' {code} I only realized the source of the problem when reading issue: https://issues.apache.org/jira/browse/SPARK-16975 which describes a similar problem but with column names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org