Re: sqlContext.parquetFile(path) fails if path is a file but succeeds if a directory

2014-08-19 Thread chutium
it is definitively a bug, sqlContext.parquetFile should take both dir and
single file as parameter.

this if-check for isDir make no sense after this commit
https://github.com/apache/spark/pull/1370/files#r14967550

i opened a ticket for this issue
https://issues.apache.org/jira/browse/SPARK-3138

this ticket shows how to reproduce this bug.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sqlContext-parquetFile-path-fails-if-path-is-a-file-but-succeeds-if-a-directory-tp12345p12426.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



sqlContext.parquetFile(path) fails if path is a file but succeeds if a directory

2014-08-18 Thread Fengyun RAO
I'm using CDH 5.1 with spark 1.0.

When I try to run Spark SQL following the Programming Guide

val parquetFile = sqlContext.parquetFile(path)

If the path is a file, it throws an exception:

 Exception in thread main java.lang.IllegalArgumentException:
Expected hdfs://*/file.parquet for be a directory with Parquet
files/metadata
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetRelation.scala:301)
at 
org.apache.spark.sql.parquet.ParquetRelation.parquetSchema(ParquetRelation.scala:62)
at 
org.apache.spark.sql.parquet.ParquetRelation.init(ParquetRelation.scala:69)
at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:98)

However, if the path is the parent directory of the file, it succeeds.
Note: there is only one file in that directory.

I look into the source,

 /**
   * Try to read Parquet metadata at the given Path. We first see if
there is a summary file
   * in the parent directory. If so, this is used. Else we read the
actual footer at the given
   * location.
   * @param origPath The path at which we expect one (or more) Parquet files.
   * @return The `ParquetMetadata` containing among other things the schema.
   */
  def readMetaData(origPath: Path): ParquetMetadata

It doesn't require a directory, but it did throw an exception

 if (!fs.getFileStatus(path).isDir) {
  throw new IllegalArgumentException(
sExpected $path for be a directory with Parquet files/metadata)
 }


It seems odd to me, can anybody explains why, and how to read a file, not a
directory?