I'm using CDH 5.1 with spark 1.0.
When I try to run Spark SQL following the Programming Guide
val parquetFile = sqlContext.parquetFile(path)
If the path is a file, it throws an exception:
Exception in thread main java.lang.IllegalArgumentException:
Expected hdfs://*/file.parquet for be a directory with Parquet
files/metadata
at
org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetRelation.scala:301)
at
org.apache.spark.sql.parquet.ParquetRelation.parquetSchema(ParquetRelation.scala:62)
at
org.apache.spark.sql.parquet.ParquetRelation.init(ParquetRelation.scala:69)
at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:98)
However, if the path is the parent directory of the file, it succeeds.
Note: there is only one file in that directory.
I look into the source,
/**
* Try to read Parquet metadata at the given Path. We first see if
there is a summary file
* in the parent directory. If so, this is used. Else we read the
actual footer at the given
* location.
* @param origPath The path at which we expect one (or more) Parquet files.
* @return The `ParquetMetadata` containing among other things the schema.
*/
def readMetaData(origPath: Path): ParquetMetadata
It doesn't require a directory, but it did throw an exception
if (!fs.getFileStatus(path).isDir) {
throw new IllegalArgumentException(
sExpected $path for be a directory with Parquet files/metadata)
}
It seems odd to me, can anybody explains why, and how to read a file, not a
directory?