Github user chutium commented on a diff in the pull request: https://github.com/apache/spark/pull/1959#discussion_r16530668 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala --- @@ -373,9 +373,11 @@ private[parquet] object ParquetTypesConverter extends Logging { } ParquetRelation.enableLogForwarding() + // NOTE: Explicitly list "_temporary" because hadoop 0.23 removed the variable TEMP_DIR_NAME + // from FileOutputCommitter. Check MAPREDUCE-5229 for the detail. val children = fs.listStatus(path).filterNot { status => val name = status.getPath.getName - name(0) == '.' || name == FileOutputCommitter.SUCCEEDED_FILE_NAME + name(0) == '.' || name == FileOutputCommitter.SUCCEEDED_FILE_NAME || name == "_temporary" } // NOTE (lian): Parquet "_metadata" file can be very slow if the file consists of lots of row --- End diff -- hmm, a better solution for all of this could be: no ```val children = fs.listStatus(path)...``` any more then: ``` val metafile = fs.listStatus(path).find(_.getPath.getName == ParquetFileWriter.PARQUET_METADATA_FILE) val datafile = fs.listStatus(path).find(isNotHiddenFile(_.getPath.getName)) ``` this ```isNotHiddenFile``` simply check like this ```(name(0) != '.' && name(0) != '_')``` then something like: ``` if datafile is not null return ParquetFileReader.readFooter(conf, datafile) else return ParquetFileReader.readFooter(conf, metafile) ``` and moreover, @liancheng, after reading carefully following comments, finally i know what you mean "complete Parquet file on HDFS should be directory" https://github.com/apache/spark/pull/2044#issuecomment-52733594 you mean the whole directory is "a single parquet file", and the files in it are "data"? but such a definition is really very very confusing... are you sure about this definition? i just googled, but found noting, only some like "Parquet files are self-describing so the schema is preserved" so, since they are self-describing, in my mind, each "data-file" in a parquet file folder is also valid parquet-format-file, it should also be able to take as an input source for parquet reader like our Spark SQLContext...
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org