[GitHub] spark pull request: [SPARK-2119][SQL] Improved Parquet performance...

liancheng Tue, 15 Jul 2014 20:00:17 -0700

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1370#discussion_r14979405
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala ---
    @@ -365,20 +366,23 @@ private[parquet] object ParquetTypesConverter extends 
Logging {
             s"Expected $path for be a directory with Parquet files/metadata")
         }
         ParquetRelation.enableLogForwarding()
    -    val metadataPath = new Path(path, 
ParquetFileWriter.PARQUET_METADATA_FILE)
    -    // if this is a new table that was just created we will find only the 
metadata file
    -    if (fs.exists(metadataPath) && fs.isFile(metadataPath)) {
    -      ParquetFileReader.readFooter(conf, metadataPath)
    -    } else {
    -      // there may be one or more Parquet files in the given directory
    -      val footers = ParquetFileReader.readFooters(conf, 
fs.getFileStatus(path))
    -      // TODO: for now we assume that all footers (if there is more than 
one) have identical
    -      // metadata; we may want to add a check here at some point
    -      if (footers.size() == 0) {
    -        throw new IllegalArgumentException(s"Could not find Parquet 
metadata at path $path")
    -      }
    -      footers(0).getParquetMetadata
    +
    +    val children = fs.listStatus(path).filterNot {
    +      _.getPath.getName == FileOutputCommitter.SUCCEEDED_FILE_NAME
         }
    +
    +    // NOTE (lian): Parquet "_metadata" file can be very slow if the file 
consists of lots of row
    +    // groups. Since Parquet schema is replicated among all row groups, we 
only need to touch a
    --- End diff --
    
    Yes, we are making this assumption, will add a comment here. (And checking 
schema consistency can be potentially inefficient for large Parquet file with 
lots of row groups.)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2119][SQL] Improved Parquet performance...

Reply via email to