[ 
https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706875#comment-14706875
 ] 

Ryan Blue commented on SPARK-10143:
-----------------------------------

[~yhuai], you're right that the input format now delegates to FileInputFormat 
to calculate splits. The main goal of PARQUET-139 was to be able to calculate 
splits without information from the file footer because the creates a 
performance problem: reading all of the file footers before submitting a job.

What you're suggesting, to ensure min split size is at least the row group size 
will require reading the footers, so you will end up trading a minor problem 
(empty tasks) that affects bad files for a big problem (reading footers) that 
affects all files. I suggest not taking any action here and recommending the 
user rewrite the data. The row group size should never be larger than the HDFS 
block size.

PARQUET-308 actually bumps up the block size to the row group size if it is 
less. It also adds padding to avoid row groups that span HDFS blocks, though 
you should set the max padding size to something reasonable to use it, like 8MB.

> Parquet changed the behavior of calculating splits
> --------------------------------------------------
>
>                 Key: SPARK-10143
>                 URL: https://issues.apache.org/jira/browse/SPARK-10143
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0
>            Reporter: Yin Huai
>            Priority: Critical
>
> When Parquet's task side metadata is enabled (by default it is enabled and it 
> needs to be enabled to deal with tables with many files), Parquet delegates 
> the work of calculating initial splits to FileInputFormat (see 
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311).
>  If filesystem's block size is smaller than the row group size and users do 
> not set min split size, splits in the initial split list will have lots of 
> dummy splits and they contribute to empty tasks (because the starting point 
> and ending point of a split does not cover the starting point of a row 
> group). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to