[ 
https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14707067#comment-14707067
 ] 

Yin Huai commented on SPARK-10143:
----------------------------------

[~rdblue] Thank you for the detailed info! One thing I did not explain clearly 
is for now I will use {{ParquetOutputFormat.getLongBlockSize}} to get the block 
size setting from the conf to avoid touching parquet footers. What do you 
think? 

The main motivation of this workaround is for native S3 file system, we only 
have the concept of HDFS block size but we do not really break a file larger 
than this block size to multiple physical S3 files. So, for a large parquet 
file in S3, it is still a single S3 object. Since bumping up the default block 
size may affect the parallelism of workloads using other file formats, I am 
bumping the min split size only for parquet input format.


> Parquet changed the behavior of calculating splits
> --------------------------------------------------
>
>                 Key: SPARK-10143
>                 URL: https://issues.apache.org/jira/browse/SPARK-10143
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0
>            Reporter: Yin Huai
>            Priority: Critical
>
> When Parquet's task side metadata is enabled (by default it is enabled and it 
> needs to be enabled to deal with tables with many files), Parquet delegates 
> the work of calculating initial splits to FileInputFormat (see 
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311).
>  If filesystem's block size is smaller than the row group size and users do 
> not set min split size, splits in the initial split list will have lots of 
> dummy splits and they contribute to empty tasks (because the starting point 
> and ending point of a split does not cover the starting point of a row 
> group). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to