[ 
https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14707148#comment-14707148
 ] 

Ryan Blue commented on SPARK-10143:
-----------------------------------

[~yhuai] if you do that, you will get the current value for the configuration, 
not what was used to write the file. If you want to know what the value was 
when the file was written, you have to read its footer.

As far as solving the challenge of S3 input splits, if you're running in S3, 
why not split the files based on total length? Example:
* 2 files: 500 MB and 700MB
* Want 5 reducers
* Splits: file 1:0-250MB, file 1:250-500MB, file 2:0-250MB, file 2:250-500MB, 
file 2:500-700MB

Even without knowing the block size, you can control parallelism. If there are 
lots of small blocks (say 64MB block size), then you get approximately what you 
wanted. If there are big blocks (256MB) then you are still okay. If you have 
gigantic blocks (500MB) then you waste a couple tasks and get as much 
parallelism as possible anyway.

> Parquet changed the behavior of calculating splits
> --------------------------------------------------
>
>                 Key: SPARK-10143
>                 URL: https://issues.apache.org/jira/browse/SPARK-10143
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0
>            Reporter: Yin Huai
>            Priority: Critical
>
> When Parquet's task side metadata is enabled (by default it is enabled and it 
> needs to be enabled to deal with tables with many files), Parquet delegates 
> the work of calculating initial splits to FileInputFormat (see 
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311).
>  If filesystem's block size is smaller than the row group size and users do 
> not set min split size, splits in the initial split list will have lots of 
> dummy splits and they contribute to empty tasks (because the starting point 
> and ending point of a split does not cover the starting point of a row 
> group). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to