[ 
https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14707363#comment-14707363
 ] 

Yin Huai commented on SPARK-10143:
----------------------------------

Yeah, the setting is not the real row group size and we probably will always 
assume a row group is 128MB. But I feel in generally, end users do not really 
tweak this setting often. With our current master, we are creating too many 
tasks, which cause significant regression in some cases.

My thought on splitting the files based on total length is that it requires us 
to determine the number of tasks first, right?

> Parquet changed the behavior of calculating splits
> --------------------------------------------------
>
>                 Key: SPARK-10143
>                 URL: https://issues.apache.org/jira/browse/SPARK-10143
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0
>            Reporter: Yin Huai
>            Priority: Critical
>
> When Parquet's task side metadata is enabled (by default it is enabled and it 
> needs to be enabled to deal with tables with many files), Parquet delegates 
> the work of calculating initial splits to FileInputFormat (see 
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311).
>  If filesystem's block size is smaller than the row group size and users do 
> not set min split size, splits in the initial split list will have lots of 
> dummy splits and they contribute to empty tasks (because the starting point 
> and ending point of a split does not cover the starting point of a row 
> group). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to