[ 
https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14707222#comment-14707222
 ] 

Ryan Blue commented on SPARK-10143:
-----------------------------------

I think you're going to end up assuming every row group is 128MB then. That's 
not terrible, but you may as well allocate the number of tasks that you want 
and divide up the input evenly by size. In the worst case, you get fewer real 
tasks because that's the way the data is laid out (fewer row groups than 
tasks). For other cases, your executors are responsible for continuous chunks 
of files from S3. If you do it based on the default row group size I think 
you're going to hit the case where you create too many tasks fairly often.

> Parquet changed the behavior of calculating splits
> --------------------------------------------------
>
>                 Key: SPARK-10143
>                 URL: https://issues.apache.org/jira/browse/SPARK-10143
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0
>            Reporter: Yin Huai
>            Priority: Critical
>
> When Parquet's task side metadata is enabled (by default it is enabled and it 
> needs to be enabled to deal with tables with many files), Parquet delegates 
> the work of calculating initial splits to FileInputFormat (see 
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311).
>  If filesystem's block size is smaller than the row group size and users do 
> not set min split size, splits in the initial split list will have lots of 
> dummy splits and they contribute to empty tasks (because the starting point 
> and ending point of a split does not cover the starting point of a row 
> group). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to