[ https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yin Huai reassigned SPARK-10143: -------------------------------- Assignee: Yin Huai > Parquet changed the behavior of calculating splits > -------------------------------------------------- > > Key: SPARK-10143 > URL: https://issues.apache.org/jira/browse/SPARK-10143 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.5.0 > Reporter: Yin Huai > Assignee: Yin Huai > Priority: Critical > Fix For: 1.5.0 > > > When Parquet's task side metadata is enabled (by default it is enabled and it > needs to be enabled to deal with tables with many files), Parquet delegates > the work of calculating initial splits to FileInputFormat (see > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311). > If filesystem's block size is smaller than the row group size and users do > not set min split size, splits in the initial split list will have lots of > dummy splits and they contribute to empty tasks (because the starting point > and ending point of a split does not cover the starting point of a row > group). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org