[jira] [Assigned] (SPARK-10143) Parquet changed the behavior of calculating splits
[ https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai reassigned SPARK-10143: Assignee: Yin Huai > Parquet changed the behavior of calculating splits > -- > > Key: SPARK-10143 > URL: https://issues.apache.org/jira/browse/SPARK-10143 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Critical > Fix For: 1.5.0 > > > When Parquet's task side metadata is enabled (by default it is enabled and it > needs to be enabled to deal with tables with many files), Parquet delegates > the work of calculating initial splits to FileInputFormat (see > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311). > If filesystem's block size is smaller than the row group size and users do > not set min split size, splits in the initial split list will have lots of > dummy splits and they contribute to empty tasks (because the starting point > and ending point of a split does not cover the starting point of a row > group). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10143) Parquet changed the behavior of calculating splits
[ https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10143: Assignee: (was: Apache Spark) > Parquet changed the behavior of calculating splits > -- > > Key: SPARK-10143 > URL: https://issues.apache.org/jira/browse/SPARK-10143 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.0 >Reporter: Yin Huai >Priority: Critical > > When Parquet's task side metadata is enabled (by default it is enabled and it > needs to be enabled to deal with tables with many files), Parquet delegates > the work of calculating initial splits to FileInputFormat (see > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311). > If filesystem's block size is smaller than the row group size and users do > not set min split size, splits in the initial split list will have lots of > dummy splits and they contribute to empty tasks (because the starting point > and ending point of a split does not cover the starting point of a row > group). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10143) Parquet changed the behavior of calculating splits
[ https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10143: Assignee: Apache Spark > Parquet changed the behavior of calculating splits > -- > > Key: SPARK-10143 > URL: https://issues.apache.org/jira/browse/SPARK-10143 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Apache Spark >Priority: Critical > > When Parquet's task side metadata is enabled (by default it is enabled and it > needs to be enabled to deal with tables with many files), Parquet delegates > the work of calculating initial splits to FileInputFormat (see > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311). > If filesystem's block size is smaller than the row group size and users do > not set min split size, splits in the initial split list will have lots of > dummy splits and they contribute to empty tasks (because the starting point > and ending point of a split does not cover the starting point of a row > group). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org