[ https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14707414#comment-14707414 ]
Ryan Blue commented on SPARK-10143: ----------------------------------- [~yhuai], yes, you'd want to determine the number of tasks first. Or, if you don't care then you can choose a reasonable task size like 256MB. That shouldn't be less than the default row group size, but could easily be larger. > Parquet changed the behavior of calculating splits > -------------------------------------------------- > > Key: SPARK-10143 > URL: https://issues.apache.org/jira/browse/SPARK-10143 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.5.0 > Reporter: Yin Huai > Priority: Critical > > When Parquet's task side metadata is enabled (by default it is enabled and it > needs to be enabled to deal with tables with many files), Parquet delegates > the work of calculating initial splits to FileInputFormat (see > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311). > If filesystem's block size is smaller than the row group size and users do > not set min split size, splits in the initial split list will have lots of > dummy splits and they contribute to empty tasks (because the starting point > and ending point of a split does not cover the starting point of a row > group). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org