[ https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14707148#comment-14707148 ]
Ryan Blue commented on SPARK-10143: ----------------------------------- [~yhuai] if you do that, you will get the current value for the configuration, not what was used to write the file. If you want to know what the value was when the file was written, you have to read its footer. As far as solving the challenge of S3 input splits, if you're running in S3, why not split the files based on total length? Example: * 2 files: 500 MB and 700MB * Want 5 reducers * Splits: file 1:0-250MB, file 1:250-500MB, file 2:0-250MB, file 2:250-500MB, file 2:500-700MB Even without knowing the block size, you can control parallelism. If there are lots of small blocks (say 64MB block size), then you get approximately what you wanted. If there are big blocks (256MB) then you are still okay. If you have gigantic blocks (500MB) then you waste a couple tasks and get as much parallelism as possible anyway. > Parquet changed the behavior of calculating splits > -------------------------------------------------- > > Key: SPARK-10143 > URL: https://issues.apache.org/jira/browse/SPARK-10143 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.5.0 > Reporter: Yin Huai > Priority: Critical > > When Parquet's task side metadata is enabled (by default it is enabled and it > needs to be enabled to deal with tables with many files), Parquet delegates > the work of calculating initial splits to FileInputFormat (see > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311). > If filesystem's block size is smaller than the row group size and users do > not set min split size, splits in the initial split list will have lots of > dummy splits and they contribute to empty tasks (because the starting point > and ending point of a split does not cover the starting point of a row > group). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org