[ https://issues.apache.org/jira/browse/SPARK-33534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Takeshi Yamamuro updated SPARK-33534: ------------------------------------- Component/s: (was: Input/Output) SQL > Allow specifying a minimum number of bytes in a split of a file > --------------------------------------------------------------- > > Key: SPARK-33534 > URL: https://issues.apache.org/jira/browse/SPARK-33534 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.0.1 > Reporter: Niels Basjes > Priority: Major > > *Background* > Long time ago I have written a way for reading a (usually large) Gzipped > file in a way that allows better distribution of the load over an Apache > Hadoop cluster: [https://github.com/nielsbasjes/splittablegzip] > Seems like people still need this kind of functionality and it turns out my > code works without modification in conjunction with Apache Spark. > See for example: > - SPARK-29102 > - [https://stackoverflow.com/q/28127119/877069] > - [https://stackoverflow.com/q/27531816/877069] > So [~nchammas] provided documentation to my project a while ago on how to use > it with Spark. > [https://github.com/nielsbasjes/splittablegzip/blob/master/README-Spark.md] > *The problem* > Now some people have indicated getting errors from this feature of mine. > Fact is that this functionality cannot read a split if it is too small (the > number of bytes read from disk and the number of bytes coming out the > compression are different). So my code uses the {{io.file.buffer.size}} > setting but also has a hard coded lower limit split size of 4 KiB. > Now the problem I found when looking into the reports I got is that Spark > does not have a minimum number of bytes in a split. > In fact: When I created a test file and then set the > {{spark.sql.files.maxPartitionBytes}} to exactly 1 byte less than the size of > my test file my library gave the error: > {{java.lang.IllegalArgumentException: The provided InputSplit (562686;562687] > is 1 bytes which is too small. (Minimum is 65536)}} > I found the code that does this calculation here > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L74 > *Proposed enhancement* > So what I propose is to have a new setting > ({{spark.sql.files.minPartitionBytes}} ?) that will guarantee that no split > of a file is smaller than a configured number of bytes. > I also propose to have this set to something like 64KiB as a default. > Having some constraints on the values of > {{spark.sql.files.minPartitionBytes}} and possibly in relation with > {{spark.sql.files.maxPartitionBytes}} would be fine. > *Notes* > Hadoop already has code that does this: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L456 -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org