[
https://issues.apache.org/jira/browse/SPARK-33534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Takeshi Yamamuro updated SPARK-33534:
-------------------------------------
Component/s: (was: Input/Output)
SQL
> Allow specifying a minimum number of bytes in a split of a file
> ---------------------------------------------------------------
>
> Key: SPARK-33534
> URL: https://issues.apache.org/jira/browse/SPARK-33534
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.0.1
> Reporter: Niels Basjes
> Priority: Major
>
> *Background*
> Long time ago I have written a way for reading a (usually large) Gzipped
> file in a way that allows better distribution of the load over an Apache
> Hadoop cluster: [https://github.com/nielsbasjes/splittablegzip]
> Seems like people still need this kind of functionality and it turns out my
> code works without modification in conjunction with Apache Spark.
> See for example:
> - SPARK-29102
> - [https://stackoverflow.com/q/28127119/877069]
> - [https://stackoverflow.com/q/27531816/877069]
> So [~nchammas] provided documentation to my project a while ago on how to use
> it with Spark.
> [https://github.com/nielsbasjes/splittablegzip/blob/master/README-Spark.md]
> *The problem*
> Now some people have indicated getting errors from this feature of mine.
> Fact is that this functionality cannot read a split if it is too small (the
> number of bytes read from disk and the number of bytes coming out the
> compression are different). So my code uses the {{io.file.buffer.size}}
> setting but also has a hard coded lower limit split size of 4 KiB.
> Now the problem I found when looking into the reports I got is that Spark
> does not have a minimum number of bytes in a split.
> In fact: When I created a test file and then set the
> {{spark.sql.files.maxPartitionBytes}} to exactly 1 byte less than the size of
> my test file my library gave the error:
> {{java.lang.IllegalArgumentException: The provided InputSplit (562686;562687]
> is 1 bytes which is too small. (Minimum is 65536)}}
> I found the code that does this calculation here
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L74
> *Proposed enhancement*
> So what I propose is to have a new setting
> ({{spark.sql.files.minPartitionBytes}} ?) that will guarantee that no split
> of a file is smaller than a configured number of bytes.
> I also propose to have this set to something like 64KiB as a default.
> Having some constraints on the values of
> {{spark.sql.files.minPartitionBytes}} and possibly in relation with
> {{spark.sql.files.maxPartitionBytes}} would be fine.
> *Notes*
> Hadoop already has code that does this:
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L456
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]