[ 
https://issues.apache.org/jira/browse/SPARK-33534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-33534:
-------------------------------------
    Component/s:     (was: Input/Output)
                 SQL

> Allow specifying a minimum number of bytes in a split of a file
> ---------------------------------------------------------------
>
>                 Key: SPARK-33534
>                 URL: https://issues.apache.org/jira/browse/SPARK-33534
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.1
>            Reporter: Niels Basjes
>            Priority: Major
>
> *Background*
>  Long time ago I have written a way for reading a (usually large) Gzipped 
> file in a way that allows better distribution of the load over an Apache 
> Hadoop cluster: [https://github.com/nielsbasjes/splittablegzip]
> Seems like people still need this kind of functionality and it turns out my 
> code works without modification in conjunction with Apache Spark.
>  See for example:
>  - SPARK-29102
>  - [https://stackoverflow.com/q/28127119/877069]
>  - [https://stackoverflow.com/q/27531816/877069]
> So [~nchammas] provided documentation to my project a while ago on how to use 
> it with Spark.
>  [https://github.com/nielsbasjes/splittablegzip/blob/master/README-Spark.md]
> *The problem*
>  Now some people have indicated getting errors from this feature of mine.
> Fact is that this functionality cannot read a split if it is too small (the 
> number of bytes read from disk and the number of bytes coming out the 
> compression are different). So my code uses the {{io.file.buffer.size}} 
> setting but also has a hard coded lower limit split size of 4 KiB.
> Now the problem I found when looking into the reports I got is that Spark 
> does not have a minimum number of bytes in a split.
> In fact: When I created a test file and then set the 
> {{spark.sql.files.maxPartitionBytes}} to exactly 1 byte less than the size of 
> my test file my library gave the error:
> {{java.lang.IllegalArgumentException: The provided InputSplit (562686;562687] 
> is 1 bytes which is too small. (Minimum is 65536)}}
> I found the code that does this calculation here 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L74
> *Proposed enhancement*
> So what I propose is to have a new setting 
> ({{spark.sql.files.minPartitionBytes}}  ?) that will guarantee that no split 
> of a file is smaller than a configured number of bytes.
> I also propose to have this set to something like 64KiB as a default.
> Having some constraints on the values of 
> {{spark.sql.files.minPartitionBytes}} and possibly in relation with 
> {{spark.sql.files.maxPartitionBytes}} would be fine.
> *Notes*
> Hadoop already has code that does this: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L456



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to