[jira] [Commented] (SPARK-33534) Allow specifying a minimum number of bytes in a split of a file
[ https://issues.apache.org/jira/browse/SPARK-33534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17873539#comment-17873539 ] Mayur Dubey commented on SPARK-33534: - Hi, Do we have "spark.sql.files.minPartitionBytes" available now or any alternate solution that works well ? [~nielsbasjes] I am getting this error in AWS Glue spark session, I tried to update spark.sql.files.maxPartitionBytes but it doesn't seem to work in Glue. "An error occurred while calling o147.collectToPython. The provided InputSplit (12582912;12642402] is 59490 bytes which is too small. (Minimum is 65536)" > Allow specifying a minimum number of bytes in a split of a file > --- > > Key: SPARK-33534 > URL: https://issues.apache.org/jira/browse/SPARK-33534 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1 >Reporter: Niels Basjes >Priority: Major > > *Background* > Long time ago I have written a way for reading a (usually large) Gzipped > file in a way that allows better distribution of the load over an Apache > Hadoop cluster: [https://github.com/nielsbasjes/splittablegzip] > Seems like people still need this kind of functionality and it turns out my > code works without modification in conjunction with Apache Spark. > See for example: > - SPARK-29102 > - [https://stackoverflow.com/q/28127119/877069] > - [https://stackoverflow.com/q/27531816/877069] > So [~nchammas] provided documentation to my project a while ago on how to use > it with Spark. > [https://github.com/nielsbasjes/splittablegzip/blob/master/README-Spark.md] > *The problem* > Now some people have indicated getting errors from this feature of mine. > Fact is that this functionality cannot read a split if it is too small (the > number of bytes read from disk and the number of bytes coming out the > compression are different). So my code uses the {{io.file.buffer.size}} > setting but also has a hard coded lower limit split size of 4 KiB. > Now the problem I found when looking into the reports I got is that Spark > does not have a minimum number of bytes in a split. > In fact: When I created a test file and then set the > {{spark.sql.files.maxPartitionBytes}} to exactly 1 byte less than the size of > my test file my library gave the error: > {{java.lang.IllegalArgumentException: The provided InputSplit (562686;562687] > is 1 bytes which is too small. (Minimum is 65536)}} > I found the code that does this calculation here > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L74 > *Proposed enhancement* > So what I propose is to have a new setting > ({{spark.sql.files.minPartitionBytes}} ?) that will guarantee that no split > of a file is smaller than a configured number of bytes. > I also propose to have this set to something like 64KiB as a default. > Having some constraints on the values of > {{spark.sql.files.minPartitionBytes}} and possibly in relation with > {{spark.sql.files.maxPartitionBytes}} would be fine. > *Notes* > Hadoop already has code that does this: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L456 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33534) Allow specifying a minimum number of bytes in a split of a file
[ https://issues.apache.org/jira/browse/SPARK-33534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338295#comment-17338295 ] Niels Basjes commented on SPARK-33534: -- [~Suhass] To be clear: I wrote this tool that is apparently useful for people who use Spark. I personally do not use Spark very often. Do I have another way around this? No. The only thing I can think of (which would not be usable in real scenarios) is to first list all of the input files and then using some kind of numerical analysis try to find a value for "spark.sql.files.maxPartitionBytes" that does not trigger this problem and then run the job with that setting. There will be scenarios when reading multiple files where such a value will not exist; so I say 'No' to your question. > Allow specifying a minimum number of bytes in a split of a file > --- > > Key: SPARK-33534 > URL: https://issues.apache.org/jira/browse/SPARK-33534 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1 >Reporter: Niels Basjes >Priority: Major > > *Background* > Long time ago I have written a way for reading a (usually large) Gzipped > file in a way that allows better distribution of the load over an Apache > Hadoop cluster: [https://github.com/nielsbasjes/splittablegzip] > Seems like people still need this kind of functionality and it turns out my > code works without modification in conjunction with Apache Spark. > See for example: > - SPARK-29102 > - [https://stackoverflow.com/q/28127119/877069] > - [https://stackoverflow.com/q/27531816/877069] > So [~nchammas] provided documentation to my project a while ago on how to use > it with Spark. > [https://github.com/nielsbasjes/splittablegzip/blob/master/README-Spark.md] > *The problem* > Now some people have indicated getting errors from this feature of mine. > Fact is that this functionality cannot read a split if it is too small (the > number of bytes read from disk and the number of bytes coming out the > compression are different). So my code uses the {{io.file.buffer.size}} > setting but also has a hard coded lower limit split size of 4 KiB. > Now the problem I found when looking into the reports I got is that Spark > does not have a minimum number of bytes in a split. > In fact: When I created a test file and then set the > {{spark.sql.files.maxPartitionBytes}} to exactly 1 byte less than the size of > my test file my library gave the error: > {{java.lang.IllegalArgumentException: The provided InputSplit (562686;562687] > is 1 bytes which is too small. (Minimum is 65536)}} > I found the code that does this calculation here > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L74 > *Proposed enhancement* > So what I propose is to have a new setting > ({{spark.sql.files.minPartitionBytes}} ?) that will guarantee that no split > of a file is smaller than a configured number of bytes. > I also propose to have this set to something like 64KiB as a default. > Having some constraints on the values of > {{spark.sql.files.minPartitionBytes}} and possibly in relation with > {{spark.sql.files.maxPartitionBytes}} would be fine. > *Notes* > Hadoop already has code that does this: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L456 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33534) Allow specifying a minimum number of bytes in a split of a file
[ https://issues.apache.org/jira/browse/SPARK-33534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17305154#comment-17305154 ] Suhas Jaladi commented on SPARK-33534: -- [~nielsbasjes], Just checking if you have any alternate solution until " spark.sql.files.minPartitionBytes" is developed > Allow specifying a minimum number of bytes in a split of a file > --- > > Key: SPARK-33534 > URL: https://issues.apache.org/jira/browse/SPARK-33534 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1 >Reporter: Niels Basjes >Priority: Major > > *Background* > Long time ago I have written a way for reading a (usually large) Gzipped > file in a way that allows better distribution of the load over an Apache > Hadoop cluster: [https://github.com/nielsbasjes/splittablegzip] > Seems like people still need this kind of functionality and it turns out my > code works without modification in conjunction with Apache Spark. > See for example: > - SPARK-29102 > - [https://stackoverflow.com/q/28127119/877069] > - [https://stackoverflow.com/q/27531816/877069] > So [~nchammas] provided documentation to my project a while ago on how to use > it with Spark. > [https://github.com/nielsbasjes/splittablegzip/blob/master/README-Spark.md] > *The problem* > Now some people have indicated getting errors from this feature of mine. > Fact is that this functionality cannot read a split if it is too small (the > number of bytes read from disk and the number of bytes coming out the > compression are different). So my code uses the {{io.file.buffer.size}} > setting but also has a hard coded lower limit split size of 4 KiB. > Now the problem I found when looking into the reports I got is that Spark > does not have a minimum number of bytes in a split. > In fact: When I created a test file and then set the > {{spark.sql.files.maxPartitionBytes}} to exactly 1 byte less than the size of > my test file my library gave the error: > {{java.lang.IllegalArgumentException: The provided InputSplit (562686;562687] > is 1 bytes which is too small. (Minimum is 65536)}} > I found the code that does this calculation here > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L74 > *Proposed enhancement* > So what I propose is to have a new setting > ({{spark.sql.files.minPartitionBytes}} ?) that will guarantee that no split > of a file is smaller than a configured number of bytes. > I also propose to have this set to something like 64KiB as a default. > Having some constraints on the values of > {{spark.sql.files.minPartitionBytes}} and possibly in relation with > {{spark.sql.files.maxPartitionBytes}} would be fine. > *Notes* > Hadoop already has code that does this: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L456 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org