Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/21638#discussion_r198120457 --- Diff: core/src/main/scala/org/apache/spark/input/PortableDataStream.scala --- @@ -45,7 +45,8 @@ private[spark] abstract class StreamFileInputFormat[T] * which is set through setMaxSplitSize */ def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: Int) { - val defaultMaxSplitBytes = sc.getConf.get(config.FILES_MAX_PARTITION_BYTES) + val defaultMaxSplitBytes = Math.max( + sc.getConf.get(config.FILES_MAX_PARTITION_BYTES), minPartitions) val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES) val defaultParallelism = sc.defaultParallelism --- End diff -- Could you describe the use case when you need to take into account `minPartitions`. By default, `FILES_MAX_PARTITION_BYTES` is 128MB. Let's say it is even set to 1000, and `minPartitions` equals to 10 000. What is the reason to set the max size of splits in **bytes** to the min **number** of partition. Why should bigger number of partitions require bigger split size? Could you add more details to the PR description, please.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org