Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21638#discussion_r215030825 --- Diff: core/src/main/scala/org/apache/spark/input/PortableDataStream.scala --- @@ -47,7 +47,7 @@ private[spark] abstract class StreamFileInputFormat[T] def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: Int) { val defaultMaxSplitBytes = sc.getConf.get(config.FILES_MAX_PARTITION_BYTES) val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES) - val defaultParallelism = sc.defaultParallelism + val defaultParallelism = Math.max(sc.defaultParallelism, minPartitions) --- End diff -- ``` sc = new SparkContext(new SparkConf().setAppName("test").setMaster("local") .set(config.FILES_OPEN_COST_IN_BYTES.key, "0") .set("spark.default.parallelism", "1")) println(sc.binaryFiles(dirpath1, minPartitions = 50).getNumPartitions) println(sc.binaryFiles(dirpath1, minPartitions = 1).getNumPartitions) ``` It is not hard to verify whether the parameter `minPartitions` takes an effect. Currently, the description of this parameter is not clear. We need to document it clear which factors impact the actual number of partitions; otherwise, users will not understand how to use it.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org