Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21638#discussion_r215030825
  
    --- Diff: 
core/src/main/scala/org/apache/spark/input/PortableDataStream.scala ---
    @@ -47,7 +47,7 @@ private[spark] abstract class StreamFileInputFormat[T]
       def setMinPartitions(sc: SparkContext, context: JobContext, 
minPartitions: Int) {
         val defaultMaxSplitBytes = 
sc.getConf.get(config.FILES_MAX_PARTITION_BYTES)
         val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES)
    -    val defaultParallelism = sc.defaultParallelism
    +    val defaultParallelism = Math.max(sc.defaultParallelism, minPartitions)
    --- End diff --
    
    ```
          sc = new SparkContext(new 
SparkConf().setAppName("test").setMaster("local")
            .set(config.FILES_OPEN_COST_IN_BYTES.key, "0")
            .set("spark.default.parallelism", "1"))
    
          println(sc.binaryFiles(dirpath1, minPartitions = 50).getNumPartitions)
          println(sc.binaryFiles(dirpath1, minPartitions = 1).getNumPartitions)
    ```
    
    It is not hard to verify whether the parameter `minPartitions` takes an 
effect. Currently, the description of this parameter is not clear. We need to 
document it clear which factors impact the actual number of partitions; 
otherwise, users will not understand how to use it. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to