Github user michalsenkyr commented on a diff in the pull request: https://github.com/apache/spark/pull/16157#discussion_r91716176 --- Diff: docs/programming-guide.md --- @@ -347,7 +347,7 @@ Some notes on reading files with Spark: Apart from text files, Spark's Scala API also supports several other data formats: -* `SparkContext.wholeTextFiles` lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with `textFile`, which would return one record per line in each file. +* `SparkContext.wholeTextFiles` lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with `textFile`, which would return one record per line in each file. It takes an optional second argument for controlling the minimal number of partitions (by default this is 2). It uses [CombineFileInputFormat](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html) internally in order to process large numbers of small files effectively by grouping files on the same executor into a single partition. This can lead to sub-optimal partitioning when the file sets would benefit from residing in multiple partitions (e.g., larger partitions would not fit in memory, files are replicated but a large subset is locally reachable from a single executor, subsequent transformations would benefit from multi-core processing). In those cases, set the `minPartitions` argume nt to enforce splitting. --- End diff -- Yes, it is different in what the elements are. However, there is no indication that the partitioning policy differs that much. I always understood `textFile`'s partitioning policy as "we will split it if we can" up to individual blocks. `wholeTextFiles`' partitioning seems to be more like "we will merge it if we can" up to the executor boundary. The polar opposite. This manifests in the way the developer has to think about handling partitions. In `textFile` it is generally safe to let Spark figure it all out without sacrificing much in performance, whereas in `wholeTextFiles` you may frequently run into performance problems due to having too few partitions. An alternate solution in this case would be to unite the partitioning policies of the `textFile` and `wholeTextFiles` methods by having the latter also "split if we can". In this case up to the individual files (presently achievable by setting minPartitions to an arbitrary large number). Therefore each file would, by default, have its own partition. However, this approach would mean a significant change which might break existing applications.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org