Github user michalsenkyr commented on a diff in the pull request: https://github.com/apache/spark/pull/16157#discussion_r90975179 --- Diff: docs/programming-guide.md --- @@ -347,7 +347,7 @@ Some notes on reading files with Spark: Apart from text files, Spark's Scala API also supports several other data formats: -* `SparkContext.wholeTextFiles` lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with `textFile`, which would return one record per line in each file. +* `SparkContext.wholeTextFiles` lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with `textFile`, which would return one record per line in each file. It takes an optional second argument for controlling the minimal number of partitions (by default this is 2). It uses [CombineFileInputFormat](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html) internally in order to process large numbers of small files effectively by grouping files on the same node into a single split. (This can lead to non-optimal partitioning. It is therefore advisable to set the minimal number of partitions explicitly.) --- End diff -- Yes, you are right. A 'node' should probably be an 'executor'. Also 'split' should probably be replaced by 'partition' to prevent confusion. I used Hadoop terminology because it is used in the linked Hadoop documentation, but Spark terminology may be more appropriate here. As I understand it, this behavior is guaranteed as setting minPartitions directly sets maxSplitSize by dividing the length of the file ([code](https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala#L57)). Please note that [minSplitSize](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html#setMinSplitSizeNode-long-) is only used for leftover data. As the programming guide already contains quite a detailed description of partitioning on `textFile`, I thought it would make sense to mention it in `wholeTextFiles` as well. Especially when the behavior differs that much.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org