[GitHub] spark pull request #16157: [SPARK-18723][DOC] Expanded programming guide inf...

srowen Mon, 05 Dec 2016 18:39:18 -0800

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16157#discussion_r91002001
  
    --- Diff: docs/programming-guide.md ---
    @@ -347,7 +347,7 @@ Some notes on reading files with Spark:
     
     Apart from text files, Spark's Scala API also supports several other data 
formats:
     
    -* `SparkContext.wholeTextFiles` lets you read a directory containing 
multiple small text files, and returns each of them as (filename, content) 
pairs. This is in contrast with `textFile`, which would return one record per 
line in each file.
    +* `SparkContext.wholeTextFiles` lets you read a directory containing 
multiple small text files, and returns each of them as (filename, content) 
pairs. This is in contrast with `textFile`, which would return one record per 
line in each file. It takes an optional second argument for controlling the 
minimal number of partitions (by default this is 2). It uses 
[CombineFileInputFormat](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html)
 internally in order to process large numbers of small files effectively by 
grouping files on the same node into a single split. (This can lead to 
non-optimal partitioning. It is therefore advisable to set the minimal number 
of partitions explicitly.)
    --- End diff --
    
    A few more sentences in the docs here, and possibly the scaladoc, can't 
hurt. It would call attention to the fact that you may wish to set a minimum, 
and why you would do that.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16157: [SPARK-18723][DOC] Expanded programming guide inf...

Reply via email to