spark git commit: [SPARK-8437] [DOCS] Using directory path without wildcard for filename slow for large number of files with wholeTextFiles and binaryFiles

andrewor14 Mon, 29 Jun 2015 17:22:22 -0700

Repository: spark
Updated Branches:
  refs/heads/branch-1.4 cdfa388dd -> b2684557f



[SPARK-8437] [DOCS] Using directory path without wildcard for filename slow for 
large number of files with wholeTextFiles and binaryFiles

Note that 'dir/*' can be more efficient in some Hadoop FS implementations that 
'dir/'

Author: Sean Owen <so...@cloudera.com>

Closes #7036 from srowen/SPARK-8437 and squashes the following commits:

0e813ae [Sean Owen] Note that 'dir/*' can be more efficient in some Hadoop FS 
implementations that 'dir/'

(cherry picked from commit 5d30eae56051c563a8427f330b09ef66db0a0d21)
Signed-off-by: Andrew Or <and...@databricks.com>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b2684557
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b2684557
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b2684557

Branch: refs/heads/branch-1.4
Commit: b2684557fa0d2ec14b7529324443c8154d81c348
Parents: cdfa388
Author: Sean Owen <so...@cloudera.com>
Authored: Mon Jun 29 17:21:35 2015 -0700
Committer: Andrew Or <and...@databricks.com>
Committed: Mon Jun 29 17:21:47 2015 -0700

----------------------------------------------------------------------
 core/src/main/scala/org/apache/spark/SparkContext.scala | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/b2684557/core/src/main/scala/org/apache/spark/SparkContext.scala
----------------------------------------------------------------------
diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala 
b/core/src/main/scala/org/apache/spark/SparkContext.scala
index b4c0d4c..f8af710 100644
--- a/core/src/main/scala/org/apache/spark/SparkContext.scala
+++ b/core/src/main/scala/org/apache/spark/SparkContext.scala
@@ -824,6 +824,8 @@ class SparkContext(config: SparkConf) extends Logging with 
ExecutorAllocationCli
    * }}}
    *
    * @note Small files are preferred, large file is also allowable, but may 
cause bad performance.
+   * @note On some filesystems, `.../path/*` can be a more efficient way to 
read all files in a directory
+   *       rather than `.../path/` or `.../path`
    *
    * @param minPartitions A suggestion value of the minimal splitting number 
for input data.
    */
@@ -871,9 +873,11 @@ class SparkContext(config: SparkConf) extends Logging with 
ExecutorAllocationCli
    *   (a-hdfs-path/part-nnnnn, its content)
    * }}}
    *
-   * @param minPartitions A suggestion value of the minimal splitting number 
for input data.
-   *
    * @note Small files are preferred; very large files may cause bad 
performance.
+   * @note On some filesystems, `.../path/*` can be a more efficient way to 
read all files in a directory
+   *       rather than `.../path/` or `.../path`
+   *
+   * @param minPartitions A suggestion value of the minimal splitting number 
for input data.
    */
   @Experimental
   def binaryFiles(


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-8437] [DOCS] Using directory path without wildcard for filename slow for large number of files with wholeTextFiles and binaryFiles

Reply via email to