spark git commit: [SPARK-14970][SQL] Prevent DataSource from enumerates all files in a directory if there is user specified schema

marmbrus Thu, 28 Apr 2016 12:59:32 -0700

Repository: spark
Updated Branches:
  refs/heads/master d5ab42ceb -> 0ee5419b6



[SPARK-14970][SQL] Prevent DataSource from enumerates all files in a directory 
if there is user specified schema

## What changes were proposed in this pull request?
The FileCatalog object gets created even if the user specifies schema, which 
means files in the directory is enumerated even thought its not necessary. For 
large directories this is very slow. User would want to specify schema in such 
scenarios of large dirs, and this defeats the purpose quite a bit.

## How was this patch tested?
Hard to test this with unit test.

Author: Tathagata Das <tathagata.das1...@gmail.com>

Closes #12748 from tdas/SPARK-14970.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0ee5419b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0ee5419b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0ee5419b

Branch: refs/heads/master
Commit: 0ee5419b6ce535c714718d0d33b80eedd4b0a5fd
Parents: d5ab42c
Author: Tathagata Das <tathagata.das1...@gmail.com>
Authored: Thu Apr 28 12:59:08 2016 -0700
Committer: Michael Armbrust <mich...@databricks.com>
Committed: Thu Apr 28 12:59:08 2016 -0700

----------------------------------------------------------------------
 .../sql/execution/datasources/DataSource.scala   | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/0ee5419b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
----------------------------------------------------------------------
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
index 2f3826f..63dc1fd 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
@@ -127,17 +127,16 @@ case class DataSource(
   }
 
   private def inferFileFormatSchema(format: FileFormat): StructType = {
-    val caseInsensitiveOptions = new CaseInsensitiveMap(options)
-    val allPaths = caseInsensitiveOptions.get("path")
-    val globbedPaths = allPaths.toSeq.flatMap { path =>
-      val hdfsPath = new Path(path)
-      val fs = 
hdfsPath.getFileSystem(sparkSession.sessionState.newHadoopConf())
-      val qualified = hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory)
-      SparkHadoopUtil.get.globPathIfNecessary(qualified)
-    }.toArray
-
-    val fileCatalog: FileCatalog = new HDFSFileCatalog(sparkSession, options, 
globbedPaths, None)
     userSpecifiedSchema.orElse {
+      val caseInsensitiveOptions = new CaseInsensitiveMap(options)
+      val allPaths = caseInsensitiveOptions.get("path")
+      val globbedPaths = allPaths.toSeq.flatMap { path =>
+        val hdfsPath = new Path(path)
+        val fs = 
hdfsPath.getFileSystem(sparkSession.sessionState.newHadoopConf())
+        val qualified = hdfsPath.makeQualified(fs.getUri, 
fs.getWorkingDirectory)
+        SparkHadoopUtil.get.globPathIfNecessary(qualified)
+      }.toArray
+      val fileCatalog: FileCatalog = new HDFSFileCatalog(sparkSession, 
options, globbedPaths, None)
       format.inferSchema(
         sparkSession,
         caseInsensitiveOptions,


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-14970][SQL] Prevent DataSource from enumerates all files in a directory if there is user specified schema

Reply via email to