[jira] [Updated] (SPARK-16121) ListingFileCatalog does not list in parallel anymore

2016-06-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16121:
--
Fix Version/s: (was: 2.0.0)
   2.0.1
   2.1.0

> ListingFileCatalog does not list in parallel anymore
> 
>
> Key: SPARK-16121
> URL: https://issues.apache.org/jira/browse/SPARK-16121
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 2.1.0, 2.0.1
>
>
> In ListingFileCatalog, the implementation of {{listLeafFiles}} is shown 
> below. When the number of user-provided paths is less than the value of 
> {{sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold}}, we 
> will not use parallel listing, which is different from what 1.6 does (for 
> 1.6, if the number of children of any inner dir is larger than the threshold, 
> we will use the parallel listing).
> {code}
> protected def listLeafFiles(paths: Seq[Path]): 
> mutable.LinkedHashSet[FileStatus] = {
> if (paths.length >= 
> sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) {
>   HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, 
> sparkSession)
> } else {
>   // Dummy jobconf to get to the pathFilter defined in configuration
>   val jobConf = new JobConf(hadoopConf, this.getClass)
>   val pathFilter = FileInputFormat.getInputPathFilter(jobConf)
>   val statuses: Seq[FileStatus] = paths.flatMap { path =>
> val fs = path.getFileSystem(hadoopConf)
> logInfo(s"Listing $path on driver")
> Try {
>   HadoopFsRelation.listLeafFiles(fs, fs.getFileStatus(path), 
> pathFilter)
> }.getOrElse(Array.empty[FileStatus])
>   }
>   mutable.LinkedHashSet(statuses: _*)
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16121) ListingFileCatalog does not list in parallel anymore

2016-06-22 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-16121:
---
Assignee: Yin Huai

> ListingFileCatalog does not list in parallel anymore
> 
>
> Key: SPARK-16121
> URL: https://issues.apache.org/jira/browse/SPARK-16121
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 2.0.0
>
>
> In ListingFileCatalog, the implementation of {{listLeafFiles}} is shown 
> below. When the number of user-provided paths is less than the value of 
> {{sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold}}, we 
> will not use parallel listing, which is different from what 1.6 does (for 
> 1.6, if the number of children of any inner dir is larger than the threshold, 
> we will use the parallel listing).
> {code}
> protected def listLeafFiles(paths: Seq[Path]): 
> mutable.LinkedHashSet[FileStatus] = {
> if (paths.length >= 
> sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) {
>   HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, 
> sparkSession)
> } else {
>   // Dummy jobconf to get to the pathFilter defined in configuration
>   val jobConf = new JobConf(hadoopConf, this.getClass)
>   val pathFilter = FileInputFormat.getInputPathFilter(jobConf)
>   val statuses: Seq[FileStatus] = paths.flatMap { path =>
> val fs = path.getFileSystem(hadoopConf)
> logInfo(s"Listing $path on driver")
> Try {
>   HadoopFsRelation.listLeafFiles(fs, fs.getFileStatus(path), 
> pathFilter)
> }.getOrElse(Array.empty[FileStatus])
>   }
>   mutable.LinkedHashSet(statuses: _*)
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org