[jira] [Updated] (SPARK-16121) ListingFileCatalog does not list in parallel anymore
[ https://issues.apache.org/jira/browse/SPARK-16121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16121: -- Fix Version/s: (was: 2.0.0) 2.0.1 2.1.0 > ListingFileCatalog does not list in parallel anymore > > > Key: SPARK-16121 > URL: https://issues.apache.org/jira/browse/SPARK-16121 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Blocker > Fix For: 2.1.0, 2.0.1 > > > In ListingFileCatalog, the implementation of {{listLeafFiles}} is shown > below. When the number of user-provided paths is less than the value of > {{sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold}}, we > will not use parallel listing, which is different from what 1.6 does (for > 1.6, if the number of children of any inner dir is larger than the threshold, > we will use the parallel listing). > {code} > protected def listLeafFiles(paths: Seq[Path]): > mutable.LinkedHashSet[FileStatus] = { > if (paths.length >= > sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) { > HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, > sparkSession) > } else { > // Dummy jobconf to get to the pathFilter defined in configuration > val jobConf = new JobConf(hadoopConf, this.getClass) > val pathFilter = FileInputFormat.getInputPathFilter(jobConf) > val statuses: Seq[FileStatus] = paths.flatMap { path => > val fs = path.getFileSystem(hadoopConf) > logInfo(s"Listing $path on driver") > Try { > HadoopFsRelation.listLeafFiles(fs, fs.getFileStatus(path), > pathFilter) > }.getOrElse(Array.empty[FileStatus]) > } > mutable.LinkedHashSet(statuses: _*) > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16121) ListingFileCatalog does not list in parallel anymore
[ https://issues.apache.org/jira/browse/SPARK-16121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-16121: --- Assignee: Yin Huai > ListingFileCatalog does not list in parallel anymore > > > Key: SPARK-16121 > URL: https://issues.apache.org/jira/browse/SPARK-16121 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Blocker > Fix For: 2.0.0 > > > In ListingFileCatalog, the implementation of {{listLeafFiles}} is shown > below. When the number of user-provided paths is less than the value of > {{sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold}}, we > will not use parallel listing, which is different from what 1.6 does (for > 1.6, if the number of children of any inner dir is larger than the threshold, > we will use the parallel listing). > {code} > protected def listLeafFiles(paths: Seq[Path]): > mutable.LinkedHashSet[FileStatus] = { > if (paths.length >= > sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) { > HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, > sparkSession) > } else { > // Dummy jobconf to get to the pathFilter defined in configuration > val jobConf = new JobConf(hadoopConf, this.getClass) > val pathFilter = FileInputFormat.getInputPathFilter(jobConf) > val statuses: Seq[FileStatus] = paths.flatMap { path => > val fs = path.getFileSystem(hadoopConf) > logInfo(s"Listing $path on driver") > Try { > HadoopFsRelation.listLeafFiles(fs, fs.getFileStatus(path), > pathFilter) > }.getOrElse(Array.empty[FileStatus]) > } > mutable.LinkedHashSet(statuses: _*) > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org