[ https://issues.apache.org/jira/browse/SPARK-16121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15496234#comment-15496234 ]
Gaurav Shah commented on SPARK-16121: ------------------------------------- [~mengxr] was this fixed in 2.0.0 or is it planned for 2.0.1, My partition discovery takes about 10 minutes and I guess this should fix it > ListingFileCatalog does not list in parallel anymore > ---------------------------------------------------- > > Key: SPARK-16121 > URL: https://issues.apache.org/jira/browse/SPARK-16121 > Project: Spark > Issue Type: Bug > Components: SQL > Reporter: Yin Huai > Assignee: Yin Huai > Priority: Blocker > Fix For: 2.0.0 > > > In ListingFileCatalog, the implementation of {{listLeafFiles}} is shown > below. When the number of user-provided paths is less than the value of > {{sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold}}, we > will not use parallel listing, which is different from what 1.6 does (for > 1.6, if the number of children of any inner dir is larger than the threshold, > we will use the parallel listing). > {code} > protected def listLeafFiles(paths: Seq[Path]): > mutable.LinkedHashSet[FileStatus] = { > if (paths.length >= > sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) { > HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, > sparkSession) > } else { > // Dummy jobconf to get to the pathFilter defined in configuration > val jobConf = new JobConf(hadoopConf, this.getClass) > val pathFilter = FileInputFormat.getInputPathFilter(jobConf) > val statuses: Seq[FileStatus] = paths.flatMap { path => > val fs = path.getFileSystem(hadoopConf) > logInfo(s"Listing $path on driver") > Try { > HadoopFsRelation.listLeafFiles(fs, fs.getFileStatus(path), > pathFilter) > }.getOrElse(Array.empty[FileStatus]) > } > mutable.LinkedHashSet(statuses: _*) > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org