[GitHub] spark pull request #17745: [SPARK-17159][Streaming] optimise check for new f...

ScrapCodes Thu, 23 Aug 2018 04:03:36 -0700

Github user ScrapCodes commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17745#discussion_r212267619
  
    --- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
 ---
    @@ -196,29 +191,29 @@ class FileInputDStream[K, V, F <: NewInputFormat[K, 
V]](
           logDebug(s"Getting new files for time $currentTime, " +
             s"ignoring files older than $modTimeIgnoreThreshold")
     
    -      val newFileFilter = new PathFilter {
    -        def accept(path: Path): Boolean = isNewFile(path, currentTime, 
modTimeIgnoreThreshold)
    -      }
    -      val directoryFilter = new PathFilter {
    -        override def accept(path: Path): Boolean = 
fs.getFileStatus(path).isDirectory
    -      }
    -      val directories = fs.globStatus(directoryPath, 
directoryFilter).map(_.getPath)
    +      val directories = 
Option(fs.globStatus(directoryPath)).getOrElse(Array.empty[FileStatus])
    --- End diff --
    
    So, on looking at the code of glob status, it does filter at the end, so 
doing something like above might just be ok. 
    
    Also globStatus does a listStatus() per child directory or a 
getFileStatus() in case input pattern is not a glob, each call to listStatus 
does 3+ http calls and each call to getFileStatus does 2 http calls.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17745: [SPARK-17159][Streaming] optimise check for new f...

Reply via email to