Steve Loughran created SPARK-17159:
--------------------------------------

             Summary: Improve FileInputDStream.findNewFiles list performance
                 Key: SPARK-17159
                 URL: https://issues.apache.org/jira/browse/SPARK-17159
             Project: Spark
          Issue Type: Improvement
          Components: Streaming
    Affects Versions: 2.0.0
         Environment: spark against object stores
            Reporter: Steve Loughran
            Priority: Minor


{{FileInputDStream.findNewFiles()}} is doing a globStatus with a fitler that 
calls getFileStatus() on every file, takes the output and does listStatus() on 
the output.

This going to suffer on object stores, as dir listing and getFileStatus calls 
are so expensive. It's clear this is a problem, as the method has code to 
detect timeouts in the window and warn of problems.

It should be possible to make this faster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to