Steve Loughran created SPARK-17159: -------------------------------------- Summary: Improve FileInputDStream.findNewFiles list performance Key: SPARK-17159 URL: https://issues.apache.org/jira/browse/SPARK-17159 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 2.0.0 Environment: spark against object stores Reporter: Steve Loughran Priority: Minor
{{FileInputDStream.findNewFiles()}} is doing a globStatus with a fitler that calls getFileStatus() on every file, takes the output and does listStatus() on the output. This going to suffer on object stores, as dir listing and getFileStatus calls are so expensive. It's clear this is a problem, as the method has code to detect timeouts in the window and warn of problems. It should be possible to make this faster -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org