[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

steveloughran Tue, 23 Aug 2016 11:04:54 -0700

Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/14731
  
    1. updated the code to bypass the glob routine when there is no wildcard; 
this bypasses something fairly inefficient. 
    1. reporting FNFE on that base dir differently; skip the stack trace 
(maybe: log at a lower level?). 
    1. Updated the docs with a special list of blobstore best practises.
    
    It's a bit hard to get some of that phrasing of what the wildcard does 
right; needs careful review.
    
    Tested using my s3 streaming test, which did use a * in the wildcard. All 
works, but no improvements in speed on what is a fairly unrealistic structure. 
The time to recursively list object stores remotely is tangibly slow. Maybe 
that should go in the text too: "it can be take seconds to scan object stores 
for new data, with the time being proportional to directory depth and the 
number of files in a directory. Shallow and wide directory trees are faster"



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

Reply via email to