Github user steveloughran commented on the pull request:

    https://github.com/apache/spark/pull/8512#issuecomment-162974033
  
    Has anyone looked at the performance of this versus S3a in Hadoop 2.7+? 
Because while I do agree this will dramatically improve s3n: and s3: perf, all 
ongoing Hadoop work is on the s3a FS, with s3n left alone on the grounds that 
every upgrade of jets3t or change breaks things. S3a does use {{ListRequest}} 
and I'd expect it to not only list faster, but have faster reads too.
    
    That doesn't mean this patch won't be useful: if anyone still uses s3: 
it'll be essential (there's no maintenance going on there), and the code here 
will also benefit hadoop <= 2.6. It's just for 2.7+ I would say "use s3a and be 
done with it". That said, there's lots of work on s3a which remains to be 
looked at, especially in lazy seeks.
    
    What could be very useful for the Hadoop team here is some tests for Spark 
using S3 so as to catch regressions in functionality, performance, scale
    
    1. Measure that ls() performance. Maybe we can find/get someone to create 
an s3 store pre-populated with many files.
    2. look at the costs of read + seek + close on big files. 
[HADOOP-12376](https://issues.apache.org/jira/browse/HADOOP-12376) turned out 
to be a surprise there: if you close() a multiGB file 3 bytes in, that close() 
still completes. Again, having some public reference files would aid testing 
here



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to