On 18 Nov 2016, at 14:31, Keith Bourgoin <ke...@parsely.com<mailto:ke...@parsely.com>> wrote:
We thread the file processing to amortize the cost of things like getting files from S3. Define cost here: actual $ amount, or merely time to read the data? If it's read times, you should really be trying the new stuff coming in the hadoop-2.8+ s3a client, which has put a lot of work into higher performance reading of ORC & Parquet data, plus general improvements in listing/opening, etc, trying to cut down on slow metadata queries. You are still going to have delays of tens to hundreds of millis on every HTTP request (bigger ones for DNS problems and/or s3 load balancer overload), but once open, seek + read of s3 data will be much faster (not end-to-end read of an s3 file though, that's just bandwidth limitation after the HTTPS negotiation). http://www.slideshare.net/steve_l/hadoop-hive-spark-and-object-stores Also, do make sure you are using s3a URLs, if you weren't already -Steve