On 18 Nov 2016, at 14:31, Keith Bourgoin 
<ke...@parsely.com<mailto:ke...@parsely.com>> wrote:

We thread the file processing to amortize the cost of things like getting files 
from S3.

Define cost here: actual $ amount, or merely time to read the data?

If it's read times, you should really be trying the new stuff coming in the 
hadoop-2.8+ s3a client, which has put a lot of work into higher performance 
reading of ORC & Parquet data, plus general improvements in listing/opening, 
etc, trying to cut down on slow metadata queries. You are still going to have 
delays of tens to hundreds of millis on every HTTP request (bigger ones for DNS 
problems and/or s3 load balancer overload), but once open, seek + read of s3 
data will be much faster (not end-to-end read of an s3 file though, that's just 
bandwidth limitation after the HTTPS negotiation).

http://www.slideshare.net/steve_l/hadoop-hive-spark-and-object-stores

Also, do make sure you are using s3a URLs, if you weren't already

-Steve

Reply via email to