Re: Leveraging S3 select

Steve Loughran Wed, 13 Dec 2017 04:01:11 -0800


On 8 Dec 2017, at 17:05, Andrew Duffy 
<adu...@palantir.com<mailto:adu...@palantir.com>> wrote:


Hey Steve,

Happen to have a link to the TPC-DS benchmark data w/random S3 reads? I've done 
a decent amount of digging, but all I've found is a reference in a slide deck

Is that one of mine?

We haven't done any benchmarking with/without random IO for a while, as we've 
taken that as a given and worrying about the other aspects of the problem: 
speeding up directory listings & getFileStatus calls (used a lot in the 
serialized partitioning phase), and making direct commits of work to S3 both 
correct and performant.

I'm trying to sort out some benchmarking there, which involves: cherry picking 
the new changes to an internal release, building that, having someone who 
understands benchmarking set up a cluster and run the tests, which involves 
their time and the cost of the clusters. I say clusters as it'll inevitably 
involve playing with different VM options and some EMR clusters alongside(*).

One bit of fun there becomes the fact that different instances of the same 
cluster specs may give different numbers; it depends on actual CPUs allocated, 
network, neighbours. When we do publish some numbers, we do it from a single 
cluster instance, rather than doing "best per-test outcome on multiple 
clusters". Good to check if others do the same.

Otherwise: test with your own code & the Hadoop 2.8.1+ JARs; see what numbers 
you get. If you are using Parquet or ORC, I would not consider using the 
sequential IO code. At the same time, if you are working with CSV, Avro, gzip, 
you don't want to use it, because what would be a single file GET with some 
forward skips of read & discard of data is now a slow sequence of GETs with 
latency between each one.
HADOOP-14965<https://issues.apache.org/jira/browse/HADOOP-14965> (not yet 
committed) changes the default policy of an input stream to "switch to random 
IO mode on the first backwards seek", so you don't need to decide upfront that 
to use. There's the potential cost of the first HTTPS abort on that initial 
backwards seek, but after, random IO all the way. the Wasb client has been 
doing this for a while and everyone is happy, not least because its one less 
tuning option to document & test, and eliminates a whole class of support calls 
"client is fast to read .csv but not .orc files".

-Steve


(*) I have a hadoop branch-2.9 fork with the new committer stuff in if someone 
wants to compare numbers there. Bear in mind that the current 
RDD.write(s3a://something) command, when it uses the Hadoop FileOutputFormats 
and hence the FileOutputCommitter is not just observably a slow O(data) kind of 
operation, it is *not correct*, so the performance is just a detail. It's the 
one you notice, but not the issue to fear. Fixed by HADOOP-13786 & a bit of 
glue to keep Spark happy

Re: Leveraging S3 select

Reply via email to