On 14 Feb 2018, at 13:51, Tayyebi, Ameen 
<tayye...@amazon.com<mailto:tayye...@amazon.com>> wrote:

Thanks a lot Steve. I’ll go through the Jira’s you linked in detail. I took a 
quick look and am sufficiently scared for now. I had run into that warning from 
the S3 stream before. Sigh.


things like that are trouble as they don't get picked up in automated test runs 
-or they do, but unless you look at the console output, you don't see them. 
It's why when generally for a Hadoop release we fix the version at least 4-6 
weeks before the release and play through the command line, downstream tests, 
etc.

The other troublespot is changes in perf & scale which don't show up on the 
smaller tests which look for functionality "can I distcp 5 directories", but 
kick off when the problem is "can I use distcp to back up 4 PB of data without 
too many DELETE calls being throttled". If you look at 
HADOOP-15209/HADOOP-15191 the tests instrument the S3A client to count the #of 
delete calls made so I can make assertions that an LRU cache of recently 
deleted paths actually works, using the test mechanisms covered in 
http://steveloughran.blogspot.co.uk/2016/04/distributed-testing-making-use-of.html

Other than that, sitting in front of the screen watching that a test to do 
spark ORC/Parquet jobs using S3 as a src & dest, and noticing when it takes 
much much longer than you'd expect is a cue of a regression. Or that I'm just 
accidentally using an object store on a different continent from normal.

It'd be interesting to consider what you can do with scaltest/junit test 
runners to actually catch performance regressions here: have something take the 
Ant-format XML reports & convert to something where you can diff performance 
over time & so build up your own local model of how long things should take. An 
interesting project for someone.

FWIW, Hadoop 3.1 is down for v 1.11.271 (shaded), which does have your stuff 
in, so far all is good.

-Steve

Reply via email to