On 14 Feb 2018, at 13:51, Tayyebi, Ameen <tayye...@amazon.com<mailto:tayye...@amazon.com>> wrote:
Thanks a lot Steve. I’ll go through the Jira’s you linked in detail. I took a quick look and am sufficiently scared for now. I had run into that warning from the S3 stream before. Sigh. things like that are trouble as they don't get picked up in automated test runs -or they do, but unless you look at the console output, you don't see them. It's why when generally for a Hadoop release we fix the version at least 4-6 weeks before the release and play through the command line, downstream tests, etc. The other troublespot is changes in perf & scale which don't show up on the smaller tests which look for functionality "can I distcp 5 directories", but kick off when the problem is "can I use distcp to back up 4 PB of data without too many DELETE calls being throttled". If you look at HADOOP-15209/HADOOP-15191 the tests instrument the S3A client to count the #of delete calls made so I can make assertions that an LRU cache of recently deleted paths actually works, using the test mechanisms covered in http://steveloughran.blogspot.co.uk/2016/04/distributed-testing-making-use-of.html Other than that, sitting in front of the screen watching that a test to do spark ORC/Parquet jobs using S3 as a src & dest, and noticing when it takes much much longer than you'd expect is a cue of a regression. Or that I'm just accidentally using an object store on a different continent from normal. It'd be interesting to consider what you can do with scaltest/junit test runners to actually catch performance regressions here: have something take the Ant-format XML reports & convert to something where you can diff performance over time & so build up your own local model of how long things should take. An interesting project for someone. FWIW, Hadoop 3.1 is down for v 1.11.271 (shaded), which does have your stuff in, so far all is good. -Steve