Hello experienced hadoop users, I am having a data pipeline consisting of two java MR jobs coordinated by oozie scheduler. Both of them process the same data but the first one is more than 10 times slower than second one. Job counters on RM page are not much helpful in that matter. I have verified from our monitoring system that there were no constraints on hw like IO, CPU, network etc. Specifically it was using just a fraction of allowed resources designated to given container.
Is there a way to get some profiling statistics out of hadoop cluster task? What are the best available tools, required settings etc. I have read a Hadoop definitive guide - job tunning but not sure that those settings are still valid for hadoop 2.2.0. Could someone refer to some good resource where to look for informatio e.g. blog, manual, book etc.. I am a bit confused what refers to hadoop 1 and what's are the settings for hadoop 2 mr 2. Dataset size is around 500MB compressed, and it is map only task Thanks for any experience shared Jakub --