[jira] [Commented] (MAPREDUCE-3402) AMScalability test of Sleep job with 100K 1-sec maps regressed into running very slowly

2011-12-07 Thread Vinod Kumar Vavilapalli (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13164900#comment-13164900
 ] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-3402:


[~karams] had been extremely helpful in running various tests to hunt this 
down. And we finally got some results after a couple of weeks of hard work.

Turns out that most of the issues are because we made a switch from 32 bit JVMs 
to 64 bit. Using compressed references dramatically increased the AMs speed, 
and the job finishes in around 30-35 mins. That is still a regression, but 
atleast the job finishes after the compressed-ops setting and/or changing the 
jvm back to 32 bit.

Giving more heap to the 32 bit JVM, around 3GB, helps to finish the job in 
around 7-8 mins. But that isn't something we want to do for all jobs. Reverting 
back to original speed definitely means that AM is wasting away time in GCs. 
Some of the observations Sid made above may hint at the root culprit.

Will file separate tickets to fix the inefficiencies.

 AMScalability test of Sleep job with 100K 1-sec maps regressed into running 
 very slowly
 ---

 Key: MAPREDUCE-3402
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3402
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: applicationmaster, mrv2
Affects Versions: 0.23.0
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
Priority: Blocker
 Fix For: 0.23.1


 The world was rosier before October 19-25, [~karams] says.
 The 100K 1 second sleep job used to take around 800mins or 13-14 mins. It now 
 runs till 45 mins and still manages to complete only about 45K tasks.
 One/more of the flurry of commits for 0.23.0 deserve(s) the blame.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3402) AMScalability test of Sleep job with 100K 1-sec maps regressed into running very slowly

2011-12-05 Thread Siddharth Seth (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13163206#comment-13163206
 ] 

Siddharth Seth commented on MAPREDUCE-3402:
---

Possibly different from Vinod's leads. With some changes to the environment - 
and maybe a result of a few more commits, the job does complete.
Couple of observations: 
- The first tens of thousands of maps finish pretty fast.
- GC kicks in midway through the job and can't reclaim much. Spends several 
cycles where nothing is reclaimed before managing to reclaim a small amount.
- Counters are taking up a good amount of heap.
- JobHistory writes cannot keep up.
- Bumping up the AM heapsize does help.

Doesn't explain why the performance was better pre Oct 19 though. Opening and 
linking 2 jiras (non blockers since increasing the heap works well) for 
possible changes to counters and JobHistory. 

 AMScalability test of Sleep job with 100K 1-sec maps regressed into running 
 very slowly
 ---

 Key: MAPREDUCE-3402
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3402
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: applicationmaster, mrv2
Affects Versions: 0.23.0
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
Priority: Blocker
 Fix For: 0.23.1


 The world was rosier before October 19-25, [~karams] says.
 The 100K 1 second sleep job used to take around 800mins or 13-14 mins. It now 
 runs till 45 mins and still manages to complete only about 45K tasks.
 One/more of the flurry of commits for 0.23.0 deserve(s) the blame.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3402) AMScalability test of Sleep job with 100K 1-sec maps regressed into running very slowly

2011-11-16 Thread Vinod Kumar Vavilapalli (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13151099#comment-13151099
 ] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-3402:


Independent invention! I was so into debugging I didn't check the JIRA posts. 
Yes, I am just using the same benchmark, and reproduced many a oddities with 
100K maps, and was extolling you on the way for the benchmark :)

Playing with heap-dumps and profilers on this benchmark now.

 AMScalability test of Sleep job with 100K 1-sec maps regressed into running 
 very slowly
 ---

 Key: MAPREDUCE-3402
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3402
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: applicationmaster, mrv2
Affects Versions: 0.23.0
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
 Fix For: 0.23.1


 The world was rosier before October 19-25, [~karams] says.
 The 100K 1 second sleep job used to take around 800mins or 13-14 mins. It now 
 runs till 45 mins and still manages to complete only about 45K tasks.
 One/more of the flurry of commits for 0.23.0 deserve(s) the blame.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3402) AMScalability test of Sleep job with 100K 1-sec maps regressed into running very slowly

2011-11-15 Thread Sharad Agarwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13151029#comment-13151029
 ] 

Sharad Agarwal commented on MAPREDUCE-3402:
---

just fyi org.apache.hadoop.mapreduce.v2.app.MRAppBenchmark can be used to 
benchmark the AM mainly for memory usage, job latencies and state machine 
transitions. It however doesn't capture the remoting/rpc issues as it doesn't 
run on real cluster.

 AMScalability test of Sleep job with 100K 1-sec maps regressed into running 
 very slowly
 ---

 Key: MAPREDUCE-3402
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3402
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: applicationmaster, mrv2
Affects Versions: 0.23.0
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
 Fix For: 0.23.1


 The world was rosier before October 19-25, [~karams] says.
 The 100K 1 second sleep job used to take around 800mins or 13-14 mins. It now 
 runs till 45 mins and still manages to complete only about 45K tasks.
 One/more of the flurry of commits for 0.23.0 deserve(s) the blame.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira