get rid of excessive flushes from PipeMapper/Reducer
----------------------------------------------------

                 Key: HADOOP-3196
                 URL: https://issues.apache.org/jira/browse/HADOOP-3196
             Project: Hadoop Core
          Issue Type: Bug
          Components: contrib/streaming
    Affects Versions: 0.16.2
            Reporter: Joydeep Sen Sarma


there's a flush on the buffered output streams in mapper/reducer for every row 
of data.

      // 2/4 Hadoop to Tool                                                     
                                                              
      if (numExceptions_ == 0) {
        if (!this.ignoreKey) {
          write(key);
          clientOut_.write('\t');
        }
        write(value);
        if(!this.skipNewline) {
            clientOut_.write('\n');
        }
        clientOut_.flush();
      } else {
        numRecSkipped_++;
      }

tried to measure impact of removing this. number of context switches reported 
by vmstat shows marked decline. 

with flush (10 second intervals):
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 4  2    784  23140  83352 3114648    0    0  4819 32397 1175 13220 59 11 13 17
 1  2    784 129724  80704 3075696    0    0  4614 27196 1156 14797 49 11 19 21
 4  0    784  24160  83440 3174880    0    0    96 36070 1337 10976 67 11  9 12
 5  0    784 155872  84400 3158840    0    0   125 44084 1280 11044 68 14 10  8
 2  1    784 365128  87048 2892032    0    0   119 38472 1317 11610 69 14 10  7

without flush:
 5  0    784  24652  56056 3217864    0    0   310 29499 1379  7603 76  9  7  8
 5  3    784 118456  54568 3209992    0    0  3249 33426 1173  6828 63 11 12 14
 0  2    784 227628  54820 3198560    0    0  7840 30063 1146  8899 60 10 15 15
 3  1    784  25608  55048 3313512    0    0  3251 36276 1194  7915 60 10 15 15
 1  2    784 197324  49968 3194572    0    0  4714 35479 1281  8204 62 13 12 13

cs goes down by about 20-30%. but having trouble measuring overall speed 
improvement (too many variables due to spec. execution etc. - need better 
benchmark).

can't hurt.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to