I'm still trying to solve this problem.  One person mentioned that mappers have 
to sort the data and that the sort buffer may be relevant but I'm seeing the 
same linear slowdown from the reducer, and more importantly, my data sizes are 
so small (a few MBs) that if the Hadoop settings out-of-the-box didn't work at 
these scales, Hadoop wouldn't work at all.  I'm only seeing this linear 
slowdown with streaming jobs btw.

The bottom line is -- as quoted below from my earlier post on this issue -- 
that the time required to write a block of a given size (say 1 MB, it doesn't 
matter for the purpose of this discussion) goes up linearly with each 
additional block written.  Even at scales of a few 10s of MBs total data, this 
results in write-times of several minutes and given the superlinearity, only a 
few more 10s of MBs quickly stretches to half an hour!

Does no one have any idea what could possibly be causing this?  As far as I can 
tell, anyone running a streaming job ought to see this behavior because my 
program doesn't do anything weird w.r.t. writing output data, it simply dumps 
it straight onto stdout.  Consequently, it seems to me that all streaming jobs 
should produce this behavior...and therefore that streaming should be virtually 
useless.  I'm sure I'm doing something silly, but my approach is so simple I 
just don't know what I could be doing wrong.

Here is the basic outline of my output routine.  It just shows that I'm not 
doing anything "funny", I'm just dumping bytes onto stdout:

        double accumulatedWriteBlockTime = 0;
        int lastPosToWrite = bufferLength - 1;  //Inclusive
        int pos = 0;
        while (pos <= lastPosToWrite)
        {
                long timeWriteOneBlockStart = getTimestamp_100s();
                
                int writeBlockSize = kWriteBlockSize;   //1048576 B (1 MB)
                if (pos + writeBlockSize > bufferLength)
                        writeBlockSize = bufferLength - pos;
                
                cout.write(buffer + pos, writeBlockSize);
                pos += writeBlockSize;
                
                long timeWriteOneBlockEnd = getTimestamp_100s();
                double diffTimeWriteOneBlock = 
diffTimes_100s(timeWriteOneBlockStart, timeWriteOneBlockEnd);
                accumulatedWriteBlockTime += diffTimeWriteOneBlock;
                
                OUT "diffTimeWriteOneBlock: "
                        << setw(8) << secsToTimeStr(diffTimeWriteOneBlock)
                        << "        cumulativeWriteBlockTime: "
                        << setw(8) << secsToTimeStr(accumulatedWriteBlockTime)
                        << "        Progress: " << setw(8) << setprecision(2)
                        << (((double)pos / (double)bufferLength) * 100.0) << "%"
                        << endl;
        }

I only broke it up into an iteration over blocks after I discovered this 
problem.  My original code was much simpler.  It just called write() once with 
the entire buffer.  I broke it up like this to investigate the problem and that 
was when I discovered that it took longer to write the same amount of data as 
time went on (actually I suspected this after observing the superlinear 
relationship between the datasize and the write time).  Here's a sample output 
from the code with the middle section removed for brevity:

                diffTimeWriteOneBlock:    0.35s        
cumulativeWriteBlockTime:     0.35s        Progress:     3.99%
                diffTimeWriteOneBlock:    0.85s        
cumulativeWriteBlockTime:      1.2s        Progress:     7.98%
                diffTimeWriteOneBlock:    1.28s        
cumulativeWriteBlockTime:     2.48s        Progress:    11.97%
                diffTimeWriteOneBlock:    1.83s        
cumulativeWriteBlockTime:     4.31s        Progress:    15.96%
                diffTimeWriteOneBlock:    2.34s        
cumulativeWriteBlockTime:     6.65s        Progress:    19.95%
                diffTimeWriteOneBlock:    2.99s        
cumulativeWriteBlockTime:     9.64s        Progress:    23.95%
[middle section removed for brevity]
                diffTimeWriteOneBlock:   10.25s        
cumulativeWriteBlockTime: 1m 45.17s        Progress:    79.82%
                diffTimeWriteOneBlock:   10.72s        
cumulativeWriteBlockTime: 1m 55.89s        Progress:    83.81%
                diffTimeWriteOneBlock:   11.32s        
cumulativeWriteBlockTime:  2m 7.21s        Progress:    87.80%
                diffTimeWriteOneBlock:   12.25s        
cumulativeWriteBlockTime: 2m 19.46s        Progress:    91.79%
                diffTimeWriteOneBlock:   12.48s        
cumulativeWriteBlockTime: 2m 31.94s        Progress:    95.78%
                diffTimeWriteOneBlock:   13.06s        
cumulativeWriteBlockTime:    2m 45s        Progress:    99.77%


Thanks for any help.

________________________________________________________________________________
Keith Wiley     kwi...@keithwiley.com     keithwiley.com    music.keithwiley.com

"I used to be with it, but then they changed what it was.  Now, what I'm with
isn't it, and what's it seems weird and scary to me."
                                           --  Abe (Grandpa) Simpson
________________________________________________________________________________

Reply via email to