Guojun is right, the reduce() inputs are buffered and read off of disk. You
are in no danger there.
On Fri, Jun 29, 2012 at 11:02 PM, GUOJUN Zhu wrote:
>
> If you are referring the iterable in the reducer, they are special and not
> in the memory at all. Once the iterator pass a value, it is los
Thanks, Arun. Switching to CapacityScheduler seems to have fixed much of
the issue: TeraGen and TeraSort are now evenly distributed and run almost
twice as fast. However, TeraValidate only ran on one node, leaving 3
completely idle (except for the AM). I browsed the block locations of the
output pa
If you are referring the iterable in the reducer, they are special and not
in the memory at all. Once the iterator pass a value, it is lost and you
cannot recover it. There is nothing like linkedlist in behind.
Zhu, Guojun
Modeling Sr Graduate
571-3824370
guojun_...@freddiemac.com
Financial E
I was actually quite curious as to how Hadoop was managing to get all of the
records into the Iterable in the first place. I thought they were using a very
specialized object that implements Iterable, but a heap dump shows they're
likely just using a LinkedList. All I was doing was duplicating
Hey Matt,
As far as I can tell, Hadoop isn't at fault here truly.
If your issue is that you collect in a list before you store, you
should focus on that and just avoid collecting it completely. Why
don't you serialize as you receive, if the incoming order is already
taken care of? As far as I can
The folder contains files with text and other folders with text files. The
text is not key/value, it's just text. Something like this:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum has been the industry's standard dumm...
I'm thinking about 3 options:
Firs