Re: generate task timeline figures like "Hadoop Sorts a Petabyte..." blog

2009-07-22 Thread Miles Osborne
nope, if i recall the data is randomly generated (the task itself requires fixed-length binary strings to be sorted) Miles 2009/7/22 Harish Mallipeddi > On Wed, Jul 22, 2009 at 8:52 PM, Rares Vernica wrote: > > > Hello, > > > > I wonder how did the Yahoo! developers generate the Task Timeline

Re: .gz as input files in streaming.

2009-07-14 Thread Miles Osborne
here is a part of a shell script i wrote which deals with compressed input and produces compressed output (for streaming) > hadoop dfs -rmr $4 hadoop jar /usr/local/share/hadoop/contrib/streaming/hadoop-*-streaming.jar -mapper $1 -reducer $2 -input $3/* -output $4 -file $1 -file $2 -jobconf mapre

Re: Disk configuration.:1

2009-07-13 Thread Miles Osborne
we have 7.B T data nodes and soon will be getting 9 T nodes. and the good news is that it all works well. one wrinkle i've noticed is that should a disk or two fill-up then the entire machine can get black listed (if you have smaller capacity machines then this is probably the correct behaviour)

Re: Sort by value

2009-07-09 Thread Miles Osborne
if you have pairs, then have your mapper emit > this will result in your data being resorted by the value Miles 2009/7/9 Marcus Herou > Really ? WIll that work ? > > input something like this > > tag > tag2 > tag > tag2 > tag3 > ... > produces output > > tag 2 > tag2 2 > tag3 1 > > > Sw