I compared the 2 results they were the same,
for the system command the sed before the sort, is working properly, i did ctrl V then tab to input a tab character in the terminal, and viewed the result its stripping out the rest of the data okay.



Ashish Venugopal wrote:
Just a small note (does not answer your question, but deals with your
testing command), when running the system command version below, its
important to test with
sort -k 1 -t $TAB

where TAB is something like:

TAB=`echo "\t"`

to ensure that you sort by key, rather than the whole line. Sorting by the
whole line can cause your reduce code to seem to work during testing (if you
are testing on the command line), but then not work correctly via Hadoop.



On Tue, Jun 10, 2008 at 6:56 PM, Elia Mazzawi <[EMAIL PROTECTED]>
wrote:

Hello,

we were considering using hadoop to process some data,
we have it set up on 8 nodes ( 1 master + 7 slaves)

we filled the cluster up with files that contain tab delimited data.
string \tab string etc
then we ran the example grep with a regular expression to count the number
of each unique starting string.
we had 3500 files containing 3,015,294 lines totaling 5 GB.

to benchmark it we ran
bin/hadoop jar hadoop-0.17.0-examples.jar grep data/*  output
'^[a-zA-Z]+\t'
it took 26 minutes

then to compare, we ran this bash command on one of the nodes, which
produced the same output out of the data:

cat * | sed -e s/\  .*// |sort | uniq -c > /tmp/out
(sed regexpr is tab not spaces)

which took 2.5 minutes

Then we added 10X the data into the cluster and reran Hadoop, it took 214
minutes which is less than 10X the time, but still not that impressive.


so we are seeing a 10X performance penalty for using Hadoop vs the system
commands,
is that expected?
we were expecting hadoop to be faster since it is distributed?
perhaps there is too much overhead involved here?
is the data too small?



Reply via email to