Re: hadoop benchmarked, too slow to use
Thanks for the suggestions, I'm going to rerun the same test with close to 64Mb files and 7 then 14 reducers. we've done another test to see if more servers would speed up the cluster, with 2 nodes down took 322 minutes on the 10X data thats 5.3 hours vs 214 minutes with all nodes online. started the test after hdfs marked the nodes as dead, and there were no timeouts. 332/214 = 55% more time with 5/7 = 71% servers. so our conclusion is that more servers will make the cluster faster. Ashish Thusoo wrote: Try by first just reducing the number of files and increasing the data in each file so you have close to 64MB of data per file. So in your case that would amount to about 700-800 files in the 10X test case (instead of 35000 that you have). See if that give substantially better results on your larger test case. For the smaller one, I don't think you will be able to do better than the unix command - the data set is too small. Ashish -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 10, 2008 5:00 PM To: core-user@hadoop.apache.org Subject: Re: hadoop benchmarked, too slow to use so it would make sense for me to configure hadoop for smaller chunks? Elia Mazzawi wrote: yes chunk size was 64mb, and each file has some data it used 7 mappers and 1 reducer. 10X the data took 214 minutes vs 26 minutes for the smaller set i uploaded the same data 10 times in different directories ( so more files, same size ) Ashish Thusoo wrote: Apart from the setup times, the fact that you have 3500 files means that you are going after around 220GB of data as each file would have atleast one chunk (this calculation is assuming a chunk size of 64MB and this assumes that each file has atleast some data). Mappers would probably need to read up this amount of data and with 7 nodes you may just have 14 map slots. I may be wrong here, but just out of curiosity how many mappers does your job use. Don't know why the 10X data was not better though if the bad performance of the smaller test case was due to fragmentation. For that test did you also increase the number of files, or did you simply increase the amount of data in each file. Plus on small sets (of the order of 2-3 GB) of data unix commands can't really be beaten :) Ashish -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 10, 2008 3:56 PM To: core-user@hadoop.apache.org Subject: hadoop benchmarked, too slow to use Hello, we were considering using hadoop to process some data, we have it set up on 8 nodes ( 1 master + 7 slaves) we filled the cluster up with files that contain tab delimited data. string \tab string etc then we ran the example grep with a regular expression to count the number of each unique starting string. we had 3500 files containing 3,015,294 lines totaling 5 GB. to benchmark it we ran bin/hadoop jar hadoop-0.17.0-examples.jar grep data/* output '^[a-zA-Z]+\t' it took 26 minutes then to compare, we ran this bash command on one of the nodes, which produced the same output out of the data: cat * | sed -e s/\ .*// |sort | uniq -c /tmp/out (sed regexpr is tab not spaces) which took 2.5 minutes Then we added 10X the data into the cluster and reran Hadoop, it took 214 minutes which is less than 10X the time, but still not that impressive. so we are seeing a 10X performance penalty for using Hadoop vs the system commands, is that expected? we were expecting hadoop to be faster since it is distributed? perhaps there is too much overhead involved here? is the data too small?
Re: hadoop benchmarked, too slow to use
On Jun 11, 2008, at 11:53 AM, Elia Mazzawi wrote: we concatenated the files to bring them close to and less than 64mb and the difference was huge without changing anything else we went from 214 minutes to 3 minutes ! *smile* How many reduces are you running now? 1 or more? Arun Elia Mazzawi wrote: Thanks for the suggestions, I'm going to rerun the same test with close to 64Mb files and 7 then 14 reducers. we've done another test to see if more servers would speed up the cluster, with 2 nodes down took 322 minutes on the 10X data thats 5.3 hours vs 214 minutes with all nodes online. started the test after hdfs marked the nodes as dead, and there were no timeouts. 332/214 = 55% more time with 5/7 = 71% servers. so our conclusion is that more servers will make the cluster faster. Ashish Thusoo wrote: Try by first just reducing the number of files and increasing the data in each file so you have close to 64MB of data per file. So in your case that would amount to about 700-800 files in the 10X test case (instead of 35000 that you have). See if that give substantially better results on your larger test case. For the smaller one, I don't think you will be able to do better than the unix command - the data set is too small. Ashish -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 10, 2008 5:00 PM To: core-user@hadoop.apache.org Subject: Re: hadoop benchmarked, too slow to use so it would make sense for me to configure hadoop for smaller chunks? Elia Mazzawi wrote: yes chunk size was 64mb, and each file has some data it used 7 mappers and 1 reducer. 10X the data took 214 minutes vs 26 minutes for the smaller set i uploaded the same data 10 times in different directories ( so more files, same size ) Ashish Thusoo wrote: Apart from the setup times, the fact that you have 3500 files means that you are going after around 220GB of data as each file would have atleast one chunk (this calculation is assuming a chunk size of 64MB and this assumes that each file has atleast some data). Mappers would probably need to read up this amount of data and with 7 nodes you may just have 14 map slots. I may be wrong here, but just out of curiosity how many mappers does your job use. Don't know why the 10X data was not better though if the bad performance of the smaller test case was due to fragmentation. For that test did you also increase the number of files, or did you simply increase the amount of data in each file. Plus on small sets (of the order of 2-3 GB) of data unix commands can't really be beaten :) Ashish -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 10, 2008 3:56 PM To: core-user@hadoop.apache.org Subject: hadoop benchmarked, too slow to use Hello, we were considering using hadoop to process some data, we have it set up on 8 nodes ( 1 master + 7 slaves) we filled the cluster up with files that contain tab delimited data. string \tab string etc then we ran the example grep with a regular expression to count the number of each unique starting string. we had 3500 files containing 3,015,294 lines totaling 5 GB. to benchmark it we ran bin/hadoop jar hadoop-0.17.0-examples.jar grep data/* output '^ [a-zA-Z]+\t' it took 26 minutes then to compare, we ran this bash command on one of the nodes, which produced the same output out of the data: cat * | sed -e s/\ .*// |sort | uniq -c /tmp/out (sed regexpr is tab not spaces) which took 2.5 minutes Then we added 10X the data into the cluster and reran Hadoop, it took 214 minutes which is less than 10X the time, but still not that impressive. so we are seeing a 10X performance penalty for using Hadoop vs the system commands, is that expected? we were expecting hadoop to be faster since it is distributed? perhaps there is too much overhead involved here? is the data too small?
Re: hadoop benchmarked, too slow to use
that was with 7 reducers, but i meant to run it with 1. I'll re-run to compare. Arun C Murthy wrote: On Jun 11, 2008, at 11:53 AM, Elia Mazzawi wrote: we concatenated the files to bring them close to and less than 64mb and the difference was huge without changing anything else we went from 214 minutes to 3 minutes ! *smile* How many reduces are you running now? 1 or more? Arun Elia Mazzawi wrote: Thanks for the suggestions, I'm going to rerun the same test with close to 64Mb files and 7 then 14 reducers. we've done another test to see if more servers would speed up the cluster, with 2 nodes down took 322 minutes on the 10X data thats 5.3 hours vs 214 minutes with all nodes online. started the test after hdfs marked the nodes as dead, and there were no timeouts. 332/214 = 55% more time with 5/7 = 71% servers. so our conclusion is that more servers will make the cluster faster. Ashish Thusoo wrote: Try by first just reducing the number of files and increasing the data in each file so you have close to 64MB of data per file. So in your case that would amount to about 700-800 files in the 10X test case (instead of 35000 that you have). See if that give substantially better results on your larger test case. For the smaller one, I don't think you will be able to do better than the unix command - the data set is too small. Ashish -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 10, 2008 5:00 PM To: core-user@hadoop.apache.org Subject: Re: hadoop benchmarked, too slow to use so it would make sense for me to configure hadoop for smaller chunks? Elia Mazzawi wrote: yes chunk size was 64mb, and each file has some data it used 7 mappers and 1 reducer. 10X the data took 214 minutes vs 26 minutes for the smaller set i uploaded the same data 10 times in different directories ( so more files, same size ) Ashish Thusoo wrote: Apart from the setup times, the fact that you have 3500 files means that you are going after around 220GB of data as each file would have atleast one chunk (this calculation is assuming a chunk size of 64MB and this assumes that each file has atleast some data). Mappers would probably need to read up this amount of data and with 7 nodes you may just have 14 map slots. I may be wrong here, but just out of curiosity how many mappers does your job use. Don't know why the 10X data was not better though if the bad performance of the smaller test case was due to fragmentation. For that test did you also increase the number of files, or did you simply increase the amount of data in each file. Plus on small sets (of the order of 2-3 GB) of data unix commands can't really be beaten :) Ashish -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 10, 2008 3:56 PM To: core-user@hadoop.apache.org Subject: hadoop benchmarked, too slow to use Hello, we were considering using hadoop to process some data, we have it set up on 8 nodes ( 1 master + 7 slaves) we filled the cluster up with files that contain tab delimited data. string \tab string etc then we ran the example grep with a regular expression to count the number of each unique starting string. we had 3500 files containing 3,015,294 lines totaling 5 GB. to benchmark it we ran bin/hadoop jar hadoop-0.17.0-examples.jar grep data/* output '^[a-zA-Z]+\t' it took 26 minutes then to compare, we ran this bash command on one of the nodes, which produced the same output out of the data: cat * | sed -e s/\ .*// |sort | uniq -c /tmp/out (sed regexpr is tab not spaces) which took 2.5 minutes Then we added 10X the data into the cluster and reran Hadoop, it took 214 minutes which is less than 10X the time, but still not that impressive. so we are seeing a 10X performance penalty for using Hadoop vs the system commands, is that expected? we were expecting hadoop to be faster since it is distributed? perhaps there is too much overhead involved here? is the data too small?
Re: hadoop benchmarked, too slow to use
Yes. That does count as huge. Congratulations! On Wed, Jun 11, 2008 at 11:53 AM, Elia Mazzawi [EMAIL PROTECTED] wrote: we concatenated the files to bring them close to and less than 64mb and the difference was huge without changing anything else we went from 214 minutes to 3 minutes ! -- ted
RE: hadoop benchmarked, too slow to use
good to know... this puppy does scale :) and hadoop is awesome for what it does... Ashish -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 11, 2008 11:54 AM To: core-user@hadoop.apache.org Subject: Re: hadoop benchmarked, too slow to use we concatenated the files to bring them close to and less than 64mb and the difference was huge without changing anything else we went from 214 minutes to 3 minutes ! Elia Mazzawi wrote: Thanks for the suggestions, I'm going to rerun the same test with close to 64Mb files and 7 then 14 reducers. we've done another test to see if more servers would speed up the cluster, with 2 nodes down took 322 minutes on the 10X data thats 5.3 hours vs 214 minutes with all nodes online. started the test after hdfs marked the nodes as dead, and there were no timeouts. 332/214 = 55% more time with 5/7 = 71% servers. so our conclusion is that more servers will make the cluster faster. Ashish Thusoo wrote: Try by first just reducing the number of files and increasing the data in each file so you have close to 64MB of data per file. So in your case that would amount to about 700-800 files in the 10X test case (instead of 35000 that you have). See if that give substantially better results on your larger test case. For the smaller one, I don't think you will be able to do better than the unix command - the data set is too small. Ashish -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 10, 2008 5:00 PM To: core-user@hadoop.apache.org Subject: Re: hadoop benchmarked, too slow to use so it would make sense for me to configure hadoop for smaller chunks? Elia Mazzawi wrote: yes chunk size was 64mb, and each file has some data it used 7 mappers and 1 reducer. 10X the data took 214 minutes vs 26 minutes for the smaller set i uploaded the same data 10 times in different directories ( so more files, same size ) Ashish Thusoo wrote: Apart from the setup times, the fact that you have 3500 files means that you are going after around 220GB of data as each file would have atleast one chunk (this calculation is assuming a chunk size of 64MB and this assumes that each file has atleast some data). Mappers would probably need to read up this amount of data and with 7 nodes you may just have 14 map slots. I may be wrong here, but just out of curiosity how many mappers does your job use. Don't know why the 10X data was not better though if the bad performance of the smaller test case was due to fragmentation. For that test did you also increase the number of files, or did you simply increase the amount of data in each file. Plus on small sets (of the order of 2-3 GB) of data unix commands can't really be beaten :) Ashish -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 10, 2008 3:56 PM To: core-user@hadoop.apache.org Subject: hadoop benchmarked, too slow to use Hello, we were considering using hadoop to process some data, we have it set up on 8 nodes ( 1 master + 7 slaves) we filled the cluster up with files that contain tab delimited data. string \tab string etc then we ran the example grep with a regular expression to count the number of each unique starting string. we had 3500 files containing 3,015,294 lines totaling 5 GB. to benchmark it we ran bin/hadoop jar hadoop-0.17.0-examples.jar grep data/* output '^[a-zA-Z]+\t' it took 26 minutes then to compare, we ran this bash command on one of the nodes, which produced the same output out of the data: cat * | sed -e s/\ .*// |sort | uniq -c /tmp/out (sed regexpr is tab not spaces) which took 2.5 minutes Then we added 10X the data into the cluster and reran Hadoop, it took 214 minutes which is less than 10X the time, but still not that impressive. so we are seeing a 10X performance penalty for using Hadoop vs the system commands, is that expected? we were expecting hadoop to be faster since it is distributed? perhaps there is too much overhead involved here? is the data too small?
Re: hadoop benchmarked, too slow to use
Just a small note (does not answer your question, but deals with your testing command), when running the system command version below, its important to test with sort -k 1 -t $TAB where TAB is something like: TAB=`echo \t` to ensure that you sort by key, rather than the whole line. Sorting by the whole line can cause your reduce code to seem to work during testing (if you are testing on the command line), but then not work correctly via Hadoop. On Tue, Jun 10, 2008 at 6:56 PM, Elia Mazzawi [EMAIL PROTECTED] wrote: Hello, we were considering using hadoop to process some data, we have it set up on 8 nodes ( 1 master + 7 slaves) we filled the cluster up with files that contain tab delimited data. string \tab string etc then we ran the example grep with a regular expression to count the number of each unique starting string. we had 3500 files containing 3,015,294 lines totaling 5 GB. to benchmark it we ran bin/hadoop jar hadoop-0.17.0-examples.jar grep data/* output '^[a-zA-Z]+\t' it took 26 minutes then to compare, we ran this bash command on one of the nodes, which produced the same output out of the data: cat * | sed -e s/\ .*// |sort | uniq -c /tmp/out (sed regexpr is tab not spaces) which took 2.5 minutes Then we added 10X the data into the cluster and reran Hadoop, it took 214 minutes which is less than 10X the time, but still not that impressive. so we are seeing a 10X performance penalty for using Hadoop vs the system commands, is that expected? we were expecting hadoop to be faster since it is distributed? perhaps there is too much overhead involved here? is the data too small?
Re: hadoop benchmarked, too slow to use
I suspect that many people are using Hadoop with a moderate number of nodes and expecting to see a win over a sequential, single node version. The result (and I've seen this too) is typically that the single node version wins hands-down. Apart from speeding-up the Hadoop job (eg via compression, tweaking magic parameters and the like), it is better to consider jobs that are simply too large to fit into memory on a single machine. Under those circumstances --and with many machines-- you will see Hadoop winning. Horses for courses. (Another dimension is redundancy; we have had a long running Nutch setup downloading and parsing blogs posts. Only very recently was it noticed that one of our machines was in fact dead. The show still went on and nothing was lost) Miles 2008/6/10 Elia Mazzawi [EMAIL PROTECTED]: Hello, we were considering using hadoop to process some data, we have it set up on 8 nodes ( 1 master + 7 slaves) we filled the cluster up with files that contain tab delimited data. string \tab string etc then we ran the example grep with a regular expression to count the number of each unique starting string. we had 3500 files containing 3,015,294 lines totaling 5 GB. to benchmark it we ran bin/hadoop jar hadoop-0.17.0-examples.jar grep data/* output '^[a-zA-Z]+\t' it took 26 minutes then to compare, we ran this bash command on one of the nodes, which produced the same output out of the data: cat * | sed -e s/\ .*// |sort | uniq -c /tmp/out (sed regexpr is tab not spaces) which took 2.5 minutes Then we added 10X the data into the cluster and reran Hadoop, it took 214 minutes which is less than 10X the time, but still not that impressive. so we are seeing a 10X performance penalty for using Hadoop vs the system commands, is that expected? we were expecting hadoop to be faster since it is distributed? perhaps there is too much overhead involved here? is the data too small? -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Re: hadoop benchmarked, too slow to use
I compared the 2 results they were the same, for the system command the sed before the sort, is working properly, i did ctrl V then tab to input a tab character in the terminal, and viewed the result its stripping out the rest of the data okay. Ashish Venugopal wrote: Just a small note (does not answer your question, but deals with your testing command), when running the system command version below, its important to test with sort -k 1 -t $TAB where TAB is something like: TAB=`echo \t` to ensure that you sort by key, rather than the whole line. Sorting by the whole line can cause your reduce code to seem to work during testing (if you are testing on the command line), but then not work correctly via Hadoop. On Tue, Jun 10, 2008 at 6:56 PM, Elia Mazzawi [EMAIL PROTECTED] wrote: Hello, we were considering using hadoop to process some data, we have it set up on 8 nodes ( 1 master + 7 slaves) we filled the cluster up with files that contain tab delimited data. string \tab string etc then we ran the example grep with a regular expression to count the number of each unique starting string. we had 3500 files containing 3,015,294 lines totaling 5 GB. to benchmark it we ran bin/hadoop jar hadoop-0.17.0-examples.jar grep data/* output '^[a-zA-Z]+\t' it took 26 minutes then to compare, we ran this bash command on one of the nodes, which produced the same output out of the data: cat * | sed -e s/\ .*// |sort | uniq -c /tmp/out (sed regexpr is tab not spaces) which took 2.5 minutes Then we added 10X the data into the cluster and reran Hadoop, it took 214 minutes which is less than 10X the time, but still not that impressive. so we are seeing a 10X performance penalty for using Hadoop vs the system commands, is that expected? we were expecting hadoop to be faster since it is distributed? perhaps there is too much overhead involved here? is the data too small?
RE: hadoop benchmarked, too slow to use
how many reducers? Perhaps u are defaulting to one reducer. One variable is how fast the java regex evaluation is wrt to sed. One option is to use hadoop streaming and use ur sed fragment as the mapper. That will be another way of measuring hadoop overhead that eliminates some variables. Hadoop also has a quite a few variables to tune performance .. (check the hadoop wiki for yahoo's sort benchmark settings for example) -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 10, 2008 3:56 PM To: core-user@hadoop.apache.org Subject: hadoop benchmarked, too slow to use Hello, we were considering using hadoop to process some data, we have it set up on 8 nodes ( 1 master + 7 slaves) we filled the cluster up with files that contain tab delimited data. string \tab string etc then we ran the example grep with a regular expression to count the number of each unique starting string. we had 3500 files containing 3,015,294 lines totaling 5 GB. to benchmark it we ran bin/hadoop jar hadoop-0.17.0-examples.jar grep data/* output '^[a-zA-Z]+\t' it took 26 minutes then to compare, we ran this bash command on one of the nodes, which produced the same output out of the data: cat * | sed -e s/\ .*// |sort | uniq -c /tmp/out (sed regexpr is tab not spaces) which took 2.5 minutes Then we added 10X the data into the cluster and reran Hadoop, it took 214 minutes which is less than 10X the time, but still not that impressive. so we are seeing a 10X performance penalty for using Hadoop vs the system commands, is that expected? we were expecting hadoop to be faster since it is distributed? perhaps there is too much overhead involved here? is the data too small?
RE: hadoop benchmarked, too slow to use
Apart from the setup times, the fact that you have 3500 files means that you are going after around 220GB of data as each file would have atleast one chunk (this calculation is assuming a chunk size of 64MB and this assumes that each file has atleast some data). Mappers would probably need to read up this amount of data and with 7 nodes you may just have 14 map slots. I may be wrong here, but just out of curiosity how many mappers does your job use. Don't know why the 10X data was not better though if the bad performance of the smaller test case was due to fragmentation. For that test did you also increase the number of files, or did you simply increase the amount of data in each file. Plus on small sets (of the order of 2-3 GB) of data unix commands can't really be beaten :) Ashish -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 10, 2008 3:56 PM To: core-user@hadoop.apache.org Subject: hadoop benchmarked, too slow to use Hello, we were considering using hadoop to process some data, we have it set up on 8 nodes ( 1 master + 7 slaves) we filled the cluster up with files that contain tab delimited data. string \tab string etc then we ran the example grep with a regular expression to count the number of each unique starting string. we had 3500 files containing 3,015,294 lines totaling 5 GB. to benchmark it we ran bin/hadoop jar hadoop-0.17.0-examples.jar grep data/* output '^[a-zA-Z]+\t' it took 26 minutes then to compare, we ran this bash command on one of the nodes, which produced the same output out of the data: cat * | sed -e s/\ .*// |sort | uniq -c /tmp/out (sed regexpr is tab not spaces) which took 2.5 minutes Then we added 10X the data into the cluster and reran Hadoop, it took 214 minutes which is less than 10X the time, but still not that impressive. so we are seeing a 10X performance penalty for using Hadoop vs the system commands, is that expected? we were expecting hadoop to be faster since it is distributed? perhaps there is too much overhead involved here? is the data too small?
Re: hadoop benchmarked, too slow to use
I could rerun the benchmark with a single node server to see what happens. my concern is, the 8 node setup was 10X slower than the bash command, so I was starting to suspect that the cluster is not running properly, but everything looks good in the logs. no timeouts and such. Miles Osborne wrote: I suspect that many people are using Hadoop with a moderate number of nodes and expecting to see a win over a sequential, single node version. The result (and I've seen this too) is typically that the single node version wins hands-down. Apart from speeding-up the Hadoop job (eg via compression, tweaking magic parameters and the like), it is better to consider jobs that are simply too large to fit into memory on a single machine. Under those circumstances --and with many machines-- you will see Hadoop winning. Horses for courses. (Another dimension is redundancy; we have had a long running Nutch setup downloading and parsing blogs posts. Only very recently was it noticed that one of our machines was in fact dead. The show still went on and nothing was lost) Miles 2008/6/10 Elia Mazzawi [EMAIL PROTECTED]: Hello, we were considering using hadoop to process some data, we have it set up on 8 nodes ( 1 master + 7 slaves) we filled the cluster up with files that contain tab delimited data. string \tab string etc then we ran the example grep with a regular expression to count the number of each unique starting string. we had 3500 files containing 3,015,294 lines totaling 5 GB. to benchmark it we ran bin/hadoop jar hadoop-0.17.0-examples.jar grep data/* output '^[a-zA-Z]+\t' it took 26 minutes then to compare, we ran this bash command on one of the nodes, which produced the same output out of the data: cat * | sed -e s/\ .*// |sort | uniq -c /tmp/out (sed regexpr is tab not spaces) which took 2.5 minutes Then we added 10X the data into the cluster and reran Hadoop, it took 214 minutes which is less than 10X the time, but still not that impressive. so we are seeing a 10X performance penalty for using Hadoop vs the system commands, is that expected? we were expecting hadoop to be faster since it is distributed? perhaps there is too much overhead involved here? is the data too small?
Re: hadoop benchmarked, too slow to use
yes there was only 1 reducer, how many should i try ? Joydeep Sen Sarma wrote: how many reducers? Perhaps u are defaulting to one reducer. One variable is how fast the java regex evaluation is wrt to sed. One option is to use hadoop streaming and use ur sed fragment as the mapper. That will be another way of measuring hadoop overhead that eliminates some variables. Hadoop also has a quite a few variables to tune performance .. (check the hadoop wiki for yahoo's sort benchmark settings for example) -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 10, 2008 3:56 PM To: core-user@hadoop.apache.org Subject: hadoop benchmarked, too slow to use Hello, we were considering using hadoop to process some data, we have it set up on 8 nodes ( 1 master + 7 slaves) we filled the cluster up with files that contain tab delimited data. string \tab string etc then we ran the example grep with a regular expression to count the number of each unique starting string. we had 3500 files containing 3,015,294 lines totaling 5 GB. to benchmark it we ran bin/hadoop jar hadoop-0.17.0-examples.jar grep data/* output '^[a-zA-Z]+\t' it took 26 minutes then to compare, we ran this bash command on one of the nodes, which produced the same output out of the data: cat * | sed -e s/\ .*// |sort | uniq -c /tmp/out (sed regexpr is tab not spaces) which took 2.5 minutes Then we added 10X the data into the cluster and reran Hadoop, it took 214 minutes which is less than 10X the time, but still not that impressive. so we are seeing a 10X performance penalty for using Hadoop vs the system commands, is that expected? we were expecting hadoop to be faster since it is distributed? perhaps there is too much overhead involved here? is the data too small?
Re: hadoop benchmarked, too slow to use
Why not do a little experiment and see what the timing results are when using a range of reducers eg 1, 2, 5, 7, 13 Miles 2008/6/11 Elia Mazzawi [EMAIL PROTECTED]: yes there was only 1 reducer, how many should i try ? Joydeep Sen Sarma wrote: how many reducers? Perhaps u are defaulting to one reducer. One variable is how fast the java regex evaluation is wrt to sed. One option is to use hadoop streaming and use ur sed fragment as the mapper. That will be another way of measuring hadoop overhead that eliminates some variables. Hadoop also has a quite a few variables to tune performance .. (check the hadoop wiki for yahoo's sort benchmark settings for example) -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 10, 2008 3:56 PM To: core-user@hadoop.apache.org Subject: hadoop benchmarked, too slow to use Hello, we were considering using hadoop to process some data, we have it set up on 8 nodes ( 1 master + 7 slaves) we filled the cluster up with files that contain tab delimited data. string \tab string etc then we ran the example grep with a regular expression to count the number of each unique starting string. we had 3500 files containing 3,015,294 lines totaling 5 GB. to benchmark it we ran bin/hadoop jar hadoop-0.17.0-examples.jar grep data/* output '^[a-zA-Z]+\t' it took 26 minutes then to compare, we ran this bash command on one of the nodes, which produced the same output out of the data: cat * | sed -e s/\ .*// |sort | uniq -c /tmp/out (sed regexpr is tab not spaces) which took 2.5 minutes Then we added 10X the data into the cluster and reran Hadoop, it took 214 minutes which is less than 10X the time, but still not that impressive. so we are seeing a 10X performance penalty for using Hadoop vs the system commands, is that expected? we were expecting hadoop to be faster since it is distributed? perhaps there is too much overhead involved here? is the data too small? -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Re: hadoop benchmarked, too slow to use
yes chunk size was 64mb, and each file has some data it used 7 mappers and 1 reducer. 10X the data took 214 minutes vs 26 minutes for the smaller set i uploaded the same data 10 times in different directories ( so more files, same size ) Ashish Thusoo wrote: Apart from the setup times, the fact that you have 3500 files means that you are going after around 220GB of data as each file would have atleast one chunk (this calculation is assuming a chunk size of 64MB and this assumes that each file has atleast some data). Mappers would probably need to read up this amount of data and with 7 nodes you may just have 14 map slots. I may be wrong here, but just out of curiosity how many mappers does your job use. Don't know why the 10X data was not better though if the bad performance of the smaller test case was due to fragmentation. For that test did you also increase the number of files, or did you simply increase the amount of data in each file. Plus on small sets (of the order of 2-3 GB) of data unix commands can't really be beaten :) Ashish -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 10, 2008 3:56 PM To: core-user@hadoop.apache.org Subject: hadoop benchmarked, too slow to use Hello, we were considering using hadoop to process some data, we have it set up on 8 nodes ( 1 master + 7 slaves) we filled the cluster up with files that contain tab delimited data. string \tab string etc then we ran the example grep with a regular expression to count the number of each unique starting string. we had 3500 files containing 3,015,294 lines totaling 5 GB. to benchmark it we ran bin/hadoop jar hadoop-0.17.0-examples.jar grep data/* output '^[a-zA-Z]+\t' it took 26 minutes then to compare, we ran this bash command on one of the nodes, which produced the same output out of the data: cat * | sed -e s/\ .*// |sort | uniq -c /tmp/out (sed regexpr is tab not spaces) which took 2.5 minutes Then we added 10X the data into the cluster and reran Hadoop, it took 214 minutes which is less than 10X the time, but still not that impressive. so we are seeing a 10X performance penalty for using Hadoop vs the system commands, is that expected? we were expecting hadoop to be faster since it is distributed? perhaps there is too much overhead involved here? is the data too small?
RE: hadoop benchmarked, too slow to use
Perhaps something of the order of number of cores in ur system. At least 7 for sure - but perhaps 14 if u have multi-core machines (assuming tasks per node is 2 at least). Also - Ashish hit the nail on the head - way too many small files. Hadoop job overhead is killing you. At this point - I wish I could say - just use TextMultiFileInputFormat - except there isn't one (and I guess the nearest alternative - the hadoop archival stuff is not in 17). Bad luck. -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 10, 2008 4:26 PM To: core-user@hadoop.apache.org Subject: Re: hadoop benchmarked, too slow to use yes there was only 1 reducer, how many should i try ? Joydeep Sen Sarma wrote: how many reducers? Perhaps u are defaulting to one reducer. One variable is how fast the java regex evaluation is wrt to sed. One option is to use hadoop streaming and use ur sed fragment as the mapper. That will be another way of measuring hadoop overhead that eliminates some variables. Hadoop also has a quite a few variables to tune performance .. (check the hadoop wiki for yahoo's sort benchmark settings for example) -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 10, 2008 3:56 PM To: core-user@hadoop.apache.org Subject: hadoop benchmarked, too slow to use Hello, we were considering using hadoop to process some data, we have it set up on 8 nodes ( 1 master + 7 slaves) we filled the cluster up with files that contain tab delimited data. string \tab string etc then we ran the example grep with a regular expression to count the number of each unique starting string. we had 3500 files containing 3,015,294 lines totaling 5 GB. to benchmark it we ran bin/hadoop jar hadoop-0.17.0-examples.jar grep data/* output '^[a-zA-Z]+\t' it took 26 minutes then to compare, we ran this bash command on one of the nodes, which produced the same output out of the data: cat * | sed -e s/\ .*// |sort | uniq -c /tmp/out (sed regexpr is tab not spaces) which took 2.5 minutes Then we added 10X the data into the cluster and reran Hadoop, it took 214 minutes which is less than 10X the time, but still not that impressive. so we are seeing a 10X performance penalty for using Hadoop vs the system commands, is that expected? we were expecting hadoop to be faster since it is distributed? perhaps there is too much overhead involved here? is the data too small?
Re: hadoop benchmarked, too slow to use
okay I'll try that. thanks Joydeep Sen Sarma wrote: Perhaps something of the order of number of cores in ur system. At least 7 for sure - but perhaps 14 if u have multi-core machines (assuming tasks per node is 2 at least). Also - Ashish hit the nail on the head - way too many small files. Hadoop job overhead is killing you. At this point - I wish I could say - just use TextMultiFileInputFormat - except there isn't one (and I guess the nearest alternative - the hadoop archival stuff is not in 17). Bad luck. -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 10, 2008 4:26 PM To: core-user@hadoop.apache.org Subject: Re: hadoop benchmarked, too slow to use yes there was only 1 reducer, how many should i try ? Joydeep Sen Sarma wrote: how many reducers? Perhaps u are defaulting to one reducer. One variable is how fast the java regex evaluation is wrt to sed. One option is to use hadoop streaming and use ur sed fragment as the mapper. That will be another way of measuring hadoop overhead that eliminates some variables. Hadoop also has a quite a few variables to tune performance .. (check the hadoop wiki for yahoo's sort benchmark settings for example) -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 10, 2008 3:56 PM To: core-user@hadoop.apache.org Subject: hadoop benchmarked, too slow to use Hello, we were considering using hadoop to process some data, we have it set up on 8 nodes ( 1 master + 7 slaves) we filled the cluster up with files that contain tab delimited data. string \tab string etc then we ran the example grep with a regular expression to count the number of each unique starting string. we had 3500 files containing 3,015,294 lines totaling 5 GB. to benchmark it we ran bin/hadoop jar hadoop-0.17.0-examples.jar grep data/* output '^[a-zA-Z]+\t' it took 26 minutes then to compare, we ran this bash command on one of the nodes, which produced the same output out of the data: cat * | sed -e s/\ .*// |sort | uniq -c /tmp/out (sed regexpr is tab not spaces) which took 2.5 minutes Then we added 10X the data into the cluster and reran Hadoop, it took 214 minutes which is less than 10X the time, but still not that impressive. so we are seeing a 10X performance penalty for using Hadoop vs the system commands, is that expected? we were expecting hadoop to be faster since it is distributed? perhaps there is too much overhead involved here? is the data too small?
Re: hadoop benchmarked, too slow to use
so it would make sense for me to configure hadoop for smaller chunks? Elia Mazzawi wrote: yes chunk size was 64mb, and each file has some data it used 7 mappers and 1 reducer. 10X the data took 214 minutes vs 26 minutes for the smaller set i uploaded the same data 10 times in different directories ( so more files, same size ) Ashish Thusoo wrote: Apart from the setup times, the fact that you have 3500 files means that you are going after around 220GB of data as each file would have atleast one chunk (this calculation is assuming a chunk size of 64MB and this assumes that each file has atleast some data). Mappers would probably need to read up this amount of data and with 7 nodes you may just have 14 map slots. I may be wrong here, but just out of curiosity how many mappers does your job use. Don't know why the 10X data was not better though if the bad performance of the smaller test case was due to fragmentation. For that test did you also increase the number of files, or did you simply increase the amount of data in each file. Plus on small sets (of the order of 2-3 GB) of data unix commands can't really be beaten :) Ashish -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 10, 2008 3:56 PM To: core-user@hadoop.apache.org Subject: hadoop benchmarked, too slow to use Hello, we were considering using hadoop to process some data, we have it set up on 8 nodes ( 1 master + 7 slaves) we filled the cluster up with files that contain tab delimited data. string \tab string etc then we ran the example grep with a regular expression to count the number of each unique starting string. we had 3500 files containing 3,015,294 lines totaling 5 GB. to benchmark it we ran bin/hadoop jar hadoop-0.17.0-examples.jar grep data/* output '^[a-zA-Z]+\t' it took 26 minutes then to compare, we ran this bash command on one of the nodes, which produced the same output out of the data: cat * | sed -e s/\ .*// |sort | uniq -c /tmp/out (sed regexpr is tab not spaces) which took 2.5 minutes Then we added 10X the data into the cluster and reran Hadoop, it took 214 minutes which is less than 10X the time, but still not that impressive. so we are seeing a 10X performance penalty for using Hadoop vs the system commands, is that expected? we were expecting hadoop to be faster since it is distributed? perhaps there is too much overhead involved here? is the data too small?
Re: hadoop benchmarked, too slow to use
Hey all, I'm not a world-class hadoop expert yet, so I won't try to advise anything in particular except to point out that the regular expression implementation makes a big difference[1]. It may be that Java vs. sed is an unfair fight. One valid test /may be/ to run a 1 mapper hadoop job and see how that fairs. In a perfect world, you'd expect it to run about 7x slower than in the 7 slave cluster. How far off it is might tell you something. Good luck! [1] http://swtch.com/~rsc/regexp/regexp1.html -- Jim R. Wilson (jimbojw) On Tue, Jun 10, 2008 at 7:14 PM, Ashish Thusoo [EMAIL PROTECTED] wrote: Try by first just reducing the number of files and increasing the data in each file so you have close to 64MB of data per file. So in your case that would amount to about 700-800 files in the 10X test case (instead of 35000 that you have). See if that give substantially better results on your larger test case. For the smaller one, I don't think you will be able to do better than the unix command - the data set is too small. Ashish -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 10, 2008 5:00 PM To: core-user@hadoop.apache.org Subject: Re: hadoop benchmarked, too slow to use so it would make sense for me to configure hadoop for smaller chunks? Elia Mazzawi wrote: yes chunk size was 64mb, and each file has some data it used 7 mappers and 1 reducer. 10X the data took 214 minutes vs 26 minutes for the smaller set i uploaded the same data 10 times in different directories ( so more files, same size ) Ashish Thusoo wrote: Apart from the setup times, the fact that you have 3500 files means that you are going after around 220GB of data as each file would have atleast one chunk (this calculation is assuming a chunk size of 64MB and this assumes that each file has atleast some data). Mappers would probably need to read up this amount of data and with 7 nodes you may just have 14 map slots. I may be wrong here, but just out of curiosity how many mappers does your job use. Don't know why the 10X data was not better though if the bad performance of the smaller test case was due to fragmentation. For that test did you also increase the number of files, or did you simply increase the amount of data in each file. Plus on small sets (of the order of 2-3 GB) of data unix commands can't really be beaten :) Ashish -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 10, 2008 3:56 PM To: core-user@hadoop.apache.org Subject: hadoop benchmarked, too slow to use Hello, we were considering using hadoop to process some data, we have it set up on 8 nodes ( 1 master + 7 slaves) we filled the cluster up with files that contain tab delimited data. string \tab string etc then we ran the example grep with a regular expression to count the number of each unique starting string. we had 3500 files containing 3,015,294 lines totaling 5 GB. to benchmark it we ran bin/hadoop jar hadoop-0.17.0-examples.jar grep data/* output '^[a-zA-Z]+\t' it took 26 minutes then to compare, we ran this bash command on one of the nodes, which produced the same output out of the data: cat * | sed -e s/\ .*// |sort | uniq -c /tmp/out (sed regexpr is tab not spaces) which took 2.5 minutes Then we added 10X the data into the cluster and reran Hadoop, it took 214 minutes which is less than 10X the time, but still not that impressive. so we are seeing a 10X performance penalty for using Hadoop vs the system commands, is that expected? we were expecting hadoop to be faster since it is distributed? perhaps there is too much overhead involved here? is the data too small?