Help!!The problem about Hadoop

2010-10-05 Thread Jander
Hi, all I do an application using hadoop. I take 1GB text data as input the result as follows: (1) the cluster of 3 PCs: the time consumed is 1020 seconds. (2) the cluster of 4 PCs: the time is about 680 seconds. But the application before I use Hadoop takes about 280 seconds, so as the sp

Re: Help!!The problem about Hadoop

2010-10-05 Thread Jeff Zhang
Hi Jander, Hadoop has overhead compared to single-machine solution. How many task have you get when you run your hadoop job ? And what is time consuming for each map and reduce task ? There's lots of tips for performance tuning of hadoop. Such as compression and jvm reuse. 2010/10/5 Jander <442

Re:Re: Help!!The problem about Hadoop

2010-10-05 Thread Jander
Hi Jeff, Thank you very much for your reply sincerely. I exactly know hadoop has overhead, but is it too large in my problem? The 1GB text input has about 500 map tasks because the input is composed of little text file. And the time each map taken is from 8 seconds to 20 seconds. I use compres

Re: Re: Help!!The problem about Hadoop

2010-10-05 Thread Harsh J
500 small files comprising one gigabyte? Perhaps you should try concatenating them all into one big file and try; as a mapper is supposed to run at least for a minute optimally. And small files don't make good use of the HDFS block feature. Have a read: http://www.cloudera.com/blog/2009/02/the-sma

Re: is there no streaming.jar file in hadoop-0.21.0??

2010-10-05 Thread Alejandro Abdelnur
Edward, Yep, you should use the one from contrib/ Alejandro On Tue, Oct 5, 2010 at 1:55 PM, edward choi wrote: > Thanks, Tom. > Didn't expect the author of THE BOOK would answer my question. Very > surprised and honored :-) > Just one more question if you don't mind. > I read it on the Internet

Re: Re: Help!!The problem about Hadoop

2010-10-05 Thread Alejandro Abdelnur
Or you could try using MultiFileInputFormat for your MR job. http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapred/MultiFileInputFormat.html Alejandro On Tue, Oct 5, 2010 at 4:55 PM, Harsh J wrote: > 500 small files comprising one gigabyte? Perhaps you should try > concat

how to set diffent VM parameters for mappers and reducers?

2010-10-05 Thread Vitaliy Semochkin
Hello, I have mappers that do not need much ram but combiners and reducers need a lot. Is it possible to set different VM parameters for mappers and reducers? PS Often I face interesting problem, on same set of data I recieve I have java.lang.OutOfMemoryError: Java heap space in combiner but i

RE: how to set diffent VM parameters for mappers and reducers?

2010-10-05 Thread Michael Segel
Hi, You don't say which version of Hadoop you are using. Going from memory, I believe in the CDH3 release from Cloudera, there are some specific OPTs you can set in hadoop-env.sh. HTH -Mike > Date: Tue, 5 Oct 2010 16:59:35 +0400 > Subject: how to set diffent VM parameters for mappers and re

Re: how to set diffent VM parameters for mappers and reducers?

2010-10-05 Thread Jeff Zhang
You can set mapred.child.java.opts in mapred-site.xml BTW, combiner can been run both in map side and reduce side On Tue, Oct 5, 2010 at 8:59 PM, Vitaliy Semochkin wrote: > Hello, > > > I have mappers that do not need much ram but combiners and reducers need a > lot. > Is it possible to set

Re: how to set diffent VM parameters for mappers and reducers?

2010-10-05 Thread Alejandro Abdelnur
The following 2 properties should work: mapred.map.child.java.opts mapred.reduce.child.java.opts Alejandro On Tue, Oct 5, 2010 at 9:02 PM, Michael Segel wrote: > > Hi, > > You don't say which version of Hadoop you are using. > Going from memory, I believe in the CDH3 release from Cloudera, the

does reduce > copy (at 0.52 MB/s) means network or other IO problem?

2010-10-05 Thread Vitaliy Semochkin
Hello, I often see reduce > copy (at 0.52 MB/s) phase with such speed. Despite in my cluster all 5 nodes are in same rack. Does it mean any network or other IO problems, or other reasons can cause such slow speed? Thanks in Advance, Vitaliy S

Re: how to set diffent VM parameters for mappers and reducers?

2010-10-05 Thread Vitaliy Semochkin
I'm using apache hadoop-0.20.2 - the recent version i found in maven central repo. Regards, Vitaliy S On Tue, Oct 5, 2010 at 5:02 PM, Michael Segel wrote: > > Hi, > > You don't say which version of Hadoop you are using. > Going from memory, I believe in the CDH3 release from Cloudera, there are

Set number Reducer per machines.

2010-10-05 Thread Pramy Bhats
Hi, I am trying to run a job on my hadoop cluster, where I get consistently get heap space error. I increased the heap-space to 4 GB in hadoop-env.sh and reboot the cluster. However, I still get the heap space error. One of things, I want to try is to reduce the number of map / reduce process p

Re: Set number Reducer per machines.

2010-10-05 Thread Marcos Medrado Rubinelli
You can set the mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum properties in your mapred-site.xml file, but you may also want to check your current mapred.child.java.opts and mapred.child.ulimit values to make sure they aren't overriding the 4GB you set globall

Re: Set number Reducer per machines.

2010-10-05 Thread ed
Hi Pramod, How much memory does each node in your cluster have? What type of processors do those nodes have? (dual core, quad core, dual quad core? etc..) In what step are you seeing the heap space error (mapper or reducer?) It's quite possible that you're mapper or reducer code could be improv

Re: does reduce > copy (at 0.52 MB/s) means network or other IO problem?

2010-10-05 Thread Harsh J
The reduce begins copying map outputs as they complete (starting at 5% of them) and this transfer may be very meagre and thus the low rate of transfer. Observe once all maps finish or near completion at their last wave, if the network status shown is still slow then there is a problem, whose commo

Re: Set number Reducer per machines.

2010-10-05 Thread Pramy Bhats
Hi Ed, I was trying to benchmark some application code available online. http://github.com/lintool/Cloud9 For the program computing concurrentmatrix strips. However, the code itself is problematic because it throws heap-space error for even very small data sets. thanks, --Pramod On Tue, Oct 5

Re: Set number Reducer per machines.

2010-10-05 Thread ed
What are the exact files you are using for the mapper and reducer from the cloud9 package? On Tue, Oct 5, 2010 at 2:15 PM, Pramy Bhats wrote: > Hi Ed, > > I was trying to benchmark some application code available online. > http://github.com/lintool/Cloud9 > > For the program computing concurrentm

Re: Problem with DistributedCache after upgrading to CDH3b2

2010-10-05 Thread Kim Vogt
I'm experiencing the same problem. I was hoping there were be a reply to this. Anyone? Bueller? -Kim On Fri, Jul 16, 2010 at 1:58 AM, Jamie Cockrill wrote: > Dear All, > > We recently upgraded from CDH3b1 to b2 and ever since, all our > mapreduce jobs that use the DistributedCache have failed.

Re: Problem with DistributedCache after upgrading to CDH3b2

2010-10-05 Thread Jamie Cockrill
Hi Kim, We didn't fix it in the end. I just ended up manually writing the files to the cluster using the FileSystem class, and then reading them back out again on the other side. Not terribly efficient as I guess the point of DistributedCache is that the files get distributed to every node, wherea

conf.setCombinerClass in Map/Reduce

2010-10-05 Thread Shi Yu
Hi, I am still confused about the effect of using Combiner in Hadoop Map/Reduce. The performance tips (http://www.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/) suggest us to write a combiner to do initial aggregation before the data hits the reducer for performance ad

Re: Problem with DistributedCache after upgrading to CDH3b2

2010-10-05 Thread Kim Vogt
Hey Jamie, Thanks for the reply. I asked about it the cloudera IRC so maybe they'll look into it, in the meantime, I'm going to go ahead and copy that file over to my datanodes :-) -Kim On Tue, Oct 5, 2010 at 2:54 PM, Jamie Cockrill wrote: > Hi Kim, > > We didn't fix it in the end. I just ende

Re: Help!!The problem about Hadoop

2010-10-05 Thread Sudhir Vallamkondu
You should try implementing some suggestions from this blog post http://www.cloudera.com/blog/2009/02/the-small-files-problem/ In general just google for tuning map/reduce programs and you will see some good articles like these http://www.docstoc.com/docs/3766688/Hadoop-Map-Reduce-Tuning-and-Deb

Re: is there no streaming.jar file in hadoop-0.21.0??

2010-10-05 Thread edward choi
Thanks for the reply Alejandro. Appreciate it. Ed. 2010/10/5 Alejandro Abdelnur > Edward, > > Yep, you should use the one from contrib/ > > Alejandro > > On Tue, Oct 5, 2010 at 1:55 PM, edward choi wrote: > > Thanks, Tom. > > Didn't expect the author of THE BOOK would answer my question. Very

Re: Help!!The problem about Hadoop

2010-10-05 Thread Allen Wittenauer
On Oct 5, 2010, at 12:34 AM, Jander wrote: > Hi, all > I do an application using hadoop. > I take 1GB text data as input the result as follows: >(1) the cluster of 3 PCs: the time consumed is 1020 seconds. >(2) the cluster of 4 PCs: the time is about 680 seconds. > But the application bef

Re: Datanode Registration DataXceiver java.io.EOFException

2010-10-05 Thread Sudhir Vallamkondu
We use gangalia for monitoring our cluster and use a nagios plugin that interfaces with gmeta node to setup various rules around number of datanodes, missing/corrupted blocks etc http://www.cloudera.com/blog/2009/03/hadoop-metrics/ http://exchange.nagios.org/directory/Plugins/Network-and-Systems-

Re: conf.setCombinerClass in Map/Reduce

2010-10-05 Thread Antonio Piccolboni
On Tue, Oct 5, 2010 at 4:32 PM, Shi Yu wrote: > Hi, > > I am still confused about the effect of using Combiner in Hadoop > Map/Reduce. The performance tips ( > http://www.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/) > suggest us to write a combiner to do initial aggregat

Re: conf.setCombinerClass in Map/Reduce

2010-10-05 Thread Shi Yu
Hi, thanks for the answer, Antonio. I have found one of the main problem. It was because I used the MultipleOutputs in the Reduce class, so when I set the Combiner and the Reducer, the Combiner will not provide normal data flow to the Reducer. Therefore, the program ceases at the Combiner and

Re: Efficient query to directory-num-files?

2010-10-05 Thread Keith Wiley
On 2010, Oct 04, at 11:38 AM, Harsh J wrote: On Mon, Oct 4, 2010 at 11:11 PM, Keith Wiley wrote: - I want to know how many files are in a directory. - Well, actually, I want to know how many files are in a few thousand directories. - I anticipate the answer to be approximately four millio