Re: delay the execution of reducers
you can see whether your property is in effect by looking at the following URL http://:50030/jobconf.jsp?jobid= replace with your jobtracker ip and with the running job have you restarted mapreduce after changing mapred-site.xml? On Mon, Nov 29, 2010 at 6:56 AM, li ping wrote: > org.apache.hadoop.mapred.JobInProgress > > Maybe you find this class. > > On Mon, Nov 29, 2010 at 4:36 AM, Da Zheng wrote: > > > I have a problem with subscribing mapreduce mailing list. > > > > I use hadoop-0.20.2. I have added this parameter to mapred-site.xml. Is > > there any way for me to check whether the parameter has been read and > > activated? > > > > BTW, what do you mean by opening a jira? > > > > Thanks, > > Da > > > > > > On 11/28/2010 05:03 AM, Arun C Murthy wrote: > > > >> Moving to mapreduce-user@, bcc common-u...@. Please use project > >> specific lists. > >> > >> mapreduce.reduce.slowstart.completed.maps is the right knob. Which > version > >> of hadoop are you running? If it isn't working, please open a jira. > Thanks. > >> > >> Arun > >> > >> On Nov 27, 2010, at 11:40 PM, Da Zheng wrote: > >> > >> Hello, > >>> > >>> I found in Hadoop that reducers starts when a fraction of the number of > >>> mappers > >>> is complete. However, in my case, I hope reducers to start only when > all > >>> mappers > >>> are complete. I searched for Hadoop configuration parameters, and found > >>> mapred.reduce.slowstart.completed.maps, which seems to do what I want. > >>> But no > >>> matter what value (0.99, 1.00, etc) I set to > >>> mapred.reduce.slowstart.completed.maps, reducers always start to > execute > >>> when > >>> about 10% of mappers are complete. > >>> > >>> Do I set the right parameter? Is there any other parameter I can use > for > >>> this > >>> purpose? > >>> > >>> Thanks, > >>> Da > >>> > >> > >> > > > > > -- > -李平 > -- Thanks & Regards, Chandra Prakash Bhagtani, Nokia India Pvt. Ltd.
Equivalence of MultipleInputs.addInputPath(...) without a JobConf
Hi all, I'm having difficulties figuring out what the equivalent of using of using org.apache.hadoop.mapred.lib.addInputPath(JobConf conf, Path path, Class inputFormatClass, Class mapperClass) is in org.apache.hadoop.mapreduce, i.e. not using a JobConf, rather Job or Configuration? Would appreciate any help. Thanks, Alan
Re: delay the execution of reducers
On 11/29/2010 05:42 AM, Chandraprakash Bhagtani wrote: you can see whether your property is in effect by looking at the following URL http://:50030/jobconf.jsp?jobid= replace with your jobtracker ip and with the running job have you restarted mapreduce after changing mapred-site.xml? It shows me the value is still 0.05. I am a little confused. Since hadoop in each machine has configuration files, which configuration files should I change? For mapred-site.xml, I only need to change the one in the master node? (I always start my MapReduce program from the master node). What about other configuration files such as core-site.xml and hdfs-site.xml? I guess I have to change them on all machines in the cluster. Thanks, Da
Re: delay the execution of reducers
Just set it for you job. In your launching program do something like: jobConf.setFloat("mapred.reduce.slowstart.completed.maps", 0.5); On Nov 29, 2010, at 9:46 AM, Da Zheng wrote: On 11/29/2010 05:42 AM, Chandraprakash Bhagtani wrote: you can see whether your property is in effect by looking at the following URL http://:50030/jobconf.jsp?jobid= replace with your jobtracker ip and with the running job have you restarted mapreduce after changing mapred-site.xml? It shows me the value is still 0.05. I am a little confused. Since hadoop in each machine has configuration files, which configuration files should I change? For mapred-site.xml, I only need to change the one in the master node? (I always start my MapReduce program from the master node). What about other configuration files such as core- site.xml and hdfs-site.xml? I guess I have to change them on all machines in the cluster. Thanks, Da
Re: delay the execution of reducers
Changing the parameter for a specific job works better for me. But I was asking in general in which configuration file(s) should I change the value of the parameters. For parameters in hdfs-site.xml, I should changes the configuration file in each machine. But for parameters in mapred-site.xml, it seems enough to change the configuration file in the machine where the job is launched Thanks, Da On 11/29/2010 01:31 PM, Arun C Murthy wrote: Just set it for you job. In your launching program do something like: jobConf.setFloat("mapred.reduce.slowstart.completed.maps", 0.5); On Nov 29, 2010, at 9:46 AM, Da Zheng wrote: On 11/29/2010 05:42 AM, Chandraprakash Bhagtani wrote: you can see whether your property is in effect by looking at the following URL http://:50030/jobconf.jsp?jobid= replace with your jobtracker ip and with the running job have you restarted mapreduce after changing mapred-site.xml? It shows me the value is still 0.05. I am a little confused. Since hadoop in each machine has configuration files, which configuration files should I change? For mapred-site.xml, I only need to change the one in the master node? (I always start my MapReduce program from the master node). What about other configuration files such as core-site.xml and hdfs-site.xml? I guess I have to change them on all machines in the cluster. Thanks, Da
small files and number of mappers
Hey there, I am doing some tests and wandering which are the best practices to deal with very small files which are continuously being generated(1Mb or even less). I see that if I have hundreds of small files in hdfs, hadoop automatically will create A LOT of map tasks to consume them. Each map task will take 10 seconds or less... I don't know if it's possible to change the number of map tasks from java code using the new API (I know it can be done with the old one). I would like to do something like NumMapTasksCalculatedByHadoop * 0.3. This way, less maps tasks would be instanciated and each would be working more time. I have had a look at hadoop archives aswell but don't thing they can help me here. Any advice or similar experience? Thanks in advance. -- View this message in context: http://lucene.472066.n3.nabble.com/small-files-and-number-of-mappers-tp1989598p1989598.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
a bug? inputsplit cannot return the location correctly
Hello, I guess it might not a right mailing list to send, but I cannot send emails to MapReduce mailing list. I don't know why. I have a 6-node cluster and stored a 25GB data in the HDFS. I ran a simple MapReduce program and used mapred.reduce.slowstart.completed.maps to delay the execution of reducers. That is, during the mapping phase, only the mappers are running. Normally, there shouldn't be much network traffic in the network when there are only mappers running. However, I can see almost 25GB data is transmitted in the network. So I print all splits, files that they point to, and nodes where they are when InputSplit is generated. I also print the same thing for each splits when a RecordReader is initialized. I surprisingly found that InputSplit (in my case, it's FileSplit) passed to RecordReader doesn't have any location information. It seems to explain why the Hadoop cannot consider about the data locality when launching mapper tasks. It seems to be a bug to me. I use hadoop v0.20.2. Does anyone experience the similar problem like this? Thanks, Da
HDFS Rsync process??
We have two Hadoop clusters in two separate buildings. Both clusters are loading the same data from the same sources (the second cluster is for DR). We're looking at how we can recover the primary cluster and catch it back up again as new data will continue to feed into the DR cluster. It's been suggested we use rsync across the network however my concern is the amount of data we would have to copy over would take several days (at a minimum) to sync them even with our dual bonded 1 gig network cards. I'm curious if anyone has come up with a solution short of just loading the source logs into HDFS. Is there a way to even rsync two clusters and get them in sync? Been googling around. Haven't found anything of substances yet. Thanks!
where is example of the configuration about multi nodes on one machine?
i have only one machine and it's powerful. so, i want the all the slaves and master on one machine? thx in advanced
Re: where is example of the configuration about multi nodes on one machine?
you can create virtual machines on your single machine: for you have to install sun virtual box(other tools are also available like VMware) now you can create as many virtual machine as you want then you can create one master and all slaves -Thanks and Regards, Rahul Patodi Associate Software Engineer, Impetus Infotech (India) Private Limited, www.impetus.com Mob:09907074413 2010/11/30 beneo_7 > i have only one machine and it's powerful. > so, i want the all the slaves and master on one machine? > > thx in advanced > --