Re: delay the execution of reducers

2010-11-29 Thread Chandraprakash Bhagtani
you can see whether your property is in effect by looking at the following
URL
http://:50030/jobconf.jsp?jobid=

replace  with your jobtracker ip and  with the
running job

have you restarted mapreduce after changing mapred-site.xml?




On Mon, Nov 29, 2010 at 6:56 AM, li ping  wrote:

> org.apache.hadoop.mapred.JobInProgress
>
> Maybe you find this class.
>
> On Mon, Nov 29, 2010 at 4:36 AM, Da Zheng  wrote:
>
> > I have a problem with subscribing mapreduce mailing list.
> >
> > I use hadoop-0.20.2. I have added this parameter to mapred-site.xml. Is
> > there any way for me to check whether the parameter has been read and
> > activated?
> >
> > BTW, what do you mean by opening a jira?
> >
> > Thanks,
> > Da
> >
> >
> > On 11/28/2010 05:03 AM, Arun C Murthy wrote:
> >
> >> Moving to mapreduce-user@, bcc common-u...@. Please use project
> >> specific lists.
> >>
> >> mapreduce.reduce.slowstart.completed.maps is the right knob. Which
> version
> >> of hadoop are you running? If it isn't working, please open a jira.
> Thanks.
> >>
> >> Arun
> >>
> >> On Nov 27, 2010, at 11:40 PM, Da Zheng wrote:
> >>
> >>  Hello,
> >>>
> >>> I found in Hadoop that reducers starts when a fraction of the number of
> >>> mappers
> >>> is complete. However, in my case, I hope reducers to start only when
> all
> >>> mappers
> >>> are complete. I searched for Hadoop configuration parameters, and found
> >>> mapred.reduce.slowstart.completed.maps, which seems to do what I want.
> >>> But no
> >>> matter what value (0.99, 1.00, etc) I set to
> >>> mapred.reduce.slowstart.completed.maps, reducers always start to
> execute
> >>> when
> >>> about 10% of mappers are complete.
> >>>
> >>> Do I set the right parameter? Is there any other parameter I can use
> for
> >>> this
> >>> purpose?
> >>>
> >>> Thanks,
> >>> Da
> >>>
> >>
> >>
> >
>
>
> --
> -李平
>



-- 
Thanks & Regards,
Chandra Prakash Bhagtani,
Nokia India Pvt. Ltd.


Equivalence of MultipleInputs.addInputPath(...) without a JobConf

2010-11-29 Thread Alan Said
Hi all,
I'm having difficulties figuring out what the equivalent of using of using
org.apache.hadoop.mapred.lib.addInputPath(JobConf conf, Path path, Class inputFormatClass, Class mapperClass)
is in org.apache.hadoop.mapreduce, i.e. not using a JobConf, rather Job or 
Configuration?

Would appreciate any help.

Thanks,
Alan



Re: delay the execution of reducers

2010-11-29 Thread Da Zheng

On 11/29/2010 05:42 AM, Chandraprakash Bhagtani wrote:

you can see whether your property is in effect by looking at the following
URL
http://:50030/jobconf.jsp?jobid=

replace  with your jobtracker ip and  with the
running job

have you restarted mapreduce after changing mapred-site.xml?

It shows me the value is still 0.05. I am a little confused. Since 
hadoop in each machine has configuration files, which configuration 
files should I change? For mapred-site.xml, I only need to change the 
one in the master node? (I always start my MapReduce program from the 
master node). What about other configuration files such as core-site.xml 
and hdfs-site.xml? I guess I have to change them on all machines in the 
cluster.


Thanks,
Da


Re: delay the execution of reducers

2010-11-29 Thread Arun C Murthy

Just set it for you  job.

In your launching program do something like:

jobConf.setFloat("mapred.reduce.slowstart.completed.maps", 0.5);

On Nov 29, 2010, at 9:46 AM, Da Zheng wrote:


On 11/29/2010 05:42 AM, Chandraprakash Bhagtani wrote:
you can see whether your property is in effect by looking at the  
following

URL
http://:50030/jobconf.jsp?jobid=

replace  with your jobtracker ip and  with  
the

running job

have you restarted mapreduce after changing mapred-site.xml?


It shows me the value is still 0.05. I am a little confused. Since
hadoop in each machine has configuration files, which configuration
files should I change? For mapred-site.xml, I only need to change the
one in the master node? (I always start my MapReduce program from the
master node). What about other configuration files such as core- 
site.xml
and hdfs-site.xml? I guess I have to change them on all machines in  
the

cluster.

Thanks,
Da




Re: delay the execution of reducers

2010-11-29 Thread Da Zheng

Changing the parameter for a specific job works better for me.

But I was asking in general in which configuration file(s) should I 
change the value of the parameters.
For parameters in hdfs-site.xml, I should changes the configuration file 
in each machine. But for parameters in mapred-site.xml, it seems enough 
to change the configuration file in the machine where the job is launched


Thanks,
Da

On 11/29/2010 01:31 PM, Arun C Murthy wrote:

Just set it for you  job.

In your launching program do something like:

jobConf.setFloat("mapred.reduce.slowstart.completed.maps", 0.5);

On Nov 29, 2010, at 9:46 AM, Da Zheng wrote:


On 11/29/2010 05:42 AM, Chandraprakash Bhagtani wrote:
you can see whether your property is in effect by looking at the 
following

URL
http://:50030/jobconf.jsp?jobid=

replace  with your jobtracker ip and  with the
running job

have you restarted mapreduce after changing mapred-site.xml?


It shows me the value is still 0.05. I am a little confused. Since
hadoop in each machine has configuration files, which configuration
files should I change? For mapred-site.xml, I only need to change the
one in the master node? (I always start my MapReduce program from the
master node). What about other configuration files such as core-site.xml
and hdfs-site.xml? I guess I have to change them on all machines in the
cluster.

Thanks,
Da






small files and number of mappers

2010-11-29 Thread Marc Sturlese

Hey there,
I am doing some tests and wandering which are the best practices to deal
with very small files which are continuously being generated(1Mb or even
less).

I see that if I have hundreds of small files in hdfs, hadoop automatically
will create A LOT of map tasks to consume them. Each map task will take 10
seconds or less... I don't know if it's possible to change the number of map
tasks from java code using the new API (I know it can be done with the old
one). I would like to do something like NumMapTasksCalculatedByHadoop * 0.3.
This way, less maps tasks would be instanciated and each would be working
more time.

I have had a look at hadoop archives aswell but don't thing they can help me
here.

Any advice or similar experience?
Thanks in advance.


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/small-files-and-number-of-mappers-tp1989598p1989598.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


a bug? inputsplit cannot return the location correctly

2010-11-29 Thread Da Zheng

Hello,

I guess it might not a right mailing list to send, but I cannot send 
emails to MapReduce mailing list. I don't know why.


I have a 6-node cluster and stored a 25GB data in the HDFS. I ran a 
simple MapReduce program and used mapred.reduce.slowstart.completed.maps 
to delay the execution of reducers. That is, during the mapping phase, 
only the mappers are running. Normally, there shouldn't be much network 
traffic in the network when there are only mappers running. However, I 
can see almost 25GB data is transmitted in the network.


So I print all splits, files that they point to, and nodes where they 
are when InputSplit is generated. I also print the same thing for each 
splits when a RecordReader is initialized. I surprisingly found that 
InputSplit (in my case, it's FileSplit) passed to RecordReader doesn't 
have any location information. It seems to explain why the Hadoop cannot 
consider about the data locality when launching mapper tasks.


It seems to be a bug to me. I use hadoop v0.20.2. Does anyone experience 
the similar problem like this?


Thanks,
Da


HDFS Rsync process??

2010-11-29 Thread hadoopman
We have two Hadoop clusters in two separate buildings.  Both clusters 
are loading the same data from the same sources (the second cluster is 
for DR).


We're looking at how we can recover the primary cluster and catch it 
back up again as new data will continue to feed into the DR cluster.  
It's been suggested we use rsync across the network however my concern 
is the amount of data we would have to copy over would take several days 
(at a minimum) to sync them even with our dual bonded 1 gig network cards.


I'm curious if anyone has come up with a solution short of just loading 
the source logs into HDFS.  Is there a way to even rsync two clusters 
and get them in sync?  Been googling around.  Haven't found anything of 
substances yet.


Thanks!


where is example of the configuration about multi nodes on one machine?

2010-11-29 Thread beneo_7
i have only one machine and it's powerful.
so, i want the all the slaves and master on one machine?

thx in advanced


Re: where is example of the configuration about multi nodes on one machine?

2010-11-29 Thread rahul patodi
you can create virtual machines on your single machine:
for you have to install sun virtual box(other tools are also available like
VMware)
now you can create as many virtual machine as you want
then you can create one master and all slaves

-Thanks and Regards,
Rahul Patodi
Associate Software Engineer,
Impetus Infotech (India) Private Limited,
www.impetus.com
Mob:09907074413

2010/11/30 beneo_7 

> i have only one machine and it's powerful.
> so, i want the all the slaves and master on one machine?
>
> thx in advanced
>



--