Please help me: is there a way to "chown" in Hadoop?
I need to change a file's owner from userA to userB. Is there such a command? Thanks lot! % hadoop dfs -ls file /user/userA/file2008-08-25 20:00 rwxr-xr-x userAsupergroup
Fw: Write permission of file/dir in Hadoop
Would anybody help with that? Thanks. - Forwarded Message From: Gopal Gandhi <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Thursday, August 21, 2008 3:36:40 PM Subject: Write permission of file/dir in Hadoop Folks, Is it possible to "chmod" a dir in Hadoop so that user X can only write files to it but can not remove files from it? Thanks.
Write permission of file/dir in Hadoop
Folks, Is it possible to chmod a dir in Hadoop so that user X can only write files to it but can not remove files from it? Thanks.
[Streaming] How to pass arguments to a map/reduce script
I am using Hadoop streaming and I need to pass arguments to my map/reduce script. Because a map/reduce script is triggered by hadoop, like hadoop -file MAPPER -mapper "$MAPPER" -file REDUCER -reducer "$REDUCER" ... How can I pass arguments to MAPPER? I tried -cmdenv name=val , but it does not work. Anybody can help me? Thanks lot.
How to write JAVA code for Hadoop streaming.
I am using Hadoop streaming and I want to write the map/reduce scripts in JAVA, rather than perl, etc. Would anybody give me a sample? Thanks
Re: how to increase number of reduce tasks
I think this is due to the input data of your job exist on one node. Mappers are launched only on nodes with data (Hadoop calls it "block"). For reducer, I am not sure why there's only 1 reducer. Anybody can explain that? - Original Message From: Alexander Aristov <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Thursday, July 31, 2008 12:06:59 PM Subject: how to increase number of reduce tasks Hi I am running nutch on hadoop 0.17.1. I launch 5 nodes to perform crawling. When I look at the job statistics I see that only 1 reduce task is stared for all steps and hence I do a conclusion that hadoop doesn't consume all available resources. Only one node is extremily busy, other nodes are idle. How can I configure hadoop to consume all resources? I added mapred.map.tasks and mapred.reduce.tasks parameters but they have no effect. I also increased the max number for the mapred tasks, job tracker shows it. During all stages map tasks reaches maximum 3, andreduce only 1. -- Best Regards Alexander Aristov
Re: How can I control Number of Mappers of a job?
Thank you, finally someone has interests in my questions =) My cluster contains more than one machine. Please don't get me wrong :-). I don't want to limit the total mappers in one node (by mapred.map.tasks). What I want is to limit the total mappers for one job. The motivation is that I have 2 jobs to run at the same time. they have "the same input data in Hadoop". I found that one job has to wait until the other finishes its mapping. Because the 2 jobs are submitted by 2 different people, I don't want one job to be starving. So I want to limit the first job's total mappers so that the 2 jobs will be launched simultaneously. - Original Message From: "Goel, Ankur" <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Cc: [EMAIL PROTECTED] Sent: Wednesday, July 30, 2008 10:17:53 PM Subject: RE: How can I control Number of Mappers of a job? How big is your cluster? Assuming you are running a single node cluster, Hadoop-default.xml has a parameter 'mapred.map.tasks' that is set to 2. So By default, no matter how many map tasks are calculated by framework, only 2 map task will execute on a single node cluster. -Original Message- From: Gopal Gandhi [mailto:[EMAIL PROTECTED] Sent: Thursday, July 31, 2008 4:38 AM To: core-user@hadoop.apache.org Cc: [EMAIL PROTECTED] Subject: How can I control Number of Mappers of a job? The motivation is to control the max # of mappers of a job. For example, the input data is 246MB, divided by 64M is 4. If by default there will be 4 mappers launched on the 4 blocks. What I want is to set its max # of mappers as 2, so that 2 mappers are launched first and when they completes on the first 2 blocks, another 2 mappers start on the rest 2 blocks. Does Hadoop provide a way?
How can I control Number of Mappers of a job?
The motivation is to control the max # of mappers of a job. For example, the input data is 246MB, divided by 64M is 4. If by default there will be 4 mappers launched on the 4 blocks. What I want is to set its max # of mappers as 2, so that 2 mappers are launched first and when they completes on the first 2 blocks, another 2 mappers start on the rest 2 blocks. Does Hadoop provide a way?
Re: How to control the map and reduce step sequentially
Yes, reducer starts for sorting, but not really reduces . - Original Message From: 晋光峰 <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Monday, July 28, 2008 10:08:33 PM Subject: Re: How to control the map and reduce step sequentially I got it. Thanks! 2008/7/28 Shengkai Zhu <[EMAIL PROTECTED]> > The real reduce logic is actually started when all map tasks are finished. > > Is it still unexpected? > > > On 7/28/08, 晋光峰 <[EMAIL PROTECTED]> wrote: > > > > Dear All, > > > > When i using Hadoop, I noticed that the reducer step is started > immediately > > when the mappers are still running. According to my project requirement, > > the > > reducer step should not start until all the mappers finish their > execution. > > Anybody knows how to use some Hadoop API to achieve this? When all the > > mappers finish their process, then the reducer is started. > > > > Thanks > > -- > > Guangfeng Jin > > > > > > -- > > 朱盛凯 > > Jash Zhu > > 复旦大学软件学院 > > Software School, Fudan University > -- Guangfeng Jin
Fw: question on HDFS
Hi folks, Does anybody has a comment on that? Why we let reducer fetch local data through HTTP not SSH? - Forwarded Message From: Gopal Gandhi <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Cc: [EMAIL PROTECTED] Sent: Tuesday, July 22, 2008 6:30:49 PM Subject: Re: question on HDFS That's interesting. Why letting reducer fetch local data through HTTP not SSH? - Original Message From: Arun C Murthy <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Tuesday, July 22, 2008 2:19:36 PM Subject: Re: question on HDFS Mori, On Jul 22, 2008, at 12:22 PM, Mori Bellamy wrote: > hey all, > let us say that i have 3 boxes, A B and C. initially, map tasks are > running on all 3. after most of the mapping is done, C is 32% done > with reduce (so still copying stuff to its local disk) and A is > stuck on a particularly long map-task (it got an ill-behaved record > from the input splits). does A's intermediate map output data go > directly to C's local disk, or is it still written to HDFS and > therefore distributed amongst all the machines? also, will A's disk > be a favored target for A's output bytes, or is the target volume > independent of the corresponding mapper? > Intermediate outputs (i.e. map outputs) are written to the local disk and not to HDFS. The reduce fetches the intermediate outputs via HTTP. hth, Arun > Thanks! The answer to this question should clear a lot of things up > for me.
Re: question on HDFS
That's interesting. Why letting reducer fetch local data through HTTP not SSH? - Original Message From: Arun C Murthy <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Tuesday, July 22, 2008 2:19:36 PM Subject: Re: question on HDFS Mori, On Jul 22, 2008, at 12:22 PM, Mori Bellamy wrote: > hey all, > let us say that i have 3 boxes, A B and C. initially, map tasks are > running on all 3. after most of the mapping is done, C is 32% done > with reduce (so still copying stuff to its local disk) and A is > stuck on a particularly long map-task (it got an ill-behaved record > from the input splits). does A's intermediate map output data go > directly to C's local disk, or is it still written to HDFS and > therefore distributed amongst all the machines? also, will A's disk > be a favored target for A's output bytes, or is the target volume > independent of the corresponding mapper? > Intermediate outputs (i.e. map outputs) are written to the local disk and not to HDFS. The reduce fetches the intermediate outputs via HTTP. hth, Arun > Thanks! The answer to this question should clear a lot of things up > for me.
Problem of Hadoop's Partitioner
I am following the example in http://hadoop.apache.org/core/docs/current/streaming.html about Hadoop's partitioner: org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner . It seems that the sorted values are based on dictionary, for eg: 1 12 15 2 28 What if I want to get numerical sorted list: 1 2 12 15 28 What partitioner shall I use?
[Streaming] I figured out a way to do combining using mapper, would anybody check it?
I am using Hadoop Streaming. I figured out a way to do combining using mapper, is it the same as using a separate combiner? For example: the input is a list of words, I want to count their total number for each word. The traditional mapper is: while () { chomp ($_); $word = $_; print ($word\t1\n); } Instead of using a additional combiner, I modify the mapper to use a hash %hash = (); while () { chomp ($_); $word = $_; $hash{$word} ++; } foreach $key (%hash){ print "$key\t$hash{$key}\n"; } Is it the same as using a seperate combiner?