Please help me: is there a way to "chown" in Hadoop?

2008-08-26 Thread Gopal Gandhi
I need to change a file's owner from userA to userB. Is there such a command? 
Thanks lot!

% hadoop dfs -ls file
/user/userA/file2008-08-25 20:00 rwxr-xr-x   userAsupergroup



  

Fw: Write permission of file/dir in Hadoop

2008-08-22 Thread Gopal Gandhi
Would anybody help with that? Thanks.



- Forwarded Message 
From: Gopal Gandhi <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Thursday, August 21, 2008 3:36:40 PM
Subject: Write permission of file/dir in Hadoop

Folks,

  Is it possible to "chmod" a dir in Hadoop so that user X can only write files 
to it but can not remove files from it? Thanks.


  

Write permission of file/dir in Hadoop

2008-08-21 Thread Gopal Gandhi
Folks,

  Is it possible to chmod a dir in Hadoop so that user X can only write files 
to it but can not remove files from it? Thanks.



  

[Streaming] How to pass arguments to a map/reduce script

2008-08-21 Thread Gopal Gandhi
I am using Hadoop streaming and I need to pass arguments to my map/reduce 
script. Because a map/reduce script is triggered by hadoop, like
hadoop   -file MAPPER -mapper "$MAPPER" -file REDUCER -reducer "$REDUCER" 
...
How can I pass arguments to MAPPER?

I tried -cmdenv name=val , but it does not work.
Anybody can help me? Thanks lot.



  

How to write JAVA code for Hadoop streaming.

2008-08-05 Thread Gopal Gandhi
I am using Hadoop streaming and I want to write the map/reduce scripts in JAVA, 
rather than perl, etc. Would anybody give me a sample? Thanks



  

Re: how to increase number of reduce tasks

2008-07-31 Thread Gopal Gandhi
I think this is due to the input data of your job exist on one node. Mappers 
are launched only on nodes with data (Hadoop calls it "block").
For reducer, I am not sure why there's only 1 reducer. Anybody can explain that?



- Original Message 
From: Alexander Aristov <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Thursday, July 31, 2008 12:06:59 PM
Subject: how to increase number of reduce tasks

Hi

I am running nutch on hadoop 0.17.1. I launch 5 nodes to perform crawling.

When I look at the job statistics I see that only 1 reduce task is stared
for all steps and hence I do a conclusion that hadoop doesn't consume all
available resources.

Only one node is extremily busy, other nodes are idle. How can I configure
hadoop to consume all resources?

I added mapred.map.tasks and mapred.reduce.tasks parameters but they have no
effect.
I also increased the max number for the mapred tasks, job tracker shows it.

During all stages map tasks  reaches maximum 3, andreduce only 1.

-- 
Best Regards
Alexander Aristov



  

Re: How can I control Number of Mappers of a job?

2008-07-31 Thread Gopal Gandhi
Thank you, finally someone has interests in my questions =)
My cluster contains more than one machine. Please don't get me wrong :-). I 
don't want to limit the total mappers in one node (by mapred.map.tasks). What I 
want is to limit the total mappers for one job. The motivation is that I have 2 
jobs to run at the same time. they have "the same input data in Hadoop". I 
found that one job has to wait until the other finishes its mapping. Because 
the 2 jobs are submitted by 2 different people, I don't want one job to be 
starving. So I want to limit the first job's total mappers so that the 2 jobs 
will be launched simultaneously.



- Original Message 
From: "Goel, Ankur" <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Cc: [EMAIL PROTECTED]
Sent: Wednesday, July 30, 2008 10:17:53 PM
Subject: RE: How can I control Number of Mappers of a job?

How big is your cluster? Assuming you are running a single node cluster,

Hadoop-default.xml has a parameter 'mapred.map.tasks' that is set to 2.
So
By default, no matter how many map tasks are calculated by framework,
only  2 map task will execute on a single node cluster.

-Original Message-
From: Gopal Gandhi [mailto:[EMAIL PROTECTED] 
Sent: Thursday, July 31, 2008 4:38 AM
To: core-user@hadoop.apache.org
Cc: [EMAIL PROTECTED]
Subject: How can I control Number of Mappers of a job?

The motivation is to control the max # of mappers of a job. For example,
the input data is 246MB, divided by 64M is 4. If by default there will
be 4 mappers launched on the 4 blocks. 
What I want is to set its max # of mappers as 2, so that 2 mappers are
launched first and when they completes on the first 2 blocks, another 2
mappers start on the rest 2 blocks. Does Hadoop provide a way?


  

How can I control Number of Mappers of a job?

2008-07-30 Thread Gopal Gandhi
The motivation is to control the max # of mappers of a job. For example, the 
input data is 246MB, divided by 64M is 4. If by default there will be 4 mappers 
launched on the 4 blocks. 
What I want is to set its max # of mappers as 2, so that 2 mappers are launched 
first and when they completes on the first 2 blocks, another 2 mappers start on 
the rest 2 blocks. Does Hadoop provide a way?



  

Re: How to control the map and reduce step sequentially

2008-07-30 Thread Gopal Gandhi
Yes, reducer starts for sorting, but not really reduces .



- Original Message 
From: 晋光峰 <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Monday, July 28, 2008 10:08:33 PM
Subject: Re: How to control the map and reduce step sequentially

I got it. Thanks!

2008/7/28 Shengkai Zhu <[EMAIL PROTECTED]>

> The real reduce logic is actually started when all map tasks are finished.
>
> Is it still unexpected?
>
>
> On 7/28/08, 晋光峰 <[EMAIL PROTECTED]> wrote:
> >
> > Dear All,
> >
> > When i using Hadoop, I noticed that the reducer step is started
> immediately
> > when the mappers are still running. According to my project requirement,
> > the
> > reducer step should not start until all the mappers finish their
> execution.
> > Anybody knows how to use some Hadoop API to achieve this? When all the
> > mappers finish their process, then the reducer is started.
> >
> > Thanks
> > --
> > Guangfeng Jin
> >
>
>
>
> --
>
> 朱盛凯
>
> Jash Zhu
>
> 复旦大学软件学院
>
> Software School, Fudan University
>



-- 
Guangfeng Jin



  

Fw: question on HDFS

2008-07-23 Thread Gopal Gandhi
Hi folks,

  Does anybody has a comment on that? Why we let reducer fetch local data 
through HTTP not SSH?


- Forwarded Message 
From: Gopal Gandhi <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Cc: [EMAIL PROTECTED]
Sent: Tuesday, July 22, 2008 6:30:49 PM
Subject: Re: question on HDFS

That's interesting. Why letting reducer fetch local data through HTTP not SSH?



- Original Message 
From: Arun C Murthy <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Tuesday, July 22, 2008 2:19:36 PM
Subject: Re: question on HDFS

Mori,

On Jul 22, 2008, at 12:22 PM, Mori Bellamy wrote:

> hey all,
> let us say that i have 3 boxes, A B and C. initially, map tasks are  
> running on all 3. after most of the mapping is done, C is 32% done  
> with reduce (so still copying stuff to its local disk) and A is  
> stuck on a particularly long map-task (it got an ill-behaved record  
> from the input splits). does A's intermediate map output data go  
> directly to C's local disk, or is it still written to HDFS and  
> therefore distributed amongst all the machines? also, will A's disk  
> be a favored target for A's output bytes, or is the target volume  
> independent of the corresponding mapper?
>

Intermediate outputs (i.e. map outputs) are written to the local disk  
and not to HDFS. The reduce fetches the intermediate outputs via HTTP.

hth,
Arun

> Thanks! The answer to this question should clear a lot of things up  
> for me.


  

Re: question on HDFS

2008-07-22 Thread Gopal Gandhi
That's interesting. Why letting reducer fetch local data through HTTP not SSH?



- Original Message 
From: Arun C Murthy <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Tuesday, July 22, 2008 2:19:36 PM
Subject: Re: question on HDFS

Mori,

On Jul 22, 2008, at 12:22 PM, Mori Bellamy wrote:

> hey all,
> let us say that i have 3 boxes, A B and C. initially, map tasks are  
> running on all 3. after most of the mapping is done, C is 32% done  
> with reduce (so still copying stuff to its local disk) and A is  
> stuck on a particularly long map-task (it got an ill-behaved record  
> from the input splits). does A's intermediate map output data go  
> directly to C's local disk, or is it still written to HDFS and  
> therefore distributed amongst all the machines? also, will A's disk  
> be a favored target for A's output bytes, or is the target volume  
> independent of the corresponding mapper?
>

Intermediate outputs (i.e. map outputs) are written to the local disk  
and not to HDFS. The reduce fetches the intermediate outputs via HTTP.

hth,
Arun

> Thanks! The answer to this question should clear a lot of things up  
> for me.


  

Problem of Hadoop's Partitioner

2008-07-21 Thread Gopal Gandhi
I am following the example in 
http://hadoop.apache.org/core/docs/current/streaming.html about Hadoop's 
partitioner: org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner . It seems 
that the sorted values are based on dictionary, for eg:
1
12
15
2
28

What if I want to get numerical sorted list:
1
2
12
15
28
What partitioner shall I use?



  

[Streaming] I figured out a way to do combining using mapper, would anybody check it?

2008-07-21 Thread Gopal Gandhi
I am using Hadoop Streaming. 
I figured out a way to do combining using mapper, is it the same as using a 
separate combiner?

For example: the input is a list of words, I want to count their total number 
for each word. 
The traditional mapper is:

while () {
  chomp ($_);
  $word = $_;
  print ($word\t1\n);
}


Instead of using a additional combiner, I modify the mapper to use a hash

%hash = ();
while () {
  chomp ($_);
  $word = $_;
  $hash{$word} ++;
}

foreach $key (%hash){
  print "$key\t$hash{$key}\n";
}

Is it the same as using a seperate combiner?