fair scheduler in 0.19

2009-03-07 Thread Rong-en Fan
Hi,

It seems to me that anyone can change pool assignment via scheduler's
web UI. In other words, admins can not strictly enforce the pool
assignment. Is this still be true in 0.20?

Thanks,
Rong-En Fan


Re: Accessing local files

2008-12-22 Thread Rong-en Fan
On Tue, Dec 23, 2008 at 9:12 AM, Rodrigo Schmidt rschm...@facebook.comwrote:


 Hi,

 I want to use a local file (present on the file system of machine in my
 cluster) as the input to be used by mappers on my job. Is there an easy way
 to do that?


I think you can use file:// (i.e., LocalFileSystem) or you can also use
DistributedCache.

Regards,
Rong-En Fan


Re: hadoop 0.18.2 Checksum ok was sent and should not be sent again

2008-11-17 Thread Rong-en Fan
I believe it was for debug purpose and was removed after 0.18.2 released.

On Mon, Nov 17, 2008 at 8:57 PM, Alexander Aristov 
[EMAIL PROTECTED] wrote:

 Hi all
 I upgraded hadoop to the 0.18.2 version and tried to run a test job,
 distcopy from S3 to HDFS


 I got a lot of info-level errors although the job successfully finished.

 Any ideas? Can I simply suppress INFOs in log4j and forget about the error?


 08/11/17 07:43:09 INFO fs.FSInputChecker: java.io.IOException: Checksum ok
 was sent and should not be sent again
at
 org.apache.hadoop.dfs.DFSClient$BlockReader.read(DFSClient.java:863)
at

 org.apache.hadoop.dfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1392)
at
 org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1428)
at
 org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1377)
at java.io.DataInputStream.readInt(DataInputStream.java:372)
at

 org.apache.hadoop.io.SequenceFile$Reader.readRecordLength(SequenceFile.java:1898)
at
 org.apache.hadoop.io.SequenceFile$Reader.nextRaw(SequenceFile.java:1961)
at

 org.apache.hadoop.io.SequenceFile$Sorter$SortPass.run(SequenceFile.java:2399)
at
 org.apache.hadoop.io.SequenceFile$Sorter.sortPass(SequenceFile.java:2335)
at
 org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:2285)
at
 org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:2326)
at org.apache.hadoop.tools.DistCp.checkDuplication(DistCp.java:1032)
at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1013)
at org.apache.hadoop.tools.DistCp.copy(DistCp.java:618)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:768)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:788)





 --
 Best Regards
 Alexander Aristov



network topology script, datanode rejoin

2008-10-11 Thread Rong-en Fan
Hi,

Recently, I'm playing with the network topology script and
the scenario that datanode comes back from dead with
rack location changed. I did few experiments:

1) just stop datanode
2) just stop datanode, remove data storage dir
3) decommission node
4) decommission node, remove data storage dir

The time from death to life is about one week. However,
it seems that somehow the namenode uses the old rack
localtion for those nodes. I have to restart the namenode
in order to get the rack information correctly.

However, I roughly checked the source, it seems to me that
under some circumstances the namenode will re-query the
topology script. Would someone please explain this in more
details? (the HDFS docs on hadoop site is not very clear about
the network topology part)

Thanks,
Rong-En Fan


slow copy makes reduce hang

2008-09-18 Thread Rong-en Fan
Hi,

I'm using 0.17.2.1 and see a reduce hang in shuffle phase due
to a unresponsive node. From the reduce log (sorry that I didn't
keep it around), it stuck in copying map output from a dead
node (I can not ssh to that one). At that point, all maps are already
finished. I'm wondering why this slowness does not trigger a reduce
task fail and the corresponding map failed (even if it is finished) then
redo the map task on  another node so that the reduce can work.

Thanks,
Rong-En Fan


Re: slow copy makes reduce hang

2008-09-18 Thread Rong-en Fan
Reply to myself. I'm using streaming and the task timeout was set to 0,
so that's why.

On Fri, Sep 19, 2008 at 3:34 AM, Rong-en Fan [EMAIL PROTECTED] wrote:
 Hi,

 I'm using 0.17.2.1 and see a reduce hang in shuffle phase due
 to a unresponsive node. From the reduce log (sorry that I didn't
 keep it around), it stuck in copying map output from a dead
 node (I can not ssh to that one). At that point, all maps are already
 finished. I'm wondering why this slowness does not trigger a reduce
 task fail and the corresponding map failed (even if it is finished) then
 redo the map task on  another node so that the reduce can work.

 Thanks,
 Rong-En Fan



Re: slow copy makes reduce hang

2008-09-18 Thread Rong-en Fan
this time, I set task timeout to 10m via

  -jobconf mapred.task.timeout=60

However, I still see this hang at shuffle stage, and lots
of messages below appear in the log

2008-09-19 12:34:02,289 INFO org.apache.hadoop.mapred.ReduceTask:
task_200809190308_0007_r_01_1 Need 6 map output(s)
2008-09-19 12:34:02,290 INFO org.apache.hadoop.mapred.ReduceTask:
task_200809190308_0007_r_01_1: Got 0 new map-outputs  0 obsolete
map-outputs from tasktracker and 0 map-outputs from previous failures
2008-09-19 12:34:02,290 INFO org.apache.hadoop.mapred.ReduceTask:
task_200809190308_0007_r_01_1 Got 6 known map output location(s);
scheduling...
2008-09-19 12:34:02,290 INFO org.apache.hadoop.mapred.ReduceTask:
task_200809190308_0007_r_01_1 Scheduled 0 of 6 known outputs (6
slow hosts and 0 dup hosts)

When fetching map output from one weird node (actually, it has a disk died),
the http daemon returns 500 internal server error.

It seems to me that the reducer fails in an infinite loop... I'm wondering
this behavior is fixed in 0.18.x or there is some configuration parameters
that I should tune with?

Thanks,
Rong-En Fan

On Fri, Sep 19, 2008 at 9:42 AM, Rong-en Fan [EMAIL PROTECTED] wrote:
 Reply to myself. I'm using streaming and the task timeout was set to 0,
 so that's why.

 On Fri, Sep 19, 2008 at 3:34 AM, Rong-en Fan [EMAIL PROTECTED] wrote:
 Hi,

 I'm using 0.17.2.1 and see a reduce hang in shuffle phase due
 to a unresponsive node. From the reduce log (sorry that I didn't
 keep it around), it stuck in copying map output from a dead
 node (I can not ssh to that one). At that point, all maps are already
 finished. I'm wondering why this slowness does not trigger a reduce
 task fail and the corresponding map failed (even if it is finished) then
 redo the map task on  another node so that the reduce can work.

 Thanks,
 Rong-En Fan




Re: [Streaming] How to pass arguments to a map/reduce script

2008-08-21 Thread Rong-en Fan
On Thu, Aug 21, 2008 at 3:14 PM, Gopal Gandhi
[EMAIL PROTECTED] wrote:
 I am using Hadoop streaming and I need to pass arguments to my map/reduce 
 script. Because a map/reduce script is triggered by hadoop, like
 hadoop   -file MAPPER -mapper $MAPPER -file REDUCER -reducer $REDUCER 
 ...
 How can I pass arguments to MAPPER?

 I tried -cmdenv name=val , but it does not work.
 Anybody can help me? Thanks lot.

I use -jobconf, for example

hadoop ... -jobconf my.mapper.arg1=foobar

and in the map script, I get this by reading the environment variable

my_mapper_arg1

Hope this helps,
Rong-En Fan






Re: [Streaming] How to pass arguments to a map/reduce script

2008-08-21 Thread Rong-en Fan
On Fri, Aug 22, 2008 at 12:51 AM, Steve Gao [EMAIL PROTECTED] wrote:
 That's interesting. Suppose your mapper script is a Perl script, how do you 
 assign my.mapper.arg1's value to a variable $x?
 $x = $my.mapper.arg1
 I just tried the way and my perl script does not recognize $my.mapper.arg1.

$ENV{my_mapper_arg1}

 --- On Thu, 8/21/08, Rong-en Fan [EMAIL PROTECTED] wrote:
 From: Rong-en Fan [EMAIL PROTECTED]
 Subject: Re: [Streaming] How to pass arguments to a map/reduce script
 To: core-user@hadoop.apache.org
 Cc: [EMAIL PROTECTED]
 Date: Thursday, August 21, 2008, 11:09 AM

 On Thu, Aug 21, 2008 at 3:14 PM, Gopal Gandhi
 [EMAIL PROTECTED] wrote:
 I am using Hadoop streaming and I need to pass arguments to my map/reduce
 script. Because a map/reduce script is triggered by hadoop, like
 hadoop   -file MAPPER -mapper $MAPPER -file REDUCER
 -reducer $REDUCER ...
 How can I pass arguments to MAPPER?

 I tried -cmdenv name=val , but it does not work.
 Anybody can help me? Thanks lot.

 I use -jobconf, for example

 hadoop ... -jobconf my.mapper.arg1=foobar

 and in the map script, I get this by reading the environment variable

 my_mapper_arg1

 Hope this helps,
 Rong-En Fan










access jobconf in streaming job

2008-08-08 Thread Rong-en Fan
I'm using streaming with a mapper written in perl. However, an
issue is that I want to pass some arguments via command line.
In regular Java mapper, I can access JobConf in Mapper.
Is there a way to do this?

Thanks,
Rong-En Fan


Re: access jobconf in streaming job

2008-08-08 Thread Rong-en Fan
After looking into streaming source, the answer is via environment
variables. For example, mapred.task.timeout is in
the mapred_task_timeout environment variable.

On Fri, Aug 8, 2008 at 4:26 PM, Rong-en Fan [EMAIL PROTECTED] wrote:
 I'm using streaming with a mapper written in perl. However, an
 issue is that I want to pass some arguments via command line.
 In regular Java mapper, I can access JobConf in Mapper.
 Is there a way to do this?

 Thanks,
 Rong-En Fan



different dfs block size

2008-07-14 Thread Rong-en Fan
Hi,

I'm wondering what would be the memory consumption of
dfs.block.size for a fixed set of data in NameNode? I know
it is determined by # of blocks and # of replications, but
how many memory does one block will use in NameNode?
In addition, what would be the pros/cons of bigger/smaller
block size?

Thanks,
Rong-En Fan


Re: How to set up rack awareness?

2008-04-17 Thread Rong-en Fan
On Thu, Apr 17, 2008 at 2:41 AM, Nate Carlson [EMAIL PROTECTED] wrote:
 I'm setting up a hadoop cluster across two data centers (with gig bandwidth
 between them).. I'd like to use the rack awareness features to help Hadoop
 know which nodes are local.. I see that it's possible, but haven't found
 any guides on how to set it up. If anyone's got a quick primer I'd
 appreciate it!

I think you have to prepare a 'network script' in hadoop-site.xml. This script
tells which rack this host belongs to.

Regards,
Rong-En Fan


  -nc



MapFile and MapFileOutputFormat

2008-03-20 Thread Rong-en Fan
Hi,

I have two questions regarding the mapfile in hadoop/hdfs. First, when using
MapFileOutputFormat as reducer's output, is there any way to change
the index interval (i.e., able to call setIndexInterval() on the
output MapFile)?
Second, is it possible to tell what is the position in data file for a given
key, assuming index interval is 1 and # of keys are small?

Thanks,
Rong-En Fan