RE: Issue with usage of fs -test

2009-05-28 Thread Koji Noguchi
Maybe 
https://issues.apache.org/jira/browse/HADOOP-3792 ?

Koji

-Original Message-
From: pankaj jairath [mailto:pjair...@yahoo-inc.com] 
Sent: Thursday, May 28, 2009 4:49 AM
To: core-user@hadoop.apache.org
Subject: Issue with usage of fs -test

Hello,

I am facing a strange issue, where in the /fs -test -e/ fails and /fs 
-ls/ succeeds to list the file. Following is the grep of such a  result
:

bin]$ hadoop fs -ls /projects/myproject///.done
Found 1 items
-rw---   3 user hdfs  0 2009-03-19 22:28 
/projects/myproject///.done
[...@mymachine bin]$ echo $?
0
[...@mymachine bin]$ hadoop fs -test -e
/projects/myproject///.done
[...@mymachine bin]$ echo $?
1


What is the cause of such a behaviour, any pointers would much be
appreciated. (HADOOP_CONF_DIR and HADOOP_HOME are set correctly at env
vars)


Thanks
Pankaj 





RE: Setting up another machine as secondary node

2009-05-14 Thread Koji Noguchi
 The secondary namenode takes a snapshot 
 at 5 minute (configurable) intervals,

This is a bit too aggressive.
Checkpointing is still an expensive operation.
I'd say every hour or even every day.

Isn't the default 3600 seconds?

Koji

-Original Message-
From: jason hadoop [mailto:jason.had...@gmail.com] 
Sent: Thursday, May 14, 2009 7:46 AM
To: core-user@hadoop.apache.org
Subject: Re: Setting up another machine as secondary node

any machine put in the conf/masters file becomes a secondary namenode.

At some point there was confusion on the safety of more than one
machine,
which I believe was settled, as many are safe.

The secondary namenode takes a snapshot at 5 minute (configurable)
intervals, rebuilds the fsimage and sends that back to the namenode.
There is some performance advantage of having it on the local machine,
and
some safety advantage of having it on an alternate machine.
Could someone who remembers speak up on the single vrs multiple
secondary
namenodes?


On Thu, May 14, 2009 at 6:07 AM, David Ritch david.ri...@gmail.com
wrote:

 First of all, the secondary namenode is not a what you might think a
 secondary is - it's not failover device.  It does make a copy of the
 filesystem metadata periodically, and it integrates the edits into the
 image.  It does *not* provide failover.

 Second, you specify its IP address in hadoop-site.xml.  This is where
you
 can override the defaults set in hadoop-default.xml.

 dbr

 On Thu, May 14, 2009 at 9:03 AM, Rakhi Khatwani
rakhi.khatw...@gmail.com
 wrote:

  Hi,
  I wanna set up a cluster of 5 nodes in such a way that
  node1 - master
  node2 - secondary namenode
  node3 - slave
  node4 - slave
  node5 - slave
 
 
  How do we go about that?
  there is no property in hadoop-env where i can set the ip-address
for
  secondary name node.
 
  if i set node-1 and node-2 in masters, and when we start dfs, in
both the
  m/cs, the namenode n secondary namenode processes r present. but i
think
  only node1 is active.
  n my namenode fail over operation fails.
 
  ny suggesstions?
 
  Regards,
  Rakhi
 




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


RE: Setting up another machine as secondary node

2009-05-14 Thread Koji Noguchi
Before 0.19, fsimage/edits were on the same directory.
So whenever secondary finishes checkpointing, it copies back the fsimage
while namenode still kept on writing to the edits file.

Usually we observed some latency on the namenode side during that time.

HADOOP-3948 would probably help after 0.19 or later.

Koji

-Original Message-
From: Brian Bockelman [mailto:bbock...@cse.unl.edu] 
Sent: Thursday, May 14, 2009 10:32 AM
To: core-user@hadoop.apache.org
Subject: Re: Setting up another machine as secondary node

Hey Koji,

It's an expensive operation - for the secondary namenode, not the  
namenode itself, right?  I don't particularly care if I stress out a  
dedicated node that doesn't have to respond to queries ;)

Locally we checkpoint+backup fairly frequently (not 5 minutes ...  
maybe less than the default hour) due to sheer paranoia of losing  
metadata.

Brian

On May 14, 2009, at 12:25 PM, Koji Noguchi wrote:

 The secondary namenode takes a snapshot
 at 5 minute (configurable) intervals,

 This is a bit too aggressive.
 Checkpointing is still an expensive operation.
 I'd say every hour or even every day.

 Isn't the default 3600 seconds?

 Koji

 -Original Message-
 From: jason hadoop [mailto:jason.had...@gmail.com]
 Sent: Thursday, May 14, 2009 7:46 AM
 To: core-user@hadoop.apache.org
 Subject: Re: Setting up another machine as secondary node

 any machine put in the conf/masters file becomes a secondary namenode.

 At some point there was confusion on the safety of more than one
 machine,
 which I believe was settled, as many are safe.

 The secondary namenode takes a snapshot at 5 minute (configurable)
 intervals, rebuilds the fsimage and sends that back to the namenode.
 There is some performance advantage of having it on the local machine,
 and
 some safety advantage of having it on an alternate machine.
 Could someone who remembers speak up on the single vrs multiple
 secondary
 namenodes?


 On Thu, May 14, 2009 at 6:07 AM, David Ritch david.ri...@gmail.com
 wrote:

 First of all, the secondary namenode is not a what you might think a
 secondary is - it's not failover device.  It does make a copy of the
 filesystem metadata periodically, and it integrates the edits into  
 the
 image.  It does *not* provide failover.

 Second, you specify its IP address in hadoop-site.xml.  This is where
 you
 can override the defaults set in hadoop-default.xml.

 dbr

 On Thu, May 14, 2009 at 9:03 AM, Rakhi Khatwani
 rakhi.khatw...@gmail.com
 wrote:

 Hi,
I wanna set up a cluster of 5 nodes in such a way that
 node1 - master
 node2 - secondary namenode
 node3 - slave
 node4 - slave
 node5 - slave


 How do we go about that?
 there is no property in hadoop-env where i can set the ip-address
 for
 secondary name node.

 if i set node-1 and node-2 in masters, and when we start dfs, in
 both the
 m/cs, the namenode n secondary namenode processes r present. but i
 think
 only node1 is active.
 n my namenode fail over operation fails.

 ny suggesstions?

 Regards,
 Rakhi





 -- 
 Alpha Chapters of my book on Hadoop are available
 http://www.apress.com/book/view/9781430219422
 www.prohadoopbook.com a community for Hadoop Professionals



RE: Blocks replication in downtime even

2009-04-27 Thread Koji Noguchi
http://hadoop.apache.org/core/docs/current/hdfs_design.html#Data+Disk+Fa
ilure%2C+Heartbeats+and+Re-Replication

hope this helps.

Koji

-Original Message-
From: Stas Oskin [mailto:stas.os...@gmail.com] 
Sent: Monday, April 27, 2009 4:11 AM
To: core-user@hadoop.apache.org
Subject: Blocks replication in downtime even

Hi.

I have a question:

If I have N of DataNodes, and one or several of the nodes have become
unavailable, would HDFS re-synchronize the blocks automatically,
according
to replication level set?
And if yes, when? As soon as the offline node was detected, or only on
file
access?

Regards.


RE: core-user Digest 23 Apr 2009 02:09:48 -0000 Issue 887

2009-04-23 Thread Koji Noguchi
Owen, 

 Is it just the patches that have already been applied 
 to the 18 branch? Or are there more?

Former. Just the patches that have already been applied to 0.18 branch.
I especially want HADOOP-5465 in for the 'stable' release.
(This patch is also missing in 0.19.1)

Koji


-Original Message-
From: Owen O'Malley [mailto:omal...@apache.org] 
Sent: Thursday, April 23, 2009 11:54 AM
To: core-user@hadoop.apache.org
Subject: Re: core-user Digest 23 Apr 2009 02:09:48 - Issue 887


On Apr 22, 2009, at 10:44 PM, Koji Noguchi wrote:

 Nigel,

 When you have time, could you release 0.18.4 that contains some of the
 patches that make our clusters 'stable'?

Is it just the patches that have already been applied to the 18  
branch? Or are there more?

-- Owen


RE: core-user Digest 23 Apr 2009 02:09:48 -0000 Issue 887

2009-04-22 Thread Koji Noguchi
Nigel, 

When you have time, could you release 0.18.4 that contains some of the
patches that make our clusters 'stable'?
 
Koji

-Original Message-
From: Nigel Daley [mailto:nda...@yahoo-inc.com] 
Sent: Wednesday, April 22, 2009 10:31 PM
To: core-user@hadoop.apache.org
Subject: Re: core-user Digest 23 Apr 2009 02:09:48 - Issue 887

No, I didn't mark 0.19.1 stable.  I left 0.18.3 as our most stable  
release.

My company skipped deploying 0.19.x so I have no experience with that  
branch.  Others?

Nige

 Has the release 0.19 now become a stable one?

 On Wed, Apr 22, 2009 at 4:53 PM, Nigel Daley nda...@yahoo-inc.com  
 wrote:

 Release 0.20.0 contains many improvements, new features, bug fixes  
 and
 optimizations.

 For Hadoop release details and downloads, visit:
 http://hadoop.apache.org/core/releases.html

 Hadoop 0.20.0 Release Notes are at
 http://hadoop.apache.org/core/docs/r0.20.0/releasenotes.html

 Thanks to all who contributed to this release!

 Nigel



RE: Multiple outputs and getmerge?

2009-04-21 Thread Koji Noguchi
Stuart, 

I once used MultipleOutputFormat and created
   (mapred.work.output.dir)/type1/part-_
   (mapred.work.output.dir)/type2/part-_
...

And JobTracker took care of the renaming to 
   (mapred.output.dir)/type{1,2}/part-__

Would that work for you?

Koji

-Original Message-
From: Stuart White [mailto:stuart.whi...@gmail.com] 
Sent: Monday, April 20, 2009 1:15 PM
To: core-user@hadoop.apache.org
Subject: Multiple outputs and getmerge?

I've written a MR job with multiple outputs.  The normal output goes
to files named part-X and my secondary output records go to files
I've chosen to name ExceptionDocuments (and therefore are named
ExceptionDocuments-m-X).

I'd like to pull merged copies of these files to my local filesystem
(two separate merged files, one containing the normal output and one
containing the ExceptionDocuments output).  But, since hadoop lands
both of these outputs to files residing in the same directory, when I
issue hadoop dfs -getmerge, what I get is a file that contains both
outputs.

To get around this, I have to move files around on HDFS so that my
different outputs are in different directories.

Is this the best/only way to deal with this?  It would be better if
hadoop offered the option of writing different outputs to different
output directories, or if getmerge offered the ability to specify a
file prefix for files desired to be merged.

Thanks!


RE: Multiple outputs and getmerge?

2009-04-21 Thread Koji Noguchi
Something in the lines of 

... class MyOutputFormat extends MultipleTextOutputFormatText, Text {
protected String generateFileNameForKeyValue(Text key, 
 Text v, String name) {
  Path outpath = new Path(key.toString(), name);
  return outpath.toString();
}
  }

would create a directory per key.

If you just want to keep your side-effect files separate, then 
get your working dir by 
FileOutputFormat.getWorkOutputPath(...) 
or $mapred_work_output_dir

and dfs -mkdir workdir/NewDir and put the secondary files there.

Explained in 

http://hadoop.apache.org/core/docs/r0.18.3/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)


Koji


-Original Message-
From: Stuart White [mailto:stuart.whi...@gmail.com] 
Sent: Tuesday, April 21, 2009 11:46 AM
To: core-user@hadoop.apache.org
Subject: Re: Multiple outputs and getmerge?

On Tue, Apr 21, 2009 at 1:00 PM, Koji Noguchi knogu...@yahoo-inc.com wrote:

 I once used MultipleOutputFormat and created
   (mapred.work.output.dir)/type1/part-_
   (mapred.work.output.dir)/type2/part-_
    ...

 And JobTracker took care of the renaming to
   (mapred.output.dir)/type{1,2}/part-__

 Would that work for you?

Can you please explain this in more detail?  It looks like you're
using MultipleOutputFormat for *both* of your outputs?  So, you simply
don't use the OutputCollector passed as a parm to Mapper#map()?


RE: mapred.tasktracker.map.tasks.maximum

2009-04-21 Thread Koji Noguchi
It's probably a silly question, but you do have more than 2 mappers on
your second job?

If yes, I have no idea what's happening.

Koji

-Original Message-
From: javateck javateck [mailto:javat...@gmail.com] 
Sent: Tuesday, April 21, 2009 1:38 PM
To: core-user@hadoop.apache.org
Subject: Re: mapred.tasktracker.map.tasks.maximum

right, I set it in hadoop-site.xml before starting the whole hadoop
processes, I have one job running fully utilizing the 10 map tasks, but
subsequent queries are only using 2 of them, don't know why.
I have enough RAM also, no paging out is happening, I'm running on
0.18.3.
Right now I put all processes on one machine, namenode, datanode,
jobtracker, tasktracker, I have a 2*4core CPU, and 20GB RAM.


On Tue, Apr 21, 2009 at 1:25 PM, Koji Noguchi
knogu...@yahoo-inc.comwrote:

 This is a cluster config and not a per job config.

 So this has to be set when the mapreduce cluster first comes up.

 Koji


 -Original Message-
 From: javateck javateck [mailto:javat...@gmail.com]
 Sent: Tuesday, April 21, 2009 1:20 PM
 To: core-user@hadoop.apache.org
 Subject: mapred.tasktracker.map.tasks.maximum

 I set my mapred.tasktracker.map.tasks.maximum to 10, but when I run
a
 task, it's only using 2 out of 10, any way to know why it's only using
 2?
 thanks



RE: reduce task specific jvm arg

2009-04-15 Thread Koji Noguchi
This sounds like a reasonable request.

Created 
https://issues.apache.org/jira/browse/HADOOP-5684

On our clusters, sometimes users want thin mappers and large reducers.

Koji

-Original Message-
From: Jun Rao [mailto:jun...@almaden.ibm.com] 
Sent: Thursday, April 09, 2009 10:30 AM
To: core-user@hadoop.apache.org
Subject: reduce task specific jvm arg

Hi,

Is there a way to set jvm parameters only for reduce tasks in Hadoop?
Thanks,

Jun
IBM Almaden Research Center
K55/B1, 650 Harry Road, San Jose, CA  95120-6099

jun...@almaden.ibm.com


RE: Issue distcp'ing from 0.19.2 to 0.18.3

2009-04-09 Thread Koji Noguchi
Bryan,

hftp://ds-nn1:7276
hdfs://ds-nn2:7276

Are you using the same port number for hftp and hdfs?

Looking at the stack trace, it seems like it failed before starting a
distcp job.

Koji

-Original Message-
From: Bryan Duxbury [mailto:br...@rapleaf.com] 
Sent: Wednesday, April 08, 2009 11:40 PM
To: core-user@hadoop.apache.org
Subject: Issue distcp'ing from 0.19.2 to 0.18.3

Hey all,

I was trying to copy some data from our cluster on 0.19.2 to a new  
cluster on 0.18.3 by using disctp and the hftp:// filesystem.  
Everything seemed to be going fine for a few hours, but then a few  
tasks failed because a few files got 500 errors when trying to be  
read from the 19 cluster. As a result the job died. Now that I'm  
trying to restart it, I get this error:

[rapl...@ds-nn2 ~]$ hadoop distcp hftp://ds-nn1:7276/ hdfs://ds- 
nn2:7276/cluster-a
09/04/08 23:32:39 INFO tools.DistCp: srcPaths=[hftp://ds-nn1:7276/]
09/04/08 23:32:39 INFO tools.DistCp: destPath=hdfs://ds-nn2:7276/ 
cluster-a
With failures, global counters are inaccurate; consider running with -i
Copy failed: java.net.SocketException: Unexpected end of file from  
server
 at sun.net.www.http.HttpClient.parseHTTPHeader 
(HttpClient.java:769)
 at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
 at sun.net.www.http.HttpClient.parseHTTPHeader 
(HttpClient.java:766)
 at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
 at sun.net.www.protocol.http.HttpURLConnection.getInputStream 
(HttpURLConnection.java:1000)
 at org.apache.hadoop.dfs.HftpFileSystem$LsParser.fetchList 
(HftpFileSystem.java:183)
 at org.apache.hadoop.dfs.HftpFileSystem 
$LsParser.getFileStatus(HftpFileSystem.java:193)
 at org.apache.hadoop.dfs.HftpFileSystem.getFileStatus 
(HftpFileSystem.java:222)
 at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:667)
 at org.apache.hadoop.tools.DistCp.checkSrcPath(DistCp.java:588)
 at org.apache.hadoop.tools.DistCp.copy(DistCp.java:609)
 at org.apache.hadoop.tools.DistCp.run(DistCp.java:768)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
 at org.apache.hadoop.tools.DistCp.main(DistCp.java:788)

I changed nothing at all between the first attempt and the subsequent  
failed attempts. The only clues in the namenode log for the 19  
cluster are:

2009-04-08 23:29:09,786 WARN org.apache.hadoop.ipc.Server: Incorrect  
header or version mismatch from 10.100.50.252:47733 got version 47  
expected version 2

Anyone have any ideas?

-Bryan


RE: Very assymetric data allocation

2009-04-07 Thread Koji Noguchi
Marcus,

One known issue in 0.18.3 is HADOOP-5465.

CopyPaste from 
https://issues.apache.org/jira/browse/HADOOP-4489?focusedCommentId=12693
956page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpa
nel#action_12693956

Hairong said:
 This bug might be caused by HADOOP-5465. Once a datanode hits
HADOOP-5465, NameNode sends an empty replication request to the data
node on every reply to a heartbeat, thus not a single scheduled block
deletion request can be sent to the data node.

(Also, if you're always writing from one of the nodes, that node is more
likely to get full.)



Nigel, not sure if this is the issue, but it would be nice to have
0.18.4 out.


Koji



-Original Message-
From: Marcus Herou [mailto:marcus.he...@tailsweep.com] 
Sent: Tuesday, April 07, 2009 12:45 AM
To: hadoop-u...@lucene.apache.org
Subject: Very assymetric data allocation

Hi.

We are running Hadoop 0.18.3 and noticed a strange issue when one of our
machines went out of disk yesterday.
If you can see the table below it would display that the server
mapredcoord is 66.91% allocated and the others are almost empty.
How can that be ?

Any information about this would be very helpful.

mapredcoord is as well our jobtracker.

//Marcus

Node Last Contact Admin State Size (GB) Used (%) Used (%) Remaining (GB)
Blocks
mapredcoordhttp://mapredcoord:50076/browseDirectory.jsp?namenodeInfoPor
t=50070dir=%2F2In
Service416.6966.91

90.9419806
mapreduce2http://mapreduce2:50076/browseDirectory.jsp?namenodeInfoPort=
50070dir=%2F2In
Service416.696.71

303.54456
mapreduce3http://mapreduce3:50076/browseDirectory.jsp?namenodeInfoPort=
50070dir=%2F2In
Service416.690.44
351.693975
mapreduce4http://mapreduce4:50076/browseDirectory.jsp?namenodeInfoPort=
50070dir=%2F0In
Service416.690.25
355.821549
mapreduce5http://mapreduce5:50076/browseDirectory.jsp?namenodeInfoPort=
50070dir=%2F2In
Service416.690.42
347.683995
mapreduce6http://mapreduce6:50076/browseDirectory.jsp?namenodeInfoPort=
50070dir=%2F0In
Service416.690.43
352.73982
mapreduce7http://mapreduce7:50076/browseDirectory.jsp?namenodeInfoPort=
50070dir=%2F0In
Service416.690.5
351.914079
mapreduce8http://mapreduce8:50076/browseDirectory.jsp?namenodeInfoPort=
50070dir=%2F1In
Service416.690.48
350.154169


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/


RE: Socket closed Exception

2009-03-30 Thread Koji Noguchi
Hi Lohit,

My initial guess would be
https://issues.apache.org/jira/browse/HADOOP-4040

When this happened on our 0.17 cluster, all of our (task) clients were
using the max idle time of 1 hour due to this bug instead of the
configured value of a few seconds.
Thus each client kept the connection up much longer than we expected.
(Not sure if this applies to your 0.15 cluster, but it sounds similar to
what we observed.)

This worked until namenode started hitting the max limit of '
ipc.client.idlethreshold'.  

  nameipc.client.idlethreshold/name
  value4000/value
  descriptionDefines the threshold number of connections after which
   connections will be inspected for idleness.
  /description

When inspecting for idleness, namenode uses

  nameipc.client.maxidletime/name
  value12/value
  descriptionDefines the maximum idle time for a connected client 
   after which it may be disconnected.
  /description

As a result, many connections got disconnected at once.
Clients only see the timeouts when they try to re-use that sockets the
next time and wait for 1 minute.  That's why they are not exactly at the
same time, but *almost* the same time.


# If this solves your problem, Raghu should get the credit. 
  He spent so many hours to solve this mystery for us. :)


Koji


-Original Message-
From: lohit [mailto:lohit...@yahoo.com] 
Sent: Sunday, March 29, 2009 11:56 AM
To: core-user@hadoop.apache.org
Subject: Socket closed Exception


Recently we are seeing lot of Socket closed exception in our cluster.
Many task's open/create/getFileInfo calls get back 'SocketException'
with message 'Socket closed'. We seem to see many tasks fail with same
error around same time. There are no warning or info messages in
NameNode /TaskTracker/Task logs. (This is on HDFS 0.15) Are there cases
where NameNode closes socket due heavy load or during conention of
resource of anykind?

Thanks,
Lohit



RE: corrupt unreplicated block in dfs (0.18.3)

2009-03-26 Thread Koji Noguchi
Mike, you might want to look at -move option in fsck.

bash-3.00$ hadoop fsck
Usage: DFSck path [-move | -delete | -openforwrite] [-files [-blocks
[-locations | -racks]]]
path  start checking from this path
-move   move corrupted files to /lost+found
-delete delete corrupted files
-files  print out files being checked
-openforwrite   print out files opened for write
-blocks print out block report
-locations  print out locations for every block
-racks  print out network topology for data-node locations



I never use it since I would rather have users' jobs fail than jobs
succeeding with incomplete inputs.

Koji


-Original Message-
From: Aaron Kimball [mailto:aa...@cloudera.com] 
Sent: Thursday, March 26, 2009 9:41 AM
To: core-user@hadoop.apache.org
Subject: Re: corrupt unreplicated block in dfs (0.18.3)

Just because a block is corrupt doesn't mean the entire file is corrupt.
Furthermore, the presence/absence of a file in the namespace is a
completely
separate issue to the data in the file. I think it would be a surprising
interface change if files suddenly disappeared just because 1 out of
potentially many blocks were corrupt.

- Aaron

On Thu, Mar 26, 2009 at 1:21 PM, Mike Andrews m...@xoba.com wrote:

 i noticed that when a file with no replication (i.e., replication=1)
 develops a corrupt block, hadoop takes no action aside from the
 datanode throwing an exception to the client trying to read the file.
 i manually corrupted a block in order to observe this.

 obviously, with replication=1 its impossible to fix the block, but i
 thought perhaps hadoop would take some other action, such as deleting
 the file outright, or moving it to a corrupt directory, or marking
 it or keeping track of it somehow to note that there's un-fixable
 corruption in the filesystem? thus, the current behaviour seems to
 sweep the corruption under the rug and allows its continued existence,
 aside from notifying the specific client doing the read with an
 exception.

 if anyone has any information about this issue or how to work around
 it, please let me know.

 on the other hand, i tested that corrupting a block in a replication=3
 file causes hadoop to re-replicate the block from another existing
 copy, which is good and is i what i expected.

 best,
 mike


 --
 permanent contact information at http://mikerandrews.com



RE: streaming error when submit the job:Cannot run program chmod: java.io.IOException: error=12, Cannot allocate memory

2009-03-11 Thread Koji Noguchi
Shixing,

Discussion on 
   https://issues.apache.org/jira/browse/HADOOP-5059
may be related.

Koji

-Original Message-
From: shixing [mailto:paradise...@gmail.com] 
Sent: Wednesday, March 11, 2009 1:31 AM
To: core-user@hadoop.apache.org
Subject: streaming error when submit the job:Cannot run program chmod:
java.io.IOException: error=12, Cannot allocate memory

09/03/11 15:43:55 ERROR streaming.StreamJob: Error Launching job :
java.io.IOException: Cannot run program chmod: java.io.IOException:
error=12, Cannot allocate memory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286
)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:338)
at
org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.j
ava:480)
at
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem
.java:472)
at
org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.jav
a:274)
at
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:3
64)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:468)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:375)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:208)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:142)
at
org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1238)
at
org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1219)
at
org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:247)
at
org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:2426)
at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
Impl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:467)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:902)
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot
allocate memory
at java.lang.UNIXProcess.init(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
... 22 more

and when I resubmit the job, it successes!
-- 
Best wishes!
My Friend~


RE: Potential race condition (Hadoop 18.3)

2009-03-02 Thread Koji Noguchi
Ryan,

If you're using getOutputPath, try replacing it with getWorkOutputPath.

http://hadoop.apache.org/core/docs/r0.18.3/api/org/apache/hadoop/mapred/
FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf
)

Koji

-Original Message-
From: Ryan Shih [mailto:ryan.s...@gmail.com] 
Sent: Monday, March 02, 2009 11:01 AM
To: core-user@hadoop.apache.org
Subject: Potential race condition (Hadoop 18.3)

Hi - I'm not sure yet, but I think I might be hitting a race condition
in
Hadoop 18.3. What seems to happen is that in the reduce phase, some of
my
tasks perform speculative execution but when the initial task completes
successfully, it sends a kill to the new task started. After all is said
and
done, perhaps one in every five or ten which kill their second task ends
up
with zero or truncated output.  When I code it to turn off speculative
execution, the problem goes away. Are there known race conditions that I
should be aware of around this area?

Thanks in advance,
Ryan


RE: how can I decommission nodes on-the-fly?

2008-11-26 Thread Koji Noguchi
+1 

Created Jira.
https://issues.apache.org/jira/browse/HADOOP-4733

Koji

 Steve Loughran wrote: 
 At some point in the future, I could imagine it being handy to 
 have the ability to decomission a task tracker, which would tell 
 it to stop accepting new work, and run the rest down. This would be
 good when tasks take time to run but you still want to be agile 
 in your cluster management.



RE: Cannot run program bash: java.io.IOException: error=12, Cannot allocate memory

2008-11-18 Thread Koji Noguchi


We had a similar issue before with Secondary Namenode failing with 

2008-10-09 02:00:58,288 ERROR org.apache.hadoop.dfs.NameNode.Secondary:
java.io.IOException:
javax.security.auth.login.LoginException: Login failed: Cannot run
program whoami: java.io.IOException:
error=12, Cannot allocate memory

In our case, simply increasing the swap space fixed our problem.

http://hudson.gotdns.com/wiki/display/HUDSON/IOException+Not+enough+spac
e

When checking with strace, it was failing at 

[pid  7927] clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x4133c9f0) = -1 ENOMEM (Cannot allocate memory)


Without CLONE_VM. In the clone man page, 

 If  CLONE_VM  is not set, the child process runs in a separate copy of
the memory space of the calling process
at the time of clone.  Memory writes or file mappings/unmappings
performed by one of the processes do not affect the 
other,  as with fork(2). 

Koji


-Original Message-
From: Brian Bockelman [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, November 18, 2008 3:12 PM
To: core-user@hadoop.apache.org
Subject: Re: Cannot run program bash: java.io.IOException: error=12,
Cannot allocate memory

Hey Xavier,

Don't forget, the Linux kernel reserves the memory; current heap space  
is disregarded.  How much heap space does your data node and  
tasktracker get?  (PS: overcommit ratio is disregarded if  
overcommit_memory=2).

You also have to remember that there is some overhead from the OS, the  
Java code cache, and a bit from running the JVM.  Add at least 64 MB  
per JVM for code cache and running, and we get 400MB of memory left  
for the OS and any other process running.

You're definitely running out of memory.  Either allow overcommitting  
(which will mean Java is no longer locked out of swap) or reduce  
memory consumption.

Brian

On Nov 18, 2008, at 4:57 PM, Xavier Stevens wrote:

 1) It doesn't look like I'm out of memory but it is coming really  
 close.
 2) overcommit_memory is set to 2, overcommit_ratio = 100

 As for the JVM, I am using Java 1.6.

 **Note of Interest**: The virtual memory I see allocated in top for  
 each
 task is more than what I am specifying in the hadoop job/site configs.

 Currently each physical box has 16 GB of memory.  I see the datanode  
 and
 tasktracker using:

RESVIRT
 Datanode145m   1408m
 Tasktracker 206m   1439m

 When idle.

 So taking that into account I do 16000 MB - (1408+1439) MB which would
 leave me with 13200 MB.  In my old settings I was using 8 map tasks   
 so
 13200 / 8 = 1650 MB.

 My mapred.child.java.opts is -Xmx1536m which should leave me a little
 head room.

 When running though I see some tasks reporting 1900m.


 -Xavier


 -Original Message-
 From: Brian Bockelman [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, November 18, 2008 2:42 PM
 To: core-user@hadoop.apache.org
 Subject: Re: Cannot run program bash: java.io.IOException: error=12,
 Cannot allocate memory

 Hey Xavier,

 1) Are you out of memory (dumb question, but doesn't hurt to ask...)?
 What does Ganglia tell you about the node?
 2) Do you have /proc/sys/vm/overcommit_memory set to 2?

 Telling Linux not to overcommit memory on Java 1.5 JVMs can be very
 problematic.  Java 1.5 asks for min heap size + 1 GB of reserved, non-
 swap memory on Linux systems by default.  The 1GB of reserved, non-  
 swap
 memory is used for the JIT to compile code; this bug wasn't fixed  
 until
 later Java 1.5 updates.

 Brian

 On Nov 18, 2008, at 4:32 PM, Xavier Stevens wrote:

 I'm still seeing this problem on a cluster using Hadoop 0.18.2.  I
 tried
 dropping the max number of map tasks per node from 8 to 7.  I still
 get
 the error although it's less frequent.  But I don't get the error at
 all
 when using Hadoop 0.17.2.

 Anyone have any suggestions?


 -Xavier

 -Original Message-
 From: [EMAIL PROTECTED] On Behalf Of Edward J. Yoon
 Sent: Thursday, October 09, 2008 2:07 AM
 To: core-user@hadoop.apache.org
 Subject: Re: Cannot run program bash: java.io.IOException:  
 error=12,
 Cannot allocate memory

 Thanks Alexander!!

 On Thu, Oct 9, 2008 at 4:49 PM, Alexander Aristov
 [EMAIL PROTECTED] wrote:
 I received such errors when I overloaded data nodes. You may  
 increase
 swap space or run less tasks.

 Alexander

 2008/10/9 Edward J. Yoon [EMAIL PROTECTED]

 Hi,

 I received below message. Can anyone explain this?

 08/10/09 11:53:33 INFO mapred.JobClient: Task Id :
 task_200810081842_0004_m_00_0, Status : FAILED
 java.io.IOException: Cannot run program bash:  
 java.io.IOException:
 error=12, Cannot allocate memory
  at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
  at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
  at org.apache.hadoop.util.Shell.run(Shell.java:134)
  at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
  at

 org.apache.hadoop.fs.LocalDirAllocator
 $AllocatorPerContext.getLocalPathF
 orWrite(LocalDirAllocator.java:296)
  

RE: Specify per file replication factor in dfs -put command line

2008-08-29 Thread Koji Noguchi
Try 

hadoop dfs -D dfs.replication=2 -put abc bcd

Koji

-Original Message-
From: Kevin [mailto:[EMAIL PROTECTED] 
Sent: Friday, August 29, 2008 11:11 AM
To: core-user@hadoop.apache.org
Subject: Specify per file replication factor in dfs -put command line

Hi,

Does any one happen to know how to specify the replication factor of a
file when I upload it by the hadoop dfs -put command? Thank you!

Best,
-Kevin


RE: HDFS -rmr permissions

2008-08-14 Thread Koji Noguchi
Hi Brian, 

I believe dfs -rmr does check the permission for each file.
What's allowing you to delete other users data is the trash feature. 
Each user's Trash is expunged by the namenode process, which is a
superuser.
More discussion on 
(http://issues.apache.org/jira/browse/HADOOP-2514)

My guess is, what we really need is a 'sticky bit' that won't allow dfs
-mv for files/directories under a dir with 777 permission.  I couldn't
find a Jira so opened a new one. 
https://issues.apache.org/jira/browse/HADOOP-3953

Koji

===
(userB) hadoop dfs -ls / | grep ' /tmp'
drwxrwxrwx   - knoguchi supergroup  0 2008-08-14 16:47 /tmp

(userB) hadoop dfs -Dfs.trash.interval=0 -ls /tmp
Found 1 items
drwxr-xr-x   - userA users  0 2008-08-14 16:45 /tmp/userA-dir
(userB) hadoop dfs -Dfs.trash.interval=0 -lsr /tmp
drwxr-xr-x   - userA users  0 2008-08-14 16:45 /tmp/userA-dir
drwxr-xr-x   - userA users  0 2008-08-14 16:45
/tmp/userA-dir/foo1
-rw-r--r--   1 userA users 13 2008-08-14 16:45
/tmp/userA-dir/foo1/a
-rw-r--r--   1 userA users 15 2008-08-14 16:45
/tmp/userA-dir/foo1/b
-rw-r--r--   1 userA users 25 2008-08-14 16:45
/tmp/userA-dir/foo1/c

(userB) hadoop dfs -Dfs.trash.interval=0 -rmr /tmp/userA-dir
rmr: org.apache.hadoop.fs.permission.AccessControlException: Permission
denied: user=userB, access=ALL, inode=userA-dir:userA:users:rwxr-xr-x

(userB) hadoop dfs -Dfs.trash.interval=1 -rmr /tmp/userA-dir
Moved to trash: hdfs://ucdev13.inktomisearch.com:47522/tmp/userA-dir
===


-Original Message-
From: Brian Karlak [mailto:[EMAIL PROTECTED] 
Sent: Thursday, August 07, 2008 11:27 AM
To: core-user@hadoop.apache.org
Cc: Colin Evans
Subject: HDFS -rmr permissions

Hello --

As far as I can tell, hadoop dfs -rmr only checks the permissions of  
the directory to be deleted and it's parent.  Unlike Unix, however, it  
does not seem to check the permissions of the directories / files  
contained within the directory to be deleted.

Is this by design?  It seems dangerous.  For instance, we have a  
directory where we want to allow people to deposit common resources  
for a project.  Its permissions need to be 777, otherwise only one  
person can write to it.  But with 777 permissions, any fool can  
accidentally wipe it.

(Of course, if we have /trash set up, accidental writes are not as big  
a deal, but still ...)

Thoughts / comments?  Is there a way to make -rmr check the  
permissions of the files within the directories it's deleting, just as  
unix does?  If not, is this a legit feature request?  (I checked JIRA,  
but I didn't find anything on this ...)

Thanks,
Brian


RE: MapReduce with multi-languages

2008-07-11 Thread Koji Noguchi
Hi.

Asked Runping about this.
Here's his reply.

Koji 


=
On 7/10/08 11:16 PM, Koji Noguchi [EMAIL PROTECTED] wrote:
  Runping,
  
  Can they use Buffer class?
  
  Koji

Yes, use Buffer or ByteWritable for the key/value classes.
But the critical point is to implement their own record reader/input
format classes.
Runping

=

-Original Message-
From: NOMURA Yoshihide [mailto:[EMAIL PROTECTED] 
Sent: Thursday, July 10, 2008 10:36 PM
To: core-user@hadoop.apache.org
Subject: Re: MapReduce with multi-languages

Mr. Taeho Kang,

I need to analyze different character encoding text too.
And I suggested to support encoding configuration in TextInputFormat.

https://issues.apache.org/jira/browse/HADOOP-3481

But I think you should convert the text file encoding to UTF-8 at
present.

Regards,

Taeho Kang:
 Dear Hadoop User Group,
 
 What are elegant ways to do mapred jobs on text-based data encoded
with
 something other than UTF-8?
 
 It looks like Hadoop assumes the text data is always in UTF-8 and
handles
 data that way - encoding with UTF-8 and decoding with UTF-8.
 And whenever the data is not in UTF-8 encoded format, problems arise.
 
 Here is what I'm thinking of to clear the situation.. correct and
advise me
 if you see my approaches look bad!
 
 (1) Re-encode the original data with UTF-8?
 (2) Replace the part of source code where UTF-8 encoder and decoder
are
 used?
 
 Or has anyone of you guys had trouble with running map-red job on data
with
 multi-languages?
 
 Any suggestions/advices are welcome and appreciated!
 
 Regards,
 
 Taeho
 

-- 
NOMURA Yoshihide:
 Software Innovation Laboratory, Fujitsu Labs. Ltd., Japan
 Tel: 044-754-2675 (Ext: 7106-6916)
 Fax: 044-754-2570 (Ext: 7108-7060)
 E-Mail: [EMAIL PROTECTED]