[Fwd: Re: runtime exceptions not killing job]
Or maybe I can't use attachments, so here's the stack traces inline: --task tracker 2008-03-17 21:58:30 Full thread dump Java HotSpot(TM) 64-Bit Server VM (1.6.0_03-b05 mixed mode): "Attach Listener" daemon prio=10 tid=0x2aab1205c400 nid=0x523d waiting on condition [0x..0x] java.lang.Thread.State: RUNNABLE "IPC Client connection to bigmike.internal.persai.com/192.168.1.3:9001" daemon prio=10 tid=0x2aab14317000 nid=0x5230 in Object.wait() [0x41c44000..0x41c44ba0] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x2aaaf304da08> (a org.apache.hadoop.ipc.Client$Connection) at java.lang.Object.wait(Object.java:485) at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:234) - locked <0x2aaaf304da08> (a org.apache.hadoop.ipc.Client$Connection) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:273) "process reaper" daemon prio=10 tid=0x2aab1205bc00 nid=0x51c6 runnable [0x41f47000..0x41f47da0] java.lang.Thread.State: RUNNABLE at java.lang.UNIXProcess.waitForProcessExit(Native Method) at java.lang.UNIXProcess.access$900(UNIXProcess.java:20) at java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132) "Thread-408" prio=10 tid=0x2aab14316000 nid=0x51c5 in Object.wait() [0x41d45000..0x41d45a20] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x2aaaf2cf0948> (a java.lang.UNIXProcess) at java.lang.Object.wait(Object.java:485) at java.lang.UNIXProcess.waitFor(UNIXProcess.java:165) - locked <0x2aaaf2cf0948> (a java.lang.UNIXProcess) at org.apache.hadoop.util.Shell.runCommand(Shell.java:152) at org.apache.hadoop.util.Shell.run(Shell.java:100) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:252) at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:456) at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:379) "SocketListener0-0" prio=10 tid=0x2aab1205e400 nid=0x519d in Object.wait() [0x41038000..0x41038da0] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x2aaaf2650c20> (a org.mortbay.util.ThreadPool$PoolThread) at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:522) - locked <0x2aaaf2650c20> (a org.mortbay.util.ThreadPool$PoolThread) "[EMAIL PROTECTED]" daemon prio=10 tid=0x2aab183a9000 nid=0x46f5 waiting on condition [0x41a42000..0x41a42aa0] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:597) at java.lang.Thread.run(Thread.java:619) "[EMAIL PROTECTED]" daemon prio=10 tid=0x2aab183ce000 nid=0x46ef waiting on condition [0x4184..0x41840c20] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:597) at java.lang.Thread.run(Thread.java:619) "Map-events fetcher for all reduce tasks on tracker_kentbox.internal.persai.com:localhost/127.0.0.1:43477" daemon prio=10 tid=0x2aab18438400 nid=0x4631 in Object.wait() [0x4173f000..0x4173fda0] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x2aaab3f0ace0> (a java.lang.Object) at org.apache.hadoop.mapred.TaskTracker$MapEventsFetcherThread.run(TaskTracker.java:534) - locked <0x2aaab3f0ace0> (a java.lang.Object) "org.apache.hadoop.io.ObjectWritable Connection Culler" daemon prio=10 tid=0x2aab18427400 nid=0x462f waiting on condition [0x4153d000..0x4153daa0] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:423) "IPC Server handler 1 on 43477" daemon prio=10 tid=0x2aab18476c00 nid=0x462e in Object.wait() [0x4143c000..0x4143cb20] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x2aaab41356b0> (a java.util.LinkedList) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:869) - locked <0x2aaab41356b0> (a java.util.LinkedList) "IPC Server handler 0 on 43477" daemon prio=10 tid=0x2aab18389c00 nid=0x462d in Object.wait() [0x4133b000..0x4133bba0] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x2aaab41356b0> (a java.util.LinkedList) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:869) - locked <0x2aaab41356b0> (a
Re: runtime exceptions not killing job
It seems to happen only with reduce tasks, not map tasks. I reproduced it by having a dummy reduce task throw an NPE immediately. The error is shown on the reduce details page but the job does not register the task as failed. I've attached the task tracker stack trace, the child stack trace and a screenshot of the task list page. Matt Owen O'Malley wrote: On Mar 17, 2008, at 3:14 PM, Matt Kent wrote: I recently upgraded from Hadoop 0.14.1 to 0.16.1. Previously in 0.14.1, if a map or reduce task threw a runtime exception such as an NPE, the task, and ultimately the job, would fail in short order. I was running on job on my local 0.16.1 cluster today, and when the reduce tasks started throwing NPEs, the tasks just hung. Eventually they timed out and were killed, but is this expected behavior in 0.16.1? I'd prefer the job to fail quickly if NPEs are being thrown. This sounds like a bug. Tasks should certainly fail immediately if an exception is thrown. Do you know where the exception is being thrown? Can you get a stack trace of the task from jstack after the exception and before the task times out? Thanks, Owen -- Matt Kent Co-Founder Persai 1221 40th St #113 Emeryville, CA 94608 [EMAIL PROTECTED] 2008-03-17 21:58:30 Full thread dump Java HotSpot(TM) 64-Bit Server VM (1.6.0_03-b05 mixed mode): "Attach Listener" daemon prio=10 tid=0x2aab1205c400 nid=0x523d waiting on condition [0x..0x] java.lang.Thread.State: RUNNABLE "IPC Client connection to bigmike.internal.persai.com/192.168.1.3:9001" daemon prio=10 tid=0x2aab14317000 nid=0x5230 in Object.wait() [0x41c44000..0x41c44ba0] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x2aaaf304da08> (a org.apache.hadoop.ipc.Client$Connection) at java.lang.Object.wait(Object.java:485) at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:234) - locked <0x2aaaf304da08> (a org.apache.hadoop.ipc.Client$Connection) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:273) "process reaper" daemon prio=10 tid=0x2aab1205bc00 nid=0x51c6 runnable [0x41f47000..0x41f47da0] java.lang.Thread.State: RUNNABLE at java.lang.UNIXProcess.waitForProcessExit(Native Method) at java.lang.UNIXProcess.access$900(UNIXProcess.java:20) at java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132) "Thread-408" prio=10 tid=0x2aab14316000 nid=0x51c5 in Object.wait() [0x41d45000..0x41d45a20] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x2aaaf2cf0948> (a java.lang.UNIXProcess) at java.lang.Object.wait(Object.java:485) at java.lang.UNIXProcess.waitFor(UNIXProcess.java:165) - locked <0x2aaaf2cf0948> (a java.lang.UNIXProcess) at org.apache.hadoop.util.Shell.runCommand(Shell.java:152) at org.apache.hadoop.util.Shell.run(Shell.java:100) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:252) at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:456) at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:379) "SocketListener0-0" prio=10 tid=0x2aab1205e400 nid=0x519d in Object.wait() [0x41038000..0x41038da0] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x2aaaf2650c20> (a org.mortbay.util.ThreadPool$PoolThread) at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:522) - locked <0x2aaaf2650c20> (a org.mortbay.util.ThreadPool$PoolThread) "[EMAIL PROTECTED]" daemon prio=10 tid=0x2aab183a9000 nid=0x46f5 waiting on condition [0x41a42000..0x41a42aa0] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:597) at java.lang.Thread.run(Thread.java:619) "[EMAIL PROTECTED]" daemon prio=10 tid=0x2aab183ce000 nid=0x46ef waiting on condition [0x4184..0x41840c20] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:597) at java.lang.Thread.run(Thread.java:619) "Map-events fetcher for all reduce tasks on tracker_kentbox.internal.persai.com:localhost/127.0.0.1:43477" daemon prio=10 tid=0x2aab18438400 nid=0x4631 in Object.wait() [0x4173f000..0x4173fda0] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x2aaab3f0ace0> (a java.lang.Object) at org.apache.hadoop.mapred.TaskTracker$MapEventsFetcherThr
Re: Hadoop-Patch buil is not progressing for 6 hours
org.apache.hadoop.streaming.TestGzipInput was stuck. I killed it. Nige On Mar 17, 2008, at 8:17 PM, Konstantin Shvachko wrote: Usually a build takes 2 hours or less. This one is stuck and I don't see changes in the QUEUE OF PENDING PATCHES when I submit a patch. I guess something is wrong with Hadson. Could anybody please check. --Konstantin
Re: runtime exceptions not killing job
On Mar 17, 2008, at 3:14 PM, Matt Kent wrote: I recently upgraded from Hadoop 0.14.1 to 0.16.1. Previously in 0.14.1, if a map or reduce task threw a runtime exception such as an NPE, the task, and ultimately the job, would fail in short order. I was running on job on my local 0.16.1 cluster today, and when the reduce tasks started throwing NPEs, the tasks just hung. Eventually they timed out and were killed, but is this expected behavior in 0.16.1? I'd prefer the job to fail quickly if NPEs are being thrown. This sounds like a bug. Tasks should certainly fail immediately if an exception is thrown. Do you know where the exception is being thrown? Can you get a stack trace of the task from jstack after the exception and before the task times out? Thanks, Owen
Problems with 0.16.1
We believe that there has been a regression in release 0.16.1 with respect to the reliability of HDFS. In particular, tasks get stuck talking to the data nodes when they are under load. It seems to be HADOOP-3033. We are currently testing it further. In the mean time, I would suggest holding off 0.16.1. -- Owen
Hadoop-Patch buil is not progressing for 6 hours
Usually a build takes 2 hours or less. This one is stuck and I don't see changes in the QUEUE OF PENDING PATCHES when I submit a patch. I guess something is wrong with Hadson. Could anybody please check. --Konstantin
Re: runtime exceptions not killing job
I've noticed this behavior as well in 16.0 with RuntimeExceptions in general. Chris On Mon, Mar 17, 2008 at 6:14 PM, Matt Kent <[EMAIL PROTECTED]> wrote: > I recently upgraded from Hadoop 0.14.1 to 0.16.1. Previously in 0.14.1, > if a map or reduce task threw a runtime exception such as an NPE, the > task, and ultimately the job, would fail in short order. I was running > on job on my local 0.16.1 cluster today, and when the reduce tasks > started throwing NPEs, the tasks just hung. Eventually they timed out > and were killed, but is this expected behavior in 0.16.1? I'd prefer the > job to fail quickly if NPEs are being thrown. > > Matt > > -- > Matt Kent > Co-Founder > Persai > 1221 40th St #113 > Emeryville, CA 94608 > [EMAIL PROTECTED] > >
Reading from and writing to a particular block
I examined org.apache.hadoop.dfs.DistributedFileSystem For reads, I couldn't find a method to read a particular block or chunk. I only see the open method, which opens an input stream for a path. How can I read only a specific block? Also, for writes, I cannot write to a particular block. I have to write the entire file again to modify a single byte. Am I correct? Thanks for your responses, Best, Cagdas
runtime exceptions not killing job
I recently upgraded from Hadoop 0.14.1 to 0.16.1. Previously in 0.14.1, if a map or reduce task threw a runtime exception such as an NPE, the task, and ultimately the job, would fail in short order. I was running on job on my local 0.16.1 cluster today, and when the reduce tasks started throwing NPEs, the tasks just hung. Eventually they timed out and were killed, but is this expected behavior in 0.16.1? I'd prefer the job to fail quickly if NPEs are being thrown. Matt -- Matt Kent Co-Founder Persai 1221 40th St #113 Emeryville, CA 94608 [EMAIL PROTECTED]
secondarynamenode
Hi, Anyone have stats for the impact on system resources (disk/memory/cpu) of running secondarynamenode on a host? Are people running multiple secondary's for redundancy or just going the route of having the namenode (& secondary) replicate edits & fsimage to multiple physical disks (and nfs when possible)? Basically, I'm trying to determine the possible impact on performance of running secondarynamenode daemon on several machines, as well as if there's any potential benefit of running more than 1. What are other people doing for disaster recovery on namenode? -jorgenj -- "Liberties are not given, they are taken." - Aldous Huxley
Re: single node Hbase
Try our 'getting started': http://hadoop.apache.org/hbase/docs/current/api/index.html. St.Ack Peter W. wrote: Hello, Are there any Hadoop documentation resources showing how to run the current version of Hbase on a single node? Thanks, Peter W.
single node Hbase
Hello, Are there any Hadoop documentation resources showing how to run the current version of Hbase on a single node? Thanks, Peter W.
Re: [some bugs] Re: file permission problem
Hi Stefan, > any magic we can do with hadoop.dfs.umask? > dfs.umask is similar to Unix umask. > Or is there any other off switch for the file security? > If dfs.permissions is set to false, then the security will be turned off. For the two questions above, see http://hadoop.apache.org/core/docs/r0.16.1/hdfs_permissions_guide.html for more details > I definitely can reproduce the problem Johannes describes ... > I guess you are using the nightly builds which having the bug. Please try 0.16.1 release or current trunk. > Beside of that I had some interesting observations. > If I have permissions to write to a folder A I can delete folder A and > file B that is inside of folder A even if I do have no permissions for B. > This is also true for POSIX or Unix, where Hadoop permission bases on. > Also I noticed following in my dfs > [EMAIL PROTECTED] hadoop]$ bin/hadoop fs -ls /user/joa23/myApp-1205474968598 > Found 1 items > /user/joa23/myApp-1205474968598/VOICE_CALL2008-03-13 16:00 > rwxr-xr-xhadoopsupergroup > [EMAIL PROTECTED] hadoop]$ bin/hadoop fs -ls > /user/joa23/myApp-1205474968598/VOICE_CALL > Found 1 items > /user/joa23/myApp-1205474968598/VOICE_CALL/part-027311 > 2008-03-13 16:00rw-r--r--joa23supergroup > > Do I miss something or was I able to write as user joa23 into a > folder owned by hadoop where I should have no permissions. :-O. > Should I open some jira issues? > Suppose joa23 is not a superuser. Then, no. The output above only shows a file owned by joa23 exists in a directory owned hadoop. This can definitely be done by a sequence of commands with chmod/chown. Suppose joa23 is not a superuser. If joa23 can create a file, say by "hadoop fs -put ...", under hadoop's directory with rwxr-xr-x, then it is a bug. But I don't think we can do this. Hope this helps. Nicholas
Re: [some bugs] Re: file permission problem
Hi, Let me clarify the versions having this problem. 0.16.0 release, 0.16.1 release, current trunk: no problem Nightly builds between 0.16.0 and 0.16.1 before HADOOP-2391 or after HADOOP-2915: no problem Nightly builds between 0.16.0 and 0.16.1 after HADOOP-2391 and before HADOOP-2915: bug exists Similarly, codes downloading from trunk before HADOOP-2391 or after HADOOP-2915: no problem Codes downloading from trunk after HADOOP-2391 and before HADOOP-2915: bug exists Sorry for the confusion. Nicholas - Original Message From: Stefan Groschupf <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Saturday, March 15, 2008 8:02:07 PM Subject: Re: [some bugs] Re: file permission problem Great - it is even alrady fixed in 16.1! Thanks for the hint! Stefan On Mar 14, 2008, at 2:49 PM, Andy Li wrote: > I think this is the same problem related to this mail thread. > > http://www.mail-archive.com/[EMAIL PROTECTED]/msg02759.html > > A JIRA has been filed, please see HADOOP-2915.
Re: [core-user] Processing binary files Howto??
You can certainly do this, but you are simply repeating the work that hadoop developers have already done. Can you say what kind of satellite data you will be processing? If it is imagery, then I would imagine that Google's use of map-reduce to prepare image tiles for Google maps would be an interesting example. On 3/17/08 11:12 AM, "Alfonso Olias Sanz" <[EMAIL PROTECTED]> wrote: > Another thing we want to considerer is to make our simple grid aware > of the data location in order to move the task to the node which > contains the data. A way of getting the hostname were the > filename-block is and then calling the dfs API from that node.
[core-user] Integration of disperse clusters.
Hi, I haven't gone so far with the hadoop dfs documentation, but I would like to know if it is possible to integrate disperse clusters (separated by firewalls). We cannot open ports, so we have to traverse the firewall through http/https (tunnelling). Apart from its performance drawbacks. Is it possible? Regards Alfonso
Re: [core-user] Processing binary files Howto??
Use the default block size of hadoop if you can (64 MB). Then don't worry about splitting or replication. Hadoop will do that. On 3/17/08 11:12 AM, "Alfonso Olias Sanz" <[EMAIL PROTECTED]> wrote: > But we could also try to make every binary file as big as > a block size (then we need to decide the block size). Or if they are > bigger than a block, the split them before adding the data to the > cluster.
Re: [core-user] Processing binary files Howto??
Hi Ted Thanks for the info :) We do not know yet how is going to be generated the input data. Because the data is going to be observations from a satellite... still on design. But we could also try to make every binary file as big as a block size (then we need to decide the block size). Or if they are bigger than a block, the split them before adding the data to the cluster. But the file should be large enough, as you said, in order to last more than 10 seconds its computation. Then each task (map) will process the files that are locally stored in a node (the framework controls this??) All these is fine. We already have a grid solution with agents on every node polling for jobs. Each job sent to a node computes 1-n files (could be zipped) of simulated data. One solution is to move to map/reduce and let the framework do the distribution of tasks and data. Another thing we want to considerer is to make our simple grid aware of the data location in order to move the task to the node which contains the data. A way of getting the hostname were the filename-block is and then calling the dfs API from that node. Cheers Alfonso On 17/03/2008, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > This sounds very different from your earlier questions. > > If you have a moderate (10's to 1000's) number of binary files, then it is > very easy to write a special purpose InputFormat that tells hadoop that the > file is not splittable. This allows you to add all of the files as inputs > to the map step and you will get the locality that you want. The files > should be large enough so that you take at least 10 seconds or more > processing them to get good performance relative to startup costs. If they > are not, then you may want to package them up in a form that can be read > sequentially. This need not be splittable, but it would be nice if it were. > > If you are producing a single file per hour, then this style works pretty > well. In my own work, we have a few compressed and encrypted files each > hour that are map-reduced into a more congenial and splittable form each > hour. Then subsequent steps are used to aggregate or process the data as > needed. > > This gives you all of the locality that you were looking for. > > > On 3/17/08 6:19 AM, "Alfonso Olias Sanz" <[EMAIL PROTECTED]> > wrote: > > > > Hi there. > > > > After reading a bit of the hadoop framework and trying the WordCount > > example. I have several doubts about how to use map /reduce with > > binary files. > > > > In my case binary files are generated in a time line basis. Let's say > > 1 file per hour. The size of each file is different (briefly we are > > getting pictures from space and the stars density is different between > > observations). The mappers, rather than receiving the file content. > > They have to receive the file name. I read that if the input files > > are big (several blocks), they are split among several tasks in > > same/different node/s (block sizes?). But we want each map task > > processes a file rather than a block (or a line of a file as in the > > WordCount sample). > > > > In a previous post I did to this forum. I was recommended to use an > > input file with all the file names, so the mappers would receive the > > file name. But there is a drawback related with data location (also > > was mentioned this), because data then has to be moved from one node > > to another. Data is not going to be replicated to all the nodes. So > > a task taskA that has to process fileB on nodeN, it has to be executed > > on nodeN. How can we achive that??? What if a task requires a file > > that is on other node. Does the framework moves the logic to that > > node? We need to define a URI file map in each node > > (hostname/path/filename) for all the files. Tasks would access the > > local URI file map in order to process the files. > > > > Another approach we have thought is to use the distributed file system > > to load balance the data among the nodes. And have our processes > > running on every node (without using the map/reduce framework). Then > > each process has to access to the local node to process the data, > > using the dfs API (or checking the local URI file map). This approach > > would be more flexible to us, because depending on the machine > > (cuadcore, dualcore) we know how many java threads we can run in order > > to get the maximum performance of the machine. Using the framework we > > can only say a number of tasks to be executed on every node, but all > > the nodes have to be the same. > > > > URI file map. > > Once the files are copied to the distributed file system, then we need > > to create this table map. Or is it a way to access a at > > the data node and retrieve the files it handles? rather than getting > > all the files in all the nodes in that ie > > > > NodeA /tmp/.../mytask/input/fileA-1 > > /tmp/.../mytask/input
Re: Multiple Output Value Classes
Stu Hood wrote: But I'm trying to _output_ multiple different value classes from a Mapper, and not having any luck. You can wrap things in ObjectWritable. When writing, this records the class name with each instance, then, when reading, constructs an appropriate instance and reads it. It can wrap Writable, String, primitive types, and arrays of these. Doug
Re: [core-user] Processing binary files Howto??
This sounds very different from your earlier questions. If you have a moderate (10's to 1000's) number of binary files, then it is very easy to write a special purpose InputFormat that tells hadoop that the file is not splittable. This allows you to add all of the files as inputs to the map step and you will get the locality that you want. The files should be large enough so that you take at least 10 seconds or more processing them to get good performance relative to startup costs. If they are not, then you may want to package them up in a form that can be read sequentially. This need not be splittable, but it would be nice if it were. If you are producing a single file per hour, then this style works pretty well. In my own work, we have a few compressed and encrypted files each hour that are map-reduced into a more congenial and splittable form each hour. Then subsequent steps are used to aggregate or process the data as needed. This gives you all of the locality that you were looking for. On 3/17/08 6:19 AM, "Alfonso Olias Sanz" <[EMAIL PROTECTED]> wrote: > Hi there. > > After reading a bit of the hadoop framework and trying the WordCount > example. I have several doubts about how to use map /reduce with > binary files. > > In my case binary files are generated in a time line basis. Let's say > 1 file per hour. The size of each file is different (briefly we are > getting pictures from space and the stars density is different between > observations). The mappers, rather than receiving the file content. > They have to receive the file name. I read that if the input files > are big (several blocks), they are split among several tasks in > same/different node/s (block sizes?). But we want each map task > processes a file rather than a block (or a line of a file as in the > WordCount sample). > > In a previous post I did to this forum. I was recommended to use an > input file with all the file names, so the mappers would receive the > file name. But there is a drawback related with data location (also > was mentioned this), because data then has to be moved from one node > to another. Data is not going to be replicated to all the nodes. So > a task taskA that has to process fileB on nodeN, it has to be executed > on nodeN. How can we achive that??? What if a task requires a file > that is on other node. Does the framework moves the logic to that > node? We need to define a URI file map in each node > (hostname/path/filename) for all the files. Tasks would access the > local URI file map in order to process the files. > > Another approach we have thought is to use the distributed file system > to load balance the data among the nodes. And have our processes > running on every node (without using the map/reduce framework). Then > each process has to access to the local node to process the data, > using the dfs API (or checking the local URI file map). This approach > would be more flexible to us, because depending on the machine > (cuadcore, dualcore) we know how many java threads we can run in order > to get the maximum performance of the machine. Using the framework we > can only say a number of tasks to be executed on every node, but all > the nodes have to be the same. > > URI file map. > Once the files are copied to the distributed file system, then we need > to create this table map. Or is it a way to access a at > the data node and retrieve the files it handles? rather than getting > all the files in all the nodes in that ie > > NodeA /tmp/.../mytask/input/fileA-1 > /tmp/.../mytask/input/fileA-2 > > NodeB /tmp/.../mytask/input/fileB > > A process at nodeB listing the /tmp/.../input directory, would get only fileB > > Any ideas? > Thanks > Alfonso.
Re: [core-user] Move application to Map/Reduce architecture with Hadoop
Replication is vital in large or even medium-sized clusters for reliability. Replication also helps distribution. On 3/17/08 2:48 AM, "Alfonso Olias Sanz" <[EMAIL PROTECTED]> wrote: > But what I wanted to say was that we need to set up a > cluster in a way that the data is distributed among all the data > nodes. Instead of replicated.
[core-user] Processing binary files Howto??
Hi there. After reading a bit of the hadoop framework and trying the WordCount example. I have several doubts about how to use map /reduce with binary files. In my case binary files are generated in a time line basis. Let's say 1 file per hour. The size of each file is different (briefly we are getting pictures from space and the stars density is different between observations). The mappers, rather than receiving the file content. They have to receive the file name. I read that if the input files are big (several blocks), they are split among several tasks in same/different node/s (block sizes?). But we want each map task processes a file rather than a block (or a line of a file as in the WordCount sample). In a previous post I did to this forum. I was recommended to use an input file with all the file names, so the mappers would receive the file name. But there is a drawback related with data location (also was mentioned this), because data then has to be moved from one node to another. Data is not going to be replicated to all the nodes. So a task taskA that has to process fileB on nodeN, it has to be executed on nodeN. How can we achive that??? What if a task requires a file that is on other node. Does the framework moves the logic to that node? We need to define a URI file map in each node (hostname/path/filename) for all the files. Tasks would access the local URI file map in order to process the files. Another approach we have thought is to use the distributed file system to load balance the data among the nodes. And have our processes running on every node (without using the map/reduce framework). Then each process has to access to the local node to process the data, using the dfs API (or checking the local URI file map). This approach would be more flexible to us, because depending on the machine (cuadcore, dualcore) we know how many java threads we can run in order to get the maximum performance of the machine. Using the framework we can only say a number of tasks to be executed on every node, but all the nodes have to be the same. URI file map. Once the files are copied to the distributed file system, then we need to create this table map. Or is it a way to access a at the data node and retrieve the files it handles? rather than getting all the files in all the nodes in that ie NodeA /tmp/.../mytask/input/fileA-1 /tmp/.../mytask/input/fileA-2 NodeB /tmp/.../mytask/input/fileB A process at nodeB listing the /tmp/.../input directory, would get only fileB Any ideas? Thanks Alfonso.
Re: [core-user] Move application to Map/Reduce architecture with Hadoop
Hi Stu thanks for the link. Well, I should have been more precise. I do not think we will have 1 millon files. But what I wanted to say was that we need to set up a cluster in a way that the data is distributed among all the data nodes. Instead of replicated. Thats the first step, the second is to think in fault tolreance/replication of the data nodes. Rui pointed to have a look at the "distcp" I will also have a look at the number of mappers per node. Thanks Alfonso On 15/03/2008, Stu Hood <[EMAIL PROTECTED]> wrote: > Hello Alfonso, > > A Hadoop namenode with a reasonable amount of memory (4-8GB) should be able > to handle a few million files (but I can't find a reference to back that > statement up: anyone?). > > Unfortunately, the Hadoop jobtracker is not currently capable of handling > jobs with that many inputs. See > https://issues.apache.org/jira/browse/HADOOP-2119 > > The only approach I know of currently (without modifying your input files) > is the following, but it has the nasty side-effect of losing all input > locality for the tasks: > > Do all of your processing in the Map tasks, and implement them a lot like > the distcp tool is implemented: the input for your Map task is a file > containing a list of your files to be processed, one per line. You then use > the Hadoop Filesystem API to read and process the files, and write your > outputs. You would then set the number of Reducers for the job to 0, since > you won't have any key->value output from the Mappers. > > Thanks, > Stu > > > > -Original Message- > From: Alfonso Olias Sanz <[EMAIL PROTECTED]> > Sent: Friday, March 14, 2008 8:05pm > To: core-user@hadoop.apache.org > Subject: [core-user] Move application to Map/Reduce architecture with Hadoop > > Hi > > I have just started using hadoop and HDFS. I have done the WordCount > test application which gets some input files, process the files, and > generates and output file. > > > I have a similar application, that has a million input files and has > to produce a million output files. The correlation between > input/output is 1:1. > > > This process is suitable to run with a Map/Reduce approach. But I > have several doubts I hope some body can answer me. > > *** What should be the Reduce function??? Because there is no merge of > data of the running map processes. > > > *** I set up a 2nodes cluster for the WordCount test. They work as > master+slave, slave. Before launching the process I copied the files > using $HADOOP_HOME/bin/hadoop dfs -copyFromLocal > > > > This files were replicated in both nodes. Is it there anyway to avoid > the files being replicated to all the nodes and instead, have them > distributed among all the nodes. With no replication of files. > > > > *** During the WordCount test, 25 map jobs were launched! For our > application is overkilling. We have done several performance tests > without using hadoop and we have seen that we can launch 1 > application per core. So is it there anyway to configure the cluster > in order to launch a number of tasks per core. So depending on > dual-core or quad core pcs, the number of running processes will be > different. > > > > Thanks in advance. > alf. > > >
Re: [ANNOUNCE] Hadoop release 0.16.1 available
No trouble at all. Just wanted to make sure i test my code against the appropriate code base and would not hit bugs that were fixed in releases inbetween. Thanks Erwan On Sun, Mar 16, 2008 at 9:45 PM, stack <[EMAIL PROTECTED]> wrote: > Erwan: > > I just posted a longer note to the hbase-user mailing list on this topic. > > In short, the most recent hbase release is that which is bundled as a > contrib in hadoop 0.16.0. The hbase that is in hadoop 0.16.1 is the > same as is in the hadoop-0.16.0-hbase.jar. Since the 0.16.0 release, > hbase graduated out of contrib and became a subproject homed at a > different location in svn. We also got our own area in JIRA, > mailing-lists, website, etc. (See hbase.org). All development since > has gone on at the new svn location, not under hadoop at > src/contrib/hbase. The next hbase release will be 0.1.0 made from the > hbase 0.1 branch. Our 0.1 branch is intended to ride along atop the > hadoop 0.16 branch. 0.1.0 will be the hbase that is in hadoop 0.16.0 > with fixes and some changes to accomodate our new standing as a project > dependent, but apart from our parent hadoop. We hope to put up a > candidate later this week. > > Originally, to cut confusion, the idea was that hbase would quickly > release a 0.1.0 from our new location, before hadoop 0.16.1 went out, > and then hbase would be cleaned out of the hadoop 0.16 branch as it has > been from hadoop TRUNK, but we've been laggards. > > We'll do up a better explanation of the transition at hbase.org over the > next few days. Sorry for any trouble caused meantime. > St.Ack > > > > Erwan Arzur wrote: > > Hello everybody, > > > > i am a bit confused by the release notes (and > > http://hadoop.apache.org/hbase/releases.html) stating that hadoop and > hbase > > are going to be separate from now on, and : > > > > ~/hadoop-0.16.1$ ls -l contrib/hbase/ > > total 700 > > drwxr-xr-x 2 hadoop hadoop 4096 2008-03-15 08:56 bin > > drwxr-xr-x 2 hadoop hadoop 4096 2008-03-15 08:55 conf > > -rw-r--r-- 1 hadoop hadoop 693401 2008-03-09 05:43 > hadoop-0.16.1-hbase.jar > > drwxr-xr-x 2 hadoop hadoop 4096 2008-03-15 08:55 lib > > drwxr-xr-x 6 hadoop hadoop 4096 2008-03-09 05:43 webapps > > > > Is the release that shipped with hadoop 0.16.1 nevertheless the > equivalent > > to 0.1.0 ? Which hbase release should be stable enough for us to use ? > > > > Thanks in advance, > > > > Erwan > > > > On Fri, Mar 14, 2008 at 7:55 PM, Nigel Daley <[EMAIL PROTECTED]> > wrote: > > > > > >> Release 0.16.1 fixes critical bugs in 0.16.0. Note that HBase > >> releases are now maintained at http://hadoop.apache.org/hbase/ > >> > >> For Hadoop release details and downloads, visit: > >> > >>http://hadoop.apache.org/core/releases.html > >> > >> Thanks to all who contributed to this release! > >> > >> Nigel > >> > >> > >> > > > > > >