RE: Stop MR jobs after N records have been produced ?
bah bah salaam amir jaan mibinam ke to ham be hadoop alaghe mandi :D --- On Fri, 9/5/08, Amir Youssefi [EMAIL PROTECTED] wrote: From: Amir Youssefi [EMAIL PROTECTED] Subject: RE: Stop MR jobs after N records have been produced ? To: core-user@hadoop.apache.org Date: Friday, September 5, 2008, 3:48 AM Also see following jira for more discussions: http://issues.apache.org/jira/browse/HADOOP-3973 I will close this as new interface will address the issue. - Amir -Original Message- From: Owen O'Malley [mailto:[EMAIL PROTECTED] On Behalf Of Owen O'Malley Sent: Thursday, September 04, 2008 2:13 PM To: core-user@hadoop.apache.org Subject: Re: Stop MR jobs after N records have been produced ? On Sep 4, 2008, at 12:48 PM, Tarandeep Singh wrote: Can I stop Map-Reduce jobs after mappers (or reducers) have produced N records ? You could do this pretty easily by implementing a custom MapRunnable. There is no equivalent for reduces. The interface proposed in HADOOP-1230 would support that kind of application. See: http://svn.apache.org/repos/asf/hadoop/core/trunk/src/mapred/org/apache/ hadoop/mapreduce/ Look at the new Mapper and Reducer interfaces. -- Owen
Sharing Memory across Map task [multiple cores] runing in same machine
Can we use something like RAM FS to share static data across map tasks. Scenario, 1) Quadcore machine 2) 2 1-TB Disk 3) 8 GB ram, Now Ii need ~2.7 GB ram per Map process to load some static data in memory using which i would be processing data.(cpu intensive jobs) Can i share memory across mappers on the same machine so that memory footprint is less and i can run more than 4 mappers simultaneously utilizing all 4 cores. Can we use stuff like RamFS
Re: Sharing Memory across Map task [multiple cores] runing in same machine
Hadoop doesn't support this natively. So if you need this kind of a functionality, you'd need to code your application in such a way. But I am worried about the race conditions in determining which task should first create the ramfs and load the data. If you can provide atomicity in determining whether the ramfs has been created and data loaded, and if not, then do the creation/load, then things should work. If atomicity cannot be guaranteed, you might consider this - 1) Run a job with only maps that creates the ramfs and loads the data (if your cluster is small you can do this manually). You can use distributed cache to store the data you want to load. 2) Run your job that processes the data 3) Run a third job to delete the ramfs. On 9/5/08 1:29 PM, Amit Kumar Singh [EMAIL PROTECTED] wrote: Can we use something like RAM FS to share static data across map tasks. Scenario, 1) Quadcore machine 2) 2 1-TB Disk 3) 8 GB ram, Now Ii need ~2.7 GB ram per Map process to load some static data in memory using which i would be processing data.(cpu intensive jobs) Can i share memory across mappers on the same machine so that memory footprint is less and i can run more than 4 mappers simultaneously utilizing all 4 cores. Can we use stuff like RamFS
Re: Integrate HADOOP and Map/Reduce paradigm into HPC environment
Filippo Spiga wrote: This procedure allows me to - use persistent HDFS on all cluster, placing namenode to frontend (always up and running) and datanode to other nodes - submit a lot of jobs to resource manager trasparently without any problem and manage jobs priority/reservation with MAUI as simple as other classical HPC jobs - execute jobtracker and tasktracker services on the nodes chosen by TORQUE (in particular, the first node selected becomes the jobtracker) - store logs for different users into separated directory - run only one job at time (but probably multiple map/reduce jobs can runs together because different jobs use different subset of nodes) Probably HOD does what I can do with my raw script... it's possibile that I don't understand well the userguide... Filippo, HOD indeed allows you to do all these things, and a little bit more. On the other hand your script executes the jobtracker on the first node always, which also seems useful to me. It will be nice if you can still try HOD and see if it makes your life simpler in any way. :-) Sorry for my english :-P Regards 2008/9/2 Hemanth Yamijala [EMAIL PROTECTED] Allen Wittenauer wrote: On 8/18/08 11:33 AM, Filippo Spiga [EMAIL PROTECTED] wrote: Well but I haven't understand how I should configurate HOD to work in this manner. For HDFS I folllow this sequence of steps - conf/master contain only master node of my cluster - conf/slaves contain all nodes - I start HDFS using bin/start-dfs.sh Right, fine... Potentially I would allow to use all nodes for MapReduce. For HOD which parameter should I set in contrib/hod/conf/hodrc? Should I change only the gridservice-hdfs section? I was hoping the HOD folks would answer this question for you, but they are apparently sleeping. :) Woops ! Sorry, I missed this. Anyway, yes, if you point gridservice-hdfs to a static HDFS, it should use that as the -default- HDFS. That doesn't prevent a user from using HOD to create a custom HDFS as part of their job submission. Allen's answer is perfect. Please refer to http://hadoop.apache.org/core/docs/current/hod_user_guide.html#Using+an+external+HDFS for more information about how to set up the gridservice-hdfs section to use a static or external HDFS.
Re: Integrate HADOOP and Map/Reduce paradigm into HPC environment
Filippo Spiga wrote: This procedure allows me to - use persistent HDFS on all cluster, placing namenode to frontend (always up and running) and datanode to other nodes - submit a lot of jobs to resource manager trasparently without any problem and manage jobs priority/reservation with MAUI as simple as other classical HPC jobs - execute jobtracker and tasktracker services on the nodes chosen by TORQUE (in particular, the first node selected becomes the jobtracker) - store logs for different users into separated directory - run only one job at time (but probably multiple map/reduce jobs can runs together because different jobs use different subset of nodes) Probably HOD does what I can do with my raw script... it's possibile that I don't understand well the userguide... Filippo, HOD indeed allows you to do all these things, and a little bit more. On the other hand your script executes the jobtracker on the first node always, which also seems useful to me. It will be nice if you can still try HOD and see if it makes your life simpler in any way. :-) Sorry for my english :-P Regards 2008/9/2 Hemanth Yamijala [EMAIL PROTECTED] Allen Wittenauer wrote: On 8/18/08 11:33 AM, Filippo Spiga [EMAIL PROTECTED] wrote: Well but I haven't understand how I should configurate HOD to work in this manner. For HDFS I folllow this sequence of steps - conf/master contain only master node of my cluster - conf/slaves contain all nodes - I start HDFS using bin/start-dfs.sh Right, fine... Potentially I would allow to use all nodes for MapReduce. For HOD which parameter should I set in contrib/hod/conf/hodrc? Should I change only the gridservice-hdfs section? I was hoping the HOD folks would answer this question for you, but they are apparently sleeping. :) Woops ! Sorry, I missed this. Anyway, yes, if you point gridservice-hdfs to a static HDFS, it should use that as the -default- HDFS. That doesn't prevent a user from using HOD to create a custom HDFS as part of their job submission. Allen's answer is perfect. Please refer to http://hadoop.apache.org/core/docs/current/hod_user_guide.html#Using+an+external+HDFS for more information about how to set up the gridservice-hdfs section to use a static or external HDFS.
Re: Hadoop for computationally intensive tasks (no data)
Tenaali Ram wrote: Hi, I am new to hadoop. What I have understood so far is- hadoop is used to process huge data using map-reduce paradigm. I am working on problem where I need to perform large number of computations, most computations can be done independently of each other (so I think each mapper can handle one or more such computations). However there is no data involved. Its just number crunching job. Is it suited for Hadoop ? well, you can have the MR jobs stick data out into the filesystem. So even though they don't start of located, they end up running where the output needs to go. Has anyone used hadoop for merely number crunching? If yes, how should I define input for the job and ensure that computations are distributed to all nodes in the grid? The current scheduler moves work to near where the data sources are, going for the same machine or same rack, looking for a task tracker with a spare slot. There isn't yet any scheduler that worried more about pure computation, where you need to consider current CPU load, memory consumption and power budget -whether your rack is running so hot its at risk of being shut down, or at least throttled back. That's the kind of scheduling where the e-science and grid toolkit people have the edge. But now that the version of hadoop in SVN has support for plug-in scheduling, someone has the opportunity to write a new scheduler, one that focuses on pure computation...
critical name node problem
Hi! My namenode has run out of space, and now I'm getting the following: 08/09/05 09:23:22 WARN dfs.StateChange: DIR* FSDirectory.unprotectedDelete: failed to remove /data_v1/2008/06/26/12/pub1-access-2008-06-26-11_52_07.log.gz because it does not exist 08/09/05 09:23:22 INFO ipc.Server: Stopping server on 9000 08/09/05 09:23:22 ERROR dfs.NameNode: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106) at org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90) at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:441) at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:766) at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:640) at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:223) at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80) at org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:274) at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:255) at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:133) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:178) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:164) at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:848) at org.apache.hadoop.dfs.NameNode.main(NameNode.java:857) 08/09/05 09:23:22 INFO dfs.NameNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NameNode at ec2-67-202-42-251.compute-1.amazonaws.com/10.251.39.196 hadoop-0.17.1 btw. What do I do now? Andreas signature.asc Description: This is a digitally signed message part.
contrib join package
Hi, Is there any detailed documentation on the org.apache.hadoop.contrib.utils.join package ? I have a simple Join task consisting of 2 input datasets. Each contains tab-separated records. Set1: Record format = field1\tfield2\tfield3\tfield4\tfield5 Set2: Record format = field1\tfield2\tfield3 Join criterion: Set1.field1 = Set2.field1 Output: Set2.field2\tSet1.field2\tSet1.field3\tSet1.field4 The org.apache.hadoop.contrib.utils.join package contains DataJoinMapperBase and DataJoinReducerBase abstract classes, and a TaggedMapOutput class which should be the base class for the mapper output values. But there aren't any examples showing how these classes should be used to implement inner or outer joins in a generic manner. If anybody has used this package and would like to share their experience, please let me know. Thanks, Rahul Sood [EMAIL PROTECTED]
Re: JVM Spawning
LocalJobRunner allows you to test your code with everything running in a single JVM. Just set mapred.job.tracker=local. Doug Ryan LeCompte wrote: I see... so there really isn't a way for me to test a map/reduce program using a single node without incurring the overhead of upping/downing JVM's... My input is broken up into 5 text files is there a way I could start the job such that it only uses 1 map to process the whole thing? I guess I'd have to concatenate the files into 1 file and somehow turn off splitting? Ryan On Wed, Sep 3, 2008 at 12:09 AM, Owen O'Malley [EMAIL PROTECTED] wrote: On Sep 2, 2008, at 9:00 PM, Ryan LeCompte wrote: Beginner's question: If I have a cluster with a single node that has a max of 1 map/1 reduce, and the job submitted has 50 maps... Then it will process only 1 map at a time. Does that mean that it's spawning 1 new JVM for each map processed? Or re-using the same JVM when a new map can be processed? It creates a new JVM for each task. Devaraj is working on https://issues.apache.org/jira/browse/HADOOP-249 which will allow the jvms to run multiple tasks sequentially. -- Owen
Re: critical name node problem
Ok, googling a little bit around, the solution seems to either delete the edits file, which in my case would be non-cool (24MB worth of edits in there), or truncate it correctly. So I used the following script to figure out how much data needs to be dropped: LEN=25497570 while true do dd if=edits.org of=edits bs=$LEN count=1 time hadoop namenode if [[ $? -ne 255 ]] then echo $LEN seems to have worked. exit 0 fi LEN=$(expr $LEN - 1) done Guess something like this might make sense to add http://wiki.apache.org/hadoop/TroubleShooting not everyone will be able to figure out how to get rid of the last incomplete record. Another idea would be a tool or namenode startup mode that would make it ignore EOFExceptions to recover as much of the edits as possible. Andreas On Friday 05 September 2008 13:30:34 Andreas Kostyrka wrote: Hi! My namenode has run out of space, and now I'm getting the following: 08/09/05 09:23:22 WARN dfs.StateChange: DIR* FSDirectory.unprotectedDelete: failed to remove /data_v1/2008/06/26/12/pub1-access-2008-06-26-11_52_07.log.gz because it does not exist 08/09/05 09:23:22 INFO ipc.Server: Stopping server on 9000 08/09/05 09:23:22 ERROR dfs.NameNode: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106) at org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90) at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:441) at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:766) at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:640) at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:223) at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80) at org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:274) at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:255) at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:133) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:178) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:164) at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:848) at org.apache.hadoop.dfs.NameNode.main(NameNode.java:857) 08/09/05 09:23:22 INFO dfs.NameNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NameNode at ec2-67-202-42-251.compute-1.amazonaws.com/10.251.39.196 hadoop-0.17.1 btw. What do I do now? Andreas signature.asc Description: This is a digitally signed message part.
Re: runtime change of hadoop.tmp.dir
Do you successed by changing it in the config file ? 2008/9/4, Deepak Diwakar [EMAIL PROTECTED]: Hi I am using hadoop in standalone mode. I want to change hadoop.tmp.dir on the runtime. I found that there is an option -D in command line to set configurable parameter, but did not succeeded. It would be really helpful to me if somebody could put exact syntax for that? Thanks regards, -- - Deepak Diwakar,
Re: Timeouts at reduce stage
:) 2008/9/5, Doug Cutting [EMAIL PROTECTED]: Jason Venner wrote: We have modified the /main/ that launches the children of the task tracker to explicity exit, in it's finally block. That helps substantially. Have you submitted this as a patch? Doug
Re: critical name node problem
On 9/5/08 5:53 AM, Andreas Kostyrka [EMAIL PROTECTED] wrote: Another idea would be a tool or namenode startup mode that would make it ignore EOFExceptions to recover as much of the edits as possible. We clearly need to change the how to configure docs to make sure people put at least two directories on two different storage systems for the dfs.name.dir . This problem seems to happen quite often, and having two+ dirs helps protect against it. We recently had one of the disks on one of our copies go bad. The system kept going just fine until we had a chance to reconfig the name node. That said, I've just HADOOP-4080 to help alert admins in these situations.
Re: contrib join package
Please look at the examples in the sourcehttp://svn.apache.org/repos/asf/hadoop/core/trunk/src/contrib/data_join/src/examples/org/apache/hadoop/contrib/utils/join/directory. Unfortunately, I don't know of any other documentation on it. -- Owen
Custom Writeables
Hello, Can a custom Writeable object used as a key/value contain other writeables, like MapWriteable? Thanks, Ryan
Re: Custom Writeables
Yes, it is pretty easy to compose Writables. Just have the write and readFields call the nested types. Look at the way that JobStatushttp://svn.apache.org/repos/asf/hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/JobStatus.java's write method calls jobid.write. -- Owen
Re: Custom Writeables
On Fri, Sep 5, 2008 at 10:18 AM, Ryan LeCompte [EMAIL PROTECTED] wrote: Thanks! Quick question on that particular class: why are the methods synchronized? I didn't think that key/value objects needed to be thread safe? I picked a simple example out of map/reduce framework to demonstrate the approach. The framework has a lot of threads and so needs synchronization on its own data types. Your maps and reduces are only called by a single thread and so probably don't need synchronization. (The framework won't call readFields/write on your types from multiple threads either...) -- Owen
Re: Compare data on HDFS side
On 4-Sep-08, at 6:51 AM, Andrey Pankov wrote: Hello, Does anyone know is it possible to compare data on HDFS but avoid coping data to local box? I mean if I'd like to find difference between local text files I can use diff command. If files are at HDFS then I have to get them from HDFS to local box and only then do diff. Coping files to local fs is a bit annoying and could be problematical when files are huge, say 2-5 Gb. You could always do this as a mapreduce task. diff --brief is trivial, actually finding the diffs is left as an exercise for the reader :) I'm currently doing a line-oriented diff of two files where the order of the lines is unimportant, so I just have my reducer flag lines that show up an odd number of times. Karl Anderson [EMAIL PROTECTED] http://monkey.org/~kra
Hadoop + Elastic Block Stores
Hello, I was wondering if anyone has gotten far at all with getting Hadoop up and running with EC2 + EBS? Any luck getting this to work in a way that the HDFS runs on the EBS so that it isn't blown away every time you bring up/down the EC2 Hadoop cluster? I'd like to experiment with this next, and was curious if anyone had any luck. :) Thanks! Ryan
Re: rsync on 2 HDFS
Hi Deepika, We have a utility called distcp - distributed copy. Note that distcp itself is different from rsync. However, distcp -delete is similar to rsync --delete. distcp -delete is a new feature in 0.19. See HADOOP-3939. For more details about distcp, see http://hadoop.apache.org/core/docs/r0.18.0/distcp.html (the doc is for 0.18, so it won't mention distcp -delete. The 0.19 doc will be updated in HADOOP-3942.) Nicholas Sze - Original Message From: Deepika Khera [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Friday, September 5, 2008 2:42:09 PM Subject: rsync on 2 HDFS Hi, I wanted to do an rsync --delete between data in 2 HDFS system directories. Do we have a utility that could do this? I am aware that HDFS does not allow partial writes. An alternative would be to write a program to generate the list of differences in paths and then use distcp to copy the files and delete the appropriate files. Any pointers to implementations (or partial implementations)? Thanks, Deepika
How do specify certain IP to be used by datanode/namenode
Hi, The machines I am using each has multiple network cards, and hence multiple IP addresses. Some of them look like a default IP. However, I wish to use some other IPs, and tried to modify the hadoop-site.xml, masters, and slaves files. But it looks that the IP still jumps to the default ones when hadoop runs. Does anyone have an idea how I could possibly make it work? Thank you! -Kevin
Re: Master Recommended Hardware
Camilo, See http://wiki.apache.org/hadoop/NameNode And see the discussion NameNode Hardware specs started here: http://www.mail-archive.com/core-user@hadoop.apache.org/msg04109.html This should give you the basics. Regards, J-D On Fri, Sep 5, 2008 at 10:31 PM, Camilo Gonzalez [EMAIL PROTECTED] wrote: Hi! Just wondering, there have been discussions/articles about Recommended Hardware for the Cluster Nodes (ex: http://wiki.apache.org/hadoop/MachineScaling). Have there been any discussions/articles about Recommended Hardware Configurations for the Master (Namenode) machine? (Any links to previous discussions are appreciated). It is my guessing that this machine should have much more memory than the Data nodes, but don't know if this should be necessarily true. Thanks in advance, -- Camilo A. Gonzalez Ing. de Sistemas Tel: 300 657 96 96
Re: Hadoop + Elastic Block Stores
Ryan, I currently have a Hadoop/HBase setup that uses EBS. It works but using EBS implied an additional overhead of configuration (too bad you can't spawn instances with volumes already attached to it tho I'm sure that'll come). Shutting down instances and bringing others up also requires more micro-management but I think Tom White wrote about it and there was a link to it in another discussion you were part of. Hope this helps, J-D On Fri, Sep 5, 2008 at 7:00 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hello, I was wondering if anyone has gotten far at all with getting Hadoop up and running with EC2 + EBS? Any luck getting this to work in a way that the HDFS runs on the EBS so that it isn't blown away every time you bring up/down the EC2 Hadoop cluster? I'd like to experiment with this next, and was curious if anyone had any luck. :) Thanks! Ryan
Re: How do specify certain IP to be used by datanode/namenode
Kevin, Did you try changing the dfs.datanode.dns.interface/dfs.datanode.dns.nameserver/mapred.tasktracker.dns.interface/mapred.tasktracker.dns.nameserver parameters? J-D On Fri, Sep 5, 2008 at 8:14 PM, Kevin [EMAIL PROTECTED] wrote: Hi, The machines I am using each has multiple network cards, and hence multiple IP addresses. Some of them look like a default IP. However, I wish to use some other IPs, and tried to modify the hadoop-site.xml, masters, and slaves files. But it looks that the IP still jumps to the default ones when hadoop runs. Does anyone have an idea how I could possibly make it work? Thank you! -Kevin
Re: help! how can i control special data to specific datanode?
Hi, I suggest that you read how data is stored in HDFS, see http://hadoop.apache.org/core/docs/r0.18.0/hdfs_design.html J-D On Sat, Sep 6, 2008 at 12:11 AM, ZhiHong Fu [EMAIL PROTECTED] wrote: hello. I'm a new user to hadoop. and Now I hava a problem in understanding Hdfs. In such a scene. I have several databases and want to index them. So when I map indexing database, I have to control which database index was stored in which datanode. So when database updated, The index can be updated correspondely. So how can i get to do this? thanks
Re: Hadoop + Elastic Block Stores
Good to know that you got it up and running. I'd really love to one day see some scripts under src/contrib/ec2/bin that can setup/mount the EBS volumes automatically. :-) On Sep 5, 2008, at 11:38 PM, Jean-Daniel Cryans [EMAIL PROTECTED] wrote: Ryan, I currently have a Hadoop/HBase setup that uses EBS. It works but using EBS implied an additional overhead of configuration (too bad you can't spawn instances with volumes already attached to it tho I'm sure that'll come). Shutting down instances and bringing others up also requires more micro-management but I think Tom White wrote about it and there was a link to it in another discussion you were part of. Hope this helps, J-D On Fri, Sep 5, 2008 at 7:00 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hello, I was wondering if anyone has gotten far at all with getting Hadoop up and running with EC2 + EBS? Any luck getting this to work in a way that the HDFS runs on the EBS so that it isn't blown away every time you bring up/down the EC2 Hadoop cluster? I'd like to experiment with this next, and was curious if anyone had any luck. :) Thanks! Ryan