Polymorphic behavior of Maps in One Job?
If I have 2 I/P format classes set up...Both giving different name and value pairs Then Is it possibe to configure multiple map and reduce classes in One job based on different key value pairs? If i overload the map() method does the framework call them polymorphically based on the varying parameters(the key and the value)? or do we need seperate classes? For adding multiple mappers and I am thinking of using: MultipleInputs.addInputPath(JobConf conf, Path path, Class? extends InputFormat inputFormatClass, Class? extends Mapper mapperClass) to add the mappers and my I/P format. And use MultipleOutputs class to configure the O/P from the mappers. IF this is right where do i add the multiple implementations for the reducers in the JobConf?? -- View this message in context: http://www.nabble.com/Polymorphic-behavior-of-Maps-in-One-Job--tp22907228p22907228.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: best practice: mapred.local vs dfs drives
Thanks for the headsup. C Owen O'Malley wrote: We always share the drives. -- Owen On Apr 5, 2009, at 0:52, zsongbo zson...@gmail.com wrote: I usually set mapred.local.dir to share the disk space with DFS, since some mapreduce job need big temp space. On Fri, Apr 3, 2009 at 8:36 PM, Craig Macdonald cra...@dcs.gla.ac.ukwrote: Hello all, Following recent hardware discussions, I thought I'd ask a related question. Our cluster nodes have 3 drives: 1x 160GB system/scratch and 2x 500GB DFS drives. The 160GB system drive is partitioned such that 100GB is for job mapred.local space. However, we find that for our application, mapred.local free space for map output space is the limiting parameter on the number of reducers we can have (our application prefers less reducers). How do people normally work for dfs vs mapred.local space. Do you (a) share the DFS drives with the task tracker temporary files, Or do you (b) keep them on separate partitions or drives? We originally went with (b) because it prevented a run-away job from eating all the DFS space on the machine, however, I'm beginning to realise the disadvantages. Any comments? Thanks Craig
Re: RPM spec file for 0.19.1
Simon Lewis si...@lewis.li writes: On 3 Apr 2009, at 15:11, Ian Soboroff wrote: Steve Loughran ste...@apache.org writes: I think from your perpective it makes sense as it stops anyone getting itchy fingers and doing their own RPMs. Um, what's wrong with that? I would certainly like the ability to build RPMs from a source checkout, anyone thought of putting a standard spec file in with the source somewhere? Another vote for a .spec file to be included in the standard distribution as a contrib. If it's ok with Cloudera (since my spec file just came from them), I will edit my JIRA to offer that proposal. If it's Cloudera's spec that's included, we should also include the init.d script templates (which are already Apache licensed). Ian
Hadoop Reduce Job errors, job gets killed.
Hi, My Hadoop Map/Reduce is giving the following error message right about when it is 95% complete with the reducing step on one node. The process gets killed. The error message from the logs are noted below. *java.io.IOException: Filesystem closed*, any ideas please? 2009-04-06 10:41:07,202 INFO org.apache.hadoop.streaming.PipeMapRed: Records R/W=10263370/642860 2009-04-06 10:41:17,203 INFO org.apache.hadoop.streaming.PipeMapRed: Records R/W=10263370/1033247 2009-04-06 10:41:27,437 INFO org.apache.hadoop.streaming.PipeMapRed: Records R/W=10263370/1844222 2009-04-06 10:41:37,438 INFO org.apache.hadoop.streaming.PipeMapRed: Records R/W=10263370/2884839 2009-04-06 10:41:44,350 WARN org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Filesystem closed at org.apache.hadoop.dfs.DFSClient.checkOpen(DFSClient.java:166) at org.apache.hadoop.dfs.DFSClient.access$500(DFSClient.java:58) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:2104) at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:141) at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:124) at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:112) at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:86) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:41) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.writeObject(TextOutputFormat.java:72) at org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.write(TextOutputFormat.java:87) at org.apache.hadoop.mapred.ReduceTask$2.collect(ReduceTask.java:315) at org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:346) 2009-04-06 10:41:44,478 INFO org.apache.hadoop.streaming.PipeMapRed: MRErrorThread done 2009-04-06 10:41:44,478 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed.waitOutputThreads(): subprocess failed with code 141 in org.apache.hadoop.streaming.PipeMapRed 2009-04-06 10:41:44,480 INFO org.apache.hadoop.streaming.PipeMapRed: mapRedFinished 2009-04-06 10:41:44,480 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed.waitOutputThreads(): subprocess failed with code 141 in org.apache.hadoop.streaming.PipeMapRed 2009-04-06 10:41:44,481 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.io.IOException: Filesystem closed at org.apache.hadoop.dfs.DFSClient.checkOpen(DFSClient.java:166) at org.apache.hadoop.dfs.DFSClient.access$500(DFSClient.java:58) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.flush(DFSClient.java:2176) at java.io.FilterOutputStream.flush(FilterOutputStream.java:123) at java.io.DataOutputStream.flush(DataOutputStream.java:106) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:66) at org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:99) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:340) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2084)
Re: Using HDFS to serve www requests
Indeed, it would be a very nice interface to have (if anyone has some free time)! I know a few Caltech people who'd like to see how how their WAN transfer product (http://monalisa.cern.ch/FDT/) would work with HDFS; if there was a HDFS NIO interface, playing around with HDFS and FDT would be fairly trivial. Brian On Apr 3, 2009, at 5:16 AM, Steve Loughran wrote: Snehal Nagmote wrote: can you please explain exactly adding NIO bridge means what and how it can be done , what could be advantages in this case ? NIO: java non-blocking IO. It's a standard API to talk to different filesystems; support has been discussed in jira. If the DFS APIs were accessible under an NIO front end, then applications written for the NIO APIs would work with the supported filesystems, with no need to code specifically for hadoop's not-yet-stable APIs Steve Loughran wrote: Edward Capriolo wrote: It is a little more natural to connect to HDFS from apache tomcat. This will allow you to skip the FUSE mounts and just use the HDFS- API. I have modified this code to run inside tomcat. http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample I will not testify to how well this setup will perform under internet traffic, but it does work. If someone adds an NIO bridge to hadoop filesystems then it would be easier; leaving you only with the performance issues.
Re: Amazon Elastic MapReduce
Are intermediate results stored in S3 as well? Also, any plans to support HTable? Chris K Wensel-2 wrote: FYI Amazons new Hadoop offering: http://aws.amazon.com/elasticmapreduce/ And Cascading 1.0 supports it: http://www.cascading.org/2009/04/amazon-elastic-mapreduce.html cheers, ckw -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/ -- View this message in context: http://www.nabble.com/Amazon-Elastic-MapReduce-tp22842658p22911128.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Amazon Elastic MapReduce
Intermediate results can be stored in hdfs on the EC2 machines, or in S3 using s3n... performance is better if you store on hdfs: -input, s3n://elasticmapreduce/samples/similarity/lastfm/input/, -output,hdfs:///home/hadoop/output2/, On Mon, Apr 6, 2009 at 11:27 AM, Patrick A. patrickange...@gmail.comwrote: Are intermediate results stored in S3 as well? Also, any plans to support HTable? Chris K Wensel-2 wrote: FYI Amazons new Hadoop offering: http://aws.amazon.com/elasticmapreduce/ And Cascading 1.0 supports it: http://www.cascading.org/2009/04/amazon-elastic-mapreduce.html cheers, ckw -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/ -- View this message in context: http://www.nabble.com/Amazon-Elastic-MapReduce-tp22842658p22911128.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch
Reduce task attempt retry strategy
Hi, I had a flaky machine the other day that was still accepting jobs and sending heartbeats, but caused all reduce task attempts to fail. This in turn caused the whole job to fail because the same reduce task was retried 3 times on that particular machine. Perhaps I¹m confusing this with the block placement strategy in hdfs, but I always thought that the framework would retry jobs on a different machine if retries on the original machine keep failing. E.g. I would have expected to retry once or twice on the same machine, but then switch to a different one to minimize the likelihood of getting stuck on a bad machine. What is the expected behavior in 0.19.1 (which I¹m running) ? Any plans for improving on this in the future ? Thanks, Stefan
problem running on a cluster of mixed hardware due to Incompatible buildVersion of JobTracker adn TaskTracker
I am using a cluster of mixed hardware, 32-bit and 64-bit machines, to run Hadoop 0.18.3. I can't use the distribution tar ball since I need to apply a couple of patches. So I build my own Hadoop binaries after applying the patches that I need. I build two copies, one for 32-bit machines and one for 64-bit machines. I am having problem starting the TaskTracker that are not the same hardware type as the JobTracker. I get the Incompatible buildVesion error because the compile time is part of the buildVersion: JobTracker's: 0.18.3 from by httpd on Mon Apr 6 07:35:15 PDT 2009 TaskTracker's: 0.18.3 from by httpd on Mon Apr 6 07:34:56 PDT 2009 Any advice on how I can get arournd this problem? Is there a way to build a single version of Hadoop that will run on both 32-bit and 64-bit machines. I notice that there are some native libraries under $HADOOP_HOME/lib/native/Linux-amd64-64 and $HADOOP_HOME/lib/native/Linux-i386-32. Do I need to compile my own version of those libraries since I am applying pathes to the distribution? I hope I don't have to hack the code to take the compile time out of buildVersion. Thanks in advance for your help. Bill
Job tracker not responding during streaming job
I am running Hadoop streaming. After around 42 jobs on an 18-node cluster, the jobtracker stops responding. This happens on normally- working code. Here are the symptoms. 1. A job is running, but it pauses with reduce stuck at XX% 2. hadoop job -list hangs or takes a very long time to return 3. In the Ganglia metrics on the Jobtracker node: a. jvm.metrics__JobTracker__gcTimeMillis rises above 20 k (20 seconds) before failure b. jvm.metrics__JobTracker__memHeapUsedM rises above 600 before failure c. jvm.metrics__JobTracker__gcCount rises above 1 k before failure The ticker looks like this. 09/04/06 03:06:28 INFO streaming.StreamJob: map 24% reduce 7% 09/04/06 03:13:44 INFO streaming.StreamJob: map 25% reduce 7% After the 03:13:44 line, it hangs for more than 15 minutes. In the jobtracker log, I see this. 2009-04-04 04:19:13,563 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-8143535428142072268_95993 failed because recovery from primary datanode 10.1.0.156:50010 failed 4 times. Will retry... After restarting both dfs and mapreduce on all nodes, the problem goes away, and the formally non-working job proceeds without failure. Does anyone else see this problem? David Kellogg
Re: problem running on a cluster of mixed hardware due to Incompatible buildVersion of JobTracker adn TaskTracker
Hey Bill, I might be giving you bad advice (I've only verified this for HDFS components on the 0.19.x branch, not the JT/TT or the 0.18.x branch), but... In my understanding, Hadoop only compares the base SVN revision number, not the build strings. Make sure that both have the SVN rev. Our build machines don't have svn installed, so we actually have to generate the correct SVN rev, then submit it to the build clusters. We certainly don't have our build time strings match up. Hope this helps! Brian On Apr 6, 2009, at 4:46 PM, Bill Au wrote: I am using a cluster of mixed hardware, 32-bit and 64-bit machines, to run Hadoop 0.18.3. I can't use the distribution tar ball since I need to apply a couple of patches. So I build my own Hadoop binaries after applying the patches that I need. I build two copies, one for 32-bit machines and one for 64-bit machines. I am having problem starting the TaskTracker that are not the same hardware type as the JobTracker. I get the Incompatible buildVesion error because the compile time is part of the buildVersion: JobTracker's: 0.18.3 from by httpd on Mon Apr 6 07:35:15 PDT 2009 TaskTracker's: 0.18.3 from by httpd on Mon Apr 6 07:34:56 PDT 2009 Any advice on how I can get arournd this problem? Is there a way to build a single version of Hadoop that will run on both 32-bit and 64-bit machines. I notice that there are some native libraries under $HADOOP_HOME/lib/native/Linux-amd64-64 and $HADOOP_HOME/lib/native/Linux-i386-32. Do I need to compile my own version of those libraries since I am applying pathes to the distribution? I hope I don't have to hack the code to take the compile time out of buildVersion. Thanks in advance for your help. Bill
Re: problem running on a cluster of mixed hardware due to Incompatible buildVersion of JobTracker adn TaskTracker
On Mon, Apr 6, 2009 at 4:01 PM, Brian Bockelman bbock...@cse.unl.eduwrote: Hey Bill, I might be giving you bad advice (I've only verified this for HDFS components on the 0.19.x branch, not the JT/TT or the 0.18.x branch), but... In my understanding, Hadoop only compares the base SVN revision number, not the build strings. Make sure that both have the SVN rev. Our build machines don't have svn installed, so we actually have to generate the correct SVN rev, then submit it to the build clusters. We certainly don't have our build time strings match up. Nope - the JT/TT definitely do verify the entirety of the build string (at least in the case when not built from SVN). I saw the same behavior as Bill is reporting last week on 0.18.3. -Todd
Re: problem running on a cluster of mixed hardware due to Incompatible buildVersion of JobTracker adn TaskTracker
Ah yes, there you go ... so much for extrapolating on a Monday :). Sorry Bill! Brian On Apr 6, 2009, at 6:03 PM, Todd Lipcon wrote: On Mon, Apr 6, 2009 at 4:01 PM, Brian Bockelman bbock...@cse.unl.eduwrote: Hey Bill, I might be giving you bad advice (I've only verified this for HDFS components on the 0.19.x branch, not the JT/TT or the 0.18.x branch), but... In my understanding, Hadoop only compares the base SVN revision number, not the build strings. Make sure that both have the SVN rev. Our build machines don't have svn installed, so we actually have to generate the correct SVN rev, then submit it to the build clusters. We certainly don't have our build time strings match up. Nope - the JT/TT definitely do verify the entirety of the build string (at least in the case when not built from SVN). I saw the same behavior as Bill is reporting last week on 0.18.3. -Todd
Re: problem running on a cluster of mixed hardware due to Incompatible buildVersion of JobTracker adn TaskTracker
This was discussed over on: https://issues.apache.org/jira/browse/HADOOP-5203 Doug uploaded a patch, but no one seems to be working on it. -- Owen
hadoop 0.18.3 writing not flushing to hadoop server?
I have a strange issue that when I write to hadoop, I find that the content is not transferred to hadoop even after a long time, is there any way to force flush the local temp files to hadoop after writing to hadoop? And when I shutdown the VM, it's getting flushed. thanks,
Modeling WordCount in a different way
Hi, I want to make experiments with wordcount example in a different way. Suppose we have very large data. Instead of splitting all the data one time, we want to feed some splits in the map-reduce job at a time. I want to model the hadoop job like this, Suppose a batch of inputsplits arrive in the beginning to every map, and reduce gives the word, frequency for this batch of inputsplits. Now after this another batch of inputsplits arrive and the results from subsequent reduce are aggregated to the previous results(if the word that has frequency 2 in previous processing and in this processing it occurs 1 time, then the frequency of that is now maintained as 3). In next map-reduce that comes 4 times, now its frequency maintained as 7 And this process goes on like this. Now how would I model inputsplits like this and how these continuous map-reduces can be made running. In what way should I keep the results of Map-Reduces so that I could aggregate this with the output of next Map-reduce. Thanks, Aayush
Re: hadoop 0.18.3 writing not flushing to hadoop server?
The data is flushed when the file is closed, or the amount written is an even multiple of the block size specified for the file, which by default is 64meg. There is no other way to flush the data to HDFS at present. There is an attempt at this in 0.19.0 but it caused data corruption issues and was backed out for 0.19.1. Hopefully a working version will appear soon. On Mon, Apr 6, 2009 at 5:05 PM, javateck javateck javat...@gmail.comwrote: I have a strange issue that when I write to hadoop, I find that the content is not transferred to hadoop even after a long time, is there any way to force flush the local temp files to hadoop after writing to hadoop? And when I shutdown the VM, it's getting flushed. thanks, -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422
Re: problem running on a cluster of mixed hardware due to Incompatible buildVersion of JobTracker adn TaskTracker
Owen, thanks for pointing out that Jira. Bill On Mon, Apr 6, 2009 at 7:20 PM, Owen O'Malley omal...@apache.org wrote: This was discussed over on: https://issues.apache.org/jira/browse/HADOOP-5203 Doug uploaded a patch, but no one seems to be working on it. -- Owen
Re: Reduce task attempt retry strategy
Stefan Will wrote: Hi, I had a flaky machine the other day that was still accepting jobs and sending heartbeats, but caused all reduce task attempts to fail. This in turn caused the whole job to fail because the same reduce task was retried 3 times on that particular machine. What is your cluster size? If a task fails on a machine then its re-tried on some other machine (based on number of good machines left in the cluster). After certain number of failures, the machine will be blacklisted (again based on number of machine left in the cluster). 3 different reducers might be scheduled on that machine but that should not lead to job failure. Can you explain in detail what exactly happened. Find out where the attempts got scheduled from the jobtracker's log. Amar Perhaps I¹m confusing this with the block placement strategy in hdfs, but I always thought that the framework would retry jobs on a different machine if retries on the original machine keep failing. E.g. I would have expected to retry once or twice on the same machine, but then switch to a different one to minimize the likelihood of getting stuck on a bad machine. What is the expected behavior in 0.19.1 (which I¹m running) ? Any plans for improving on this in the future ? Thanks, Stefan
Re: Job tracker not responding during streaming job
David Kellogg wrote: I am running Hadoop streaming. After around 42 jobs on an 18-node cluster, the jobtracker stops responding. This happens on normally-working code. Here are the symptoms. 1. A job is running, but it pauses with reduce stuck at XX% 2. hadoop job -list hangs or takes a very long time to return 3. In the Ganglia metrics on the Jobtracker node: a. jvm.metrics__JobTracker__gcTimeMillis rises above 20 k (20 seconds) before failure b. jvm.metrics__JobTracker__memHeapUsedM rises above 600 before failure c. jvm.metrics__JobTracker__gcCount rises above 1 k before failure The ticker looks like this. 09/04/06 03:06:28 INFO streaming.StreamJob: map 24% reduce 7% 09/04/06 03:13:44 INFO streaming.StreamJob: map 25% reduce 7% After the 03:13:44 line, it hangs for more than 15 minutes. In the jobtracker log, I see this. 2009-04-04 04:19:13,563 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-8143535428142072268_95993 failed because recovery from primary datanode 10.1.0.156:50010 failed 4 times. Will retry... After restarting both dfs and mapreduce on all nodes, the problem goes away, and the formally non-working job proceeds without failure. David, What version are you using? There can be because of : 1) Number of tasks in jobtracker's memory might exceed its limits. What is the total number of tasks in the jobtracker's memory? What is the jobtracker's heap size? Try increasing the heap size and also try setting the mapred.jobtracker.completeuserjobs.maximum parameter to some low value. 2) Sometimes some slow/bad datanode causes the jobtracker to get stuck. As you have mentioned this might be the cause. Can you let us know the output of 'kill -3' on jobtracker process. Does anyone else see this problem? David Kellogg
connecting two clusters
Hey all I'm trying to connect two separate Hadoop clusters. Is it possible to do so? I need data to be shuttled back and forth between the two clusters. Any suggestions? Thank you! Mithila Nagendra Arizona State University
Re: connecting two clusters
DistCp is the standard way to copy data between clusters. What it does is run a mapreduce job to copy data between a source cluster and a destination cluster. See http://hadoop.apache.org/core/docs/r0.19.1/distcp.html On Mon, Apr 6, 2009 at 9:49 PM, Mithila Nagendra mnage...@asu.edu wrote: Hey all I'm trying to connect two separate Hadoop clusters. Is it possible to do so? I need data to be shuttled back and forth between the two clusters. Any suggestions? Thank you! Mithila Nagendra Arizona State University
Re: connecting two clusters
On Apr 6, 2009, at 9:49 PM, Mithila Nagendra wrote: Hey all I'm trying to connect two separate Hadoop clusters. Is it possible to do so? I need data to be shuttled back and forth between the two clusters. Any suggestions? You should use hadoop distcp. It is a map/reduce program that copies data, typically from one cluster to another. If you have the hftp interface enabled, you can use that to copy between hdfs clusters that are different versions. hadoop distcp hftp://namenode1:1234/foo/bar hdfs://foo/bar -- Owen
Re: connecting two clusters
Thanks! I was looking at the link sent by Philip. The copy is done with the following command: hadoop distcp hdfs://nn1:8020/foo/bar \ hdfs://nn2:8020/bar/foo I was wondering if nn1 and nn2 are the names of the clusters or the name of the masters on each cluster. I wanted map/reduce tasks running on each of the two clusters to communicate with each other. I dont know if hadoop provides for synchronization between two map/reduce tasks. The tasks run simultaneouly, and they need to access a common file - something like a map/reduce task at a higher level utilizing the data produced by the map/reduce at the lower level. Mithila On Tue, Apr 7, 2009 at 7:57 AM, Owen O'Malley omal...@apache.org wrote: On Apr 6, 2009, at 9:49 PM, Mithila Nagendra wrote: Hey all I'm trying to connect two separate Hadoop clusters. Is it possible to do so? I need data to be shuttled back and forth between the two clusters. Any suggestions? You should use hadoop distcp. It is a map/reduce program that copies data, typically from one cluster to another. If you have the hftp interface enabled, you can use that to copy between hdfs clusters that are different versions. hadoop distcp hftp://namenode1:1234/foo/bar hdfs://foo/bar -- Owen
Re: Reduce task attempt retry strategy
I seen the same thing happening on 0.19.branch. When a task fails on the reduce end it always retries on the same node until it kills the job for to many failed tries on one reduce task. I am running a cluster of 7 nodes. Billy Stefan Will stefan.w...@gmx.net wrote in message news:c5ff7f91.18c09%stefan.w...@gmx.net... Hi, I had a flaky machine the other day that was still accepting jobs and sending heartbeats, but caused all reduce task attempts to fail. This in turn caused the whole job to fail because the same reduce task was retried 3 times on that particular machine. Perhaps I¹m confusing this with the block placement strategy in hdfs, but I always thought that the framework would retry jobs on a different machine if retries on the original machine keep failing. E.g. I would have expected to retry once or twice on the same machine, but then switch to a different one to minimize the likelihood of getting stuck on a bad machine. What is the expected behavior in 0.19.1 (which I¹m running) ? Any plans for improving on this in the future ? Thanks, Stefan