Re: can't find java home
can you run dos2unix /home/HadoopAdmin/hadoop-0.18.3/bin/../conf/*.sh and then try again. Thanks, -Vikas. On Wed, Aug 26, 2009 at 11:56 AM, Puri, Aseem aseem.p...@honeywell.comwrote: Hi I am facing an issue while starting my hadoop cluster. When I run the command: $ bin/hadoop namenode -format I found this exception: /home/HadoopAdmin/hadoop-0.18.3/bin/../conf/hadoop-env.sh: line 2: $'\r': command not found /home/HadoopAdmin/hadoop-0.18.3/bin/../conf/hadoop-env.sh: line 7: $'\r': command not found /home/HadoopAdmin/hadoop-0.18.3/bin/../conf/hadoop-env.sh: line 10: $'\r': command not found /home/HadoopAdmin/hadoop-0.18.3/bin/../conf/hadoop-env.sh: line 13: $'\r': command not found /home/HadoopAdmin/hadoop-0.18.3/bin/../conf/hadoop-env.sh: line 16: $'\r': command not found /home/HadoopAdmin/hadoop-0.18.3/bin/../conf/hadoop-env.sh: line 19: $'\r': command not found /home/HadoopAdmin/hadoop-0.18.3/bin/../conf/hadoop-env.sh: line 29: $'\r': command not found /home/HadoopAdmin/hadoop-0.18.3/bin/../conf/hadoop-env.sh: line 32: $'\r': command not found /home/HadoopAdmin/hadoop-0.18.3/bin/../conf/hadoop-env.sh: line 35: $'\r': command not found /home/HadoopAdmin/hadoop-0.18.3/bin/../conf/hadoop-env.sh: line 38: $'\r': command not found /home/HadoopAdmin/hadoop-0.18.3/bin/../conf/hadoop-env.sh: line 41: $'\r': command not found /home/HadoopAdmin/hadoop-0.18.3/bin/../conf/hadoop-env.sh: line 46: $'\r': command not found /home/HadoopAdmin/hadoop-0.18.3/bin/../conf/hadoop-env.sh: line 49: $'\r': command not found /home/HadoopAdmin/hadoop-0.18.3/bin/../conf/hadoop-env.sh: line 52: $'\r': command not found /bin/java: No such file or directorymin/java/jdk1.6.0_13 /bin/java: No such file or directorymin/java/jdk1.6.0_13 /bin/java: cannot execute: No such file or directorydk1.6.0_13 Also in home/HadoopAdmin/hadoop-0.18.3/bin/conf/hadoop-env.shI set my java home as where I have installed java as: export JAVA_HOME=/home/HadoopAdmin/java/jdk1.6.0_13 Please help me on this issue. Regards Aseem Puri
HBase master not starting
Hello, Iam trying to setup Hbase-0.20 with Hadoop-0.20 in fully distributed mode. I have problem while starting the Hbase master: The stack trace is as follows 2009-08-26 01:18:31,454 INFO org.apache.hadoop.hbase.master.HMaster: My address is domU-12-31-39-00-0A-52.compute-1.internal:6 2009-08-26 01:18:32,600 FATAL org.apache.hadoop.hbase.master.HMaster: Not starting HMaster because: at java.io.DataInputStream.readUTF(DataInputStream.java:572) at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:323) java.io.EOFException Please help me out with this. Below is my hbase-site.xml configuration property namehbase.rootdir/name valuehdfs://domU-12-31-39-00-28-52.compute-1.internal:40010/hbase/value descriptionThe directory shared by region servers. /description /property property namehbase.master/name valuedomU-12-31-39-00-28-52.compute-1.internal:6/value descriptionThe host and port that the HBase master runs at. /description /property property namehbase.master.port/name value6/value descriptionThe port master should bind to./description /property property namehbase.cluster.distributed/name valuetrue/value descriptionThe mode the cluster will be in. Possible values are false: standalone and pseudo-distributed setups with managed Zookeeper true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh) /description /property /configuration
Testing Hadoop job
hi can u guys suggest some hadoop unit testing framework apart from MRUnit??? i have used MRUnit but i m not sure abt its feasibilty and support to hadoop 0.20. i could not find a proper documentation for MRUnit, is it available anywhere? -- cheers nikhil
0.19.1 infinite loop
I'm using hadoop 0.19.1 on a 60 node cluster, each node has 8GB of ram and 4 cores. I have several jobs that run every day, and last night one of them triggered an infinite loop that rendered the cluster inoperable. As the job finishes, the following is logged to the job tracker logs: 2009-08-25 22:08:04,633 INFO org.apache.hadoop.mapred.JobInProgress: Task 'attempt_200908220740_0126_r_01_0' has completed task_200908220740_0126_r_01 successfully. 2009-08-25 22:08:04,633 INFO org.apache.hadoop.mapred.JobInProgress: Job job_200908220740_0126 has completed successfully. 2009-08-25 22:08:09,897 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /proc/statpump/incremental/200908260200/_logs/history/dup-jt_12509412317 25_job_200908220740_0126_hadoop_statpump-incremental retrying... That last line, Could not complete file... then repeats forever, at which point the job tracker UI stops responding and no more tasks will run. The only way to free things up is to restart the jobtracker Both prior to and during the infinite loop, I see this in the namenode logs. Because it starts long before the inifinte loop I can't tell for sure if it's related, and it is still happening now even after the restart and with jobs finishing without issue 2009-08-25 22:08:05,760 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 54310, call nextGenerationStamp(blk_2796235715791117970_4385127) from 172.21.30.2:48164: error: java.io.IOException: blk_2796235715791117970_4385127 is already commited, storedBlock == null. java.io.IOException: blk_2796235715791117970_4385127 is already commited, storedBlock == null. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.nextGenerationStampF orBlock(FSNamesystem.java:4552) at org.apache.hadoop.hdfs.server.namenode.NameNode.nextGenerationStamp(Name Node.java:402) at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor Impl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894) And finally, this warning appears in the namenode logs just prior as well 2009-08-25 22:07:22,580 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Inconsistent size for block blk_-1458477261945758787_4416123 reported from 172.21.30.4:50010 current size is 5396992 reported size is 67108864 Can anyone point me in a direction to determine what's going here? Thanks The information transmitted in this email is intended only for the person(s) or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this email in error, please contact the sender and permanently delete the email from any computer.
Re: 0.19.1 infinite loop
Hey Jeremy, Glad someone else has run into this! I always thought this specific infinite loop was in my code. I had an issue open for it earlier, but I ultimately was not sure if it was in my code or HDFS, so we closed it: https://issues.apache.org/jira/browse/HADOOP-4866 We [and others] get these daily. It would be nice to figure out a way to replicate this. Brian On Aug 26, 2009, at 8:27 AM, Jeremy Pinkham wrote: I'm using hadoop 0.19.1 on a 60 node cluster, each node has 8GB of ram and 4 cores. I have several jobs that run every day, and last night one of them triggered an infinite loop that rendered the cluster inoperable. As the job finishes, the following is logged to the job tracker logs: 2009-08-25 22:08:04,633 INFO org.apache.hadoop.mapred.JobInProgress: Task 'attempt_200908220740_0126_r_01_0' has completed task_200908220740_0126_r_01 successfully. 2009-08-25 22:08:04,633 INFO org.apache.hadoop.mapred.JobInProgress: Job job_200908220740_0126 has completed successfully. 2009-08-25 22:08:09,897 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /proc/statpump/incremental/200908260200/_logs/history/dup- jt_12509412317 25_job_200908220740_0126_hadoop_statpump-incremental retrying... That last line, Could not complete file... then repeats forever, at which point the job tracker UI stops responding and no more tasks will run. The only way to free things up is to restart the jobtracker Both prior to and during the infinite loop, I see this in the namenode logs. Because it starts long before the inifinte loop I can't tell for sure if it's related, and it is still happening now even after the restart and with jobs finishing without issue 2009-08-25 22:08:05,760 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 54310, call nextGenerationStamp(blk_2796235715791117970_4385127) from 172.21.30.2:48164: error: java.io.IOException: blk_2796235715791117970_4385127 is already commited, storedBlock == null. java.io.IOException: blk_2796235715791117970_4385127 is already commited, storedBlock == null. at org .apache.hadoop.hdfs.server.namenode.FSNamesystem.nextGenerationStampF orBlock(FSNamesystem.java:4552) at org .apache.hadoop.hdfs.server.namenode.NameNode.nextGenerationStamp(Name Node.java:402) at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source) at sun .reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor Impl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894) And finally, this warning appears in the namenode logs just prior as well 2009-08-25 22:07:22,580 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Inconsistent size for block blk_-1458477261945758787_4416123 reported from 172.21.30.4:50010 current size is 5396992 reported size is 67108864 Can anyone point me in a direction to determine what's going here? Thanks The information transmitted in this email is intended only for the person(s) or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this email in error, please contact the sender and permanently delete the email from any computer. smime.p7s Description: S/MIME cryptographic signature
RE: 0.19.1 infinite loop
Thanks Brian. I'm trying to find a way to reliably replicate it, and will certainly update this list if I manage to do so. It is happening with more frequency in our QA environment, which is a much smaller cluster (only 2 nodes), but still not deterministically. Hopefully we can hone in on something. -Original Message- From: Brian Bockelman [mailto:bbock...@cse.unl.edu] Sent: Wednesday, August 26, 2009 9:54 AM To: common-user@hadoop.apache.org Subject: Re: 0.19.1 infinite loop Hey Jeremy, Glad someone else has run into this! I always thought this specific infinite loop was in my code. I had an issue open for it earlier, but I ultimately was not sure if it was in my code or HDFS, so we closed it: https://issues.apache.org/jira/browse/HADOOP-4866 We [and others] get these daily. It would be nice to figure out a way to replicate this. Brian On Aug 26, 2009, at 8:27 AM, Jeremy Pinkham wrote: I'm using hadoop 0.19.1 on a 60 node cluster, each node has 8GB of ram and 4 cores. I have several jobs that run every day, and last night one of them triggered an infinite loop that rendered the cluster inoperable. As the job finishes, the following is logged to the job tracker logs: 2009-08-25 22:08:04,633 INFO org.apache.hadoop.mapred.JobInProgress: Task 'attempt_200908220740_0126_r_01_0' has completed task_200908220740_0126_r_01 successfully. 2009-08-25 22:08:04,633 INFO org.apache.hadoop.mapred.JobInProgress: Job job_200908220740_0126 has completed successfully. 2009-08-25 22:08:09,897 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /proc/statpump/incremental/200908260200/_logs/history/dup- jt_12509412317 25_job_200908220740_0126_hadoop_statpump-incremental retrying... That last line, Could not complete file... then repeats forever, at which point the job tracker UI stops responding and no more tasks will run. The only way to free things up is to restart the jobtracker Both prior to and during the infinite loop, I see this in the namenode logs. Because it starts long before the inifinte loop I can't tell for sure if it's related, and it is still happening now even after the restart and with jobs finishing without issue 2009-08-25 22:08:05,760 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 54310, call nextGenerationStamp(blk_2796235715791117970_4385127) from 172.21.30.2:48164: error: java.io.IOException: blk_2796235715791117970_4385127 is already commited, storedBlock == null. java.io.IOException: blk_2796235715791117970_4385127 is already commited, storedBlock == null. at org .apache.hadoop.hdfs.server.namenode.FSNamesystem.nextGenerationStampF orBlock(FSNamesystem.java:4552) at org .apache.hadoop.hdfs.server.namenode.NameNode.nextGenerationStamp(Name Node.java:402) at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source) at sun .reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor Impl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894) And finally, this warning appears in the namenode logs just prior as well 2009-08-25 22:07:22,580 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Inconsistent size for block blk_-1458477261945758787_4416123 reported from 172.21.30.4:50010 current size is 5396992 reported size is 67108864 Can anyone point me in a direction to determine what's going here? Thanks The information transmitted in this email is intended only for the person(s) or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this email in error, please contact the sender and permanently delete the email from any computer. The information transmitted in this email is intended only for the person(s) or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this email in error, please contact the sender and permanently delete the email from any computer.
Re: Intra-datanode balancing?
But I mean, then how does that datanode knows that these files were copied from one partition to another, in this new directory? I'm not sure the inner workings of how a datanode knows what files are on itself...I was assuming that it knows by keeping track of the subdir directory...or is that just a placeholder name and whatever directory is under that parent directory will be scanned and picked up by the datanode? Kris. On Tue, Aug 25, 2009 at 6:24 PM, Raghu Angadi rang...@yahoo-inc.com wrote: Kris Jirapinyo wrote: How does copying the subdir work? What if that partition already has the same subdir (in the case that our partition is not new but relatively new...with maybe 10% used)? You can copy the files. There isn't really any requirement on number of files in directory. something like cp -r subdir5 dest/subdir5 might do (or rsync without --delete option). Just make sure you delete the directory from the source. Raghu. Thanks for the suggestions so far guys. Kris. On Tue, Aug 25, 2009 at 5:01 PM, Raghu Angadi rang...@yahoo-inc.com wrote: For now you are stuck with the hack. Sooner or later hadoop has to handle heterogeneous nodes better. In general it tries to write to all the disks irrespective of % full since that gives the best performance (assuming each partition's capabilities are same). But it is lame at handling skews. Regd your hack : 1. You can copy subdir to new partition rather than deleting (datanodes should be shutdown). 2. I would think it is less work to implement a better policy in DataNode for this case. It would be a pretty local change. When choosing a partition for a new block, DN already knows how much freespace is left on each one. for simplest implementation you skip partitions that have less 25% of avg freespace or choose with a probability proportional to relative freespace. If it works well, file a jira. I don't think HDFS-343 is directly related to this or is likely to be fixed. There is another jira that makes placement policy at NameNode pluggable (does not affect Datanode). Raghu. Kris Jirapinyo wrote: Hi all, I know this has been filed as a JIRA improvement already http://issues.apache.org/jira/browse/HDFS-343, but is there any good workaround at the moment? What's happening is I have added a few new EBS volumes to half of the cluster, but Hadoop doesn't want to write to them. When I try to do cluster rebalancing, since the new disks make the percentage used lower, it fills up the first two existing local disks, which is exactly what I don't want to happen. Currently, I just delete several subdirs from dfs, since I know that with a replication factor of 3, it'll be ok, so that fixes the problems in the short term. But I still cannot get Hadoop to use those new larger disks efficiently. Any thoughts? -- Kris.
Re: Intra-datanode balancing?
Kris Jirapinyo wrote: But I mean, then how does that datanode knows that these files were copied from one partition to another, in this new directory? I'm not sure the inner workings of how a datanode knows what files are on itself...I was assuming that it knows by keeping track of the subdir directory... or is that just a placeholder name and whatever directory is under that parent directory will be scanned and picked up by the datanode? correct. directory name does not matter. Only requirement is a block file and its .meta file in the same directory. When datanode starts up it scans all these directories and stores their path in memory. Of course, this is still a big hack! (just making it clear for readers who haven't seen the full context). Raghu. Kris. On Tue, Aug 25, 2009 at 6:24 PM, Raghu Angadi rang...@yahoo-inc.com wrote: Kris Jirapinyo wrote: How does copying the subdir work? What if that partition already has the same subdir (in the case that our partition is not new but relatively new...with maybe 10% used)? You can copy the files. There isn't really any requirement on number of files in directory. something like cp -r subdir5 dest/subdir5 might do (or rsync without --delete option). Just make sure you delete the directory from the source. Raghu. Thanks for the suggestions so far guys. Kris. On Tue, Aug 25, 2009 at 5:01 PM, Raghu Angadi rang...@yahoo-inc.com wrote: For now you are stuck with the hack. Sooner or later hadoop has to handle heterogeneous nodes better. In general it tries to write to all the disks irrespective of % full since that gives the best performance (assuming each partition's capabilities are same). But it is lame at handling skews. Regd your hack : 1. You can copy subdir to new partition rather than deleting (datanodes should be shutdown). 2. I would think it is less work to implement a better policy in DataNode for this case. It would be a pretty local change. When choosing a partition for a new block, DN already knows how much freespace is left on each one. for simplest implementation you skip partitions that have less 25% of avg freespace or choose with a probability proportional to relative freespace. If it works well, file a jira. I don't think HDFS-343 is directly related to this or is likely to be fixed. There is another jira that makes placement policy at NameNode pluggable (does not affect Datanode). Raghu. Kris Jirapinyo wrote: Hi all, I know this has been filed as a JIRA improvement already http://issues.apache.org/jira/browse/HDFS-343, but is there any good workaround at the moment? What's happening is I have added a few new EBS volumes to half of the cluster, but Hadoop doesn't want to write to them. When I try to do cluster rebalancing, since the new disks make the percentage used lower, it fills up the first two existing local disks, which is exactly what I don't want to happen. Currently, I just delete several subdirs from dfs, since I know that with a replication factor of 3, it'll be ok, so that fixes the problems in the short term. But I still cannot get Hadoop to use those new larger disks efficiently. Any thoughts? -- Kris.
Re: Intra-datanode balancing?
Hmm then in that case, it is possible for me to manually balance load those datanodes by moving most of the files onto the new, larger partition. I will try it. Thanks! -- Kris J. On Wed, Aug 26, 2009 at 10:13 AM, Raghu Angadi rang...@yahoo-inc.comwrote: Kris Jirapinyo wrote: But I mean, then how does that datanode knows that these files were copied from one partition to another, in this new directory? I'm not sure the inner workings of how a datanode knows what files are on itself...I was assuming that it knows by keeping track of the subdir directory... or is that just a placeholder name and whatever directory is under that parent directory will be scanned and picked up by the datanode? correct. directory name does not matter. Only requirement is a block file and its .meta file in the same directory. When datanode starts up it scans all these directories and stores their path in memory. Of course, this is still a big hack! (just making it clear for readers who haven't seen the full context). Raghu. Kris. On Tue, Aug 25, 2009 at 6:24 PM, Raghu Angadi rang...@yahoo-inc.com wrote: Kris Jirapinyo wrote: How does copying the subdir work? What if that partition already has the same subdir (in the case that our partition is not new but relatively new...with maybe 10% used)? You can copy the files. There isn't really any requirement on number of files in directory. something like cp -r subdir5 dest/subdir5 might do (or rsync without --delete option). Just make sure you delete the directory from the source. Raghu. Thanks for the suggestions so far guys. Kris. On Tue, Aug 25, 2009 at 5:01 PM, Raghu Angadi rang...@yahoo-inc.com wrote: For now you are stuck with the hack. Sooner or later hadoop has to handle heterogeneous nodes better. In general it tries to write to all the disks irrespective of % full since that gives the best performance (assuming each partition's capabilities are same). But it is lame at handling skews. Regd your hack : 1. You can copy subdir to new partition rather than deleting (datanodes should be shutdown). 2. I would think it is less work to implement a better policy in DataNode for this case. It would be a pretty local change. When choosing a partition for a new block, DN already knows how much freespace is left on each one. for simplest implementation you skip partitions that have less 25% of avg freespace or choose with a probability proportional to relative freespace. If it works well, file a jira. I don't think HDFS-343 is directly related to this or is likely to be fixed. There is another jira that makes placement policy at NameNode pluggable (does not affect Datanode). Raghu. Kris Jirapinyo wrote: Hi all, I know this has been filed as a JIRA improvement already http://issues.apache.org/jira/browse/HDFS-343, but is there any good workaround at the moment? What's happening is I have added a few new EBS volumes to half of the cluster, but Hadoop doesn't want to write to them. When I try to do cluster rebalancing, since the new disks make the percentage used lower, it fills up the first two existing local disks, which is exactly what I don't want to happen. Currently, I just delete several subdirs from dfs, since I know that with a replication factor of 3, it'll be ok, so that fixes the problems in the short term. But I still cannot get Hadoop to use those new larger disks efficiently. Any thoughts? -- Kris.
Re: Seattle / NW Hadoop, HBase Lucene, etc. Meetup , Wed August 26th, 6:45pm
Hello, My apologies, but there was a mix-up reserving our meeting location, and we don't have access to it. I'm very sorry, and beer is on me next month. Promise :) Sent from my Internets On Aug 25, 2009, at 4:21 PM, Bradford Stephens bradfordsteph...@gmail.com wrote: Hey there, Apologies for this not going out sooner -- apparently it was sitting as a draft in my inbox. A few of you have pinged me, so thanks for your vigilance. It's time for another Hadoop/Lucene/Apache Stack meetup! We've had great attendance in the past few months, let's keep it up! I'm always amazed by the things I learn from everyone. We're back at the University of Washington, Allen Computer Science Center (not Computer Engineering) Map: http://www.washington.edu/home/maps/?CSE Room: 303 -or- the Entry level. If there are changes, signs will be posted. More Info: The meetup is about 2 hours: we'll have two in-depth talks of 15-20 minutes each, and then several lightning talks of 5 minutes. If no one offers, We'll then have discussion and 'social time'. we'll just have general discussion. Let net know if you're interested in speaking or attending. We'd like to focus on education, so every presentation *needs* to ask some questions at the end. We can talk about these after the presentations, and I'll record what we've learned in a wiki and share that with the rest of us. Contact: Bradford Stephens, 904-415-3009, bradfordsteph...@gmail.com -- http://www.roadtofailure.com -- The Fringes of Scalability, Social Media, and Computer Science
Symlink support
Hi! Could someone tell me about the status of symbolic link support in HDFS (HDFS-245)? It looks like a patch is merged with latest trunk. So I would like to know how good it works and whether or not the patch is applicable for the current release of Hadoop. We just start testing HDFS as a part of our ftp mirror site (http://ftp.kddilabs.jp/). It works fine so far. But if HDFS supports symlink feature, we could save lots of capacity :-) Thanks. Yasu
control map to split assignment
Hello, I wonder is there is a way to control how maps are assigned to splits in order to balance the load across the cluster. Here is a simplified example. I have tow types of inputs: long and short. Each input is in a different file and will be processed by a single map task. Suppose the long inputs take 10s to process while the short inputs take 3s to process. I have two long inputs and two short inputs. My cluster has 2 nodes and each node can execute only one map task at a time. A possible schedule of the tasks could be the following: Node 1: long map, short map - 10s + 3s = 13s Node 2: long map, short map - 10s + 3s = 13s So, my job will be done in 13s. Another possible schedule is: Node 1: long map - 10s Node 2: short map, short map, long map - 3s + 3s + 10s = 16s And, my job will be done in 16s. Clearly, the first scheduling is better. Is there a way to control how the schedule is build? If I can control which inputs are processed first, I could schedule the long inputs to be processed first and so they will be balanced across nodes and I will end up with something similar to the first schedule. I could configure the job so that a long input gets processed by more that a map, and so end up balancing the work, but I noticed that overall, this takes more time than a bad scheduling with only one map per input. Thanks! Cheers, Rares Vernica
How does reducer get intermediate output?
Hi all, In my cluster, the reducer often can't fetch mapper's output. I know there are many reasons for this situation. And I think it's necessary to find out how does reducer get intermediate output. I have read the source code. However, I'm not clear about the whole process. Could you tell me the process of it? How does each node communicate with each other and how does class ReduceCopier work? Thank you. Inifok
Re: Concatenating files on HDFS
HDFS files are write once so you cannot append to them (at the moment). What you can do is copy your local file to HDFS dir containing the same file you want to append to. Once that is done you can run a simple (Identity Mapper Identity Reducer) mapreduce job with input as this directory and number of reducers = 1. - Original Message - From: Turner Kunkel thkun...@gmail.com To: core-u...@hadoop.apache.org Sent: Wednesday, August 26, 2009 10:02:41 PM GMT +05:30 Chennai, Kolkata, Mumbai, New Delhi Subject: Concatenating files on HDFS Is there any way to concatenate/append a local file to a file on HDFS without copying down the HDFS file locally first? I tried: bin/hadoop dfs -cat file:///[local file] hdfs://[hdfs file] But it just tries to look for hdfs://[hdfs file] as a local file, since I suppose the dfs -cat command doesn't support the operator. Thanks. -- -Turner Kunkel