Re: running hadoop on heterogeneous hardware
Bill Au wrote: Is hadoop designed to run on homogeneous hardware only, or does it work just as well on heterogeneous hardware as well? If the datanodes have different disk capacities, does HDFS still spread the data blocks equally amount all the datanodes, or will the datanodes with high disk capacity end up storing more data blocks? Similarily, if the tasktrackres have different numbers of CPUs, is there a way to configure hadoop to run more tasks on those tasktrackers that have more CPUs? Is that simply a matter of setting mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum differently on the tasktrackers? Bill Life is simpler on homogenous boxes; by setting the maximum tasks differently for the different machines, you do limit the amount of work that gets pushed out to those boxes. More troublesome is slower CPUs/HDDs, they arent picked up directly, though speculative work can handle some of this One interesting bit of research would be something adaptive; something to monitor throughput and tune those values based on performance; that would detect variations in a cluster and work with with it, rather than requiring you to know the capabilities of every machine. -steve
Hadoop with many input/output files?
I am seeing the MultiFileInputFormat and the MultipleOutputFormat Input/Output formats for the Job configuration. How can I properly use them? I had previously used the default Input and Output Format types, which for my PDF concatenation project, merely reduced Hadoop to a scheduler. The idea is per directory, to concatenate all PDFs in said directory to one PDF, and for this I'm using iText. How can I use these Format types? What would be in my input into the mapper and what would my InputKeyValue and OutputKeyValue classes be? Thank you! I can't find documentation on these other than the Javadoc, which doesn't help much. Richard J. Zak
Re: Hadoop with many input/output files?
I have a very similar question: how do I recursively list all files in a given directory, to the end that all files are processed by MapReduce? If I just copy them to the output, let's say, is there any problem dropping them all in the same output directory in HDFS? To use a bad example, Windows chokes on many files in one directory. Thank you, Mark On Thu, Jan 22, 2009 at 8:28 AM, Zak, Richard [USA] zak_rich...@bah.comwrote: I am seeing the MultiFileInputFormat and the MultipleOutputFormat Input/Output formats for the Job configuration. How can I properly use them? I had previously used the default Input and Output Format types, which for my PDF concatenation project, merely reduced Hadoop to a scheduler. The idea is per directory, to concatenate all PDFs in said directory to one PDF, and for this I'm using iText. How can I use these Format types? What would be in my input into the mapper and what would my InputKeyValue and OutputKeyValue classes be? Thank you! I can't find documentation on these other than the Javadoc, which doesn't help much. Richard J. Zak
Set the Order of the Keys in Reduce
Hello, Any tips would be greatly appreciated. Is there a way to set the order of the keys in reduce as shown below, no matter what order the collection in MAP occurs in. Thanks, Brian public void map(WritableComparable key, Text values, OutputCollectorText, Text output, Reporter reporter) throws IOException { //collect many CAT_A and CAT_B in random order output.collect(CAT_A, details); output.collect(CAT_B, details); } public void reduce(Text key, IteratorText values, OutputCollectorText, Text output, Reporter reporter) throws IOException { //always reduce CAT_A first, then reduce CAT_B } _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this message in error, please contact the sender and delete the material from any computer.
Re: Set the Order of the Keys in Reduce
Hi Brian, The CAT_A and CAT_B keys will be processed by different reducer instances, so they run independently and may run in any order. What's the output that you're trying to get? Cheers, Tom On Thu, Jan 22, 2009 at 3:25 PM, Brian MacKay brian.mac...@medecision.com wrote: Hello, Any tips would be greatly appreciated. Is there a way to set the order of the keys in reduce as shown below, no matter what order the collection in MAP occurs in. Thanks, Brian public void map(WritableComparable key, Text values, OutputCollectorText, Text output, Reporter reporter) throws IOException { //collect many CAT_A and CAT_B in random order output.collect(CAT_A, details); output.collect(CAT_B, details); } public void reduce(Text key, IteratorText values, OutputCollectorText, Text output, Reporter reporter) throws IOException { //always reduce CAT_A first, then reduce CAT_B } _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this message in error, please contact the sender and delete the material from any computer.
Archive?
Hi, is there an archive to the messages? I am a newcomer, granted, but google groups has all the discussion capabilities, and it has a searchable archive. It is strange to have just a mailing list. Am I missing something? Thank you, Mark
Re: Archive?
Hi Mark, The archives are listed on http://wiki.apache.org/hadoop/MailingListArchives Tom On Thu, Jan 22, 2009 at 3:41 PM, Mark Kerzner markkerz...@gmail.com wrote: Hi, is there an archive to the messages? I am a newcomer, granted, but google groups has all the discussion capabilities, and it has a searchable archive. It is strange to have just a mailing list. Am I missing something? Thank you, Mark
RE: Set the Order of the Keys in Reduce
Hello Tom, Would like to apply some rules To CAT_A, then use the output of CAT_A to reduce CAT_B. I'd rather not run two JOBS, so perhaps I need two reducers? First Reducer processes CAT_A, then when complete second reducer does CAT_B? I suppose this would accomplish the same thing? -Original Message- From: Tom White [mailto:t...@cloudera.com] Sent: Thursday, January 22, 2009 10:41 AM To: core-user@hadoop.apache.org Subject: Re: Set the Order of the Keys in Reduce Hi Brian, The CAT_A and CAT_B keys will be processed by different reducer instances, so they run independently and may run in any order. What's the output that you're trying to get? Cheers, Tom On Thu, Jan 22, 2009 at 3:25 PM, Brian MacKay brian.mac...@medecision.com wrote: Hello, Any tips would be greatly appreciated. Is there a way to set the order of the keys in reduce as shown below, no matter what order the collection in MAP occurs in. Thanks, Brian public void map(WritableComparable key, Text values, OutputCollectorText, Text output, Reporter reporter) throws IOException { //collect many CAT_A and CAT_B in random order output.collect(CAT_A, details); output.collect(CAT_B, details); } public void reduce(Text key, IteratorText values, OutputCollectorText, Text output, Reporter reporter) throws IOException { //always reduce CAT_A first, then reduce CAT_B } _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this message in error, please contact the sender and delete the material from any computer. _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this message in error, please contact the sender and delete the material from any computer.
Re: Set the Order of the Keys in Reduce
Reducers run independently and without knowledge of one another, so you can't get one reducer to depend on the output of another. I think having two jobs is the simplest way to achieve what you're trying to do. Tom On Thu, Jan 22, 2009 at 3:48 PM, Brian MacKay brian.mac...@medecision.com wrote: Hello Tom, Would like to apply some rules To CAT_A, then use the output of CAT_A to reduce CAT_B. I'd rather not run two JOBS, so perhaps I need two reducers? First Reducer processes CAT_A, then when complete second reducer does CAT_B? I suppose this would accomplish the same thing? -Original Message- From: Tom White [mailto:t...@cloudera.com] Sent: Thursday, January 22, 2009 10:41 AM To: core-user@hadoop.apache.org Subject: Re: Set the Order of the Keys in Reduce Hi Brian, The CAT_A and CAT_B keys will be processed by different reducer instances, so they run independently and may run in any order. What's the output that you're trying to get? Cheers, Tom On Thu, Jan 22, 2009 at 3:25 PM, Brian MacKay brian.mac...@medecision.com wrote: Hello, Any tips would be greatly appreciated. Is there a way to set the order of the keys in reduce as shown below, no matter what order the collection in MAP occurs in. Thanks, Brian public void map(WritableComparable key, Text values, OutputCollectorText, Text output, Reporter reporter) throws IOException { //collect many CAT_A and CAT_B in random order output.collect(CAT_A, details); output.collect(CAT_B, details); } public void reduce(Text key, IteratorText values, OutputCollectorText, Text output, Reporter reporter) throws IOException { //always reduce CAT_A first, then reduce CAT_B } _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this message in error, please contact the sender and delete the material from any computer. _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this message in error, please contact the sender and delete the material from any computer.
Re: Set the Order of the Keys in Reduce
On Jan 22, 2009, at 7:25 AM, Brian MacKay wrote: Is there a way to set the order of the keys in reduce as shown below, no matter what order the collection in MAP occurs in. The keys to reduce are *always* sorted. If the default order is not correct, you can change the compare function. As Tom points out, the critical thing is making sure that all of the keys that you need to group together go to the same reduce. So let's make it a little more concrete and say that you have: public class TextPair implements Writable { public TextPair() {} public void set(String left, String right); public String getLeft(); ... } And your map 0 does: key.set(CAT, B); output.collect(key, value); key.set(DOG, A); output.collect(key, value); While map 1 does: key.set(CAT, A); output.collect(key, value); key.set(DOG,B); output.collect(key,value); And you want to make sure that all of the cats go to the same reduces and that the dogs go to the same reduce, you would need to set the partitioner. It would look like: public class MyPartitionerV implements PartitionerTextPair, V { public void configure(JobConf job) {} public int getPartition(TextPair key, V value, int numReduceTasks) { return (key.getLeft().hashCode() Integer.MAX_VALUE) % numReduceTasks; } } Then define a raw comparator that sorts based on both the left and right part of the TextPair, and you are set. -- Owen
RE: Decommissioning Nodes
I wasn't able to get decommissioning to work at all and found that just taking the node down got it out of the cluster. What version are you running and how are you initiating the decommissioning? -Rob Rob Hamilton - VP Network Operations P +1 (410) 379-2195 x 240 E r...@lotame.com 6085 Marshalee Drive, Suite 210 Elkridge, MD 21075 -Original Message- From: Hargraves, Alyssa [mailto:aly...@wpi.edu] Sent: Wednesday, January 21, 2009 7:35 PM To: core-user@hadoop.apache.org Subject: Decommissioning Nodes Hello Hadoop Users, I was hoping someone would be able to answer a question about node decommissioning. I have a test Hadoop cluster set up which only consists of my computer and a master node. I am looking at the removal and addition of nodes. Adding a node is nearly instant (only about 5 seconds), but removing a node by decommissioning it takes a while, and I don't understand why. Currently, the systems are running no map/reduce tasks and storing no data. DFS Health reports: 7 files and directories, 0 blocks = 7 total. Heap Size is 6.68 MB / 992.31 MB (0%) Capacity: 298.02 GB DFS Remaining : 245.79 GB DFS Used: 4 KB DFS Used% : 0 % Live Nodes : 2 Dead Nodes : 0 Node Last ContactAdmin State Size (GB) Used (%)Used (%)Remaining (GB) Blocks master 0 In Service 149.01 0 122.22 0 slave 82 Decommission In Progress149.01 0 123.58 0 However, even with nothing stored and nothing running, the decommission process takes 3 to 5 minutes, and I'm not quite sure why. There isn't any data to move anywhere, and there aren't any jobs to worry about. I am using 0.18.2. Thank you for any help in solving this, Alyssa Hargraves The information transmitted in this email is intended only for the person(s) or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this email in error, please contact the sender and permanently delete the email from any computer.
RE: Set the Order of the Keys in Reduce
Owen, Thanks for joining in.. I suppose what is needed is a new config setting called SequenceReducer. In it you would specify multiple reducer classes in the order you would like executed by JobTracker. When Map completes, MyReducerA.class would run, and in it would be specified the keys it should reduce, not all existing. In Owen's example, this could be CAT. When all instances of the MyReducerA complete reducing CAT, JobTracker would move on to the next reducer in the list. MyReducerB could then retrieve the values reduced down from CAT in HDFS as a filter to reduce DOG. List list = new ArrayList(); List.add( MyReducerA.class ) //Reduces CAT List.add( MyReducerB.class ) //Reduces DOG conf.setSequenceReducer (list); I agree with the previous posts and appreciate everyone insights and participation. What I proposed above is not simple. But when one considers the size of the job, running it twice doesn't make a lot of sense. Should one rerun a 40 gb job file because the values reduced in CAT are needed to filter the reduce of DOG? A better way must exist! Owen, maybe I misunderstood your message, but it seems like even with the addition of a partitioner and raw comparator Tom's post would still prevent what I'm trying to do without having what is suggested above in some fashion. you can't get one reducer to depend on the output of another. Thanks, Brian -Original Message- From: Tom White [mailto:t...@cloudera.com] Sent: Thursday, January 22, 2009 11:04 AM To: core-user@hadoop.apache.org Subject: Re: Set the Order of the Keys in Reduce Reducers run independently and without knowledge of one another, so you can't get one reducer to depend on the output of another. I think having two jobs is the simplest way to achieve what you're trying to do. Tom On Thu, Jan 22, 2009 at 3:48 PM, Brian MacKay brian.mac...@medecision.com wrote: Hello Tom, Would like to apply some rules To CAT_A, then use the output of CAT_A to reduce CAT_B. I'd rather not run two JOBS, so perhaps I need two reducers? First Reducer processes CAT_A, then when complete second reducer does CAT_B? I suppose this would accomplish the same thing? -Original Message- From: Tom White [mailto:t...@cloudera.com] Sent: Thursday, January 22, 2009 10:41 AM To: core-user@hadoop.apache.org Subject: Re: Set the Order of the Keys in Reduce Hi Brian, The CAT_A and CAT_B keys will be processed by different reducer instances, so they run independently and may run in any order. What's the output that you're trying to get? Cheers, Tom On Thu, Jan 22, 2009 at 3:25 PM, Brian MacKay brian.mac...@medecision.com wrote: Hello, Any tips would be greatly appreciated. Is there a way to set the order of the keys in reduce as shown below, no matter what order the collection in MAP occurs in. Thanks, Brian public void map(WritableComparable key, Text values, OutputCollectorText, Text output, Reporter reporter) throws IOException { //collect many CAT_A and CAT_B in random order output.collect(CAT_A, details); output.collect(CAT_B, details); } public void reduce(Text key, IteratorText values, OutputCollectorText, Text output, Reporter reporter) throws IOException { //always reduce CAT_A first, then reduce CAT_B } _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this message in error, please contact the sender and delete the material from any computer. _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this message in error, please contact the sender and delete the material from any computer. _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this message in error, please
RE: Decommissioning Nodes
I was following the steps at http://wiki.apache.org/hadoop/FAQ#17 to do the decommission. However, you have to be patient with it since it seems to take a long time. If it took 3-5 minutes with my nodes that have no data and no jobs running, I can't imagine how long it would be for a real cluster. One thing that I had trouble with originally was the fact that it doesn't seem to work if your replication is set to be same as your number of machines (since I was just testing things, I had replication set to 2 with 2 machines, but that's not a good real-world example). The problem I'm having though (from Jeremy's reply earlier it sounds like he misinterpreted it) isn't how long it is taking for the node to go from decommissioned to being recognized by the master as dead. Whether or not it's recognized as dead isn't something that matters for what I'm doing. The real problem is that going from the In Service to Decommissioned state is taking forever. Decommission In Progress lasts 3 to 5 minutes despite the fact that there aren't jobs or data on those nodes. If anyone else has any idea why that might be (I can see why it would take time if there are jobs or data, but not otherwise) please let me know. - Alyssa From: Rob Hamilton [...@lotame.com] Sent: Thursday, January 22, 2009 12:26 PM To: core-user@hadoop.apache.org Subject: RE: Decommissioning Nodes I wasn't able to get decommissioning to work at all and found that just taking the node down got it out of the cluster. What version are you running and how are you initiating the decommissioning? -Rob Rob Hamilton - VP Network Operations P +1 (410) 379-2195 x 240 E r...@lotame.com 6085 Marshalee Drive, Suite 210 Elkridge, MD 21075 -Original Message- From: Hargraves, Alyssa [mailto:aly...@wpi.edu] Sent: Wednesday, January 21, 2009 7:35 PM To: core-user@hadoop.apache.org Subject: Decommissioning Nodes Hello Hadoop Users, I was hoping someone would be able to answer a question about node decommissioning. I have a test Hadoop cluster set up which only consists of my computer and a master node. I am looking at the removal and addition of nodes. Adding a node is nearly instant (only about 5 seconds), but removing a node by decommissioning it takes a while, and I don't understand why. Currently, the systems are running no map/reduce tasks and storing no data. DFS Health reports: 7 files and directories, 0 blocks = 7 total. Heap Size is 6.68 MB / 992.31 MB (0%) Capacity: 298.02 GB DFS Remaining : 245.79 GB DFS Used: 4 KB DFS Used% : 0 % Live Nodes : 2 Dead Nodes : 0 Node Last ContactAdmin State Size (GB) Used (%)Used (%)Remaining (GB) Blocks master 0 In Service 149.01 0 122.22 0 slave 82 Decommission In Progress149.01 0 123.58 0 However, even with nothing stored and nothing running, the decommission process takes 3 to 5 minutes, and I'm not quite sure why. There isn't any data to move anywhere, and there aren't any jobs to worry about. I am using 0.18.2. Thank you for any help in solving this, Alyssa Hargraves The information transmitted in this email is intended only for the person(s) or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this email in error, please contact the sender and permanently delete the email from any computer.
watch out: Hadoop and Linux kernel 2.6.27
Hi, we just came across a very serious problem with Hadoop (and any other nio intense Java-application) and kernel 2.6.27. Short story: Increase epoll maximum_instances (/proc/sys/fs/epoll/max_user_instances) to prevent Too many open files errors regardless your ulimit -n settings. Long story: http://pero.blogs.aprilmayjune.org/2009/01/22/hadoop-and-linux-kernel-2627-epoll-limits/ I just wanted to drop this note since it took us 2 days to figure it out... :( Regards Peter
Re: Decommissioning Nodes
Can you try setting the following in hadoop-site.xml at the name node and see if the time comes down to around a minute property nameheartbeat.recheck.interval/name value1/value /property This effectively On Thu, Jan 22, 2009 at 9:42 AM, Hargraves, Alyssa aly...@wpi.edu wrote: I was following the steps at http://wiki.apache.org/hadoop/FAQ#17 to do the decommission. However, you have to be patient with it since it seems to take a long time. If it took 3-5 minutes with my nodes that have no data and no jobs running, I can't imagine how long it would be for a real cluster. One thing that I had trouble with originally was the fact that it doesn't seem to work if your replication is set to be same as your number of machines (since I was just testing things, I had replication set to 2 with 2 machines, but that's not a good real-world example). The problem I'm having though (from Jeremy's reply earlier it sounds like he misinterpreted it) isn't how long it is taking for the node to go from decommissioned to being recognized by the master as dead. Whether or not it's recognized as dead isn't something that matters for what I'm doing. The real problem is that going from the In Service to Decommissioned state is taking forever. Decommission In Progress lasts 3 to 5 minutes despite the fact that there aren't jobs or data on those nodes. If anyone else has any idea why that might be (I can see why it would take time if there are jobs or data, but not otherwise) please let me know. - Alyssa From: Rob Hamilton [...@lotame.com] Sent: Thursday, January 22, 2009 12:26 PM To: core-user@hadoop.apache.org Subject: RE: Decommissioning Nodes I wasn't able to get decommissioning to work at all and found that just taking the node down got it out of the cluster. What version are you running and how are you initiating the decommissioning? -Rob Rob Hamilton - VP Network Operations P +1 (410) 379-2195 x 240 E r...@lotame.com 6085 Marshalee Drive, Suite 210 Elkridge, MD 21075 -Original Message- From: Hargraves, Alyssa [mailto:aly...@wpi.edu] Sent: Wednesday, January 21, 2009 7:35 PM To: core-user@hadoop.apache.org Subject: Decommissioning Nodes Hello Hadoop Users, I was hoping someone would be able to answer a question about node decommissioning. I have a test Hadoop cluster set up which only consists of my computer and a master node. I am looking at the removal and addition of nodes. Adding a node is nearly instant (only about 5 seconds), but removing a node by decommissioning it takes a while, and I don't understand why. Currently, the systems are running no map/reduce tasks and storing no data. DFS Health reports: 7 files and directories, 0 blocks = 7 total. Heap Size is 6.68 MB / 992.31 MB (0%) Capacity: 298.02 GB DFS Remaining : 245.79 GB DFS Used: 4 KB DFS Used% : 0 % Live Nodes : 2 Dead Nodes : 0 Node Last ContactAdmin State Size (GB) Used (%) Used (%)Remaining (GB) Blocks master 0 In Service 149.01 0 122.22 0 slave 82 Decommission In Progress149.01 0 123.58 0 However, even with nothing stored and nothing running, the decommission process takes 3 to 5 minutes, and I'm not quite sure why. There isn't any data to move anywhere, and there aren't any jobs to worry about. I am using 0.18.2. Thank you for any help in solving this, Alyssa Hargraves The information transmitted in this email is intended only for the person(s) or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this email in error, please contact the sender and permanently delete the email from any computer. -- Kumar Pandey http://www.linkedin.com/in/kumarpandey
FileOutputFormat.getWorkOutputPath and map-to-reduce-only side-effect files
Hello Hadoop Core, I have a very brief question: Our map tasks create side-effect files, in the directory returned by FileOutputFormat.getWorkOutputPath(). This works fine for the getting the side-effect files that can be accessed by the reducers. However, as these map-generated side-effect files are only of use to the reducers, it would be nice to have them deleted from the output directory. However, we cant delete them in a reducer.close(), as this would prevent them being accessible to other reduce tasks (speculative or otherwise). Any suggestions, short of deleting them after the job completes? Craig
Re: using distcp for http source files
Aaron Kimball wrote: Doesn't the WebDAV protocol use http for file transfer, and support reads / writes / listings / etc? Yes. Getting a WebDAV-based FileSystem in Hadoop has long been a goal. It could replace libhdfs, since there are already a WebDav-based FUSE filesystem for Linux (wdfs, davfs2). WebDAV is also mountable from Windows, etc. Is anyone aware of an OSS web dav library that could be wrapped in a FileSystem implementation? Yes, Apache Slide does but it's dead. Apache Jackrabbit also does and it is alive (http://jackrabbit.apache.org/). Doug
Re: Distributed cache testing in local mode
Hi Bhupesh, I've noticed the same problem -- LocalJobRunner makes the DistributedCache effectively not work; so my code often winds up with two codepaths to retrieve the local data :\ You could try running in pseudo-distributed mode to test, though then you lose the ability to run a single-stepping debugger on the whole end-to-end process. - Aaron On Thu, Jan 22, 2009 at 11:29 AM, Bhupesh Bansal bban...@linkedin.comwrote: Hey folks, I am trying to use Distributed cache in hadoop jobs to pass around configuration files , external-jars (job sepecific) and some archive data. I want to test Job end-to-end in local mode, but I think the distributed caches are localized in TaskTracker code which is not called in local mode Through LocalJobRunner. I can do some fairly simple workarounds for this but was just wondering if folks have more ideas about it. Thanks Bhupesh