0.20.0 mapreduce package documentation
I just started playing with 0.20.0. I see that the mapred package is deprecated in favor of the mapreduce package. Is there any migration documentation for the new API (i.e., something more touristy than Javadoc)? All the website docs and Wiki examples are on the old API. Sorry if this is on the mailing list... I searched a bit but came up dry... In the same vein, it would be nice if release notes could have some narrative to go along with the random assortment of JIRA issue numbers. Especially when major API migration is in the works. Ian
Re: Command-line jobConf options in 0.18.3
bin/hadoop jar -files collopts -D prise.collopts=collopts p3l-3.5.jar gov.nist.nlpir.prise.mapred.MapReduceIndexer input output The 'prise.collopts' option doesn't appear in the JobConf. Ian Aaron Kimball aa...@cloudera.com writes: Can you give an example of the exact arguments you're sending on the command line? - Aaron On Wed, Jun 3, 2009 at 5:46 PM, Ian Soboroff ian.sobor...@nist.gov wrote: If after I call getConf to get the conf object, I manually add the key/ value pair, it's there when I need it. So it feels like ToolRunner isn't parsing my args for some reason. Ian On Jun 3, 2009, at 8:45 PM, Ian Soboroff wrote: Yes, and I get the JobConf via 'JobConf job = new JobConf(conf, the.class)'. The conf is the Configuration object that comes from getConf. Pretty much copied from the WordCount example (which this program used to be a long while back...) thanks, Ian On Jun 3, 2009, at 7:09 PM, Aaron Kimball wrote: Are you running your program via ToolRunner.run()? How do you instantiate the JobConf object? - Aaron On Wed, Jun 3, 2009 at 10:19 AM, Ian Soboroff ian.sobor...@nist.gov wrote: I'm backporting some code I wrote for 0.19.1 to 0.18.3 (long story), and I'm finding that when I run a job and try to pass options with -D on the command line, that the option values aren't showing up in my JobConf. I logged all the key/value pairs in the JobConf, and the option I passed through with -D isn't there. This worked in 0.19.1... did something change with command-line options from 18 to 19? Thanks, Ian
Re: Subdirectory question revisited
Here's how I solved the problem using a custom InputFormat... the key part is in listStatus(), where we traverse the directory tree. Since HDFS doesn't have links this code is probably safe, but if you have a filesystem with cycles you will get trapped. Ian import java.io.IOException; import java.io.InputStream; import java.io.OutputStream; import java.util.List; import java.util.ArrayList; import java.util.Arrays; import java.util.ArrayDeque; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.compress.CompressionCodec; import org.apache.hadoop.io.compress.CompressionCodecFactory; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.PathFilter; import org.apache.hadoop.fs.Path; import org.apache.hadoop.fs.FSDataInputStream; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.RecordReader; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.FileSplit; import org.apache.hadoop.mapred.InputSplit; import org.apache.hadoop.mapred.InvalidInputException; import org.apache.hadoop.mapred.LineRecordReader; public class TrecWebInputFormat extends FileInputFormatDocLocation, Text { @Override public boolean isSplitable(FileSystem fs, Path filename) { return false; } @Override public RecordReaderDocLocation, Text getRecordReader(InputSplit split, JobConf job, Reporter reporter) throws IOException { return new TrecWebRecordReader(job, (FileSplit)split); } // The following are incomprehensibly private in FileInputFormat... private static final PathFilter hiddenFileFilter = new PathFilter(){ public boolean accept(Path p){ String name = p.getName(); return !name.startsWith(_) !name.startsWith(.); } }; /** * Proxy PathFilter that accepts a path only if all filters given in the * constructor do. Used by the listPaths() to apply the built-in * hiddenFileFilter together with a user provided one (if any). */ private static class MultiPathFilter implements PathFilter { private ListPathFilter filters; public MultiPathFilter(ListPathFilter filters) { this.filters = filters; } public boolean accept(Path path) { for (PathFilter filter : filters) { if (!filter.accept(path)) { return false; } } return true; } } @Override protected FileStatus[] listStatus(JobConf job) throws IOException { Path[] dirs = getInputPaths(job); if (dirs.length == 0) { throw new IOException(No input paths specified in job); } ListFileStatus result = new ArrayListFileStatus(); ListIOException errors = new ArrayListIOException(); ArrayDequeFileStatus stats = new ArrayDequeFileStatus(dirs.length); // creates a MultiPathFilter with the hiddenFileFilter and the // user provided one (if any). ListPathFilter filters = new ArrayListPathFilter(); filters.add(hiddenFileFilter); PathFilter jobFilter = getInputPathFilter(job); if (jobFilter != null) { filters.add(jobFilter); } PathFilter inputFilter = new MultiPathFilter(filters); // Set up traversal from input paths, which may be globs for (Path p: dirs) { FileSystem fs = p.getFileSystem(job); FileStatus[] matches = fs.globStatus(p, inputFilter); if (matches == null) { errors.add(new IOException(Input path does not exist: + p)); } else if (matches.length == 0) { errors.add(new IOException(Input Pattern + p + matches 0 files)); } else { for (FileStatus globStat: matches) { stats.add(globStat); } } } while (!stats.isEmpty()) { FileStatus stat = stats.pop(); if (stat.isDir()) { FileSystem fs = stat.getPath().getFileSystem(job); for (FileStatus sub: fs.listStatus(stat.getPath(), inputFilter)) { stats.push(sub); } } else { result.add(stat); } } if (!errors.isEmpty()) { throw new InvalidInputException(errors); } LOG.info(Total input paths to process : + result.size()); return result.toArray(new FileStatus[result.size()]); } public static class TrecWebRecordReader implements RecordReaderDocLocation, Text { private CompressionCodecFactory compressionCodecs = null;
Re: *.gz input files
If you're case is like mine, where you have lots of .gz files and you don't want splits in the middle of those files, you can use the code I just sent in the thread about traversing subdirectories. In brief, your RecordReader could do something like: public static class MyRecordReader implements RecordReaderDocLocation, Text { private CompressionCodecFactory compressionCodecs = null; private long start; private long end; private long pos; private Path file; private LineRecordReader.LineReader in; public MyRecordReader(JobConf job, FileSplit split) throws IOException { file = split.getPath(); start = 0; end = split.getLength(); compressionCodecs = new CompressionCodecFactory(job); CompressionCodec codec = compressionCodecs.getCodec(file); FileSystem fs = file.getFileSystem(job); FSDataInputStream fileIn = fs.open(file); if (codec != null) { in = new LineRecordReader.LineReader(codec.createInputStream(fil eIn), job); } else { in = new LineRecordReader.LineReader(fileIn, job); } pos = 0; } Alex Loddengaard a...@cloudera.com writes: Hi Adam, Gzipped files don't play that nicely with Hadoop, because they aren't splittable. Can you use bzip2 instead? bzip2 files play more nicely with Hadoop, because they're splittable. If you're stuck with gzip, then take a look here: http://issues.apache.org/jira/browse/HADOOP-437. I don't know if you'll have to set the same JobConf parameter in newer versions of Hadoop, but it's worth trying out. Hope this helps. Alex On Wed, Jun 3, 2009 at 11:50 AM, Adam Silberstein silbe...@yahoo-inc.comwrote: Hi, I have some hadoop code that works properly when the input files are not compressed, but it is not working for the gzipped versions of those files. My files are named with *.gz, but the format is not being recognized. I'm under the impression I don't need to set any JobConf parameters to indicate compressed input. I'm actually taking a directory name as input, and modeled that aspect of my application after the MultiFileWordCount.java example in org.apache.hadoop.examples. Not sure if this is part of the problem. Thanks, Adam
Command-line jobConf options in 0.18.3
I'm backporting some code I wrote for 0.19.1 to 0.18.3 (long story), and I'm finding that when I run a job and try to pass options with -D on the command line, that the option values aren't showing up in my JobConf. I logged all the key/value pairs in the JobConf, and the option I passed through with -D isn't there. This worked in 0.19.1... did something change with command-line options from 18 to 19? Thanks, Ian
Re: Command-line jobConf options in 0.18.3
Yes, and I get the JobConf via 'JobConf job = new JobConf(conf, the.class)'. The conf is the Configuration object that comes from getConf. Pretty much copied from the WordCount example (which this program used to be a long while back...) thanks, Ian On Jun 3, 2009, at 7:09 PM, Aaron Kimball wrote: Are you running your program via ToolRunner.run()? How do you instantiate the JobConf object? - Aaron On Wed, Jun 3, 2009 at 10:19 AM, Ian Soboroff ian.sobor...@nist.gov wrote: I'm backporting some code I wrote for 0.19.1 to 0.18.3 (long story), and I'm finding that when I run a job and try to pass options with - D on the command line, that the option values aren't showing up in my JobConf. I logged all the key/value pairs in the JobConf, and the option I passed through with -D isn't there. This worked in 0.19.1... did something change with command-line options from 18 to 19? Thanks, Ian
Re: Command-line jobConf options in 0.18.3
If after I call getConf to get the conf object, I manually add the key/ value pair, it's there when I need it. So it feels like ToolRunner isn't parsing my args for some reason. Ian On Jun 3, 2009, at 8:45 PM, Ian Soboroff wrote: Yes, and I get the JobConf via 'JobConf job = new JobConf(conf, the.class)'. The conf is the Configuration object that comes from getConf. Pretty much copied from the WordCount example (which this program used to be a long while back...) thanks, Ian On Jun 3, 2009, at 7:09 PM, Aaron Kimball wrote: Are you running your program via ToolRunner.run()? How do you instantiate the JobConf object? - Aaron On Wed, Jun 3, 2009 at 10:19 AM, Ian Soboroff ian.sobor...@nist.gov wrote: I'm backporting some code I wrote for 0.19.1 to 0.18.3 (long story), and I'm finding that when I run a job and try to pass options with -D on the command line, that the option values aren't showing up in my JobConf. I logged all the key/value pairs in the JobConf, and the option I passed through with -D isn't there. This worked in 0.19.1... did something change with command-line options from 18 to 19? Thanks, Ian
Task files in _temporary not getting promoted out
Ok, help. I am trying to create local task outputs in my reduce job, and they get created, then go poof when the job's done. My first take was to use FileOutputFormat.getWorkOutputPath, and create directories in there for my outputs (which are Lucene indexes). Exasperated, I then wrote a small OutputFormat/RecordWriter pair to write the indexes. In each case, I can see directories being created in attempt_foo/_temporary, but when the task is over they're gone. I've stared at TextOutputFormat and I can't figure out why it's files survive and mine don't. Help! Again, this is 0.18.3. Thanks, Ian
Re: hadoop hardware configuration
Brian Bockelman bbock...@cse.unl.edu writes: Despite my trying, I've never been able to come even close to pegging the CPUs on our NN. I'd recommend going for the fastest dual-cores which are affordable -- latency is king. Clue? Surely the latencies in Hadoop that dominate are not cured with faster processors, but with more RAM and faster disks? I've followed your posts for a while, so I know you are very experienced with this stuff... help me out here. Ian
Re: RPM spec file for 0.19.1
Simon Lewis si...@lewis.li writes: On 3 Apr 2009, at 15:11, Ian Soboroff wrote: Steve Loughran ste...@apache.org writes: I think from your perpective it makes sense as it stops anyone getting itchy fingers and doing their own RPMs. Um, what's wrong with that? I would certainly like the ability to build RPMs from a source checkout, anyone thought of putting a standard spec file in with the source somewhere? Another vote for a .spec file to be included in the standard distribution as a contrib. If it's ok with Cloudera (since my spec file just came from them), I will edit my JIRA to offer that proposal. If it's Cloudera's spec that's included, we should also include the init.d script templates (which are already Apache licensed). Ian
Re: RPM spec file for 0.19.1
If you guys want to spin RPMs for the community, that's great. My main motivation was that I wanted the current version rather than 0.18.3. There is of course (as Steve points out) a larger discussion about if you want RPMs, what should be in them. In particular, some might want to include the configuration in the RPMs. That's a good reason to post SRPMs, because then it's not so hard to re-roll the RPMs with different configurations. (Personally I wouldn't manage configs with RPM, it's just a pain to propagate changes. Instead, we are looking at using Puppet for general cluster configuration needs, and RPMs for the basic binaries.) Ian Christophe Bisciglia christo...@cloudera.com writes: Hey Ian, we are totally fine with this - the only reason we didn't contribute the SPEC file is that it is the output of our internal build system, and we don't have the bandwidth to properly maintain multiple RPMs. That said, we chatted about this a bit today, and were wondering if the community would like us to host RPMs for all releases in our devel repository. We can't stand behind these from a reliability angle the same way we can with our blessed RPMs, but it's a manageable amount of additional work to have our build system spit those out as well. If you'd like us to do this, please add a me too to this page: http://www.getsatisfaction.com/cloudera/topics/should_we_release_host_rpms_for_all_releases We could even skip the branding on the devel releases :-) Cheers, Christophe On Thu, Apr 2, 2009 at 12:46 PM, Ian Soboroff ian.sobor...@nist.gov wrote: I created a JIRA (https://issues.apache.org/jira/browse/HADOOP-5615) with a spec file for building a 0.19.1 RPM. I like the idea of Cloudera's RPM file very much. In particular, it has nifty /etc/init.d scripts and RPM is nice for managing updates. However, it's for an older, patched version of Hadoop. This spec file is actually just Cloudera's, with suitable edits. The spec file does not contain an explicit license... if Cloudera have strong feelings about it, let me know and I'll pull the JIRA attachment. The JIRA includes instructions on how to roll the RPMs yourself. I would have attached the SRPM but they're too big for JIRA. I can offer noarch RPMs build with this spec file if someone wants to host them. Ian
Re: RPM spec file for 0.19.1
Steve Loughran ste...@apache.org writes: I think from your perpective it makes sense as it stops anyone getting itchy fingers and doing their own RPMs. Um, what's wrong with that? Ian
Re: RPM spec file for 0.19.1
Steve Loughran ste...@apache.org writes: -RPM and deb packaging would be nice Indeed. The best thing would be to have the hadoop build system output them, for some sensible subset of systems. -the jdk requirements are too harsh as it should run on openjdk's JRE or jrockit; no need for sun only. Too bad the only way to say that is leave off all jdk dependencies. I haven't tried running Hadoop on anything but the Sun JDK, much less built it from source (well, the rpmbuild did that so I guess I have). -I worry about how they patch the rc.d files. I can see why, but wonder what that does with the RPM ownership Those are just fine: (from hadoop-init.tmpl) #!/bin/bash # # (c) Copyright 2009 Cloudera, Inc. # # Licensed under the Apache License, Version 2.0 (the License); # you may not use this file except in compliance with the License. ... Ian
Re: swap hard drives between datanodes
Or if you have a node blow a motherboard but the disks are fine... Ian On Mar 30, 2009, at 10:03 PM, Mike Andrews wrote: i tried swapping two hot-swap sata drives between two nodes in a cluster, but it didn't work: after restart, one of the datanodes shut down since namenode said it reported a block belonging to another node, which i guess namenode thinks is a fatal error. is this caused by the hadoop/datanode/current/VERSION file having the IP address and other ID information of the datanode hard-coded? it'd be great to be able to do a manual gross cluster rebalance by just physically swapping hard drives, but seems like this is not possible in the current version 0.18.3. -- permanent contact information at http://mikerandrews.com
Re: Creating Lucene index in Hadoop
I understand why you would index in the reduce phase, because the anchor text gets shuffled to be next to the document. However, when you index in the map phase, don't you just have to reindex later? The main point to the OP is that HDFS is a bad FS for writing Lucene indexes because of how Lucene works. The simple approach is to write your index outside of HDFS in the reduce phase, and then merge the indexes from each reducer manually. Ian Ning Li ning.li...@gmail.com writes: Or you can check out the index contrib. The difference of the two is that: - In Nutch's indexing map/reduce job, indexes are built in the reduce phase. Afterwards, they are merged into smaller number of shards if necessary. The last time I checked, the merge process does not use map/reduce. - In contrib/index, small indexes are built in the map phase. They are merged into the desired number of shards in the reduce phase. In addition, they can be merged into existing shards. Cheers, Ning On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 imcap...@126.com wrote: you can see the nutch code. 2009/3/13 Mark Kerzner markkerz...@gmail.com Hi, How do I allow multiple nodes to write to the same index file in HDFS? Thank you, Mark
Re: Creating Lucene index in Hadoop
Does anyone have stats on how multiple readers on an optimized Lucene index in HDFS compares with a ParallelMultiReader (or whatever its called) over RPC on a local filesystem? I'm missing why you would ever want the Lucene index in HDFS for reading. Ian Ning Li ning.li...@gmail.com writes: I should have pointed out that Nutch index build and contrib/index targets different applications. The latter is for applications who simply want to build Lucene index from a set of documents - e.g. no link analysis. As to writing Lucene indexes, both work the same way - write the final results to local file system and then copy to HDFS. In contrib/index, the intermediate results are in memory and not written to HDFS. Hope it clarifies things. Cheers, Ning On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff ian.sobor...@nist.gov wrote: I understand why you would index in the reduce phase, because the anchor text gets shuffled to be next to the document. However, when you index in the map phase, don't you just have to reindex later? The main point to the OP is that HDFS is a bad FS for writing Lucene indexes because of how Lucene works. The simple approach is to write your index outside of HDFS in the reduce phase, and then merge the indexes from each reducer manually. Ian Ning Li ning.li...@gmail.com writes: Or you can check out the index contrib. The difference of the two is that: - In Nutch's indexing map/reduce job, indexes are built in the reduce phase. Afterwards, they are merged into smaller number of shards if necessary. The last time I checked, the merge process does not use map/reduce. - In contrib/index, small indexes are built in the map phase. They are merged into the desired number of shards in the reduce phase. In addition, they can be merged into existing shards. Cheers, Ning On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 imcap...@126.com wrote: you can see the nutch code. 2009/3/13 Mark Kerzner markkerz...@gmail.com Hi, How do I allow multiple nodes to write to the same index file in HDFS? Thank you, Mark
Re: Hadoop job using multiple input files
Amandeep Khurana ama...@gmail.com writes: Is it possible to write a map reduce job using multiple input files? For example: File 1 has data like - Name, Number File 2 has data like - Number, Address Using these, I want to create a third file which has something like - Name, Address How can a map reduce job be written to do this? Have one map job read both files in sequence, and map them to (number, name or address). Then reduce on number. Ian
Re: Regarding Hadoop multi cluster set-up
I would love to see someplace a complete list of the ports that the various Hadoop daemons expect to have open. Does anyone have that? Ian On Feb 4, 2009, at 1:16 PM, shefali pawar wrote: Hi, I will have to check. I can do that tomorrow in college. But if that is the case what should i do? Should i change the port number and try again? Shefali On Wed, 04 Feb 2009 S D wrote : Shefali, Is your firewall blocking port 54310 on the master? John On Wed, Feb 4, 2009 at 12:34 PM, shefali pawar shefal...@rediffmail.com wrote: Hi, I am trying to set-up a two node cluster using Hadoop0.19.0, with 1 master(which should also work as a slave) and 1 slave node. But while running bin/start-dfs.sh the datanode is not starting on the slave. I had read the previous mails on the list, but nothing seems to be working in this case. I am getting the following error in the hadoop-root-datanode-slave log file while running the command bin/start-dfs.sh = 2009-02-03 13:00:27,516 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG: / STARTUP_MSG: Starting DataNode STARTUP_MSG: host = slave/172.16.0.32 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.19.0 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r 713890; compiled by 'ndaley' on Fri Nov 14 03:12:29 UTC 2008 / 2009-02-03 13:00:28,725 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/172.16.0.46:54310. Already tried 0 time(s). 2009-02-03 13:00:29,726 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/172.16.0.46:54310. Already tried 1 time(s). 2009-02-03 13:00:30,727 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/172.16.0.46:54310. Already tried 2 time(s). 2009-02-03 13:00:31,728 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/172.16.0.46:54310. Already tried 3 time(s). 2009-02-03 13:00:32,729 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/172.16.0.46:54310. Already tried 4 time(s). 2009-02-03 13:00:33,730 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/172.16.0.46:54310. Already tried 5 time(s). 2009-02-03 13:00:34,731 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/172.16.0.46:54310. Already tried 6 time(s). 2009-02-03 13:00:35,732 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/172.16.0.46:54310. Already tried 7 time(s). 2009-02-03 13:00:36,733 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/172.16.0.46:54310. Already tried 8 time(s). 2009-02-03 13:00:37,734 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/172.16.0.46:54310. Already tried 9 time(s). 2009-02-03 13:00:37,738 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Call to master/172.16.0.46:54310 failed on local exception: No route to host at org.apache.hadoop.ipc.Client.call(Client.java:699) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at $Proxy4.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:306) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:343) at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:288) at org .apache .hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java: 258) at org .apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java: 205) at org .apache .hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java: 1199) at org .apache .hadoop .hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java: 1154) at org .apache .hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java: 1162) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java: 1284) Caused by: java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java: 574) at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:100) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java: 299) at org.apache.hadoop.ipc.Client$Connection.access$1700(Client.java:176) at org.apache.hadoop.ipc.Client.getConnection(Client.java:772) at org.apache.hadoop.ipc.Client.call(Client.java:685) ... 12 more 2009-02-03 13:00:37,739 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down DataNode at slave/172.16.0.32 / Also, the Pseudo distributed operation is working on both the machines. And i am able to
Re: FileInputFormat directory traversal
Hmm. Based on your reasons, an extension to FileInputFormat for the lib package seems more in order. I'll try to hack something up and file a Jira issue. Ian On Feb 3, 2009, at 4:28 PM, Doug Cutting wrote: Hi, Ian. One reason is that a MapFile is represented by a directory containing two files named index and data. SequenceFileInputFormat handles MapFiles too by, if an input file is a directory containing a data file, using that file. Another reason is that's what reduces generate. Neither reason implies that this is the best or only way of doing things. It would probably be better if FileInputFormat optionally supported recursive file enumeration. (It would be incompatible and thus cannot be the default mode.) Please file an issue in Jira for this and attach your patch. Thanks, Doug Ian Soboroff wrote: Is there a reason FileInputFormat only traverses the first level of directories in its InputPaths? (i.e., given an InputPath of 'foo', it will get foo/* but not foo/bar/*). I wrote a full depth-first traversal in my custom InputFormat which I can offer as a patch. But to do it I had to duplicate the PathFilter classes in FileInputFormat which are marked private, so a mainline patch would also touch FileInputFormat. Ian
FileInputFormat directory traversal
Is there a reason FileInputFormat only traverses the first level of directories in its InputPaths? (i.e., given an InputPath of 'foo', it will get foo/* but not foo/bar/*). I wrote a full depth-first traversal in my custom InputFormat which I can offer as a patch. But to do it I had to duplicate the PathFilter classes in FileInputFormat which are marked private, so a mainline patch would also touch FileInputFormat. Ian
My tasktrackers keep getting lost...
I hope someone can help me out. I'm getting started with Hadoop, have written the firt part of my project (a custom InputFormat), and am now using that to test out my cluster setup. I'm running 0.19.0. I have five dual-core Linux workstations with most of a 250GB disk available for playing, and am controlling things from my Mac Pro. (This is not the production cluster, that hasn't been assembled yet. This is just to get the code working and figure out the bumps.) My test data is about 18GB of web pages, and the test app at the moment just counts the number of web pages in each bundle file. The map jobs run just fine, but when it gets into the reduce, the TaskTrackers all get lost to the JobTracker. I can't see why, because the TaskTrackers are all still running on the slaves. Also, the jobdetails URL starts returning an HTTP 500 error, although other links from that page still work. I've tried going onto the slaves and manually restarting the tasktrackers with hadoop-daemon.sh, and also turning on job restarting in the site conf and then running stop-mapred/start-mapred. The trackers start up and try to clean up and get going again, but they then just get lost again. Here's some error output from the master jobtracker: 2009-02-02 13:39:40,904 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_200902021252_0002_r_05_1' from 'tracker_darling:localhost.localdomain/127.0.0.1:58336' 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: attempt_200902021252_0002_m_004592_1 is 796370 ms debug. 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: Launching task attempt_200902021252_0002_m_004592_1 timed out. 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: attempt_200902021252_0002_m_004582_1 is 794199 ms debug. 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: Launching task attempt_200902021252_0002_m_004582_1 timed out. 2009-02-02 13:41:22,271 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_cheyenne:localhost.localdomain/127.0.0.1:52769'; resending the previous 'lost' response 2009-02-02 13:41:22,272 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_tigris:localhost.localdomain/127.0.0.1:52808'; resending the previous 'lost' response 2009-02-02 13:41:22,272 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_monocacy:localhost.localdomain/127.0.0.1:54464'; Resending the previous 'lost' response 2009-02-02 13:41:22,298 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_129.6.101.41:127.0.0.1/127.0.0.1:58744'; resending the previous 'lost' response 2009-02-02 13:41:22,421 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_rhone:localhost.localdomain/127.0.0.1:45749'; resending the previous 'lost' response 2009-02-02 13:41:22,421 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 54311 caught: java.lang.NullPointerException at org.apache.hadoop.mapred.MapTask.write(MapTask.java:123) at org.apache.hadoop.mapred.LaunchTaskAction.write(LaunchTaskAction.java :48) at org.apache.hadoop.mapred.HeartbeatResponse.write(HeartbeatResponse.ja va:101) at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:1 59) at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:70) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:907) 2009-02-02 13:41:27,275 WARN org.apache.hadoop.mapred.JobTracker: Status from unknown Tracker : tracker_monocacy:localhost.localdomain/127.0.0.1:54464 And from a slave: 2009-02-02 13:26:39,440 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 129.6.101.18:50060, dest: 129.6.101.12:37304, bytes: 6, op: MAPRED_SHUFFLE, cliID: attempt_200902021252_0002_m_000111_0 2009-02-02 13:41:40,165 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception: java.io.IOException: Call to rogue/129.6.101.41:54311 failed on local exception: null at org.apache.hadoop.ipc.Client.call(Client.java:699) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.mapred.$Proxy4.heartbeat(Unknown Source) at org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1164) at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:997) at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1678) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2698) Caused by: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101) at org.apache.hadoop.io.UTF8.readChars(UTF8.java:211) at