0.20.0 mapreduce package documentation

2009-06-05 Thread Ian Soboroff

I just started playing with 0.20.0.  I see that the mapred package is
deprecated in favor of the mapreduce package.  Is there any
migration documentation for the new API (i.e., something more touristy
than Javadoc)?  All the website docs and Wiki examples are on the old API.

Sorry if this is on the mailing list... I searched a bit but came up
dry...

In the same vein, it would be nice if release notes could have some
narrative to go along with the random assortment of JIRA issue numbers.
Especially when major API migration is in the works.

Ian


Re: Command-line jobConf options in 0.18.3

2009-06-04 Thread Ian Soboroff

bin/hadoop jar -files collopts -D prise.collopts=collopts p3l-3.5.jar 
gov.nist.nlpir.prise.mapred.MapReduceIndexer input output

The 'prise.collopts' option doesn't appear in the JobConf.

Ian

Aaron Kimball aa...@cloudera.com writes:

 Can you give an example of the exact arguments you're sending on the command
 line?
 - Aaron

 On Wed, Jun 3, 2009 at 5:46 PM, Ian Soboroff ian.sobor...@nist.gov wrote:

 If after I call getConf to get the conf object, I manually add the key/
 value pair, it's there when I need it.  So it feels like ToolRunner isn't
 parsing my args for some reason.

 Ian

 On Jun 3, 2009, at 8:45 PM, Ian Soboroff wrote:

 Yes, and I get the JobConf via 'JobConf job = new JobConf(conf,
 the.class)'.  The conf is the Configuration object that comes from
 getConf.  Pretty much copied from the WordCount example (which this
 program used to be a long while back...)

 thanks,
 Ian

 On Jun 3, 2009, at 7:09 PM, Aaron Kimball wrote:

 Are you running your program via ToolRunner.run()? How do you
 instantiate the JobConf object?
 - Aaron

 On Wed, Jun 3, 2009 at 10:19 AM, Ian Soboroff 
 ian.sobor...@nist.gov wrote:
 I'm backporting some code I wrote for 0.19.1 to 0.18.3 (long
 story), and I'm finding that when I run a job and try to pass
 options with -D on the command line, that the option values aren't
 showing up in my JobConf.  I logged all the key/value pairs in the
 JobConf, and the option I passed through with -D isn't there.

 This worked in 0.19.1... did something change with command-line
 options from 18 to 19?

 Thanks,
 Ian


Re: Subdirectory question revisited

2009-06-04 Thread Ian Soboroff

Here's how I solved the problem using a custom InputFormat... the key
part is in listStatus(), where we traverse the directory tree.  Since
HDFS doesn't have links this code is probably safe, but if you have a
filesystem with cycles you will get trapped.

Ian

import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.List;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.ArrayDeque;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.PathFilter;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapred.InputSplit;
import org.apache.hadoop.mapred.InvalidInputException;
import org.apache.hadoop.mapred.LineRecordReader;

public class TrecWebInputFormat extends FileInputFormatDocLocation, Text {
@Override 
public boolean isSplitable(FileSystem fs, Path filename) {
return false;
}
   
@Override 
public RecordReaderDocLocation, Text 
getRecordReader(InputSplit split, JobConf job, Reporter reporter)
throws IOException {
return new TrecWebRecordReader(job, (FileSplit)split);
}

// The following are incomprehensibly private in FileInputFormat...
private static final PathFilter hiddenFileFilter = new PathFilter(){
public boolean accept(Path p){
String name = p.getName(); 
return !name.startsWith(_)  !name.startsWith(.); 
}
}; 

/**
 * Proxy PathFilter that accepts a path only if all filters given in the
 * constructor do. Used by the listPaths() to apply the built-in
 * hiddenFileFilter together with a user provided one (if any).
 */
private static class MultiPathFilter implements PathFilter {
private ListPathFilter filters;

public MultiPathFilter(ListPathFilter filters) {
this.filters = filters;
}

public boolean accept(Path path) {
for (PathFilter filter : filters) {
if (!filter.accept(path)) {
return false;
}
}
return true;
}
}


@Override
protected FileStatus[] listStatus(JobConf job) 
throws IOException {
Path[] dirs = getInputPaths(job);
if (dirs.length == 0) {
throw new IOException(No input paths specified in job);
}

ListFileStatus result = new ArrayListFileStatus();
ListIOException errors = new ArrayListIOException();
ArrayDequeFileStatus stats = new ArrayDequeFileStatus(dirs.length);

// creates a MultiPathFilter with the hiddenFileFilter and the
// user provided one (if any).
ListPathFilter filters = new ArrayListPathFilter();
filters.add(hiddenFileFilter);
PathFilter jobFilter = getInputPathFilter(job);
if (jobFilter != null) {
filters.add(jobFilter);
}
PathFilter inputFilter = new MultiPathFilter(filters);

// Set up traversal from input paths, which may be globs
for (Path p: dirs) {
FileSystem fs = p.getFileSystem(job);
FileStatus[] matches = fs.globStatus(p, inputFilter);
if (matches == null) {
errors.add(new IOException(Input path does not exist:  + p));
} else if (matches.length == 0) {
errors.add(new IOException(Input Pattern  + p +  matches 0 
files));
} else {
for (FileStatus globStat: matches) {
stats.add(globStat);
}
}
}

while (!stats.isEmpty()) {
FileStatus stat = stats.pop();
if (stat.isDir()) {
FileSystem fs = stat.getPath().getFileSystem(job);
for (FileStatus sub: fs.listStatus(stat.getPath(), 
   inputFilter)) {
stats.push(sub);
} 
} else {
result.add(stat);
}
}

if (!errors.isEmpty()) {
throw new InvalidInputException(errors);
}
LOG.info(Total input paths to process :  + result.size()); 
return result.toArray(new FileStatus[result.size()]);
}


public static class TrecWebRecordReader 
implements RecordReaderDocLocation, Text {
private CompressionCodecFactory compressionCodecs = null;

Re: *.gz input files

2009-06-04 Thread Ian Soboroff

If you're case is like mine, where you have lots of .gz files and you
don't want splits in the middle of those files, you can use the code I
just sent in the thread about traversing subdirectories.  In brief, your
RecordReader could do something like:

public static class MyRecordReader 
implements RecordReaderDocLocation, Text {
private CompressionCodecFactory compressionCodecs = null;
private long start;
private long end;
private long pos;
private Path file;
private LineRecordReader.LineReader in;

public MyRecordReader(JobConf job, FileSplit split)
throws IOException {
file = split.getPath();
start = 0;
end = split.getLength();
compressionCodecs = new CompressionCodecFactory(job);
CompressionCodec codec = compressionCodecs.getCodec(file);

FileSystem fs = file.getFileSystem(job);
FSDataInputStream fileIn = fs.open(file);

if (codec != null) {
in = new LineRecordReader.LineReader(codec.createInputStream(fil
eIn), job);
} else {
in = new LineRecordReader.LineReader(fileIn, job);
}
pos = 0;
}



Alex Loddengaard a...@cloudera.com writes:

 Hi Adam,

 Gzipped files don't play that nicely with Hadoop, because they aren't
 splittable.  Can you use bzip2 instead?  bzip2 files play more nicely with
 Hadoop, because they're splittable.  If you're stuck with gzip, then take a
 look here: http://issues.apache.org/jira/browse/HADOOP-437.  I don't know
 if you'll have to set the same JobConf parameter in newer versions of
 Hadoop, but it's worth trying out.

 Hope this helps.

 Alex

 On Wed, Jun 3, 2009 at 11:50 AM, Adam Silberstein 
 silbe...@yahoo-inc.comwrote:

 Hi,

 I have some hadoop code that works properly when the input files are not
 compressed, but it is not working for the gzipped versions of those
 files.  My files are named with *.gz, but the format is not being
 recognized.  I'm under the impression I don't need to set any JobConf
 parameters to indicate compressed input.



 I'm actually taking a directory name as input, and modeled that aspect
 of my application after the MultiFileWordCount.java example in
 org.apache.hadoop.examples.  Not sure if this is part of the problem.



 Thanks,

 Adam




Command-line jobConf options in 0.18.3

2009-06-03 Thread Ian Soboroff
I'm backporting some code I wrote for 0.19.1 to 0.18.3 (long story),  
and I'm finding that when I run a job and try to pass options with -D  
on the command line, that the option values aren't showing up in my  
JobConf.  I logged all the key/value pairs in the JobConf, and the  
option I passed through with -D isn't there.


This worked in 0.19.1... did something change with command-line  
options from 18 to 19?


Thanks,
Ian



Re: Command-line jobConf options in 0.18.3

2009-06-03 Thread Ian Soboroff
Yes, and I get the JobConf via 'JobConf job = new JobConf(conf,  
the.class)'.  The conf is the Configuration object that comes from  
getConf.  Pretty much copied from the WordCount example (which this  
program used to be a long while back...)


thanks,
Ian

On Jun 3, 2009, at 7:09 PM, Aaron Kimball wrote:

Are you running your program via ToolRunner.run()? How do you  
instantiate the JobConf object?

- Aaron

On Wed, Jun 3, 2009 at 10:19 AM, Ian Soboroff  
ian.sobor...@nist.gov wrote:
I'm backporting some code I wrote for 0.19.1 to 0.18.3 (long story),  
and I'm finding that when I run a job and try to pass options with - 
D on the command line, that the option values aren't showing up in  
my JobConf.  I logged all the key/value pairs in the JobConf, and  
the option I passed through with -D isn't there.


This worked in 0.19.1... did something change with command-line  
options from 18 to 19?


Thanks,
Ian






Re: Command-line jobConf options in 0.18.3

2009-06-03 Thread Ian Soboroff
If after I call getConf to get the conf object, I manually add the key/ 
value pair, it's there when I need it.  So it feels like ToolRunner  
isn't parsing my args for some reason.


Ian

On Jun 3, 2009, at 8:45 PM, Ian Soboroff wrote:

Yes, and I get the JobConf via 'JobConf job = new JobConf(conf,  
the.class)'.  The conf is the Configuration object that comes from  
getConf.  Pretty much copied from the WordCount example (which this  
program used to be a long while back...)


thanks,
Ian

On Jun 3, 2009, at 7:09 PM, Aaron Kimball wrote:

Are you running your program via ToolRunner.run()? How do you  
instantiate the JobConf object?

- Aaron

On Wed, Jun 3, 2009 at 10:19 AM, Ian Soboroff  
ian.sobor...@nist.gov wrote:
I'm backporting some code I wrote for 0.19.1 to 0.18.3 (long  
story), and I'm finding that when I run a job and try to pass  
options with -D on the command line, that the option values aren't  
showing up in my JobConf.  I logged all the key/value pairs in the  
JobConf, and the option I passed through with -D isn't there.


This worked in 0.19.1... did something change with command-line  
options from 18 to 19?


Thanks,
Ian








Task files in _temporary not getting promoted out

2009-06-03 Thread Ian Soboroff
Ok, help.  I am trying to create local task outputs in my reduce job,  
and they get created, then go poof when the job's done.


My first take was to use FileOutputFormat.getWorkOutputPath, and  
create directories in there for my outputs (which are Lucene  
indexes).  Exasperated, I then wrote a small OutputFormat/RecordWriter  
pair to write the indexes.  In each case, I can see directories being  
created in attempt_foo/_temporary, but when the task is over they're  
gone.


I've stared at TextOutputFormat and I can't figure out why it's files  
survive and mine don't.  Help!  Again, this is 0.18.3.


Thanks,
Ian



Re: hadoop hardware configuration

2009-05-28 Thread Ian Soboroff
Brian Bockelman bbock...@cse.unl.edu writes:

 Despite my trying, I've never been able to come even close to pegging
 the CPUs on our NN.

 I'd recommend going for the fastest dual-cores which are affordable -- 
 latency is king.

Clue?

Surely the latencies in Hadoop that dominate are not cured with faster
processors, but with more RAM and faster disks?

I've followed your posts for a while, so I know you are very experienced
with this stuff... help me out here.

Ian


Re: RPM spec file for 0.19.1

2009-04-06 Thread Ian Soboroff
Simon Lewis si...@lewis.li writes:

 On 3 Apr 2009, at 15:11, Ian Soboroff wrote:
 Steve Loughran ste...@apache.org writes:

 I think from your perpective it makes sense as it stops anyone
 getting
 itchy fingers and doing their own RPMs.

 Um, what's wrong with that?

 I would certainly like the ability to build RPMs from a source
 checkout, anyone thought of putting a standard spec file in with the
 source somewhere?

Another vote for a .spec file to be included in the standard
distribution as a contrib.

If it's ok with Cloudera (since my spec file just came from them), I
will edit my JIRA to offer that proposal.  If it's Cloudera's spec
that's included, we should also include the init.d script templates
(which are already Apache licensed).

Ian



Re: RPM spec file for 0.19.1

2009-04-03 Thread Ian Soboroff

If you guys want to spin RPMs for the community, that's great.  My main
motivation was that I wanted the current version rather than 0.18.3.

There is of course (as Steve points out) a larger discussion about if
you want RPMs, what should be in them.  In particular, some might want
to include the configuration in the RPMs.  That's a good reason to post
SRPMs, because then it's not so hard to re-roll the RPMs with different
configurations.

(Personally I wouldn't manage configs with RPM, it's just a pain to
propagate changes.  Instead, we are looking at using Puppet for general
cluster configuration needs, and RPMs for the basic binaries.)

Ian

Christophe Bisciglia christo...@cloudera.com writes:

 Hey Ian, we are totally fine with this - the only reason we didn't
 contribute the SPEC file is that it is the output of our internal
 build system, and we don't have the bandwidth to properly maintain
 multiple RPMs.

 That said, we chatted about this a bit today, and were wondering if
 the community would like us to host RPMs for all releases in our
 devel repository. We can't stand behind these from a reliability
 angle the same way we can with our blessed RPMs, but it's a
 manageable amount of additional work to have our build system spit
 those out as well.

 If you'd like us to do this, please add a me too to this page:
 http://www.getsatisfaction.com/cloudera/topics/should_we_release_host_rpms_for_all_releases

 We could even skip the branding on the devel releases :-)

 Cheers,
 Christophe

 On Thu, Apr 2, 2009 at 12:46 PM, Ian Soboroff ian.sobor...@nist.gov wrote:

 I created a JIRA (https://issues.apache.org/jira/browse/HADOOP-5615)
 with a spec file for building a 0.19.1 RPM.

 I like the idea of Cloudera's RPM file very much.  In particular, it has
 nifty /etc/init.d scripts and RPM is nice for managing updates.
 However, it's for an older, patched version of Hadoop.

 This spec file is actually just Cloudera's, with suitable edits.  The
 spec file does not contain an explicit license... if Cloudera have
 strong feelings about it, let me know and I'll pull the JIRA attachment.

 The JIRA includes instructions on how to roll the RPMs yourself.  I
 would have attached the SRPM but they're too big for JIRA.  I can offer
 noarch RPMs build with this spec file if someone wants to host them.

 Ian





Re: RPM spec file for 0.19.1

2009-04-03 Thread Ian Soboroff
Steve Loughran ste...@apache.org writes:

 I think from your perpective it makes sense as it stops anyone getting
 itchy fingers and doing their own RPMs. 

Um, what's wrong with that?

Ian




Re: RPM spec file for 0.19.1

2009-04-03 Thread Ian Soboroff
Steve Loughran ste...@apache.org writes:

 -RPM and deb packaging would be nice

Indeed.  The best thing would be to have the hadoop build system output
them, for some sensible subset of systems.

 -the jdk requirements are too harsh as it should run on openjdk's JRE
 or jrockit; no need for sun only. Too bad the only way to say that is
 leave off all jdk dependencies.

I haven't tried running Hadoop on anything but the Sun JDK, much less
built it from source (well, the rpmbuild did that so I guess I have).

 -I worry about how they patch the rc.d files. I can see why, but
 wonder what that does with the RPM ownership

Those are just fine: (from hadoop-init.tmpl)

#!/bin/bash
# 
# (c) Copyright 2009 Cloudera, Inc.
# 
#   Licensed under the Apache License, Version 2.0 (the License);
#   you may not use this file except in compliance with the License.
...

Ian



Re: swap hard drives between datanodes

2009-03-31 Thread Ian Soboroff

Or if you have a node blow a motherboard but the disks are fine...

Ian

On Mar 30, 2009, at 10:03 PM, Mike Andrews wrote:


i tried swapping two hot-swap sata drives between two nodes in a
cluster, but it didn't work: after restart, one of the datanodes shut
down since namenode said it reported a block belonging to another
node, which i guess namenode thinks is a fatal error. is this caused
by the hadoop/datanode/current/VERSION file having the IP address and
other ID information of the datanode hard-coded? it'd be great to be
able to do a manual gross cluster rebalance by just physically
swapping hard drives, but seems like this is not possible in the
current version 0.18.3.

--
permanent contact information at http://mikerandrews.com




Re: Creating Lucene index in Hadoop

2009-03-16 Thread Ian Soboroff

I understand why you would index in the reduce phase, because the anchor
text gets shuffled to be next to the document.  However, when you index
in the map phase, don't you just have to reindex later?

The main point to the OP is that HDFS is a bad FS for writing Lucene
indexes because of how Lucene works.  The simple approach is to write
your index outside of HDFS in the reduce phase, and then merge the
indexes from each reducer manually.

Ian

Ning Li ning.li...@gmail.com writes:

 Or you can check out the index contrib. The difference of the two is that:
   - In Nutch's indexing map/reduce job, indexes are built in the
 reduce phase. Afterwards, they are merged into smaller number of
 shards if necessary. The last time I checked, the merge process does
 not use map/reduce.
   - In contrib/index, small indexes are built in the map phase. They
 are merged into the desired number of shards in the reduce phase. In
 addition, they can be merged into existing shards.

 Cheers,
 Ning


 On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 imcap...@126.com wrote:
 you can see the nutch code.

 2009/3/13 Mark Kerzner markkerz...@gmail.com

 Hi,

 How do I allow multiple nodes to write to the same index file in HDFS?

 Thank you,
 Mark





Re: Creating Lucene index in Hadoop

2009-03-16 Thread Ian Soboroff

Does anyone have stats on how multiple readers on an optimized Lucene
index in HDFS compares with a ParallelMultiReader (or whatever its
called) over RPC on a local filesystem?

I'm missing why you would ever want the Lucene index in HDFS for
reading.

Ian

Ning Li ning.li...@gmail.com writes:

 I should have pointed out that Nutch index build and contrib/index
 targets different applications. The latter is for applications who
 simply want to build Lucene index from a set of documents - e.g. no
 link analysis.

 As to writing Lucene indexes, both work the same way - write the final
 results to local file system and then copy to HDFS. In contrib/index,
 the intermediate results are in memory and not written to HDFS.

 Hope it clarifies things.

 Cheers,
 Ning


 On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff ian.sobor...@nist.gov wrote:

 I understand why you would index in the reduce phase, because the anchor
 text gets shuffled to be next to the document.  However, when you index
 in the map phase, don't you just have to reindex later?

 The main point to the OP is that HDFS is a bad FS for writing Lucene
 indexes because of how Lucene works.  The simple approach is to write
 your index outside of HDFS in the reduce phase, and then merge the
 indexes from each reducer manually.

 Ian

 Ning Li ning.li...@gmail.com writes:

 Or you can check out the index contrib. The difference of the two is that:
   - In Nutch's indexing map/reduce job, indexes are built in the
 reduce phase. Afterwards, they are merged into smaller number of
 shards if necessary. The last time I checked, the merge process does
 not use map/reduce.
   - In contrib/index, small indexes are built in the map phase. They
 are merged into the desired number of shards in the reduce phase. In
 addition, they can be merged into existing shards.

 Cheers,
 Ning


 On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 imcap...@126.com wrote:
 you can see the nutch code.

 2009/3/13 Mark Kerzner markkerz...@gmail.com

 Hi,

 How do I allow multiple nodes to write to the same index file in HDFS?

 Thank you,
 Mark







Re: Hadoop job using multiple input files

2009-02-06 Thread Ian Soboroff
Amandeep Khurana ama...@gmail.com writes:

 Is it possible to write a map reduce job using multiple input files?

 For example:
 File 1 has data like - Name, Number
 File 2 has data like - Number, Address

 Using these, I want to create a third file which has something like - Name,
 Address

 How can a map reduce job be written to do this?

Have one map job read both files in sequence, and map them to (number,
name or address).  Then reduce on number.

Ian



Re: Regarding Hadoop multi cluster set-up

2009-02-04 Thread Ian Soboroff
I would love to see someplace a complete list of the ports that the  
various Hadoop daemons expect to have open.  Does anyone have that?


Ian

On Feb 4, 2009, at 1:16 PM, shefali pawar wrote:



Hi,

I will have to check. I can do that tomorrow in college. But if that  
is the case what should i do?


Should i change the port number and try again?

Shefali

On Wed, 04 Feb 2009 S D wrote :

Shefali,

Is your firewall blocking port 54310 on the master?

John

On Wed, Feb 4, 2009 at 12:34 PM, shefali pawar shefal...@rediffmail.com 
wrote:



Hi,

I am trying to set-up a two node cluster using Hadoop0.19.0, with 1
master(which should also work as a slave) and 1 slave node.

But while running bin/start-dfs.sh the datanode is not starting on  
the
slave. I had read the previous mails on the list, but nothing  
seems to be

working in this case. I am getting the following error in the
hadoop-root-datanode-slave log file while running the command
bin/start-dfs.sh =

2009-02-03 13:00:27,516 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
/
STARTUP_MSG: Starting DataNode
STARTUP_MSG:   host = slave/172.16.0.32
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.19.0
STARTUP_MSG:   build =
https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r
713890; compiled by 'ndaley' on Fri Nov 14 03:12:29 UTC 2008
/
2009-02-03 13:00:28,725 INFO org.apache.hadoop.ipc.Client:  
Retrying connect

to server: master/172.16.0.46:54310. Already tried 0 time(s).
2009-02-03 13:00:29,726 INFO org.apache.hadoop.ipc.Client:  
Retrying connect

to server: master/172.16.0.46:54310. Already tried 1 time(s).
2009-02-03 13:00:30,727 INFO org.apache.hadoop.ipc.Client:  
Retrying connect

to server: master/172.16.0.46:54310. Already tried 2 time(s).
2009-02-03 13:00:31,728 INFO org.apache.hadoop.ipc.Client:  
Retrying connect

to server: master/172.16.0.46:54310. Already tried 3 time(s).
2009-02-03 13:00:32,729 INFO org.apache.hadoop.ipc.Client:  
Retrying connect

to server: master/172.16.0.46:54310. Already tried 4 time(s).
2009-02-03 13:00:33,730 INFO org.apache.hadoop.ipc.Client:  
Retrying connect

to server: master/172.16.0.46:54310. Already tried 5 time(s).
2009-02-03 13:00:34,731 INFO org.apache.hadoop.ipc.Client:  
Retrying connect

to server: master/172.16.0.46:54310. Already tried 6 time(s).
2009-02-03 13:00:35,732 INFO org.apache.hadoop.ipc.Client:  
Retrying connect

to server: master/172.16.0.46:54310. Already tried 7 time(s).
2009-02-03 13:00:36,733 INFO org.apache.hadoop.ipc.Client:  
Retrying connect

to server: master/172.16.0.46:54310. Already tried 8 time(s).
2009-02-03 13:00:37,734 INFO org.apache.hadoop.ipc.Client:  
Retrying connect

to server: master/172.16.0.46:54310. Already tried 9 time(s).
2009-02-03 13:00:37,738 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode:  
java.io.IOException: Call
to master/172.16.0.46:54310 failed on local exception: No route to  
host

  at org.apache.hadoop.ipc.Client.call(Client.java:699)
  at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
  at $Proxy4.getProtocolVersion(Unknown Source)
  at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
  at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:306)
  at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:343)
  at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:288)
  at
org 
.apache 
.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java: 
258)

  at
org 
.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java: 
205)

  at
org 
.apache 
.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java: 
1199)

  at
org 
.apache 
.hadoop 
.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java: 
1154)

  at
org 
.apache 
.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java: 
1162)

  at
org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java: 
1284)

Caused by: java.net.NoRouteToHostException: No route to host
  at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
  at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java: 
574)

  at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:100)
  at
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java: 
299)

  at
org.apache.hadoop.ipc.Client$Connection.access$1700(Client.java:176)
  at org.apache.hadoop.ipc.Client.getConnection(Client.java:772)
  at org.apache.hadoop.ipc.Client.call(Client.java:685)
  ... 12 more

2009-02-03 13:00:37,739 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/
SHUTDOWN_MSG: Shutting down DataNode at slave/172.16.0.32
/


Also, the Pseudo distributed operation is working on both the  
machines. And

i am able to 

Re: FileInputFormat directory traversal

2009-02-03 Thread Ian Soboroff
Hmm.  Based on your reasons, an extension to FileInputFormat for the  
lib package seems more in order.


I'll try to hack something up and file a Jira issue.

Ian

On Feb 3, 2009, at 4:28 PM, Doug Cutting wrote:


Hi, Ian.

One reason is that a MapFile is represented by a directory  
containing two files named index and data.   
SequenceFileInputFormat handles MapFiles too by, if an input file is  
a directory containing a data file, using that file.


Another reason is that's what reduces generate.

Neither reason implies that this is the best or only way of doing  
things.  It would probably be better if FileInputFormat optionally  
supported recursive file enumeration.  (It would be incompatible and  
thus cannot be the default mode.)


Please file an issue in Jira for this and attach your patch.

Thanks,

Doug

Ian Soboroff wrote:
Is there a reason FileInputFormat only traverses the first level of  
directories in its InputPaths?  (i.e., given an InputPath of 'foo',  
it will get foo/* but not foo/bar/*).
I wrote a full depth-first traversal in my custom InputFormat which  
I can offer as a patch.  But to do it I had to duplicate the  
PathFilter classes in FileInputFormat which are marked private, so  
a mainline patch would also touch FileInputFormat.

Ian




FileInputFormat directory traversal

2009-02-03 Thread Ian Soboroff
Is there a reason FileInputFormat only traverses the first level of  
directories in its InputPaths?  (i.e., given an InputPath of 'foo', it  
will get foo/* but not foo/bar/*).


I wrote a full depth-first traversal in my custom InputFormat which I  
can offer as a patch.  But to do it I had to duplicate the PathFilter  
classes in FileInputFormat which are marked private, so a mainline  
patch would also touch FileInputFormat.


Ian



My tasktrackers keep getting lost...

2009-02-02 Thread Ian Soboroff

I hope someone can help me out.  I'm getting started with Hadoop, 
have written the firt part of my project (a custom InputFormat), and am
now using that to test out my cluster setup.

I'm running 0.19.0.  I have five dual-core Linux workstations with most
of a 250GB disk available for playing, and am controlling things from my
Mac Pro.  (This is not the production cluster, that hasn't been
assembled yet.  This is just to get the code working and figure out the
bumps.)

My test data is about 18GB of web pages, and the test app at the moment
just counts the number of web pages in each bundle file.  The map jobs
run just fine, but when it gets into the reduce, the TaskTrackers all
get lost to the JobTracker.  I can't see why, because the TaskTrackers
are all still running on the slaves.  Also, the jobdetails URL starts
returning an HTTP 500 error, although other links from that page still
work.

I've tried going onto the slaves and manually restarting the
tasktrackers with hadoop-daemon.sh, and also turning on job restarting
in the site conf and then running stop-mapred/start-mapred.  The
trackers start up and try to clean up and get going again, but they then
just get lost again.

Here's some error output from the master jobtracker:

2009-02-02 13:39:40,904 INFO org.apache.hadoop.mapred.JobTracker: Removed 
completed task 'attempt_200902021252_0002_r_05_1' from 
'tracker_darling:localhost.localdomain/127.0.0.1:58336'
2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: 
attempt_200902021252_0002_m_004592_1 is 796370 ms debug.
2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: Launching 
task attempt_200902021252_0002_m_004592_1 timed out.
2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: 
attempt_200902021252_0002_m_004582_1 is 794199 ms debug.
2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: Launching 
task attempt_200902021252_0002_m_004582_1 timed out.
2009-02-02 13:41:22,271 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 
'duplicate' heartbeat from 
'tracker_cheyenne:localhost.localdomain/127.0.0.1:52769'; resending the 
previous 'lost' response
2009-02-02 13:41:22,272 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 
'duplicate' heartbeat from 
'tracker_tigris:localhost.localdomain/127.0.0.1:52808'; resending the previous 
'lost' response
2009-02-02 13:41:22,272 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 
'duplicate' heartbeat from 
'tracker_monocacy:localhost.localdomain/127.0.0.1:54464'; Resending the 
previous 'lost' response
2009-02-02 13:41:22,298 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 
'duplicate' heartbeat from 'tracker_129.6.101.41:127.0.0.1/127.0.0.1:58744'; 
resending the previous 'lost' response
2009-02-02 13:41:22,421 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 
'duplicate' heartbeat from 
'tracker_rhone:localhost.localdomain/127.0.0.1:45749'; resending the previous 
'lost' response
2009-02-02 13:41:22,421 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9
on 54311 caught: java.lang.NullPointerException
at org.apache.hadoop.mapred.MapTask.write(MapTask.java:123)
at org.apache.hadoop.mapred.LaunchTaskAction.write(LaunchTaskAction.java
:48)
at org.apache.hadoop.mapred.HeartbeatResponse.write(HeartbeatResponse.ja
va:101)
at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:1
59)
at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:70)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:907)

2009-02-02 13:41:27,275 WARN org.apache.hadoop.mapred.JobTracker: Status from 
unknown Tracker : tracker_monocacy:localhost.localdomain/127.0.0.1:54464

And from a slave:

2009-02-02 13:26:39,440 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: 
src: 129.6.101.18:50060, dest: 129.6.101.12:37304, bytes: 6, op: 
MAPRED_SHUFFLE, cliID: attempt_200902021252_0002_m_000111_0
2009-02-02 13:41:40,165 ERROR org.apache.hadoop.mapred.TaskTracker: Caught 
exception: java.io.IOException: Call to rogue/129.6.101.41:54311 failed on 
local exception: null
at org.apache.hadoop.ipc.Client.call(Client.java:699)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at org.apache.hadoop.mapred.$Proxy4.heartbeat(Unknown Source)
at 
org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1164)
at 
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:997)
at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1678)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2698)
Caused by: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at 
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
at 
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
at org.apache.hadoop.io.UTF8.readChars(UTF8.java:211)
at