0.20.0 mapreduce package documentation

2009-06-05 Thread Ian Soboroff
I just started playing with 0.20.0. I see that the mapred package is deprecated in favor of the mapreduce package. Is there any migration documentation for the new API (i.e., something more touristy than Javadoc)? All the website docs and Wiki examples are on the old API. Sorry if this is on t

Re: Task files in _temporary not getting promoted out

2009-06-04 Thread Ian Soboroff
, only successfully completed tasks have the files > moved up. > > I don't recall if the FileOutputCommitter class appeared in 0.18 > > > On Wed, Jun 3, 2009 at 6:43 PM, Ian Soboroff wrote: > >> Ok, help. I am trying to create local task outputs in my reduce job,

Re: *.gz input files

2009-06-04 Thread Ian Soboroff
If you're case is like mine, where you have lots of .gz files and you don't want splits in the middle of those files, you can use the code I just sent in the thread about traversing subdirectories. In brief, your RecordReader could do something like: public static class MyRecordReader

Re: Subdirectory question revisited

2009-06-04 Thread Ian Soboroff
Here's how I solved the problem using a custom InputFormat... the key part is in listStatus(), where we traverse the directory tree. Since HDFS doesn't have links this code is probably safe, but if you have a filesystem with cycles you will get trapped. Ian import java.io.IOException; import ja

Re: Command-line jobConf options in 0.18.3

2009-06-04 Thread Ian Soboroff
sending on the command > line? > - Aaron > > On Wed, Jun 3, 2009 at 5:46 PM, Ian Soboroff wrote: > > If after I call getConf to get the conf object, I manually add the key/ > value pair, it's there when I need it.  So it feels like ToolRunner isn't

Task files in _temporary not getting promoted out

2009-06-03 Thread Ian Soboroff
Ok, help. I am trying to create local task outputs in my reduce job, and they get created, then go poof when the job's done. My first take was to use FileOutputFormat.getWorkOutputPath, and create directories in there for my outputs (which are Lucene indexes). Exasperated, I then wrote a

Re: Command-line jobConf options in 0.18.3

2009-06-03 Thread Ian Soboroff
If after I call getConf to get the conf object, I manually add the key/ value pair, it's there when I need it. So it feels like ToolRunner isn't parsing my args for some reason. Ian On Jun 3, 2009, at 8:45 PM, Ian Soboroff wrote: Yes, and I get the JobConf via 'JobConf jo

Re: Command-line jobConf options in 0.18.3

2009-06-03 Thread Ian Soboroff
PM, Aaron Kimball wrote: Are you running your program via ToolRunner.run()? How do you instantiate the JobConf object? - Aaron On Wed, Jun 3, 2009 at 10:19 AM, Ian Soboroff wrote: I'm backporting some code I wrote for 0.19.1 to 0.18.3 (long story), and I'm finding that when I run a

Command-line jobConf options in 0.18.3

2009-06-03 Thread Ian Soboroff
I'm backporting some code I wrote for 0.19.1 to 0.18.3 (long story), and I'm finding that when I run a job and try to pass options with -D on the command line, that the option values aren't showing up in my JobConf. I logged all the key/value pairs in the JobConf, and the option I passed t

Re: hadoop hardware configuration

2009-05-28 Thread Ian Soboroff
Brian Bockelman writes: > Despite my trying, I've never been able to come even close to pegging > the CPUs on our NN. > > I'd recommend going for the fastest dual-cores which are affordable -- > latency is king. Clue? Surely the latencies in Hadoop that dominate are not cured with faster proce

Re: RPM spec file for 0.19.1

2009-04-06 Thread Ian Soboroff
Simon Lewis writes: > On 3 Apr 2009, at 15:11, Ian Soboroff wrote: >> Steve Loughran writes: >> >>> I think from your perpective it makes sense as it stops anyone >>> getting >>> itchy fingers and doing their own RPMs. >> >> Um, what's

Re: RPM spec file for 0.19.1

2009-04-03 Thread Ian Soboroff
Steve Loughran writes: > -RPM and deb packaging would be nice Indeed. The best thing would be to have the hadoop build system output them, for some sensible subset of systems. > -the jdk requirements are too harsh as it should run on openjdk's JRE > or jrockit; no need for sun only. Too bad th

Re: RPM spec file for 0.19.1

2009-04-03 Thread Ian Soboroff
Steve Loughran writes: > I think from your perpective it makes sense as it stops anyone getting > itchy fingers and doing their own RPMs. Um, what's wrong with that? Ian

Re: RPM spec file for 0.19.1

2009-04-03 Thread Ian Soboroff
faction.com/cloudera/topics/should_we_release_host_rpms_for_all_releases > > We could even skip the branding on the "devel" releases :-) > > Cheers, > Christophe > > On Thu, Apr 2, 2009 at 12:46 PM, Ian Soboroff wrote: >> >> I created a JIRA (https://issues.

RPM spec file for 0.19.1

2009-04-02 Thread Ian Soboroff
I created a JIRA (https://issues.apache.org/jira/browse/HADOOP-5615) with a spec file for building a 0.19.1 RPM. I like the idea of Cloudera's RPM file very much. In particular, it has nifty /etc/init.d scripts and RPM is nice for managing updates. However, it's for an older, patched version of

Re: swap hard drives between datanodes

2009-03-31 Thread Ian Soboroff
Or if you have a node blow a motherboard but the disks are fine... Ian On Mar 30, 2009, at 10:03 PM, Mike Andrews wrote: i tried swapping two hot-swap sata drives between two nodes in a cluster, but it didn't work: after restart, one of the datanodes shut down since namenode said it reported a

Re: Creating Lucene index in Hadoop

2009-03-16 Thread Ian Soboroff
inal > results to local file system and then copy to HDFS. In contrib/index, > the intermediate results are in memory and not written to HDFS. > > Hope it clarifies things. > > Cheers, > Ning > > > On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff wrote: >> >>

Re: Creating Lucene index in Hadoop

2009-03-16 Thread Ian Soboroff
I understand why you would index in the reduce phase, because the anchor text gets shuffled to be next to the document. However, when you index in the map phase, don't you just have to reindex later? The main point to the OP is that HDFS is a bad FS for writing Lucene indexes because of how Luce

Re: Hadoop job using multiple input files

2009-02-06 Thread Ian Soboroff
Amandeep Khurana writes: > Is it possible to write a map reduce job using multiple input files? > > For example: > File 1 has data like - Name, Number > File 2 has data like - Number, Address > > Using these, I want to create a third file which has something like - Name, > Address > > How can a m

Re: Regarding "Hadoop multi cluster" set-up

2009-02-04 Thread Ian Soboroff
I would love to see someplace a complete list of the ports that the various Hadoop daemons expect to have open. Does anyone have that? Ian On Feb 4, 2009, at 1:16 PM, shefali pawar wrote: Hi, I will have to check. I can do that tomorrow in college. But if that is the case what should i

Re: FileInputFormat directory traversal

2009-02-03 Thread Ian Soboroff
y of doing things. It would probably be better if FileInputFormat optionally supported recursive file enumeration. (It would be incompatible and thus cannot be the default mode.) Please file an issue in Jira for this and attach your patch. Thanks, Doug Ian Soboroff wrote: Is there

FileInputFormat directory traversal

2009-02-03 Thread Ian Soboroff
Is there a reason FileInputFormat only traverses the first level of directories in its InputPaths? (i.e., given an InputPath of 'foo', it will get foo/* but not foo/bar/*). I wrote a full depth-first traversal in my custom InputFormat which I can offer as a patch. But to do it I had to du

Re: My tasktrackers keep getting lost...

2009-02-03 Thread Ian Soboroff
So staring at these logs a bit more and reading hadoop-default.xml and thinking a bit, it seems to me that for some reason my slave tasktrackers are having trouble sending heartbeats back to the master. I'm not sure why this is. It is happening during the shuffle phase of the reduce setup

Re: My tasktrackers keep getting lost...

2009-02-03 Thread Ian Soboroff
On Feb 2, 2009, at 11:38 PM, Sagar Naik wrote: Can u post the output from hadoop-argus--jobtracker.out Sure: Exception closing file /user/soboroff/output/_logs/history/ rogue_1233597148110_job_200902021252_0002_soboroff_index java.io.IOException: Filesystem closed at org.apache.hado

My tasktrackers keep getting lost...

2009-02-02 Thread Ian Soboroff
I hope someone can help me out. I'm getting started with Hadoop, have written the firt part of my project (a custom InputFormat), and am now using that to test out my cluster setup. I'm running 0.19.0. I have five dual-core Linux workstations with most of a 250GB disk available for playing, an