Re: duplicate tasks getting started/killed

2010-02-10 Thread Meng Mao
Right, so have you ever seen your non-idempotent DEFINE command have an incorrect result? That would essentially point to duplicate attempts behaving badly. To your second question -- I think spec exec assumes that not all machines run at the same speed. If a machine is free (not used for some oth

Re: Problem getting secondary sort running.

2010-02-10 Thread Eric Sammer
On 2/11/10 12:40 AM, Winton Davies wrote: > ahhahhahahahahaha... I thought it was single-pass, and in this case, an > 'echo'. > Yea, the combiner can be confusing at first. It may run N times where N is zero or greater. And yes, this means that even if you supply a combiner the framework may opt

Re: Problem getting secondary sort running.

2010-02-10 Thread Winton Davies
ahhahhahahahahaha... I thought it was single-pass, and in this case, an 'echo'. Thanks ! W On Wed, Feb 10, 2010 at 8:05 PM, Eric Sammer wrote: > Winton: > > The combiner is always optional. Simply leave it out to not have one. The > reason you're seeing extra records is because a combiner can r

Re: Problem getting secondary sort running.

2010-02-10 Thread Eric Sammer
Winton: The combiner is always optional. Simply leave it out to not have one. The reason you're seeing extra records is because a combiner can run multiple times. This means you're growing your dataset after the mapper. HTH Eric On Feb 10, 2010, at 10:30 PM, Winton Davies wrote:

Many child processes dont exit

2010-02-10 Thread Zheng Lv
Hello Everyone, We often find many child processes in datanodes, which have already finished for long time. And following are the jstack log: Full thread dump Java HotSpot(TM) 64-Bit Server VM (14.3-b01 mixed mode): "DestroyJavaVM" prio=10 tid=0x2aaac8019800 nid=0x2422 waiting on condition

Re: Problem getting secondary sort running.

2010-02-10 Thread Winton Davies
Thanks Eric, I think I may have found the cause of the problem, but have no idea how to do fix it. My mapper is STDOUT.puts "key1 tab key2 tab text" -- and the job tracker shows the total number of records being emitted as say 35 million. it then goes thru -combiner /bin/cat (ie a NOOP, in theor

Announcement: New Training Offering: Hadoop for System Administrators

2010-02-10 Thread Christophe Bisciglia
Hadoop Fans, we have scheduled additional developer sessions in both the bay area and NYC. Also, due to popular demand, we'll be offering a public sysadmin training session immediately following our March developer session in the Bay Area. If this goes well, we'll make this a regular offering. Als

Re: Problem getting secondary sort running.

2010-02-10 Thread E. Sammer
Winton: I don't know the exact streaming options you're looking for, but what you have looks correct. Generally, to do what you want all you should have to do is 1. sort on both field zero and one in the key and 2. partition on only zero. This ensures all keys containing 'AA' go to the same r

Problem getting secondary sort running.

2010-02-10 Thread Winton Davies
I'm using streaming hadoop, installed vua cloudera on ec2. My job should be straightforward: 1) Map task, emits 2 keys and 1 VALUE eg AA 0 QUICK BROWN FOX AA 1 QUICK BROWN FOX BB 1 QUICK RED DOG 2) Reduce Task, assuming are all in its standard input and flag, runs thru the stdin. Whe

Re: Ubuntu Single Node Tutorial failure. No live or dead nodes.

2010-02-10 Thread E. Sammer
On 2/10/10 5:19 PM, Nick Klosterman wrote: @E.Sammer, no I don't *think* that it is part of another cluster. The tutorial is for a single node cluster just as a initial set up to see if you can get things up and running. I have reformatted the namenode several times in my effort to get hadoop to

Re: Ubuntu Single Node Tutorial failure. No live or dead nodes.

2010-02-10 Thread Nick Klosterman
@E.Sammer, no I don't *think* that it is part of another cluster. The tutorial is for a single node cluster just as a initial set up to see if you can get things up and running. I have reformatted the namenode several times in my effort to get hadoop to work. @abishek I tried the workaround y

Re: Ubuntu Single Node Tutorial failure. No live or dead nodes.

2010-02-10 Thread E. Sammer
On 2/10/10 3:57 PM, Nick Klosterman wrote: It appears I have incompatible namespaceIDs. Any thoughts on how to resolve that? This is what the full datanodes log is saying: Was this data node part of a another DFS cluster at some point? It looks like you've reformatted the name node since the d

Re: Ubuntu Single Node Tutorial failure. No live or dead nodes.

2010-02-10 Thread abhishek sharma
So Michael Noll's tutorial page has the following tips for the error you are facing. http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)#java.io.IOException:_Incompatible_namespaceIDs Abhishek On Wed, Feb 10, 2010 at 12:57 PM, Nick Klosterman wrote: > It appears

Re: Ubuntu Single Node Tutorial failure. No live or dead nodes.

2010-02-10 Thread Nick Klosterman
It appears I have incompatible namespaceIDs. Any thoughts on how to resolve that? This is what the full datanodes log is saying: 2010-02-10 15:25:09,125 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG: / STARTUP_MSG:

Re: Ubuntu Single Node Tutorial failure. No live or dead nodes.

2010-02-10 Thread Allen Wittenauer
On 2/10/10 12:42 PM, "Nick Klosterman" wrote: > I've been following Michael Noll's Single node cluster tutorial but am > unable to run the wordcount example successfully. > > It appears that I'm having some sort of problem involving the nodes. Using > copyFromLocal fails to replicate the dat

Re: Ubuntu Single Node Tutorial failure. No live or dead nodes.

2010-02-10 Thread E. Sammer
Nick: It appears that the datanode daemon isn't running. > /usr/local/hadoop/bin$ jps > 24440 SecondaryNameNode > 24626 TaskTracker > 24527 JobTracker > 24218 NameNode > 24725 Jps There's no process for DataNode. This is the process that is responsible for storing blocks. In other words, no da

Ubuntu Single Node Tutorial failure. No live or dead nodes.

2010-02-10 Thread Nick Klosterman
I've been following Michael Noll's Single node cluster tutorial but am unable to run the wordcount example successfully. It appears that I'm having some sort of problem involving the nodes. Using copyFromLocal fails to replicate the data across 1 node. When I try to look at the hadoop web inte

Re: how to pass arguments to a map reduce job

2010-02-10 Thread David Hawthorne
Thanks, that worked! On Feb 10, 2010, at 11:44 AM, Alex Kozlov wrote: David, to parse the -Dkey=value flags you need to implement Tool. Otherwise, you can just set the values yourself using conf.set(name, value) call. On Wed, Feb 10, 2010 at 11:25 AM, David Hawthorne wrote: For the ot

Re: how to pass arguments to a map reduce job

2010-02-10 Thread Alex Kozlov
David, to parse the -Dkey=value flags you need to implement Tool. Otherwise, you can just set the values yourself using conf.set(name, value) call. On Wed, Feb 10, 2010 at 11:25 AM, David Hawthorne wrote: > For the other method I was using, with otherArgs and public static > variables for field

Re: how to pass arguments to a map reduce job

2010-02-10 Thread David Hawthorne
For the other method I was using, with otherArgs and public static variables for field_name and interval_length, here's the code for that: public class FooBar { public static class FooMapper extends MapperText, IntWritable> { private final static IntWritable one = new I

Re: Cleaning jobcache manually

2010-02-10 Thread Allen Wittenauer
On 2/10/10 12:15 AM, "Marcus Herou" wrote: > We run hadoop-0.18.3 and it seems that the jobcache does not get cleaned out > properly. > > Would this cron script be to any harm to hadoop ? > > # Clean all files which are two or more days old > /usr/bin/find ${JOB_CACHE_PATH} -type f -mtime +2

how to pass arguments to a map reduce job

2010-02-10 Thread David Hawthorne
I've tried what it shows in the examples, but those don't seem to be working. Aside from that, they also complain about deprecated interface when I compile. Any help you guys can give would be greatly appreciated. Here's what I need to do in the mapper: Read through some logs. Modulo the

Re: What framework Hadoop uses for daemonizing?

2010-02-10 Thread Steve Loughran
Thomas Koch wrote: I'm working on a hadoop package for Debian, which also includes init scripts using the daemon program (Debian package "daemon") from http://www.libslack.org/daemon Can these scripts be used on other distributions, like Red Hat? Or it's a Debian only daemon? I'm not familiar en

Re: duplicate tasks getting started/killed

2010-02-10 Thread prasenjit mukherjee
Correctness of the results actually depends on my DEFINE command. If the commands are idempotent ( which is not in my case ) then I believe it wont have any affect on the results, otherwise it will indeed make the results incorrect. For example if my command fetches some data and appends to a mys

Re: duplicate tasks getting started/killed

2010-02-10 Thread Meng Mao
That cleanup action looks promising in terms of preventing duplication. What I'd meant was, could you ever find an instance where the results of your DEFINE statement were made incorrect by multiple attempts? On Wed, Feb 10, 2010 at 5:05 AM, prasenjit mukherjee < pmukher...@quattrowireless.com> wr

Re: What framework Hadoop uses for daemonizing?

2010-02-10 Thread Thomas Koch
> > I'm working on a hadoop package for Debian, which also includes init > > scripts > > using the daemon program (Debian package "daemon") from > > http://www.libslack.org/daemon > > Can these scripts be used on other distributions, like Red Hat? Or it's a > Debian only daemon? I'm not familiar e

Re: duplicate tasks getting started/killed

2010-02-10 Thread prasenjit mukherjee
Below is the log : attempt_201002090552_0009_m_01_0 /default-rack/ip-10-242-142-193.ec2.internal SUCCEEDED 100.00% 9-Feb-2010 07:04:37 9-Feb-2010 07:07:00 (2mins, 23sec) attempt_201002090552_0009_m_01_1 Task attempt: /default-rack/ip-10-212-147-129.ec2.internal

Re: Cleaning jobcache manually

2010-02-10 Thread Vibhooti Verma
We faced the same issue and we also use the cron to delete the older entries. Please be careful that your mtime for deletion should never be less than the longest job you can ever have. On Wed, Feb 10, 2010 at 1:45 PM, Marcus Herou wrote: > Hi. > > We run hadoop-0.18.3 and it seems that the jobc

Re: unique output names for each reducers

2010-02-10 Thread Mark N
may be by using multipleoutputFormat should solve my problem thanks,. On Wed, Feb 10, 2010 at 1:12 PM, Oded Rotem wrote: > Did you try one of the subclasses of MultipleOutputFormat to override the > filename in generateFileNameForKeyValue()? > > -Original Message- > From: Mark N [mailto

Cleaning jobcache manually

2010-02-10 Thread Marcus Herou
Hi. We run hadoop-0.18.3 and it seems that the jobcache does not get cleaned out properly. Would this cron script be to any harm to hadoop ? # Clean all files which are two or more days old /usr/bin/find ${JOB_CACHE_PATH} -type f -mtime +2 -exec rm {} \; Need to start cleaning today so hoping f