Re: Different Map and Reduce output types - weird error message

2008-08-29 Thread Jim Twensky
I think I've found the problem. When I removed the following line:

conf.setCombinerClass(Reduce.class);

everything worked fine. During the map phase, when the combiner uses the
Reduce.class as the Reducer, the final map (key,value) pairs are attempted
to be written as the Reducer output types, which contradict with the
specified Mapper output types.  If I'm correct, am I supposed to write a
separate reducer for the local combiner in order to speed things up?

Jim


On Fri, Aug 29, 2008 at 6:30 PM, Jim Twensky <[EMAIL PROTECTED]> wrote:

> Here is the relevant part of my mapper:
>
> (...)
>
> private final static IntWritable one = new IntWritable(1);
> private IntWritable bound = new IntWritable();
>
> (...)
>
> while(...) {
>
> output.collect(bound,one);
>
>}
>
>so I'm not sure why my mapper tries to output a FloatWritable.
>
>
>
>
> On Fri, Aug 29, 2008 at 6:17 PM, Owen O'Malley <[EMAIL PROTECTED]> wrote:
>
>> The error message is saying that your map tried to output a FloatWritable.
>>
>
>


Re[2]: Timeouts at reduce stage

2008-08-29 Thread Иван
Thank you, this suggestion seems to be very close to the real situation. The 
cluster have already been left looping such a (relatively) frequently failing 
mapreduce jobs over a huge period of time to produce a more clear picture of 
the problem. And I've tried to investigate this suggestion more closely when 
I've read it. After taking a look at Ganglia monitoring system that's running 
on that same cluster it became clear that cluster's computing resources 
apparently are exhausted. Further step was quite simple and straightforward - 
just to login to the one random node and find out the consumer of server's 
resources. The answer became clear almost instantly because top and jps 
commands offered produced a huge list of orphaned TaskTracker$Child processes 
consuming tons of CPU time and RAM (in fact, almost all of them). Some other 
nodes even have run out of 16G RAM and few GB of swap and stopped responding at 
all. 

This situation apparently doesn't seems normal, I am going to try to repeat 
such a test with some simpler jobs (I think it would be something from Hadoop 
distribution to make sure that everything is fine with code) to find out more 
definitely whether this orphaning of forked processes depends on exact MR job 
running or not (theoretically it still could be something wrong with 
Hadoop/HBase configuration or even maybe with operating system, some additional 
installed software or, as it was suggested earlier, hardware).

I would be glad if someone could help me in this process by some advice 
(googling on this topic already proved to be hard because of $ being treated as 
separator and lookup usually results in materials about real childs). Maybe 
this situation is quite common and there is a definite reason or solution?  

Thanks!

Ivan Blinkov

-Original Message-
From: Karl Anderson <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Date: Fri, 29 Aug 2008 13:17:18 -0700
Subject: Re: Timeouts at reduce stage

> 
> On 29-Aug-08, at 3:53 AM, Иван wrote:
> 
> > Thanks for a fast reply, but in fact it sometimes fails even on  
> > default MR jobs like, for example, rowcounter job from HBase 0.2.0  
> > distribution. Hardware problems are theoretically possible, but they  
> > doesn't seem to be the case because everything else is operating  
> > fine on the same set of servers. It seems that all major components  
> > of each server are fine, even disk arrays are regularly checked by  
> > datacenter stuff.
> 
> It could be due to a resource problem, I've found these hard to debug  
> at times.  Tasks or parts of the framework can fail due to other tasks  
> using up resources, and sometimes the errors you see don't make the  
> cause easy to find.  I've had memory consumption in a mapper cause  
> errors in other mappers, reducers, and fetching HDFS blocks, as well  
> as job infrastructure failures that I don't really understand (for  
> example, one task unable to find a file that was put in a job jar and  
> found by other tasks).  I think all of my timeouts have been  
> straightforward, but I could imagine resource consumption causing that  
> in an otherwise unrelated task - IO blocking, swap, etc.
> 



Re: basic questions about Hadoop!

2008-08-29 Thread Gerardo Velez
Hi Victor!

I got problem with remote writing as well, so I tried to go further on this
and I would like to share what I did, maybe you have more luck than me

1) as I'm working with user gvelez in remote host I had to give write access
to all, like this:

bin/hadoop dfs -chmod -R a+w input

2) After that, there is no more connection refused error, but instead I got
following exception



$ bin/hadoop dfs -copyFromLocal README.txt /user/hadoop/input/README.txt
cygpath: cannot create short name of d:\hadoop\hadoop-0.17.2\logs
08/08/29 19:06:51 INFO dfs.DFSClient: org.apache.hadoop.ipc.RemoteException:
jav
a.io.IOException: File /user/hadoop/input/README.txt could only be
replicated to
 0 nodes, instead of 1
at
org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.ja
va:1145)
at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:300)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)



On Fri, Aug 29, 2008 at 9:53 AM, Victor Samoylov <[EMAIL PROTECTED]
> wrote:

> Jeff,
>
> Thanks for detailed instructions, but on machine that is not hadoop server
> I
> got error:
> ~/hadoop-0.17.2$ ./bin/hadoop dfs -copyFromLocal NOTICE.txt test
> 08/08/29 19:33:07 INFO dfs.DFSClient: Exception in createBlockOutputStream
> java.net.ConnectException: Connection refused
> 08/08/29 19:33:07 INFO dfs.DFSClient: Abandoning block
> blk_-7622891475776838399
> The thing is that file was created, but with zero size.
>
> Do you have ideas why this happened?
>
> Thanks,
> Victor
>
> On Fri, Aug 29, 2008 at 4:10 AM, Jeff Payne <[EMAIL PROTECTED]> wrote:
>
> > You can use the hadoop command line on machines that aren't hadoop
> servers.
> > If you copy the hadoop configuration from one of your master servers or
> > data
> > node to the client machine and run the command line dfs tools, it will
> copy
> > the files directly to the data node.
> >
> > Or, you could use one of the client libraries.  The java client, for
> > example, allows you to open up an output stream and start dumping bytes
> on
> > it.
> >
> > On Thu, Aug 28, 2008 at 5:05 PM, Gerardo Velez <[EMAIL PROTECTED]
> > >wrote:
> >
> > > Hi Jeff, thank you for answering!
> > >
> > > What about remote writing on HDFS, lets suppose I got an application
> > server
> > > on a
> > > linux server A and I got a Hadoop cluster on servers B (master), C
> > (slave),
> > > D (slave)
> > >
> > > What I would like is sent some files from Server A to be processed by
> > > hadoop. So in order to do so, what I need to do do I need send
> those
> > > files to master server first and then copy those to HDFS?
> > >
> > > or can I pass those files to any slave server?
> > >
> > > basically I'm looking for remote writing due to files to be process are
> > not
> > > being generated on any haddop server.
> > >
> > > Thanks again!
> > >
> > > -- Gerardo
> > >
> > >
> > >
> > > Regarding
> > >
> > > On Thu, Aug 28, 2008 at 4:04 PM, Jeff Payne <[EMAIL PROTECTED]>
> wrote:
> > >
> > > > Gerardo:
> > > >
> > > > I can't really speak to all of your questions, but the master/slave
> > issue
> > > > is
> > > > a common concern with hadoop.  A cluster has a single namenode and
> > > > therefore
> > > > a single point of failure.  There is also a secondary name node
> process
> > > > which runs on the same machine as the name node in most default
> > > > configurations.  You can make it a different machine by adjusting the
> > > > master
> > > > file.  One of the more experienced lurkers should feel free to
> correct
> > > me,
> > > > but my understanding is that the secondary name node keeps track of
> all
> > > the
> > > > same index information used by the primary name node.  So, if the
> > > namenode
> > > > fails, there is no automatic recovery, but you can always tweak your
> > > > cluster
> > > > configuration to make the secondary namenode the primary and safely
> > > restart
> > > > the cluster.
> > > >
> > > > As for the storage of files, the name node is really just the traffic
> > cop
> > > > for HDFS.  No HDFS files are actually stored on that machine.  It's
> > > > basically used as a directory and lock manager, etc.  The files are
> > > stored
> > > > on multiple datanodes and I'm pretty sure all the actual file I/O
> > happens
> > > > directly between the client and the respective datanodes.
> > > >
> > > > Perhaps one of the more hardcore hadoop people on here will point it
> > out
> > > if
> > > > I'm giving bad advice.
> > > >
> > > >
> > > > On Thu, Aug 28, 2008 at 2:28 PM, Gerardo Velez <
> > [EMAIL PROTECTED]
> > > > >wrote:
> > > >
> > > > > Hi Everybody!
> > > > >
> > > > > I'm a newbie with Hadoop, I've installed it as a single node as a
> 

Re: Different Map and Reduce output types - weird error message

2008-08-29 Thread Jim Twensky
Here is the relevant part of my mapper:

(...)

private final static IntWritable one = new IntWritable(1);
private IntWritable bound = new IntWritable();

(...)

while(...) {

output.collect(bound,one);

   }

   so I'm not sure why my mapper tries to output a FloatWritable.




On Fri, Aug 29, 2008 at 6:17 PM, Owen O'Malley <[EMAIL PROTECTED]> wrote:

> The error message is saying that your map tried to output a FloatWritable.
>


Re: Different Map and Reduce output types - weird error message

2008-08-29 Thread Owen O'Malley
The error message is saying that your map tried to output a FloatWritable.


Different Map and Reduce output types - weird error message

2008-08-29 Thread Jim Twensky
Hello, I am working on a Hadoop application that produces different
(key,value) types after the map and reduce phases so I'm aware that I need
to use "JobConf.setMapOutputKeyClass" and "JobConf.setMapOutputValueClass".
However, I still keep getting the following runtime error when I run my
application:

java.io.IOException: wrong value class: org.apache.hadoop.io.FloatWritable
is not class org.apache.hadoop.io.IntWritable
at
org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:938)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$1.collect(MapTask.java:414)
at
test.DistributionCreator$Reduce.reduce(DistributionCreator.java:104)
at
test.DistributionCreator$Reduce.reduce(DistributionCreator.java:85)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:439)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(MapTask.java:418)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:604)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:193)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1804)

My mapper class goes like:

  public static class MapClass extends MapReduceBase
implements Mapper {

(...)

public void map(LongWritable key, Text value,
OutputCollector output,
Reporter reporter) throws IOException {

 (...)
}

}

and my Reducer goes like:

  public static class Reduce extends MapReduceBase
implements Reducer
{

(...)

public void reduce(IntWritable key, Iterator values,
   OutputCollector output,
   Reporter reporter) throws IOException {


float sum = 0;

(...)

output.collect(key, new FloatWritable(sum));

 }

   }

   and the corresponding part of my configuration goes as follows:

conf.setMapOutputValueClass(IntWritable.class);
conf.setMapOutputKeyClass(IntWritable.class);
conf.setOutputKeyClass(IntWritable.class);
conf.setOutputValueClass(FloatWritable.class);

   which I believe is consistent with the mapper and the reducer classes.
Can you please let me know what I'm missing here?

   Thanks in advance,

   Jim


open files not being cleaned up?

2008-08-29 Thread Karl Anderson
I'm running several Hadoop jobs sequentially on one cluster.  I'm  
noticing that later jobs are dying because of too many open files, and  
that earlier runs tend to cause later runs to die - in other words,  
file resources aren't being freed somewhere.


By running a job over and over again, I can cause all subsequent jobs  
to die, even jobs that had successfully run earlier.


I'm using streaming on a hadoop-ec2 cluster, hadoop version 18.0, and  
my inputs and outputs are all HDFS controlled by streaming (stdin and  
stdout), never writing or reading as a side effect.   Each job uses  
the HDFS output of a previous job as its input, but the jobs are all  
separate Hadoop processes, and only one is running at a time.


I have increased the open file limit for root to 65536 in limits.conf  
on my ec2 image, no help.


Is there any solution other than firing up a new cluster for each job?

I could file a bug, but I'm not sure what's consuming the files.  On a  
random job box, /proc//fd shows only 359  fd entries for the  
entire box, and the most open for any process is 174.




RE: Specify per file replication factor in "dfs -put" command line

2008-08-29 Thread Koji Noguchi
Try 

hadoop dfs -D dfs.replication=2 -put abc bcd

Koji

-Original Message-
From: Kevin [mailto:[EMAIL PROTECTED] 
Sent: Friday, August 29, 2008 11:11 AM
To: core-user@hadoop.apache.org
Subject: Specify per file replication factor in "dfs -put" command line

Hi,

Does any one happen to know how to specify the replication factor of a
file when I upload it by the "hadoop dfs -put" command? Thank you!

Best,
-Kevin


Re: Hadoop over Lustre?

2008-08-29 Thread Joel Welling
That seems to have done the trick!  I am now running Hadoop 0.18
straight out of Lustre, without an intervening HDFS.  The unusual things
about my hadoop-site.xml are:


  fs.default.name
  file:///bessemer/welling


  mapred.system.dir
  ${fs.default.name}/hadoop_tmp/mapred/system
  The shared directory where MapReduce stores control
files.
  


where /bessemer/welling is a directory on a mounted Lustre filesystem.
I then do 'bin/start-mapred.sh' (without starting dfs), and I can run
Hadoop programs normally.  I do have to specify full input and output
file paths- they don't seem to be relative to fs.default.name .  That's
not too troublesome, though.

Thanks very much!  
-Joel
 [EMAIL PROTECTED]

On Fri, 2008-08-29 at 10:52 -0700, Owen O'Malley wrote:
> Check the setting for mapred.system.dir. This needs to be a path that is on
> a distributed file system. In old versions of Hadoop, it had to be on the
> default file system, but that is no longer true. In recent versions, the
> system dir only needs to be configured on the JobTracker and it is passed to
> the TaskTrackers and clients.



Re: EC2 AMI for Hadoop 0.18.0

2008-08-29 Thread Karl Anderson


On 29-Aug-08, at 6:49 AM, Stuart Sierra wrote:


Anybody have one?  Any success building it with create-hadoop-image?
Thanks,
-Stuart


I was able to build one following the instructions in the wiki.   
You'll need to find the Java download url (see wiki) and put it and  
your own S3 bucket name in hadoop-ec2-env.sh.





Re: Timeouts at reduce stage

2008-08-29 Thread Karl Anderson


On 29-Aug-08, at 3:53 AM, Иван wrote:

Thanks for a fast reply, but in fact it sometimes fails even on  
default MR jobs like, for example, rowcounter job from HBase 0.2.0  
distribution. Hardware problems are theoretically possible, but they  
doesn't seem to be the case because everything else is operating  
fine on the same set of servers. It seems that all major components  
of each server are fine, even disk arrays are regularly checked by  
datacenter stuff.


It could be due to a resource problem, I've found these hard to debug  
at times.  Tasks or parts of the framework can fail due to other tasks  
using up resources, and sometimes the errors you see don't make the  
cause easy to find.  I've had memory consumption in a mapper cause  
errors in other mappers, reducers, and fetching HDFS blocks, as well  
as job infrastructure failures that I don't really understand (for  
example, one task unable to find a file that was put in a job jar and  
found by other tasks).  I think all of my timeouts have been  
straightforward, but I could imagine resource consumption causing that  
in an otherwise unrelated task - IO blocking, swap, etc.




Specify per file replication factor in "dfs -put" command line

2008-08-29 Thread Kevin
Hi,

Does any one happen to know how to specify the replication factor of a
file when I upload it by the "hadoop dfs -put" command? Thank you!

Best,
-Kevin


Re: Hadoop over Lustre?

2008-08-29 Thread Owen O'Malley
Check the setting for mapred.system.dir. This needs to be a path that is on
a distributed file system. In old versions of Hadoop, it had to be on the
default file system, but that is no longer true. In recent versions, the
system dir only needs to be configured on the JobTracker and it is passed to
the TaskTrackers and clients.


Re: Hadoop over Lustre?

2008-08-29 Thread Joel Welling
Sorry; I'm picking this thread up after a couple day's delay.  Setting
fs.default.name to the equivalent of file:///path/to/lustre and changing
mapred.job.tracker to just a hostname and port does allow mapreduce to
start up.  However, test jobs fail with the exceptions below.  It looks
like TaskTracker.localizeJob is looking for job.xml in the local
filesystem; I would have expected it to look in lustre.

I can't find that particular job.xml anywhere on the system after the
run aborts, I'm afraid.  I guess it's getting cleaned up.

Thanks,
-Joel

08/08/28 18:46:07 INFO mapred.FileInputFormat: Total input paths to
process : 1508/08/28 18:46:07 INFO mapred.FileInputFormat: Total input
paths to process : 1508/08/28 18:46:08 INFO mapred.JobClient: Running
job: job_200808281828_0002
08/08/28 18:46:09 INFO mapred.JobClient:  map 0% reduce 0%
08/08/28 18:46:12 INFO mapred.JobClient: Task Id :
attempt_200808281828_0002_m_00_0, Status : FAILED
Error initializing attempt_200808281828_0002_m_00_0:
java.io.IOException:
file:/tmp/hadoop-welling/mapred/system/job_200808281828_0002/job.xml: No
such file or directory
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:216)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:150)
at
org.apache.hadoop.fs.LocalFileSystem.copyToLocalFile(LocalFileSystem.java:55)
at
org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1193)
at
org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:668)
at
org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1306)
at
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:946)
at
org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343)
at
org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2354)

08/08/28 18:46:12 WARN mapred.JobClient: Error reading task
outputhttp://foo.psc.edu:50060/tasklog?plaintext=true&taskid=attempt_200808281828_0002_m_00_0&filter=stdout
08/08/28 18:46:12 WARN mapred.JobClient: Error reading task
outputhttp://foo.psc.edu:50060/tasklog?plaintext=true&taskid=attempt_200808281828_0002_m_00_0&filter=stderr



On Mon, 2008-08-25 at 14:24 -0700, Konstantin Shvachko wrote:
> mapred.job.tracker is the address and port of the JobTracker - the main 
> server that controls map-reduce jobs.
> Every task tracker needs to know the address in order to connect.
> Do you follow the docs, e.g. that one
> http://wiki.apache.org/hadoop/GettingStartedWithHadoop
> 
> Can you start one node cluster?
> 
>  > Are there standard tests of hadoop performance?
> 
> There is the sort benchmark. We also run DFSIO benchmark for read and write 
> throughputs.
> 
> --Konstantin
> 
> Joel Welling wrote:
> > So far no success, Konstantin- the hadoop job seems to start up, but
> > fails immediately leaving no logs.  What is the appropriate setting for
> > mapred.job.tracker ?  The generic value references hdfs, but it also has
> > a port number- I'm not sure what that means.
> > 
> > My cluster is small, but if I get this working I'd be very happy to run
> > some benchmarks.  Are there standard tests of hadoop performance?
> > 
> > -Joel
> >  [EMAIL PROTECTED]
> > 
> > On Fri, 2008-08-22 at 15:59 -0700, Konstantin Shvachko wrote:
> >> I think the solution should be easier than Arun and Steve advise.
> >> Lustre is already mounted as a local directory on each cluster machines, 
> >> right?
> >> Say, it is mounted on /mnt/lustre.
> >> Then you configure hadoop-site.xml and set
> >> 
> >>fs.default.name
> >>file:///mnt/lustre
> >> 
> >> And then you start map-reduce only without hdfs using start-mapred.sh
> >>
> >> By this you basically redirect all FileSystem requests to Lustre and you 
> >> don't need
> >> data-nodes or the name-node.
> >>
> >> Please let me know if that works.
> >>
> >> Also it would very interesting to have your experience shared on this list.
> >> Problems, performance - everything is quite interesting.
> >>
> >> Cheers,
> >> --Konstantin
> >>
> >> Joel Welling wrote:
>  2. Could you set up symlinks from the local filesystem, so point every 
>  node at a local dir
>    /tmp/hadoop
>  with each node pointing to a different subdir in the big filesystem?
> >>> Yes, I could do that!  Do I need to do it for the log directories as
> >>> well, or can they be shared?
> >>>
> >>> -Joel
> >>>
> >>> On Fri, 2008-08-22 at 15:48 +0100, Steve Loughran wrote:
>  Joel Welling wrote:
> > Thanks, Steve and Arun.  I'll definitely try to write something based on
> > the KFS interface.  I think that for our applications putting the mapper
> > on the right rack is not going to be that useful.  A lot of our
> > calculations are going to be disordered stuff based on 3D spatial
> > relationships like nearest-neighbor finding, so things will be in a
> > random access pattern most of the time.
> >
> > Is there a way to set up the configuration for HDFS so that differe

Re: Some comparators defined via .JobConf.setOutputKeyComparatorClass no longer work in 0.18.0

2008-08-29 Thread Owen O'Malley
I see your problem. *smile* I'd suggest a patch for 0.18.1 that changes the
constructor with createInstances from private to protected. I created the
jira HADOOP-4046  for
this. That said, you *really* should create a RawComparator for your type.
It will perform much much better in the sort.


Re: basic questions about Hadoop!

2008-08-29 Thread Victor Samoylov
Jeff,

Thanks for detailed instructions, but on machine that is not hadoop server I
got error:
~/hadoop-0.17.2$ ./bin/hadoop dfs -copyFromLocal NOTICE.txt test
08/08/29 19:33:07 INFO dfs.DFSClient: Exception in createBlockOutputStream
java.net.ConnectException: Connection refused
08/08/29 19:33:07 INFO dfs.DFSClient: Abandoning block
blk_-7622891475776838399
The thing is that file was created, but with zero size.

Do you have ideas why this happened?

Thanks,
Victor

On Fri, Aug 29, 2008 at 4:10 AM, Jeff Payne <[EMAIL PROTECTED]> wrote:

> You can use the hadoop command line on machines that aren't hadoop servers.
> If you copy the hadoop configuration from one of your master servers or
> data
> node to the client machine and run the command line dfs tools, it will copy
> the files directly to the data node.
>
> Or, you could use one of the client libraries.  The java client, for
> example, allows you to open up an output stream and start dumping bytes on
> it.
>
> On Thu, Aug 28, 2008 at 5:05 PM, Gerardo Velez <[EMAIL PROTECTED]
> >wrote:
>
> > Hi Jeff, thank you for answering!
> >
> > What about remote writing on HDFS, lets suppose I got an application
> server
> > on a
> > linux server A and I got a Hadoop cluster on servers B (master), C
> (slave),
> > D (slave)
> >
> > What I would like is sent some files from Server A to be processed by
> > hadoop. So in order to do so, what I need to do do I need send those
> > files to master server first and then copy those to HDFS?
> >
> > or can I pass those files to any slave server?
> >
> > basically I'm looking for remote writing due to files to be process are
> not
> > being generated on any haddop server.
> >
> > Thanks again!
> >
> > -- Gerardo
> >
> >
> >
> > Regarding
> >
> > On Thu, Aug 28, 2008 at 4:04 PM, Jeff Payne <[EMAIL PROTECTED]> wrote:
> >
> > > Gerardo:
> > >
> > > I can't really speak to all of your questions, but the master/slave
> issue
> > > is
> > > a common concern with hadoop.  A cluster has a single namenode and
> > > therefore
> > > a single point of failure.  There is also a secondary name node process
> > > which runs on the same machine as the name node in most default
> > > configurations.  You can make it a different machine by adjusting the
> > > master
> > > file.  One of the more experienced lurkers should feel free to correct
> > me,
> > > but my understanding is that the secondary name node keeps track of all
> > the
> > > same index information used by the primary name node.  So, if the
> > namenode
> > > fails, there is no automatic recovery, but you can always tweak your
> > > cluster
> > > configuration to make the secondary namenode the primary and safely
> > restart
> > > the cluster.
> > >
> > > As for the storage of files, the name node is really just the traffic
> cop
> > > for HDFS.  No HDFS files are actually stored on that machine.  It's
> > > basically used as a directory and lock manager, etc.  The files are
> > stored
> > > on multiple datanodes and I'm pretty sure all the actual file I/O
> happens
> > > directly between the client and the respective datanodes.
> > >
> > > Perhaps one of the more hardcore hadoop people on here will point it
> out
> > if
> > > I'm giving bad advice.
> > >
> > >
> > > On Thu, Aug 28, 2008 at 2:28 PM, Gerardo Velez <
> [EMAIL PROTECTED]
> > > >wrote:
> > >
> > > > Hi Everybody!
> > > >
> > > > I'm a newbie with Hadoop, I've installed it as a single node as a
> > > > pseudo-distributed environment, but I would like to go further and
> > > > configure
> > > > a complete hadoop cluster. But I got the following questions.
> > > >
> > > > 1.- I undertsand that HDFS has a master/slave architecture. So master
> > and
> > > > the master server manages the file system namespace and regulates
> > access
> > > to
> > > > files by clients. So, what happens in a cluster environment if the
> > master
> > > > server fails or is down due to network issues?
> > > > the slave become as master server or something?
> > > >
> > > >
> > > > 2.- What about Haddop Filesystem, from client point of view. the
> client
> > > > should only store files in the HDFS on master server, or clients are
> > able
> > > > to
> > > > store the file to be processed on a HDFS from a slave server as well?
> > > >
> > > >
> > > > 3.- Until now, what I;m doing to run hadoop is:
> > > >
> > > >1.- copy file to be processes from Linux File System to HDFS
> > > >2.- Run hadoop shell   hadoop   -jarfile  input output
> > > >3.- The results are stored on output directory
> > > >
> > > >
> > > > There is anyway to have hadoop as a deamon, so that, when the file is
> > > > stored
> > > > in HDFS the file is processed automatically with hadoop?
> > > >
> > > > (witout to run hadoop shell everytime)
> > > >
> > > >
> > > > 4.- What happens with processed files, they are deleted form HDFS
> > > > automatically?
> > > >
> > > >
> > > > Thanks in advance!
> > > >
> > > >
> > > > -- Gerardo Velez
> > > >
> > >
> > >
> > 

Some comparators defined via .JobConf.setOutputKeyComparatorClass no longer work in 0.18.0

2008-08-29 Thread Igor Maximchuk

Hello,

I use a custom comparators that  subclass  WritableComparator and 
overrides int  compare(WritableComparable lh, WritableComparable rh)


for example:

public class ProfoutComparator extends WritableComparator {
   public ProfoutComparator() {
   super(ProfoutMapKey.class);
   }

   public int compare(WritableComparable lh, WritableComparable rh) {
   ProfoutMapKey lk = (ProfoutMapKey) lh;
   ProfoutMapKey rk = (ProfoutMapKey) rh;
   if (lk.getVisitorId() > rk.getVisitorId())return 1;
   if (lk.getVisitorId() < rk.getVisitorId())return -1;
   if (lk.getInterestValue() < rk.getInterestValue())return 1;
   if (lk.getInterestValue() > rk.getInterestValue())return -1;
   return 0;
   }
}

I need different comparators of MyKey for different mapreduce jobs. so I 
cannot simply override compareTo method in MyKey and cannot register 
custom comparator for MyKey because


after upgrading to 0.18 I start receving NullPointerExceptions  when 
running tasks that define comparator via 
.JobConf.setOutputKeyComparatorClass


08/08/29 19:49:20 INFO mapred.TaskInProgress: Error from 
attempt_200808291949_0001_m_00_1: java.lang.NullPointerException
   at 
org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:96)
   at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:487)

   at org.apache.hadoop.util.QuickSort.fix(QuickSort.java:30)
   at org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:83)
   at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
   at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:750)
   at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)

   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:228)
   at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)



looks like this is because "buffer" variable is not initialised, since 
default constructor for  WritableComparator no longer initialises it and 
there is no way to make it do so. See patches in  
https://issues.apache.org/jira/browse/HADOOP-3665.


Is it possible to change signature of   WritableComparator(Class 
keyClass, boolean createInstances) from private to protected so it can 
be called from subclass or make another way to tell WritableComparator 
to initialize fully?


Re: MultipleOutputFormat versus MultipleOutputs

2008-08-29 Thread Shirley Cohen

Thanks, Benjamin. Your example saved me a lot of time :))

Shirley

On Aug 28, 2008, at 8:03 AM, Benjamin Gufler wrote:


Hi Shirley,

On 2008-08-28 14:32, Shirley Cohen wrote:

Do you have an example that shows how to use MultipleOutputFormat?


using MultipleOutputFormat is actually pretty easy. Derive a class  
from

it, overriding - if you want to base the destination file name on the
key and/or value - the method "generateFileNameForKeyValue". I'm using
it this way:

protected String generateFileNameForKeyValue(K key, V value,
String name) {
return name + "-" + key.toString();
}

Pay attention at not generating too many different file names,  
however:
All the files are kept open until the Reducer terminates, and  
operating

systems usually impose a limit on open files you can have.

Also, if you haven't done so yet, please upgrade to the latest  
release,

0.18, if you want to use MultipleOutputFormat. Up to 0.17.2, there was
some trouble with Reducers having more than one output file (see
HADOOP-3639 for the details).

Benjamin




EC2 AMI for Hadoop 0.18.0

2008-08-29 Thread Stuart Sierra
Anybody have one?  Any success building it with create-hadoop-image?
Thanks,
-Stuart


Re: Problem in Map/Reduce

2008-08-29 Thread John Meagher
Did you override the equals and hashcode methods?  These are the
methods usually used in a map to determine equality for put/get
operations.  The comparator is probably only used for sorting, not
equality checks.



On Fri, Aug 29, 2008 at 2:55 AM, P.ILAYARAJA <[EMAIL PROTECTED]> wrote:
>
> Hello:
>
> I wrote a simple Map/Reduce program. The ouput key of the Map function is a 
> user defined datum(class)
> with two member strings. The OutputKeyComparatorClass is set to this datum 
> class and the class
> overrides the "compareTo" function.
>
> The problem is the final MapOutput from reduce has the same "key" occuring in 
> more than one record.
> Any thougts on why this could happen?
>
> Also I see that the compareTo function never gets as input the pair of "keys" 
> that are same for comparison.
>
> Regards,
> Ilay


Re: Re: Timeouts at reduce stage

2008-08-29 Thread Иван
Thanks for a fast reply, but in fact it sometimes fails even on default MR jobs 
like, for example, rowcounter job from HBase 0.2.0 distribution. Hardware 
problems are theoretically possible, but they doesn't seem to be the case 
because everything else is operating fine on the same set of servers. It seems 
that all major components of each server are fine, even disk arrays are 
regularly checked by datacenter stuff.

Ivan Blinkov


Re: parallel hadoop process reading same input file

2008-08-29 Thread Deepak Diwakar
 my good luck. i resolved the problem. To run more than one map task you
need to have different hadoop directory
then go to /hadoop-home/conf. and copy the following property from
hadoop-default.xml:

  hadoop.tmp.dir
  /tmp/hadoop-${user.name}
  A base for other temporary directories.


and paste into file hadoop-site.xml and change the value field
different-different for different-different hadoop directory. Then there
would not be any conflict while keeping the intermediate files for the
different-different map tasks.

Thanks
Deepak,

2008/8/29 Deepak Diwakar <[EMAIL PROTECTED]>

> I am running  two different hadoop map/reduce task in standalone mode on
> single node which read same folder. I found that Task1 was not  able to
> processed those file which have  been processed by Task2 and vice-versa. It
> gave some IO error. It seems that in standalone mode  while processing the
> file map task usually locks the file internally (Hoping that should not  be
> the case in DFS mode)
>
> One more observation I found that two map task can't be run on single task
> tracker or single node simultaneously(even if you setup two different hadoop
> directory and try to run map task from both places) . Possible reason I
> could think for is " Hadoop stores its intermediate map /reduce task output
> into some file format in /tmp/ folder. Hence if we run two map task
> simultaneously then it finds conflict keep the intermediate files at the
> same location and results error.
>
> This is my interpretation.
>
> Any feasible solution are appreciable for the standalone mode.
>
> Thanks
> Deepak
>
>
>
> 2008/8/28 lohit <[EMAIL PROTECTED]>
>
> Hi Deepak,
>> Can you explain what process and what files they are trying to read? If
>> you are talking about map/reduce tasks reading files on DFS, then, yes
>> parallel reads are allowed. Multiple writers are not.
>> -Lohit
>>
>>
>>
>> - Original Message 
>> From: Deepak Diwakar <[EMAIL PROTECTED]>
>> To: core-user@hadoop.apache.org
>> Sent: Thursday, August 28, 2008 6:06:58 AM
>> Subject: parallel hadoop process reading same input file
>>
>> Hi,
>>
>> When I am running two hadoop processes in parallel and both process has to
>> read same file. It fails.
>> Of course one solution is to keep copy of file into different location so
>> that accessing simultaneously would not cause any problem. But what if we
>> don't want to do so because it costs extra space.
>> Plz do suggest me any suitable solution to this.
>>
>> Thanks & Regards,
>> Deepak
>>
>>
>
>
>
>
>


Re: Timeouts at reduce stage

2008-08-29 Thread Miles Osborne
The problem here is that when a mapper fails, it may either be due to
some bug within that mapper OR it may be due to hardware problems of
one kind and another (disks getting full etc etc).  if you configure
hadoop to use job replication, then in either case, a failing job will
get resubmitted multiple times.

For the first case, this is fine --the whole job will complete,
possibly taking longer.

For the second case, the entire job will fail.

If you want the entire job to fail at once, you should disable
preemptive scheduling. naturally this means you are hoping that your
hardware is fine and that the only source of mapper failures are
bug-related.

Miles

2008/8/29 Иван <[EMAIL PROTECTED]>:
> From time to time I'm experiencing huge decrease of performance while running 
> some MR jobs.
> The reason have revealed itself quite easily - some tasks have failed 
> according to JobTracker's web interface.
> Record reporting such a failure usually looks somehow like this (usually 
> appears at exact reduce stage):
> "Task task_200808270610_0085_m_000242_0 failed to report status for 600 
> seconds. Killing!"
>
> In fact it doesn't seems to be somehow related with exact type of job which 
> is currently running - it just appears from time to time with different ones. 
> But if that's the case - the execution time of job becomes several times 
> longer and finally usually results in job failure. The changing of some 
> configuration options like mapred.task.timeout generally only makes the death 
> of a job faster, but really doesn't somehow help to cure the problem.
>
> Are there any suggestions about the possible reasons of such a behavior of 
> mapreduce framework or maybe someone have already experienced the same 
> problems?
>
> Thanks!
>
> Ivan Blinkov
>
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


Timeouts at reduce stage

2008-08-29 Thread Иван
>From time to time I'm experiencing huge decrease of performance while running 
>some MR jobs.
The reason have revealed itself quite easily - some tasks have failed according 
to JobTracker's web interface.
Record reporting such a failure usually looks somehow like this (usually 
appears at exact reduce stage):
"Task task_200808270610_0085_m_000242_0 failed to report status for 600 
seconds. Killing!"

In fact it doesn't seems to be somehow related with exact type of job which is 
currently running - it just appears from time to time with different ones. But 
if that's the case - the execution time of job becomes several times longer and 
finally usually results in job failure. The changing of some configuration 
options like mapred.task.timeout generally only makes the death of a job 
faster, but really doesn't somehow help to cure the problem.

Are there any suggestions about the possible reasons of such a behavior of 
mapreduce framework or maybe someone have already experienced the same problems?

Thanks!

Ivan Blinkov