RandomWriter not responding to parameter changes

2008-08-14 Thread James Graham (Greywolf)

I have altered the values described in randomwriter, but they don't seem
to have any effect on the amount of data generated.

I am specifying the configuration file as the last parameter; it seems
to have no effect whatsoever.

Go figure.  What am I doing wrong?
--
James Graham (Greywolf)   |
650.930.1138|925.768.4053 *
[EMAIL PROTECTED] |
Check out what people are saying about SearchMe! -- click below
http://www.searchme.com/stack/109aa


sort failing, help?

2008-08-12 Thread James Graham (Greywolf)

Environment specifications:

Hadoop 0.16.4 (stable)
MACHINES: 20 (18 datanodes)
RAM: 8G
SWAP: none (most of our production machines do not use swap as it kills
response; this may change for the hadoop machines)
CPU: 4
OS: Gentoo Linux; kernel 2.6.23

Problem:
The sort example routine is failing.  The map completes successfully, but the
reduce fails, with a GC out of memory/heap problem.

PARAMETERS (in human readable format)
io.sort.mb = 256
io.file.buffer.size = 65536
io.bytes.per.checksum = 4096
fs.inmemory.size.mb = 2048
dfs.namenode.handler.count = 128
dfs.balance.bandwidthPerSec = 131072
mapred.job.tracker.handler.count = 1
local.cache.size = 238435456
mapred.map.tasks = 67
mapred.reduce.tasks = 23
mapred.reduce.parallel.copies = 4
mapred.child.java.opts = default (changing the heap size doesn't seem to help)
mapred.inmem.merge.threshold = 0 (let the ramfs memory consumption trigger)
mapred.submit.replication = 5
tasktracker.http.threads = 128
ipc.server.listen.queue.size = 128

# all others are default values.

What should I be looking at, here?

--
James Graham (Greywolf)   |
650.930.1138|925.768.4053 *
[EMAIL PROTECTED] |
Check out what people are saying about SearchMe! -- click below
http://www.searchme.com/stack/109aa


Re: performance not great, or did I miss something?

2008-08-11 Thread James Graham (Greywolf)

Thus spake Allen Wittenauer::

On 8/8/08 1:25 PM, "James Graham (Greywolf)" <[EMAIL PROTECTED]> wrote:
 > 226GB of available disk space on each one;
 > 4 processors (2 x dualcore)
 > 8GB of RAM each.

Some simple stuff:

(Assuming SATA):
Are you using AHCI?
Do you have the write cache enabled?


I will investigate this...



Is the topologyProgram providing proper results?


The whowhat, now?


Is DNS performing as expected? Is it fast?


DNS seems appropriately configured...


How many tasks per node?


four, I think, each of map and reduce.


How much heap does your name node have? Is it going into garbage collection
or swapping?


Maybe GC; no swapping (our systems do not have swap allocated).

--
James Graham (Greywolf)   |
650.930.1138|925.768.4053 *
[EMAIL PROTECTED] |
Check out what people are saying about SearchMe! -- click below
http://www.searchme.com/stack/109aa


Re: namenode & jobtracker: joint or separate, which is better?

2008-08-08 Thread James Graham (Greywolf)

Thus spake lohit::
It depends on your machine configuration, how much resource it has and 
what you can afford to lose in case of failures.
It would be good to run NameNode and jobtracker on their own dedicate 
nodes and datanodes and tasktracker on rest of the nodes. We have seen 
cases where tasktrackers take down nodes for malicious programs, in such 
cases you do not want your jobtracker or namenode to be on those nodes.
Also, running multiple jvms might slow down the node and your process. I 
would recommend you run atleast the NameNode on dedicated node.

Thanks,
Lohit


Good to know; thank you.

--
James Graham (Greywolf)   |
650.930.1138|925.768.4053 *
[EMAIL PROTECTED] |
Check out what people are saying about SearchMe! -- click below
http://www.searchme.com/stack/109aa


namenode & jobtracker: joint or separate, which is better?

2008-08-08 Thread James Graham (Greywolf)

Which is better, to have the namenode and jobtracker as distinct nodes
or as a single node, and are there pros/cons regarding using either or
both as datanodes?
--
James Graham (Greywolf)   |
650.930.1138|925.768.4053 *
[EMAIL PROTECTED] |
Check out what people are saying about SearchMe! -- click below
http://www.searchme.com/stack/109aa


performance not great, or did I miss something?

2008-08-08 Thread James Graham (Greywolf)

Greetings,

I'm very very new to this (as you could probably tell from my other postings).

I have 20 nodes available as a cluster, less one as the namenode and one as
the jobtracker (unless I can use them too).  Specs are:

226GB of available disk space on each one;
4 processors (2 x dualcore)
8GB of RAM each.

The RandomWriter takes just over 17 minutes to complete;
the Sorter takes well over three to four hours or more to complete
on only about a half terabyte of data.

This is certainly not the speed or power I had been led to expect from
Hadoop, so I am guessing I have some things tuned wrong (actually, I'm
certain some are tuned wrong as during the reduce phase, I'm seeing processes
die from lack of memory...).

Given the above hardware specs, what should I expect as a theoretical maximum
throughput?  machines 3-10 are on 1GbE, machines 11-20 are on a second 1GbE,
connected by a mutual 1GbE upstream (another switch).



--
James Graham (Greywolf)   |
650.930.1138|925.768.4053 *
[EMAIL PROTECTED] |
Check out what people are saying about SearchMe! -- click below
http://www.searchme.com/stack/109aa


mapred/map only at 2, always?

2008-08-07 Thread James Graham (Greywolf)

hadoop 0.16.4

Why are mapred.reduce.tasks and mapred.map.tasks always showing up
as "2"?

I have the same config on all nodes.
hadoop-site.xml contains the following parameters:


   mapred.map.tasks
   67
   The default number of map tasks per job.  Typically set
   to a prime several times greater than number of available hosts.
   Ignored when mapred.job.tracker is "local".
   



   mapred.reduce.tasks
   23
   The default number of reduce tasks per job.  Typically set
   to a prime close to the number of available hosts.  Ignored when
   mapred.job.tracker is "local".
   



  mapred.job.tracker
  idx1-r70:50030  
  The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  


--
James Graham (Greywolf)   |
650.930.1138|925.768.4053 *
[EMAIL PROTECTED] |
Check out what people are saying about SearchMe! -- click below
http://www.searchme.com/stack/109aa


Re: Configuration: I need help.

2008-08-06 Thread James Graham (Greywolf)

Thus spake James Graham (Greywolf)::


Now I have something interesting going on. Given the following configuration
file, what am I doing wrong? When I type "start-dfs.sh" on the namenode,
as instructed in the docs, I end up with, effectively, "Address already 
in use;

shutting down NameNode".

I do not understand this. It's like it's trying to start it twice; netstat
shows no port 50070 in use after shutdown.

I feel like an idiot trying to wrap my mind around this! What the heck am
I doing wrong?


Never mind.  declaring multiple services at the same port never works.



--
James Graham (Greywolf)   |
650.930.1138|925.768.4053 *
[EMAIL PROTECTED] |
Check out what people are saying about SearchMe! -- click below
http://www.searchme.com/stack/109aa


Re: Configuration: I need help.

2008-08-06 Thread James Graham (Greywolf)

Thus spake Otis Gospodnetic::

Hi James,

You can put the same hadoop-site.xml on all machines. Yes, you do want a 
secondary NN - a single NN is a SPOF. Browser the archives a few days 
back to find an email from Paul about DRBD (disk replication) to avoid 
this SPOF.


Okay, thank you!  good to know (even though the documentation seems to state
that "secondary (NN) is a misnomer, since it never takes over for the primary
NN".

Now I have something interesting going on.  Given the following configuration
file, what am I doing wrong?  When I type "start-dfs.sh" on the namenode,
as instructed in the docs, I end up with, effectively, "Address already in use;
shutting down NameNode".

I do not understand this.  It's like it's trying to start it twice; netstat
shows no port 50070 in use after shutdown.

I feel like an idiot trying to wrap my mind around this!  What the heck am
I doing wrong?





 dfs.secondary.http.address
 0.0.0.0:50090
 
   The secondary namenode http server address and port.
   If the port is 0 then the server will start on a free port.
 



 dfs.datanode.address
 0.0.0.0:50010
 
   The address where the datanode server will listen to.
   If the port is 0 then the server will start on a free port.
 



 dfs.datanode.http.address
 0.0.0.0:50075
 
   The datanode http server address and port.
   If the port is 0 then the server will start on a free port.
 



 dfs.http.address
 idx2-r70:50070
 
   The address and the base port where the dfs namenode web ui will listen on.
   If the port is 0 then the server will start on a free port.
 



 mapred.job.tracker
 idx1-r70:50030
 The host and port that the MapReduce job tracker runs
 at.  If "local", then jobs are run in-process as a single map
 and reduce task.
 



 mapred.job.tracker.http.address
 idx1-r70:50030
 
   The job tracker http server address and port the server will listen on.
   If the port is 0 then the server will start on a free port.
 




 fs.default.name
 hdfs://idx2-r70:50070/
 The name of the default file system.  A URI whose
 scheme and authority determine the FileSystem implementation.  The
 uri's scheme determines the config property (fs.SCHEME.impl) naming
 the FileSystem implementation class.  The uri's authority is used to
 determine the host, port, etc. for a filesystem.



###
--
James Graham (Greywolf)   |
650.930.1138|925.768.4053 *
[EMAIL PROTECTED] |
Check out what people are saying about SearchMe! -- click below
http://www.searchme.com/stack/109aa


Configuration: I need help.

2008-08-06 Thread James Graham (Greywolf)

Seeing as there is no search function on the archives, I'm relegated
to asking a possibly redundant question or four:

I have, as a sample setup:

idx1-trackerJobTracker
idx2-namenode   NameNode
idx3-slave  DataTracker
...
idx20-slave DataTracker

Q1: Can I put the same hadoop-site.xml file on all machines or do I need
to configure each machine separately?

Q2: My current setup does not seem to find a primary namenode, but instead
wants to put idx1 and idx2 as secondary namenodes; as a result, I am
not getting anything usable on any of the web addresses (50030, 50050,
50070, 50090).

Q3: Possibly connected to Q1:  The current setup seems to go out and start
on all machines (masters/slaves); when I say "bin/start-mapred.sh" on
the JobTracker, I get the answer "jobtracker running...kill it first".

Q4: Do I even *need* a secondary namenode?

IWBN if I did not have to maintain three separate configuration files
(jobtracker/namenode/datatracker).
--
James Graham (Greywolf)   |
650.930.1138|925.768.4053 *
[EMAIL PROTECTED] |
Check out what people are saying about SearchMe! -- click below
http://www.searchme.com/stack/109aa