Hadoop - is it good for me and performance question

2008-06-29 Thread yair gotdanker
Hello all,



I am newbie to hadoop, The technology seems very interesting but I am not
sure it suit my needs.  I really appreciate your feedbacks.



The problem:

I have multiple logservers each receiving 10-100 mg/minute. The received
data is processed to produce aggregated data.
The data process time should take few minutes at top (10 min).

In addtion, I did some performance benchmark on the workcount example
provided by quickstart tutorial on my pc (pseudo-distributed, using
quickstart configurations file) and it took about 40 seconds!
I must be missing something here, I must be doing something wrong here since
40 seconds is way too long!
Map/reduce function should be very fast since there is almost no processing
done. So I guess most of the time spend on the hadoop framework.

I will appreciate any help  for understanding this and how can I increase
the performance.
btw:
Does anyone know good behind the schene tutorial, that explains more on how
the jobtracker/tasktracker communicate and so.


configuration basics

2008-06-29 Thread Chris Anderson
On Fri, Jun 27, 2008 at 9:15 AM, Rick Cox [EMAIL PROTECTED] wrote:

 Yes, mapred.tasktracker.map.tasks.maximum is configured per
 tasktracker on startup. It can't be configured per job because it's
 not a job-scope parameter (if there are multiple concurrent jobs, they
 have to share the task limit).

 rick


Is there a good way to discover which parameters can be configured on
a job basis, vs a tasktracker or site basis?

Eg I'd like to change my dfs.replication.min when I add new nodes to
my cluster, if possible. Restarting dfs leaves me with namenodeId
mismatches (and that's not good), so that's not really an option.

Thanks!
Chris

-- 
Chris Anderson
http://jchris.mfdz.com