Hadoop - is it good for me and performance question
Hello all, I am newbie to hadoop, The technology seems very interesting but I am not sure it suit my needs. I really appreciate your feedbacks. The problem: I have multiple logservers each receiving 10-100 mg/minute. The received data is processed to produce aggregated data. The data process time should take few minutes at top (10 min). In addtion, I did some performance benchmark on the workcount example provided by quickstart tutorial on my pc (pseudo-distributed, using quickstart configurations file) and it took about 40 seconds! I must be missing something here, I must be doing something wrong here since 40 seconds is way too long! Map/reduce function should be very fast since there is almost no processing done. So I guess most of the time spend on the hadoop framework. I will appreciate any help for understanding this and how can I increase the performance. btw: Does anyone know good behind the schene tutorial, that explains more on how the jobtracker/tasktracker communicate and so.
configuration basics
On Fri, Jun 27, 2008 at 9:15 AM, Rick Cox [EMAIL PROTECTED] wrote: Yes, mapred.tasktracker.map.tasks.maximum is configured per tasktracker on startup. It can't be configured per job because it's not a job-scope parameter (if there are multiple concurrent jobs, they have to share the task limit). rick Is there a good way to discover which parameters can be configured on a job basis, vs a tasktracker or site basis? Eg I'd like to change my dfs.replication.min when I add new nodes to my cluster, if possible. Restarting dfs leaves me with namenodeId mismatches (and that's not good), so that's not really an option. Thanks! Chris -- Chris Anderson http://jchris.mfdz.com