Re: reducers hanging problem

2008-06-30 Thread Chris Anderson
On Mon, Jun 30, 2008 at 8:30 AM, Andreas Kostyrka <[EMAIL PROTECTED]> wrote:
>  Plus it seems to be deterministic, it always stop at 3 reduce parts
> not finishing, although I haven't yet checked if they are always the same
> errors or not.

I've been struggling through getting my streaming tasks (map only, no
reduce) to run across large Nutch crawls. I'd been having
deterministic failures as well. It's worth checking your streaming job
against input data (maybe on a local workstation) to see that it
doesn't blow up on some of the input. It turns out my Ruby/Hpricot XML
parsers were having a hard time swallowing large binary files
(surprise) and as a result, some map tasks would always die in the
same place.

I got my data to test locally but running streaming jar with cat as
it's mapper, and then copying the results to my workstation, and
piping them into my script. I haven't tried using cat as a reducer,
but it should yield output files suitable for running your streaming
reducers over, in an instrumented environment.


-- 
Chris Anderson
http://jchris.mfdz.com


configuration basics

2008-06-29 Thread Chris Anderson
On Fri, Jun 27, 2008 at 9:15 AM, Rick Cox <[EMAIL PROTECTED]> wrote:
>
> Yes, mapred.tasktracker.map.tasks.maximum is configured per
> tasktracker on startup. It can't be configured per job because it's
> not a job-scope parameter (if there are multiple concurrent jobs, they
> have to share the task limit).
>
> rick
>

Is there a good way to discover which parameters can be configured on
a job basis, vs a tasktracker or site basis?

Eg I'd like to change my dfs.replication.min when I add new nodes to
my cluster, if possible. Restarting dfs leaves me with namenodeId
mismatches (and that's not good), so that's not really an option.

Thanks!
Chris

-- 
Chris Anderson
http://jchris.mfdz.com


Re: process limits for streaming jar

2008-06-27 Thread Chris Anderson
Having experimented some more, I've found that the simple solution is
to limit the resource usage by limiting the # of map tasks and the
memory they are allowed to consume.

I'm specifying the constraints on the command line like this:

-jobconf mapred.tasktracker.map.tasks.maximum=2 mapred.child.ulimit=1048576

The configuration parameters seem to take, in the job.xml available
from the web console, I see these lines:

mapred.child.ulimit 1048576
mapred.tasktracker.map.tasks.maximum2

The problem is that when there are a large number of map tasks to
complete, Hadoop doesn't seem to obey the map.tasks.maximum. Instead,
it is spawning 8 map tasks per tasktracker (even when I change the
mapred.tasktracker.map.tasks.maximum in hadoop-site.xml to 2, on the
master). The cluster was booted with the setting at 8. Do I need to
change hadoop-site.xml on all the slaves, and restart the task
trackers, in order to make the limit apply? That seems unlikely - I'd
really like to manage this parameter on a per-job level.

Thanks for any input!

Chris

-- 
Chris Anderson
http://jchris.mfdz.com


process limits for streaming jar

2008-06-25 Thread Chris Anderson
Hi there,

I'm running some streaming jobs on ec2 (ruby parsing scripts) and in
my most recent test I managed to spike the load on my large instances
to 25 or so. As a result, I lost communication with one instance. I
think I took down sshd. Whoops.

My question is, has anyone got strategies for managing resources used
by the processes spawned by streaming jar? Ideally I'd like to run my
ruby scripts under nice.

I can hack something together with wrappers, but I'm thinking there
might be a configuration option to handle this within Streaming jar.
Thanks for any suggestions!

-- 
Chris Anderson
http://jchris.mfdz.com


Re: realtime hadoop

2008-06-23 Thread Chris Anderson
Vadim,

Depending on the nature of your data, CouchDB (http://couchdb.org)
might be worth looking into. It speaks JSON natively, and has
real-time map/reduce support. The 0.8.0 release is imminent (don't
bother with 0.7.2), and the community is active. We're using it for
something similar to what you describe, and it's working well.

Chris

-- 
Chris Anderson
http://jchris.mfdz.com


Re: contrib EC2 with hadoop 0.17

2008-06-09 Thread Chris Anderson
On Mon, Jun 9, 2008 at 9:01 AM, Chris K Wensel <[EMAIL PROTECTED]> wrote:
>
> configuration values should be set in conf/hadoop-site.xml. Those particular
> values you are referring to probably should be set per job and generally
> don't have anything to do with instance sizes but more to do with cluster
> size and the job being run.
>
> different instance sizes have mapred.tasktracker.map.tasks.maximum and
> mapred.tasktracker.reduce.tasks.maximum set accordingly (see hadoop-init),
> but again might/should be tuned to your application (cpu or io bound).
>

Thanks for clearing all this up, Chris. We're actually doing just
that, and now having the recommendation from you to do it this way
makes me believe we're doing it right.

So far, Hadoop has been treating us well!

-- 
Chris Anderson
http://jchris.mfdz.com


Re: contrib EC2 with hadoop 0.17

2008-06-07 Thread Chris Anderson
On Sat, Jun 7, 2008 at 5:25 PM, Chris K Wensel <[EMAIL PROTECTED]> wrote:
> The new scripts do not use the start/stop-all.sh scripts, and thus do not
> maintain the slaves file. This is so cluster startup is much faster and a
> bit more reliable (keys do not need to be pushed to the slaves). Also we can
> grow the cluster lazily just by starting slave nodes.

Thanks for the description, Chris. Now that I understand the basic
model, I'm starting to see how the configuration is passed to the
slaves using the -d option of ec2-run-instances.

One config question: on our cluster (hadoop 0.17 with
INSTANCE_TYPE="m1.small") the conf/hadoop-default.xml has
mapred.reduce.tasks set to 1, and mapred.map.tasks set to 2.

>From experimenting and reading the FAQ, it looks like those numbers
should be higher, unless you have single-machine cluster. Maybe
there's something I'm missing, but by upping mapred.map.tasks and
mapred.reduce.tasks to 5 and 15 (in our job jar) we're getting much
better performance. Is there a reason hadoop-init doesn't build a
hadoop-site.xml file with higher or configurable values for these
fields?

> But it probably would be wise to provide scripts to build/refresh the slaves
> file, and push keys to slaves, so the cluster can be traditionally
> maintained, instead of just re-instantiated with new parameters etc.

I'm still getting the hang of best practices as far as deploying /
managing clusters. But for EC2 the all-or-nothing cluster approach
seems right. Maybe the slave scripts aren't needed.

>
> I wonder if these scripts would make sense in general, instead of being ec2
> specific?

There's so much functionality being handled by the ec2-script suite
that using Eucalyptus http://eucalyptus.cs.ucsb.edu/ (which allows any
data center to be managed like EC2) might make more sense.

Thanks again for the response, I'm think I'm starting to get the hang of this.

-- 
Chris Anderson
http://jchris.mfdz.com


contrib EC2 with hadoop 0.17

2008-06-07 Thread Chris Anderson
First of all, thanks to whoever maintains the hadoop-ec2 scripts.
They've saved us untold time and frustration getting started with a
small testing cluster (5 instances).

A question: when we log into the newly created cluster, and run jobs
from the example jar (pi, etc) everything works great. We expect our
custom jobs will run just as smoothly.

However, when we restart the namenodes and tasktrackers by running
bin/stop-all.sh on the master, it tries to stop only activity on
localhost. Running start-all.sh then boots up a localhost-only cluster
(on which jobs run just fine).

The only way we've been able to recover from this situation is to use
bin/terminate-hadoop-cluster and bin/destroy-hadoop-cluster and then
start again from scratch with a new cluster.

There must be a simple way to restart the namenodes and jobtrackers
across all machines from the master. Also, I think understanding the
answer to this question might put a lot more into perspective for me,
so I can go on to do more advanced things on my own.

Thanks for any assistance / insight!

Chris


output from stop-all.sh
==

stopping jobtracker
localhost: Warning: Permanently added 'localhost' (RSA) to the list of
known hosts.
localhost: no tasktracker to stop
stopping namenode
localhost: no datanode to stop
localhost: no secondarynamenode to stop


conf files in /usr/local/hadoop-0.17.0
==

# cat conf/slaves
localhost
# cat conf/masters
localhost




-- 
Chris Anderson
http://jchris.mfdz.com


Re: hadoop on EC2

2008-05-28 Thread Chris Anderson
On Wed, May 28, 2008 at 2:23 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>
> That doesn't work because the various web pages have links or redirects to
> other pages on other machines.
>
> Also, you would need to ssh to ALL of your cluster to get the file browser
> to work.

True. That makes it a little impractical.

>
> Better to do the proxy thing.
>

This would be a nice addition to the Hadoop EC2 AMI (which is super
helpful, by the way). Thanks to whoever put it together.


-- 
Chris Anderson
http://jchris.mfdz.com


Re: hadoop on EC2

2008-05-28 Thread Chris Anderson
Andreas,

If you can ssh into the nodes, you can always set up port-forwarding
with ssh -L to bring those ports to your local machine.

On Wed, May 28, 2008 at 1:51 PM, Andreas Kostyrka <[EMAIL PROTECTED]> wrote:
> What I wonder is what ports do I need to access?
>
> 50060 on all nodes.
> 50030 on the jobtracker.
>
> Any other ports?
>
> Andreas
>
> Am Mittwoch, den 28.05.2008, 13:37 -0700 schrieb Allen Wittenauer:
>>
>>
>> On 5/28/08 1:22 PM, "Andreas Kostyrka" <[EMAIL PROTECTED]> wrote:
>> > I just wondered what other people use to access the hadoop webservers,
>> > when running on EC2?
>>
>> While we don't run on EC2 :), we do protect the hadoop web processes by
>> putting a proxy in front of it.  A user connects to the proxy,
>> authenticates, and then gets the output from the hadoop process.  All of the
>> redirection magic happens via a localhost connection, so no data is leaked
>> unprotected.
>>
>



-- 
Chris Anderson
http://jchris.mfdz.com