Re: Using Hadoop in non-typical large scale user-driven environment

Brian Bockelman Wed, 02 Dec 2009 14:25:58 -0800

On Dec 2, 2009, at 4:08 PM, Habermaas, William wrote:

> Hadoop isn't going to like losing its datanodes when people shutdown their 
> computers.


Of course, that's what makes it a fun project ;)

Maciej, this is definitely possible, but it is a large project.  My 
recommendations are:
1) Talk to the condor folks who are working on doing a Hadoop-on-Demand like 
system integrated with Condor.  Condor has a huge number of knobs to do things 
like shut down jobs when mouse/keyboard activity is detected.  Also works on 
Windows.
2) See the new code slated for 0.21.0 that gives you a pluggable framework for 
data placement.  This would allow you to pick and chose which hosts your data 
goes to (as it will have to go away when people come back)
3) In conjunction with (2), talk to David Anderson's research team at Berkeley. 
 IIRC, he had a grad student doing along the lines of "in order for a service 
to have 99% uptime, how many BOINC hosts must it be running on?".  Similarly, 
you should be able to get good availability by replicating it to enough 
different hosts (although BOINC was lucky in that it could run during "night 
hours" of any time zone across the world)
4) Security.  Haven't even begun to think about how you'd secure this.

There's lots of challenges and good hard problems to think about.  No guarantee 
of success.  I guess that's what makes it a research project.

Brian

> More importantly, when the datanodes are running, your users will be impacted 
> by data replication. Unlike Seti, Hadoop doesn't know when the user's 
> screensaver is running so it will start doing things when it feels like it.
> 
> Can someone else comment on whether HOD (hadoop-on-demand) would fit this 
> scenario? 
> Bill   
> 
> -----Original Message-----
> From: Maciej Trebacz [mailto:maciej.treb...@gmail.com] 
> Sent: Wednesday, December 02, 2009 4:50 PM
> To: common-user@hadoop.apache.org
> Subject: Using Hadoop in non-typical large scale user-driven environment
> 
> First of all, I'd like to say hi to all people on the list.
> 
> I ran across Hadoop and Cloudera projects recently, and I was
> immediately intrigued with it, because I'm in the middle of writing a
> project that will use large scale distributed computing for a degree
> in my school. It seems like a perfect tool for me to use, but I have
> some questions to get sure this is the right tool for my needs.
> 
> Project I'm making assumes that there is one master node which is
> distributing data and there are several (in theory, hundreds,
> thousands or more) slave nodes. To this point, this is exactly what
> Hadoop is for. But now is the tricky part. I want the slaves to be
> computers that are used by people everyday. Think s...@home. So user
> installs Hadoop client and ideally - forgets about it, and his
> computer helps to do the computations. Also, user will not want to
> spend much of his hard drive for the computation data.
> 
> The problem with this model, as far as I understand, is that users
> will often shut down their computers (for whatever reason), once a day
> or even more. Will that be a big problem for Hadoop server to handle?
> I mean, I am afraid that most of processing power and bandwidth will
> be used for controlling the traffic in the network and it will not be
> effective.
> 
> I will appreciate any opinion in this case.
> 
> -- 
> Best regards,
> Maciej "mav" Trębacz from Poland.

smime.p7s
Description: S/MIME cryptographic signature

Re: Using Hadoop in non-typical large scale user-driven environment

Reply via email to