Re: [boinc_dev] Hadoop and BOINC

Fernando Costa Fri, 05 Nov 2010 07:08:05 -0700

That's true, the server can be represented as a sole "reducer".
The objective, though, is to distribute as much as possible and reduce
sequential work, so if possible, it would be benefitial in some cases to
have a distributed group of "reducers".


>> BOINC only needs an outbound connection to the internet, would map-reduce
>> change that?

Not necessarily, other projects would still work regardless of Mapreduce
integration (this is a requirement, existing clients or projects without MR
jobs would have to work as usual). Even on projects using MapReduce, there
are alternatives.The idea would be to make this as transparent to the user
as possible, we don't want to raise the barrier to entrance into a project,
or make life harder to those already using BOINC.
I can think of several ways to get past the firewall issue:

- Have an option to choose the port where to receive incoming connections,
in the client GUI. Much like BitTorrent does it, people who want to make the
most out of it have to open ports on the router/firewall. Not an ideal
solution, but simple enough to at least be an option.

- Publish/subscribe distributed data center, with a similar mechanism to
Volpex. All communication is initiated by the client who has the data, which
is then stored in and retrived from a "mirror".

- Super-peers / nodes. They could act as the data center mentioned above, or
simply be clients that have more hw capabilities, faster network throughput
and higher availability, and possibly have an external IP address. They
could be rewarded according to the amount of data that was retrieved and/or
sent through them. It would be a Skype or KaZaA-like arrangement, the
problem here would be the amount of data.
There is a project under development in Cardiff, called Attic
(http://www.atticfs.org/), that was supposedly dealing with this exact
issue. I helped out a couple of years ago, so I'm not sure they're going in
the same direction.

- Go through central server. This is what is already been done, as Nicolas
said. The only possible difference would be to have the outputs shipped out
to a "reduce" phase on clients, instead of running everything on the central
server. There are applications that could benefit from this, as I'm sure
projects may have enough bandwidth but not computing power or storage to
handle larger analyses.

BitTorrent is actually a good example of a system that works well with
inter-client data transfers. Something like 20% of the users are responsible
for 80% of the uploads, which gives some support to the possibility of
having a small set of trusted nodes, rewarded for their extra work. In
BitTorrent they are rewarded with better speeds - although one can argue if
that's enough of a incentive, and if they're not simply altruistic and work
for the community.

There are other possibilities, such as moving inter-client comm to UDP and
use hole-punching techniques (it does not work too well in the real-world,
I'm told), I'm sure it would be possible to get around it.

Fernando

----- Original Message ----- 
From: "Nicolás Alvarez" <[email protected]>
To: <[email protected]>
Cc: "Fernando Costa" <[email protected]>;
<[email protected]>
Sent: Friday, November 05, 2010 4:12 AM
Subject: Re: [boinc_dev] Hadoop and BOINC


> IMHO most BOINC projects are already map-reduce. The "map" steps are  done
> by BOINC clients (mapping an input file into an output file), and  the
> "reduce" step is done by the server later (taking the output files  from
> BOINC clients and reducing them into the answer to life, the  universe and
> everything).
>
> El 04/11/2010, a las 09:58, [email protected] escribió:
>> OK, and what do you do with the reduced data?  If the next step is  to
>> send
>> the reduced data to another client, there is a major problem.  BOINC
>> clients, cannot, in general talk to each other as each one may be  behind
>> its own firewall.  If the next step is to send the data to the same
>> client,
>> is there a point?
>>
>> In general, the graph looks like:
>>
>> Your Machine on your network:
>> BOINC Client  ------------------------ Firewall -------------------
>> Cloud
>> -------------------------- Project Server
>> My Machine on my
>> network: /
>> BOINC Client ------------------------- Firewall -------------------
>> Cloud
>> -------/
>>
>> Most of the BOINC clients are singleton clients as opposed to the  BOINC
>> farms where a single individual may own several computers that are all
>> attached to BOINC.  Even in the case where a single individual has
>> control
>> over two computers, one may be in the home office and the other may  be
>> in
>> the work office with separate firewalls and separate firewall policies
>> where the clients cannot talk to each other.
>>
>> BOINC only needs an outbound connection to the internet, would map-
>> reduce
>> change that?
>>
>> jm7
>>
>>
>>
>>             Fernando Costa
>>             <flco...@student.
>>              dei.uc.pt>
>> To
>>                                       [email protected],
>>             11/04/2010 07:54          [email protected]
>>              AM
>> cc
>>                                       Ali Gholami
>>                                       <[email protected]>
>>
>> Subject
>>                                       Re: [boinc_dev] Hadoop and BOINC
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Hi,
>>
>> I'm actually working on the subject, and have a BOINC prototype that  can
>> run MapReduce jobs. It is not a BOINC-Hadoop integration though, I did
>> not use any of its code and cannot run any of its apps, I simply made
>> changes to the BOINC client and server to be able to run a Map phase  and
>> then use the outputs in a Reduce function afterwards.
>>
>> The patent part I was not aware of, but if Hadoop has received a  license
>> for it, and is still used by many big names like IBM, Cloudera, Yahoo
>> and MS Bing in clusters, I don't see why it could not be applied on an
>> Internet environment. Map and Reduce operations have been around for
>> decades, the patent does not prevent its use, and there are so many
>> differences when moving it out of a data center that I don't think  there
>> will be a problem.
>>
>> From the patent itself
>> http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/PTO/srchnum.htm&r=1&f=G&l=50&s1=7,650,331.PN.&OS=PN/7,650,331&RS=PN/7,650,331
>>
>> <
>> http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/PTO/srchnum.htm&r=1&f=G&l=50&s1=7,650,331.PN.&OS=PN/7,650,331&RS=PN/7,650,331
>>>
>>
>> "What is claimed is:
>>
>> 1. A system for large-scale processing of data, comprising: a  plurality
>> of processes executing on a plurality of interconnected processors;  the
>> plurality of processes including a master process, for coordinating a
>> data processing job for processing a set of input data, and worker
>> processes;"
>>
>> In BOINC's case, there are no interconnected processors, the master
>> process is the server, the tasks are not assigned by the master "per
>> se", they are requested by clients themselves, and there is no Google
>> File System (or Hadoop's HDFS) - they refer to it as
>> " A plurality of intermediate data structures are used to store the
>> intermediate data values".
>>
>> Anyway, talking to Google directly would probably be best, and I don't
>> think they would have any problem with it. If MapReduce could
>> effectively be applied to BOINC, and Volunteer Computing in general,  the
>> patent should not be enough of a reason to stop us from at least  trying.
>>
>> I'm starting to run the first tests on smaller scale, on a cluster.
>> There are still many issues to tackle, such as connectivity (Volpex,
>> super-peers, even going through server come to mind), but the fact  that
>> there are so many different MR applications out there means that we  can
>> experiment with several alternatives before dismissing it as a
>> data-intensive paradigm for clusters-only.
>>
>> Just as a quick example - a MR job to get the average of max  temperature
>> of each year for the past 100 years, and the input were measurements
>> from thousands of weather stations from around the world.
>> The Map task would have to gather part of the input data, parse it,  and
>> output the max/avg for every year (which means only 100 values - 1 per
>> year - as output for each Map). Map is already done by BOINC, since  it's
>> embarrassingly parallel. This output would then have to be sent to
>> different Reduce workers, each responsible for a unique set of keys  (for
>> example, each reduce would get the output for 2 decades, so we would
>> have 50 Reduce tasks).
>> The communication between Mappers and Reducers would be minimal, and  the
>> initial data would either be downloaded from the central server, or be
>> previously distributed and stored in clients - like the stor...@home
>> project wanted to, in fold...@home.
>>
>> Just my 2 cents, this could all be a mistake but it's worth a shot.
>>
>> Fernando
>>
>> [email protected] wrote:
>>> Mapreduce looks like it is designed for multiple steps in the  process
>>> of
>>> breaking up the problem on a tightly linked trusted server cluster.
>> BOINC
>>> is loosely linked, and the devices are not to be trusted.  The end
>>> hosts
>>> also cannot talk to each other as many are behind firewalls and  will
>>> not
>>> allow incoming connections.
>>>
>>> It is also true that Google is claiming a patent on the algorithm.
>>> BOINC
>>> needs to stay away from patented code if at all possible.
>>>
>>> Mapreduce might work as a part of the splitter for a single project  if
>> the
>>> data set makes sense for that.  I do not see how it would work  anywhere
>>> else in BOINC.
>>>
>>> Anyone have any other ideas?
>>>
>>> jm7
>>>
>>>
>>>
>>
>>>             Ali Gholami
>>
>>>             <aligh.mail...@gm
>>
>>>             ail.com>
>> To
>>>             Sent by:                  [email protected]
>>
>>>             <boinc_dev-bounce
>> cc
>>>             [email protected]
>>
>>>             u>
>> Subject
>>>                                       [boinc_dev] Hadoop and BOINC
>>
>>>
>>
>>>             11/02/2010 02:30
>>
>>>             PM
>>
>>>
>>
>>>
>>
>>>
>>
>>>
>>>
>>>
>>>
>>> Hi everyone,
>>>
>>> I've a question about integrating BOINC with Hadoop (Open
>>> implementation of MapReduce framework). I've read a little bit about
>>> BOINC and how it makes other prjoects particularly in fold...@home
>>> area. I'm just wondering if Hadoop can be useful in term of
>>> integration with BOINC. I'd appreciate it a lot if you have some  ideas
>>> or have some guides that I can understand this problem better.
>>>
>>>
>>> Best regards
>>> Ali Gholami
>>> _______________________________________________
>>> boinc_dev mailing list
>>> [email protected]
>>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>>> To unsubscribe, visit the above URL and
>>> (near bottom of page) enter your email address.
>>>
>>>
>>>
>>> _______________________________________________
>>> boinc_dev mailing list
>>> [email protected]
>>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>>> To unsubscribe, visit the above URL and
>>> (near bottom of page) enter your email address.
>>>
>>>
>>
>>
>>
>>
>> _______________________________________________
>> boinc_dev mailing list
>> [email protected]
>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>> To unsubscribe, visit the above URL and
>> (near bottom of page) enter your email address.

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] Hadoop and BOINC

Reply via email to