OK, and what do you do with the reduced data? If the next step is to send
the reduced data to another client, there is a major problem. BOINC
clients, cannot, in general talk to each other as each one may be behind
its own firewall. If the next step is to send the data to the same client,
is there a point?
In general, the graph looks like:
Your Machine on your network:
BOINC Client ------------------------ Firewall ------------------- Cloud
-------------------------- Project Server
My Machine on my
network:
/
BOINC Client ------------------------- Firewall ------------------- Cloud
-------/
Most of the BOINC clients are singleton clients as opposed to the BOINC
farms where a single individual may own several computers that are all
attached to BOINC. Even in the case where a single individual has control
over two computers, one may be in the home office and the other may be in
the work office with separate firewalls and separate firewall policies
where the clients cannot talk to each other.
BOINC only needs an outbound connection to the internet, would map-reduce
change that?
jm7
Fernando Costa
<flco...@student.
dei.uc.pt> To
[email protected],
11/04/2010 07:54 [email protected]
AM cc
Ali Gholami
<[email protected]>
Subject
Re: [boinc_dev] Hadoop and BOINC
Hi,
I'm actually working on the subject, and have a BOINC prototype that can
run MapReduce jobs. It is not a BOINC-Hadoop integration though, I did
not use any of its code and cannot run any of its apps, I simply made
changes to the BOINC client and server to be able to run a Map phase and
then use the outputs in a Reduce function afterwards.
The patent part I was not aware of, but if Hadoop has received a license
for it, and is still used by many big names like IBM, Cloudera, Yahoo
and MS Bing in clusters, I don't see why it could not be applied on an
Internet environment. Map and Reduce operations have been around for
decades, the patent does not prevent its use, and there are so many
differences when moving it out of a data center that I don't think there
will be a problem.
From the patent itself
http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/PTO/srchnum.htm&r=1&f=G&l=50&s1=7,650,331.PN.&OS=PN/7,650,331&RS=PN/7,650,331
<
http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/PTO/srchnum.htm&r=1&f=G&l=50&s1=7,650,331.PN.&OS=PN/7,650,331&RS=PN/7,650,331
>
"What is claimed is:
1. A system for large-scale processing of data, comprising: a plurality
of processes executing on a plurality of interconnected processors; the
plurality of processes including a master process, for coordinating a
data processing job for processing a set of input data, and worker
processes;"
In BOINC's case, there are no interconnected processors, the master
process is the server, the tasks are not assigned by the master "per
se", they are requested by clients themselves, and there is no Google
File System (or Hadoop's HDFS) - they refer to it as
" A plurality of intermediate data structures are used to store the
intermediate data values".
Anyway, talking to Google directly would probably be best, and I don't
think they would have any problem with it. If MapReduce could
effectively be applied to BOINC, and Volunteer Computing in general, the
patent should not be enough of a reason to stop us from at least trying.
I'm starting to run the first tests on smaller scale, on a cluster.
There are still many issues to tackle, such as connectivity (Volpex,
super-peers, even going through server come to mind), but the fact that
there are so many different MR applications out there means that we can
experiment with several alternatives before dismissing it as a
data-intensive paradigm for clusters-only.
Just as a quick example - a MR job to get the average of max temperature
of each year for the past 100 years, and the input were measurements
from thousands of weather stations from around the world.
The Map task would have to gather part of the input data, parse it, and
output the max/avg for every year (which means only 100 values - 1 per
year - as output for each Map). Map is already done by BOINC, since it's
embarrassingly parallel. This output would then have to be sent to
different Reduce workers, each responsible for a unique set of keys (for
example, each reduce would get the output for 2 decades, so we would
have 50 Reduce tasks).
The communication between Mappers and Reducers would be minimal, and the
initial data would either be downloaded from the central server, or be
previously distributed and stored in clients - like the stor...@home
project wanted to, in fold...@home.
Just my 2 cents, this could all be a mistake but it's worth a shot.
Fernando
[email protected] wrote:
> Mapreduce looks like it is designed for multiple steps in the process of
> breaking up the problem on a tightly linked trusted server cluster.
BOINC
> is loosely linked, and the devices are not to be trusted. The end hosts
> also cannot talk to each other as many are behind firewalls and will not
> allow incoming connections.
>
> It is also true that Google is claiming a patent on the algorithm. BOINC
> needs to stay away from patented code if at all possible.
>
> Mapreduce might work as a part of the splitter for a single project if
the
> data set makes sense for that. I do not see how it would work anywhere
> else in BOINC.
>
> Anyone have any other ideas?
>
> jm7
>
>
>
> Ali Gholami
> <aligh.mail...@gm
> ail.com>
To
> Sent by: [email protected]
> <boinc_dev-bounce
cc
> [email protected]
> u>
Subject
> [boinc_dev] Hadoop and BOINC
>
> 11/02/2010 02:30
> PM
>
>
>
>
>
>
>
> Hi everyone,
>
> I've a question about integrating BOINC with Hadoop (Open
> implementation of MapReduce framework). I've read a little bit about
> BOINC and how it makes other prjoects particularly in fold...@home
> area. I'm just wondering if Hadoop can be useful in term of
> integration with BOINC. I'd appreciate it a lot if you have some ideas
> or have some guides that I can understand this problem better.
>
>
> Best regards
> Ali Gholami
> _______________________________________________
> boinc_dev mailing list
> [email protected]
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address.
>
>
>
> _______________________________________________
> boinc_dev mailing list
> [email protected]
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address.
>
>
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.