Hi everyone,
I'm trying find a good way to run "distcc" on a cluster that is running
Scyld ClusterWare from Penguin Computing. This architecture consists of
several compute nodes which are hidden from the external network behind
a single master node which is responsible for managing a work queue and
dispatching jobs to appropriate compute nodes. The master and the
compute nodes are on a private network and can see each other, but the
only external access is to the master node. The "proper" way to use the
system is to submit jobs via the queuing system. I manged to come up
with job script that does just that... it submits a job which reserves
several nodes, and when it get scheduled, it runs "distccd" on the
assigned nodes, and then does a "distcc" compile on the master node.
This works, but there are several disadvantages. First, it's not all
that interactive. Submitting a compile job and having to wait some
indeterminate amount of time for it to execute is sort of perverse...
developers might as well compile on their own machines. Second, and I
think this is the most frustrating problem, is that since "distcc" is
running on the head node, the head node must have access to all the
source code. Which means developers must upload their code to the head
node, or put it on an NFS drive. Doing either of this defeats distcc's
best feature, namely the transport protocol that it provides to allow
you to compile stuff on your *own* desktop machine using local storage.
So, we've been searching for alternatives. Two ideas came up, but both
are rather iffy, so I thought I would ping this group before spending a
lot of time fiddling with them. The first idea was to "daisy-chain"
distcc. The idea is that we would run "distccd" on each of the worker
nodes (outside of the queuing system), and run another "distccd" on the
head node. The daemon on the head node would accept connections from the
outside world, and when it tried to run "gcc", it would really be
running "distcc", which would forward the request to a "distccd" on a
worker node. So, in the outside network, developers would run "distcc",
but set their DISTCC_HOSTS to only one machine -- the head node of the
cluster.
So that's the first idea. The second idea is similar in that the head
node runs "distccd", and that developers have only that machine in their
DISTCC_HOSTS, but now, when the "distccd" runs "gcc", it instead runs a
wrapper script which calls "gcc" using Scyld ClusterWare "bpsh" wrapper,
which starts the job on the head node, then migrates it to a compute
node. My concern with this approach is that there is the possibility of
their being a lot of overhead in migrating hundreds of small "gcc" tasks
to the compute nodes one at a time.
So those are the two ideas. None of them seem ideal. Since I doubt
anyone can comment on the second idea (it's likely something we have to
try), my question to the group is 1) whether there is third, better,
alternative someone has come up with, 2) whether I should even attempt
the "daisy-chaining" approach (would distcc even be able to handle this,
or would it get hopelessly confused?).
Any thought would be very much appreciated! Thank you!
-- Marcio
__
distcc mailing list http://distcc.samba.org/
To unsubscribe or change options:
https://lists.samba.org/mailman/listinfo/distcc