Hi everyone,

I'm trying find a good way to run "distcc" on a cluster that is running Scyld ClusterWare from Penguin Computing. This architecture consists of several compute nodes which are hidden from the external network behind a single master node which is responsible for managing a work queue and dispatching jobs to appropriate compute nodes. The master and the compute nodes are on a private network and can see each other, but the only external access is to the master node. The "proper" way to use the system is to submit jobs via the queuing system. I manged to come up with job script that does just that... it submits a job which reserves several nodes, and when it get scheduled, it runs "distccd" on the assigned nodes, and then does a "distcc" compile on the master node. This works, but there are several disadvantages. First, it's not all that interactive. Submitting a compile job and having to wait some indeterminate amount of time for it to execute is sort of perverse... developers might as well compile on their own machines. Second, and I think this is the most frustrating problem, is that since "distcc" is running on the head node, the head node must have access to all the source code. Which means developers must upload their code to the head node, or put it on an NFS drive. Doing either of this defeats distcc's best feature, namely the transport protocol that it provides to allow you to compile stuff on your *own* desktop machine using local storage.

So, we've been searching for alternatives. Two ideas came up, but both are rather iffy, so I thought I would ping this group before spending a lot of time fiddling with them. The first idea was to "daisy-chain" distcc. The idea is that we would run "distccd" on each of the worker nodes (outside of the queuing system), and run another "distccd" on the head node. The daemon on the head node would accept connections from the outside world, and when it tried to run "gcc", it would really be running "distcc", which would forward the request to a "distccd" on a worker node. So, in the outside network, developers would run "distcc", but set their DISTCC_HOSTS to only one machine -- the head node of the cluster.

So that's the first idea. The second idea is similar in that the head node runs "distccd", and that developers have only that machine in their DISTCC_HOSTS, but now, when the "distccd" runs "gcc", it instead runs a wrapper script which calls "gcc" using Scyld ClusterWare "bpsh" wrapper, which starts the job on the head node, then migrates it to a compute node. My concern with this approach is that there is the possibility of their being a lot of overhead in migrating hundreds of small "gcc" tasks to the compute nodes one at a time.

So those are the two ideas. None of them seem ideal. Since I doubt anyone can comment on the second idea (it's likely something we have to try), my question to the group is 1) whether there is third, better, alternative someone has come up with, 2) whether I should even attempt the "daisy-chaining" approach (would distcc even be able to handle this, or would it get hopelessly confused?).

Any thought would be very much appreciated! Thank you!

-- Marcio


__ distcc mailing list http://distcc.samba.org/ To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/distcc

Reply via email to