Hi,

We just started to use distcc at our company (a dozen developers on same project), and I'm pleased to see the distcc project is active!

I read your enhancements proposal with attention, especially point 4 and 5 and I would like to share with you some ideas.

On Friday 18 November 2005 23.39, Daniel Kegel wrote:

> 4. When a distccd server is full up on active jobs, and other nearby
>    servers are not, it's a shame that clients which connect to the
>    wrong server have to wait
...
> 5. If Alice has already compiled everything on client A, and Bob starts a job
>    to compile the same everything on client B, it's a shame that Bob has to wait;
>    perhaps distccd (or a load balancer!) should (carefully) cache results.

I'm very interested about combining ccache and distcc. I think that would make a huge performance improvement in our situation where many developers are compiling each day roughly the same set of files.

But installing ccache on each distccd host means as many separate caches.
Since filling a cache has a price (compiling!), I would prefer filling as few caches as possible with the same content. I know that some are using a file server to share the cache but that means network communications and we may avoid that.

As far as I know, ccache is using an md4 hash on the pre-processor output as a short signature of a file to compile. Why not sending only this hash to hosts instead of the full pre-processor output?
If the host has the result in its cache, we win, else the client can decide to try another host or to continue with this host by sending him the pre-processor output to compile and store in its cache for the next developer to come in.

An objection to this scenario could be the network overhead needed to ask all hosts the one who has the cached output. That brings me to another area where distcc may need improvement: host selection.

For now host selection, as far as I know, don't use any serverside status information (such as current load or number of pending connections) and since every clients use the same algorithm to select a host from its hosts list, chances are that distcc clients will all tend to connect to the same servers.
Imagine a 10 servers farm and 15 developers distributing 5 files to compile. 15 distcc clients will try to connect to the first server while 5 servers will remain idle (if I'm wrong with this scenario please tell me). I saw a patch submission that, as I understood, tend to eliminate this problem by randomizing the host selection. That may solve the pb in this case.

Ultimately we may want a way for the client to select the "best" host based on different criteria:
- does it have already the output in cache?
- does it have a slot available?
- is it the most powerful?

So why not just shout out what we need by broadcasting (or multicasting) the md4 hash code in just one udp packet. Available hosts would reply by describing their status (availability, cached output available, power ratio,...) so that the distcc client could choose the best one.
UDP is not reliable but reliability is not mandatory in this case since we use it only as a way to improve the host selection. And distcc clients will only wait for answers for a very limited time (an additional way to select the most reactive server).

The work needed for all that would be some merging of ccache code into distcc so that distcc and distccd can exchange only the hash code instead of the whole pre-processed file.
The host selection protocol can be totally separated and should have minimal impact on existing distcc source code.

Additionnaly, if distccd hosts can reply to broadcasting, one may want to take this opportunity to implement automatic detection of available distccd servers. But I personnally think zeroconf would be better suited for this, and I saw a patch submission for it.

thank you for reading!
Laurent
__ 
distcc mailing list            http://distcc.samba.org/
To unsubscribe or change options: 
https://lists.samba.org/mailman/listinfo/distcc

Reply via email to