Martin Pool wrote:

On 11/04/2006, at 7:27 PM, Zdenek Behan wrote:

Hi,

I encountered a very strange problem with distcc. Let me explain:

I have 2 machines (both gentoo). One is i686 (fast) and the other is ppc (slow). I have a working(tested) ppc crosscompiler on i686 and native compiler on ppp [same versions - 3.4.6]

I emerged exactly the same version of distcc on both (tried multiple versions), and ran them with:

(slow machine)
PATH="/usr/powerpc-unknown-linux-gnu/bin:/usr/powerpc-unknown-linux- gnu/gcc-bin/3.4.6/" /usr/bin/distccd -p 55555 -N 10 --allow 192.168.1.0/24 --listen=192.168.1.15 --no-detach --user distcc -- log-stderr

(fast machine)
PATH="/usr/powerpc-unknown-linux-gnu/bin:/usr/powerpc-unknown-linux- gnu/gcc-bin/3.4.6/" /usr/bin/distccd -p 55555 -N 10 --allow 192.168.1.0/24 --listen=192.168.1.201 --no-detach --user distcc -- log-stderr


I put both hosts (201, 15) into /etc/distcc/hosts.

Daemons work fine until i try to actually compile something.

--
#include <stdio.h>

int main( int argc, char ** argv )
{
        printf("Hello world!\n");

        return 1;
}
--

I created simple hello.c to demonstrate. The command used is:
distcc powerpc-unknown-linux-gnu-gcc -c -o hello.o hello.c

Now i have 4 variants of using distcc. Fast to Fast (localhost) Fast to Slow, Slow to Fast and Slow to Slow.

When doing any of the pointless variants (F->S, F->F, S->S), distccd creates /tmp/distccd_key.i on the local machine containing preprocessed source and within fraction of a second, it's done. Verbose distccd output says something like:


When you say "/tmp/distccd_key.i " I presume the "key" is actually some random hex characters?

Naturally :)


distccd[12392] (dcc_check_client) connection from 192.168.1.15:1895
distccd[12392] compile from hello.c to hello.o
distccd[12392] (dcc_r_file_timed) 16695 bytes received in 0.001372s, rate 11883kB/s distccd[12392] (dcc_collect_child) cc times: user 0.170000s, system 0.040000s, 501 minflt, 1030 majflt distccd[12392] powerpc-unknown-linux-gnu-gcc hello.c on localhost completed ok
distccd[12392] job complete

In the last variant (Slow -> Fast), it creates /tmp/distccd_key.i as well, however, what it contains can hardly be compared to preprocessed source. It's basically a binary file containing a random dump of some disk data. I have a copy of such a file in case anyone wants to see it, but there's not much to see, really.

Naturally this fails with megabytes long error log going as following
/tmp/distccd_237f6a6d.i:122: error: stray '\242' in program
/tmp/distccd_237f6a6d.i:122: error: stray '\160' in program
/tmp/distccd_237f6a6d.i:122: error: stray '\195' in program
/tmp/distccd_237f6a6d.i:122: error: stray '\242' in program
...

Output looks like this:
distccd[11797] (dcc_check_client) connection from 192.168.1.15:1894
distccd[11797] compile from hello.c to hello.o
distccd[11797] (dcc_r_file_timed) 16695 bytes received in 0.002057s, rate 7926kB/s distccd[11797] (dcc_collect_child) cc times: user 0.425935s, system 0.989849s, 889 minflt, 0 majflt distccd[11797] powerpc-unknown-linux-gnu-gcc hello.c on localhost failed
distccd[11797] job complete

Notice the size file size actually being the same. It's the content that is scrambled, for reason completely unknown to me. Neither side does crash, only report the huge error log and then go on.

Just for the record, distcc is built on both systems natively with native compiler (same version - 3.4.6), glibc versions are not the same, but i can hardly imagine that being a problem.

Can anyone help me, or at least point me to where i should be looking for the problem? This seems to be purely distcc issue, as it never gets to actually compiling anything, besides, i believe my crosscompiler setup is correct.

My first guess was endianity swap (ppc is big endian), but since there is some totally out of place text mixed up with garbage binary data in the temporary file, i think that's not the solution. So now i'm left with being completely clueless, and any help will be appreciated.


I suspect you have a kernel bug on the ppc machine which is making it transmit the wrong data across the network. To check it, please run on the ppc host

  tcpdump -w distcc.pcap 'tcp port 2622'

and compile a file. Then stop tcpdump and post the capture file to me, or have a look at it in ethereal if you like. I suspect we will see garbage in the DOTI field because sendfile isn't working properly. What kernel are you running there? Do you have a known good one you could try?

You were absolutely right there. I checked with ethereal, and it's obviously being transmitted badly. I'm not sure whether it's kernel bug or not, but i suspect so, since i replaced everything else already, with no help. I even discovered 1 more application misbehaving in a similar way. It's python's bzip2 library. Every file zipped with that ends up with random contents of the disk instead of the data. I also rebuilt the whole system, and it still didn't fix it, so it's not just some dynamic linking issue. That would show that it may be a problem of certain types of file descriptors in the kernel, but who knows what sort of obscure problem it really is.

The machine is a network dist storage box to which i'm trying to port some reasonable linux distribution for development directly on target platform. It has ppcboot, supplied kernel(2.4.20) and busybox based initrd in flash already. Unfortunately it's a very slow machine, which is why i was installing distcc in there in the first place, to build the system and new kernel. Generally crosscompiling those on a faster machine is a pain, some makefiles simply do if crosscompiling; then die; fi, or in the worse cases, simply use the native host compiler without saying a word, much unlike distcc which can be easilly limited to the right toolchain by PATH variable, and generally works without problems. So, distcc failing was really a painful blow. :)

However, i discovered a workaround for the problem. After digging some through distcc source, i found out some comments about ssh and local connections being connected differently, i did not examine the problem a whole lot more, instead i just tunelled one local port to the target distcc on fast machine, and set the remote host to localhost instead, and it just worked (TM). I already rebuilt the system, and the problem did not go away, but does not bother me that much anymore. I'd still be interested in what could be the reason for this bug, however, because i always thought remote network connections are generally being treated equally, and in this case connecting to own ip adress (not 127.0.01, the outer interface) works, but connecting elsewhere fails. Also, all other system services (sshd) work perfectly, and didn't even crash a single time, so it's not exactly a network problem. I also have issues replacing the (possibly buggy) kernel in flash, and i'm stuck with using the old one with my new root image for now, so knowing what could potentially go wrong in the source would be very valuable info. Hopefully even for someone else who might bump into the same problem later. :)


Zdenek
__ distcc mailing list http://distcc.samba.org/ To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/distcc

Reply via email to