Hi,
I have some problem that makes me mad for some time.
We just setted up a web farm to support our application that runs entirely using mod_perl.
Until now we used a traditional apache+vhosts to serve our customers, but as it became so unadministrable, we started this new sistem to serve better.
The basic structure is a reverse proxy as a frontend that redirects requests to a bunch of different machines, each one with a bunch of "apaches" on differents ports for each customer.
All this little apaches are running as non-root users on ports > 50000, to protect better one customer of the other.

Well, once all setted up all seemed to go well. But on one page that uses the Mail::Sendmail module to send an e-mail, the server crashed with a segmentation fault.
After tracing all we could into all the perl modules, we found that the server crashed
when Mail::Sendmail tried to open the network socket.
Then we did a little test and setted up a program that just opened a socket, and once the page are called, the server segfaults...
The same test, works perfect outside mod_perl...

The server is an Fully Updated RedHat 9
custom "WOLK" 2.4 kernel  ( 2.4.20-wolk4.9s )
perl-5.8 (first tried with stock redhat. Then I recompiled my own rpm with no threads, with the same results)
apache-1.3.29 (EAPI+no EXPAT options, tested activating and deactivating those options, with no success)
mod_ssl
mod_perl 1.29 (I've tested 1.27 and 1.28 also, with the same results)

Then I start debugging apache, to see what would be happening...

gdb httpd
(gdb) run -X
<Click on the fatal page with the Mail::Sendmail>

Program received signal SIGSEGV, Segmentation fault.
0x1555ef8e in do_lookup_versioned () from /lib/ld-linux.so.2
(gdb) where
#0  0x1555ef8e in do_lookup_versioned () from /lib/ld-linux.so.2
#1  0x1555e156 in _dl_lookup_versioned_symbol_internal () from /lib/ld-linux.so.2
#2  0x15561e03 in fixup () from /lib/ld-linux.so.2
#3  0x15561cc0 in _dl_runtime_resolve () from /lib/ld-linux.so.2
#4  0x156e60a8 in getprotobyname_r@@GLIBC_2.1.2 () from /lib/libc.so.6
#5  0x156e5f5f in getprotobyname () from /lib/libc.so.6
#6  0x15d234eb in Perl_pp_gprotoent () at pp_sys.c:4856
#7  0x15d23299 in Perl_pp_gpbyname () at pp_sys.c:4823
#8  0x15cd08d2 in Perl_runops_debug () at dump.c:1414
#9  0x15c8c54e in S_call_body (myop=0x3fffdc40, is_eval=0) at perl.c:2069
#10 0x15c8c1fd in Perl_call_sv (sv=0x15d72d54, flags=4) at perl.c:1987
#11 0x157c82ad in perl_call_handler (sv=0x914d048, r=0x985fffc, args=0x0) at mod_perl.c:1661
#12 0x157c7a90 in perl_run_stacked_handlers (hook=0x914d048 "4?\024\t\001", r=0x985fffc, handlers=0x9121560)
    at mod_perl.c:1374
#13 0x157c60da in perl_handler (r=0x9121560) at mod_perl.c:914
#14 0x08054d0c in ap_invoke_handler ()
#15 0x0806b2aa in process_request_internal ()
#16 0x0806b307 in ap_process_request ()
#17 0x08061b5d in child_main ()
#18 0x08061d30 in make_child ()
#19 0x08061eaf in startup_children ()
#20 0x080625a8 in standalone_main ()
#21 0x08062e61 in main ()
#22 0x15602917 in __libc_start_main () from /lib/libc.so.6

(gdb)quit

As the program fails in getprotobyname glibc function, I suppose the problem is the infamous buggy glibc's of RedHat. or also a incompatibility with my current WOLK kernel...

The strace confirms that the problem seems related to the /etc/protocols file (Used by getprotobyname)

# strace -X
...
...
brk(0)                                  = 0x9bdc000
brk(0x9bdd000)                          = 0x9bdd000
brk(0)                                  = 0x9bdd000
brk(0x9bde000)                          = 0x9bde000
brk(0)                                  = 0x9bde000
brk(0x9bdf000)                          = 0x9bdf000
time(NULL)                              = 1084532294
open("/etc/protocols", O_RDONLY)        = 9
fcntl64(9, F_GETFD)                     = 0
fcntl64(9, F_SETFD, FD_CLOEXEC)         = 0
fstat64(9, {st_mode=S_IFREG|0744, st_size=2168, ...}) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x1556d000
read(9, "ip\t0\tIP\nicmp\t1\tICMP\t\t\nigmp\t2\tIGM"..., 4096) = 2168
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
+++ killed by SIGSEGV +++

First I supposed a permission problem, as all our apaches runs as normal users, but changing the permissions of that file did'nt help. Seeing the strace, seems to open OK the file in all cases.
To discart silly parsing problems i removed all comments and void lines on that file, with no difference.
I also downgraded glibc to the RedHat default (glibc-2.3.2-11.9), from the upgraded glibc-2.3.2-27. No success...
I'm stuck on this... I suspect a permissions problem, but due to the mature of the system (httpd.conf heavily customized realtime based on username) I can't test it easily.  I also suspect on a DNS (Red Hat caused me some strange problems in previous versions) or multiple interface problem, (every machine has a public IP and a private IP). Also can be the custom kernel....

While I'll do some more tests, someone can help me on this?
I'm really desperate...

Reply via email to