Re: [osol-code] Odd socket read/write problem with Apache 2.0.54 on Solaris 10

Alexander Kolbasov Fri, 26 Aug 2005 16:12:23 -0700

> Hello,
> 
> Apache httpd 2.x is running into some odd CGI problems on Solaris 10.  We've 
> had a number of people hit this in the wild - including myself using Apache 
> 2.0.54 (our latest stable release).  We're stumped and we're trying to see if 
> we can find some help from people who know Solaris fairly well.  =)
> 
> We've narrowed it down to some weirdness with reading and writing sockets 
> between processes.


Do you use AF_INET or AF_UNIX sockets?

I'd guess that these are AF_UNIX sockets. It looks similar to bug

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6227895 or
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6249138

If you are using AF_INET sockets, it is something else. 

The two bugs above are fixed in SolarisExpress/OpenSolaris bits.
There is S10 port which will show up with the update.

- Alexander Kolbasov


> 
> First off, here's our httpd bug report:
> 
> <http://issues.apache.org/bugzilla/show_bug.cgi?id=34264>
> 
> We have confirmed this with Solaris 10 GA with all current Sol10 GA patches 
> applied.  We have heard reports that 'go[ing] back to the next to last 
> Solaris 
> Express driver prior to GA' works.  So, this bug must have been introduced 
> just before 10 went GA.  There were no issues with Solaris 9 or earlier.
> 
> Here's the overview: When httpd 2.x is threaded (with the 'worker' MPM), CGIs 
> are handled by a dedicated process that sits in an accept() loop.  When an 
> incoming request gets assigned to a thread and needs to exec() a CGI, the 
> thread creates a socket to this standalone cgid process.  The thread then 
> writes a bunch of information over this socket - such as the program name, 
> arguments, etc.  The cgid process then reads this data and executes the 
> script 
> accordingly and shuffles back the program output over that socket.
> 
> Here's the issue we have: the environment variables to use in the CGI are 
> passed in a <4-byte len><value> format on the socket from the thread.  At 
> times, the environment variable length is 'skipped' or corrupted.  This 
> causes 
> httpd to think that there's a *lot* of data to be read - it then calloc's 
> roughly 1GB of memory (ouch!).
> 
> Through dtrace, we know that we're writing it successfully to the socket - 
> but 
> it will occasionally come out on the other side corrupted.  N.B. if you truss 
> the program, it'll work just fine.
> 
> Here's a dtrace file you might find helpful:
> 
> <http://people.apache.org/~jerenkrantz/httpd.d>
> 
> (I'm new to dtrace, so there might be easier ways to write this script.)
> 
> Configure httpd with --enable-mpm=worker such that we only have 1 worker 
> thread:
> 
> <IfModule worker.c>
> StartServers         1
> MaxClients           1
> MinSpareThreads      1
> MaxSpareThreads      1
> ThreadsPerChild      1
> MaxRequestsPerChild  0
> </IfModule>
> 
> Run it with:
> 
> ./httpd.d <pid of worker process> <pid of cgid process>
> 
> The worker process is the httpd with multiple threads; look for a thread that 
> has these stack characteristics (this is the idle 'worker' waiting for a 
> connection):
> -----------------  lwp# 3 / thread# 3  --------------------
>  cba7d3a9 lwp_park (0, 0, 0)
>  cba77c2a cond_wait_queue (81e5768, 81e5734, 0, 0) + 3b
>  cba78123 _cond_wait (81e5768, 81e5734) + 66
>  cba78165 cond_wait (81e5768, 81e5734) + 21
>  cba7819e pthread_cond_wait (81e5768, 81e5734) + 1b
>  cbcaf9a6 apr_thread_cond_wait (81e5760, 81e5730) + 36
>  080ada99 ap_queue_pop (81e5718, ca49dfa4, ca49df98) + 69
>  080aac3f worker_thread (81e5838, 87cfdd8) + 10f
>  cbcaac0a dummy_worker (81e5838) + 3a
>  cba7d03f _thr_setup (ca9b0000) + 4e
>  cba7d330 _lwp_start (ca9b0000, 0, 0, ca49dff8, cba7d330, ca9b0000)
> 
> The cgid program will have a pstack output similar to:
> 
> 15089:  httpd -k start
>  cba7d905 accept   (25, 8046b7a, 8046b64, 1)
>  0809c1e4 cgid_server (811f820) + 314
>  0809c852 cgid_start (811b0a0, 811f820, 8119e18) + a2
>  0809b405 cgid_maint (0, 8119e18, f) + 95
>  cbcad0cd apr_proc_other_child_alert (8046c7c, 0, f) + 7d
>  cbcad300 apr_proc_other_child_read (8046c7c, f) + 30
>  080ac1e9 server_main_loop (0) + 199
>  080ac57b ap_mpm_run (811b0a0, 8153180, 811f820) + 2eb
>  080b5a34 main     (3, 8046d44, 8046d54) + 9c4
>  0807ba3a ???????? (3, 8046e84, 8046ea4, 8046ea7, 0, 8046ead)
> 
> Pretty much any CGI script will demonstrate the problem.  There's a few in 
> the 
> httpd PR link above.
> 
> So, you could run dtrace with:
> 
> ./httpd.d 14581 15089
> 
> Here's a 'bad' output from an Solaris 10/Intel (SMP) box:
> 
> -----
>   0  40471                    get_req:entry (15274) Entering get_req!
>   0     10                       read:entry read (15274) fd: 38 - 64 bytes
>   0     11                      read:return read (15274) fd: 38 - 64 bytes
>   0     11                      read:return
>   0     10                       read:entry read (15274) fd: 38 - 42 bytes
>   0     11                      read:return read (15274) fd: 38 - 42 bytes
>   0     11                      read:return 
> /home/jerenk/public_html/weblog/weblog.cgi
>   0     10                       read:entry read (15274) fd: 38 - 10 bytes
>   0     11                      read:return read (15274) fd: 38 - 10 bytes
>   0     11                      read:return weblog.cgii/software/
>   0     10                       read:entry read (15274) fd: 38 - 20 bytes
>   0     11                      read:return read (15274) fd: 38 - 20 bytes
>   0     11                      read:return /weblog.cgi/software
>   0     10                       read:entry read (15274) fd: 38 - 4 bytes
>   0     11                      read:return read (15274) fd: 38 - 4 bytes
>   0     11                      read:return 1430084180
>   0     11                      read:return TZ=U
> ...UH-OH.  This isn't correct....15274 will now go allocate that much 
> memory...
>   1     12                      write:entry write (14581) fd: 38 - 64 bytes
>   1     12                      write:entry
>   1     13                     write:return wrote (14581): 64
>   1     12                      write:entry write (14581) fd: 38 - 42 bytes
>   1     12                      write:entry 
> /home/jerenk/public_html/weblog/weblog.cgi
>   1     13                     write:return wrote (14581): 42
>   1     12                      write:entry write (14581) fd: 38 - 10 bytes
>   1     12                      write:entry weblog.cgi
>   1     13                     write:return wrote (14581): 10
>   1     12                      write:entry write (14581) fd: 38 - 20 bytes
>   1     12                      write:entry /weblog.cgi/software
>   1     13                     write:return wrote (14581): 20
>   1     12                      write:entry write (14581) fd: 38 - 4 bytes
>   1     12                      write:entry
>   1     13                     write:return wrote (14581): 4
>   1     12                      write:entry write (14581) fd: 38 - 13 bytes
>   1     12                      write:entry TZ=US/Pacific
>   1     13                     write:return wrote (14581): 13
>   1     12                      write:entry write (14581) fd: 38 - 4 bytes
>   1     12                      write:entry $
>   1     13                     write:return wrote (14581): 4
>   1     12                      write:entry write (14581) fd: 38 - 36 bytes
>   1     12                      write:entry 
> HTTP_HOST=weblog.erenkrantz.com:8080
>   1     13                     write:return wrote (14581): 36
>   1     12                      write:entry write (14581) fd: 38 - 4 bytes
>   1     12                      write:entry l
>   1     13                     write:return wrote (14581): 4
>   1     12                      write:entry write (14581) fd: 38 - 108 bytes
>   1     12                      write:entry HTTP_USER_AGENT=Mozilla/5.0 
> (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8b4) Gecko/20050819 
> Firefox/1.0+
>   1     13                     write:return wrote (14581): 108
> ...snip...pid 15274 is done allocating 1430084180 bytes....
>   1     10                       read:entry read (15274) fd: 38 - 1430084180 
> bytes
>   1     11                      read:return read (15274): Short read: 
> 1430084180, 165
>   1     11                      read:return S/Pacific
>   1     10                       read:entry read (15274) fd: 38 - 1430084015 
> bytes
>   1     11                      read:return read (15274): Short read: 
> 1430084015, 1849
>   1     11                      read:return Data too large
>   1     10                       read:entry read (15274) fd: 38 - 1430082166 
> bytes
>   1     11                      read:return read (15274): Short read: 
> 1430082166, 0
>   1     11                      read:return
>   1  40472                   get_req:return (15274) Leaving get_req!
> -----
> 
> Since its an SMP box, I'm guessing that'll partially explain why they appear 
> out of order.  (This is the first connection, so it's not from somewhere 
> else.)
> 
> For 4 byte reads, I have dtrace doing the length conversion on the read data.
> 
> Notice the write pattern: 64 bytes, 42 bytes, 10, 20, 4, 13, 4, 36....
> Notice the read pattern:  64 bytes, 42 bytes, 10, 20, 4**, 1430084180...
> 
> That 4-byte read is corrupted.  This causes the reader to allocate 1430084180 
> bytes.
> 
> So, does this ring a bell for anyone?
> 
> Our code worked fine on Solaris 9 - which I was using until yesterday on this 
> particular machine.  And, as mentioned before, it also worked fine on pre-GA 
> releases of Sol10.  (It also works identically on FreeBSD, Linux, etc, etc.)
> 
> The cgid source file is here:
> 
> <http://svn.apache.org/repos/asf/httpd/httpd/tags/2.0.54/modules/generators/mod_cgid.c>
> 
> The thread from [email protected] today is here:
> 
> <http://mail-archives.apache.org/mod_mbox/httpd-dev/200508.mbox/[EMAIL 
> PROTECTED]>
> 
> Thanks in advance for any help.  -- justin
> _______________________________________________
> opensolaris-code mailing list
> [email protected]
> https://opensolaris.org:444/mailman/listinfo/opensolaris-code
> 


_______________________________________________
opensolaris-code mailing list
[email protected]
https://opensolaris.org:444/mailman/listinfo/opensolaris-code

Re: [osol-code] Odd socket read/write problem with Apache 2.0.54 on Solaris 10

Reply via email to