re: [osol-code] Odd socket read/write problem with Apache 2.0.54 on Solaris 10

Peter Memishian Fri, 26 Aug 2005 16:12:21 -0700

is this an AF_UNIX socket?  if so, the implementation was massively
overhauled for solaris 10, right near GA.


sasha: does this problem sound familiar?

 > Apache httpd 2.x is running into some odd CGI problems on Solaris 10.  We've 
 > had a number of people hit this in the wild - including myself using Apache 
 > 2.0.54 (our latest stable release).  We're stumped and we're trying to see 
 > if 
 > we can find some help from people who know Solaris fairly well.  =)
 > 
 > We've narrowed it down to some weirdness with reading and writing sockets 
 > between processes.
 > 
 > First off, here's our httpd bug report:
 > 
 > <http://issues.apache.org/bugzilla/show_bug.cgi?id=34264>
 > 
 > We have confirmed this with Solaris 10 GA with all current Sol10 GA patches 
 > applied.  We have heard reports that 'go[ing] back to the next to last 
 > Solaris 
 > Express driver prior to GA' works.  So, this bug must have been introduced 
 > just before 10 went GA.  There were no issues with Solaris 9 or earlier.
 > 
 > Here's the overview: When httpd 2.x is threaded (with the 'worker' MPM), 
 > CGIs 
 > are handled by a dedicated process that sits in an accept() loop.  When an 
 > incoming request gets assigned to a thread and needs to exec() a CGI, the 
 > thread creates a socket to this standalone cgid process.  The thread then 
 > writes a bunch of information over this socket - such as the program name, 
 > arguments, etc.  The cgid process then reads this data and executes the 
 > script 
 > accordingly and shuffles back the program output over that socket.
 > 
 > Here's the issue we have: the environment variables to use in the CGI are 
 > passed in a <4-byte len><value> format on the socket from the thread.  At 
 > times, the environment variable length is 'skipped' or corrupted.  This 
 > causes 
 > httpd to think that there's a *lot* of data to be read - it then calloc's 
 > roughly 1GB of memory (ouch!).
 > 
 > Through dtrace, we know that we're writing it successfully to the socket - 
 > but 
 > it will occasionally come out on the other side corrupted.  N.B. if you 
 > truss 
 > the program, it'll work just fine.
 > 
 > Here's a dtrace file you might find helpful:
 > 
 > <http://people.apache.org/~jerenkrantz/httpd.d>
 > 
 > (I'm new to dtrace, so there might be easier ways to write this script.)
 > 
 > Configure httpd with --enable-mpm=worker such that we only have 1 worker 
 > thread:
 > 
 > <IfModule worker.c>
 > StartServers         1
 > MaxClients           1
 > MinSpareThreads      1
 > MaxSpareThreads      1
 > ThreadsPerChild      1
 > MaxRequestsPerChild  0
 > </IfModule>
 > 
 > Run it with:
 > 
 > ./httpd.d <pid of worker process> <pid of cgid process>
 > 
 > The worker process is the httpd with multiple threads; look for a thread 
 > that 
 > has these stack characteristics (this is the idle 'worker' waiting for a 
 > connection):
 > -----------------  lwp# 3 / thread# 3  --------------------
 >  cba7d3a9 lwp_park (0, 0, 0)
 >  cba77c2a cond_wait_queue (81e5768, 81e5734, 0, 0) + 3b
 >  cba78123 _cond_wait (81e5768, 81e5734) + 66
 >  cba78165 cond_wait (81e5768, 81e5734) + 21
 >  cba7819e pthread_cond_wait (81e5768, 81e5734) + 1b
 >  cbcaf9a6 apr_thread_cond_wait (81e5760, 81e5730) + 36
 >  080ada99 ap_queue_pop (81e5718, ca49dfa4, ca49df98) + 69
 >  080aac3f worker_thread (81e5838, 87cfdd8) + 10f
 >  cbcaac0a dummy_worker (81e5838) + 3a
 >  cba7d03f _thr_setup (ca9b0000) + 4e
 >  cba7d330 _lwp_start (ca9b0000, 0, 0, ca49dff8, cba7d330, ca9b0000)
 > 
 > The cgid program will have a pstack output similar to:
 > 
 > 15089:  httpd -k start
 >  cba7d905 accept   (25, 8046b7a, 8046b64, 1)
 >  0809c1e4 cgid_server (811f820) + 314
 >  0809c852 cgid_start (811b0a0, 811f820, 8119e18) + a2
 >  0809b405 cgid_maint (0, 8119e18, f) + 95
 >  cbcad0cd apr_proc_other_child_alert (8046c7c, 0, f) + 7d
 >  cbcad300 apr_proc_other_child_read (8046c7c, f) + 30
 >  080ac1e9 server_main_loop (0) + 199
 >  080ac57b ap_mpm_run (811b0a0, 8153180, 811f820) + 2eb
 >  080b5a34 main     (3, 8046d44, 8046d54) + 9c4
 >  0807ba3a ???????? (3, 8046e84, 8046ea4, 8046ea7, 0, 8046ead)
 > 
 > Pretty much any CGI script will demonstrate the problem.  There's a few in 
 > the 
 > httpd PR link above.
 > 
 > So, you could run dtrace with:
 > 
 > ./httpd.d 14581 15089
 > 
 > Here's a 'bad' output from an Solaris 10/Intel (SMP) box:
 > 
 > -----
 >   0  40471                    get_req:entry (15274) Entering get_req!
 >   0     10                       read:entry read (15274) fd: 38 - 64 bytes
 >   0     11                      read:return read (15274) fd: 38 - 64 bytes
 >   0     11                      read:return
 >   0     10                       read:entry read (15274) fd: 38 - 42 bytes
 >   0     11                      read:return read (15274) fd: 38 - 42 bytes
 >   0     11                      read:return 
 > /home/jerenk/public_html/weblog/weblog.cgi
 >   0     10                       read:entry read (15274) fd: 38 - 10 bytes
 >   0     11                      read:return read (15274) fd: 38 - 10 bytes
 >   0     11                      read:return weblog.cgii/software/
 >   0     10                       read:entry read (15274) fd: 38 - 20 bytes
 >   0     11                      read:return read (15274) fd: 38 - 20 bytes
 >   0     11                      read:return /weblog.cgi/software
 >   0     10                       read:entry read (15274) fd: 38 - 4 bytes
 >   0     11                      read:return read (15274) fd: 38 - 4 bytes
 >   0     11                      read:return 1430084180
 >   0     11                      read:return TZ=U
 > ...UH-OH.  This isn't correct....15274 will now go allocate that much 
 > memory...
 >   1     12                      write:entry write (14581) fd: 38 - 64 bytes
 >   1     12                      write:entry
 >   1     13                     write:return wrote (14581): 64
 >   1     12                      write:entry write (14581) fd: 38 - 42 bytes
 >   1     12                      write:entry 
 > /home/jerenk/public_html/weblog/weblog.cgi
 >   1     13                     write:return wrote (14581): 42
 >   1     12                      write:entry write (14581) fd: 38 - 10 bytes
 >   1     12                      write:entry weblog.cgi
 >   1     13                     write:return wrote (14581): 10
 >   1     12                      write:entry write (14581) fd: 38 - 20 bytes
 >   1     12                      write:entry /weblog.cgi/software
 >   1     13                     write:return wrote (14581): 20
 >   1     12                      write:entry write (14581) fd: 38 - 4 bytes
 >   1     12                      write:entry
 >   1     13                     write:return wrote (14581): 4
 >   1     12                      write:entry write (14581) fd: 38 - 13 bytes
 >   1     12                      write:entry TZ=US/Pacific
 >   1     13                     write:return wrote (14581): 13
 >   1     12                      write:entry write (14581) fd: 38 - 4 bytes
 >   1     12                      write:entry $
 >   1     13                     write:return wrote (14581): 4
 >   1     12                      write:entry write (14581) fd: 38 - 36 bytes
 >   1     12                      write:entry 
 > HTTP_HOST=weblog.erenkrantz.com:8080
 >   1     13                     write:return wrote (14581): 36
 >   1     12                      write:entry write (14581) fd: 38 - 4 bytes
 >   1     12                      write:entry l
 >   1     13                     write:return wrote (14581): 4
 >   1     12                      write:entry write (14581) fd: 38 - 108 bytes
 >   1     12                      write:entry HTTP_USER_AGENT=Mozilla/5.0 
 > (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8b4) Gecko/20050819 
 > Firefox/1.0+
 >   1     13                     write:return wrote (14581): 108
 > ...snip...pid 15274 is done allocating 1430084180 bytes....
 >   1     10                       read:entry read (15274) fd: 38 - 1430084180 
 > bytes
 >   1     11                      read:return read (15274): Short read: 
 > 1430084180, 165
 >   1     11                      read:return S/Pacific
 >   1     10                       read:entry read (15274) fd: 38 - 1430084015 
 > bytes
 >   1     11                      read:return read (15274): Short read: 
 > 1430084015, 1849
 >   1     11                      read:return Data too large
 >   1     10                       read:entry read (15274) fd: 38 - 1430082166 
 > bytes
 >   1     11                      read:return read (15274): Short read: 
 > 1430082166, 0
 >   1     11                      read:return
 >   1  40472                   get_req:return (15274) Leaving get_req!
 > -----
 > 
 > Since its an SMP box, I'm guessing that'll partially explain why they appear 
 > out of order.  (This is the first connection, so it's not from somewhere 
 > else.)
 > 
 > For 4 byte reads, I have dtrace doing the length conversion on the read data.
 > 
 > Notice the write pattern: 64 bytes, 42 bytes, 10, 20, 4, 13, 4, 36....
 > Notice the read pattern:  64 bytes, 42 bytes, 10, 20, 4**, 1430084180...
 > 
 > That 4-byte read is corrupted.  This causes the reader to allocate 
 > 1430084180 
 > bytes.
 > 
 > So, does this ring a bell for anyone?
 > 
 > Our code worked fine on Solaris 9 - which I was using until yesterday on 
 > this 
 > particular machine.  And, as mentioned before, it also worked fine on pre-GA 
 > releases of Sol10.  (It also works identically on FreeBSD, Linux, etc, etc.)
 > 
 > The cgid source file is here:
 > 
 > <http://svn.apache.org/repos/asf/httpd/httpd/tags/2.0.54/modules/generators/mod_cgid.c>
 > 
 > The thread from [email protected] today is here:
 > 
 > <http://mail-archives.apache.org/mod_mbox/httpd-dev/200508.mbox/[EMAIL 
 > PROTECTED]>
 > 
 > Thanks in advance for any help.  -- justin
 > _______________________________________________
 > opensolaris-code mailing list
 > [email protected]
 > https://opensolaris.org:444/mailman/listinfo/opensolaris-code

-- 
meem
_______________________________________________
opensolaris-code mailing list
[email protected]
https://opensolaris.org:444/mailman/listinfo/opensolaris-code

re: [osol-code] Odd socket read/write problem with Apache 2.0.54 on Solaris 10

Reply via email to