Re: [PATCH] Re: the most common seg fault on daedalus
On Wed, 10 Apr 2002, Greg Ames wrote: > We have another similar looking dump, with this patch on. It's > /usr/local/apache2.0.35c/corefiles/httpd.core.2 > > #0 mmap_cleanup (themmap=0x8153ac0) at mmap.c:98 > #1 0x280cc8f0 in apr_mmap_delete (mm=0x8153ac0) at mmap.c:202 Ahh, this is different. > Didn't I see a refcount inside the object that's being deleted? If so, > how can we trust it after the thing is deleted? Nope, no refcount. The bucket is refcounted, but not the mmap. And the is_owner should guarantee us that only _one_ bucket will ever try to delete the thing and succeed. The check for -1 should be saving us the rest of the time... don't know what's going on. Will investigate. --Cliff -- Cliff Woolley [EMAIL PROTECTED] Charlottesville, VA
Re: [PATCH] Re: the most common seg fault on daedalus
Cliff Woolley wrote: > Greg, can you try this patch on daedalus for me? :-( We have another similar looking dump, with this patch on. It's /usr/local/apache2.0.35c/corefiles/httpd.core.2 #0 mmap_cleanup (themmap=0x8153ac0) at mmap.c:98 #1 0x280cc8f0 in apr_mmap_delete (mm=0x8153ac0) at mmap.c:202 #2 0x280ad926 in mmap_destroy (data=0x8131088) at apr_buckets_mmap.c:82 #3 0x280adf08 in apr_brigade_cleanup (data=0x814c4f8) at apr_brigade.c:86 #4 0x280adebe in brigade_cleanup (data=0x814c4f8) at apr_brigade.c:72 #5 0x280cdd3b in run_cleanups (c=0x81393f8) at apr_pools.c:1713 #6 0x280cd51c in apr_pool_destroy (pool=0x813f010) at apr_pools.c:638 #7 0x805ea1c in ap_process_http_connection (c=0x812c120) at http_core.c:300 #8 0x806ecf3 in ap_run_process_connection (c=0x812c120) at connection.c:85 #9 0x806f008 in ap_process_connection (c=0x812c120, csd=0x812c048) at connection.c:207 #10 0x80648f0 in child_main (child_num_arg=547) at prefork.c:671 Didn't I see a refcount inside the object that's being deleted? If so, how can we trust it after the thing is deleted? Greg
Re: [PATCH] Re: the most common seg fault on daedalus
On Tue, 9 Apr 2002, Greg Ames wrote: > But this morning we have a core dump > (/usr/local/apache2.0.35c/corefiles/httpd.core.1). The good news is > that it doesn't look anything like the mmap cleanup dumps, so I think > this patch is good. Cool. Yeah, I was pretty confident I had that one pinned down. I'll go ahead and do the corresponding Win32 patch. --Cliff -- Cliff Woolley [EMAIL PROTECTED] Charlottesville, VA
Re: [PATCH] Re: the most common seg fault on daedalus
Greg Ames wrote: > > Cliff Woolley wrote: > > > Greg, can you try this patch on daedalus for me? (sorry if there were line > > wraps) > > OK, it's running on port 8092, and passes everything I know how/remember to > throw at it in test mode. I'll put it into production after band practice > tonight when I have time to watch it for a while. Hopefully it will be another > Maytag Man experience. We did go live with 2.0.35 + this patch last night. I posted a message to that effect from an unusual mail server but don't see it this AM. It may have been eaten by a spam filter or is waiting for the moderator. The biggest excitement while I was watching it was to see that the 2.0.35 download rate is about 33% higher than 1.3.24. But this morning we have a core dump (/usr/local/apache2.0.35c/corefiles/httpd.core.1). The good news is that it doesn't look anything like the mmap cleanup dumps, so I think this patch is good. I'll post the info from the new dump in a separate thread. Greg
Re: [PATCH] Re: the most common seg fault on daedalus
Cliff Woolley wrote: > Greg, can you try this patch on daedalus for me? (sorry if there were line > wraps) OK, it's running on port 8092, and passes everything I know how/remember to throw at it in test mode. I'll put it into production after band practice tonight when I have time to watch it for a while. Hopefully it will be another Maytag Man experience. Greg p.s. I did have to hand apply the second half of the patch. I couldn't see what the mismatch was via eyeball. It wasn't line wrap; it may have been differences in whitespace; diff'ing the patches did highlight what looked like blank lines. No big deal.
Re: [PATCH] Re: the most common seg fault on daedalus
On Mon, 8 Apr 2002, Greg Ames wrote: > hmmm, I wonder what changed since 2.0.32 to trigger the failure then? > Maybe it's some of the internal_fast_redirect stuff, which affects this > URL. Could be. It's most likely some combination of setaside being called and then a brigade being cleaned up without all of the buckets being emptied from it before-hand. If the internal_fast_redirect stuff could cause that (I suppose it possibly could), then > > Greg, can you try this patch on daedalus for me? > > Sure, no problem. But it will probably be a while before it goes live. > Last time I checked we had 415 active requests, and we've hit 600 > (ServerLimit) today. Mondays typically are heavy load days, plus I > think we're getting /.ed . 'kay. In the meanwhile I'll try to pin down an exact request that will cause it to segfault. -- Cliff Woolley [EMAIL PROTECTED] Charlottesville, VA
Re: [PATCH] Re: the most common seg fault on daedalus
Cliff Woolley wrote: > As a side note, the buckets code is okay because (and it even has a > comment to this effect), it assumes that apr_mmap_delete() will do the > Right Thing, and if the mmap is not owned or already deleted, it will just > be a no-op. Sorry, I didn't notice before you pointed it out that the code involved left bucket-land. hmmm, I wonder what changed since 2.0.32 to trigger the failure then? Maybe it's some of the internal_fast_redirect stuff, which affects this URL. > Greg, can you try this patch on daedalus for me? Sure, no problem. But it will probably be a while before it goes live. Last time I checked we had 415 active requests, and we've hit 600 (ServerLimit) today. Mondays typically are heavy load days, plus I think we're getting /.ed . Greg
[PATCH] Re: the most common seg fault on daedalus
On Mon, 8 Apr 2002, Cliff Woolley wrote: > Ha. ;-) Actually I think I see a number of potential problems in > apr/mmap/unix/mmap.c. I'm working on a patch and I'll post it here. Okay, there are two or three issues or potential issues addressed here. (1) apr_mmap_dup() would register a cleanup on the *new_mmap even if the new one was not the owner. That turned out to be okay though because mmap_cleanup would detect the !mm->is_owner case and just pretend everything was cool. It's just extra work that doesn't need to be done. (2) In mmap_cleanup(), if munmap() fails for some reason, we would not set mm->mm to -1, meaning if we ever tried to run the cleanup a second time for some reason, we would not know we'd already tried to munmap() the thing and we'd do it again. This is the same kind of bug that killed us a while back on directory cleanups if you'll recall. (3) Finally (and this is, I believe, the source of our current segfaults): okay, here's a weird sequence of events: - create an mmap - dup it (with or without transferring ownership) - cleanup or delete one copy - delete the other one Because apr_mmap_delete() would assume that any APR_SUCCESS from mmap_cleanup() meant "OK, go ahead and kill the cleanup", and !mm->is_owner would return APR_SUCCESS, we would kill the cleanup even if there wasn't one to kill. Boom. There are two or three related sequences of events that cause this, but they all boil down to essentially the above. There are parallel problems in the win32 mmap code that will need patching as well. As a side note, the buckets code is okay because (and it even has a comment to this effect), it assumes that apr_mmap_delete() will do the Right Thing, and if the mmap is not owned or already deleted, it will just be a no-op. Greg, can you try this patch on daedalus for me? (sorry if there were line wraps) Thanks, Cliff Index: mmap.c === RCS file: /home/cvs/apr/mmap/unix/mmap.c,v retrieving revision 1.40 diff -u -d -r1.40 mmap.c --- mmap.c 13 Mar 2002 20:39:24 - 1.40 +++ mmap.c 8 Apr 2002 18:26:36 - @@ -85,25 +85,21 @@ apr_mmap_t *mm = themmap; int rv; -if (!mm->is_owner) { -return APR_SUCCESS; +if ((!mm->is_owner) || (mm->mm == (void *)-1)) { +/* XXX: we shouldn't ever get here */ +return APR_ENOENT; } #ifdef BEOS rv = delete_area(mm->area); - -if (rv == 0) { -mm->mm = (void *)-1; -return APR_SUCCESS; -} #else rv = munmap(mm->mm, mm->size); +#endif +mm->mm = (void *)-1; if (rv == 0) { -mm->mm = (void *)-1; return APR_SUCCESS; } -#endif return errno; } @@ -189,24 +185,21 @@ (*new_mmap)->is_owner = 1; old_mmap->is_owner = 0; apr_pool_cleanup_kill(old_mmap->cntxt, old_mmap, mmap_cleanup); +apr_pool_cleanup_register(p, *new_mmap, mmap_cleanup, + apr_pool_cleanup_null); } else { (*new_mmap)->is_owner = 0; } -apr_pool_cleanup_register(p, *new_mmap, mmap_cleanup, - apr_pool_cleanup_null); } return APR_SUCCESS; } APR_DECLARE(apr_status_t) apr_mmap_delete(apr_mmap_t *mm) { -apr_status_t rv; +apr_status_t rv = APR_SUCCESS; -if (mm->mm == (void *)-1) -return APR_ENOENT; - -if ((rv = mmap_cleanup(mm)) == APR_SUCCESS) { +if (mm->is_owner && ((rv = mmap_cleanup(mm)) == APR_SUCCESS)) { apr_pool_cleanup_kill(mm->cntxt, mm, mmap_cleanup); return APR_SUCCESS; } -- Cliff Woolley [EMAIL PROTECTED] Charlottesville, VA