Re: Release Sched and futex bug

Robert L. Millner Tue, 18 Mar 2003 16:04:39 -0800

> Is that problem related to the Native POSIX Thread Library
> issues that are describe in the 8.0.94/RELEASE-NOTES file?


> If so, that doc says that the workaround is to either set
> "LD_ASSUME_KERNEL=2.2.5" or boot with the option "nosysinfo"

I'll try that out as a workaround.  Thanks (and I should have read that in
the first place after switching up from earlier 8.0.9x versions).

> We've found one problem with rpm and SIGPIPE.  If you do something
> like "rpm -qa | /bin/true" as root, you'll get a stale lock.  You'll
> also get stale locks any time you use SIGKILL or any other

Ok, I'll check that.  If this is the culprit, then its likely that the
problems I was seeing yesterday come from using rpm as part of shell
scripts and having the output feed other scripts.


> That doesn't prove it is a kernel bug, because rebooting also clears
> rpm's lock files.

Right, that was the wrong culprit.  So, looking a little deeper into
this...

Looping over:
rpm -Uvh cpan2rpm-2.014-1.noarch.rpm
rpm -e cpan2rpm

[ side note: cpan2rpm is quite useful. ]

...appears go a hundred iterations without producing a hang.



The hang can be reproduced reliably with this set of commands:

1. reboot
2. log in as root
3. rpm -qa | /bin/true
4. rpm -e cpan2rpm     # installed previously

...confirming Matt's message.  Once the "rpm -qa | /bin/true" command has
been issued, successive "rpm -e" and "rpm -U" commands reliably hang.


The problem does not occur with this sequence:
1. reboot
2. log in as root
3. LD_ASSUME_KERNEL="2.2.5" rpm -qa | /bin/true
4. rpm -e cpan2rpm
5. rpm -Uvh cpan2rpm-2.014-1.noarch.rpm

...or with the sequence:
1. reboot
2. log in as root
3. rpm -qa | /bin/true
4. LD_ASSUME_KERNEL="2.2.5" rpm -e cpan2rpm
5. LD_ASSUME_KERNEL="2.2.5" rpm -Uvh cpan2rpm-2.014-1.noarch.rpm

...and prefixing all subsequent rpm commands with
LD_ASSUME_KERNEL="2.2.5".

The rpm command reliably stops hanging when you rm -rf /var/lib/rpm/__db.*
without a reboot, which is also done in rc.sysinit.

So the release notes (which I should have applied before ranting) are
appropriate to this instance.  Good to know there's a workaround and that
this is probably far more mundane.


The waiting futex syscall in "rpm -e":
futex(0x4059130c, FUTEX_WAIT, 0, NULL <unfinished ...>


[EMAIL PROTECTED]:/proc/17240#grep '__db.' maps
40017000-4001b000 rw-s 00000000 08:11 229714     /var/lib/rpm/__db.001
40406000-40548000 rw-s 00000000 08:11 229715     /var/lib/rpm/__db.002
40548000-405b8000 rw-s 00000000 08:11 229716     /var/lib/rpm/__db.003


...is referencing a pointer in the address range of the mmapped file
"__db.003".  Rpm's usage of shared regions is interesting code reading
that doesn't need to be rehashed here.  The hang is clearly a case of
waiting for a mutex lock on a structure in a file which needs to be
cleared as part of a lock reclamation step or rpm needs to be able to bail
out earlier and alert the user that there's a problem.

Thanks for the support.

        Cheers,
        Rob




-- 
Phoebe-list mailing list
[EMAIL PROTECTED]
https://listman.redhat.com/mailman/listinfo/phoebe-list

Re: Release Sched and futex bug

Reply via email to