On Thu, 2018-07-26 at 05:48 +0200, Matt Darfeuille wrote: > > As illustrated by this lingering thread, issues that are only present > on > one platform makes me moved away from OpenWRT/LEDE.
The platform is not the problem. The platform is just providing the
tools.
Or are you suggesting that the "lock" tool on OpenWRT/LEDE is actually
buggy? Given that it's just a wrapper around flock() that seems
unlikely. But I'm happy to be proven wrong if you can provide a
reproducer for the bug that I can submit upstream. As much testing as
I have done with the "lock" tool it operates as expected when used as
expected.
Given the evidence, it seems like the file being locked is getting
removed before the lock is released.
A reboot of my router this morning has reproduced the situation and
this is what I see:
# ps -ef | grep lock
root 2700 2666 0 07:13 ? 00:00:00 lock
/etc/shorewall-lite/state/lock
root 3234 1 0 07:13 ? 00:00:00 lock
/etc/shorewall-lite/state/lock
# lsof -n -p 3234
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
lock 3234 root cwd DIR 0,15 656 258 /
lock 3234 root rtd DIR 0,15 656 258 /
lock 3234 root txt REG 254,0 308533 1786 /bin/busybox
lock 3234 root mem REG 254,0 77040 213 /lib/libgcc_s.so.1
lock 3234 root mem REG 254,0 601968 402 /lib/libc.so
lock 3234 root 0u CHR 1,3 0t0 317 /dev/null
lock 3234 root 1u CHR 1,3 0t0 317 /dev/null
lock 3234 root 2u CHR 1,3 0t0 317 /dev/null
lock 3234 root 3u REG 0,14 5 61617
/etc/shorewall-lite/state/lock (deleted)
lock 3234 root 13w FIFO 0,8 0t0 1732 pipe
# cat /proc/2700/fd/3
3234
# strace -f -p 3234
strace: Process 3234 attached
restart_syscall(<... resuming interrupted syscall_516 ...>) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcd900) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcd900) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcd900) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcd900) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcd900) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcd900) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffcd900) = 0
nanosleep({tv_sec=1, tv_nsec=0}, ^Cstrace: Process 3234 detached
<detached ...>
# strace -f -p 2700
strace: Process 2700 attached
flock(3, LOCK_EX^Cstrace: Process 2700 detached
<detached ...>
Hrm. Given:
g_havemutex="lock -u ${lockf} && rm -f ${lockf}"
Observe this particular set of operations:
tty1# lock /tmp/mylockfile
tty1# [has the lock and returns]
tty2# lock /tmp/mylockfile
[blocks waiting for locker1 to release the lock as we can see:]
# lsof | grep /tmp/mylockfile
lock 1249 root 3u REG 0,13 5 352778
/tmp/mylockfile
lock 1250 root 3u REG 0,13 5 352778
/tmp/mylockfile
tty1# lock -u /tmp/mylockfile && rm -f /tmp/mylockfile
tty1# [returns, releasing the lock to tty2]
tty2# [returns from blocked state, now holds the lock]
# lsof | grep /tmp/mylockfile
lock 1404 root 3u REG 0,13 5 352778
/tmp/mylockfile (deleted)
tty3# lock /tmp/mylockfile
tty3# [wait, what? it returns even though tty2 has the lock!]
# lsof | grep /tmp/mylockfile
lock 1404 root 3u REG 0,13 5 352778
/tmp/mylockfile (deleted)
lock 1439 root 3u REG 0,13 5 362181
/tmp/mylockfile
So at this point both tty2 and tty3 believe they have the lock and have
returned, allowing them to do their work on top of each other.
I don't think a process can simply remove the lock file just because it
has released it's lock on it. It can only be removed if there are no
more outstanding locks on it. Or just don't remove it. lock seems to
function perfectly fine with the file pre-existing.
I'm not sure I can draw a line from this problem to the stale locks
problem, but it's probably a good thing to fix before continuing to try
to debug the stale locks problem.
Cheers,
b.
signature.asc
Description: This is a digitally signed message part
------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________ Shorewall-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/shorewall-users
