Bugs item #1599254, was opened at 2006-11-19 11:03 Message generated for change (Comment added) made by akuchling You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1599254&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Python Library Group: Python 2.5 Status: Open Resolution: None Priority: 9 Private: No Submitted By: David Watson (baikie) Assigned to: A.M. Kuchling (akuchling) Summary: mailbox: other programs' messages can vanish without trace Initial Comment: The mailbox classes based on _singlefileMailbox (mbox, MMDF, Babyl) implement the flush() method by writing the new mailbox contents into a temporary file which is then renamed over the original. Unfortunately, if another program tries to deliver messages while mailbox.py is working, and uses only fcntl() locking, it will have the old file open and be blocked waiting for the lock to become available. Once mailbox.py has replaced the old file and closed it, making the lock available, the other program will write its messages into the now-deleted "old" file, consigning them to oblivion. I've caused Postfix on Linux to lose mail this way (although I did have to turn off its use of dot-locking to do so). A possible fix is attached. Instead of new_file being renamed, its contents are copied back to the original file. If file.truncate() is available, the mailbox is then truncated to size. Otherwise, if truncation is required, it's truncated to zero length beforehand by reopening self._path with mode wb+. In the latter case, there's a check to see if the mailbox was replaced while we weren't looking, but there's still a race condition. Any alternative ideas? Incidentally, this fixes a problem whereby Postfix wouldn't deliver to the replacement file as it had the execute bit set. ---------------------------------------------------------------------- >Comment By: A.M. Kuchling (akuchling) Date: 2007-01-17 15:53 Message: Logged In: YES user_id=11375 Originator: NO mailbox-unified-patch contains David's copy-back-new and fcntl-warn patches, plus the test-mailbox patch and some additional changes to mailbox.py from me. (I'll upload a diff to David's diffs in a minute.) This is the change I'd like to check in. test_mailbox.py now passes, as does the mailbox-break.py script I'm using. ---------------------------------------------------------------------- Comment By: A.M. Kuchling (akuchling) Date: 2007-01-17 14:56 Message: Logged In: YES user_id=11375 Originator: NO Committed a modified version of the doc. patch in rev. 53472 (trunk) and rev. 53474 (release25-maint). ---------------------------------------------------------------------- Comment By: Neal Norwitz (nnorwitz) Date: 2007-01-17 01:48 Message: Logged In: YES user_id=33168 Originator: NO Andrew, do you need any help with this? ---------------------------------------------------------------------- Comment By: A.M. Kuchling (akuchling) Date: 2007-01-15 14:01 Message: Logged In: YES user_id=11375 Originator: NO Comment from Andrew MacIntyre (os2vacpp is the OS/2 that lacks ftruncate()): ================ I actively support the OS/2 EMX port (sys.platform == "os2emx"; build directory is PC/os2emx). I would like to keep the VAC++ port alive, but the reality is I don't have the resources to do so. The VAC++ port was the subject of discussion about removal of build support support from the source tree for 2.6 - I don't recall there being a definitive outcome, but if someone does delete the PC/os2vacpp directory, I'm not in a position to argue. AMK: (mailbox.py has a separate section of code used when file.truncate() isn't available, and the existence of this section is bedevilling me. It would be convenient if platforms without file.truncate() weren't a factor; then this branch could just be removed. In your opinion, would it hurt OS/2 users of mailbox.py if support for platforms without file.truncate() was removed?) aimacintyre: No. From what documentation I can quickly check, ftruncate() operates on file descriptors rather than FILE pointers. As such I am sure that, if it became an issue, it would not be difficult to write a ftruncate() emulation wrapper for the underlying OS/2 APIs that implement the required functionality. ---------------------------------------------------------------------- Comment By: David Watson (baikie) Date: 2007-01-13 13:32 Message: Logged In: YES user_id=1504904 Originator: YES I like the warning idea - it seems appropriate if the problem is relatively rare. How about this? File Added: mailbox-fcntl-warn.diff ---------------------------------------------------------------------- Comment By: A.M. Kuchling (akuchling) Date: 2007-01-12 14:41 Message: Logged In: YES user_id=11375 Originator: NO One OS/2 port lacks truncate(), and so does RISCOS. ---------------------------------------------------------------------- Comment By: A.M. Kuchling (akuchling) Date: 2007-01-12 13:41 Message: Logged In: YES user_id=11375 Originator: NO I realized that making flush() invalidate keys breaks the final example in the docs, which loops over inbox.iterkeys() and removes messages, doing a pack() after each message. Which platforms lack file.truncate()? Windows has it; POSIX has it, so modern Unix variants should all have it. Maybe mailbox should simply raise an exception (or trigger a warning?) if truncate is missing, and we should then assume that flush() has no effect upon keys. ---------------------------------------------------------------------- Comment By: A.M. Kuchling (akuchling) Date: 2007-01-12 12:12 Message: Logged In: YES user_id=11375 Originator: NO So shall we document flush() as invalidating keys, then? ---------------------------------------------------------------------- Comment By: David Watson (baikie) Date: 2007-01-06 14:57 Message: Logged In: YES user_id=1504904 Originator: YES Oops, length checking had made the first two lines of this patch redundant; update-toc applies OK with fuzz. File Added: mailbox-copy-back-new.diff ---------------------------------------------------------------------- Comment By: David Watson (baikie) Date: 2007-01-06 10:30 Message: Logged In: YES user_id=1504904 Originator: YES File Added: mailbox-copy-back-53287.diff ---------------------------------------------------------------------- Comment By: David Watson (baikie) Date: 2007-01-06 10:24 Message: Logged In: YES user_id=1504904 Originator: YES Aack, yes, that should be _next_user_key. Attaching a fixed version. I've been thinking, though: flush() does in fact invalidate the keys on platforms without a file.truncate(), when the fcntl() lock is momentarily released afterwards. It seems hard to avoid this as, perversely, fcntl() locks are supposed to be released automatically on all file descriptors referring to the file whenever the process closes any one of them - even one the lock was never set on. So, code using mailbox.py such platforms could inadvertently be carrying keys across an unlocked period, which is not made safe by the update-toc patch (as it's only meant to avert disasters resulting from doing this *and* rebuilding the table of contents, *assuming* that another process hasn't deleted or rearranged messages). File Added: mailbox-update-toc-fixed.diff ---------------------------------------------------------------------- Comment By: A.M. Kuchling (akuchling) Date: 2007-01-05 14:51 Message: Logged In: YES user_id=11375 Originator: NO Question about mailbox-update-doc: the add() method still returns self._next_key - 1; should this be self._next_user_key - 1? The keys in _user_toc are the ones returned to external users of the mailbox, right? (A good test case would be to initialize _next_key to 0 and _next_user_key to a different value like 123456.) I'm still staring at the patch, trying to convince myself that it will help -- haven't spotted any problems, but this bug is making me nervous... ---------------------------------------------------------------------- Comment By: A.M. Kuchling (akuchling) Date: 2007-01-05 14:24 Message: Logged In: YES user_id=11375 Originator: NO As a step toward improving matters, I've attached the suggested doc patch (for both 25-maint and trunk). It encourages people to use Maildir :), explicitly states that modifications should be bracketed by lock(), and fixes the examples to match. It does not say that keys are invalidated by doing a flush(), because we're going to try to avoid the necessity for that. File Added: mailbox-docs.diff ---------------------------------------------------------------------- Comment By: A.M. Kuchling (akuchling) Date: 2006-12-20 14:48 Message: Logged In: YES user_id=11375 Originator: NO Committed length-checking.diff to trunk in rev. 53110. ---------------------------------------------------------------------- Comment By: David Watson (baikie) Date: 2006-12-20 14:19 Message: Logged In: YES user_id=1504904 Originator: YES File Added: mailbox-test-lock.diff ---------------------------------------------------------------------- Comment By: David Watson (baikie) Date: 2006-12-20 14:17 Message: Logged In: YES user_id=1504904 Originator: YES Yeah, I think that should definitely go in. ExternalClashError or a subclass sounds fine to me (although you could make a whole taxonomy of these errors, really). It would be good to have the code actually keep up with other programs' changes, though; a program might just want to count the messages at first, say, and not make changes until much later. I've been trying out the second option (patch attached, to apply on top of mailbox-copy-back), regenerating _toc on locking, but preserving existing keys. The patch allows existing _generate_toc()s to work unmodified, but means that _toc now holds the entire last known contents of the mailbox file, with the 'pending' (user-visible) mailbox state being held in a new attribute, _user_toc, which is a mapping from keys issued to the program to the keys of _toc (i.e. sequence numbers in the file). When _toc is updated, any new messages that have appeared are given keys in _user_toc that haven't been issued before, and any messages that have disappeared are removed from it. The code basically assumes that messages with the same sequence number are the same message, though, so even if most cases are caught by the length check, programs that make deletions/replacements before locking could still delete the wrong messages. This behaviour could be trapped, though, by raising an exception in lock() if self._pending is set (after all, code like that would be incorrect unless it could be assumed that the mailbox module kept hashes of each message or something). Also attached is a patch to the test case, adding a lock/unlock around the message count to make sure _toc is up-to-date if the parent process finishes first; without it, there are still intermittent failures. File Added: mailbox-update-toc.diff ---------------------------------------------------------------------- Comment By: A.M. Kuchling (akuchling) Date: 2006-12-20 09:46 Message: Logged In: YES user_id=11375 Originator: NO Attaching a patch that adds length checking: before doing a flush() on a single-file mailbox, seek to the end and verify its length is unchanged. It raises an ExternalClashError if the file's length has changed. (Should there be a different exception for this case, perhaps a subclass of ExternalClashError?) I verified that this change works by running a program that added 25 messages, pausing between each one, and then did 'echo "new line" > /tmp/mbox' from a shell while the program was running. I also noticed that the self._lookup() call in self.flush() wasn't necessary, and replaced it by an assertion. I think this change should go on both the trunk and 25-maint branches. File Added: length-checking.diff ---------------------------------------------------------------------- Comment By: A.M. Kuchling (akuchling) Date: 2006-12-18 12:43 Message: Logged In: YES user_id=11375 Originator: NO Eep, correct; changing the key IDs would be a big problem for existing code. We could say 'discard all keys' after doing lock() or unlock(), but this is an API change that means the fix couldn't be backported to 2.5-maint. We could make generating the ToC more complicated, preserving key IDs when possible; that may not be too difficult, though the code might be messy. Maybe it's best to just catch this error condition: save the size of the mailbox, updating it in _append_message(), and then make .flush() raise an exception if the mailbox size has unexpectedly changed. ---------------------------------------------------------------------- Comment By: David Watson (baikie) Date: 2006-12-16 14:09 Message: Logged In: YES user_id=1504904 Originator: YES Yes, I see what you mean. I had tried multiple flushes, but only inside a single lock/unlock. But this means that in the no-truncate() code path, even this is currently unsafe, as the lock is momentarily released after flushing. I think _toc should be regenerated after every lock(), as with the risk of another process replacing/deleting/rearranging the messages, it isn't valid to carry sequence numbers from one locked period to another anyway, or from unlocked to locked. However, this runs the risk of dangerously breaking code that thinks it is valid to do so, even in the case where the mailbox was *not* modified (i.e. turning possible failure into certain failure). For instance, if the program removes message 1, then as things stand, the key "1" is no longer used, and removing message 2 will remove the message that followed 1. If _toc is regenerated in between, however (using the current code, so that the messages are renumbered from 0), then the old message 2 becomes message 1, and removing message 2 will therefore remove the wrong message. You'd also have things like pending deletions and replacements (also unsafe generally) being forgotten. So it would take some work to get right, if it's to work at all... ---------------------------------------------------------------------- Comment By: A.M. Kuchling (akuchling) Date: 2006-12-15 09:06 Message: Logged In: YES user_id=11375 Originator: NO I'm testing the fix using two Python processes running mailbox.py, and my test case fails even with your patch. This is due to another bug, even in the patched version. mbox has a dictionary attribute, _toc, mapping message keys to positions in the file. flush() writes out all the messages in self._toc and constructs a new _toc with the new file offsets. It doesn't re-read the file to see if new messages were added by another process. One fix that seems to work: instead of doing 'self._toc = new_toc' after flush() has done its work, do self._toc = None. The ToC will be regenerated the next time _lookup() is called, causing a re-read of all the contents of the mbox. Inefficient, but I see no way around the necessity for doing this. It's not clear to me that my suggested fix is enough, though. Process #1 opens a mailbox, reads the ToC, and the process does something else for 5 minutes. In the meantime, process #2 adds a file to the mbox. Process #1 then adds a message to the mbox and writes it out; it never notices process #2's change. Maybe the _toc has to be regenerated every time you call lock(), because at this point you know there will be no further updates to the mbox by any other process. Any unlocked usage of _toc should also really be regenerating _toc every time, because you never know if another process has added a message... but that would be really inefficient. ---------------------------------------------------------------------- Comment By: A.M. Kuchling (akuchling) Date: 2006-12-15 08:17 Message: Logged In: YES user_id=11375 Originator: NO The attached patch adds a test case to test_mailbox.py that demonstrates the problem. No modifications to mailbox.py are needed to show data loss. Now looking at the patch... File Added: mailbox-test.patch ---------------------------------------------------------------------- Comment By: A.M. Kuchling (akuchling) Date: 2006-12-12 16:04 Message: Logged In: YES user_id=11375 Originator: NO I agree with David's analysis; this is in fact a bug. I'll try to look at the patch. ---------------------------------------------------------------------- Comment By: David Watson (baikie) Date: 2006-11-19 15:44 Message: Logged In: YES user_id=1504904 Originator: YES This is a bug. The point is that the code is subverting the protection of its own fcntl locking. I should have pointed out that Postfix was still using fcntl locking, and that should have been sufficient. (In fact, it was due to its use of fcntl locking that it chose precisely the wrong moment to deliver mail.) Dot-locking does protect against this, but not every program uses it - which is precisely the reason that the code implements fcntl locking in the first place. ---------------------------------------------------------------------- Comment By: Martin v. Löwis (loewis) Date: 2006-11-19 15:02 Message: Logged In: YES user_id=21627 Originator: NO Mailbox locking was invented precisely to support this kind of operation. Why do you complain that things break if you deliberately turn off the mechanism preventing breakage? I fail to see a bug here. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1599254&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com