Bugs item #1599254, was opened at 2006-11-19 11:03 Message generated for change (Comment added) made by akuchling You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1599254&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Python Library Group: Python 2.5 Status: Open Resolution: None >Priority: 7 Private: No Submitted By: David Watson (baikie) Assigned to: A.M. Kuchling (akuchling) Summary: mailbox: other programs' messages can vanish without trace Initial Comment: The mailbox classes based on _singlefileMailbox (mbox, MMDF, Babyl) implement the flush() method by writing the new mailbox contents into a temporary file which is then renamed over the original. Unfortunately, if another program tries to deliver messages while mailbox.py is working, and uses only fcntl() locking, it will have the old file open and be blocked waiting for the lock to become available. Once mailbox.py has replaced the old file and closed it, making the lock available, the other program will write its messages into the now-deleted "old" file, consigning them to oblivion. I've caused Postfix on Linux to lose mail this way (although I did have to turn off its use of dot-locking to do so). A possible fix is attached. Instead of new_file being renamed, its contents are copied back to the original file. If file.truncate() is available, the mailbox is then truncated to size. Otherwise, if truncation is required, it's truncated to zero length beforehand by reopening self._path with mode wb+. In the latter case, there's a check to see if the mailbox was replaced while we weren't looking, but there's still a race condition. Any alternative ideas? Incidentally, this fixes a problem whereby Postfix wouldn't deliver to the replacement file as it had the execute bit set. ---------------------------------------------------------------------- >Comment By: A.M. Kuchling (akuchling) Date: 2007-01-22 10:46 Message: Logged In: YES user_id=11375 Originator: NO This would be an API change, and therefore out-of-bounds for 2.5. I suggest giving up on this for 2.5.1 and only fixing it in 2.6. I'll add another warning to the docs, and maybe to the module as well. ---------------------------------------------------------------------- Comment By: David Watson (baikie) Date: 2007-01-21 17:10 Message: Logged In: YES user_id=1504904 Originator: YES Hold on, I have a plan. If _toc is only regenerated on locking, or at the end of a flush(), then the only way self._pending can be set at that time is if the application has made modifications before calling lock(). If we make that an exception-raising offence, then we can assume that self._toc is a faithful representation of the last known contents of the file. That means we can preserve the existing message keys on a reread without any of that _user_toc nonsense. Diff attached, to apply on top of mailbox-unified2. It's probably had even less review and testing than the previous version, but it appears to pass all the regression tests and doesn't change any existing semantics. File Added: mailbox-update-toc-new.diff ---------------------------------------------------------------------- Comment By: A.M. Kuchling (akuchling) Date: 2007-01-20 22:16 Message: Logged In: YES user_id=11375 Originator: NO <sigh> I'm starting to lose track of all the variations on the bug. Maybe we should just add more warnings to the documentation about locking the mailbox when modifying it and not try to fix this at all. ---------------------------------------------------------------------- Comment By: David Watson (baikie) Date: 2007-01-20 13:20 Message: Logged In: YES user_id=1504904 Originator: YES Hang on. If a message's key changes after recreating _toc, that does not mean that another process has modified the mailbox. If the application removes a message and then (inadvertently) causes _toc to be regenerated, the keys of all subsequent messages will be decremented by one, due only to the application's own actions. That's what happens in the "broken locking" test case: the program intends to remove message 0, flush, and then remove message 1, but because _toc is regenerated in between, message 1 is renumbered as 0, message 2 is renumbered as 1, and so the program deletes message 2 instead. To clear _toc in such code without attempting to preserve the message keys turns possible data loss (in the case that another process modified the mailbox) into certain data loss. That's what I'm concerned about. ---------------------------------------------------------------------- Comment By: A.M. Kuchling (akuchling) Date: 2007-01-19 10:24 Message: Logged In: YES user_id=11375 Originator: NO After reflection, I don't think the potential changing actually makes things any worse. _generate() always starts numbering keys with 1, so if a message's key changes because of lock()'s, re-reading, that means someone else has already modified the mailbox. Without the ToC clearing, you're already fated to have a corrupted mailbox because the new mailbox will be written using outdated file offsets. With the ToC clearing, you delete the wrong message. Neither outcome is good, but data is lost either way. The new behaviour is maybe a little bit better in that you're losing a single message but still generating a well-formed mailbox, and not a randomly jumbled mailbox. I suggest applying the patch to clear self._toc, and noting in the documentation that keys might possibly change after doing a lock(). ---------------------------------------------------------------------- Comment By: David Watson (baikie) Date: 2007-01-18 13:15 Message: Logged In: YES user_id=1504904 Originator: YES This version passes the new tests (it fixes the length checking bug, and no longer clears self._toc), but goes back to failing test_concurrent_add. File Added: mailbox-unified2-module.diff ---------------------------------------------------------------------- Comment By: David Watson (baikie) Date: 2007-01-18 13:14 Message: Logged In: YES user_id=1504904 Originator: YES Unfortunately, there is a problem with clearing _toc: it's basically the one I alluded to in my comment of 2006-12-16. Back then I thought it could be caught by the test you issue the warning for, but the application may instead do its second remove() *after* the lock() (so that self._pending is not set at lock() time), using the key carried over from before it called unlock(). As before, this would result in the wrong message being deleted. I've added a test case for this (diff attached), and a bug I found in the process whereby flush() wasn't updating the length, which could cause subsequent flushes to fail (I've got a fix for this). These seem to have turned up a third bug in the MH class, but I haven't looked at that yet. File Added: mailbox-unified2-test.diff ---------------------------------------------------------------------- Comment By: A.M. Kuchling (akuchling) Date: 2007-01-17 16:06 Message: Logged In: YES user_id=11375 Originator: NO Add mailbox-unified-patch. File Added: mailbox-unified-patch.diff ---------------------------------------------------------------------- Comment By: A.M. Kuchling (akuchling) Date: 2007-01-17 16:05 Message: Logged In: YES user_id=11375 Originator: NO mailbox-pending-lock is the difference between David's copy-back-new + fcntl-warning and my -unified patch, uploaded so that he can comment on the changes. The only change is to make _singleFileMailbox.lock() clear self._toc, forcing a re-read of the mailbox file. If _pending is true, the ToC isn't cleared and a warning is logged. I think this lets existing code run (albeit possibly with a warning if the mailbox is modified before .lock() is called), but fixes the risk of missing changes written by another process. Triggering a new warning is sort of an API change, but IMHO it's still worth backporting; code that triggers this warning is at risk of losing messages or corrupting the mailbox. Clearing the _toc on .lock() is also sort of an API change, but it's necessary to avoid the potential data loss. It may lead to some redundant scanning of mailbox files, but this price is worth paying, I think; people probably don't have 2Gb mbox files (I hope not, anyway!) and no extra read is done if you create the mailbox and immediately lock it before looking anything up. Neal: if you want to discuss this patch or want an explanation of something, feel free to chat with me about it. I'll wait a day or two and see if David spots any problems. If nothing turns up, I'll commit it to both trunk and release25-maint. File Added: mailbox-pending-lock.diff ---------------------------------------------------------------------- Comment By: A.M. Kuchling (akuchling) Date: 2007-01-17 15:53 Message: Logged In: YES user_id=11375 Originator: NO mailbox-unified-patch contains David's copy-back-new and fcntl-warn patches, plus the test-mailbox patch and some additional changes to mailbox.py from me. (I'll upload a diff to David's diffs in a minute.) This is the change I'd like to check in. test_mailbox.py now passes, as does the mailbox-break.py script I'm using. ---------------------------------------------------------------------- Comment By: A.M. Kuchling (akuchling) Date: 2007-01-17 14:56 Message: Logged In: YES user_id=11375 Originator: NO Committed a modified version of the doc. patch in rev. 53472 (trunk) and rev. 53474 (release25-maint). ---------------------------------------------------------------------- Comment By: Neal Norwitz (nnorwitz) Date: 2007-01-17 01:48 Message: Logged In: YES user_id=33168 Originator: NO Andrew, do you need any help with this? ---------------------------------------------------------------------- Comment By: A.M. Kuchling (akuchling) Date: 2007-01-15 14:01 Message: Logged In: YES user_id=11375 Originator: NO Comment from Andrew MacIntyre (os2vacpp is the OS/2 that lacks ftruncate()): ================ I actively support the OS/2 EMX port (sys.platform == "os2emx"; build directory is PC/os2emx). I would like to keep the VAC++ port alive, but the reality is I don't have the resources to do so. The VAC++ port was the subject of discussion about removal of build support support from the source tree for 2.6 - I don't recall there being a definitive outcome, but if someone does delete the PC/os2vacpp directory, I'm not in a position to argue. AMK: (mailbox.py has a separate section of code used when file.truncate() isn't available, and the existence of this section is bedevilling me. It would be convenient if platforms without file.truncate() weren't a factor; then this branch could just be removed. In your opinion, would it hurt OS/2 users of mailbox.py if support for platforms without file.truncate() was removed?) aimacintyre: No. From what documentation I can quickly check, ftruncate() operates on file descriptors rather than FILE pointers. As such I am sure that, if it became an issue, it would not be difficult to write a ftruncate() emulation wrapper for the underlying OS/2 APIs that implement the required functionality. ---------------------------------------------------------------------- Comment By: David Watson (baikie) Date: 2007-01-13 13:32 Message: Logged In: YES user_id=1504904 Originator: YES I like the warning idea - it seems appropriate if the problem is relatively rare. How about this? File Added: mailbox-fcntl-warn.diff ---------------------------------------------------------------------- Comment By: A.M. Kuchling (akuchling) Date: 2007-01-12 14:41 Message: Logged In: YES user_id=11375 Originator: NO One OS/2 port lacks truncate(), and so does RISCOS. ---------------------------------------------------------------------- Comment By: A.M. Kuchling (akuchling) Date: 2007-01-12 13:41 Message: Logged In: YES user_id=11375 Originator: NO I realized that making flush() invalidate keys breaks the final example in the docs, which loops over inbox.iterkeys() and removes messages, doing a pack() after each message. Which platforms lack file.truncate()? Windows has it; POSIX has it, so modern Unix variants should all have it. Maybe mailbox should simply raise an exception (or trigger a warning?) if truncate is missing, and we should then assume that flush() has no effect upon keys. ---------------------------------------------------------------------- Comment By: A.M. Kuchling (akuchling) Date: 2007-01-12 12:12 Message: Logged In: YES user_id=11375 Originator: NO So shall we document flush() as invalidating keys, then? ---------------------------------------------------------------------- Comment By: David Watson (baikie) Date: 2007-01-06 14:57 Message: Logged In: YES user_id=1504904 Originator: YES Oops, length checking had made the first two lines of this patch redundant; update-toc applies OK with fuzz. File Added: mailbox-copy-back-new.diff ---------------------------------------------------------------------- Comment By: David Watson (baikie) Date: 2007-01-06 10:30 Message: Logged In: YES user_id=1504904 Originator: YES File Added: mailbox-copy-back-53287.diff ---------------------------------------------------------------------- Comment By: David Watson (baikie) Date: 2007-01-06 10:24 Message: Logged In: YES user_id=1504904 Originator: YES Aack, yes, that should be _next_user_key. Attaching a fixed version. I've been thinking, though: flush() does in fact invalidate the keys on platforms without a file.truncate(), when the fcntl() lock is momentarily released afterwards. It seems hard to avoid this as, perversely, fcntl() locks are supposed to be released automatically on all file descriptors referring to the file whenever the process closes any one of them - even one the lock was never set on. So, code using mailbox.py such platforms could inadvertently be carrying keys across an unlocked period, which is not made safe by the update-toc patch (as it's only meant to avert disasters resulting from doing this *and* rebuilding the table of contents, *assuming* that another process hasn't deleted or rearranged messages). File Added: mailbox-update-toc-fixed.diff ---------------------------------------------------------------------- Comment By: A.M. Kuchling (akuchling) Date: 2007-01-05 14:51 Message: Logged In: YES user_id=11375 Originator: NO Question about mailbox-update-doc: the add() method still returns self._next_key - 1; should this be self._next_user_key - 1? The keys in _user_toc are the ones returned to external users of the mailbox, right? (A good test case would be to initialize _next_key to 0 and _next_user_key to a different value like 123456.) I'm still staring at the patch, trying to convince myself that it will help -- haven't spotted any problems, but this bug is making me nervous... ---------------------------------------------------------------------- Comment By: A.M. Kuchling (akuchling) Date: 2007-01-05 14:24 Message: Logged In: YES user_id=11375 Originator: NO As a step toward improving matters, I've attached the suggested doc patch (for both 25-maint and trunk). It encourages people to use Maildir :), explicitly states that modifications should be bracketed by lock(), and fixes the examples to match. It does not say that keys are invalidated by doing a flush(), because we're going to try to avoid the necessity for that. File Added: mailbox-docs.diff ---------------------------------------------------------------------- Comment By: A.M. Kuchling (akuchling) Date: 2006-12-20 14:48 Message: Logged In: YES user_id=11375 Originator: NO Committed length-checking.diff to trunk in rev. 53110. ---------------------------------------------------------------------- Comment By: David Watson (baikie) Date: 2006-12-20 14:19 Message: Logged In: YES user_id=1504904 Originator: YES File Added: mailbox-test-lock.diff ---------------------------------------------------------------------- Comment By: David Watson (baikie) Date: 2006-12-20 14:17 Message: Logged In: YES user_id=1504904 Originator: YES Yeah, I think that should definitely go in. ExternalClashError or a subclass sounds fine to me (although you could make a whole taxonomy of these errors, really). It would be good to have the code actually keep up with other programs' changes, though; a program might just want to count the messages at first, say, and not make changes until much later. I've been trying out the second option (patch attached, to apply on top of mailbox-copy-back), regenerating _toc on locking, but preserving existing keys. The patch allows existing _generate_toc()s to work unmodified, but means that _toc now holds the entire last known contents of the mailbox file, with the 'pending' (user-visible) mailbox state being held in a new attribute, _user_toc, which is a mapping from keys issued to the program to the keys of _toc (i.e. sequence numbers in the file). When _toc is updated, any new messages that have appeared are given keys in _user_toc that haven't been issued before, and any messages that have disappeared are removed from it. The code basically assumes that messages with the same sequence number are the same message, though, so even if most cases are caught by the length check, programs that make deletions/replacements before locking could still delete the wrong messages. This behaviour could be trapped, though, by raising an exception in lock() if self._pending is set (after all, code like that would be incorrect unless it could be assumed that the mailbox module kept hashes of each message or something). Also attached is a patch to the test case, adding a lock/unlock around the message count to make sure _toc is up-to-date if the parent process finishes first; without it, there are still intermittent failures. File Added: mailbox-update-toc.diff ---------------------------------------------------------------------- Comment By: A.M. Kuchling (akuchling) Date: 2006-12-20 09:46 Message: Logged In: YES user_id=11375 Originator: NO Attaching a patch that adds length checking: before doing a flush() on a single-file mailbox, seek to the end and verify its length is unchanged. It raises an ExternalClashError if the file's length has changed. (Should there be a different exception for this case, perhaps a subclass of ExternalClashError?) I verified that this change works by running a program that added 25 messages, pausing between each one, and then did 'echo "new line" > /tmp/mbox' from a shell while the program was running. I also noticed that the self._lookup() call in self.flush() wasn't necessary, and replaced it by an assertion. I think this change should go on both the trunk and 25-maint branches. File Added: length-checking.diff ---------------------------------------------------------------------- Comment By: A.M. Kuchling (akuchling) Date: 2006-12-18 12:43 Message: Logged In: YES user_id=11375 Originator: NO Eep, correct; changing the key IDs would be a big problem for existing code. We could say 'discard all keys' after doing lock() or unlock(), but this is an API change that means the fix couldn't be backported to 2.5-maint. We could make generating the ToC more complicated, preserving key IDs when possible; that may not be too difficult, though the code might be messy. Maybe it's best to just catch this error condition: save the size of the mailbox, updating it in _append_message(), and then make .flush() raise an exception if the mailbox size has unexpectedly changed. ---------------------------------------------------------------------- Comment By: David Watson (baikie) Date: 2006-12-16 14:09 Message: Logged In: YES user_id=1504904 Originator: YES Yes, I see what you mean. I had tried multiple flushes, but only inside a single lock/unlock. But this means that in the no-truncate() code path, even this is currently unsafe, as the lock is momentarily released after flushing. I think _toc should be regenerated after every lock(), as with the risk of another process replacing/deleting/rearranging the messages, it isn't valid to carry sequence numbers from one locked period to another anyway, or from unlocked to locked. However, this runs the risk of dangerously breaking code that thinks it is valid to do so, even in the case where the mailbox was *not* modified (i.e. turning possible failure into certain failure). For instance, if the program removes message 1, then as things stand, the key "1" is no longer used, and removing message 2 will remove the message that followed 1. If _toc is regenerated in between, however (using the current code, so that the messages are renumbered from 0), then the old message 2 becomes message 1, and removing message 2 will therefore remove the wrong message. You'd also have things like pending deletions and replacements (also unsafe generally) being forgotten. So it would take some work to get right, if it's to work at all... ---------------------------------------------------------------------- Comment By: A.M. Kuchling (akuchling) Date: 2006-12-15 09:06 Message: Logged In: YES user_id=11375 Originator: NO I'm testing the fix using two Python processes running mailbox.py, and my test case fails even with your patch. This is due to another bug, even in the patched version. mbox has a dictionary attribute, _toc, mapping message keys to positions in the file. flush() writes out all the messages in self._toc and constructs a new _toc with the new file offsets. It doesn't re-read the file to see if new messages were added by another process. One fix that seems to work: instead of doing 'self._toc = new_toc' after flush() has done its work, do self._toc = None. The ToC will be regenerated the next time _lookup() is called, causing a re-read of all the contents of the mbox. Inefficient, but I see no way around the necessity for doing this. It's not clear to me that my suggested fix is enough, though. Process #1 opens a mailbox, reads the ToC, and the process does something else for 5 minutes. In the meantime, process #2 adds a file to the mbox. Process #1 then adds a message to the mbox and writes it out; it never notices process #2's change. Maybe the _toc has to be regenerated every time you call lock(), because at this point you know there will be no further updates to the mbox by any other process. Any unlocked usage of _toc should also really be regenerating _toc every time, because you never know if another process has added a message... but that would be really inefficient. ---------------------------------------------------------------------- Comment By: A.M. Kuchling (akuchling) Date: 2006-12-15 08:17 Message: Logged In: YES user_id=11375 Originator: NO The attached patch adds a test case to test_mailbox.py that demonstrates the problem. No modifications to mailbox.py are needed to show data loss. Now looking at the patch... File Added: mailbox-test.patch ---------------------------------------------------------------------- Comment By: A.M. Kuchling (akuchling) Date: 2006-12-12 16:04 Message: Logged In: YES user_id=11375 Originator: NO I agree with David's analysis; this is in fact a bug. I'll try to look at the patch. ---------------------------------------------------------------------- Comment By: David Watson (baikie) Date: 2006-11-19 15:44 Message: Logged In: YES user_id=1504904 Originator: YES This is a bug. The point is that the code is subverting the protection of its own fcntl locking. I should have pointed out that Postfix was still using fcntl locking, and that should have been sufficient. (In fact, it was due to its use of fcntl locking that it chose precisely the wrong moment to deliver mail.) Dot-locking does protect against this, but not every program uses it - which is precisely the reason that the code implements fcntl locking in the first place. ---------------------------------------------------------------------- Comment By: Martin v. Löwis (loewis) Date: 2006-11-19 15:02 Message: Logged In: YES user_id=21627 Originator: NO Mailbox locking was invented precisely to support this kind of operation. Why do you complain that things break if you deliberately turn off the mechanism preventing breakage? I fail to see a bug here. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1599254&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com