Bug#499070: dpkg leaves system in unusable state after running out of diskspace
Hi, On Wed, 2008-10-22 at 09:51:05 +0300, Guillem Jover wrote: http://git.hadrons.org/?p=debian/dpkg.git The correct approach is the one in ood-abort, implementing what I guess was meant to be done with onerr_abort. What's missing is untangling onerr_abort being activated by the archives/packages processing loop when too many errors happen. I've done that now and updated pu/ood-abort, and overall, the relevant changes from the branch are not that much code 6 files changed, 30 insertions(+), 7 deletions(-) from which 10 of the inserted lines are comments or blank lines. The main issue is the new string for translation. This code though should be way more robust against unrecoverable error conditions. I still need to review once more all onerr_abort uses, and test when onerr_abort is being activated due to too many processing errors. The ood-unwind branch might imply less behaviour changes, but it's not the correct solution long term, and some side effects might be still there as the execution continues when it should not. I guess this branch can be reduced to just the cb1bdc7d commit (even also removing the fseek call), as the rest are just papering over the real problems and will not cover all situations anyway. On unrecoverable errors, it becomes a mess as the code tries to continue running even if it's not expecting to, but that's the behaviour that has been present since forever and people have been living with it, so it would be a slight improvement, which should at least fix the reported problem of the bogus update file. regards, guillem -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#499070: dpkg leaves system in unusable state after running out of diskspace
Hi, Hmm all this time passed already... On Tue, 2008-10-07 at 17:47:52 +0200, Raphael Hertzog wrote: On Tue, 07 Oct 2008, Guillem Jover wrote: Yeah got the code the day after that mail, but I've not found the time to test it. I guess the easiest is to change one of the function return values to the output of rand() or similar and see from there. I'll try to get to it this week. Worked and tested that last week but didn't clean up a bit the stuff until today. I've pushed two branches, pu/ood-abort and pu/ood-unwind with different approaches for the fix, to: http://git.hadrons.org/?p=debian/dpkg.git Just share it, others might have the time to test (using quota on a openvz virtual environment for example). The problem is triggering the exact conditions for this bug, having OOD is not enough, and would probably need a lot of iterations to maybe be able to reproduce it. That's why I wanted to add the targetted random(). The initial patch (f0efc5cc in ood-unwind) I mentioned in the previous mail only fixed part of the problem, in some conditions when doing error unwinding some of the cu_ functions from src/cleanup.c get called then those call modstatdb_note, which trigger this bug. So the root problem here is that the onerr_abort logic is broken, the code assumes that once onerr_abort has been flagged the program will terminate, but that does not happen, and some code unexpectedly runs again. At the same time the code is trying to deal with onerr_abort in random places. The correct approach is the one in ood-abort, implementing what I guess was meant to be done with onerr_abort. What's missing is untangling onerr_abort being activated by the archives/packages processing loop when too many errors happen. The ood-unwind branch might imply less behaviour changes, but it's not the correct solution long term, and some side effects might be still there as the execution continues when it should not. I'll finish the odd-abort code, and do some testing, if I feel comfortable with that one I'll push it, otherwise I might consider ood-unwind for lenny, and switch to ood-abort for squeeze. Let's see. regards, guillem -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#499070: dpkg leaves system in unusable state after running out of diskspace
On Thu, 18 Sep 2008, Guillem Jover wrote: The other is that when onerr_abort is signaled dpkg should not continue processing anything anymore, it should just do whatever cleanup is required and exit. But that can wait probably post-lenny. So this is the proper fix, and it should not be that big, probably less than 10 lines? Will cook something today or tomorrow... Any progress ? Cheers, -- Raphaël Hertzog Le best-seller français mis à jour pour Debian Etch : http://www.ouaza.com/livre/admin-debian/ -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#499070: dpkg leaves system in unusable state after running out of diskspace
Hey! On Tue, 2008-10-07 at 09:02:54 +0200, Raphael Hertzog wrote: On Thu, 18 Sep 2008, Guillem Jover wrote: The other is that when onerr_abort is signaled dpkg should not continue processing anything anymore, it should just do whatever cleanup is required and exit. But that can wait probably post-lenny. So this is the proper fix, and it should not be that big, probably less than 10 lines? Will cook something today or tomorrow... Any progress ? Yeah got the code the day after that mail, but I've not found the time to test it. I guess the easiest is to change one of the function return values to the output of rand() or similar and see from there. I'll try to get to it this week. And it ended up being a 7 lines patch, although I think I'll be changing part of that error recovery logic for squeeze. regards, guillem -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#499070: dpkg leaves system in unusable state after running out of diskspace
On Tue, 07 Oct 2008, Guillem Jover wrote: Any progress ? Yeah got the code the day after that mail, but I've not found the time to test it. I guess the easiest is to change one of the function return values to the output of rand() or similar and see from there. I'll try to get to it this week. Just share it, others might have the time to test (using quota on a openvz virtual environment for example). Cheers, -- Raphaël Hertzog Le best-seller français mis à jour pour Debian Etch : http://www.ouaza.com/livre/admin-debian/ -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#499070: dpkg leaves system in unusable state after running out of diskspace
Hey, On Tue, 2008-09-16 at 23:14:04 +0300, Guillem Jover wrote: So I think there's two things to fix here, one is that the fseek() in createimptmp() should be done just before the fwrite() in modstatdb_note_core() to guarantee that we are going to be always at the beginning, and we make proper use of the reserved space allocated previously with those '#padding' lines to avoid the out of space condition. That's a 4-liner patch, which should be fine for lenny, and prevents this bogus condition were the user most probably is going to remove that file to be able to continue, which might produce an inconsistent state in the dpkg db. And the real problem is that there might not be an easy manual fix by the users if the status data could not be completely written to the update file. I take this back, the fseek() needs to be there to guarantee that the data has been written. So moving it would only help partially, as we would always be writting at the beginning but the fwrite() might not have enough space from the reserved padding, and the data might get truncated. The other is that when onerr_abort is signaled dpkg should not continue processing anything anymore, it should just do whatever cleanup is required and exit. But that can wait probably post-lenny. So this is the proper fix, and it should not be that big, probably less than 10 lines? Will cook something today or tomorrow... regards, guillem -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#499070: dpkg leaves system in unusable state after running out of diskspace
forcemerge 499070 497041 thanks On Tue, 2008-09-16 at 01:06:05 +0200, Alexander Prinsier wrote: Package: dpkg Version: 1.14.22 Severity: serious I was installing php5 while my system ran out of disk space. dpkg breaks, and leaves the system in a state where I can no longer use dpkg. To find out the version number of dpkg I used /var/log/dpkg.log, as I couldn't use dpkg to find it's own version number... Hope it's the correct one. This is what happened: [...] Selecting previously deselected package libjpeg62. Unpacking libjpeg62 (from .../libjpeg62_6b-14_i386.deb) ... dpkg: error processing /var/cache/apt/archives/libjpeg62_6b-14_i386.deb (--unpack): failed in buffer_write(fd) (10, ret=-1): backend dpkg-deb during `./usr/share/doc/libjpeg62/copyright': Disk quota exceeded Selecting previously deselected package libdjvulibre21. Unpacking libdjvulibre21 (from .../libdjvulibre21_3.5.20-8_i386.deb) ... dpkg: error processing /var/cache/apt/archives/libdjvulibre21_3.5.20-8_i386.deb (--unpack): failed in buffer_write(fd) (10, ret=-1): backend dpkg-deb during `./usr/lib/libdjvulibre.so.21.0.0': Disk quota exceeded dpkg-deb: subprocess paste killed by signal (Broken pipe) Selecting previously deselected package libxpm4. Unpacking libxpm4 (from .../libxpm4_1%3a3.5.7-1_i386.deb) ... Selecting previously deselected package libgd2-xpm. Unpacking libgd2-xpm (from .../libgd2-xpm_2.0.36~rc1~dfsg-3_i386.deb) ... dpkg: error processing /var/cache/apt/archives/libgd2-xpm_2.0.36~rc1~dfsg-3_i386.deb (--unpack): unable to flush /var/lib/dpkg/updates/tmp.i after padding: Disk quota exceeded Here's where the problem starts. That «tmp.i» file is the one that has been filled with '#padding' lines, the fflush() in createimptmp() fails, and it ohshite()s, but just before it has been able to rewind the file to the beginning. Processing triggers for man-db ... /usr/bin/mandb: can't write to /var/cache/man/19933: Disk quota exceeded gdbm fatal: read error dpkg: failed to write status record about `libcairo2' to `/var/lib/dpkg/status': Disk quota exceeded E: Sub-process /usr/bin/dpkg returned an error code (2) A package failed to install. Trying to recover: dpkg: parse error, in file `/var/lib/dpkg/updates/0119' near line 1: newline in field name `#padding' Press return to continue. The second problem is that even if the ohshite() called from createimptmp() is protected inside an onerr_abort section, the execution continues in archivefiles(), out of process_archive() but into process_queue(), which at some point calls a modstatdb_note(), moving the «tmp.i» file into the «0119» one, then another ohshite() is called and the onerr_abort is sensed again, making process_queue() terminate its loop. # dpkg --configure -a dpkg: parse error, in file `/var/lib/dpkg/updates/0119' near line 1: newline in field name `#padding' # But at the point that modstat_note() was called the function wrote at the end of it, preserving all the '#padding' lines, and making dpkg barf subsequently. So I think there's two things to fix here, one is that the fseek() in createimptmp() should be done just before the fwrite() in modstatdb_note_core() to guarantee that we are going to be always at the beginning, and we make proper use of the reserved space allocated previously with those '#padding' lines to avoid the out of space condition. That's a 4-liner patch, which should be fine for lenny, and prevents this bogus condition were the user most probably is going to remove that file to be able to continue, which might produce an inconsistent state in the dpkg db. And the real problem is that there might not be an easy manual fix by the users if the status data could not be completely written to the update file. The other is that when onerr_abort is signaled dpkg should not continue processing anything anymore, it should just do whatever cleanup is required and exit. But that can wait probably post-lenny. regards, guillem -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]