Bug#239111: Grub is shockingly bad code

2009-01-13 Thread Robert Millan
tags 239111 patch
thanks

Hi,

Please could you try the attached patch, and confirm that it works?  Thanks

-- 
Robert Millan

  The DRM opt-in fallacy: "Your data belongs to us. We will decide when (and
  how) you may access your data; but nobody's threatening your freedom: we
  still allow you to remove your data and not access it at all."
diff -ur grub-0.97/util/grub-install.in grub-0.97.new/util/grub-install.in
--- grub-0.97/util/grub-install.in	2004-07-24 20:57:31.0 +0200
+++ grub-0.97.new/util/grub-install.in	2009-01-13 12:18:43.0 +0100
@@ -450,6 +450,17 @@
 # Create a safe temporary file.
 test -n "$mklog" && log_file=`$mklog`
 
+# GRUB will try to verify that stage2 is accessible using its own
+# filesystem drivers.  Make sure it's committed to disk.
+sync
+
+# On XFS, sync() is not enough.
+if [ `grub-probe -t fs ${grubdir}` = "xfs" ] ; then
+  xfs_freeze -f ${grubdir} && xfs_freeze -u ${grubdir}
+  # We don't have set -e.  If xfs_freeze failed, it's worth trying anyway,
+  # maybe we're lucky.
+fi
+
 # Now perform the installation.
 $grub_shell --batch $no_floppy --device-map=$device_map <$log_file
 root $root_drive


Bug#239111: Grub is shockingly bad code

2009-01-12 Thread Robert Millan
On Mon, Jan 12, 2009 at 07:57:44PM +, Robert McQueen wrote:
> I didn't test that, I was basing on my experiences of installing GRUB
> manually. I usually do:
> 
> mkdir /boot/grub
> cp -a /usr/lib/grub/i386-pc/* /boot/grub
> vi /boot/grub/device.map
> grub --device-map=/boot/grub/device.map
> # doesn't work because XFS hasn't put /boot/grub onto disk ...
> # swear a bit
> xfs_freeze -f /
> xfs_freeze -u /
> grub --device-map=/boot/grub/device.map
> # works
> update-grub
> # makes me a menu.lst file
> 
> This never froze for me.

I wasn't aware you were running xfs_freeze by hand.

But I don't trust that this won't hang the kernel.  See the link I pasted in
the previous mail.

This situation is very unfortunate.  If we call xfs_freeze, we risk hanging
Linux.  If we don't, we risk the file being inaccesible.  As you point
out, an error message is preferable to hanging the kernel, but it's a
certain error message against a possible but unlikely hang.  Maybe we should
go with the latter then?

In any case, both are better than a certain hang which is what we have now.

> > And if we _really_ need xfs_freeze, it means we're doing something wrong.
> 
> I thought the wrongness was well known: the GRUB shell tries to read the
> block device directly and traverse the filesystem directories itself to
> to find which blocks on the disk are where the stage* files are located.

No, this is a missunderstanding (which I got misled with, too).  GRUB only
tries to verify that the file will be readable later on (that is, when
Linux isn't running).

stage1 is the MBR.  stage1.5 is supposed to be embedded after the MBR.
stage2 is in the filesystem, and only accessed using the filesystem driver
in stage1.5.

> XFS considers putting the metadata into the journal an acceptable
> implementation of sync(), but grub's filesystem reading code ignores the
> journal, so can't find the files. If the shell used FIBMAP ioctl to find
> the blocks when running under the OS, it'd always find them and all
> would be sweetness and light.

We can't do that.  Using kernel facilities to find the blocks would defeat
the point of verifiing that GRUB can read the file.

-- 
Robert Millan

  The DRM opt-in fallacy: "Your data belongs to us. We will decide when (and
  how) you may access your data; but nobody's threatening your freedom: we
  still allow you to remove your data and not access it at all."



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#239111: Grub is shockingly bad code

2009-01-12 Thread Robert Millan
On Mon, Jan 12, 2009 at 08:13:52PM +, Robert McQueen wrote:
> >   - sync is not enough
> > (see http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=239111#53)
> 
> Correct. This is a property of XFS. As I said, it considers that putting
> metadata into the journal, and knowing it's flushed there, to meet the
> API guarantee of sync(). sync() doesn't make any guarantee that the
> files can be found by an an incomplete implementation of the filesystem
> (such as GRUB's, which ignores the journal).

There was a conscious decision to avoid implementing journaling.  I don't
remember why;  code size, I guess.

> >   - xfs_freeze could hang even if you don't write anything
> > (see http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=239111#194)
> 
> I believe this report is unreliable, perhaps it froze because it's made
> an echo in a process whose stdout is being logged?

Good catch.  Ok, then I'm fine with that approach.

> I will try to find some time to test
> it as well as Ben's simple freeze; unfreeze patch which I prefer,

You mean the one in #242 ?  This looks good to me.

Btw, Ben, could you test your patch on GRUB 2 too?  It probably has exactly
the same problem.

-- 
Robert Millan

  The DRM opt-in fallacy: "Your data belongs to us. We will decide when (and
  how) you may access your data; but nobody's threatening your freedom: we
  still allow you to remove your data and not access it at all."



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#239111: Grub is shockingly bad code

2009-01-12 Thread Robert McQueen
Robert Millan wrote:
> It would seem that running sync would suffice for that.  Unfortunately, it
> seems that:
> 
>   - sync is not enough
> (see http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=239111#53)

Correct. This is a property of XFS. As I said, it considers that putting
metadata into the journal, and knowing it's flushed there, to meet the
API guarantee of sync(). sync() doesn't make any guarantee that the
files can be found by an an incomplete implementation of the filesystem
(such as GRUB's, which ignores the journal).

>   - xfs_freeze could hang even if you don't write anything
> (see http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=239111#194)

I believe this report is unreliable, perhaps it froze because it's made
an echo in a process whose stdout is being logged?

> This is odd.  I think I'm going to assume sync does what is expected from
> it.  If it doesn't maybe it's a bug in Linux, which maybe got fixed already,
> or maybe it's a (lesser) bug in GRUB which we can't find untill it's
> exposed.

It's not a bug, it's the behaviour of XFS which I've explained a few
times already. The bug in GRUB is trying to read the filesystem at a
block level, behind the kernel's back.

> In any case, we're better off not using xfs_freeze untill we know for sure
> why it is needed, and are certain it won't cause hangs.

I know why it's needed, and have done for around 4 years, and have had
this confirmed from the XFS developers too.

> I'm waiting for Ron's confirmation that this works for him, and then I'll
> remove the patch & upload.

I'm pretty sure it won't work reliably without some xfs_freeze, so I
still prefer including the patch. I will try to find some time to test
it as well as Ben's simple freeze; unfreeze patch which I prefer, as I
believe it will make things work in lenny's kernel (although not etch,
where freeze is sadly broken, which is why the patch in #179 has such
horrible looping sleep/freeze hacks anyway).

Cheers,
Rob



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#239111: Grub is shockingly bad code

2009-01-12 Thread Robert McQueen
Robert Millan wrote:
> Grmf.  I was making wrong assumptions.  This is not about block lists
> (I still think block lists suck, but let's be fair...):
...
> So we freeze the filesystem and afterwards try to write to it.  Not a
> good idea...

Indeed not.

> #239111 initial report claims GRUB hangs before we added
> xfs_freeze.diff, but according to what Rob says, not freezing makes it
> work for him.

I didn't test that, I was basing on my experiences of installing GRUB
manually. I usually do:

mkdir /boot/grub
cp -a /usr/lib/grub/i386-pc/* /boot/grub
vi /boot/grub/device.map
grub --device-map=/boot/grub/device.map
# doesn't work because XFS hasn't put /boot/grub onto disk ...
# swear a bit
xfs_freeze -f /
xfs_freeze -u /
grub --device-map=/boot/grub/device.map
# works
update-grub
# makes me a menu.lst file

This never froze for me.

> Rob, can you test removing the patch completely?

I suppose, it would at least make it fail rather than fuck my system,
but based on my understanding of the problem, won't make installations
on XFS work reliably.

> I'm almost certain it's the fwrite() call in install_func.  I could be
> wrong, but I still don't see why we would want to use xfs_freeze anyway.

You use xfs_freeze to force XFS to flush the metadata for the newly
created stage* files out of the journal and into the directory entries
themselves. sync() puts the data on disk and the metadata in the
journal, but xfs_freeze gives the extra guarantee of the journal being
clean and empty so the filesystem can be shapshotted and mounted as
read-only (with no journal replay).

> And if we _really_ need xfs_freeze, it means we're doing something wrong.

I thought the wrongness was well known: the GRUB shell tries to read the
block device directly and traverse the filesystem directories itself to
to find which blocks on the disk are where the stage* files are located.
XFS considers putting the metadata into the journal an acceptable
implementation of sync(), but grub's filesystem reading code ignores the
journal, so can't find the files. If the shell used FIBMAP ioctl to find
the blocks when running under the OS, it'd always find them and all
would be sweetness and light. Thats what Steve was trying to do, I
think, before he gave up and sent the OP on this thread.

Regards,
Rob



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#239111: Grub is shockingly bad code

2009-01-12 Thread Robert Millan
On Mon, Jan 12, 2009 at 08:28:19PM +0100, Robert Millan wrote:
> 
> It's not the log file.  Joeyh tried that (see the bug log).
> 
> I'm almost certain it's the fwrite() call in install_func.  I could be
> wrong, but I still don't see why we would want to use xfs_freeze anyway.
> 
> And if we _really_ need xfs_freeze, it means we're doing something wrong.

Ah, I remember.  There was a sanity check.  The GRUB shell tries to read the
file using its own filesystem driver, to make sure real GRUB will be able to
do it as well.  At least, that's what we do in GRUB 2.

It would seem that running sync would suffice for that.  Unfortunately, it
seems that:

  - sync is not enough
(see http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=239111#53)

  - xfs_freeze could hang even if you don't write anything
(see http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=239111#194)

This is odd.  I think I'm going to assume sync does what is expected from
it.  If it doesn't maybe it's a bug in Linux, which maybe got fixed already,
or maybe it's a (lesser) bug in GRUB which we can't find untill it's
exposed.

In any case, we're better off not using xfs_freeze untill we know for sure
why it is needed, and are certain it won't cause hangs.

I'm waiting for Ron's confirmation that this works for him, and then I'll
remove the patch & upload.

-- 
Robert Millan

  The DRM opt-in fallacy: "Your data belongs to us. We will decide when (and
  how) you may access your data; but nobody's threatening your freedom: we
  still allow you to remove your data and not access it at all."



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#239111: Grub is shockingly bad code

2009-01-12 Thread Robert Millan
On Mon, Jan 12, 2009 at 06:57:55PM +, Robert McQueen wrote:
> 
> So no, it's nothing to do with where I'm trying to install GRUB to, the
> problem is the Debian patch to grub-install did xfs_freeze on my root
> filesystem and then did various crap including trying to write to the
> log file (and make a /boot/grub/default file, I think?),

It's not the log file.  Joeyh tried that (see the bug log).

I'm almost certain it's the fwrite() call in install_func.  I could be
wrong, but I still don't see why we would want to use xfs_freeze anyway.

And if we _really_ need xfs_freeze, it means we're doing something wrong.

-- 
Robert Millan

  The DRM opt-in fallacy: "Your data belongs to us. We will decide when (and
  how) you may access your data; but nobody's threatening your freedom: we
  still allow you to remove your data and not access it at all."



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#239111: Grub is shockingly bad code

2009-01-12 Thread Robert Millan
On Mon, Jan 12, 2009 at 07:32:49PM +0100, Robert Millan wrote:
> 
> I (and upstream in general) believe that the only right way to rely on a
> hardcoded list of blocks that live inside a filesystem is _not to_.

Grmf.  I was making wrong assumptions.  This is not about block lists
(I still think block lists suck, but let's be fair...):

#ifdef GRUB_UTIL
  /* In the grub shell, access the Stage 2 via the OS filesystem
 service, if possible.  */
  if (stage2_os_file)
{
[...]
  if (fwrite (stage2_buffer, 1, SECTOR_SIZE, fp)
  != SECTOR_SIZE)
[...]
}
  else
#endif /* GRUB_UTIL */
{
  if (! devwrite (saved_sector - part_start, 1, stage2_buffer))
goto fail;
}

So we freeze the filesystem and afterwards try to write to it.  Not a good
idea...

#239111 initial report claims GRUB hangs before we added xfs_freeze.diff, but
according to what Rob says, not freezing makes it work for him.

Rob, can you test removing the patch completely?

-- 
Robert Millan

  The DRM opt-in fallacy: "Your data belongs to us. We will decide when (and
  how) you may access your data; but nobody's threatening your freedom: we
  still allow you to remove your data and not access it at all."



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#239111: Grub is shockingly bad code

2009-01-12 Thread Robert McQueen
Robert Millan wrote:
> Rob,
> 
> Did you hit this problem when installing GRUB to a partition, or to the
> whole disk?

I was upgrading from etch to lenny on a box where / is XFS and /boot and
/var are on the same partition. GRUB is installed into the MBR. I know
you can't install bootloaders onto XFS partitions.

(If that had happened, based on previous experience the FS would've
panicked and all IO returned with an error, so I'd probably have been
able to log in and get a lot of errors, assuming bash was still in the
cache. :D)

So no, it's nothing to do with where I'm trying to install GRUB to, the
problem is the Debian patch to grub-install did xfs_freeze on my root
filesystem and then did various crap including trying to write to the
log file (and make a /boot/grub/default file, I think?), all of which go
into D state while the filesystem is frozen so don't finish, and
meanwhile everything else on my system ground to a halt too. Obviously
hard for me to debug given I lost all my access to the system, but
asking SysRq for backtraces showed everything was blocked on FS access.
(Actually as a result of my experience I think XFS devs are adding an
unfreeze SysRq atm...)

Just make the script not try do anything at all which might potentially
cause a write to the filesystem while it's frozen, is the main thing...
Backing out the bullshit XFS Debian patch would do that, applying Ben's
might even make it work properly on lenny.

Regards,
Rob



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#239111: Grub is shockingly bad code

2009-01-12 Thread Robert Millan

Rob,

Did you hit this problem when installing GRUB to a partition, or to the
whole disk?

-- 
Robert Millan

  The DRM opt-in fallacy: "Your data belongs to us. We will decide when (and
  how) you may access your data; but nobody's threatening your freedom: we
  still allow you to remove your data and not access it at all."



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#239111: Grub is shockingly bad code

2009-01-12 Thread Robert McQueen
Robert Millan wrote:
> Hi Rob,

Hi Robert,

> You just convinced me that this is completely fucked up.  This is not the
> first time someone claims to have fixed this problem, only to discover that
> it wasn't, and I'm not going to gamble with ioctls, freeze/unfreeze combos
> or Linux version checks.

We all know this is fucked up. :) I'm not claiming to fix anything so it
somehow works flawlessly. I just want to avoid my system being locked up
when I follow the instructions in NEWS.Debian. So the changes I'm
proposing for lenny are just:

 a) change the Debian XFS hack to grub-install so that it does "copy
stage files; freeze; unfreeze; install grub" rather than "copy stage
files; freeze; lots of shit; unfreeze" which is guaranteed. I don't
care if it *works*, but it will at least print me an error message
which is fine because everyone who has Grub and XFS probably
installed it manually once anyway. Anyway, Ben's simpler patch (just
xfs_freeze followed by unfreeze) probably *does* work on lenny and
at least fails gracefully on etch. I'll test it.

 b) clarify the documentation in NEWS.Debian which encourages people to
run "grub-install" unequivocally, so that it tells people with an
XFS /boot filesystem to either upgrade after rebooting, or upgrade
later. If you can tell me which kernel versions that etch's Grub can
boot, I can propose a patch.

For what it's worth I've chatted to XFS developers on IRC (Eric Sandeen
amongst others) and pointed them to this bug, and they agree with my
suggested approach here.

> I (and upstream in general) believe that the only right way to rely on a
> hardcoded list of blocks that live inside a filesystem is _not to_.

You're in 100% agreement with the XFS developers here.

> I'll see if it's feasible to make it use embedding, and if not, XFS support
> in GRUB Legacy will be terminated, and users will be advised to either
> leave /boot/grub untouched or upgrade to GRUB 2.

Maybe sensible, but I suggest the above two changes are made as well to
avoid fucking people's systems if they /do/ run grub-install, because
the current Debian XFS hack makes things much much worse than upstream.

Regards,
Rob



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#239111: Grub is shockingly bad code

2009-01-12 Thread Robert Millan
On Mon, Jan 12, 2009 at 06:15:52PM +, Robert McQueen wrote:
> severity 239111 grave
> thanks
> 
> Robert Millan wrote:
> > The whole approach is wrong, so maybe it makes sense to avoid it, or maybe
> > it's too late for that, and we should issue a critical debconf warning when
> > XFS is detected.
> >
> > I will have to think about it.
> 
> The problem is that Debian's patch to try and make it work on XFS makes
> it /worse/. It turns "grub-install fails" into "grub-install fucks my
> system". There's no question we should just fix that ASAP by applying a
> patch which just does freeze immediately followed by unfreeze. That's
> why I raised the severity, and I still consider this RC for lenny
> because the upgrade instructions in NEWS.Debian encouraged me to do
> something which bricked my system.
> 
> (Note that although I didn't try it yet, I'm almost certain that Ben's
> approach with a helper binary to run FIBMAP ioctl won't help any more
> than calling sync() will. sync is /not/ broken in XFS - if you call it
> then it does flush the data to disk - the "problem" is that it considers
> metadata synced to the journal as safely sync()'d, so calling it isn't
> sufficient to udate the dentry's for grub to read directly.)
> 
> Given even xfs_freeze is broken in etch's kernel, we can't make
> grub-install work reliably there, but if we put freeze/unfreeze between
> copying the files and invoking grub then it will a) not hose people's
> systems under etch, and b) will work with lenny's kernel. There's also
> no need for a critical warning, because grub-install has to be done by
> the user manually anyway. We should just include the correct advice.
> 
> If you'd answer my question about which kernel / grub versions are
> actually incompatible, then we can choose between the two behaviours.
> If etch's grub can boot lenny's kernel, we can tell XFS users to just
> update grub after rebooting, otherwise we can direct them at some manual
> install instruction and say "please freeze and unfreeze the filesystem,
> run sync a few times, and then have a cup of tea, between copying the
> stage* files and running the grub shell".

Hi Rob,

You just convinced me that this is completely fucked up.  This is not the
first time someone claims to have fixed this problem, only to discover that
it wasn't, and I'm not going to gamble with ioctls, freeze/unfreeze combos
or Linux version checks.

I (and upstream in general) believe that the only right way to rely on a
hardcoded list of blocks that live inside a filesystem is _not to_.

I'll see if it's feasible to make it use embedding, and if not, XFS support
in GRUB Legacy will be terminated, and users will be advised to either
leave /boot/grub untouched or upgrade to GRUB 2.

-- 
Robert Millan

  The DRM opt-in fallacy: "Your data belongs to us. We will decide when (and
  how) you may access your data; but nobody's threatening your freedom: we
  still allow you to remove your data and not access it at all."



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#239111: Grub is shockingly bad code

2009-01-12 Thread Steve McIntyre
On Mon, Jan 12, 2009 at 07:00:58PM +0100, Robert Millan wrote:
>On Sun, Jan 11, 2009 at 05:16:03PM +, Steve McIntyre wrote:
>> Yes, I've looked at the grub2 source and I'm much happier. It doesn't
>> look like a novice had written it, which is a major improvement. I'm
>> still curious WTH anybody would think the nested functions are a good
>> thing, though...
>
>It's kind of a tradition here.  I think the point is just to limit the
>scope of "hook functions" same way as is usually done for variables.

:-/ As an experienced C programmer, they make me shudder. For simple
readability of the code, I'd give up on them altogether. But it's your
code... 

>> The thing I'm more bothered about right now is what will happen to
>> people with existing systems who upgrade. That's why Rob raised this
>> bug to grave: his remote system locked up as part of an upgrade. What
>> do you plan to do to stop this happening again?
>
>The whole approach is wrong, so maybe it makes sense to avoid it, or maybe
>it's too late for that, and we should issue a critical debconf warning when
>XFS is detected.
>
>I will have to think about it.

OK; please be quick, as we could really do with something in place for
Lenny and we're not far away.

-- 
Steve McIntyre, Cambridge, UK.st...@einval.com
You lock the door
And throw away the key
There's someone in my head but it's not me 




-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#239111: Grub is shockingly bad code

2009-01-12 Thread Robert McQueen
severity 239111 grave
thanks

Robert Millan wrote:
> The whole approach is wrong, so maybe it makes sense to avoid it, or maybe
> it's too late for that, and we should issue a critical debconf warning when
> XFS is detected.
>
> I will have to think about it.

The problem is that Debian's patch to try and make it work on XFS makes
it /worse/. It turns "grub-install fails" into "grub-install fucks my
system". There's no question we should just fix that ASAP by applying a
patch which just does freeze immediately followed by unfreeze. That's
why I raised the severity, and I still consider this RC for lenny
because the upgrade instructions in NEWS.Debian encouraged me to do
something which bricked my system.

(Note that although I didn't try it yet, I'm almost certain that Ben's
approach with a helper binary to run FIBMAP ioctl won't help any more
than calling sync() will. sync is /not/ broken in XFS - if you call it
then it does flush the data to disk - the "problem" is that it considers
metadata synced to the journal as safely sync()'d, so calling it isn't
sufficient to udate the dentry's for grub to read directly.)

Given even xfs_freeze is broken in etch's kernel, we can't make
grub-install work reliably there, but if we put freeze/unfreeze between
copying the files and invoking grub then it will a) not hose people's
systems under etch, and b) will work with lenny's kernel. There's also
no need for a critical warning, because grub-install has to be done by
the user manually anyway. We should just include the correct advice.

If you'd answer my question about which kernel / grub versions are
actually incompatible, then we can choose between the two behaviours.
If etch's grub can boot lenny's kernel, we can tell XFS users to just
update grub after rebooting, otherwise we can direct them at some manual
install instruction and say "please freeze and unfreeze the filesystem,
run sync a few times, and then have a cup of tea, between copying the
stage* files and running the grub shell".

Regards,
Rob



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#239111: Grub is shockingly bad code

2009-01-12 Thread Robert Millan
On Sun, Jan 11, 2009 at 05:16:03PM +, Steve McIntyre wrote:
> Yes, I've looked at the grub2 source and I'm much happier. It doesn't
> look like a novice had written it, which is a major improvement. I'm
> still curious WTH anybody would think the nested functions are a good
> thing, though...

It's kind of a tradition here.  I think the point is just to limit the
scope of "hook functions" same way as is usually done for variables.

> >As for the bug severity, this bug has been around for a while, and we never
> >considered it to be Release Critical.  D-I avoids using GRUB Legacy when user
> >selected XFS, and rightly so.  As I said, with GRUB 2 it's going to be
> >different (if someone wants to know more about the technical details on how
> >we changed our approach, feel free to contact me).
> 
> The thing I'm more bothered about right now is what will happen to
> people with existing systems who upgrade. That's why Rob raised this
> bug to grave: his remote system locked up as part of an upgrade. What
> do you plan to do to stop this happening again?

The whole approach is wrong, so maybe it makes sense to avoid it, or maybe
it's too late for that, and we should issue a critical debconf warning when
XFS is detected.

I will have to think about it.

-- 
Robert Millan

  The DRM opt-in fallacy: "Your data belongs to us. We will decide when (and
  how) you may access your data; but nobody's threatening your freedom: we
  still allow you to remove your data and not access it at all."



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#239111: Grub is shockingly bad code

2009-01-11 Thread Steve McIntyre
On Sun, Jan 11, 2009 at 12:26:48PM +0100, Robert Millan wrote:
>severity 239111 important
>clone 239111 -1
>retitle -1 should refuse to install on XFS unless embedding can be used
>reassign -1 grub2
>thanks
>
>On Sun, Jan 04, 2009 at 02:25:28AM +, Steve McIntyre wrote:
>> 
>> After several hours of working through the source, I give up. It's a
>> total mess and I'd much rather see grub simply removed from Debian
>> altogether than fix this bug and allow it to live on. Highlights:
>> 
>>  * gratuitous use of nested functions
>>  * horrific interfaces that don't pass enough information around
>>internally, leading to:
>>  * functions passing internal state around via umarked global
>>variables, relying on side effects
>>  * significantly obfuscated code
>>  * the core bug as described by Rob: accessing a block device
>>underneath an active filesystem and assuming that the results will
>>be safe and/or correct.
>> 
>> I *know* that grub is a bootloader, so it's always going to end up
>> having some nasty lowlevel code somewhere. But that's no excuse for
>> the code quality I've just seen. After this experience, I'm about to
>> remove grub from all my systems. Come back lilo, all is forgiven.
>
>You're right.  Code in GRUB Legacy is such a mess that we would have to
>rewrite it from scratch in order to get things right.
>
>Oh, wait, we already did.  It's called GRUB 2.  Though, we're still fond of
>the nested functions.  :-)

Yes, I've looked at the grub2 source and I'm much happier. It doesn't
look like a novice had written it, which is a major improvement. I'm
still curious WTH anybody would think the nested functions are a good
thing, though...

>As for accessing a block device underneath an active filesystem, we don't
>do this anymore, except when we have no other choice.

Yay.

>For XFS, this means that as long as you install GRUB to your MBR (and
>implicitly, post-MBR area), you'll never hit this problem.  That is,
>as long as your core.img fits in 32 kiB, which we are very careful to
>ensure it will always do.
>
>This reminds me, when it doesn't, in the particular case of XFS we need to
>abort install and tell the user why.  I'm cloning this bug for GRUB 2 with
>that purpose.
>
>Robert (et al):  Sorry that you spent so much effort debugging GRUB Legacy;
>we (upstream) have abandoned this codebase long ago.
>
>As for the bug severity, this bug has been around for a while, and we never
>considered it to be Release Critical.  D-I avoids using GRUB Legacy when user
>selected XFS, and rightly so.  As I said, with GRUB 2 it's going to be
>different (if someone wants to know more about the technical details on how
>we changed our approach, feel free to contact me).

The thing I'm more bothered about right now is what will happen to
people with existing systems who upgrade. That's why Rob raised this
bug to grave: his remote system locked up as part of an upgrade. What
do you plan to do to stop this happening again?

-- 
Steve McIntyre, Cambridge, UK.st...@einval.com
"This dress doesn't reverse." -- Alden Spiess




-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#239111: Grub is shockingly bad code

2009-01-11 Thread Robert Millan
severity 239111 important
clone 239111 -1
retitle -1 should refuse to install on XFS unless embedding can be used
reassign -1 grub2
thanks

On Sun, Jan 04, 2009 at 02:25:28AM +, Steve McIntyre wrote:
> 
> After several hours of working through the source, I give up. It's a
> total mess and I'd much rather see grub simply removed from Debian
> altogether than fix this bug and allow it to live on. Highlights:
> 
>  * gratuitous use of nested functions
>  * horrific interfaces that don't pass enough information around
>internally, leading to:
>  * functions passing internal state around via umarked global
>variables, relying on side effects
>  * significantly obfuscated code
>  * the core bug as described by Rob: accessing a block device
>underneath an active filesystem and assuming that the results will
>be safe and/or correct.
> 
> I *know* that grub is a bootloader, so it's always going to end up
> having some nasty lowlevel code somewhere. But that's no excuse for
> the code quality I've just seen. After this experience, I'm about to
> remove grub from all my systems. Come back lilo, all is forgiven.

You're right.  Code in GRUB Legacy is such a mess that we would have to
rewrite it from scratch in order to get things right.

Oh, wait, we already did.  It's called GRUB 2.  Though, we're still fond of
the nested functions.  :-)

As for accessing a block device underneath an active filesystem, we don't
do this anymore, except when we have no other choice.

For XFS, this means that as long as you install GRUB to your MBR (and
implicitly, post-MBR area), you'll never hit this problem.  That is,
as long as your core.img fits in 32 kiB, which we are very careful to
ensure it will always do.

This reminds me, when it doesn't, in the particular case of XFS we need to
abort install and tell the user why.  I'm cloning this bug for GRUB 2 with
that purpose.

Robert (et al):  Sorry that you spent so much effort debugging GRUB Legacy;
we (upstream) have abandoned this codebase long ago.

As for the bug severity, this bug has been around for a while, and we never
considered it to be Release Critical.  D-I avoids using GRUB Legacy when user
selected XFS, and rightly so.  As I said, with GRUB 2 it's going to be
different (if someone wants to know more about the technical details on how
we changed our approach, feel free to contact me).

-- 
Robert Millan

  The DRM opt-in fallacy: "Your data belongs to us. We will decide when (and
  how) you may access your data; but nobody's threatening your freedom: we
  still allow you to remove your data and not access it at all."



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#239111: Grub is shockingly bad code

2009-01-03 Thread Steve McIntyre
Thanks to Rob McQueen for doing a lot of research into this for his
mail about this bug on 12 Dec 2008. Based on that mail, I've looked
through the code in an attempt to find a way to implement a reasonable
fix: using FIBMAP as he suggested.

After several hours of working through the source, I give up. It's a
total mess and I'd much rather see grub simply removed from Debian
altogether than fix this bug and allow it to live on. Highlights:

 * gratuitous use of nested functions
 * horrific interfaces that don't pass enough information around
   internally, leading to:
 * functions passing internal state around via umarked global
   variables, relying on side effects
 * significantly obfuscated code
 * the core bug as described by Rob: accessing a block device
   underneath an active filesystem and assuming that the results will
   be safe and/or correct.

I *know* that grub is a bootloader, so it's always going to end up
having some nasty lowlevel code somewhere. But that's no excuse for
the code quality I've just seen. After this experience, I'm about to
remove grub from all my systems. Come back lilo, all is forgiven.

-- 
Steve McIntyre, Cambridge, UK.st...@einval.com
Is there anybody out there?




-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org