Re: bin/144214: zfsboot fails on gang block after upgrade to zfs v14

2010-06-03 Thread Andriy Gapon
on 28/05/2010 17:49 Doug Rabson said the following:
> 
> 
> On 27 May 2010 16:13, Andriy Gapon  > wrote:
> 
> on 27/05/2010 17:40 Doug Rabson said the following:
> >
> > Excellent work - thanks for looking into this. I still think its
> easier
> > to debug this code in userland using a shim that redirects the zfsboot
> > i/o calls to simple read system calls...
> 
> Absolutely! That should much easier.
> Do you have such a shim that you could share?
> I'd be much obliged for it.  And not only I, I think.
> Thanks!
> 
> 
> Attached. I thought I sent it to the list before but perhaps I only sent
> to one of the participants in the last gang block thread.

Thanks a lot!
I am sure that I will find it useful more than once.


-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: bin/144214: zfsboot fails on gang block after upgrade to zfs v14

2010-05-28 Thread Doug Rabson
On 27 May 2010 16:13, Andriy Gapon  wrote:

> on 27/05/2010 17:40 Doug Rabson said the following:
> >
> > Excellent work - thanks for looking into this. I still think its easier
> > to debug this code in userland using a shim that redirects the zfsboot
> > i/o calls to simple read system calls...
>
> Absolutely! That should much easier.
> Do you have such a shim that you could share?
> I'd be much obliged for it.  And not only I, I think.
> Thanks!
>

Attached. I thought I sent it to the list before but perhaps I only sent to
one of the participants in the last gang block thread.


zfstest.c
Description: Binary data
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: bin/144214: zfsboot fails on gang block after upgrade to zfs v14

2010-05-27 Thread Andriy Gapon
on 27/05/2010 17:40 Doug Rabson said the following:
> 
> Excellent work - thanks for looking into this. I still think its easier
> to debug this code in userland using a shim that redirects the zfsboot
> i/o calls to simple read system calls...

Absolutely! That should much easier.
Do you have such a shim that you could share?
I'd be much obliged for it.  And not only I, I think.
Thanks!
-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: bin/144214: zfsboot fails on gang block after upgrade to zfs v14

2010-05-27 Thread Doug Rabson
On 27 May 2010 09:35, Andriy Gapon  wrote:

>
>
> I think I nailed this problem now.
> What was additionally needed was the following change:
>if (!vdev || !vdev->v_read)
>return (EIO);
> -   if (vdev->v_read(vdev, bp, &zio_gb, offset, SPA_GANGBLOCKSIZE))
> +   if (vdev->v_read(vdev, NULL, &zio_gb, offset, SPA_GANGBLOCKSIZE))
>return (EIO);
>
> Full patch is here:
> http://people.freebsd.org/~avg/boot-zfs-gang.diff
>
> Apparently I am not as smart as Roman :) because I couldn't find the bug by
> just
> starring at this rather small function (for couple of hours), so I had to
> reproduce the problem to catch it.  Hence I am copying hackers@ to share
> couple
> of tricks that were new to me.  Perhaps, they could help someone else some
> other
> day.
>
> First, after very helpful hints that I received in parallel from pjd and
> two
> Oracle/Sun developers it became very easy to reproduce a pool with files
> with
> gang blocks in them.
> One can set metaslab_gang_bang variable in metaslab.c to some value < 128K
> and
> then blocks with size greater than metaslab_gang_bang will be allocated as
> gang
> blocks with 25% chance.  I personally did something similar but slightly
> more
> deterministic:
> --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
> +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
> @@ -1572,6 +1572,12 @@ zio_dva_allocate(zio_t *zio)
>ASSERT3U(zio->io_prop.zp_ndvas, <=, spa_max_replication(spa));
>ASSERT3U(zio->io_size, ==, BP_GET_PSIZE(bp));
>
> +   /*XXX XXX XXX XXX*/
> +   if (zio->io_size > 8 * 1024) {
> +   return (zio_write_gang_block(zio));
> +   }
> +   /*XXX XXX XXX XXX*/
> +
>error = metaslab_alloc(spa, mc, zio->io_size, bp,
>zio->io_prop.zp_ndvas, zio->io_txg, NULL, 0);
>
> This ensured that any block > 8K would be a gang block.
> Then I compiled zfs.ko with this change and put it into a virtual machine
> where
> I created a pool and populated its root/boot filesystem with /boot
> directory.
> Booted in virtual machine from the new virtual disk and immediately hit the
> problem.
>
> So far, so good, but still no clue why zfsboot crashes upon encountering a
> gang
> block.
>
> So I decided to debug the crash with gdb.
> Standard steps:
> $ qemu ... -S -s
> $ gdb
> ...
> (gdb) target remote localhost:1234
>
> Now I didn't want to single-step through the whole boot process, so I
> decided to
> get some help from gdb. Here's a trick:
> (gdb) add-symbol-file
> /usr/obj/usr/src/sys/boot/i386/gptzfsboot/gptzfsboot.out
> 0xa000
>
> gptzfsboot.out is an ELF image produced by GCC, which then gets transformed
> into
> a raw binary and then into final BTX binary (gptzfsboot).
> gptzfsboot.out is built without much debugging data but at least it
> contains
> information about function names.  Perhaps it's even possible to compile
> gptzfsboot.out with higher debug level, then debugging would be much more
> pleasant.
>
> 0xA000 is where _code_ from gptzfsboot.out ends up being loaded in memory.
> BTW, having only shallow knowledge about boot chain and BTX I didn't know
> this
> address. Another GDB trick helped me:
> (gdb) append memory boot.memdump  0x0 0x1
>
> This command dumps memory content in range 0x0-0x1 to a file named
> boot.memdump.  Then I produced a hex dump and searched for byte sequence
> with
> which gptzfsboot.bin starts (raw binary produced produced from
> gptzfsboot.out).
>
> Of course, memory dump should be taken after gptzfsboot is loaded into
> memory :)
> Catching the right moment requires a little bit of boot process knowledge.
> I caught it with:
> (gdb) b *0xC000
>
> That is, memory dump was taken after gdb stopped at the above break point.
>
> After that it was a piece of cake.  I set break point on zio_read_gang
> function
> (after add-symbol-file command) and the stepi-ed through the code (that is,
> instruction by instruction).  The following command made it easier to see
> what's
> getting executed:
> (gdb) display/i 0xA000 + $eip
>
> I quickly stepped though the code and saw that a large value was passed to
> vdev_read as 'bytes' parameter.  But this should have been 512.  The
> oversized
> read into a buffer allocated on stack smashed the stack and that was the
> end.
>
> Backtracking the call chain in source code I immediately noticed the bp
> condition in vdev_read_phys and realized what the problem was.
>
> Hope this would be a useful reading.
>

Excellent work - thanks for looking into this. I still think its easier to
debug this code in userland using a shim that redirects the zfsboot i/o
calls to simple read system calls...
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: bin/144214: zfsboot fails on gang block after upgrade to zfs v14

2010-05-27 Thread Robert Noland

Andriy Gapon wrote:


I think I nailed this problem now.
What was additionally needed was the following change:
if (!vdev || !vdev->v_read)
return (EIO);
-   if (vdev->v_read(vdev, bp, &zio_gb, offset, SPA_GANGBLOCKSIZE))
+   if (vdev->v_read(vdev, NULL, &zio_gb, offset, SPA_GANGBLOCKSIZE))
return (EIO);

Full patch is here:
http://people.freebsd.org/~avg/boot-zfs-gang.diff

Apparently I am not as smart as Roman :) because I couldn't find the bug by just
starring at this rather small function (for couple of hours), so I had to
reproduce the problem to catch it.  Hence I am copying hackers@ to share couple
of tricks that were new to me.  Perhaps, they could help someone else some other
day.


Excellent, I'm glad that this is finally tested and seems to be working. 
 When I initially added the code, I wasn't able to test it and it 
turned out the the issue that I was trying to resolve wasn't actually 
gang block related anyway.


robert.


First, after very helpful hints that I received in parallel from pjd and two
Oracle/Sun developers it became very easy to reproduce a pool with files with
gang blocks in them.
One can set metaslab_gang_bang variable in metaslab.c to some value < 128K and
then blocks with size greater than metaslab_gang_bang will be allocated as gang
blocks with 25% chance.  I personally did something similar but slightly more
deterministic:
--- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
+++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
@@ -1572,6 +1572,12 @@ zio_dva_allocate(zio_t *zio)
ASSERT3U(zio->io_prop.zp_ndvas, <=, spa_max_replication(spa));
ASSERT3U(zio->io_size, ==, BP_GET_PSIZE(bp));

+   /*XXX XXX XXX XXX*/
+   if (zio->io_size > 8 * 1024) {
+   return (zio_write_gang_block(zio));
+   }
+   /*XXX XXX XXX XXX*/
+
error = metaslab_alloc(spa, mc, zio->io_size, bp,
zio->io_prop.zp_ndvas, zio->io_txg, NULL, 0);

This ensured that any block > 8K would be a gang block.
Then I compiled zfs.ko with this change and put it into a virtual machine where
I created a pool and populated its root/boot filesystem with /boot directory.
Booted in virtual machine from the new virtual disk and immediately hit the 
problem.

So far, so good, but still no clue why zfsboot crashes upon encountering a gang
block.

So I decided to debug the crash with gdb.
Standard steps:
$ qemu ... -S -s
$ gdb
...
(gdb) target remote localhost:1234

Now I didn't want to single-step through the whole boot process, so I decided to
get some help from gdb. Here's a trick:
(gdb) add-symbol-file /usr/obj/usr/src/sys/boot/i386/gptzfsboot/gptzfsboot.out
0xa000

gptzfsboot.out is an ELF image produced by GCC, which then gets transformed into
a raw binary and then into final BTX binary (gptzfsboot).
gptzfsboot.out is built without much debugging data but at least it contains
information about function names.  Perhaps it's even possible to compile
gptzfsboot.out with higher debug level, then debugging would be much more 
pleasant.

0xA000 is where _code_ from gptzfsboot.out ends up being loaded in memory.
BTW, having only shallow knowledge about boot chain and BTX I didn't know this
address. Another GDB trick helped me:
(gdb) append memory boot.memdump  0x0 0x1

This command dumps memory content in range 0x0-0x1 to a file named
boot.memdump.  Then I produced a hex dump and searched for byte sequence with
which gptzfsboot.bin starts (raw binary produced produced from gptzfsboot.out).

Of course, memory dump should be taken after gptzfsboot is loaded into memory :)
Catching the right moment requires a little bit of boot process knowledge.
I caught it with:
(gdb) b *0xC000

That is, memory dump was taken after gdb stopped at the above break point.

After that it was a piece of cake.  I set break point on zio_read_gang function
(after add-symbol-file command) and the stepi-ed through the code (that is,
instruction by instruction).  The following command made it easier to see what's
getting executed:
(gdb) display/i 0xA000 + $eip

I quickly stepped though the code and saw that a large value was passed to
vdev_read as 'bytes' parameter.  But this should have been 512.  The oversized
read into a buffer allocated on stack smashed the stack and that was the end.

Backtracking the call chain in source code I immediately noticed the bp
condition in vdev_read_phys and realized what the problem was.

Hope this would be a useful reading.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: bin/144214: zfsboot fails on gang block after upgrade to zfs v14

2010-05-27 Thread Andriy Gapon


I think I nailed this problem now.
What was additionally needed was the following change:
if (!vdev || !vdev->v_read)
return (EIO);
-   if (vdev->v_read(vdev, bp, &zio_gb, offset, SPA_GANGBLOCKSIZE))
+   if (vdev->v_read(vdev, NULL, &zio_gb, offset, SPA_GANGBLOCKSIZE))
return (EIO);

Full patch is here:
http://people.freebsd.org/~avg/boot-zfs-gang.diff

Apparently I am not as smart as Roman :) because I couldn't find the bug by just
starring at this rather small function (for couple of hours), so I had to
reproduce the problem to catch it.  Hence I am copying hackers@ to share couple
of tricks that were new to me.  Perhaps, they could help someone else some other
day.

First, after very helpful hints that I received in parallel from pjd and two
Oracle/Sun developers it became very easy to reproduce a pool with files with
gang blocks in them.
One can set metaslab_gang_bang variable in metaslab.c to some value < 128K and
then blocks with size greater than metaslab_gang_bang will be allocated as gang
blocks with 25% chance.  I personally did something similar but slightly more
deterministic:
--- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
+++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
@@ -1572,6 +1572,12 @@ zio_dva_allocate(zio_t *zio)
ASSERT3U(zio->io_prop.zp_ndvas, <=, spa_max_replication(spa));
ASSERT3U(zio->io_size, ==, BP_GET_PSIZE(bp));

+   /*XXX XXX XXX XXX*/
+   if (zio->io_size > 8 * 1024) {
+   return (zio_write_gang_block(zio));
+   }
+   /*XXX XXX XXX XXX*/
+
error = metaslab_alloc(spa, mc, zio->io_size, bp,
zio->io_prop.zp_ndvas, zio->io_txg, NULL, 0);

This ensured that any block > 8K would be a gang block.
Then I compiled zfs.ko with this change and put it into a virtual machine where
I created a pool and populated its root/boot filesystem with /boot directory.
Booted in virtual machine from the new virtual disk and immediately hit the 
problem.

So far, so good, but still no clue why zfsboot crashes upon encountering a gang
block.

So I decided to debug the crash with gdb.
Standard steps:
$ qemu ... -S -s
$ gdb
...
(gdb) target remote localhost:1234

Now I didn't want to single-step through the whole boot process, so I decided to
get some help from gdb. Here's a trick:
(gdb) add-symbol-file /usr/obj/usr/src/sys/boot/i386/gptzfsboot/gptzfsboot.out
0xa000

gptzfsboot.out is an ELF image produced by GCC, which then gets transformed into
a raw binary and then into final BTX binary (gptzfsboot).
gptzfsboot.out is built without much debugging data but at least it contains
information about function names.  Perhaps it's even possible to compile
gptzfsboot.out with higher debug level, then debugging would be much more 
pleasant.

0xA000 is where _code_ from gptzfsboot.out ends up being loaded in memory.
BTW, having only shallow knowledge about boot chain and BTX I didn't know this
address. Another GDB trick helped me:
(gdb) append memory boot.memdump  0x0 0x1

This command dumps memory content in range 0x0-0x1 to a file named
boot.memdump.  Then I produced a hex dump and searched for byte sequence with
which gptzfsboot.bin starts (raw binary produced produced from gptzfsboot.out).

Of course, memory dump should be taken after gptzfsboot is loaded into memory :)
Catching the right moment requires a little bit of boot process knowledge.
I caught it with:
(gdb) b *0xC000

That is, memory dump was taken after gdb stopped at the above break point.

After that it was a piece of cake.  I set break point on zio_read_gang function
(after add-symbol-file command) and the stepi-ed through the code (that is,
instruction by instruction).  The following command made it easier to see what's
getting executed:
(gdb) display/i 0xA000 + $eip

I quickly stepped though the code and saw that a large value was passed to
vdev_read as 'bytes' parameter.  But this should have been 512.  The oversized
read into a buffer allocated on stack smashed the stack and that was the end.

Backtracking the call chain in source code I immediately noticed the bp
condition in vdev_read_phys and realized what the problem was.

Hope this would be a useful reading.
-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"