Re: [developer] spa_namespace_lock vs sync thread

2016-07-08 Thread George Wilson
Andriy,

I think this is basically a rule, although I don't think it's stated
anywhere. We do rely heavily on this locking strategy since there are many
places that will hold the namespace lock to prevent spa config changes but
yet wait for a txg to sync.

Is it honored everywhere? Well, I hope so but there probably hasn't been a
detailed analysis done to see if there might be places where we do
something like a spa_open() from some obscure path in the sync thread. I
would hope that our testing would have discovered that quickly and
hopefully developers will sufficiently exercise their code to make sure
they don't violate this.

Have you seen instances where we deadlock as a result of this?

Thanks,
George

On Fri, Jul 8, 2016 at 9:40 AM, Andriy Gapon  wrote:

> 
> I see a few places in the code that do the following:
> mutex_enter(_namespace_lock);
> dsl_sync_task(...);
> mutex_exit(_namespace_lock);
> One example is spa_change_guid().
> 
> In my understanding this implies that the sync thread must never acquire
> spa_namespace_lock.  Is this a real rule?  Do we always honor it?
> 
> Thank you!
> --
> Andriy Gapon
> 



---
openzfs-developer
Archives: https://www.listbox.com/member/archive/274414/=now
RSS Feed: https://www.listbox.com/member/archive/rss/274414/28015062-cce53afa
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=28015062_secret=28015062-f966d51c
Powered by Listbox: http://www.listbox.com


Re: [developer] panic: solaris assert: ba.ba_phys->bt_bytes == 0 (0x400 == 0x0)

2016-07-03 Thread George Wilson
Andriy,

Can you give me some details about how you're able to reproduce this panic.
I would like to help debug this. I'm also looking into the range_tree()
 panic, so any details you can provide would be very helpful.

If you can publish the crash dumps, I can also download them and take a
look.

Thanks,
George

On Wed, Jun 22, 2016 at 4:53 PM, Andriy Gapon  wrote:

>
> Igor,
>
> your suggestion was certainly a good one, however I took a path of a
> lesser effort and tested my workload on the latest illumos kernel:
>
> panic[cpu3]/thread=ff000bc56c40: assertion failed:
> ba.ba_phys->bt_bytes == 0 (0x400 == 0x0), file:
> ../../common/fs/zfs/bptree.c, line: 293
>
> ff000bc56890 genunix:process_type+164b75 ()
> ff000bc56a20 zfs:bptree_iterate+4bf ()
> ff000bc56a90 zfs:dsl_scan_sync+17c ()
> ff000bc56b50 zfs:spa_sync+2bb ()
> ff000bc56c20 zfs:txg_sync_thread+260 ()
> ff000bc56c30 unix:thread_start+8 ()
>
> syncing file systems... done
> dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel
> dumping:  0:34 100% done
> 100% done: 339495 pages dumped, dump succeeded
> rebooting...
>
> So, if anyone is interested I can provide any requested information from
> the crash dump or try your debugging suggestions.
>
> On 22/06/2016 17:45, Igor Kozhukhov wrote:
> > based on your changeset number - it is old update:
> >
> https://github.com/illumos/illumos-gate/commit/26455f9efcf9b1e44937d4d86d1ce37b006f25a9
> > 6052 decouple lzc_create() from the implementation details
> >
> > we have a lot of others changes in illumos tree and i can say - i have
> > no panic on my system with gcc48 build - i have tested by zfs tests.
> >
> > Maybe, as solution, you can try to merge to latest changes and try to
> > check it again?
> > i had panic with gcc48 build, but Matt pointed to some delphix update
> > and we have upstreamed it and i have no panics any more with full list
> > of zfs tests, what availabe on illumos tree.
> >
> > best regards,
> > -Igor
> >
> >
> >> On Jun 22, 2016, at 5:17 PM, Andriy Gapon  >> > wrote:
> >>
> >>
> >> I am not yet convinced that the problem has anything to do with
> >> miscompiled code.  I am using exactly the same optimizations and exactly
> >> the same compiler as the official FreeBSD builds.
> >>
> >> On 22/06/2016 17:03, Igor Kozhukhov wrote:
> >>> Hi Andri,
> >>>
> >>> i have DilOS with gcc-4.8,5 (+ special patches) for illumos builds.
> >>> i had some problems with zdb - found it by zfs tests.
> >>>
> >>> problem has been fixed by disable of optimization :
> >>> -fno-aggressive-loop-optimizations
> >>>
> >>> also, i have added:
> >>> -fno-ipa-sra
> >>>
> >>> but i no remember a story why i have added it ;)
> >>> probabbly it was added with another illumos component and new gcc-4.8
> >>>
> >>> As you know, illumos still is using gcc-4.4.4 and some newer compilers
> >>> can produce new issues with older code :)
> >>>
> >>> I think, you can try to play with your clang optimization flags too.
> >>> i have no experience with clang.
> >>>
> >>> best regards,
> >>> -Igor
> >>>
> >>>
>  On Jun 22, 2016, at 4:21 PM, Andriy Gapon   
>  > wrote:
> 
> 
>  I am getting the following panic using the latest FreeBSD head that is
>  synchronized with OpenZFS code as of
>  illumos/illumos-gate@26455f9efcf9b1e44937d4d86d1ce37b006f25a9.
> 
>  panic: solaris assert: ba.ba_phys->bt_bytes == 0 (0x400 == 0x0), file:
>  /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/bptree.c,
>  line: 292
>  cpuid = 1
>  KDB: stack backtrace:
>  db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame
>  0xfe004db9d310
>  vpanic() at vpanic+0x182/frame 0xfe004db9d390
>  panic() at panic+0x43/frame 0xfe004db9d3f0
>  assfail3() at assfail3+0x2c/frame 0xfe004db9d410
>  bptree_iterate() at bptree_iterate+0x35e/frame 0xfe004db9d540
>  dsl_scan_sync() at dsl_scan_sync+0x24f/frame 0xfe004db9d890
>  spa_sync() at spa_sync+0x897/frame 0xfe004db9dad0
>  txg_sync_thread() at txg_sync_thread+0x383/frame 0xfe004db9dbb0
>  fork_exit() at fork_exit+0x84/frame 0xfe004db9dbf0
>  fork_trampoline() at fork_trampoline+0xe/frame 0xfe004db9dbf0
>  --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
> 
>  I have a crash dump, but unfortunately it's hard to work with it,
>  because a lot of useful information got "optimized out" by clang.
> 
>  I can reproduce the panic using a synthetic workload, but I do not
> have
>  a concise reproduction scenario.  Every time the panic happens
> bt_bytes
>  is 0x400, I haven't seen any other number there.
> 
>  Does anyone have an idea what could be causing this?
>  I can try any diagnostic code that might shed more light.
>  Thank you!
> 
>  --
>  

Re: [OpenZFS Developer] [openzfs] Possible access beyond end of string in zpool comment (#47)

2015-12-18 Thread George Wilson
LGTM

---
Reply to this email directly or view it on GitHub:
https://github.com/openzfs/openzfs/pull/47#issuecomment-165781640___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] [openzfs] ASSERT supported zio_types for file and disk vdevs (#43)

2015-12-03 Thread George Wilson
LGTM.

---
Reply to this email directly or view it on GitHub:
https://github.com/openzfs/openzfs/pull/43#issuecomment-161622259___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] [openzfs] Fix mutex leak in dmu_objset_find_dp (#44)

2015-12-03 Thread George Wilson
LGTM

---
Reply to this email directly or view it on GitHub:
https://github.com/openzfs/openzfs/pull/44#issuecomment-161621968___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] [openzfs] 6458 kmem reap thread gets blocked in reclaim callback (#38)

2015-11-17 Thread George Wilson
> @@ -1508,7 +1509,15 @@ exi_cache_trim(struct exportinfo *exi)
>* used for NFSAUTH_CACHE_TRIM seconds.
>*/
>   for (c = avl_first(tree); c != NULL; c = AVL_NEXT(tree, c)) {
> - rw_enter(>authc_lock, RW_WRITER);
> + /*
> +  * We are being called by the kmem subsystem to reclaim
> +  * memory so don't block if we can't get the lock.
> +  */
> + if (rw_tryenter(>exi_cache_lock, RW_WRITER) == 0) {

Yes, this should be the authc_lock here. This was a result of a mismerge.

---
Reply to this email directly or view it on GitHub:
https://github.com/openzfs/openzfs/pull/38/files#r45091793___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] [openzfs] 6392 zdb: Introduce -V for verbatim import (#22)

2015-11-01 Thread George Wilson
LGTM

---
Reply to this email directly or view it on GitHub:
https://github.com/openzfs/openzfs/pull/22#issuecomment-152831278___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] [openzfs] 6414 vdev_config_sync could be simpler (#31)

2015-11-01 Thread George Wilson
Looks good to me

---
Reply to this email directly or view it on GitHub:
https://github.com/openzfs/openzfs/pull/31#issuecomment-152831553___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] [openzfs] 6391 Override default SPA config location via environment (#21)

2015-11-01 Thread George Wilson
LGTM

---
Reply to this email directly or view it on GitHub:
https://github.com/openzfs/openzfs/pull/21#issuecomment-152831323___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] Review Request 250: DLPX-37488 exporting a pool while an async destroy is running can leave entries in the deferred tree

2015-10-13 Thread George Wilson


> On Oct. 7, 2015, 5:42 p.m., Justin Gibbs wrote:
> > usr/src/uts/common/fs/zfs/dsl_scan.c, lines 1421-1427
> > <https://reviews.csiden.org/r/250/diff/1/?file=17649#file17649line1421>
> >
> > Wouldn't it be better to merge this logic into dsl_scan_active() so 
> > that the policy is contained in one place?  Something like:
> > 
> > ```c
> > if (!dsl_scan_active(scn, INCLUDE_STALLED_SCANS))
> > return;
> > ```
> 
> Justin Gibbs wrote:
> Grr.  Review board posted a dup and I somehow failed to mark the right 
> lines...

I'm not convinced it would be better since proceeding with a scan when a scan 
is not running is not common policy but I don't have a strong opinion either 
way. I think if we did change the interface I would prefer to see a "force" 
flag or something like that so that it's clear that we're asking for an 
exception. Using "include" make it seem like that's the preferred behavior (at 
least to me).


- George


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.csiden.org/r/250/#review818
---


On Oct. 7, 2015, 3:02 a.m., Matthew Ahrens wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.csiden.org/r/250/
> ---
> 
> (Updated Oct. 7, 2015, 3:02 a.m.)
> 
> 
> Review request for OpenZFS Developer Mailing List and Christopher Siden.
> 
> 
> Bugs: 6292
> https://www.illumos.org/issues/6292
> 
> 
> Repository: illumos-gate
> 
> 
> Description
> ---
> 
> 6292 exporting a pool while an async destroy is running can leave entries in 
> the deferred tree
> Reviewed by: Paul Dagnelie <p...@delphix.com>
> Reviewed by: Matthew Ahrens <mahr...@delphix.com>
> 
> Original author: George Wilson
> 
> 
> Diffs
> -
> 
>   usr/src/uts/common/fs/zfs/dsl_scan.c 
> 6ba5cb6a1c831f681fba16dbc8d25fd8b59b13c9 
> 
> Diff: https://reviews.csiden.org/r/250/diff/
> 
> 
> Testing
> ---
> 
> http://jenkins.delphix.com/job/zfs-precommit/3149/
> 
> 
> Thanks,
> 
> Matthew Ahrens
> 
>

___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] Review Request 258: 6319 assertion failed in zio_ddt_write: bp->blk_birth == txg

2015-10-11 Thread George Wilson

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.csiden.org/r/258/#review826
---

Ship it!


Ship It!

- George Wilson


On Oct. 10, 2015, 11:15 p.m., Matthew Ahrens wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.csiden.org/r/258/
> ---
> 
> (Updated Oct. 10, 2015, 11:15 p.m.)
> 
> 
> Review request for OpenZFS Developer Mailing List and Christopher Siden.
> 
> 
> Bugs: 6319
> https://www.illumos.org/issues/6319
> 
> 
> Repository: illumos-gate
> 
> 
> Description
> ---
> 
> 6319 assertion failed in zio_ddt_write: bp->blk_birth == txg
> 
> Revert "5693 ztest fails in dbuf_verify: buf[i] == 0, due to dedup and 
> bp_override"
> 
> This reverts commit 7f7ace370074e350853da254c65688fd43ddc695.
> 
> Original author: Matthew Ahrens
> 
> 
> Diffs
> -
> 
>   usr/src/uts/common/fs/zfs/zio.c 7fa795ea8cadafa0aa7bc241b0735b2fd4ce2593 
> 
> Diff: https://reviews.csiden.org/r/258/diff/
> 
> 
> Testing
> ---
> 
> ztest; zfs test suite
> http://jenkins.delphix.com/job/zfs-precommit/3194/
> 
> 
> Thanks,
> 
> Matthew Ahrens
> 
>

___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] cache device sharing

2015-09-11 Thread George Wilson
I don't think L2ARC can be shared only spare devices can be shared.

- George

On Fri, Sep 11, 2015 at 7:23 AM, Andriy Gapon 
wrote:

>
> I am curious what was the original reason to not allow sharing of cache
> devices
> (L2ARC) between pools?
> After all, the main ARC is shared for all pools.
> If more control is needed, perhaps we could have a pool property like
> shared-cache-devices (but less verbose) which would tell if data from other
> pools can be cached on this pool's cache devices.
>
> --
> Andriy Gapon
>
> ___
> developer mailing list
> developer@open-zfs.org
> http://lists.open-zfs.org/mailman/listinfo/developer
>
___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] cache device sharing

2015-09-11 Thread George Wilson
The ideas was that only one pool could own it so that failover would be
possible and the cache device would follow that pool. It also would be
required for persistent l2arc.

However, it should be possible to create an independent l2arc which never
fails over but can be added to multiple pools. It would not be a trivial
implementation but seems doable.

- George

On Fri, Sep 11, 2015 at 7:40 AM, Andriy Gapon <andriy.ga...@clusterhq.com>
wrote:

> On 11/09/2015 16:36, George Wilson wrote:
> > I don't think L2ARC can be shared only spare devices can be shared.
>
> But why?
> And I do not mean multiple ownership of the cache devices. I mean that if
> a pool
> with a cache device is imported then the cache device could be used by ARC
> for
> the second level caching of data from any pool.
>
> > On Fri, Sep 11, 2015 at 7:23 AM, Andriy Gapon <
> andriy.ga...@clusterhq.com
> > <mailto:andriy.ga...@clusterhq.com>> wrote:
> >
> >
> > I am curious what was the original reason to not allow sharing of
> cache
> > devices
> > (L2ARC) between pools?
> > After all, the main ARC is shared for all pools.
> > If more control is needed, perhaps we could have a pool property like
> > shared-cache-devices (but less verbose) which would tell if data
> from other
> > pools can be cached on this pool's cache devices.
> >
> > --
> > Andriy Gapon
> >
> > ___
> > developer mailing list
> > developer@open-zfs.org <mailto:developer@open-zfs.org>
> > http://lists.open-zfs.org/mailman/listinfo/developer
> >
> >
>
>
> --
> Andriy Gapon
>
>
___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] rt-rt_space is not zero. range_tree.c:153

2015-08-27 Thread George Wilson
Jorgen,

Since you're unloading the pool, there should not be any new allocations or
frees happening. This is actually prevented by the call to txg_sync_stop()
from spa_unload(). Here it should perform the final txg_wait_synced() to
clear out all the ms_freetrees and stop the sync thread.

You could add an ASSERT at the end of txg_stop_sync() to determine that the
trees are actually empty. I would also look at callers of
metaslab_free_dva() after the txg sync thread has been stopped.

Let me know what you find out.

Thanks,
George

On Thu, Aug 27, 2015 at 4:42 AM, Jorgen Lundman lund...@lundman.net wrote:


 Can you provide some details about the stack trace when you hit this
 failure. All of the ms_freetrees should be empty by the time you can
 range_tree_destroy(). So to debug this we need to understand the calling
 stack to determine why that isn't happening.


 The stack at the moment of the VERIFY0() triggered panic is:

 Aug 27 03:26:24 big-vm-mac kernel[0]: ZFS: rt-rt_space is !0 : space 6144
 from metaslab.c:1982
 Aug 27 03:26:24 big-vm-mac kernel[0]: zfs: 2103296 (512)
 Aug 27 03:26:24 big-vm-mac kernel[0]: zfs: 2125824 (1024)
 Aug 27 03:26:24 big-vm-mac kernel[0]: zfs: 2128896 (2560)
 Aug 27 03:26:24 big-vm-mac kernel[0]: zfs: 2144256 (2048)
 Aug 27 03:26:24 big-vm-mac kernel[0]: SPL: backtrace range_tree_destroy
 Aug 27 03:26:24 big-vm-mac kernel[0]: SPL: 0xff88a06738e0 :
 0xff7fa4c0471e net.lundman.zfs : _range_tree_destroy + 0x6e
 Aug 27 03:26:24 big-vm-mac kernel[0]: SPL: 0xff88a0673900 :
 0xff7fa4c00510 net.lundman.zfs : _metaslab_fini + 0x140
 Aug 27 03:26:24 big-vm-mac kernel[0]: SPL: 0xff88a0673940 :
 0xff7fa4c2b804 net.lundman.zfs : _vdev_metaslab_fini + 0x84
 Aug 27 03:26:24 big-vm-mac kernel[0]: SPL: 0xff88a0673970 :
 0xff7fa4c2b453 net.lundman.zfs : _vdev_free + 0x83
 Aug 27 03:26:24 big-vm-mac kernel[0]: SPL: 0xff88a06739a0 :
 0xff7fa4c2b425 net.lundman.zfs : _vdev_free + 0x55
 Aug 27 03:26:24 big-vm-mac kernel[0]: SPL: 0xff88a06739d0 :
 0xff7fa4c1121f net.lundman.zfs : _spa_unload + 0x11f
 Aug 27 03:26:24 big-vm-mac kernel[0]: SPL: 0xff88a06739f0 :
 0xff7fa4c1363a net.lundman.zfs : _spa_export_common + 0x29a
 Aug 27 03:26:24 big-vm-mac kernel[0]: SPL: 0xff88a0673a30 :
 0xff7fa4c13395 net.lundman.zfs : _spa_destroy + 0x25
 Aug 27 03:26:24 big-vm-mac kernel[0]: SPL: 0xff88a0673a50 :
 0xff7fa4c65b2e net.lundman.zfs : _zfs_ioc_pool_destroy + 0x1e
 Aug 27 03:26:24 big-vm-mac kernel[0]: SPL: 0xff88a0673a70 :
 0xff7fa4c622fc net.lundman.zfs : _zfsdev_ioctl + 0x66c



 ___
 developer mailing list
 developer@open-zfs.org
 http://lists.open-zfs.org/mailman/listinfo/developer

___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] Review Request 229: account for ashift when choosing buffers to be written to l2arc device

2015-08-26 Thread George Wilson

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.csiden.org/r/229/#review764
---

Ship it!


Ship It!

- George Wilson


On June 25, 2015, 12:34 p.m., Andriy Gapon wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.csiden.org/r/229/
 ---
 
 (Updated June 25, 2015, 12:34 p.m.)
 
 
 Review request for OpenZFS Developer Mailing List, Justin Gibbs, George 
 Wilson, Matthew Ahrens, and Saso Kiselkov.
 
 
 Bugs: 5219
 https://www.illumos.org/issues/5219
 
 
 Repository: illumos-gate
 
 
 Description
 ---
 
 **NOTE: this review request supercedes 
 [#112](https://reviews.csiden.org/r/112/).**
 
 If we don't account for that, then we might end up overwriting disk
 area of buffers that have not been evicted yet, because l2arc_evict
 operates in terms of disk addresses.
 
 The discrepancy between the write size calculation and the actual increment
 to l2ad_hand was introduced in
 commit 
 [aad02571bc59671aa3103bb070ae365f531b0b62](https://github.com/illumos/illumos-gate/commit/aad02571bc59671aa3103bb070ae365f531b0b62)
 
 The change that introduced l2ad_hand alignment was almost correct
 as the write size was accumulated as a sum of rounded buffer sizes.
 See commit 
 [e14bb3258d05c1b1077e2db7cf77088924e56919](https://github.com/illumos/illumos-gate/commit/e14bb3258d05c1b1077e2db7cf77088924e56919)
 
 Also, we now consistently use asize / a_sz for the allocated size and
 psize / p_sz for the physical size.
 The latter accounts for a possible size reduction because of the compression,
 whereas the former accounts for a possible subsequent size expansion
 because of the alignment requirements.
 
 The code still assumes that either underlying storage subsystems or
 hardware is able to do read-modify-write when an L2ARC buffer size is
 not a multiple of a disk's block size.  This is true for 4KB sector disks
 that provide 512B sector emulation, but may not be true in general.
 In other words, we currently do not have any code to make sure that
 an L2ARC buffer, whether compressed or not, which is used for physical I/O
 has a suitable size.
 
 Note that currently the cache device utilization is calculated based
 on the physical size, not the allocated size.  The same applies to l2_asize
 kstat. That is wrong, but this commit does not fix that.
 The accounting problem was introduced partially in
 commit 
 [aad02571bc59671aa3103bb070ae365f531b0b62](https://github.com/illumos/illumos-gate/commit/aad02571bc59671aa3103bb070ae365f531b0b62)
 and partially in 
 [3038a2b421b40dc5ac11cd88423696618584f85a](https://github.com/illumos/illumos-gate/commit/3038a2b421b40dc5ac11cd88423696618584f85a)
 (accounting became consistent but in favour of the wrong size).
 
 
 Diffs
 -
 
   usr/src/uts/common/fs/zfs/arc.c 09d2e1dd8fdf6648d863d4c036c16b3f7981dbf5 
 
 Diff: https://reviews.csiden.org/r/229/diff/
 
 
 Testing
 ---
 
 FreeBSD and Linux.
 
 
 Thanks,
 
 Andriy Gapon
 


___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] Review Request 229: account for ashift when choosing buffers to be written to l2arc device

2015-08-26 Thread George Wilson
Andriy,

Looking now.

Thanks,
George

On Wed, Aug 26, 2015 at 10:05 AM, Andriy Gapon nore...@csiden.org wrote:

 This is an automatically generated e-mail. To reply, visit:
 https://reviews.csiden.org/r/229/

 On June 19th, 2015, 12:09 a.m. EEST, *Matthew Ahrens* wrote:

 Looks good to me.  Would like to see review from Saso and George as well.

 George, Saso, ping.


 - Andriy

 On June 25th, 2015, 3:34 p.m. EEST, Andriy Gapon wrote:
 Review request for OpenZFS Developer Mailing List, Justin Gibbs, George
 Wilson, Matthew Ahrens, and Saso Kiselkov.
 By Andriy Gapon.

 *Updated June 25, 2015, 3:34 p.m.*
 *Bugs: * 5219 https://www.illumos.org/issues/5219
 *Repository: * illumos-gate
 Description

 *NOTE: this review request supercedes #112 
 https://reviews.csiden.org/r/112/.*

 If we don't account for that, then we might end up overwriting disk
 area of buffers that have not been evicted yet, because l2arc_evict
 operates in terms of disk addresses.

 The discrepancy between the write size calculation and the actual increment
 to l2ad_hand was introduced in
 commit aad02571bc59671aa3103bb070ae365f531b0b62 
 https://github.com/illumos/illumos-gate/commit/aad02571bc59671aa3103bb070ae365f531b0b62

 The change that introduced l2ad_hand alignment was almost correct
 as the write size was accumulated as a sum of rounded buffer sizes.
 See commit e14bb3258d05c1b1077e2db7cf77088924e56919 
 https://github.com/illumos/illumos-gate/commit/e14bb3258d05c1b1077e2db7cf77088924e56919

 Also, we now consistently use asize / a_sz for the allocated size and
 psize / p_sz for the physical size.
 The latter accounts for a possible size reduction because of the compression,
 whereas the former accounts for a possible subsequent size expansion
 because of the alignment requirements.

 The code still assumes that either underlying storage subsystems or
 hardware is able to do read-modify-write when an L2ARC buffer size is
 not a multiple of a disk's block size.  This is true for 4KB sector disks
 that provide 512B sector emulation, but may not be true in general.
 In other words, we currently do not have any code to make sure that
 an L2ARC buffer, whether compressed or not, which is used for physical I/O
 has a suitable size.

 Note that currently the cache device utilization is calculated based
 on the physical size, not the allocated size.  The same applies to l2_asize
 kstat. That is wrong, but this commit does not fix that.
 The accounting problem was introduced partially in
 commit aad02571bc59671aa3103bb070ae365f531b0b62 
 https://github.com/illumos/illumos-gate/commit/aad02571bc59671aa3103bb070ae365f531b0b62
 and partially in 3038a2b421b40dc5ac11cd88423696618584f85a 
 https://github.com/illumos/illumos-gate/commit/3038a2b421b40dc5ac11cd88423696618584f85a
 (accounting became consistent but in favour of the wrong size).

 Testing

 FreeBSD and Linux.

 Diffs

- usr/src/uts/common/fs/zfs/arc.c
(09d2e1dd8fdf6648d863d4c036c16b3f7981dbf5)

 View Diff https://reviews.csiden.org/r/229/diff/

___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] rt-rt_space is not zero. range_tree.c:153

2015-08-26 Thread George Wilson
Jorgen,

Can you provide some details about the stack trace when you hit this
failure. All of the ms_freetrees should be empty by the time you can
range_tree_destroy(). So to debug this we need to understand the calling
stack to determine why that isn't happening.

Thanks,
George

On Sat, Aug 22, 2015 at 8:13 PM, Jorgen Lundman lund...@lundman.net wrote:


 Hello list;

 So running the test environment, we can regularly trigger the following
 VERIFY:

 panic(cpu 0 caller 0xff7f8a065e08): VERIFY3( 0   ==  
 rt-rt_space )  failed ( 0   ==   14336
 )\n@range_tree.c:153

 I threw in some code to determine the allocator;

 ```
 range_tree_destroy(range_tree_t *rt)
 {
 if (rt-rt_space) {
 printf(ZFS: rt-rt_space is !0 : space %llu from %s:%u \n,
rt-rt_space, rt-rt_debug_allocator, rt-rt_debug_line);
 range_tree_walk(rt, dumpdump, NULL);
 }
 //VERIFY0(rt-rt_space);
 ```

 and output is similar to:

 22/08/2015 4:27:35.000 PM kernel[0]: ZFS: rt-rt_space is !0 : space 9728
 from metaslab.c:1975
 22/08/2015 4:27:35.000 PM kernel[0]: zfs: 4194304 (1536)
 22/08/2015 4:27:35.000 PM kernel[0]: zfs: 4197888 (1536)
 22/08/2015 4:27:35.000 PM kernel[0]: zfs: 4202496 (1024)
 22/08/2015 4:27:35.000 PM kernel[0]: zfs: 4208640 (512)
 22/08/2015 4:27:35.000 PM kernel[0]: zfs: 4212224 (5120)

 In fact, all of them are from metaslab.c:1975, which corresponds to

 msp-ms_freetree[t] = range_tree_create(NULL, msp,
 msp-ms_lock);

 The tests run at the time are the pool upgrade tests, so possibly
 related to the upgrade code.


 Has anyone come across this already, is it known anywhere? ZOL #3390 and
 #2947 at least mention it.

 Lund

 https://github.com/openzfsonosx/zfs/issues/361



 ___
 developer mailing list
 developer@open-zfs.org
 http://lists.open-zfs.org/mailman/listinfo/developer

___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] Review Request 219: 5909 ensure that shared snap names don't become too long after promotion

2015-05-27 Thread George Wilson

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.csiden.org/r/219/#review731
---

Ship it!


Would be nice to have a new test for the zfs test suite.

- George Wilson


On May 6, 2015, 12:21 p.m., Andriy Gapon wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.csiden.org/r/219/
 ---
 
 (Updated May 6, 2015, 12:21 p.m.)
 
 
 Review request for OpenZFS Developer Mailing List, Christopher Siden and 
 Matthew Ahrens.
 
 
 Bugs: 5909
 https://www.illumos.org/projects/illumos-gate//issues/5909
 
 
 Repository: illumos-gate
 
 
 Description
 ---
 
 5909 ensure that shared snap names don't become too long after promotion
 
 
 Diffs
 -
 
   usr/src/uts/common/fs/zfs/dsl_dataset.c 
 ef781cbce8ec5e3dfbdd048e459491f8aeced4bf 
 
 Diff: https://reviews.csiden.org/r/219/diff/
 
 
 Testing
 ---
 
 OpenZFS/FreeBSD
 
 
 Thanks,
 
 Andriy Gapon
 


___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] Review Request 217: Fix stack overflow and panic on FreeBSD.

2015-05-27 Thread George Wilson

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.csiden.org/r/217/#review728
---

Ship it!


Ship It!

- George Wilson


On May 8, 2015, 3:29 a.m., Gleb Smirnoff wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.csiden.org/r/217/
 ---
 
 (Updated May 8, 2015, 3:29 a.m.)
 
 
 Review request for OpenZFS Developer Mailing List and Matthew Ahrens.
 
 
 Repository: illumos-gate
 
 
 Description
 ---
 
 Do not put zfsvfs_t on the stack. Its size of 7656 bytes consumes too much 
 stack. Allocate it
 temporarily instead.
 
 On FreeBSD the kernel stack size is 16384. Issuing 'zpool create' command 
 builds a kernel
 stack consisting of at least 36 frames, with zfs_create_fs() somewhere in the 
 middle. If
 kernel is compiled with -O0, then stack will be exhausted and kernel panics. 
 The default
 compilation option is -O2, and it doesn't panic yet. But still putting extra 
 7656 bytes
 is risky. If any of the 36 functions is modified to consume a bit more stack, 
 we will panic
 on the default kernel.
 
 
 Diffs
 -
 
   usr/src/uts/common/fs/zfs/zfs_znode.c 
 4664899d13f86f556b71aaca6949448ffd45b0eb 
 
 Diff: https://reviews.csiden.org/r/217/diff/
 
 
 Testing
 ---
 
 With the patch, FreeBSD 11-CURRENT compiled with -O0 doesn't panic on 'zpool 
 create'.
 
 
 Thanks,
 
 Gleb Smirnoff
 


___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] Review Request 226: 1778 Assertion failed: rn-rn_nozpool == B_FALSE, file ../common/libzfs_import.c, line 1077, function zpool_open_func

2015-05-27 Thread George Wilson

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.csiden.org/r/226/#review735
---

Ship it!


Ship It!

- George Wilson


On May 27, 2015, 1:51 p.m., Andrew Stormont wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.csiden.org/r/226/
 ---
 
 (Updated May 27, 2015, 1:51 p.m.)
 
 
 Review request for OpenZFS Developer Mailing List, George Wilson and Matthew 
 Ahrens.
 
 
 Bugs: 1778
 https://www.illumos.org/projects/illumos-gate//issues/1778
 
 
 Repository: illumos-gate
 
 
 Description
 ---
 
 Unnecessary assert in libzfs sometimes causes import to blow up.
 
 
 Diffs
 -
 
   usr/src/lib/libzfs/common/libzfs_import.c 
 19f2fbc57e53142863b745cd648eaa4fbaba282c 
 
 Diff: https://reviews.csiden.org/r/226/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Andrew Stormont
 


___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] Review Request 172: 5694 traverse_prefetcher does not prefetch enough

2015-03-09 Thread George Wilson


 On March 9, 2015, 3:36 p.m., Josef 'Jeff' Sipek wrote:
  usr/src/uts/common/fs/zfs/dmu_traverse.c, line 41
  https://reviews.csiden.org/r/172/diff/1/?file=14363#file14363line41
 
  why signed?
  
  Does 50MB have some special meaning?  Or is just just a random value 
  12MB?

I left it as signed since that's what pd_bytes_fetched is.

The 50MB value is actually a conservative number and we saw good performance 
gains with our workload (mostly 8KB blocks). Remember that 12MB used before is 
only true if your stream is mostly 128KB block.


- George


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.csiden.org/r/172/#review566
---


On March 9, 2015, 2:34 p.m., Matthew Ahrens wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.csiden.org/r/172/
 ---
 
 (Updated March 9, 2015, 2:34 p.m.)
 
 
 Review request for OpenZFS Developer Mailing List and Christopher Siden.
 
 
 Bugs: 5694
 https://www.illumos.org/projects/illumos-gate//issues/5694
 
 
 Repository: illumos-gate
 
 
 Description
 ---
 
 5694 traverse_prefetcher does not prefetch enough
 Reviewed by: Matthew Ahrens mahr...@delphix.com
 Reviewed by: Alex Reece a...@delphix.com
 Reviewed by: Christopher Siden christopher.si...@delphix.com
 
 Original author: George Wilson
 
 
 Diffs
 -
 
   usr/src/uts/common/fs/zfs/dmu_traverse.c 
 14eef2e7516c9b5e717ef318321ee137dcd19c9f 
 
 Diff: https://reviews.csiden.org/r/172/diff/
 
 
 Testing
 ---
 
 ztest; zfs test suite
 
 http://jenkins/job/zfs-precommit/1821/
 
 
 Thanks,
 
 Matthew Ahrens
 


___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] Review Request 167: 5669 altroot not set in zpool create when specified with -o

2015-02-28 Thread George Wilson

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.csiden.org/r/167/#review543
---

Ship it!


Ship It!

- George Wilson


On Feb. 27, 2015, 7:45 p.m., Xin LI wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.csiden.org/r/167/
 ---
 
 (Updated Feb. 27, 2015, 7:45 p.m.)
 
 
 Review request for OpenZFS Developer Mailing List and Christopher Siden.
 
 
 Bugs: 5669
 https://www.illumos.org/projects/illumos-gate//issues/5669
 
 
 Repository: illumos-gate
 
 
 Description
 ---
 
 Currently, 'altroot' is only set when zpool create sees '-R'. It should be 
 also set if -o altroot= is used.
 
 Without this change, e.g zpool create -o altroot=/install_target usr would 
 fail if /usr is not empty.
 
 
 Diffs
 -
 
   usr/src/cmd/zpool/zpool_main.c 8b992efcb4e4c98581e109dc1f983b6b9d2ef104 
 
 Diff: https://reviews.csiden.org/r/167/diff/
 
 
 Testing
 ---
 
 Tested on FreeBSD.
 
 The change can be tested with zpool create -o altroot=/somedir [an existing 
 non-empty directory in /] [some device] on Illumos.
 
 
 Thanks,
 
 Xin LI
 


___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] Review Request 163: 5630 stale bonus buffer in recycled dnode_t leads to data corruption

2015-02-22 Thread George Wilson

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.csiden.org/r/163/#review530
---

Ship it!


Ship It!

- George Wilson


On Feb. 21, 2015, 5:34 a.m., Justin Gibbs wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.csiden.org/r/163/
 ---
 
 (Updated Feb. 21, 2015, 5:34 a.m.)
 
 
 Review request for OpenZFS Developer Mailing List and Christopher Siden.
 
 
 Bugs: 5630
 https://www.illumos.org/projects/illumos-gate//issues/5630
 
 
 Repository: illumos-gate
 
 
 Description
 ---
 
 Fix data corruption in ZFS triggered by rapidly destroying
 and creating datasets while listing datasets in another
 context.
 
 dnode_sync_free() resets the in-core dnode_t of a just
 deleted object so it represents an unallocated object.  The
 dnode_t will only be considered to represent a free dnode
 after its last hold is released.  This last hold may be
 dropped after dnode_sync_free() returns.  For data dbufs
 holding the dnode, this delay can be due to asynchronous
 evictions from the arc coincident with the dnode being
 destroyed.  For bonus buffers, this can happen if the object
 type can be destroyed even when held in another context.
 
 The observed data corruption occurred when a dsl_dir was
 destroyed for a recursive dataset destroy, while a zfs
 list operation also held this dsl_dir.  Although
 dnode_sync_free() calls dnode_evict_dbufs(), the hold on
 the dsl_dir's bonus buffer prevented it from being evicted.
 This left the bonus buffer associated with the dnode_t even
 after the last hold on the dnode was released.
 
 Some time later, this dnode_t was reused for a new dsl_dir.
 Instead of getting a new (and zero filled) bonus buffer,
 the contents from the old dsl_dir were returned.  The dsl_dir
 initialization code assumes the bonus buffer is zero filled
 and so only explicitly initializes fields that have non-zero
 initial values.  This allowed the stale data to be incorporated
 into the new dsl_dir and written to the media.
 
 Bonus buffers are only evicted via dnode_evict_dbufs() when
 there are no holds on the bonus buffer, and via
 dnode_buf_pageout/dnode_destroy().  The fix employed here
 evicts the bonus buffer for a freed dnode when the bonus
 buffer's last hold is dropped, but before the bonus buffer's
 dnode hold is released.  This ensures the dnode_t can only
 be reused after the bonus buffer eviction completes.
 
 sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode.c:
 sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dnode.h:
 sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c:
   Add dnode_rele_and_unlock(), which avoids an extra
   lock drop and acquisition in contexts where the
   dn_mtx is already held.
 
 sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c:
   In dbuf_rele_and_unlock(), immediately evict bonus
   buffers associated with freed dnodes.  The data in
   the bonus buffer is no longer relevant and this
   prevents a stale bonus buffer from being associated
   with the dnode_t should the dnode_t be reused prior
   to being destroyed.
 
 In dbuf_rele_and_unlock(), hold the bonus buffer's mutex
 until after DB_DNODE_EXIT().  This prevents another
 thread (e.g. dmu_objset_evict()-dnode_evict_dbufs()),
 from evicting the bonus buffer and invalidating the
 dbuf's dnh before or during the DB_DNODE_ENTER/EXIT()
 calls.
 
 
 Diffs
 -
 
   usr/src/uts/common/fs/zfs/dnode.c 4e376fc4d920fcc3f3429a359ed394952d116b42 
   usr/src/uts/common/fs/zfs/dbuf.c c735c9694ff905586a26d33d6ffd36cfa41ce4cb 
   usr/src/uts/common/fs/zfs/sys/dnode.h 
 406954a65323b8de8aa28b07073f588bc1805062 
 
 Diff: https://reviews.csiden.org/r/163/diff/
 
 
 Testing
 ---
 
 ZFS test suite on FreeBSD. ztest on illumos.
 
 Continuous loop on FreeBSD of recursively receiving and destroying a tree of 
 18 file systems while running management code that responds to ZFS 
 create/destroy events and enumerates file systems.  This test would fail 
 within 1 hour prior to this fix.  It has run now for 2.5 days without failure.
 
 Additional stress test:
 Thread 1:
 ```
 while true; do
 zpool create foo da1
 zpool export foo
 zpool import foo
 zpool destroy foo
 done
 ```
 
 Thread 2:
 ```
 while true; do
 zpool status foo  /dev/null
 done
 ```
 
 
 Thanks,
 
 Justin Gibbs
 


___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] Review Request 155: 5531 NULL pointer dereference in dsl_prop_get_ds()

2015-02-04 Thread George Wilson

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.csiden.org/r/155/#review477
---

Ship it!


Ship It!

- George Wilson


On Feb. 4, 2015, 4:59 a.m., Justin Gibbs wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.csiden.org/r/155/
 ---
 
 (Updated Feb. 4, 2015, 4:59 a.m.)
 
 
 Review request for OpenZFS Developer Mailing List.
 
 
 Bugs: 5531
 https://www.illumos.org/projects/illumos-gate//issues/5531
 
 
 Repository: illumos-gate
 
 
 Description
 ---
 
 Follow on to /r/153.
 
 uts/common/fs/zfs/dsl_dataset.c:
   Close race in dsl_dataset_try_add_ref() where the dsl_dataset_t
   still points to a valid bonus buffer, but that bonus buffer is
   held by a new dsl_dataset_t instance for this dataset.  Allow
   the add_ref operation only if the passed in dsl_dataset_t owns
   its referenced bonus buffer - the dbuf user for the bonus
   buffer is this dsl_dataset_t.
 
 
 Diffs
 -
 
   usr/src/uts/common/fs/zfs/dsl_dataset.c 
 ba9c766c65f64289b4e5e8c799a04ab3fde65e8e 
 
 Diff: https://reviews.csiden.org/r/155/diff/
 
 
 Testing
 ---
 
 ztest on illumos.
 
 
 Thanks,
 
 Justin Gibbs
 


___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] Review Request 152: 5527 ::spa_space fails with mdb: couldn't find member dd_phys of type struct zfs`dsl_dir'

2015-01-27 Thread George Wilson

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.csiden.org/r/152/#review461
---

Ship it!


Ship It!

- George Wilson


On Jan. 27, 2015, 5:55 a.m., Justin Gibbs wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.csiden.org/r/152/
 ---
 
 (Updated Jan. 27, 2015, 5:55 a.m.)
 
 
 Review request for OpenZFS Developer Mailing List.
 
 
 Bugs: 5527
 https://www.illumos.org/projects/illumos-gate//issues/5527
 
 
 Repository: illumos-gate
 
 
 Description
 ---
 
 5527 ::spa_space fails with mdb: couldn't find member dd_phys of type struct 
 zfs`dsl_dir'
 
 
 Diffs
 -
 
   usr/src/cmd/mdb/common/modules/zfs/zfs.c 
 225cf3dee18e110149ce6c7180fa50a3928893f9 
 
 Diff: https://reviews.csiden.org/r/152/diff/
 
 
 Testing
 ---
 
 ```
 % mdb /dev/ksyms /dev/kmem
 Loading modules: [ unix genunix specfs mac cpu.generic uppc apix scsi_vhci 
 zfs mpt sd ip hook neti sockfs arp usba stmf stmf_sbd fctl lofs random idm 
 sppp ptm nfs cpc fcip ufs logindmux ]
  ::walk spa | ::spa_space
 dd_space_towrite = 0M 0M 0M 0M
 dd_phys.dd_used_bytes = 41582M
 dd_phys.dd_compressed_bytes = 41166M
 dd_phys.dd_uncompressed_bytes = 41166M
 ms_allocmap = 0M 0M 0M 0M
 ms_freemap = 0M 0M 0M 0M
 ms_tree = 964M
 last synced avail = 9528M
 current syncing avail = 9528M
 ```
 
 
 Thanks,
 
 Justin Gibbs
 


___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] Review Request 147: 5438 zfs_blkptr_verify should continue after zfs_panic_recover

2014-12-16 Thread George Wilson

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.csiden.org/r/147/#review407
---

Ship it!


Ship It!

- George Wilson


On Dec. 16, 2014, 6:39 a.m., Xin LI wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.csiden.org/r/147/
 ---
 
 (Updated Dec. 16, 2014, 6:39 a.m.)
 
 
 Review request for OpenZFS Developer Mailing List and Christopher Siden.
 
 
 Repository: illumos-gate
 
 
 Description
 ---
 
 zfs_blkptr_verify should use 'continue' after zfs_panic_recover or fail more 
 gracefully.  Without this the code would do e.g. test if vd is NULL then 
 access vd-vdev_ops, which defeats the purpose of using zfs_panic_recover.
 
 
 Diffs
 -
 
   usr/src/uts/common/fs/zfs/zio.c d0a42e9af13b7012ac48d4073ebe3520b069c4d5 
 
 Diff: https://reviews.csiden.org/r/147/diff/
 
 
 Testing
 ---
 
 On FreeBSD only.
 
 
 Thanks,
 
 Xin LI
 


___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] Review Request 128: 5313 Allow I/Os to be aggregated across ZIO priority classes

2014-12-16 Thread George Wilson

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.csiden.org/r/128/#review408
---

Ship it!


Ship It!

- George Wilson


On Dec. 2, 2014, 12:22 p.m., Andriy Gapon wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.csiden.org/r/128/
 ---
 
 (Updated Dec. 2, 2014, 12:22 p.m.)
 
 
 Review request for OpenZFS Developer Mailing List, Justin Gibbs and George 
 Wilson.
 
 
 Bugs: 5313
 https://www.illumos.org/projects/illumos-gate//issues/5313
 
 
 Repository: illumos-gate
 
 
 Description
 ---
 
 sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h:
   Add two offset/lba based AVL trees to the vdev queue
   object.
 
 sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio.h:
   Add a second AVL node within each ZIO so that vdev_queue.c
   can sort ZIOs by both type and priority.
 
 sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_queue.c:
   Combine reads and writes, irrespecitve of their priorities
   into unified, offset sorted, trees.  Selection of the
   ZIO to issue is unchanged, but aggregation now uses the
   unified tree of the appropriate type so that aggregation
   across priority classes is possible.
 
 Original author:  Justin T. Gibbs just...@spectralogic.com
 
 
 Diffs
 -
 
   usr/src/uts/common/fs/zfs/sys/zio.h 
 f6cf259bf71a741cfee8e290be93c0600c5c9240 
   usr/src/uts/common/fs/zfs/sys/vdev_impl.h 
 6d9bcb17d00fd70d1585b9c2218086ca2239aab6 
   usr/src/uts/common/fs/zfs/vdev_queue.c 
 f47c4cd1e26dfa635dcf28e4ea3cd0cc8de4e3d1 
 
 Diff: https://reviews.csiden.org/r/128/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Andriy Gapon
 


___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] Review Request 131: [5056] ZFS deadlock on db_mtx and dn_holds/Various improvements the dmu buf user API.

2014-12-15 Thread George Wilson

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.csiden.org/r/131/#review406
---



usr/src/cmd/truss/Makefile.com
https://reviews.csiden.org/r/131/#comment333

Block shell comments must start with a blank # and end with one too.



usr/src/cmd/zhack/Makefile.com
https://reviews.csiden.org/r/131/#comment334

Block comment is missing starting and ending # line.



usr/src/cmd/zinject/Makefile.com
https://reviews.csiden.org/r/131/#comment335

Again.



usr/src/lib/brand/solaris10/s10_brand/Makefile.com
https://reviews.csiden.org/r/131/#comment336

Again.



usr/src/lib/libzfs/Makefile.com
https://reviews.csiden.org/r/131/#comment337

Again.



usr/src/uts/common/fs/zfs/dbuf.c
https://reviews.csiden.org/r/131/#comment338

Shouldn't you specify the priority? Are there any ramifications of not 
specifying a min and max alloc values?



usr/src/uts/common/fs/zfs/dmu_objset.c
https://reviews.csiden.org/r/131/#comment339

Was there a problem with calling dsl_dataset_is_snapshot() and being more 
explicity about the check? It seems like this could be error prone if sometime 
down the road this assumption is changed. Making it explicity seems much safer 
unless that logic is broken.



usr/src/uts/common/fs/zfs/dnode.c
https://reviews.csiden.org/r/131/#comment340

A comment here might be useful. Why don't we want special objects in the 
os_dnodes list?



usr/src/uts/common/fs/zfs/dsl_dir.c
https://reviews.csiden.org/r/131/#comment341

winner != NULL



usr/src/uts/common/fs/zfs/sa.c
https://reviews.csiden.org/r/131/#comment343

should be static.



usr/src/uts/common/fs/zfs/sa.c
https://reviews.csiden.org/r/131/#comment342

I'm not sure this is ever used, nuke it (and in sa.h too).



usr/src/uts/common/fs/zfs/spa_misc.c
https://reviews.csiden.org/r/131/#comment344

Update the comment to reflect the new behavior.



usr/src/uts/common/fs/zfs/sys/sa_impl.h
https://reviews.csiden.org/r/131/#comment345

Can you fix the whitespace while you're here?



usr/src/uts/intel/dev/Makefile
https://reviews.csiden.org/r/131/#comment346

Comment block requires start and end lines with just #.



usr/src/uts/intel/stmf_sbd/Makefile
https://reviews.csiden.org/r/131/#comment347

Comment block missing start and end lines.



usr/src/uts/sparc/stmf_sbd/Makefile
https://reviews.csiden.org/r/131/#comment348

Comment block missing start and end lines.


- George Wilson


On Dec. 15, 2014, 4:48 a.m., Justin Gibbs wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.csiden.org/r/131/
 ---
 
 (Updated Dec. 15, 2014, 4:48 a.m.)
 
 
 Review request for OpenZFS Developer Mailing List.
 
 
 Bugs: 5056
 https://www.illumos.org/projects/illumos-gate//issues/5056
 
 
 Repository: illumos-gate
 
 
 Description
 ---
 
 Various improvements the dmu buf user API.
 
 Submitted by: Justin Gibbs just...@spectralogic.com
 Submitted by: Will Andrews wi...@spectralogic.com
 Sponsored by: Spectra Logic Corporation
 
 Collect dmu buf user API support data into a new structure,
 dmu_buf_user_t.  Consumers of this interface must include a
 dmu_buf_user_t as a member of the user data structure that will be
 attached to a dmu buffer.  This reduces the size of dmu_buf_impl_t
 by two pointers.
 
 Queue dmu buf user eviction processing to a taskq.  This prevents
 FreeBSD witness(4) lock-order reversal warnings, potential deadlocks,
 and reduces stack depth.
 
 Convert objset eviction from a synchronous to an asynchronous process
 to accommodate the asynchronous invocation, via dmu buf user eviction,
 of dnode_buf_pageout().
 
 Modify existing users of the dmu buf user API to never access the dbuf to
 which their data was attached after user eviction has occurred.  Accessing
 the dbuf from the callback is no longer safe now that callbacks occur
 without the locks that used to protect them.  Enforce this in ZFS_DEBUG
 kernel builds by clearing the user's pointer to the dbuf, if any, at
 the time of eviction.  Callbacks have also been modified to clear their
 dbuf pointer so most errors are caught even on non ZFS_DEBUG kernels.
 However, this will not catch accesses from other contexts that occur
 between the time of eviction and the processing of the callback.
 
 Clarify programmer intent and improve readability by providing
 specialized functions for the common user data update actions remove
 and replace.
 
 Provide code-comment documentation for each API call.
 
 Perform runtime validation of proper API usage on ZFS_DEBUG kernels.
 
 uts/common/fs/zfs/sys/dbuf.h:
 uts/common/fs/zfs/dbuf.c:
   Add dbuf_verify_user() and call it from the dbuf user API and
   during dbuf eviction processing

Re: [OpenZFS Developer] [zfs] Recover from corrupted space map (illumos #4390)

2014-07-07 Thread George Wilson

On 7/7/14, 3:33 AM, Jan Schmidt via illumos-zfs wrote:
  
On Wed, June 25, 2014 at 16:15 (+0200), Keith Wesolowski Via Illumos-zfs wrote:

On Wed, Jun 25, 2014 at 01:47:54PM +0200, Jan Schmidt via illumos-zfs wrote:


That patch looks somewhat promising, though I have not tried it yet. How did you
decide which of the overlapping space map ranges to drop? From my understanding,
either range might be the one that's currently correct, isn't it?

It's actually worse than that, because there are a lot of different
cases, depending on whether the overlapping ranges are alloc or free,
whether there are overlapping sub-ranges within them, whether they're
partial or complete overlaps, etc.  And then there is the possibility of
subsequent ranges that partially overlap the previous bad ones.  You
didn't mention which form of corruption you're hitting or how severe it
is, so I don't know which cases might apply to you.  zdb is helpful in
getting a handle on that.

I have a different patch (George gets most of the credit, I take most of
the blame), that I used to recover spacemap corruption we had at Joyent
(albeit from a different cause, 4504).  It's intended for one-time use;
you boot it, it fixes the spacemaps by leaking ambiguous regions,
preferring to lose a little space rather than risk later overwriting of
data, and condenses them back out, then you reboot onto normal bits
again.  This covers a lot more cases; I tested many of them, but there
may yet be edge cases that aren't addressed.  I recommend building a
libzpool with this first and trying zdb with that before booting with
the zfs module.

Thanks for the explanation. We recovered our data. Using the most recent illumos
code already helped in importing the pool read-only.


This comes with absolutely no warranty of any kind and should be used
only where dumping the data somewhere else (harder than you might think,
since you can't create snapshots in read-only mode) and recreating the
pool is not an option.  It's on you to understand what it does and why
and to satisfy yourself that it will solve your problem safely before
using it.  The comments might help a little, but you're really on your
own.

See
https://github.com/wesolows/illumos-joyent/commit/dc4d7e06c8e0af213619f0aa517d819172911005

After backing up all data, we applied this patch and the non-readonly pool
import no longer crashed, printing ...

Jul  1 11:00:44 hostname genunix: [ID 882369 kern.warning] WARNING: zfs: freeing
overlapping segments: [fba5d0cee00,fba5d0cfa00) existing segment
[fba5d05e600,fba5d0cf400)
Jul  1 11:02:59 hostname genunix: [ID 882369 kern.warning] WARNING: zfs: freeing
overlapping segments: [12cf8b202400,12cf8b203000) existing segment
[12cf8b1d0c00,12cf8b202a00)

... several times (like 10 times each). After that, a full scrub of the pool
succeeded without any messages.

Do you think it is safe to continue using the repaired pool, or would you still
recommend to recreate it?
If all of the cases were  frees then you can continue using the pool and 
just realize that that space has been leaked and will never be 
allocatable. If the amount of space is significant then you may want to 
just recreate the pool.


Thanks,
George
___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] zfs zio reordering in zio_vdev_io_start causing panics

2014-05-07 Thread George Wilson

On 5/7/14, 10:55 AM, Steven Hartland wrote:


- Original Message - From: George Wilson 
george.wil...@delphix.com

To: Steven Hartland kill...@multiplay.co.uk; developer@open-zfs.org
Sent: Wednesday, May 07, 2014 3:07 PM
Subject: Re: [OpenZFS Developer] zfs zio reordering in 
zio_vdev_io_start causing panics




On 5/7/14, 4:44 AM, Steven Hartland wrote:

- Original Message - From: George Wilson
I think there is probably another way we can solve this problem but 
I first want to get a better understanding of the corruption. We 
have not integrated the TRIM support upstream and I suspect that's 
the source of most of the problems. Can you confirm that with TRIM 
disabled that most of corruption you've seen does not occur? I'm 
trying to get context here since we've not seen this type of 
failure elsewhere.


In the case of TRIM if BIO delete's are disabled the free IO's don't 
get
sent to physical media and instead instantly return 
ZIO_PIPELINE_CONTINUE.

static int
vdev_geom_io_start(zio_t *zio)
{
   ...
   switch (zio-io_type) {
   case ZIO_TYPE_FREE:
   if (vdev_geom_bio_delete_disable)
   return (ZIO_PIPELINE_CONTINUE);
   }
}


For your simple corruption case could you change the code to do the 
following:


if (vdev_geom_bio_delete_disable) {
zio_interrupt(zio);
return (ZIO_PIPELINE_STOP);
}

I believe this is the way it should behave. Let me know how this work 
for you. I'm going to change the way the lower consumer work and do 
some testing also.


So your saying no path from vdev_*io_start(..) that queues an IO return
ZIO_PIPELINE_CONTINUE?

If so it looks like there's a number of locations where that needs fixing
at least in head of FreeBSD e.g.
static int
vdev_file_io_start(zio_t *zio)
{
...
   if (!vdev_readable(vd)) {
   zio-io_error = SET_ERROR(ENXIO);
   return (ZIO_PIPELINE_CONTINUE);
   }
-
static int
vdev_mirror_io_start(zio_t *zio)
{
...

   if (zio-io_type == ZIO_TYPE_READ) {
   if ((zio-io_flags  ZIO_FLAG_SCRUB)  !mm-mm_replacing) {
   ...
   return (ZIO_PIPELINE_CONTINUE);
   }
...

   return (ZIO_PIPELINE_CONTINUE);
}


Also should that be asserted to make it clear and testable e.g.
static int
zio_vdev_io_start(zio_t *zio)
{
   zio_t *sio = zio;
...
   ret = vd-vdev_ops-vdev_op_io_start(zio);
   ASSERT(ret != ZIO_PIPELINE_CONTINUE || sio == zio);

   return (ret);
}

This would also ensure the stack overflow issue shouldn't occur.

   Regards
   Steve
There are a couple of cases where it can work but I'm going to make it 
such that requires you to always return back ZIO_PIPELINE_STOP. 
Otherwise it's makes it too easy to introduce a programming error. For 
example, vdev_mirror_io_start() could return ZIO_PIPELINE_CONTINUE but 
vdev_disk_io_start() shouldn't.


I'll try to have a diff ready soon. I will be interested in your test 
results.


- George
___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] zfs zio reordering in zio_vdev_io_start causing panics

2014-05-07 Thread George Wilson

On 5/7/14, 11:46 AM, Steven Hartland wrote:
- Original Message - From: George Wilson 
george.wil...@delphix.com


There are a couple of cases where it can work but I'm going to make 
it such that requires you to always return back ZIO_PIPELINE_STOP. 
Otherwise it's makes it too easy to introduce a programming error. 
For example, vdev_mirror_io_start() could return 
ZIO_PIPELINE_CONTINUE but vdev_disk_io_start() shouldn't.


I'll try to have a diff ready soon. I will be interested in your test 
results.


I'm going to give this a go but requeuing the request in the task handler
instead of processing it directly sounds like quite a waste.

Whats the objection to continuing directly with the updated zio?

   Regards
   Steve
It's a safety issue. If you need to change the zio then you effectively 
need to execute a new pipeline for that zio. If you don't want to hand 
this off to a different taskq then call zio_execute() instead of 
zio_interrupt(). For the lower layers it makes sense to call 
zio_interrupt() since we typically call the completion routines from the 
interrupt taskqs anyways.


- George


___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] zfs zio reordering in zio_vdev_io_start causing panics

2014-05-07 Thread George Wilson

On 5/7/14, 12:50 PM, Steven Hartland wrote:
- Original Message - From: George Wilson 
george.wil...@delphix.com




On 5/7/14, 11:46 AM, Steven Hartland wrote:
- Original Message - From: George Wilson 
george.wil...@delphix.com


There are a couple of cases where it can work but I'm going to make 
it such that requires you to always return back ZIO_PIPELINE_STOP. 
Otherwise it's makes it too easy to introduce a programming error. 
For example, vdev_mirror_io_start() could return 
ZIO_PIPELINE_CONTINUE but vdev_disk_io_start() shouldn't.


I'll try to have a diff ready soon. I will be interested in your 
test results.


I'm going to give this a go but requeuing the request in the task 
handler

instead of processing it directly sounds like quite a waste.

Whats the objection to continuing directly with the updated zio?

It's a safety issue. If you need to change the zio then you 
effectively need to execute a new pipeline for that zio. If you don't 
want to hand this off to a different taskq then call zio_execute() 
instead of zio_interrupt(). For the lower layers it makes sense to 
call zio_interrupt() since we typically call the completion routines 
from the interrupt taskqs anyways.


Thanks for the explanation George, always good to get a better insight in
the the reasons behind things. This is the sort of thing that would be
great to capture in pipeline description comments for future readers.

Initial tests show the TRIM code still trips on the case where errors
return ZIO_PIPELINE_CONTINUE e.g.
   zio-io_error = SET_ERROR(ENOTSUP);
   return (ZIO_PIPELINE_CONTINUE);

I'm assuming in these cases we should be looking at the following 
instead?

   zio-io_error = SET_ERROR(ENOTSUP);
   zio_interrupt(zio);
   return (ZIO_PIPELINE_STOP);

And the same with:
   switch (zio-io_type) {
   case ZIO_TYPE_IOCTL:
   /* XXPOLICY */
   if (!vdev_readable(vd)) {
   zio-io_error = SET_ERROR(ENXIO);
   return (ZIO_PIPELINE_CONTINUE);
   }


   Regards
   Steve

Steve,

That's correct. I will add some more documentation about this with my 
proposed change.


Thanks,
George

___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] zfs zio reordering in zio_vdev_io_start causing panics

2014-05-06 Thread George Wilson

Steven,

I think there is probably another way we can solve this problem but I 
first want to get a better understanding of the corruption. We have not 
integrated the TRIM support upstream and I suspect that's the source of 
most of the problems. Can you confirm that with TRIM disabled that most 
of corruption you've seen does not occur? I'm trying to get context here 
since we've not seen this type of failure elsewhere.


Since I'm not familiar with the TRIM implementation in FreeBSD I was 
wondering if you could explain the scenario that leads to the 
corruption. The fact that the pipeline doesn't allow the zio to change 
mid-pipeline is actually intentional so I don't think we want to make a 
change to allow this to occur. From code inspection it does look like 
the vdev_*_io_start() routines should never return 
ZIO_PIPELINE_CONTINUE. I will look at this closer but  it looks like 
there is a bug there.


Anyway, if you could give me more details to the corruption I would be 
happy to help you design a way that this can be implemented while still 
ensuring that a zio cannot change while the pipeline is still active. 
Thanks for diving into this and I will post more about the bugs that 
look to exist in the vdev_*_io_start() routines.


Thanks,
George

On 4/25/14, 8:53 PM, Steven Hartland wrote:

I've been working on adding IO priority support for TRIM back into
FreeBSD after the import of the new IO scheduling from illumos.

Based on avg's initial work and having got my head around the
requirements of the new scheduler I came up with the attached
zz-zfs-trim-priority.patch.

Most of the time this worked fine but as soon as bio_delete requests
where disabled using the follow I started getting panics:
sysctl vfs.zfs.vdev.bio_delete_disable=1

A simple dd is enough to trigger the panic e.g.
dd if=/dev/zero of=/data/random.dd bs=1m count=10240

The wide selection of panics all seemed to indicate queue corruption
with the main one erroring in vdev_queue_io_to_issue on the line:
zio = avl_first(vqc-vqc_queued_tree);

After a day of debugging and adding lots of additional validation
checks it became apparent that after removing a zio from vq_active_tree
both vq_active_tree and the associated vqc_queued_tree become corrupt.

By corrupt I mean that avl_numnodes is no longer in sync with a manual
count of the nodes using a tree walk.

In each case the vq_active_tree.avl_numnodes is one less than its actual
number of nodes and vqc_queued_tree.avl_numnodes is one greater than
its actual number of nodes.

After adding queue tracking to zio's it turned out that
vdev_queue_pending_remove was trying to remove a zio from vq_active_tree
which wasn't in that tree but was in write vqc_queued_tree tree.

As avl_remove doesn't do any validation on the node being present in
the tree it merrily tried to remove it resulting in nasty ness in both
trees.

The cause of this is in zio_vdev_io_start specifically
if ((zio = vdev_queue_io(zio)) == NULL)
return (ZIO_PIPELINE_STOP);

This can result in a different zio reaching:
return (vd-vdev_ops-vdev_op_io_start(zio));

When this happens and vdev_op_io_start returns ZIO_PIPELINE_CONTINUE
e.g. TRIM requests when bio_delete_disable=1 is set, the calling
zio_execute continues the pipeline for the zio it called
zio_vdev_io_start with, but that zio hasn't been processed and hence
isn't in the vq_active_tree but in one of vqc_queued_tree's.

Its not clear if any other paths can have their vdev_io_io_start
return ZIO_PIPELINE_CONTINUE but it definitely looks that way and
may well explain other panics I've seen in this area when for example
disks dropped.

I'm unsure if there's a more elegent fix but allowing pipeline stages
to change the processing zio by passing in a zio_t **ziop instead
of zio_t *zio as in the attached zfs-zio-queue-reorder.patch fixes the
issue.

Note: Patches are based on FreeBSD 10-RELEASE + some backports from
10-STABLE, mainly r260763: 4045 zfs write throttle  i/o scheduler,
so should apply to 10-STABLE and 11-CURRENT.

   Regards
   Steve


___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] permanent error in a pool

2014-02-26 Thread George Wilson
Hmm, 'zpool scrub' should be able to clear that error. Can you tell me 
which options you used to 'zinject' and I'll try reproducing it here.


Thanks,
George

On 2/26/14 4:22 AM, Andriy Gapon wrote:

I did some experimenting and used zinject to introduce some non-correctable data
errors while reading a file. After that I destroyed a filesystem with the file
and all its snapshots, but the error was still reported in zpool status -v
output.  Maybe this is expected, although not very intuitive.  But what is
definitely unexpected to me is that neither zpool clear nor zpool scrub are able
to clear that error.  Looks like it is stuck with the pool until it is 
destroyed.

$ zpool status -v
   pool: pond
  state: ONLINE
status: One or more devices has experienced an error resulting in data
 corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
 entire pool from backup.
see: http://illumos.org/msg/ZFS-8000-8A
   scan: scrub repaired 0 in 47m with 0 errors on Wed Feb 26 11:13:30 2014
config:

 NAMESTATE READ WRITE 
CKSUM
 pondONLINE   0 0   
  0
   mirror-0  ONLINE   0 0   
  0
 gptid/fcf3558b-493b-11de-a8b9-001cc08221ff  ONLINE   0 0   
  0
 gptid/48782c6e-8fbd-11de-b3e1-00241d20d446  ONLINE   0 0   
  0

errors: Permanent errors have been detected in the following files:

 0x1b2d:0xb



___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] l2arc_write_buffers scanning efficiency

2014-01-14 Thread George Wilson

LGTM.

- George

On 1/14/14 12:57 PM, Saso Kiselkov wrote:

On 1/14/14, 11:51 AM, Andriy Gapon wrote:

Recently I noticed that on one of our systems l2arc_feed_thread was consuming
non insignificant amount of CPU time.  I looked at some stats (FreeBSD has local
changes that add a few kstats for L2ARC) and it seems that
the thread was busy scanning and re-scanning buffers that are already in L2 ARC.
Buffers were being skipped because l2arc_write_eligible was returning false
because of ab-b_l2hdr != NULL.  As far as I can see the thread was doing a pass
per second and it typically scanned about 2 million buffers per pass.  Typically
walking over a buffer list was aborted due to passed_sz  headroom.

The system in question has a quite large ARC with maximum size of 60GB.  26GB
were actually in use and it seems that most of the buffers were rather small,
hash_elements == 3634055.

Perhaps, there could be some optimization to avoid pointless walking over
millions of buffers in situations like this?

P.S. Because of another local FreeBSD change the feed thread was scanning about
16 times more buffers than it would on illumos, so the issue was more prominent
with the thread consuming about 40% of a core.

That leak should have been fixed... yep found the webrev:
http://cr.illumos.org/~webrev/skiselkov/3995/
Unfortunately, it appears I dropped the ball on this one and forgot to
submit an RTI for it.

Could people please give me a quick ok on the above webrev again? I've
updated it in place. All it really does is move the
l2arc_release_cdata_buf step in front of the mutex_trylock - since the
b_tmp_cdata pointer is thread-local, we don't need to grab locks it to
manipulate it. The rest of the changes are just renaming 'abl2' to
'l2hdr' to be consistent across all of arc.c.

Cheers,


___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] Metaslab allocation question

2013-12-27 Thread George Wilson

Ilya,

I think there is a bigger problem here that your vdev_mirror() change 
would cover up. The problems is that we should never have a bp with a 
dva that is out of range. So we need to understand how that is occurring.


As for you question about metaslab_allocate_dva(), I'm assuming you're 
referring to this code path:


} else if (d != 0) {
vd = vdev_lookup_top(spa, DVA_GET_VDEV(dva[d - 1]));
mg = vd-vdev_mg-mg_next;

This vdev can't be NULL since the configuration should not have changed 
while we're in the middle of doing allocations.


Even for gang blocks we should never have a NULL vdev. I'm assuming 
you're referring to this logic:


if (vd != NULL) {
mg = vd-vdev_mg;

if (flags  METASLAB_HINTBP_AVOID 
mg-mg_next != NULL)
mg = mg-mg_next;
} else {
mg = mc-mc_rotor;
}

This is meant to handle device logs which can be removed since the ZIL 
may have a reference to an old log block on a device that no longer exists.


I think we need to find the root cause of where this out of range dva is 
coming from. If you have old core files that you could share please 
point me at them.


Thanks,
George

On 12/18/13 1:33 PM, Ilya Usvyatsky wrote:
I am looking at a rare but nasty case of corruption in which a block 
pointer has one of ditto dva's out of range.

This causes vdev_lookup_top() to return NULL for a vdev pointer.
Going through the callers of vdev_lookup_top(), I noticed that in some 
cases NULL vdev are handled properly, while in others it is not.
In particular, in case of an import, ditto blocks are handled by a 
mirror vdev code path that does not have proper handling for NULL and 
would panic the system.


My question, though, is not about the mirror (for which I have a fix 
that I will upstream eventually),but about metaslab allocator.
In particular, I am looking at metaslab_allocate() and 
metaslab_allocate_dva().
In the latter function NULL vdev is properly handled in the case of a 
gang block (since a hint may be stale and the device may be gone).
However, that very same function does not handle NULL vdev in case of 
a regular block pointer with ditto blocks.
The dilemma I am facing here is that I can either just use rotor 
(i.e.. mimic gang block behavior), or return an error immediately.
In the latter case, the caller, metaslab_allocate() would handle it 
properly.
I am inclined to go with the second option, but I would greatly 
appreciate an insight on this from someone who is familiar with the 
internal logic and the theory behind metaslab allocator.


Thank you very much,
Ilya.



___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] zio parent and zio children relation.

2013-12-17 Thread George Wilson


On 12/17/13 5:21 AM, Gaurav Mahajan wrote:

Hi all,

I am trying to understand relations of root ZIO and children ZIO.

This is what I have understood.. Please correct me if I'm wrong.

Usually whenever we want to do a series of IO operations like in sync 
thread.
We create a root ZIO with zio_root(). Now this root ZIO becomes parent 
for every ZIO that we create while syncing the async data to disk (in 
dbuf_sync_leaf).


All the child ZIO are issued using zio_nowait()
After issuing all the children ZIO at the end we call zio_wait() on 
root ZIO.


So the question that comes in my mind is that after zio_wait for root 
ZIO is over, are we guaranteed that all the children ZIO are complete.?


Yes, the root zio cannot complete until all its children have completed.

- George


complete in sense like block allocation and data write are done and 
io_done callback are complete.


I may be wrong with my understanding. Please correct me.

Thanks !!!
Gaurav.



___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


Re: [OpenZFS Developer] priority for TRIM zio

2013-11-30 Thread George Wilson

Andriy,

The fix looks good but I have a couple of questions:

1. Are you sure you want the TRIM priority to be lower than SCRUB?
2. You have zfs_vdev_trim_max_active set to 1 with an /* XXX */ next to 
it. Does this mean you haven't settled on the final value for active max?


Thanks,
George

On 11/30/13 10:13 AM, Andriy Gapon wrote:

[resending with CC fixed]

Matt,

as you most likely know, ZFS/FreeBSD already has ability to use TRIM command
with disks that support it.
As the command can be quite slow and it is a maintenance kind of command, it
makes sense to assign a very low priority to it.

I've come up with the following change that introduces a new priority for the
TRIM zios:
http://people.freebsd.org/~avg/zfs-trim-priority.diff
To be honest, this change actually attempts to restore what we already had in
FreeBSD before I merged your write throttle  i/o scheduler performance work.

Could you please review the change?
I am not sure if I correctly translated my intent to the min_active and
max_active values.  I will greatly appreciate your help with these.

Thank you!


___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer


___
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer