[zfs-code] ARC cache reference counting

2009-06-02 Thread Mark Maybee
Jeremy Archer wrote:
> I started to look at ref counting to convince myself that the db_bu field in 
> a cached dmu_impl_t object
>  is guaranteed to point at a valid arc_buf_t.
> 
> I have seen a  "deadbeef" crash on a busy system when zfs_write() is 
> pre-pagefaulting in 
> the file's pages.
> 
> The page  fault handler eventually winds its way to dbuf_hold_impl, who 
> manages to 
> find a cached dmu_impl_t record.  This record however, points to a freed 
> arc_buf_t 
> via its b_data field. The field is not null, but it points to a freed object, 
> hence the 
> crash upon trying to lock the rwlock of the alleged arc_buf.
> 
> Ref counting  should prevent something like this, correct?

Correct.  If you are running recent bits and have a core file please
file a bug on this.

-Mark



[zfs-code] ARC cache reference counting

2009-06-22 Thread Mark Maybee
Hi Jeremy,

Jeremy Archer wrote:
> Hello,
> 
> I believe the following is true, correct me if it is not:
> 
> If more than one objects reference a block (e.g. 2 files have the same block 
> open)
> there must be multiple clones of the arc_buf_t ( and associated dmu_impl_t ) 
> records 
> present, one for each of the objects.  This is always so, even if the block 
> is not
> modified, "just in case the block a should end up being modified".
> So: if there are 100 files  accessing the same block in the same txg, there 
> will be 
> 100 clones of the data, even if none of the files ultimately modifies this 
> block.  
> Seems a bit wasteful.
> 
> This dos not feel like COW to me, rather, "copy always, just in case" at 
> least in the arc/dmu realm.

Correct.  Memory management is not currently "COW".  Note that,
currently, this is not a significant issue for most environments.
The only way to "share" a data block is if the same file is accessed
simultaneously from multiple clones and/or snapshots.  This is rare.
However, with the advent of dedup (coming soon), this will become a
bigger issue.

> I fail to see why the above scenario should not be able to  get by with a 
> single,
> shared, reference counted record. A clone would only have to be made of a 
> block if
> a given file decides to modify the block.   As it is, reference counting is 
> significantly 
> complicated by mixing it with this pre-cloning. 
> 
The "simple solution" you propose is actually quite complicated to
implement.  We are working on it though.

> On to some code comprehension questions:
> 
> It seems to be that the conceptual model of a file in the dmu layer:
> A number of dmu buffers, hanging off of a dnode (i.e the per-dnode the list 
> formed via the db_link "list enabler"). Not all blocks of the file are in 
> this list, 
> only the "active" ones.  I take "active" to mean "recently accessed".
> 
> There is a somewhat opaque aspect to dmu, that is missing from the otherwise 
> excellent data structure chart.  I am talking about dirty buffer management.  
>  
> 
> db_data_pending?  db_last_dirty?  db_dirtycnt?  Could someone provide the 
> 10K mile overview on dirty buffers?
> 
Dirty buffers are buffers that have been modified, and so must be
written to stable storage.  Because IO is staged to disk, we have to
manage these buffers in separate lists: a dirty buffer goes onto the
list corresponding to the txg it belongs to.

> 
> The dbuf_states are a bit of a mystery: 
> 
> What is the difference between "DB_READ" and "DB_FILL"? 
> 
> My guess, maybe the data is coming from a different direction into the cache.
>>From below: Read from disk, (maybe) 

Yes.

>>From above: Nascent data coming from an application (newly created data?).
> 
Yes.

> I am guessing DB_NOFILL is a short-circuit path to throw obsoleted data away. 
> It would be nice to comment the states ( beyond an unexplained  state 
> transition
> diagramm.

This is used when we are pre-allocating space.  It is only used in
special circumstances (i.e., creating a swap device).

> 
> ZFS would be  more approachable to newcomers if the code was 
> a bit more commented.  
> I am not talking about copious comments, just every field 
> in the major data structures, and minimum a one-liner per function as to what 
> the function does.
> 
> Yes, given enough perseverance and a lot of time one can figure
> everything out from studying the usage patterns but the pain of this 
> could be lessened.  
> 
> The more people understand ZFS, the stronger it will become.

I agree.  We haven't been as good as you should about commenting.
You are welcome to submit code updates that improve this. :-)

-Mark



[zfs-code] ARC cache reference counting

2009-06-23 Thread Mark Maybee
Jeremy Archer wrote:
> Thanks for the explanation.
> 
>>   a dirty
>> buffer goes onto the
>> list corresponding to the txg it belongs to.
> 
> Ok. I see that all dirty buffers are put on a per txg list.  
> This is for easy synchronization, makes  sense. 
>   
> The per dmu_buf_impl_t details are a bit fuzzy.
> 
> I see there can be more than one dirty buffers per dmu_buf_impl_t.
> Are these multiple "edited versions" of the same buffer that occurred in the 
> same txg?
> 
If a dbuf is "dirtied" in more than one txg, then it will have multiple
dirty records.  So these are the "edited versions" separated by
transaction group boundaries.

> Looks like the dirty buffer (leaf) records point to an arc_buf_t, which is 
> presumably yet un-cached (until it is synched out).
> 
Correct.  It becomes "cached" when it obtains an "identity", which does
not happen until we write it out.

> db_data_pending points to the most current of these.  Seems that 
> db_data_pending 
> always points to the last (newest) dirty buffer for the dmu buffer.
> 
No, db_data_pending points to the record that is currently being
written.

> A the moment, the purpose of maintaining the per-dbuf dirty list eludes me.
> (Maybe to dispose  of these  except for the latest one, which will be written 
> out to disk)

See my comment above.  We must maintain separate versions or else get
"future leaks" (record data written in txg n+1 in txg n) and possibly
trash our checksum (we must not change the data once the checksum is
calculated).

> 
> Also: what is  BP_IS_HOLE?  I see it has no birth_txg.



[zfs-code] ARC cache reference counting

2009-06-01 Thread Mark Maybee
Jeremy Archer wrote:
> Greets,
> 
> I have read a couple of earlier posts by Jeff and Mark Maybee explaining how 
> Arc reference counting works. 
> These posts did help clarifying this piece of code ( a bit complex, to say 
> the least).
> I would like to solicit more comments elucidating ARC reference counting.
> 
> The usage pattern I see in arc.c does not seem to be following a simple 
> pattern, like the one we see in VFS.
> The crux of the matter: who are the "users" that own these references?
> 
> Perhaps I am  misreading the code but this is how it looks to me:
> 
> arc_buf_hdr->b_count  keeps track of simultaneous readers of an arc_buf_hdr 
> who called with a non-null callback parameter. The expected scenario is 
> simultaneous access from 2 or more files (which are clones or snapshots).
> 
Correct.

> The reason for the ref count and the buffer cloning: to maintain cache 
> integrity 
> when multiple readers access the same arc cache entry, and one of these 
> modifies the entry and ultimately ends up releasing the arc entry from the 
> cache.
> 
Correct.

> At such time, one of the "users" needs an anonymous entry that can be dirtied 
> and written out to a different spot, while the other needs an unchanged ARC 
> entry.
> 
Correct.

> This was probably expected to be a relatively rare occurrence, and it is not 
> expected that a large number of simultaneous  readers would access the same 
> ARC entry.

Correct.

> I am a bit puzzled why a new ARC entry could not be cloned in arc_release, 
> but 
> perhaps there is a good reason why pre-allocating this data is better than 
> the JIT alternative.
> 
We have already handed out a reference to the data at the point that
buffer is being released, so we cannot allocate a new block "JIT".

> I notice that prefetch functions (dbuf_prefetch and traverse_prefetcher) call 
> arc_read_nolock without a callback parameter, I wonder if this could create  
> a problem if a prefetch function, and a "regular" read simultaneously access 
> the same ARC cache entry. (Cloning in this case would not happen, so one 
> thread could end up releasing the entry from the cache while the other is 
> messing with it).
> 
No, a "regular" read request on a prefetch buffer will end up adding a
callback record.

> Comments and corrections to my interpretation are cordially solicited.

Hope that helps.

-Mark



[zfs-code] Disk Writes

2009-02-05 Thread Mark Maybee
Ben Rockwood wrote:
> I need some help with clarification.
> 
> My understanding is that there are 2 instances in which ZFS will write
> to disk:
> 1) TXG Sync
> 2) ZIL
> 
> Post-snv_87 a TXG should sync out when the TXG is either over filled or
> hits the timeout of 30 seconds.
> 
> First question is... is there some place I can see what this max TXG
> size is?  If I recall its 1/8th of system memory... but there has to be
> a counter somewhere right?
> 
There is both a memory throttle limit (enforced in arc_memory_throttle)
and a write throughput throttle limit (calculated in dsl_pool_sync(),
enforced in dsl_pool_tempreserve_space()).  The write limit is stored as
the 'dp_write_limit' for each pool.

> I'm unclear on ZIL writes.   I think that they happen independently of
> the normal txg rotation, but I'm not sure.
> 
> So the second question is: do they happen with a TXG sync (expitied) or
> independent of the normal TXG sync flow?
> 
> Finally, I'm unclear on exactly what constitutes a TXG Stall.  I had
> assumed that it indicated TXG's that exceeded the alloted time, but
> after some dtracing I'm uncertain.
> 
I'm not certain what you mean by: "TXG Stall".

> Any help is appreciated.
> 
> benr.
> 



[zfs-code] Disk Writes

2009-02-05 Thread Mark Maybee
Ben Rockwood wrote:
> Mark Maybee wrote:
>> Ben Rockwood wrote:
>>> I need some help with clarification.
>>>
>>> My understanding is that there are 2 instances in which ZFS will write
>>> to disk:
>>> 1) TXG Sync
>>> 2) ZIL
>>>
>>> Post-snv_87 a TXG should sync out when the TXG is either over filled or
>>> hits the timeout of 30 seconds.
>>>
>>> First question is... is there some place I can see what this max TXG
>>> size is?  If I recall its 1/8th of system memory... but there has to be
>>> a counter somewhere right?
>>>
>> There is both a memory throttle limit (enforced in arc_memory_throttle)
>> and a write throughput throttle limit (calculated in dsl_pool_sync(),
>> enforced in dsl_pool_tempreserve_space()).  The write limit is stored as
>> the 'dp_write_limit' for each pool.
> 
> I cooked up the following:
> $ dtrace -qn fbt::dsl_pool_sync:entry'{ printf("Throughput is \t %d\n
> write limit is\t %d\n\n", args[0]->dp_throughput,
> args[0]->dp_write_limit); }'
> Throughput is883975129
>  write limit is  3211748352
> 
> I'm confused with regard to the units and interpretation. 
> 
> For instance, the write limit here is almost 3GB on a system with 4GB of
> RAM.  However, if I read the code right the value here is already
> inflated *6... so the real write limit is actually 510MB right?
> 
The write_limit is independent of the memory size.  Its based purely
on the IO bandwidth available to the pool.  So a write_limit of 3GB
implies that we think that we can push 3GB of (inflated) data in 5
seconds to the drives.  If we take out the inflation, this means
we think we can push 100MB/s to the pools drives.

> As for the throughput, I need verification... I think the unit here is
> bytes per second?
> 
Correct.
> 
>>> I'm unclear on ZIL writes.   I think that they happen independently of
>>> the normal txg rotation, but I'm not sure.
>>>
>>> So the second question is: do they happen with a TXG sync (expitied) or
>>> independent of the normal TXG sync flow?
>>>
>>> Finally, I'm unclear on exactly what constitutes a TXG Stall.  I had
>>> assumed that it indicated TXG's that exceeded the alloted time, but
>>> after some dtracing I'm uncertain.
>>>
>> I'm not certain what you mean by: "TXG Stall".
> 
> I refer to the following code, which I'm having some trouble properly
> understanding:
> 
>475 txg_stalled(dsl_pool_t *dp)
> 476 {
> 477 tx_state_t *tx = &dp->dp_tx;
> 478 return (tx->tx_quiesce_txg_waiting > tx->tx_open_txg);
> 479 }
> 
Ah.  A "stall" in this context means that the sync phase is idle,
waiting for the next txg to quiesce so the the current train
is "stalled" until the quiesce finishes.
> 
> 
> Ultimately, what this all comes down to is finding a reliable way to
> determine when ZFS is struggling.  I'm currently watching (on pre-87)
> txg sync times and if it exceeds like 4 seconds per txg sync I know
> their is trouble brewing.  I'm considering whether watching either
> txg_stalled or txg_delay may be better ways to flag trouble.
> 
Stalled tends to mean that there is something happening "up top"
preventing things from moving (i.e., a tx not closing).  Delay is
used when we are trying to push more data than the pool can handle,
so that may be what you want to look at.

> dp_throughput looks like it also might be a good candidate, although it
> was only added in snv_98 unfortunate so it doesn't help a lot of my
> existing installs.  Nevertheless, graphing this value could be very
> telling and would be nice to have available as a kstat.
> 
agreed.

> The intended result is to have a reliable means of monitoring (via a
> standard monitoring framework such as Nagios or Zabbix or something) ZFS
> health... and from my studies simply watching traditional values via
> iostat isn't the best method.  If ZIL is either disabled or pushing to
> SLOG, then watching the breathing of TXG sync's should be all thats
> really important to me, at least on the write side... thats my theory
> anyway.  Feel free to flog me. :)
> 
Nope, that makes sense to me.  Ideally, we should either be chugging
along at 5-to-30 second intervals between syncs with no delays (i.e.
a light IO load), or we should be doing consistent 5s syncs with a
few delays seen (max capacity).  If you start seeing lots of delays,
you are probably trying to push too much data.

Note that we are still tuning this code.  Recently we discovered that
we may want to change the throughput calculation (we currently don't
include the dsl_dataset_sync() calls in the calc... we may want to
change that) to include more of the IO "setup" time.

> Thank you very much for your help Mark!
> 
> benr.



[zfs-code] 6551866 deadlock between zfs_write(), zfs_freesp(), and zfs_putapage()

2009-02-17 Thread Mark Maybee
Gack!  Absolutely correct J?rgen.  I have filed 6806627 to track this.

-Mark

J?rgen Keil wrote:
> It seems there is a bug introduced by the putback for
> 
> author:   Mark Maybee 
> date: Wed Jan 28 11:04:37 2009 -0700 (2 weeks ago)
> files:usr/src/uts/common/fs/zfs/arc.c 
> usr/src/uts/common/fs/zfs/sys/zfs_znode.h
> usr/src/uts/common/fs/zfs/zfs_rlock.c usr/src/uts/common/fs/zfs/zfs_vnops.c
> usr/src/uts/common/fs/zfs/zfs_znode.c
> description:
> 6551866 deadlock between zfs_write(), zfs_freesp(), and zfs_putapage()
> 6504953 zfs_getpage() misunderstands VOP_GETPAGE() interface
> 6702206 ZFS read/writer lock contention throttles sendfile() benchmark
> 6780491 Zone on a ZFS filesystem has poor fork/exec performance
> 6747596 assertion failed: DVA_EQUAL(BP_IDENTITY(&zio->io_bp_orig), 
> BP_IDENTITY(zio->io_bp)));
> 
> zfs_vnops, zfs_putpage()
> 
> 
>   3695/*
>   3696 * Align this request to the file block size in case we 
> kluster.
>   3697 * XXX - this can result in pretty aggresive locking, 
> which can
>   3698 * impact simultanious read/write access.  One option 
> might be
>   3699 * to break up long requests (len == 0) into 
> block-by-block
>   3700 * operations to get narrower locking.
>   3701 */
>   3702blksz = zp->z_blksz;
>   3703if (ISP2(blksz))
>   3704io_off = P2ALIGN_TYPED(off, blksz, u_offset_t);
>   3705else
>   3706io_off = 0;
>   3707if (len > 0 && ISP2(blksz))
>   3708io_len = P2ROUNDUP_TYPED(len + (io_off - off), 
> blksz, size_t);
>   3709else
>   3710io_len = 0;
>   3711
>   3712if (io_len == 0) {
>   3713/*
>   3714 * Search the entire vp list for pages >= 
> io_off.
>   3715 */
>   3716rl = zfs_range_lock(zp, io_off, UINT64_MAX, 
> RL_WRITER);
>   3717error = pvn_vplist_dirty(vp, io_off, 
> zfs_putapage, flags, cr);
>   3718goto out;
>   3719}
>   3720rl = zfs_range_lock(zp, io_off, io_len, RL_WRITER);
> 
> 
> Line 3708:
> "len + (io_off - off)" looks wrong, this should be
> "len + (off - io_off)".The P2ALIGN_TYPED() macro at line 3704
> should round down "off", i.e. io_off <= off.
> 
> 
> Test case:
> 
> /files2/media/osol-0906-106a-global-x86.iso is a file on a zfs filesystem
> 
> # mount -F hsfs /files2/media/osol-0906-106a-global-x86.iso /mnt
> # time mkisofs -r -o /dev/null /mnt
> <<< very slow >>>
> 1.22u 5.05s 59:37.46 0.1%
> 
> On snv_104 the same test completes in 21 seconds.
> 
> It has become 180x slower...
> 
> 
> diff --git a/usr/src/uts/common/fs/zfs/zfs_vnops.c 
> b/usr/src/uts/common/fs/zfs/zfs_vnops.c
> --- a/usr/src/uts/common/fs/zfs/zfs_vnops.c
> +++ b/usr/src/uts/common/fs/zfs/zfs_vnops.c
> @@ -3705,7 +3705,7 @@
>   else
>   io_off = 0;
>   if (len > 0 && ISP2(blksz))
> - io_len = P2ROUNDUP_TYPED(len + (io_off - off), blksz, size_t);
> + io_len = P2ROUNDUP_TYPED(len + (off - io_off), blksz, size_t);
>   else
>   io_len = 0;



[zfs-code] PSARC/2009/204 ZFS user/group quotas & space accounting

2009-04-21 Thread Mark Maybee
Pawel Jakub Dawidek wrote:
> On Sat, Apr 18, 2009 at 06:05:56PM -0700, Matthew.Ahrens at sun.com wrote:
>> Author: Matthew Ahrens 
>> Repository: /hg/onnv/onnv-gate
>> Latest revision: f41cf682d0d3e3cf5c4ec17669b903ae621ef882
>> Total changesets: 1
>> Log message:
>> PSARC/2009/204 ZFS user/group quotas & space accounting
> 
> Great to see this functionality integrated, but if I read the diff
> properly, zfs allow/unallow command will now call Python scripts from
> inside zfs utility? Is that right? This is quite important for me,
> because if that's true this means I've to remove delegations from
> FreeBSD, because we don't have Python in base system.
> 
Yes, this is true, both user quotas and delegations now use python
for their CLI processing.  Its likely that other ZFS CLI subcommands
will go this direction as well.  Its unfortunate that Python isn't
available in the base FreeBSD... is it possible that you could bundle
it with the ZFS distro (or make it a dependency for the install)?

>> 6827260 assertion failed in arc_read(): hdr == pbuf->b_hdr
> 
> I see this assertion beeing removed from arc.c, can you give more
> details on how the actual bug causing this assertion was fixed?
> (We were seeing this panic on FreeBSD/ZFS.)
> 
We introduced the b_lock around the arc_read_nolock() call in arc_read
and also use this lock in arc_release(), see bug (6732083).  The other
reason you might hit this assertion is because of the assignment to
the local 'hdr' variable before we hold the b_lock in arc_read() -- Matt
removed both the assignment and the assertion.

-Mark



[zfs-code] Can Read and Write Share Same arc_buf_hdr_t

2008-06-05 Thread Mark Maybee
J Duff wrote:
> I'm trying to understand the arc code.
> 
> Can a read zio and a write zio share the same arc_buf_hdr_t?
> 
No.

> If so, do they each have their own arc_buf_t both of which point back to the 
> same arc_buf_hdr_t? In other words, do they each have their own copy of the 
> data (arc_buf_t.b_data), but share the attributes that are contained in the 
> arc_buf_hdr_t?
> 
> Duff
> --
> This messages posted from opensolaris.org
> ___
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code



[zfs-code] Peak every 4-5 second

2008-07-22 Thread Mark Maybee
ZFS is designed to "sync" a transaction group about every 5 seconds
under normal work loads.  So your system looks to be operating as
designed.  Is there some specific reason why you need to reduce this
interval?  In general, this is a bad idea, as there is somewhat of a
"fixed overhead" associated with each sync, so increasing the sync
frequency could result in increased IO.

-Mark

Tharindu Rukshan Bamunuarachchi wrote:
> Dear ZFS Gurus,
> 
> We are developing low latency transaction processing systems for stock 
> exchanges.
> Low latency high performance file system is critical component of our 
> trading systems.
> 
> We have choose ZFS as our primary file system.
> But we saw periodical disk write peaks every 4-5 second.
> 
> Please refer first column of below output. (marked in bold)
> Output is generated from our own Disk performance measuring tool. i.e 
> DTool (please find attachment)
> 
> Compared UFS/VxFS , ZFS is performing very well,  but we could not 
> minimize periodical peaks.
> We used autoup and tune_r_fsflush flags for UFS tuning.
> 
> Are there any ZFS specific tuning, which will reduce file system flush 
> interval of ZFS.
> 
> I have tried all parameters specified in "solarisinternals" and google.com.
> I would like to go for ZFS code change/recompile if necessary.
> 
> Please advice.
> 
> Cheers
> Tharindu
> 
> 
> 
> cpu4600-100 /tantan >./*DTool -f M -s 1000 -r 1 -i 1 -W*
> System Tick = 100 usecs
> Clock resolution 10
> HR Timer created for 100usecs
> z_FileName = M
> i_Rate = 1
> l_BlockSize = 1000
> i_SyncInterval = 0
> l_TickInterval = 100
> i_TicksPerIO = 1
> i_NumOfIOsPerSlot = 1
> Max (us)| Min (us)  | Avg (us)  | MB/S  | File  
> Freq Distribution
>   336   |  4|  10.5635  |  4.7688   |  M   
> 50(98.55), 200(1.09), 500(0.36), 2000(0.00), 5000(0.00), 1(0.00), 
> 10(0.00), 20(0.00),
>   *1911 * |  4|  10.3152  |  9.4822   |  M   
> 50(98.90), 200(0.77), 500(0.32), 2000(0.01), 5000(0.00), 1(0.00), 
> 10(0.00), 20(0.00),
>   307   |  4|  9.9386   |  9.5324   |  M   
> 50(99.03), 200(0.66), 500(0.31), 2000(0.00), 5000(0.00), 1(0.00), 
> 10(0.00), 20(0.00),
>   331   |  4|  9.9465   |  9.5332   |  M   
> 50(99.04), 200(0.72), 500(0.24), 2000(0.00), 5000(0.00), 1(0.00), 
> 10(0.00), 20(0.00),
>   318   |  4|  10.1241  |  9.5309   |  M   
> 50(99.07), 200(0.66), 500(0.27), 2000(0.00), 5000(0.00), 1(0.00), 
> 10(0.00), 20(0.00),
>   303   |  4|  9.9236   |  9.5296   |  M   
> 50(99.13), 200(0.59), 500(0.28), 2000(0.00), 5000(0.00), 1(0.00), 
> 10(0.00), 20(0.00),
>   560   |  4|  10.2604  |  9.4565   |  M   
> 50(98.82), 200(0.86), 500(0.31), 2000(0.01), 5000(0.00), 1(0.00), 
> 10(0.00), 20(0.00),
>   376   |  4|  9.9975   |  9.5176   |  M   
> 50(99.05), 200(0.63), 500(0.32), 2000(0.00), 5000(0.00), 1(0.00), 
> 10(0.00), 20(0.00),
>   *9783 * |  4|  10.8216  |  9.5301   |  M   
> 50(99.05), 200(0.58), 500(0.36), 2000(0.00), 5000(0.00), 1(0.01), 
> 10(0.00), 20(0.00),
>   332   |  4|  9.9345   |  9.5252   |  M   
> 50(99.06), 200(0.61), 500(0.33), 2000(0.00), 5000(0.00), 1(0.00), 
> 10(0.00), 20(0.00),
>   355   |  4|  9.9906   |  9.5315   |  M   
> 50(99.01), 200(0.69), 500(0.30), 2000(0.00), 5000(0.00), 1(0.00), 
> 10(0.00), 20(0.00),
>   356   |  4|  10.2341  |  9.5207   |  M   
> 50(98.96), 200(0.76), 500(0.28), 2000(0.00), 5000(0.00), 1(0.00), 
> 10(0.00), 20(0.00),
>   320   |  4|  9.8893   |  9.5279   |  M   
> 50(99.10), 200(0.59), 500(0.31), 2000(0.00), 5000(0.00), 1(0.00), 
> 10(0.00), 20(0.00),
>   *10005* |  4|  10.8956  |  9.5258   |  M   
> 50(99.07), 200(0.63), 500(0.29), 2000(0.00), 5000(0.00), 1(0.00), 
> 10(0.01), 20(0.00),
>   308   |  4|  9.8417   |  9.5312   |  M   
> 50(99.07), 200(0.64), 500(0.29), 2000(0.00), 5000(0.00), 1(0.00), 
> 10(0.00), 20(0.00),
> 
> 
> 
> 
> ***
> 
> "The information contained in this email including in any attachment is 
> confidential and is meant to be read only by the person to whom it is 
> addressed. If you are not the intended recipient(s), you are prohibited from 
> printing, forwarding, saving or copying this email. If you have received this 
> e-mail in error, please immediately notify the sender and delete this e-mail 
> and its atta

[zfs-code] Understanding the ARC - arc_buf_hdr and arc_buf_t

2008-01-16 Thread Mark Maybee
J Duff wrote:
> I?m trying to understand the inner workings of the adaptive replacement cache 
> (arc). I see there are arc_bufs and arc_buf_hdrs. Each arc_buf_hdr points to 
> an arc_buf_t. The arc_buf_t is really one entry in a list of arc_buf_t 
> entries. The multiple entries are accessed through the arc_buf_t?s b_next 
> member.
> 
> Why does the arc_buf_hdr point to a list of arc_buf_ts and not just one 
> arc_buf_t, i.e., how and why do multiple arc_buf_ts get inserted into this 
> list?
> 
multiple arc_buf_t entries can be strung off of an arc_buf_hdr_t when
the same block is accessed from multiple file systems (i.e. snapshots
and/or clones).

> What is the relationship between arc_buf_hdr?s b_datacnt and b_refcnt? Is 
> b_datacnt the number of arc_buf_ts in the list?

yes

  Does the b_refcnt keep track of all users of all arc_buf_ts in the list?
> 
The refcnt is a count of all *active* users of the buffer.  The refcnt
should never be larger than the datacnt, but it is possible to have a
non-zero datacnt with a zero refcnt.  A buffer cannot be evicted while
it has a non-zero refcnt.

> Thanks for any understanding that anyone can offer.
> 
> Duff
> --
> This messages posted from opensolaris.org
> ___
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code



[zfs-code] Design for EAs in dnode.

2008-02-09 Thread Mark Maybee
Ricardo M. Correia wrote:
> Hi Matthew,
> 
> On Qui, 2008-02-07 at 12:48 -0800, Matthew Ahrens wrote:
>> on disk:
>> header:
>> struct sa_phys {
>>  uint16_t sa_numattrs;
>>  struct {
>>  uint16_t sa_type; /* enum sssa_type */
>>  uint16_t sa_length;   /* in sainfo[sa_type] chunks */
>>  } sssa_attr[];
>> };
>>
>> followed by the data in the order specified by the header.  8-byte alignment 
>> will be enforced for all attribute starting offsets.
> 
> This would require repacking when adding an attribute, right? Anyway, 
> maybe that wouldn't be a big problem..
> 
> There's also the question of the attribute size limit.. from 
> conversations I had with Andreas, I got the feeling that the Lustre 
> layout information could be quite big when a file is striped through all 
> OSTs, and I would imagine this would become an even bigger concern in 
> the future if we intend to continue scaling horizontally.
> 
By their vary nature, these attributes will have a size limitation.  I
don't really see the point in allowing a size larger than the supported
block size.  64K seems reasonable.  I think that very large values could
possibly be supported using some form of indirection: storing a block
pointer for the value, or storing an object ID for values that span
multiple blocks.

> Anyway, your proposal is interesting, but there's also one thing I would 
> like to add:
> 
> Could we have a special integer value that would essentially mean "this 
> is an unknown, name-value type of attribute", which would be used to 
> store additional, perhaps user-specified attributes?
> In the space of the attribute value, we could store the name of the 
> attribute and the value itself (perhaps with 1 or 2 additional bytes for 
> the name length of the attribute).
> 
> These attributes could have lower priority than the other system 
> attributes (they would be stored at the end) and they would be ignored 
> (but not removed) by any implemention that doesn't understand them.
> 
I'm not sure how storing them at the end makes them lower priority.  I'm
also not sure exactly how these would be used... does lustre need these?
For any file system layer, we would need to add infrastructure to manage
and manipulate these objects, as well as new interfaces to be able to
access these attributes from outside the kernel.

All that aside.  There is nothing that would prohibit the definition of
this type of "special system attribute" in the current model.

> I also think that instead of having an additional block pointer (which 
> are huge) in the dnode for "spillage", we should have something like a 
> "uint64_t dn_spillobj" which would be an object id of a "spillage" object.
> An object id is much more space-efficient and, like Andreas mentioned, 
> allows for an unlimited number of attributes/attribute sizes.
> 
I don't quite understand this.  The whole point of these attributes is
to make them fast to access... using an object ID is going to be far
more expensive then a block pointer to access.  Matt's model can also
support unlimited numbers of attributes if we allow the blocks to be
chained.

> I also find **extremely** appealing your idea of making this whole thing 
> a DMU service.
> I think that would make our lives (as in Lustre developers' lives) much, 
> much easier in the long run.. the VFS dependencies in the ZPL code have 
> been kind of a pain, to say the least.. :)
> 
> Thanks,
> Ricardo
> 
> --
>   *Ricardo Manuel Correia*
> Lustre Engineering
> 
> *Sun Microsystems, Inc.*
> Portugal
> Phone +351.214134023 / x58723
> Mobile +351.912590825
> Email Ricardo.M.Correia at Sun.COM 
> 
> 
> 
> 
> ___
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code



[zfs-code] Adding to arc_buf_hdr_t

2008-02-28 Thread Mark Maybee
The arc_buf_hdr_t is an in-core-only data structure, so it has no
version impact.  It does not matter where you add an extra field.

However, only add a field to this structure if you really need it...
increasing the size of this struct will likely have a negative
performance impact on the ARC.

-Mark

Darren J Moffat wrote:
> If I add a new field to arc_buf_hdr_t this has a version impact right?
> 
> I think I need to add a field to it for dealing with encryption of the
> L2ARC.
> 
> I'm assuming it should be added at the end and any assignment or
> reference to it should be guarded with the usual style of version check
> ?  Is that sufficient ?
> 



[zfs-code] truncate(2) not working properly.

2008-08-16 Thread Mark Maybee
This fix for this bug is currently in test and will be pushed shortly.

-Mark

Pawel Jakub Dawidek wrote:
> On Tue, Jul 22, 2008 at 08:56:31AM -0600, Mark Shellenbaum wrote:
>> Pawel Jakub Dawidek wrote:
>>> On Tue, Jul 22, 2008 at 04:28:45PM +0200, Pawel Jakub Dawidek wrote:
 Hi.

 I just reproduced a problem I was chasing on FreeBSD also on
 OpenSolaris from around 2008.01.

 Simply doing something like this:

write 9k of random data into 'foo' file
truncate 'foo' file to 7k
truncate 'foo' file to 11k
read data between 7k-9k

 There should be all zeros between 7k-9k, but there is previous data.
 It worked fine on an older ZFS versions (I'm sure it works on version 6).

 Simple test program:
>>> [...]
>>>
>> Thanks for reporting this.
>>
>> I've opened bug:
>>
>> 6728399 ftruncate(2) broken on ZFS
> 
> I went a bit further trying to find out what's going on. This is the
> change responsible for the problem: 20c04e18c58cfb135a135c0126c7d69f6078e1b5.
> 
> In the dnode_free_range() function you can find this condition:
> 
>   [...]
>   } else if (off > blkid) {
>   /* Freeing past end-of-data */
>   goto out;
>   [...]
> 
> Unfortunately blkid is uninitialized at this point, so it can't be
> right. I tried to change it to 'blksz', but it panics further.
> 
> I hope this helps. This is one of the last issues that holds back
> integration of the most recent ZFS into FreeBSD tree, so we would be
> very grateful if you guys could look into this.
> 
> Thanks in advance!
> 



[zfs-code] Unexpected b_hdr change.

2008-08-16 Thread Mark Maybee
This is a known bug:

6732083 arc_read() panic: rw_exit: lock not held

with a known cause.  The fix you suggest works, but it rather ugly.  We
are working on a fix now.

-Mark

Pawel Jakub Dawidek wrote:
> On Tue, Jul 29, 2008 at 12:41:16PM +0200, Pawel Jakub Dawidek wrote:
>> Hi.
>>
>> We're testing the most recent ZFS version from OpenSolaris ported to
>> FreeBSD. Kris (CCed) observed strange situation. In function arc_read()
>> he had a panic on assertion that we try to unlock a lock which is not
>> beeing held:
>>
>>  rw_enter(&pbuf->b_hdr->b_datalock, RW_READER);
>>
>>  err = arc_read_nolock(pio, spa, bp, done, private, priority,
>>  flags, arc_flags, zb);
>>
>>  rw_exit(&pbuf->b_hdr->b_datalock); <--- THIS ONE
>>
>> The only possiblity was that b_hdr for pbuf was changed somewhere. We
>> diagnozed this further and the b_hdr field is changed in arc_release()
>> function, here:
>>
>>  buf->b_hdr = nhdr;
> [...]
> 
> We have a simple test case to reproduce that and a patch that seems to
> work for us. I don't really understand the code well enough to be able
> to prepare the right fix.
> 
> The patch is here:
> 
>   http://people.freebsd.org/~pjd/patches/arc.patch
> 
> Script to reproduce it:
> 
> #!/bin/sh
> fs=tank/$$
> while true; do
>   zfs clone tank/boom at boom $fs
>   find /$fs > /dev/null
>   zfs destroy $fs
> done
> 
> To use it, you first need to create pool tank and tank/boom dataset,
> then put some files into /tank/boom/ (we had FreeBSD source tree in
> there), then take a snapshot tank/boom at boom and:
> 
>   # cp boom.sh /tank/
>   # cd /tank
>   # ./boom.sh &; ./boom.sh &; ./boom.sh &; ./boom.sh &
> 
> You'll need SMP machine to reproduce that. We tried to reproduce it on
> recent OpenSolaris, but we have only one CPU in there and we weren't
> able to reproduce it. Maybe it is FreeBSD-specific problem, but it
> doesn't look like that. Could you guys at least try to reproduce it on
> some SMP OpenSolaris machine?
> 
> Thanks in advance!
> 
> 
> 
> 
> 
> ___
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code



[zfs-code] Increasing dnode size

2007-09-13 Thread Mark Maybee
Andreas,

We have explored the idea of increasing the dnode size in the past
and discovered that a larger dnode size has a significant negative
performance impact on the ZPL (at least with our current caching
and read-ahead policies).  So we don't have any plans to increase
its size generically anytime soon.

However, given that the ZPL isn't the only consumer of datasets,
and that Lustre may benefit from a larger dnode size, it may be
worth investigating the possibility of supporting multiple dnode
sizes within a single pool (this is currently not supported).

Also, note that dnodes already have the notion of "fixed" DMU-
specific data and "variable" application-used data (the bonus
area).  So even in the current code, Lustre has the ability to
use 320 bytes of bonus space however it wants.

-Mark

Andreas Dilger wrote:
> Hello,
> as a brief introduction, I'm one of the developers of Lustre
> (www.lustre.org) at CFS and we are porting over Lustre to use ZFS (well,
> technically just the DMU) for back-end storage of Lustre.  We currently
> use a modified ext3/4 filesystem for the back-end storage (both data and
> metadata) fairly successfully (single filesystems of up to 2PB with up
> to 500 back-end ext3 file stores and getting 50GB/s aggregate throughput
> in some installations).
> 
> Lustre is a fairly heavy user of extended attributes on the metadata target
> (MDT) to record virtual file->object mappings, and we'll also begin using
> EAs more heavily on the object store (OST) in the near future (reverse
> object->file mappings for example).
> 
> One of the performance improvements we developed early on with ext3 is
> moving the EA into the inode to avoid seeking and full block writes for
> small amounts of EA data.  The same could also be done to improve small
> file performance (though we didn't implement that).  For ext3 this meant
> increasing the inode size from 128 bytes to a format-time constant size of
> 256 - 4096 bytes (chosen based on the default Lustre EA size for that fs).
> 
> My understanding from brief conversations with some of the ZFS developers
> is that there are already some plans to enlarge the dnode this because
> the dnode bonus buffer is getting close to being full for ZFS.  Are there
> any details of this plan that I could read, or has it been discussed before?
> Due to the generality of the terms I wasn't able to find anything by search.
> I wanted to get the ball rolling on the large dnode discussion (which
> you may have already had internally, I don't know), and start a fast EA
> discussion in a separate thread.
> 
> 
> 
> One of the important design decisions made with the ext3 "large inode" space
> (beyond the end of the regular inode) was that there was a marker in each
> inode which records how much of that space was used for "fixed" fields
> (e.g. nanosecond timestamps, creation time, inode version) at the time the
> inode was last written.  The space beyond "i_extra_isize" is used for
> extended attribute storage.  If an inode is modified and the kernel code
> wants to store additional "fixed" fields in the inode it will push the EAs
> out to external blocks to make room if there isn't enough in-inode space.
> 
> By having i_extra_isize stored in each inode (actually the first 16-bit
> field in large inodes) we are at liberty to add new fields to the inode
> itself without having to do a scan/update operation on existing inodes
> (definitely desirable for ZFS also) and we don't have to waste a lot
> of "reserved" space for potential future expansion or for fields at the
> end that are not being used (e.g. inode version is only useful for NFSv4
> and Lustre).  None of the "extra" fields are critical to correct operation
> by definition, since the code has existed until now without them...
> Conversely, we don't force EAs to start at a fixed offset and then use
> inefficient EA wrapping for small 32- or 64-bit fields.
> 
> We also _discussed_ storing ext3 small file data in an EA on an
> opportunistic basis along with more extent data (ala XFS).  Are there
> plans to allow the dn_blkptr[] array to grow on a per-dnode basis to
> avoid spilling out to an external block for files that are smaller and/or
> have little/no EA data?  Alternately, it would be interesting to store
> file data in the (enlarged) dn_blkptr[] array for small files to avoid
> fragmenting the free space within the dnode.
> 
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
> 
> ___
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code



[zfs-code] Increasing dnode size

2007-09-14 Thread Mark Maybee
Andreas Dilger wrote:
> On Sep 13, 2007  15:27 -0600, Mark Maybee wrote:
>> We have explored the idea of increasing the dnode size in the past
>> and discovered that a larger dnode size has a significant negative
>> performance impact on the ZPL (at least with our current caching
>> and read-ahead policies).  So we don't have any plans to increase
>> its size generically anytime soon.
> 
> I'm sure it depends a lot on the workload.  I don't know the details
> of how the ZFS allocators work, so it seems possible they always
> allocate the modified dnode and the corresponding EAs in a contiguous
> chunk initially, but I suspect that keeping this true over the life
> of the dnode would put an added burden on the allocator (to know this)
> or ZPL (to always mark them dirty to force colocation even if not modified).
> 
> I'd also heard that the 48 (or so) bytes that remain in the bonus buffer
> for ZFS are potentially going to be used up soon so there would be a
> desire to have a generic solution to this issue.
> 
You seem to have a line on a lot of internal development details :-).

> One of the reasons the large inode patch made it into the Linux
> kernel quickly was because it made a big difference for Samba
> (in addition to Lustre):
> 
>   http://lwn.net/Articles/112571/
> 
>> However, given that the ZPL isn't the only consumer of datasets,
>> and that Lustre may benefit from a larger dnode size, it may be
>> worth investigating the possibility of supporting multiple dnode
>> sizes within a single pool (this is currently not supported).
> 
> Without knowing the details, it would seem at first glance that
> having variable dnode size would be fairly complex.  Aren't the
> dnodes just stored in a single sparse object and accessed by
> dnode_size * objid?  This does seem desirable from the POV that
> if you have an existing fs with the current dnode size you don't
> want to need a reformat in order to use the larger size.
> 
I was referring here to supporting multiple dnode sizes within a
*pool*, but the size would still remained fixed for a given dataset
(see Bill's mail).  This is a much simpler concept to implement.

>> Also, note that dnodes already have the notion of "fixed" DMU-
>> specific data and "variable" application-used data (the bonus
>> area).  So even in the current code, Lustre has the ability to
>> use 320 bytes of bonus space however it wants.
> 
> That is true, and we discussed this internally, but one of the internal
> requirements we have for DMU usage is that it create an on-disk layout
> that matches ZFS so that it is possible to mount a Lustre filesystem
> via ZFS or ZFS-FUSE (and potentially the reverse in the future).
> This will allow us to do problem diagnosis and also leverage any ZFS
> scanning/verification tools that may be developed.
> 
Ah, interesting, I was not aware of this requirement.  It would not be
difficult to allow the ZPL to work with a larger dnode size (in fact
its pretty much a noop as long as the ZPL is not trying to use any of
the extra space in the dnode).

>> Andreas Dilger wrote:
>>> Lustre is a fairly heavy user of extended attributes on the metadata target
>>> (MDT) to record virtual file->object mappings, and we'll also begin using
>>> EAs more heavily on the object store (OST) in the near future (reverse
>>> object->file mappings for example).
>>>
>>> One of the performance improvements we developed early on with ext3 is
>>> moving the EA into the inode to avoid seeking and full block writes for
>>> small amounts of EA data.  The same could also be done to improve small
>>> file performance (though we didn't implement that).  For ext3 this meant
>>> increasing the inode size from 128 bytes to a format-time constant size of
>>> 256 - 4096 bytes (chosen based on the default Lustre EA size for that fs).
>>>
>>> My understanding from brief conversations with some of the ZFS developers
>>> is that there are already some plans to enlarge the dnode this because
>>> the dnode bonus buffer is getting close to being full for ZFS.  Are there
>>> any details of this plan that I could read, or has it been discussed 
>>> before?
>>> Due to the generality of the terms I wasn't able to find anything by 
>>> search.
>>> I wanted to get the ball rolling on the large dnode discussion (which
>>> you may have already had internally, I don't know), and start a fast EA
>>> discussion in a separate thread.
>>>
>>>
>>>
>>> One of the important design decisions made wi

[zfs-code] Recent deadlock.

2007-11-07 Thread Mark Maybee
Pawel,

I'm not quite sure I understand why thread #1 below is stalled.  Is
there only a single thread available for IO completion?

-Mark

Pawel Jakub Dawidek wrote:
> Hi.
> 
> I'm observing the following deadlock.
> 
> One thread holds zfsvfs->z_hold_mtx[i] lock and waits for I/O:
> 
> Tracing pid 1188 tid 100114 td 0xc41b1660
> sched_switch(c41b1660,0,1,180,9bbb0333,...) at sched_switch+0x1b6
> mi_switch(1,0,c0666ae9,1bf,c41b1660,...) at mi_switch+0x1ee
> sleepq_switch(c41b1660,0,c0666ae9,21d,c41b1660,...) at sleepq_switch+0xf0
> sleepq_wait(de465f94,de465f7c,c43fc60a,1,0,...) at sleepq_wait+0x60
> _cv_wait(de465f94,de465f7c,c43fc4cd,34f,0,...) at _cv_wait+0x1fc
> zio_wait(de465d68,c43ea3c5,240,4000,0,...) at zio_wait+0x99
> dbuf_read(d3a39594,de465d68,2,c43ee6df,d2e09900,...) at dbuf_read+0x2e7
> dnode_hold_impl(c4bbf000,a075,0,1,c43ec318,...) at dnode_hold_impl+0x15a
> dnode_hold(c4bbf000,a075,0,c43ec318,f899a758,...) at dnode_hold+0x35
> dmu_bonus_hold(c480e540,a075,0,0,f899a7a0,...) at dmu_bonus_hold+0x31
> zfs_zget(c3f4f000,a075,0,f899a89c,1,...) at zfs_zget+0x7a
> zfs_dirent_lock(f899a8a0,c4d18c64,f899a928,f899a89c,6,...) at 
> zfs_dirent_lock+0x619
> zfs_dirlook(c4d18c64,f899a928,f899ac20,0,0,...) at zfs_dirlook+0x272
> zfs_lookup(cbef4414,f899a928,f899ac20,f899ac34,2,...) at zfs_lookup+0x2df
> zfs_freebsd_lookup(f899aa70,c0685c24,cbef4414,cbef4414,f899ac34,...) at 
> zfs_freebsd_lookup+0xdd
> VOP_CACHEDLOOKUP_APV(c4405800,f899aa70,f899ac34,f899ac20,c3f65200,...) at 
> VOP_CACHEDLOOKUP_APV+0xc5
> vfs_cache_lookup(f899aaf4,c066e847,cbef4414,1,cbef4414,...) at 
> vfs_cache_lookup+0xd3
> VOP_LOOKUP_APV(c4405800,f899aaf4,c41b1660,c066d7b8,19e,...) at 
> VOP_LOOKUP_APV+0xe5
> lookup(f899ac0c,c066d7b8,c9,bf,ca5b032c,...) at lookup+0x53e
> namei(f899ac0c,0,c0679402,8d6,c065f266,...) at namei+0x2fe
> kern_rename(c41b1660,2820d040,2820c040,0,f899ad2c,...) at kern_rename+0x4a
> rename(c41b1660,f899acfc,8,c06758bf,c06924e0,...) at rename+0x29
> syscall(f899ad38) at syscall+0x283
> Xint0x80_syscall() at Xint0x80_syscall+0x20
> --- syscall (128, FreeBSD ELF32, rename), eip = 0x280d1cdb, esp = 0xbfbfe7fc, 
> ebp = 0xbfbfea48 ---
> 
> Another thread tries to finish I/O, but can't, because it is trying to
> acquire zfsvfs->z_hold_mtx[i] lock:
> 
> Tracing command spa_zio_intr_2 pid 1117 tid 100020 td 0xc3b38440
> sched_switch(c3b38440,0,1,180,9d2b0e9f,...) at sched_switch+0x1b6
> mi_switch(1,0,c0666ae9,1bf,c3b38440,...) at mi_switch+0x1ee
> sleepq_switch(c3b38440,0,c0666ae9,21d,c3f4f6a4,...) at sleepq_switch+0xf0
> sleepq_wait(c3f4f6a4,0,c43fba6f,3,0,...) at sleepq_wait+0x60
> _sx_xlock_hard(c3f4f6a4,c3b38440,0,c43fb80c,31e,...) at _sx_xlock_hard+0x286
> _sx_xlock(c3f4f6a4,0,c43fb80c,31e,c7af015c,...) at _sx_xlock+0xb8
> zfs_zinactive(d0d00988,0,c4401326,e27,d0d00988,...) at zfs_zinactive+0xa2
> zfs_inactive(c7af015c,c3aa9600,0,c7af015c,f8aa2888,...) at zfs_inactive+0x307
> zfs_freebsd_inactive(f8aa28a0,c06856cc,c7af0230,c7af0230,c7af015c,...) at 
> zfs_freebsd_inactive+0x32
> VOP_INACTIVE_APV(c4405800,f8aa28a0,c066e0c7,8e9,c06b5120,...) at 
> VOP_INACTIVE_APV+0xc5
> vinactive(c7af0230,0,c066e0c7,859,858,...) at vinactive+0xb1
> vrele(c7af015c,c71317d0,f8aa2924,c04d16b0,c71317d0,...) at vrele+0x18f
> zfs_get_done(c5832d8c,c71317d0,292,28d,d2387b70,...) at zfs_get_done+0xad
> dmu_sync_done(d11e56b4,ccc323c0,c6940930,0,0,...) at dmu_sync_done+0x1f0
> arc_write_done(d11e56b4,d4863b2c,e38a2208,4d3,d11d68f0,...) at 
> arc_write_done+0x44a
> zio_done(d11e56b4,c44033e0,f8aa2a38,c434906a,d13c9e00,...) at zio_done+0xb2
> zio_next_stage(d11e56b4,d11e56f8,80,0,f8aa2a68,...) at zio_next_stage+0x236
> zio_assess(d11e56b4,c3b384d8,c069938c,c43fc4cd,d11e58c8,...) at 
> zio_assess+0x843
> zio_next_stage(d11e56b4,c43fc4cd,36d,36a,f8aa2b44,...) at zio_next_stage+0x236
> zio_wait_for_children(d11e56b4,12,d11e58bc,f8aa2b8c,c436cc76,...) at 
> zio_wait_for_children+0x99
> zio_wait_children_done(d11e56b4,c0a575a0,c3b384d8,c0697e04,c0679402,...) at 
> zio_wait_children_done+0x25
> zio_next_stage(d11e56b4,c1074808,0,c0679402,8d6,...) at zio_next_stage+0x236
> zio_vdev_io_assess(d11e56b4,c44033e0,40,cef4def8,c3,...) at 
> zio_vdev_io_assess+0x2a7
> zio_next_stage(d11e56b4,c3b38440,f8aa2c3c,246,c0a116b4,...) at 
> zio_next_stage+0x236
> vdev_mirror_io_done(d11e56b4,f8aa2cf8,c42c94de,d11e56b4,0,...) at 
> vdev_mirror_io_done+0x113
> zio_vdev_io_done(d11e56b4,0,c43e6f2d,33d,c442ec14,...) at 
> zio_vdev_io_done+0x21
> taskq_thread(c442ebf4,f8aa2d38,c065fcb9,31e,c3b37aa0,...) at 
> taskq_thread+0x29e
> fork_exit(c42c9240,c442ebf4,f8aa2d38) at fork_exit+0xb8
> fork_trampoline() at fork_trampoline+0x8
> --- trap 0, eip = 0, esp = 0xf8aa2d70, ebp = 0 ---
> 
> As you can see, the SPA ZIO INTR thread wants to finish I/O requests, calls
> arc_write_done(). All in all we call vn_rele(), which may end up in calling
> zfs_zinactive() if this was the last vnode reference. zfs_zinactive() tries
> to acquire zfsvfs->z_hold_mtx[i] lock, but can't, becaus

[zfs-code] Recent deadlock.

2007-11-07 Thread Mark Maybee
Hmm, seems rather unlikely that these two IOs are related.  Thread 1
is trying to read a dnode in order to extract the znode data from its
bonus buffer.  Thread 2 is completing a dmu_sync() write (so this is
the result of a zil operation).  While its possible that the dmu_sync()
write may involve reading some of the blocks in the dnode from Thread 1,
this should not result in Thread 1 waiting for anything.

I think vdev_mirror_io_done() often shows up in the stack because the
ditto-ing code leverages that code path.

-Mark

Pawel Jakub Dawidek wrote:
> On Wed, Nov 07, 2007 at 07:01:52AM -0700, Mark Maybee wrote:
>> Pawel,
>>
>> I'm not quite sure I understand why thread #1 below is stalled.  Is
>> there only a single thread available for IO completion?
> 
> There are few, but I belive the thread #2 is trying to complete the very
> I/O request on which thread #1 is waiting.
> Thread #2 can't complete this I/O request, because the lock it needs to
> acquire is held by thread #1.
> 
>>> One thread holds zfsvfs->z_hold_mtx[i] lock and waits for I/O:
>>>
>>> Tracing pid 1188 tid 100114 td 0xc41b1660
> [...]
>>> _cv_wait(de465f94,de465f7c,c43fc4cd,34f,0,...) at _cv_wait+0x1fc
>>> zio_wait(de465d68,c43ea3c5,240,4000,0,...) at zio_wait+0x99
>>> dbuf_read(d3a39594,de465d68,2,c43ee6df,d2e09900,...) at dbuf_read+0x2e7
>>> dnode_hold_impl(c4bbf000,a075,0,1,c43ec318,...) at dnode_hold_impl+0x15a
>>> dnode_hold(c4bbf000,a075,0,c43ec318,f899a758,...) at dnode_hold+0x35
>>> dmu_bonus_hold(c480e540,a075,0,0,f899a7a0,...) at dmu_bonus_hold+0x31
>>> zfs_zget(c3f4f000,a075,0,f899a89c,1,...) at zfs_zget+0x7a
>>> zfs_dirent_lock(f899a8a0,c4d18c64,f899a928,f899a89c,6,...) at 
>>> zfs_dirent_lock+0x619
>>> zfs_dirlook(c4d18c64,f899a928,f899ac20,0,0,...) at zfs_dirlook+0x272
>>> zfs_lookup(cbef4414,f899a928,f899ac20,f899ac34,2,...) at zfs_lookup+0x2df
> [...]
> 
>>> Another thread tries to finish I/O, but can't, because it is trying to
>>> acquire zfsvfs->z_hold_mtx[i] lock:
>>>
>>> Tracing command spa_zio_intr_2 pid 1117 tid 100020 td 0xc3b38440
> [...]
>>> _sx_xlock(c3f4f6a4,0,c43fb80c,31e,c7af015c,...) at _sx_xlock+0xb8
>>> zfs_zinactive(d0d00988,0,c4401326,e27,d0d00988,...) at zfs_zinactive+0xa2
>>> zfs_inactive(c7af015c,c3aa9600,0,c7af015c,f8aa2888,...) at 
>>> zfs_inactive+0x307
> [...]
>>> zfs_get_done(c5832d8c,c71317d0,292,28d,d2387b70,...) at zfs_get_done+0xad
>>> dmu_sync_done(d11e56b4,ccc323c0,c6940930,0,0,...) at dmu_sync_done+0x1f0
>>> arc_write_done(d11e56b4,d4863b2c,e38a2208,4d3,d11d68f0,...) at 
>>> arc_write_done+0x44a
>>> zio_done(d11e56b4,c44033e0,f8aa2a38,c434906a,d13c9e00,...) at zio_done+0xb2
>>> zio_next_stage(d11e56b4,d11e56f8,80,0,f8aa2a68,...) at zio_next_stage+0x236
>>> zio_assess(d11e56b4,c3b384d8,c069938c,c43fc4cd,d11e58c8,...) at 
>>> zio_assess+0x843
>>> zio_next_stage(d11e56b4,c43fc4cd,36d,36a,f8aa2b44,...) at 
>>> zio_next_stage+0x236
>>> zio_wait_for_children(d11e56b4,12,d11e58bc,f8aa2b8c,c436cc76,...) at 
>>> zio_wait_for_children+0x99
>>> zio_wait_children_done(d11e56b4,c0a575a0,c3b384d8,c0697e04,c0679402,...) at 
>>> zio_wait_children_done+0x25
>>> zio_next_stage(d11e56b4,c1074808,0,c0679402,8d6,...) at zio_next_stage+0x236
>>> zio_vdev_io_assess(d11e56b4,c44033e0,40,cef4def8,c3,...) at 
>>> zio_vdev_io_assess+0x2a7
>>> zio_next_stage(d11e56b4,c3b38440,f8aa2c3c,246,c0a116b4,...) at 
>>> zio_next_stage+0x236
>>> vdev_mirror_io_done(d11e56b4,f8aa2cf8,c42c94de,d11e56b4,0,...) at 
>>> vdev_mirror_io_done+0x113
>>> zio_vdev_io_done(d11e56b4,0,c43e6f2d,33d,c442ec14,...) at 
>>> zio_vdev_io_done+0x21
> [...]
> 
> BTW. Why I always see vdev_mirror_io_done() even if I don't use mirror?
> Here I had RAIDZ vdev.
> 
> 
> 
> 
> 
> ___
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code



[zfs-code] Recent deadlock.

2007-11-07 Thread Mark Maybee
Pawel Jakub Dawidek wrote:
> On Wed, Nov 07, 2007 at 07:41:54AM -0700, Mark Maybee wrote:
>> Hmm, seems rather unlikely that these two IOs are related.  Thread 1
>> is trying to read a dnode in order to extract the znode data from its
>> bonus buffer.  Thread 2 is completing a dmu_sync() write (so this is
>> the result of a zil operation).  While its possible that the dmu_sync()
>> write may involve reading some of the blocks in the dnode from Thread 1,
>> this should not result in Thread 1 waiting for anything.
> 
> Maybe my interpretation is wrong, but moving this VN_RELE() to separate
> kernel thread for later fixes deadlock for me.
> 
Right, but I think the problem here is that you are running out of IO
threads.  Do you maintain separate thread sets for each "type" of IO
(i.e. reads, writes, etc)?  We found, early on, that if we didn't
reserve threads for specific types of IOs that we could run into this
type of deadlock.

-Mark



[zfs-code] R/W lock portability issue

2007-11-27 Thread Mark Maybee
Chris Kirby wrote:
> Matthew Ahrens wrote:
>> So, we use RW_LOCK_HELD() to mean, "might this thread hold the lock?"  and 
>> it 
>> is generally only used in assertions.  Eg, some routine should only be 
>> called 
>> with the lock held, so we "ASSERT(RW_LOCK_HELD(lock))".  The fact that 
>> sometimes the implementation of RW_LOCK_HELD() returns TRUE when this thread 
>> does not in fact hold the lock is fine.
>>
>> In this case, we are using it a bit differently.  The current thread must 
>> *not* hold the lock for reader.  We use RW_LOCK_HELD() to determine 
>> definitively if this thread holds the lock for writer or not.  The behavior 
>> we 
>> want is:  if this thread already holds the lock for writer, we don't need to 
>> re-grab it.  Otherwise, we grab the lock for READER.
>>
>> However, now that I go look at the implementation of RW_LOCK_HELD(), it 
>> doesn't do what this code expects; it should be using RW_WRITE_HELD().
> 
> The problem in this case turns out to be a bit harder. 
> dsl_dataset_open_obj can be called with dp_config_rwlock held for
> read or write, or not held at all.  An obvious way to fix this would
> be to have the callers pass an arg saying whether or not the lock is
> held, but that's esp. ugly in this case since there are ~30 callers,
> and at least some of them probably don't grab the lock directly.
> 
Its actually OK if we re-grab a read lock when we already hold it.  This
will just cause the readers count to increment in the lock, and we will
drop it twice, so the count will be correct in the end.  But we don't
want to try to obtain a read lock if we already have a write lock.  So
it works to just use 'RW_WRITE_HELD()' here.

-Mark



[zfs-code] R/W lock portability issue

2007-11-27 Thread Mark Maybee
Chris Kirby wrote:
> Mark Maybee wrote:
>> Chris Kirby wrote:
>>
>>> Matthew Ahrens wrote:
>>>
>>>> So, we use RW_LOCK_HELD() to mean, "might this thread hold the 
>>>> lock?"  and it is generally only used in assertions.  Eg, some 
>>>> routine should only be called with the lock held, so we 
>>>> "ASSERT(RW_LOCK_HELD(lock))".  The fact that sometimes the 
>>>> implementation of RW_LOCK_HELD() returns TRUE when this thread does 
>>>> not in fact hold the lock is fine.
>>>>
>>>> In this case, we are using it a bit differently.  The current thread 
>>>> must *not* hold the lock for reader.  We use RW_LOCK_HELD() to 
>>>> determine definitively if this thread holds the lock for writer or 
>>>> not.  The behavior we want is:  if this thread already holds the 
>>>> lock for writer, we don't need to re-grab it.  Otherwise, we grab 
>>>> the lock for READER.
>>>>
>>>> However, now that I go look at the implementation of RW_LOCK_HELD(), 
>>>> it doesn't do what this code expects; it should be using 
>>>> RW_WRITE_HELD().
>>>
>>>
>>> The problem in this case turns out to be a bit harder. 
>>> dsl_dataset_open_obj can be called with dp_config_rwlock held for
>>> read or write, or not held at all.  An obvious way to fix this would
>>> be to have the callers pass an arg saying whether or not the lock is
>>> held, but that's esp. ugly in this case since there are ~30 callers,
>>> and at least some of them probably don't grab the lock directly.
>>>
>> Its actually OK if we re-grab a read lock when we already hold it.  This
>> will just cause the readers count to increment in the lock, and we will
>> drop it twice, so the count will be correct in the end.  But we don't
>> want to try to obtain a read lock if we already have a write lock.  So
>> it works to just use 'RW_WRITE_HELD()' here.
> 
> 
> But doesn't the rw_enter code block if there's a waiting
> writer?  In other words, can't we cause a deadlock
> if we already hold the read lock, there's a waiting writer,
> and we try to grab the read lock again?
> 
Ah right, this isn't the "unfair" version of the rw lock that we use in
the SPA is it?  Yup, you're right, a writer-waiting here could lead to a
deadlock.

-Mark



[zfs-code] mappedwrite().

2007-05-01 Thread Mark Maybee
Yes, its the same in Solaris.  Its probably more correct to always to
do the dmu_write() as this keeps the page and file in sync.

-Mark

Pawel Jakub Dawidek wrote:
> Hi.
> 
> I'm pondering this piece of mappedwrite():
> 
>   if (pp = page_lookup(vp, start, SE_SHARED)) {
>   caddr_t va;
> 
>   rw_exit(&zp->z_map_lock);
>   va = ppmapin(pp, PROT_READ | PROT_WRITE, (caddr_t)-1L);
>   error = uiomove(va+off, bytes, UIO_WRITE, uio);
>   if (error == 0) {
>   dmu_write(zfsvfs->z_os, zp->z_id,
>   woff, bytes, va+off, tx);
>   }
>   ppmapout(va);
>   page_unlock(pp);
>   [...]
> 
> In FreeBSD I call dmu_write() unconditionally, because if uiomove()
> partially succeeds, we will lose the change. If uiomove() fails
> entirely, well, we just write what we got before one more time.
> Is it also the case for OpenSolaris?
> 
> 
> 
> 
> 
> ___
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://opensolaris.org/mailman/listinfo/zfs-code



[zfs-code] Refactor zfs_zget()

2007-05-09 Thread Mark Maybee
Ricardo Correia wrote:
> On Wednesday 09 May 2007 04:57:53 Ricardo Correia wrote:
>> 2) In the end of zfs_zget(), if the requested object number is not found,
>> it allocates a new znode with that object number. This shouldn't happen in
>> any FUSE operation.
> 
> Apparently, I didn't (and I still don't) fully understand this part of the 
> code, but I still need it after all.
> 
This code is not allocating new objects, rather it is re-creating vnodes
when they are no longer in memory (e.g., it was cleaned up after the
last reference was dropped, or this is the first time this object is
being referenced).

> So I propose only a very simple change - an added boolean parameter that 
> allows zfs_zget() to return unlinked objects.
> 
Could you file an RFE for this via the opensolaris bug-reporting
interface?

> See the attached patch (and sorry for the mismatched paths WRT the 
> OpenSolaris 
> tree).
> 
> Regards,
> Ricardo Correia
> 
> 
> 
> 
> diff -r 1d5a1b751011 -r 26733593a5d0 
> src/lib/libzfscommon/include/sys/zfs_znode.h
> --- a/src/lib/libzfscommon/include/sys/zfs_znode.hTue May 08 19:10:14 
> 2007 +0100
> +++ b/src/lib/libzfscommon/include/sys/zfs_znode.hWed May 09 21:43:30 
> 2007 +0100
> @@ -255,7 +255,7 @@ extern intzfs_freesp(znode_t *, uint64_
>  extern int   zfs_freesp(znode_t *, uint64_t, uint64_t, int, boolean_t);
>  extern void  zfs_znode_init(void);
>  extern void  zfs_znode_fini(void);
> -extern int   zfs_zget(zfsvfs_t *, uint64_t, znode_t **);
> +extern int   zfs_zget(zfsvfs_t *, uint64_t, znode_t **, boolean_t);
>  extern void  zfs_zinactive(znode_t *);
>  extern void  zfs_znode_delete(znode_t *, dmu_tx_t *);
>  extern void  zfs_znode_free(znode_t *);
> diff -r 1d5a1b751011 -r 26733593a5d0 src/lib/libzpool/zfs_znode.c
> --- a/src/lib/libzpool/zfs_znode.cTue May 08 19:10:14 2007 +0100
> +++ b/src/lib/libzpool/zfs_znode.cWed May 09 21:43:30 2007 +0100
> @@ -315,7 +315,7 @@ zfs_init_fs(zfsvfs_t *zfsvfs, znode_t **
>   for (i = 0; i != ZFS_OBJ_MTX_SZ; i++)
>   mutex_init(&zfsvfs->z_hold_mtx[i], NULL, MUTEX_DEFAULT, NULL);
>  
> - error = zfs_zget(zfsvfs, zfsvfs->z_root, zpp);
> + error = zfs_zget(zfsvfs, zfsvfs->z_root, zpp, B_FALSE);
>   if (error)
>   return (error);
>   ASSERT3U((*zpp)->z_id, ==, zfsvfs->z_root);
> @@ -639,7 +639,7 @@ zfs_mknode(znode_t *dzp, vattr_t *vap, u
>  }
>  
>  int
> -zfs_zget(zfsvfs_t *zfsvfs, uint64_t obj_num, znode_t **zpp)
> +zfs_zget(zfsvfs_t *zfsvfs, uint64_t obj_num, znode_t **zpp, boolean_t 
> zget_unlinked)
>  {
>   dmu_object_info_t doi;
>   dmu_buf_t   *db;
> @@ -674,7 +674,7 @@ zfs_zget(zfsvfs_t *zfsvfs, uint64_t obj_
>   mutex_enter(&zp->z_lock);
>  
>   ASSERT3U(zp->z_id, ==, obj_num);
> - if (zp->z_unlinked) {
> + if (zp->z_unlinked && !zget_unlinked) {
>   dmu_buf_rele(db, NULL);
>   mutex_exit(&zp->z_lock);
>   ZFS_OBJ_HOLD_EXIT(zfsvfs, obj_num);
> diff -r 1d5a1b751011 -r 26733593a5d0 src/zfs-fuse/zfs_acl.c
> --- a/src/zfs-fuse/zfs_acl.c  Tue May 08 19:10:14 2007 +0100
> +++ b/src/zfs-fuse/zfs_acl.c  Wed May 09 21:43:30 2007 +0100
> @@ -1366,7 +1366,7 @@ zfs_zaccess(znode_t *zp, int mode, cred_
>*/
>   if (is_attr) {
>   if ((error = zfs_zget(zp->z_zfsvfs,
> - zp->z_phys->zp_parent, &xzp)) != 0) {
> + zp->z_phys->zp_parent, &xzp, B_FALSE)) != 0){
>   return (error);
>   }
>   check_zp = xzp;
> diff -r 1d5a1b751011 -r 26733593a5d0 src/zfs-fuse/zfs_dir.c
> --- a/src/zfs-fuse/zfs_dir.c  Tue May 08 19:10:14 2007 +0100
> +++ b/src/zfs-fuse/zfs_dir.c  Wed May 09 21:43:30 2007 +0100
> @@ -187,7 +187,7 @@ zfs_dirent_lock(zfs_dirlock_t **dlpp, zn
>   zfs_dirent_unlock(dl);
>   return (EEXIST);
>   }
> - error = zfs_zget(zfsvfs, zoid, zpp);
> + error = zfs_zget(zfsvfs, zoid, zpp, B_FALSE);
>   if (error) {
>   zfs_dirent_unlock(dl);
>   return (error);
> @@ -261,7 +261,7 @@ zfs_dirlook(znode_t *dzp, char *name, vn
>   return (error);
>   }
>   rw_enter(&dzp->z_parent_lock, RW_READER);
> - error = zfs_zget(zfsvfs, dzp->z_phys->zp_parent, &zp);
> + error = zfs_zget(zfsvfs, dzp->z_phys->zp_parent, &zp, B_FALSE);
>   if (error == 0)
>   *vpp = ZTOV(zp);
>   rw_exit(&dzp->z_parent_lock);
> @@ -358,7 +358,7 @@ zfs_unlinked_drain(zfsvfs_t *zfsvfs)
>* We need to re-mark these list entries for deletion,
>* so we pull them back into core and set zp->z_unlinked.
>*/
> - error = zfs_zget(zfsvfs, zap.za_first_in

[zfs-code] Re: ARC deadlock.

2007-05-18 Thread Mark Maybee
Yup, J?rgen is correct.  The problem here is that we are blocked in
arc_data_buf_alloc() while holding a hash_lock.  This is bug 6457639.
One possibility, for this specific bug might be to drop the lock before
the allocate and then redo the read lookup (in case there is a race)
with the necessary buffer already in hand.

-Mark

J?rgen Keil wrote:
>> Kris Kennaway  found a deadlock,
>> which I think is not FreeBSD-specific.
>>
>> When we are running low in memory and kmem_alloc(KM_SLEEP) is called,
>> the thread waits for the memory to be reclaimed, right?
> 
>> In such situation the arc_reclaim_thread thread is woken up.
>>
>> Ok. I've two threads waiting for the memory to be freed:
>>
>> First one, and this one is not really problematic:
> ...
>> And second one, which holds
>> arc_buf_t->b_hdr->hash_lock:
> 
> 
> Bug 6457639 might be related, 
> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6457639
> 
> In this case I also found the arc deadlocking because of KM_SLEEP
> allocations, while having parts of the buf hash table locked.
> --
> This messages posted from opensolaris.org
> ___
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://opensolaris.org/mailman/listinfo/zfs-code



[zfs-code] Request for help and advice on cache behaviour

2007-01-10 Thread Mark Maybee
Ricardo Correia wrote:
> Hi,
> 
> I'm not sure how to control the ARC on the ZFS port to FUSE.
> 
> In the alpha1 release, for testing, I simply set the zfs_arc_max and 
> zfs_arc_min variables to 80 MBs and 64 MBs (respectively) to prevent the ARC 
> from growing unboundedly.
> 
> However, I'm having a problem. A simple run of the following script will 
> cause 
> zfs-fuse memory usage to grow almost indefinitely:
> 
> for i in `seq 1 10`;
> do
>  touch /pool/testdir/$i
> done
> 
> The problem seems to be that vnodes are getting allocated and never freed.
> 
> From what I understand, and from what I read in the previous thread about a 
> similar issue that Pawel was having, this is what happens in Solaris (and in 
> zfs-fuse, by extension):
> 
> 1) When VN_RELE() is called and vp->v_count reaches 1, VOP_INACTIVE() is 
> called.
> 2) VOP_INACTIVE() calls zfs_inactive() which calls zfs_zinactive().
> 3) zfs_zinactive() calls dmu_buf_rele()
> 4) ??
> 5) znode_pageout_func() calls zfs_znode_free() which finally frees the vnode.
> 
> As for step 4, Mark Maybee mentioned:
> 
> "Note that the db_immediate_evict == 0 means that you
> will probably *not* see a callback to the pageout function immediately.
> This is the general case.  We hold onto the znode (and related memory)
> until the associated disk blocks are evicted from the cache (arc).  The
> cache is likely to hold onto that data until either:
> - we encounter memory shortage, and so reduce the cache size
> - we read new data into the cache, and evict this data to
>   make space for it."
> 
> So even if I have a "not very big" cache, there can be a lot of alloc'ed 
> vnodes which consume a lot more memory!
> Of course, if the ARC would somehow take that memory in account when checking 
> zfs_arc_max it would be easier to tune it.
> 
Its not just the vnodes, there are also znodes, dnodes and dbufs to
consider.  The bottom line is that 64 MB of vnode-related ARC data can
tie up 192MB of other memory.  We are in the process of making the ARC
more aware of these extra overheads.

> So, the better question is: is the ARC even helpful for a FUSE filesystem?
> I mean, the Linux kernel is already caching file data, even for FUSE 
> filesystems.
> 
> Maybe I should try to disable the ARC? Or set it to a very small maximum size?
> 
> Or should I have a thread that monitors memory usage and calls 
> arc_kmem_reclaim() when it reaches a certain point? If this is the case, I 
> don't know how to determine what is considered excessive memory usage, since 
> we could be running in a computer with as little or as much memory as.. well, 
> Linux can handle.
> 
This is done on Solaris.  We have a thread that monitors the free memory
available on the machine, and tries to keep the ARC usage in check.  In
your FUSE implementation you may want to minimally track the size of the
vnode/znode/dnode/dbuf cache usage.  In general, this is just a hard
problem.  You want to keep as much data in the ARC as possible for best
performance, while not impacting applications that need memory.  The
better integration you can get between the ARC and the VM system, the
happier you will be.  In the long term, Sun will likely migrate the
ARC functionality completely into the VM system.

-Mark



[zfs-code] ARC deadlock.

2007-08-09 Thread Mark Maybee
Sorry Pawel, we have not done much with this bug.  It is not a high
priority bug for us, since we don't see it under Solaris, and there have
been lots of other demands on our time and resources lately.  Have you
tried exploring the fix I suggested?  We are always happy to accept
fixes from the community :-).

-Mark

Pawel Jakub Dawidek wrote:
> On Fri, May 18, 2007 at 08:22:26AM -0600, Mark Maybee wrote:
>> Yup, J?rgen is correct.  The problem here is that we are blocked in
>> arc_data_buf_alloc() while holding a hash_lock.  This is bug 6457639.
>> One possibility, for this specific bug might be to drop the lock before
>> the allocate and then redo the read lookup (in case there is a race)
>> with the necessary buffer already in hand.
> 
> Any updates on this? We don't see those deadlocks when ZIL is disabled
> and we're thinking about disabling ZIL by default for 7.0-RELEASE if we
> won't find fix for this. It is somehow quite easy to triggers for some
> workloads.
> 
>> J?rgen Keil wrote:
>>>> Kris Kennaway  found a deadlock,
>>>> which I think is not FreeBSD-specific.
>>>>
>>>> When we are running low in memory and kmem_alloc(KM_SLEEP) is called,
>>>> the thread waits for the memory to be reclaimed, right?
>>>> In such situation the arc_reclaim_thread thread is woken up.
>>>>
>>>> Ok. I've two threads waiting for the memory to be freed:
>>>>
>>>> First one, and this one is not really problematic:
>>> ...
>>>> And second one, which holds
>>>> arc_buf_t->b_hdr->hash_lock:
>>>
>>> Bug 6457639 might be related, 
>>> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6457639
>>>
>>> In this case I also found the arc deadlocking because of KM_SLEEP
>>> allocations, while having parts of the buf hash table locked.
> 
> 
> 
> 
> ___
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code



[zfs-code] IO error on mount for encrypted dataset

2007-08-14 Thread Mark Maybee
Darren J Moffat wrote:
> Does the ARC get flushed for a dataset when it is unmounted ?

Yes

> What does change when a dataset is unmounted ?
> 
Pretty much everything associated with the dataset should be evicted...
its possible that some of the meta-data may hang around I suppose (I
don't remember off-hand).

> The context of the problem is this:
> 
> create a pool,
> provide the pool encryption key,
> create a dataset with encryption turned on,
> put data into that dataset
>   I see it getting encrypted and written to disk by zio_write,
> zfs umount -a
> zfs mount -a
>   I can read the data back - yeah!.
> 
> However if I export the pool and then reimport it the mount fails with 
> an IO error (and yes I did remember to give the pool the key again).
> 
> A little dtracing shows me that a zap_lookup of VERSION for the 
> encrypted dataset is returning EIO, that causes the mount to ultimately 
> fail.

Could be that this meta data does not get evicted at unmount.

> 
> --
> Darren J Moffat
> ___
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code



[zfs-code] Setting zio->io_error during zio_write

2007-08-15 Thread Mark Maybee
Darren J Moffat wrote:
> For an encrypted dataset it is possible that by the time we arrive in 
> zio_write() [ zio_write_encrypt() ] that when we lookup which key is 
> needed to encrypted this data that key isn't available to us.
> 
> Is there some value of zio->io_error I can set that will not result in a 
> panic ? but will put the write in to some state where we can try again 
> later - I guess not just this write but maybe the whole transaction group ?
> 
No, we have no ability to do this.  With George's fix for 6565042, we
will introduce the ability to "hang" the pool on an IO failure... this
may give you what you want.

> I can, and I think I have, fixed the mount time case by returning false 
> from zfs_is_mountable() if the dataset is encrypted and the key needed 
> isn't present.  However the key could go away - without us knowing so in 
> some cases (not really for the pure software case but it can happen with 
> hardware token keys).
> 



[zfs-code] Contents of transaction group?

2007-04-09 Thread Mark Maybee
Atul Vidwansa wrote:
> Hi,
>I have few questions about the way a transaction group is created.
> 
> 1. Is it possible to group transactions related to multiple operations
> in same group? For example, an "rmdir foo" followed by "mkdir bar",
> can these end up in same transaction group?
> 
Yes.

> 2. Is it possible for an operation (say write()) to occupie multiple
> transaction groups?
> 
Yes.  Writes are broken into transactions at block boundaries.  So it
is possible for a large write to span multiple transaction groups.

> 3. Is it possible to know the thread id(s) for every commited txg_id?
> 
No.

-Mark



[zfs-code] Scrubbing a zpool built on LUNs

2007-04-27 Thread Mark Maybee
Mike,

Please post this sort of query to zfs-discuss (rather than zfs-code).
zfs-code is a development discussion forum.

Without any form of replication that zfs knows about (RAIDZ or mirrors),
there is no way for ZFS to fix up data errors detected in a scrub.
RAID5 LUNs just look like normal device LUNs to ZFS.  Since we always
store meta-data replicated, we *will* be able to fix meta-data errors.
A scrub may still be useful to you, however, as it will identify any
data that has checksum errors (that you can then repair by hand if
you have backup copies).

-Mark

Mike wrote:
> I'm building a system with two Apple RAIDs attached. I have hardware RAID5 
> configured so no RAIDZ or RAIDZ2, just a basic zpool pointing at the four 
> LUNs representing the four RAID controllers. For on-going maintenance, will a 
> zpool scrub be of any benefit? From what I've read with this layer of 
> abstration ZFS is only maintaining the metadata and not the actual data on 
> the drives. What's the best practice to keep the data healthy? (scrub, 
> import/export, etc..) Also, where are the replicas of the metadata stored in 
> this configuration? Can that be backed up?
> 
> Thanks.
> --
> This messages posted from opensolaris.org
> ___
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://opensolaris.org/mailman/listinfo/zfs-code



[zfs-code] Kernel panic on "zpool import"

2006-09-14 Thread Mark Maybee
Daniel Rock wrote:
> Hi,
> 
> I just triggered a kernel panic while trying to import a zpool.
> 
> The disk in the zpool was residing on a Symmetrix and mirrored with SRDF. The 
> host sees both devices though (one writeable device "R1" on one
> Symmetrix box and one write protected device "R2" on another Symmetrix box).
> 
> It seems zfs tries to import the write protected device (instead of the 
> writeable one) and panics.
> 
> On VxVM I usually can just import such a disk group with no problems:
> 1. I could mark the write protected disks as "offline":
>  vxdisk offline cXtYdZ
> 2. Even if I don't mark the write protected disks as offline I am still be 
> able to import the disk group, because VxVM notices the write protected disks 
> by itself.
> 
> Shouldn't have ZFS a similar mechanism?
> 
> The panic only occurs during manual import. On auto-import during bootup the 
> pool is imported just fine.
> 
Yes, ZFS should be able to deal with this... but it obviously doesn't
yet.  This only occurs on an import because we recreate the pool by
probing out the available luns.  We apparently have chosen to use some
of the read-only interfaces.  This does not happen when we work from the
zpool.cache, since we have all the devices we are going to use for the
pool already listed in the cache, and so don't attempt to probe.

This will be at least partially addressed when bug 6470435 is fixed.

> 
> Some info from the crash dump:
> 
> 
>>::status
> 
> debugging crash dump vmcore.0 (64-bit) from wesson
> operating system: 5.10 Generic_118833-22 (sun4us)
> panic message: 
> ZFS: I/O failure (write on  off 0: zio 300081140c0 [L0 unallocated] 
> 4000L/600P DVA[0]=<0:1400:600> DVA[1]=<0:68001400:600> fletcher4 lzjb BE 
> contiguous birth=7 fill=0 cksum=6c81986e6d:63188e
> dump content: kernel pages only
> 
>>$ 
> [...]
> 
> WARNING: /scsi_vhci/ssd at g6006048287460610424356323645 (ssd38):
> Error for Command: write(10)   Error Level: Fatal
> Requested Block: 8236  Error Block: 8236
> Vendor: EMCSerial Number: 1026E000e   
> Sense Key: Write Protected
> ASC: 0x27 (write protected), ASCQ: 0x0, FRU: 0x0
> WARNING: /scsi_vhci/ssd at g6006048x424356323645 (ssd38):
> Error for Command: write(10)   Error Level: Fatal
> Requested Block: 3416108   Error Block: 3416108
> Vendor: EMCSerial Number: 1026E000e   
> Sense Key: Write Protected
> ASC: 0x27 (write protected), ASCQ: 0x0, FRU: 0x0
> [above WARNING pair in total 4 times]
> 
> panic[cpu16]/thread=2a10cc0: 
> ZFS: I/O failure (write on  off 0: zio 300081140c0 [L0 unallocated] 
> 4000L/600P DVA[0]=<0:1400:600> DVA[1]=<0:68001400:600> fletcher4 lzjb BE 
> contiguous birth=7 fill=0 
> cksum=6c81986e6d:63188e9e5462:30ab176f69d04a:110eda2de2c1bde2): error 5
> 
> 
> 02a10740 zfs:zio_done+284 (300081140c0, 0, a8, 7019dbf0, 0, 
> 3000b1d4dc0)
>   %l0-3: 0300080c3200 7019d800 0005 0005
>   %l4-7: 7b712278 0002 0007 0005
> 02a10940 zfs:zio_vdev_io_assess+178 (300081140c0, 8000, 10, 0, 0, 10)
>   %l0-3: 0002 0002  0005
>   %l4-7: 0010 0025f1ac  00344fdd3258
> 02a10a00 genunix:taskq_thread+1a4 (300042dc578, 300042dc520, 50001, 
> 345003f564, 2a10aca, 2a10ac8)
>   %l0-3: 0001 0300042dc548 0300042dc550 0300042dc552
>   %l4-7: 0300063ef068 0002  0300042dc540
> 
> syncing file systems...
> 
> 
>>$c
> 
> vpanic(7b73b3a0, 7019dbf0, 7019dbc0, 7b73a428, 0, 300081140c0)
> zio_done+0x284(300081140c0, 0, a8, 7019dbf0, 0, 3000b1d4dc0)
> zio_vdev_io_assess+0x178(300081140c0, 8000, 10, 0, 0, 10)
> taskq_thread+0x1a4(300042dc578, 300042dc520, 50001, 345003f564, 2a10aca, 
> 2a10ac8)
> thread_start+4(300042dc520, 0, baddcafebaddcafe, baddcafebaddcafe, 
> baddcafebaddcafe, baddcafebaddcafe)
> --
> This messages posted from opensolaris.org
> ___
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://opensolaris.org/mailman/listinfo/zfs-code



[zfs-code] Re: ASSERT failed dn->dn_nlevels > level (0x0 > 0x0) dbuf.c, line: 1523

2006-09-29 Thread Mark Maybee
Darren,

I looked a bit at your dumps... in both cases, the problem is that the
os_phys block that we read from the disk is garbage:

 > 0x9377b000::print objset_phys_t
{
 os_meta_dnode = {
 dn_type = 0
 dn_indblkshift = 0
 dn_nlevels = 0
 dn_nblkptr = 0
 dn_bonustype = 0
 dn_checksum = 0
 dn_compress = 0
 dn_flags = 0
 dn_datablkszsec = 0x3
 dn_bonuslen = 0
 dn_pad2 = [ 0, 0, 0 ]
 dn_crypt = 0
 dn_maxblkid = 0
 dn_used = 0
 dn_pad3 = [ 0, 0, 0, 0 ]
 dn_blkptr = [
 {
 blk_dva = [
 {
 dva_word = [ 0, 0 ]
 }
...

I checked the actual arc buf this came from, and it looks the same.  So
the buf was successfully read, and it checksummed, but it doesn't have
good data.  This pretty much says that the problem is on the write side.
When we wrote out the root block in dmu_objset_sync(), we must have
written garbage.  I'm not yet quite sure how this happened... perhaps
something is messed up in your write path changes
(arc_write->zio_write->...), but its not obvious.  I'll investigate some
more when I get a chance

-Mark

Darren J Moffat wrote:
> I really need some help on this.  Without help the ZFS crypto project is 
>   stalled.
> 
> I've updated my bits to the ON gate as of last night.  The way I
> recreate this is slightly different but the assert is still the same.
> 
> I can create a pool with my bits and export it; when I import it I
> get the dn_levels assert.
> 
> Please I really need help.
> 
> Darren J Moffat wrote:
> 
>> Using the ZFS crypto bits, see [1] for webrev, which are in sync with ON
>>  as of 2006-09-12 (ie they include the BrandZ stuff and the changes
>> that Eric putback on the 12th).
>>
>> [1] http://cr.grommit.com/~darrenm/zfs-crypto/
>>
>>
>> I created a new pool:
>>
>> # zpool create -f tank c0t1d0
>>
>> I then created four new file systems
>>
>> # zfs create -o encryption=aes256 tank/cipher-aes256
>> # zfs create -o encryption=aes128 tank/cipher-aes128
>> # zfs create -o encryption=aes192 tank/cipher-aes192
>> # zfs create -o encryption=off tank/clear
>>
>> I listed the encryption property, then I exported the pool.
>>
>> When I did so the machine panic'd thus:
>>
>>
>> panic[cpu1]/thread=be728880: assertion failed: dn->dn_nlevels >
>> level (0x0 > 0x0), file: ../../common/fs/zfs/dbuf.c, line: 1523
>>
>> fe8000bb4730 genunix:assfail3+b9 ()
>> fe8000bb47d0 zfs:dbuf_hold_impl+329 ()
>> fe8000bb4810 zfs:dbuf_hold+2b ()
>> fe8000bb48a0 zfs:dnode_hold_impl+bd ()
>> fe8000bb48d0 zfs:dnode_hold+2b ()
>> fe8000bb4950 zfs:dmu_buf_hold+45 ()
>> fe8000bb4a20 zfs:zap_lockdir+58 ()
>> fe8000bb4aa0 zfs:zap_lookup+4d ()
>> fe8000bb4b10 zfs:dsl_pool_open+94 ()
>> fe8000bb4bb0 zfs:spa_load+566 ()
>> fe8000bb4c00 zfs:spa_tryimport+90 ()
>> fe8000bb4c50 zfs:zfs_ioc_pool_tryimport+31 ()
>> fe8000bb4cd0 zfs:zfsdev_ioctl+115 ()
>> fe8000bb4d10 genunix:cdev_ioctl+48 ()
>> fe8000bb4d50 specfs:spec_ioctl+86 ()
>> fe8000bb4db0 genunix:fop_ioctl+37 ()
>> fe8000bb4eb0 genunix:ioctl+16b ()
>> fe8000bb4f00 unix:brand_sys_syscall32+2a1 ()
>>
>> For some reason savecore didn't grab the dump so I tried again:
>>
>> This time I only go the first two filesystems created and I got
>> a different panic:
>>
>> panic[cpu1]/thread=fe800036bc80: assertion failed: (((bp)->blk_birth
>> == 0) ? 0 : ((bp)->blk_prop[0]) >> (0)) & ((1ULL << (16)) - 1)) +
>> (1)) << (9))) == db->db_level == 1 ? dn->dn_datablksz :
>> (1dn_indblkshift) (0x200 == 0x4000), file:
>> ../../common/fs/zfs/dbuf.c, line: 2186
>>
>> fe800036b4d0 genunix:assfail3+b9 ()
>> fe800036b820 zfs:dbuf_write_done+920 ()
>> fe800036b880 zfs:arc_write_done+1d3 ()
>> fe800036ba10 zfs:zio_done+2e4 ()
>> fe800036ba40 zfs:zio_next_stage+112 ()
>> fe800036ba90 zfs:zio_wait_for_children+56 ()
>> fe800036bab0 zfs:zio_wait_children_done+20 ()
>> fe800036bae0 zfs:zio_next_stage+112 ()
>> fe800036bb30 zfs:zio_vdev_io_assess+140 ()
>> fe800036bb60 zfs:zio_next_stage+112 ()
>> fe800036bbb0 zfs:vdev_mirror_io_done+377 ()
>> fe800036bbd0 zfs:zio_vdev_io_done+26 ()
>> fe800036bc60 genunix:taskq_thread+1dc ()
>> fe800036bc70 unix:thread_start+8 ()
>>
>> Then on reboot from that panic we see this one:
>>
>> panic[cpu0]/thread=870a0c00: assertion failed: dn->dn_nlevels >
>> level (0x0 > 0x0), file: ../../common/fs/zfs/dbuf.c, line: 1523
>>
>> fe80005806a0 genunix:assfail3+b9 ()
>> fe8000580740 zfs:dbuf_hold_impl+329 ()
>> fe8000580780 zfs:dbuf_hold+2b ()
>> fe8000580810 zfs:dnode_hold_impl+bd ()
>> fe8000580840 zfs:dnode_hold+2b ()
>> fe80005808c0 zfs:dmu_buf_hold+45 ()
>> fe8000580990 zfs:zap_lockdir+58 ()
>> fe8000580a10 zfs:zap_lookup+4d ()
>> fe8000580a8

[zfs-code] Re: ASSERT failed dn->dn_nlevels > level (0x0 > 0x0) dbuf.c, line: 1523

2006-10-04 Thread Mark Maybee
Darren,

It looks like you modeled your ENCRYPT/DECRYPT stages on the
COMPRESS/UNCOMPRESS stages.  This is all well and good... however
your emulation is not quite accurate:  The WRITE_COMPRESS stage
is *always* included in the write pipeline, so the zio_write()
func only adds it to the async_stages variable when compression
is turned on.  The WRITE_ENCRYPT stage, on the other hand is
explicitly in the ASYNC_STAGES, but *not* always part of the
write pipeline.  In zio_write() you are setting it in the
async_stages variable... which it is already in.  I think you
need to add it to the pipeline rather than the async_stages.

Its still not clear to me how this can result in your problems, but
then I don't yet understand how the SPA io pipeline works in all
circumstances.

-Mark

Mark Maybee wrote:
> Darren,
> 
> I looked a bit at your dumps... in both cases, the problem is that the
> os_phys block that we read from the disk is garbage:
> 
>  > 0x9377b000::print objset_phys_t
> {
>  os_meta_dnode = {
>  dn_type = 0
>  dn_indblkshift = 0
>  dn_nlevels = 0
>  dn_nblkptr = 0
>  dn_bonustype = 0
>  dn_checksum = 0
>  dn_compress = 0
>  dn_flags = 0
>  dn_datablkszsec = 0x3
>  dn_bonuslen = 0
>  dn_pad2 = [ 0, 0, 0 ]
>  dn_crypt = 0
>  dn_maxblkid = 0
>  dn_used = 0
>  dn_pad3 = [ 0, 0, 0, 0 ]
>  dn_blkptr = [
>  {
>  blk_dva = [
>  {
>  dva_word = [ 0, 0 ]
>  }
> ...
> 
> I checked the actual arc buf this came from, and it looks the same.  So
> the buf was successfully read, and it checksummed, but it doesn't have
> good data.  This pretty much says that the problem is on the write side.
> When we wrote out the root block in dmu_objset_sync(), we must have
> written garbage.  I'm not yet quite sure how this happened... perhaps
> something is messed up in your write path changes
> (arc_write->zio_write->...), but its not obvious.  I'll investigate some
> more when I get a chance
> 
> -Mark
> 
> Darren J Moffat wrote:
> 
>> I really need some help on this.  Without help the ZFS crypto project 
>> is   stalled.
>>
>> I've updated my bits to the ON gate as of last night.  The way I
>> recreate this is slightly different but the assert is still the same.
>>
>> I can create a pool with my bits and export it; when I import it I
>> get the dn_levels assert.
>>
>> Please I really need help.
>>
>> Darren J Moffat wrote:
>>
>>> Using the ZFS crypto bits, see [1] for webrev, which are in sync with ON
>>>  as of 2006-09-12 (ie they include the BrandZ stuff and the changes
>>> that Eric putback on the 12th).
>>>
>>> [1] http://cr.grommit.com/~darrenm/zfs-crypto/



[zfs-code] Current zfetch heuristic

2006-10-11 Thread Mark Maybee
Jeremy Teo wrote:
> Heya,
> 
> just a short blurb of what I understand from grokking dmu_zfetch.c
> 
> Basically the current code issues a prefetch (ie. create a new
> prefecth stream) whenever a block (level 0, DB_RF_NOPREFETCH is not
> set) is is read in dbuf_read.
> 
> Since ZFS is multi-threaded, we allow up to 8 such prefetch streams to
> be running concurrently.
> 
> dmu_zfetch calls dmu_zfetch_find, which tries to categorize the access
> pattern based on the current prefetch streams into 1 of 4 patterns:
> 
> 1. Forward sequential access
> 2. Backward sequential access
> 3. Forward sequential strided access (does this mean we skip blocks,
> but forwards?)
> 4. Backward sequential strided access (ditto...)
> 
> The new prefetch stream being created is then optimized for 1 of the 4 
> cases.
> 
> By default, we prefetch blocks in a forward manner: if the access
> pattern doesn't match 1 of the 4 cases, or doesn't always work well
> with a forward access pattern, then we suffer a potential performance
> loss, in terms of wastage of bandwidth and arc_cache space.
> 
> Is this accurate? And what other performance pitfalls/corner cases are
> there currently?
> 
Basically correct... although I am not sure I understand what you
mean by: "we prefetch blocks in a forward manner".  We will prefetch
blocks in a "backward manner" if we detect a reverse stride.  But
regardless of that, if the application does not actually need the
blocks we are prefetching, then yes, we will waste IO bandwidth and
cache space.

-Mark



[zfs-code] ZFS and memory usage.

2006-11-07 Thread Mark Maybee
Pawel Jakub Dawidek wrote:
> ZFS works really stable on FreeBSD, but I'm biggest problem is how to
> control ZFS memory usage. I've no idea how to leash that beast.
> 
> FreeBSD has a backpresure mechanism. I can register my function so it
> will be called when there are memory problems, which I do. I using it
> for ARC layer.
> Even with this in place under heavy load the kernel panics, because
> memory with KM_SLEEP cannot be allocated.
> 
> Here are some statistics of memory usage when the panic occurs:
> 
> zfs_znode_cache: 356 * 11547 = 4110732 bytes
>   zil_lwb_cache: 176 * 43 = 7568 bytes
>   arc_buf_t:  20 * 7060 = 141200 bytes
>   arc_buf_hdr_t: 188 * 7060 = 1327280 bytes
> dnode_t: 756 * 162311 = 122707116 bytes !!
>  dmu_buf_impl_t: 332 * 18649 = 6191468
>   other: 14432256 bytes (regular kmem_alloc())
> 
> There is 1GB of RAM, 320MB is for the kernel. 1/3 if kernel memory is
> configured as ARC's maximum.
> When it panics, debugger statistics show that there is around 2/3 of
> this actually allocated, but probably due memory fragmentation.
> 
> The most important part is dnode_t probably as it looks it doesn't obey
> any limits. Maybe it is a bug in my port and I'm leaking them somehow?
> On the other hand when I unload ZFS kernel module, FreeBSD's kernel
> reports any memory leaks - they exist, but are much, much smaller.
> 
> There is also quite a lot of znodes, which I'd also like to be able to
> free and not sure how. In Solaris vnode's life end in VOP_INACTIVE()
> routine, but znode if kept around. In FreeBSD VOP_INACTIVE() means "puts
> the vnode onto free vnodes list" and when we want to use this vnode for
> different file system VOP_RECLAIM() is called and VOP_RECLAIM() will be
> a good place to free znode as well, if possible.
> 
> Any ideas how to fix it?
> 
The problem is that in ZFS the vnode holds onto more memory than just
the vnode itself.  Its fine to place the vnode on a "free vnodes list"
after a VOP_INACTIVE()... but you need to make sure the you have
"released" the *extra* memory associated with the vnode:
vnode refs a znode (356 bytes + 512 bytes for the phys)
znode refs a dnode (756 bytes + 512-16k for the phys)
So a vnode could be holding up to 17k of data in memory!

I suggest you free up the znode at VOP_INACTIVE() rather than waiting
until VOP_RECLAIM().

-Mark



[zfs-code] ZFS and memory usage.

2006-11-10 Thread Mark Maybee
Pawel Jakub Dawidek wrote:
> On Tue, Nov 07, 2006 at 06:06:48PM -0700, Mark Maybee wrote:
> 
>>The problem is that in ZFS the vnode holds onto more memory than just
>>the vnode itself.  Its fine to place the vnode on a "free vnodes list"
>>after a VOP_INACTIVE()... but you need to make sure the you have
>>"released" the *extra* memory associated with the vnode:
>>  vnode refs a znode (356 bytes + 512 bytes for the phys)
>>  znode refs a dnode (756 bytes + 512-16k for the phys)
>>So a vnode could be holding up to 17k of data in memory!
>>
>>I suggest you free up the znode at VOP_INACTIVE() rather than waiting
>>until VOP_RECLAIM().
> 
> 
> How? From What I see znode is freed via zfs_znode_free() called from
> znode_pageout_func(). I don't think I can just call zfs_znode_free(), as
> I'm quite sure there are some dependent resources I need to release.
> Which function should I used for this?
> 
The callback to the pageout function is triggered by the dmu_buf_rele()
on the buffer containing the phys portion of the znode.  This rele also
triggers the freeing of all the other memory associated with the znode.

In the current Solaris bits, the rele happens in zfs_zinactive().

-Mark



[zfs-code] ZFS and memory usage.

2006-11-10 Thread Mark Maybee
Pawel Jakub Dawidek wrote:
> On Fri, Nov 10, 2006 at 06:36:07AM -0700, Mark Maybee wrote:
> 
>>Pawel Jakub Dawidek wrote:
>>
>>>On Tue, Nov 07, 2006 at 06:06:48PM -0700, Mark Maybee wrote:
>>>
>>>>The problem is that in ZFS the vnode holds onto more memory than just
>>>>the vnode itself.  Its fine to place the vnode on a "free vnodes list"
>>>>after a VOP_INACTIVE()... but you need to make sure the you have
>>>>"released" the *extra* memory associated with the vnode:
>>>>vnode refs a znode (356 bytes + 512 bytes for the phys)
>>>>znode refs a dnode (756 bytes + 512-16k for the phys)
>>>>So a vnode could be holding up to 17k of data in memory!
>>>>
>>>>I suggest you free up the znode at VOP_INACTIVE() rather than waiting
>>>>until VOP_RECLAIM().
>>>
>>>How? From What I see znode is freed via zfs_znode_free() called from
>>>znode_pageout_func(). I don't think I can just call zfs_znode_free(), as
>>>I'm quite sure there are some dependent resources I need to release.
>>>Which function should I used for this?
>>
>>The callback to the pageout function is triggered by the dmu_buf_rele()
>>on the buffer containing the phys portion of the znode.  This rele also
>>triggers the freeing of all the other memory associated with the znode.
>>
>>In the current Solaris bits, the rele happens in zfs_zinactive().
> 
> 
> Ok, I also call dmu_buf_rele() from zfs_zinactive(), but I don't see
> pageout beeing called right after zfs_zinactive().
> 
> In dmu_buf_rele() function I've such data (this is on simple
> 'touch /tank/foo'):
> 
>   holds = 1
>   db->db_dirtycnt = 1
>   db->db_level = 0
>   db->db_d.db_immediate_evict = 0
> 
> Is it the same in Solaris on 'touch /tank/foo' for dmu_buf_rele() called
> from zfs_zinactive()?

This looks fine.  Note that the db_immediate_evict == 0 means that you
will probably *not* see a callback to the pageout function immediately.
This is the general case.  We hold onto the znode (and related memory)
until the associated disk blocks are evicted from the cache (arc).  The
cache is likely to hold onto that data until either:
- we encounter memory shortage, and so reduce the cache size
- we read new data into the cache, and evict this data to
  make space for it.

-Mark



[zfs-code] Lock order reversal (harmless?).

2006-11-22 Thread Mark Maybee
Pawel Jakub Dawidek wrote:
> Hi.
> 
> FreeBSD's WITNESS mechanism for detecting lock order reversals reports
> LOR here:
> 
> lock order reversal:
>  1st 0xc3f7738c zfs:dbuf (zfs:dbuf) @ 
> /zoo/pjd/zfstest/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/dnode_sync.c:410
>  2nd 0xc3fefcc0 zfs:zn (zfs:zn) @ 
> /zoo/pjd/zfstest/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c:68
> KDB: stack backtrace:
> db_trace_self_wrapper(c05ee0fe) at db_trace_self_wrapper+0x25
> kdb_backtrace(0,,c065f510,c065f358,c061e82c,...) at kdb_backtrace+0x29
> witness_checkorder(c3fefcc0,9,c3c44a6b,44) at witness_checkorder+0x586
> _sx_xlock(c3fefcc0,c3c44a6b,44,0,c3fefca8,...) at _sx_xlock+0x6e
> znode_pageout_func(c3f77350,c3fefca8,e6172a08,c3bd8005,c3f77350,...) at 
> znode_pageout_func+0x10a
> dbuf_evict_user(c3f77350,0,c3f77848,c3f77848,c3f59c94,...) at 
> dbuf_evict_user+0x4e
> dbuf_clear(c3f77350,c3bea63b,e6172b18,c3bea64c,c3f77350,...) at 
> dbuf_clear+0x31
> dbuf_evict(c3f77350,c0650f70,0,c05ed336,2ab,...) at dbuf_evict+0xe
> dnode_evict_dbufs(c3f59c94,0,2,0,0,...) at dnode_evict_dbufs+0x24c
> dnode_sync_free(c3f59c94,c4075900,c3c1c873,0,c37ef000,...) at 
> dnode_sync_free+0xf5
> dnode_sync(c3f59c94,0,c3f45000,c4075900,0,...) at dnode_sync+0x36a
> dmu_objset_sync_dnodes(c3b47000,c3b4713c,c4075900,21c,c375b54c,...) at 
> dmu_objset_sync_dnodes+0x81
> dmu_objset_sync(c3b47000,c4075900,e6172c50,c3bf1b2c,c397ac00,...) at 
> dmu_objset_sync+0x50
> dsl_dataset_sync(c397ac00,c4075900,0,c397ac00,e6172c54,...) at 
> dsl_dataset_sync+0x14
> dsl_pool_sync(c375b400,6,0,c061e82c,c375b4ac,...) at dsl_pool_sync+0x78
> spa_sync(c37ef000,6,0,7,0,...) at spa_sync+0x2a2
> txg_sync_thread(c375b400,e6172d38) at txg_sync_thread+0x1df
> fork_exit(c3c037e8,c375b400,e6172d38) at fork_exit+0xac
> fork_trampoline() at fork_trampoline+0x8
> --- trap 0x1, eip = 0, esp = 0xe6172d6c, ebp = 0 ---
> 
> Bascially it first recorded that db_mtx is locked before z_lock and the
> backtrace above is from where is it locked in an different order.
> 
Um, I think you have that backwards: normally we obtain the z_lock
first, then obtain the db_mtx; but in the above stack, we are obtaining
the z_lock *after* we hold the db_mtx.

> I think it is harmless, because znode is not visible at this point and
> can't be referenced, which means deadlock is not possible here, right?
> 
Yea, I think this is OK.  The z_lock is being used here to prevent a
race with the force-unmount code in zfs_inactive().  In that code path
we are not going to try to obtain the db_mtx while holding the z_lock.

> If I'm right, sorry for the noice, just wanted to be 100% sure.

No problem.  Thanks for bringing this to our attention.

-Mark



[zfs-code] Lock order reversal (harmless?).

2006-11-22 Thread Mark Maybee
Pawel Jakub Dawidek wrote:
> I had another one, can you analize it?
> 
> lock order reversal:
>  1st 0xc44b9b00 zfs:dbuf (zfs:dbuf) @ 
> /zoo/pjd/zfstest/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/dbuf.c:1644
>  2nd 0xc45be898 zfs:dbufs (zfs:dbufs) @ 
> /zoo/pjd/zfstest/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/dnode_sync.c:357
> KDB: stack backtrace:
> db_trace_self_wrapper(c05ee0fe) at db_trace_self_wrapper+0x25
> kdb_backtrace(0,,c065f538,c065f5b0,c061e82c,...) at kdb_backtrace+0x29
> witness_checkorder(c45be898,9,c4113f3c,165) at witness_checkorder+0x586
> _sx_xlock(c45be898,c4113f3c,165,c41249e0,0,...) at _sx_xlock+0x6e
> dnode_evict_dbufs(c45be730,0,c4406b40,0,c45be730,...) at 
> dnode_evict_dbufs+0x39
> dmu_objset_evict_dbufs(e6258bd0,0,4,c397c000,c40d9918,...) at 
> dmu_objset_evict_dbufs+0x131
> dmu_objset_evict(c397c200,c397c000,c397c600,c397c200,e6258c04,...) at 
> dmu_objset_evict+0xb1
> dsl_dataset_evict(c44b9ac4,c397c200,e6258c1c,c40aed02,c44b9ac4,...) at 
> dsl_dataset_evict+0x4e
> dbuf_evict_user(c44b9ac4,c40ab3ee,1,0,e6258c2c,...) at dbuf_evict_user+0x4e
> dbuf_rele(c44b9ac4,c397c200,e6258c50,c40c7c5c,c397c200,...) at dbuf_rele+0x72
> dsl_dataset_sync(c397c200,c3bfea80,0,c397c200,e6258c54,...) at 
> dsl_dataset_sync+0x44
> dsl_pool_sync(c397c600,148,0,c061e82c,c397c6ac,...) at dsl_pool_sync+0x78
> spa_sync(c37ef000,148,0,149,0,...) at spa_sync+0x2a2
> txg_sync_thread(c397c600,e6258d38) at txg_sync_thread+0x1df
> fork_exit(c40d9918,c397c600,e6258d38) at fork_exit+0xac
> fork_trampoline() at fork_trampoline+0x8
> --- trap 0x1, eip = 0, esp = 0xe6258d6c, ebp = 0 ---
> 
> The backtrace is from where lock order is different than the first
> recorded.
> 
I'm not sure, since your line numbers don't quite match up with mine,
but I think this is a complaint about the dn_dbufs_mtx vs the db_mtx?

If so, in this case its OK.  The db_mtx we grab in dbuf_rele() is the
"parent" dbuf for the dataset we are evicting from dsl_dataset_evict().
So its not possible that any of the locks we attempt to grab in
dnode_evict_dbufs() could cause a deadlock with it.

-Mark



[zfs-code] Nested Transactions

2006-06-08 Thread Mark Maybee
No, nested transactions are not allowed.

(your sense is correct :-)).

-Mark

Jeremy Teo wrote:
> Are nested transactions allowed/supported by the DMU?
> 
> Namely, if I have function Foo and function Bar that both wrap their
> own operations using a transaction such that Foo and Bar are
> atomic/transactional, can I have a function FooBar that creates a
> transaction, invokes Foo, then Bar, and then commits the transaction?
> Or do I have to rewrite Foo and Bar?
> 
> (I sense the latter)
> 
> Thanks in advance. :)



[zfs-code] zio_push/pop_transform how and when to use it.

2006-06-01 Thread Mark Maybee
Hi Darren,

Sorry about the slow response (from me).  I was on vacation last
week (and am on semi-vacation this week).

I can't answer your question about using the zio transform stuff.
You will have to get Jeff or Bill's attention for that.

As far as the ARC "hook" goes: it doesn't yet exist.  You will
have to add this functionality yourself.  I will be happy to help
you design this if you decide to go down this path.

-Mark

Darren J Moffat wrote:
> Trying again since I didn't see any responses
> 
> 
> So today[0] I believe I have encrypted data (part of Hamlet in case you
> care[1]) in a ZFS file system for the first time without having a panic.[2]
> 
> However  when I read the data back using cat(1) I get garbage.
> Garbage of the correct size that looks very much like it would be the
> ciphertext!
> 
> Now I think this is because I used crypto_encrypt(9f) with the inplace
> encryption.  I believe this means that we now have the ciphertext in the
>   ARC.  Now that might actually be a good thing in some cases since it
> means that if the key goes away (ie user removed it) we don't need to
> flush the cache; it wasn't what I was trying to do at the moment though
> and I actually thought that was going to be one of the harder problems
> to solve :-)
> 
> So I went back and looked at how compression works and noticed that it
> uses zio_push/pop_transform.  I had tried using that before but it just
> resulted in panics so I put it aside for a while.
> 
> So should I be using the zio transform stuff here ?  If so how is it
> supposed to work, who allocates memory etc.
> 
> Assuming that I do want to actually have the data encrypted in the ARC
> is there a hook some where that I can have zio_decrypt_data called or is
> this something I'll need to add myself ?
> 
> Cheers
> 
> 
> [0] It was on 25th May :-)
> 
> [1] Yeah I should use something else since that is already in the
> Solaris source base (and put there by me).
> 
> [2] NOTE: This is still a prototype and is using a hardcoded key because
> this is for ZIO pipeline testing not even beta test.
> 
> -- 
> Darren J Moffat
> 
> ___
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://opensolaris.org/mailman/listinfo/zfs-code



[zfs-code] Small nits in zfs_fm.c.

2006-12-07 Thread Mark Maybee
Sorry Pawel,

The team is rather slammed with work at the moment, so we may be a bit
slow at getting to things like this.  We certainly appreciate getting
these patches though.

-Mark

Pawel Jakub Dawidek wrote:
> On Mon, Nov 13, 2006 at 03:14:04PM +0100, Pawel Jakub Dawidek wrote:
> 
>>The patch below correct typo, duplicated entry and argument type.
>>(I'm not sure if duplicated entry removal is correct, as this is only
>>based on looking at the code.)
>>
>>  http://people.freebsd.org/~pjd/opensolaris/08.patch
> 
> 
> I see this is still not fixed. Noone is interested?



[zfs-code] Opening a snapshot.

2006-08-27 Thread Mark Maybee
Pawel Jakub Dawidek wrote:

>Hi.
>
>I'm currently working on snapshots and can't understand one thing.
>
>When someone lookups a snapshot directory it gets automatically mounted
>from zfsctl_snapdir_lookup() via domount().
>
>Ok, domount() ends up in zfs_domount(). zfs_domount() wants to open
>dataset through dmu_objset_open().
>dmu_objset_open() fails at first time returning EROFS, so we retry with
>DS_MODE_READONLY, but for me it fails again (EBUSY).
>
>I investigated that this dataset was DOS_MODE_PRIMARY open on file
>system mount, so snapshot cannot DOS_MODE_PRIMARY open it again.
>
>What am I missing here or what have I messed up?
>Any ideas?
>  
>

The mode is staying set after the failed EROFS open?  This should have
been cleared when the open failed, so the second attempt can succeed.

-Mark