Re: [zfs-discuss] dedup and memory/l2arc requirements

2010-04-03 Thread Roy Sigurd Karlsbakk
  I might add some swap I guess.  I will have to try it on another
  machine with more RAM and less pool, and see how the size of the
 zdb
  image compares to the calculated size of DDT needed.  So long as
 zdb
  is the same or a little smaller than the DDT it predicts, the
 tool's
  still useful, just sometimes it will report ``DDT too big but not
 sure
  by how much'', by coredumping/thrashing instead of finishing.
 
 In my experience, more swap doesn't help break through the 2GB memory
 barrier.  As zdb is an intentionally unsupported tool, methinks
 recompile
 may be required (or write your own).

I guess this tool might not work too well, then, with 20TiB in 47M files?

Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] dedup and memory/l2arc requirements

2010-04-03 Thread Roy Sigurd Karlsbakk
 You can estimate the amount of disk space needed for the deduplication
 table
 and the expected deduplication ratio by using zdb -S poolname on
 your existing
 pool. 

This is all good, but it doesn't work too well for planning. Is there a rule of 
thumb I can use for a general overview? Say I want 125TB space and I want to 
dedup that for backup use. It'll probably be quite efficient dedup, so long 
alignment will match. By the way, is there a way to auto-align data for dedup 
in case of backup? Or does zfs do this by itself?

Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Question about large pools

2010-04-03 Thread Roy Sigurd Karlsbakk
Hi all

From http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide I 
read

Avoid creating a RAIDZ, RAIDZ-2, RAIDZ-3, or a mirrored configuration with one 
logical device of 40+ devices. See the sections below for examples of redundant 
configurations.

What do they mean by this? 40+ devices in a single raidz[123] set or 40+ 
devices in a pool regardless of raidz[123] sets?

Best regards

roy
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-03 Thread Casper . Dik


The only way to guarantee consistency in the snapshot is to always
(regardless of ZIL enabled/disabled) give priority for sync writes to get
into the TXG before async writes.

If the OS does give priority for sync writes going into TXG's before async
writes (even with ZIL disabled), then after spontaneous ungraceful reboot,
the latest uberblock is guaranteed to be consistent.

This is what Jeff Bonwick says in the zil synchronicity arc case:

   What I mean is that the barrier semantic is implicit even with no ZIL at 
all.
   In ZFS, if event A happens before event B, and you lose power, then
   what you'll see on disk is either nothing, A, or both A and B.  Never just B.
   It is impossible for us not to have at least barrier semantics.

So there's no chance that a *later* async write will overtake an earlier
sync *or* async write.

Casper


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Question about large pools

2010-04-03 Thread Robert Milkowski

On 02/04/2010 05:45, Roy Sigurd Karlsbakk wrote:

Hi all

 From http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide 
I read

Avoid creating a RAIDZ, RAIDZ-2, RAIDZ-3, or a mirrored configuration with one 
logical device of 40+ devices. See the sections below for examples of redundant 
configurations.

What do they mean by this? 40+ devices in a single raidz[123] set or 40+ 
devices in a pool regardless of raidz[123] sets?

   

It means - try to avoid a single RAID-Z group with 40+ disk drives.
Creating several smaller groups in a one pool is perfectly fine.

So for example - on x4540 servers try to avoid creating a pool with a 
single RAID-Z3 group made of 44 disks, rather create 4 RAID-Z2 groups 
each made of 11 disks all of them in a single pool.


--
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] bit-flipping in RAM...

2010-04-03 Thread Orvar Korvar
Have not the ZFS data corruption researchers been in touch with Jeff Bonwick 
and the ZFS team?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] is this pool recoverable?

2010-04-03 Thread Cindy Swearingen

Patrick,

I'm happy that you were able to recover your pool.

Your original zpool status says that this pool was last accessed on
another system, which I believe is what caused of the pool to fail,
particularly if it was accessed simultaneously from two systems.

It is important that the cause of the original pool failure is
identified to prevent it from happening again.

This rewind pool recovery is a last-ditch effort and might not recover
all broken pools.

Thanks,

Cindy

On 04/02/10 12:32, Patrick Tiquet wrote:
Thanks, that worked!! 

It needed -Ff 


The pool has been recovered with minimal loss in data.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-03 Thread Neil Perrin

On 04/02/10 08:24, Edward Ned Harvey wrote:

The purpose of the ZIL is to act like a fast log for synchronous
writes.  It allows the system to quickly confirm a synchronous write
request with the minimum amount of work.  



Bob and Casper and some others clearly know a lot here.  But I'm hearing
conflicting information, and don't know what to believe.  Does anyone here
work on ZFS as an actual ZFS developer for Sun/Oracle?  Can claim I can
answer this question, I wrote that code, or at least have read it?
  


I'm one of the ZFS developers. I wrote most of the zil code.
Still I don't have all the answers. There's a lot of knowledgeable people
on this alias. I usually monitor this alias and sometimes chime in
when there's some misinformation being spread, but sometimes the volume 
is so high.

Since I started this reply there's been 20 new posts on this thread alone!


Questions to answer would be:

Is a ZIL log device used only by sync() and fsync() system calls? 
  


- The intent log (separate device(s) or not) is only used by fsync, 
O_DSYNC, O_SYNC, O_RSYNC.

NFS commits are seen to ZFS as fsyncs.
Note sync(1m) and sync(2s) do not use the intent log. They force 
transaction group (txg)
commits on all pools. So zfs goes beyond the the requirement for sync() 
which only requires

it schedules but does not necessarily complete the writing before returning.
The zfs interpretation is rather expensive but seemed broken so we fixed it.


Is it ever used to accelerate async writes?



The zil is not used to accelerate async writes.


Suppose there is an application which sometimes does sync writes, and
sometimes async writes.  In fact, to make it easier, suppose two processes
open two files, one of which always writes asynchronously, and one of which
always writes synchronously.  Suppose the ZIL is disabled.  Is it possible
for writes to be committed to disk out-of-order?  Meaning, can a large block
async write be put into a TXG and committed to disk before a small sync
write to a different file is committed to disk, even though the small sync
write was issued by the application before the large async write?  Remember,
the point is:  ZIL is disabled.  Question is whether the async could
possibly be committed to disk before the sync.
  


Threads can be pre-empted in the OS at any time. So even though thread A 
issued
W1 before thread B issued W2, the order is not guaranteed to arrive at 
ZFS as W1, W2.

Multi-threaded applications have to handle this.

If this was a single thread issuing W1 then W2 then yes the order is 
guaranteed

regardless of whether W1 or W2 are synchronous or asynchronous.
Of course if the system crashes then the async operations might not be 
there.



I make the assumption that an uberblock is the term for a TXG after it is
committed to disk.  Correct?
  


- Kind of. The uberblock contains the root of the txg.



At boot time, or zpool import time, what is taken to be the current
filesystem?  The latest uberblock?  Something else?
  


A txg is for the whole pool which can contain many filesystems.
The latest txg defines the current state of the pool and each individual fs.


My understanding is that enabling a dedicated ZIL device guarantees sync()
and fsync() system calls block until the write has been committed to
nonvolatile storage, and attempts to accelerate by using a physical device
which is faster or more idle than the main storage pool.


Correct (except replace sync() with O_DSYNC, etc).
This also assumes hardware that for example handles correctly the 
flushing of it's caches.



  My understanding
is that this provides two implicit guarantees:  (1) sync writes are always
guaranteed to be committed to disk in order, relevant to other sync writes.
(2) In the event of OS halting or ungraceful shutdown, sync writes committed
to disk are guaranteed to be equal or greater than the async writes that
were taking place at the same time.  That is, if two processes both complete
a write operation at the same time, one in sync mode and the other in async
mode, then it is guaranteed the data on disk will never have the async data
committed before the sync data.
  


The ZIL doesn't make such guarantees. It's the DMU that handles transactions
and their grouping into txgs. It ensures that writes are committed in order
by it's transactional nature.

The function of the zil is to merely ensure that synchronous operations are
stable and replayed after a crash/power fail onto the latest txg.


Based on this understanding, if you disable ZIL, then there is no guarantee
about order of writes being committed to disk.  Neither of the above
guarantees is valid anymore.  Sync writes may be completed out of order.
Async writes that supposedly happened after sync writes may be committed to
disk before the sync writes.
  

No, disabling the ZIL does not disable the DMU.


Somebody, (Casper?) said it before, and now I'm starting to realize ... This
is also true of the snapshots.  If you 

[zfs-discuss] To slice, or not to slice

2010-04-03 Thread Edward Ned Harvey
Momentarily, I will begin scouring the omniscient interweb for information, but 
I'd like to know a little bit of what people would say here.  The question is 
to slice, or not to slice, disks before using them in a zpool.

One reason to slice comes from recent personal experience.  One disk of a 
mirror dies.  Replaced under contract with an identical disk.  Same model 
number, same firmware.  Yet when it's plugged into the system, for an unknown 
reason, it appears 0.001 Gb smaller than the old disk, and therefore unable to 
attach and un-degrade the mirror.  It seems logical this problem could have 
been avoided if the device added to the pool originally had been a slice 
somewhat smaller than the whole physical device.  Say, a slice of 28G out of 
the 29G physical disk.  Because later when I get the infinitesimally smaller 
disk, I can always slice 28G out of it to use as the mirror device.

There is some question about performance.  Is there any additional overhead 
caused by using a slice instead of the whole physical device?

There is another question about performance.  One of my colleagues said he saw 
some literature on the internet somewhere, saying ZFS behaves differently for 
slices than it does on physical devices, because it doesn't assume it has 
exclusive access to that physical device, and therefore caches or buffers 
differently ... or something like that.

Any other pros/cons people can think of?

And finally, if anyone has experience doing this, and process recommendations?  
That is ... My next task is to go read documentation again, to refresh my 
memory from years ago, about the difference between format, partition, 
label, fdisk, because those terms don't have the same meaning that they do 
in other OSes...  And I don't know clearly right now, which one(s) I want to 
do, in order to create the large slice of my disks.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] To slice, or not to slice

2010-04-03 Thread Edward Ned Harvey
  One reason to slice comes from recent personal experience. One disk
 of
  a mirror dies. Replaced under contract with an identical disk. Same
  model number, same firmware. Yet when it's plugged into the system,
  for an unknown reason, it appears 0.001 Gb smaller than the old disk,
  and therefore unable to attach and un-degrade the mirror. It seems
  logical this problem could have been avoided if the device added to
  the pool originally had been a slice somewhat smaller than the whole
  physical device. Say, a slice of 28G out of the 29G physical disk.
  Because later when I get the infinitesimally smaller disk, I can
  always slice 28G out of it to use as the mirror device.
 
 
 What build were you running? The should have been addressed by
 CR6844090
 that went into build 117.

I'm running solaris, but that's irrelevant.  The storagetek array controller
itself reports the new disk as infinitesimally smaller than the one which I
want to mirror.  Even before the drive is given to the OS, that's the way it
is.  Sun X4275 server.

BTW, I'm still degraded.  Haven't found an answer yet, and am considering
breaking all my mirrors, to create a new pool on the freed disks, and using
partitions in those disks, for the sake of rebuilding my pool using
partitions on all disks.  The aforementioned performance problem is not as
scary to me as running in degraded redundancy.


 it's well documented. ZFS won't attempt to enable the drive's cache
 unless it has the physical device. See
 
 http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide
 #Storage_Pools

Nice.  Thank you.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] To slice, or not to slice

2010-04-03 Thread Roy Sigurd Karlsbakk
- Edward Ned Harvey solar...@nedharvey.com skrev:
  What build were you running? The should have been addressed by
  CR6844090
  that went into build 117.
 
 I'm running solaris, but that's irrelevant.  The storagetek array
 controller
 itself reports the new disk as infinitesimally smaller than the one
 which I
 want to mirror.  Even before the drive is given to the OS, that's the
 way it
 is.  Sun X4275 server.
 
 BTW, I'm still degraded.  Haven't found an answer yet, and am
 considering
 breaking all my mirrors, to create a new pool on the freed disks, and
 using
 partitions in those disks, for the sake of rebuilding my pool using
 partitions on all disks.  The aforementioned performance problem is
 not as
 scary to me as running in degraded redundancy.

I would return the drive to get a bigger one before doing something as drastic 
as that. There might have been a hichup in the production line, and that's not 
your fault.

roy
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] To slice, or not to slice

2010-04-03 Thread Edward Ned Harvey
 And finally, if anyone has experience doing this, and process
 recommendations?  That is … My next task is to go read documentation
 again, to refresh my memory from years ago, about the difference
 between “format,” “partition,” “label,” “fdisk,” because those terms
 don’t have the same meaning that they do in other OSes…  And I don’t
 know clearly right now, which one(s) I want to do, in order to create
 the large slice of my disks.
 
 The whole partition vs. slice thing is a bit fuzzy to me, so take this
 with a grain of salt. You can create partitions using fdisk, or slices
 using format. The BIOS and other operating systems (windows, linux,
 etc) will be able to recognize partitions, while they won't be able to
 make sense of slices. If you need to boot from the drive or share it
 with another OS, then partitions are the way to go. If it's exclusive
 to solaris, then you can use slices. You can (but shouldn't) use slices
 and partitions from the same device (eg: c5t0d0s0 and c5t0d0p0).

Oh, I managed to find a really good answer to this question.  Several
sources all say to do precisely the same procedure, and when I did it on a
test system, it worked perfectly.  Simple and easy to repeat.  So I think
this is the gospel method to create the slices, if you're going to create
slices:

http://docs.sun.com/app/docs/doc/806-4073/6jd67r9hu
and
http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide#Rep
lacing.2FRelabeling_the_Root_Pool_Disk


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] To slice, or not to slice

2010-04-03 Thread Roy Sigurd Karlsbakk
 Oh, I managed to find a really good answer to this question.  Several
 sources all say to do precisely the same procedure, and when I did it
 on a
 test system, it worked perfectly.  Simple and easy to repeat.  So I
 think
 this is the gospel method to create the slices, if you're going to
 create

Seems like a clumsy workaround for a hardware problem. It will also disable the 
drives' cache, which is not a good idea. Why not just get a new drive?

Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] To slice, or not to slice

2010-04-03 Thread Edward Ned Harvey
 On Apr 2, 2010, at 2:29 PM, Edward Ned Harvey wrote:
  I've also heard that the risk for unexpected failure of your pool is
 higher if/when you reach 100% capacity.  I've heard that you should
 always create a small ZFS filesystem within a pool, and give it some
 reserved space, along with the filesystem that you actually plan to use
 in your pool.  Anyone care to offer any comments on that?
 
 Define failure in this context?
 
 I am not aware of a data loss failure when near full.  However, all
 file systems
 will experience performance degradation for write operations as they
 become
 full.

To tell the truth, I'm not exactly sure.  Because I've never lost any ZFS
pool or filesystem.  I only have it deployed on 3 servers, and only one of
those gets heavy use.  It only filled up once, and it didn't have any
problem.  So I'm only trying to understand the great beyond, that which I
have never known myself.  Learn from other peoples' experience,
preventively.  Yes, I do embrace a lot of voodoo and superstition in doing
sysadmin, but that's just cuz stuff ain't perfect, and I've seen so many
things happen that were supposedly not possible.  (Not talking about ZFS in
that regard...  yet.)  Well, unless you count the issue I'm having right
now, with two identical disks appearing as different sizes...  But I don't
think that's a zfs problem.

I recall some discussion either here or on opensolaris-discuss or
opensolaris-help, where at least one or a few people said they had some sort
of problem or problems, and they were suspicious about the correlation
between it happening, and the disk being full.  I also recall talking to
some random guy at a conference who said something similar.  But it's all
vague.  I really don't know.

And I have nothing concrete.  Hence the post asking for peoples' comments.
Somebody might relate something they experienced less vague than what I
know.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] To slice, or not to slice

2010-04-03 Thread Edward Ned Harvey
 I would return the drive to get a bigger one before doing something as
 drastic as that. There might have been a hichup in the production line,
 and that's not your fault.

Yeah, but I already have 2 of the replacement disks, both doing the same
thing.  One has a firmware newer than my old disk (so originally I thought
that was the cause, and requested another replacement disk).  But then we
got a replacement disk which is identical in every way to the failed disk
... but it still appears smaller for some reason.

So this happened on my SSD.  What's to prevent it from happening on one of
the spindle disks in the future?  Nothing that I know of ...  

So far, the idea of slicing seems to be the only preventive or corrective
measure.  Hence, wondering what pros/cons people would describe, beyond what
I've already thought up myself.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] is this pool recoverable?

2010-04-03 Thread Edward Ned Harvey
 Your original zpool status says that this pool was last accessed on
 another system, which I believe is what caused of the pool to fail,
 particularly if it was accessed simultaneously from two systems.

The message last accessed on another system is the normal behavior if the
pool is ungracefully offlined for some reason, and then you boot back up
again on the same system.

I learned that by using a pool on an external disk, and accidentally
knocking out the power cord of the external disk.  The system hung.  I power
cycled, couldn't boot normal.  Had to boot failsafe, and got the above
message while trying to import.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] To slice, or not to slice

2010-04-03 Thread Bob Friesenhahn

On Sat, 3 Apr 2010, Edward Ned Harvey wrote:


I would return the drive to get a bigger one before doing something as
drastic as that. There might have been a hichup in the production line,
and that's not your fault.


Yeah, but I already have 2 of the replacement disks, both doing the same
thing.  One has a firmware newer than my old disk (so originally I thought
that was the cause, and requested another replacement disk).  But then we
got a replacement disk which is identical in every way to the failed disk
... but it still appears smaller for some reason.

So this happened on my SSD.  What's to prevent it from happening on one of
the spindle disks in the future?  Nothing that I know of ...


Just keep in mind that this has been fixed in OpenSolaris for some 
time, and will surely be fixed in Solaris 10, if not already.  The 
annoying issue is that you probably need to add all of the vdev 
devices using an OS which already has the fix.  I don't know if it can 
repair a slightly overly-large device.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] To slice, or not to slice

2010-04-03 Thread Tim Cook
On Fri, Apr 2, 2010 at 4:05 PM, Edward Ned Harvey
guacam...@nedharvey.comwrote:

  Momentarily, I will begin scouring the omniscient interweb for
 information, but I’d like to know a little bit of what people would say
 here.  The question is to slice, or not to slice, disks before using them in
 a zpool.



 One reason to slice comes from recent personal experience.  One disk of a
 mirror dies.  Replaced under contract with an identical disk.  Same model
 number, same firmware.  Yet when it’s plugged into the system, for an
 unknown reason, it appears 0.001 Gb smaller than the old disk, and therefore
 unable to attach and un-degrade the mirror.  It seems logical this problem
 could have been avoided if the device added to the pool originally had been
 a slice somewhat smaller than the whole physical device.  Say, a slice of
 28G out of the 29G physical disk.  Because later when I get the
 infinitesimally smaller disk, I can always slice 28G out of it to use as the
 mirror device.



 There is some question about performance.  Is there any additional overhead
 caused by using a slice instead of the whole physical device?



 There is another question about performance.  One of my colleagues said he
 saw some literature on the internet somewhere, saying ZFS behaves
 differently for slices than it does on physical devices, because it doesn’t
 assume it has exclusive access to that physical device, and therefore caches
 or buffers differently … or something like that.



 Any other pros/cons people can think of?



 And finally, if anyone has experience doing this, and process
 recommendations?  That is … My next task is to go read documentation again,
 to refresh my memory from years ago, about the difference between “format,”
 “partition,” “label,” “fdisk,” because those terms don’t have the same
 meaning that they do in other OSes…  And I don’t know clearly right now,
 which one(s) I want to do, in order to create the large slice of my disks.


Your experience is exactly why I suggested ZFS start doing some right
sizing if you will.  Chop off a bit from the end of any disk so that we're
guaranteed to be able to replace drives from different manufacturers.  The
excuse being no reason to, Sun drives are always of identical size.  If
your drives did indeed come from Sun, their response is clearly not true.
 Regardless, I guess I still think it should be done.  Figure out what the
greatest variation we've seen from drives that are supposedly of the exact
same size, and chop it off the end of every disk.  I'm betting it's no more
than 1GB, and probably less than that.  When we're talking about a 2TB
drive, I'm willing to give up a gig to be guaranteed I won't have any issues
when it comes time to swap it out.

--Tim

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] L2ARC Workingset Size

2010-04-03 Thread Tomas Ögren
On 02 April, 2010 - Abdullah Al-Dahlawi sent me these 128K bytes:

 Hi all
 
 I ran a workload that reads  writes within 10 files each file is 256M, ie,
 (10 * 256M = 2.5GB total Dataset Size).
 
 I have set the ARC max size to 1 GB  on  etc/system file
 
 In the worse case, let us assume that the whole dataset is hot, meaning my
 workingset size= 2.5GB
 
 My SSD flash size = 8GB and being used for L2ARC
 
 No slog is used in the pool
 
 My File system record size = 8K , meaning 2.5% of 8GB is used for L2ARC
 Directory in ARC. which ultimately mean that available ARC is 1024M - 204.8M
 = 819.2M Available ARC  (Am I Right ?)

Seems about right.

 Now the Question ...
 
 After running the workload for 75 minutes, I have noticed that L2ARC device
 has grown to 6 GB !!!   

No, 6GB of the area has been touched by Copy on Write, not all of it is
in use anymore though.

 What is in L2ARC beyond my 2.5GB Workingset ?? something else is has been
 added to L2ARC 

[ snip lots of data ]

This is your last one:

 module: zfs instance: 0
 name:   arcstatsclass:misc
 c   1073741824
 c_max   1073741824
 c_min   134217728
[...]
 l2_size 2632226304
 l2_write_bytes  6486009344

Roughly 6GB has been written to the device, and slightly less than 2.5GB
is actually in use.

 p   775528448

/Tomas
-- 
Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-03 Thread Jeroen Roodhart
Hi Al,

 Have you tried the DDRdrive from Christopher George
 cgeo...@ddrdrive.com?
 Looks to me like a much better fit for your application than the F20?
 
 It would not hurt to check it out.  Looks to me like
 you need a product with low *latency* - and a RAM based cache
 would be a much better performer than any solution based solely on
 flash.
 
 Let us know (on the list) how this works out for you.

Well, I did look at it but at that time there was no Solaris support yet. Right 
now it seems there is only a beta driver? I kind of remember that if you'd want 
reliable fallback to nvram, you'd need an UPS feeding the card. I could be very 
wrong there, but the product documentation isn't very clear on this (at least 
to me ;) ) 

Also, we'd kind of like to have a SnOracle supported option. 

But yeah, on paper it does seem it could be an attractive solution...

With kind regards,

Jeroen
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] L2ARC Workingset Size

2010-04-03 Thread Abdullah Al-Dahlawi
Hi Tomas

Thanks for the clarification. If I understood you right ,  you mean that 6
GB (including my 2.5GB files)  has been written to the device and still
occupy space on the device !!!

This is fair enough for this case since most of my files ended up in L2ARC
 Great ...

But this brings two related questions

1. What are really in L2ARC ... is it my old workingset data files that have
been updated but still in L2ARC ? or something else ? Metadata ?

2. More importantly, what if my workingset was larger that 2.5GB (Say 5GB),
I guess my L2ARC device will be filled completely before all my workingset
transfer to the L2ARC device !!!

Thanks 

On Sat, Apr 3, 2010 at 4:31 PM, Tomas Ögren st...@acc.umu.se wrote:

 On 02 April, 2010 - Abdullah Al-Dahlawi sent me these 128K bytes:

  Hi all
 
  I ran a workload that reads  writes within 10 files each file is 256M,
 ie,
  (10 * 256M = 2.5GB total Dataset Size).
 
  I have set the ARC max size to 1 GB  on  etc/system file
 
  In the worse case, let us assume that the whole dataset is hot, meaning
 my
  workingset size= 2.5GB
 
  My SSD flash size = 8GB and being used for L2ARC
 
  No slog is used in the pool
 
  My File system record size = 8K , meaning 2.5% of 8GB is used for L2ARC
  Directory in ARC. which ultimately mean that available ARC is 1024M -
 204.8M
  = 819.2M Available ARC  (Am I Right ?)

 Seems about right.

  Now the Question ...
 
  After running the workload for 75 minutes, I have noticed that L2ARC
 device
  has grown to 6 GB !!!   

 No, 6GB of the area has been touched by Copy on Write, not all of it is
 in use anymore though.

  What is in L2ARC beyond my 2.5GB Workingset ?? something else is has been
  added to L2ARC 

 [ snip lots of data ]

 This is your last one:

  module: zfs instance: 0
  name:   arcstatsclass:misc
  c   1073741824
  c_max   1073741824
  c_min   134217728
 [...]
  l2_size 2632226304
  l2_write_bytes  6486009344

 Roughly 6GB has been written to the device, and slightly less than 2.5GB
 is actually in use.

  p   775528448

 /Tomas
 --
 Tomas Ögren, st...@acc.umu.se, 
 http://www.acc.umu.se/~stric/http://www.acc.umu.se/%7Estric/
 |- Student at Computing Science, University of Umeå
 `- Sysadmin at {cs,acc}.umu.se




-- 
Abdullah Al-Dahlawi
PhD Candidate
George Washington University
Department. Of Electrical  Computer Engineering

Check The Fastest 500 Super Computers Worldwide
http://www.top500.org/list/2009/11/100
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-03 Thread Christopher George
 Well, I did look at it but at that time there was no Solaris support yet. 
 Right now it 
 seems there is only a beta driver?

Correct, we just completed functional validation of the OpenSolaris driver.  
Our 
focus has now turned to performance tuning and benchmarking.  We expect to 
formally introduce the DDRdrive X1 to the ZFS community later this quarter.  It 
is our 
goal to focus exclusively on the dedicated ZIL device market going forward. 

  I kind of remember that if you'd want reliable fallback to nvram, you'd need 
 an 
  UPS feeding the card.

Currently, a dedicated external UPS is required for correct operation.  Based 
on 
community feedback, we will be offering automatic backup/restore prior to 
release.  
This guarantees the UPS will only be required for 60 secs to successfully 
backup 
the drive contents on a host power or hardware failure.  Dutifully on the next 
reboot 
the restore will occur prior to the OS loading for seamless non-volatile 
operation.

Also,we have heard loud and clear the requests for a internal power option.  It 
is our 
intention the X1 will be the first in a family of products all dedicated to ZIL 
acceleration for not only OpenSolaris but also Solaris 10 and FreeBSD.

  Also, we'd kind of like to have a SnOracle supported option.

Although a much smaller company, we believe our singular focus and absolute 
passion 
for ZFS and the potential of Hybrid Storage Pools will serve our customers well.

We are actively designing our soon to be available support plans.  Your voice 
will be 
heard, please email directly at cgeorge at ddrdrive dot com for requests, 
comments
and/or questions.

Thanks,

Christopher George
Founder/CTO
www.ddrdrive.com
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] To slice, or not to slice

2010-04-03 Thread Robert Milkowski

On 03/04/2010 19:24, Tim Cook wrote:



On Fri, Apr 2, 2010 at 4:05 PM, Edward Ned Harvey 
guacam...@nedharvey.com mailto:guacam...@nedharvey.com wrote:


Momentarily, I will begin scouring the omniscient interweb for
information, but I’d like to know a little bit of what people
would say here.  The question is to slice, or not to slice, disks
before using them in a zpool.

One reason to slice comes from recent personal experience.  One
disk of a mirror dies.  Replaced under contract with an identical
disk.  Same model number, same firmware.  Yet when it’s plugged
into the system, for an unknown reason, it appears 0.001 Gb
smaller than the old disk, and therefore unable to attach and
un-degrade the mirror.  It seems logical this problem could have
been avoided if the device added to the pool originally had been a
slice somewhat smaller than the whole physical device.  Say, a
slice of 28G out of the 29G physical disk.  Because later when I
get the infinitesimally smaller disk, I can always slice 28G out
of it to use as the mirror device.

There is some question about performance.  Is there any additional
overhead caused by using a slice instead of the whole physical device?

There is another question about performance.  One of my colleagues
said he saw some literature on the internet somewhere, saying ZFS
behaves differently for slices than it does on physical devices,
because it doesn’t assume it has exclusive access to that physical
device, and therefore caches or buffers differently … or something
like that.

Any other pros/cons people can think of?

And finally, if anyone has experience doing this, and process
recommendations?  That is … My next task is to go read
documentation again, to refresh my memory from years ago, about
the difference between “format,” “partition,” “label,” “fdisk,”
because those terms don’t have the same meaning that they do in
other OSes…  And I don’t know clearly right now, which one(s) I
want to do, in order to create the large slice of my disks.


Your experience is exactly why I suggested ZFS start doing some right 
sizing if you will.  Chop off a bit from the end of any disk so that 
we're guaranteed to be able to replace drives from different 
manufacturers.  The excuse being no reason to, Sun drives are always 
of identical size.  If your drives did indeed come from Sun, their 
response is clearly not true.  Regardless, I guess I still think it 
should be done.  Figure out what the greatest variation we've seen 
from drives that are supposedly of the exact same size, and chop it 
off the end of every disk.  I'm betting it's no more than 1GB, and 
probably less than that.  When we're talking about a 2TB drive, I'm 
willing to give up a gig to be guaranteed I won't have any issues when 
it comes time to swap it out.




that's what open solaris is doing more or less for some time now.

look in the archives of this mailing list for more information.
--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-03 Thread Ragnar Sundblad

On 1 apr 2010, at 06.15, Stuart Anderson wrote:

 Assuming you are also using a PCI LSI HBA from Sun that is managed with
 a utility called /opt/StorMan/arcconf and reports itself as the amazingly
 informative model number Sun STK RAID INT what worked for me was to run,
 arcconf delete (to delete the pre-configured volume shipped on the drive)
 arcconf create (to create a new volume)

Just to sort things out (or not? :-): 

I more than agree that this product is highly confusing, but I
don't think there is anything LSI in or about that card. I believe
it is an Adaptec card, developed, manufactured and supported by
Intel for Adaptec, licensed (or something) to StorageTek, and later
included in Sun machines (since Sun bought StorageTek, I suppose).
Now we could add Oracle to this name dropping inferno, if we would
want to.

I am not sure why they (Sun) put those in there, they don't seem
very fast or smart or anything.

/ragge

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-03 Thread Ragnar Sundblad

On 2 apr 2010, at 22.47, Neil Perrin wrote:

 Suppose there is an application which sometimes does sync writes, and
 sometimes async writes.  In fact, to make it easier, suppose two processes
 open two files, one of which always writes asynchronously, and one of which
 always writes synchronously.  Suppose the ZIL is disabled.  Is it possible
 for writes to be committed to disk out-of-order?  Meaning, can a large block
 async write be put into a TXG and committed to disk before a small sync
 write to a different file is committed to disk, even though the small sync
 write was issued by the application before the large async write?  Remember,
 the point is:  ZIL is disabled.  Question is whether the async could
 possibly be committed to disk before the sync.
   
 
 
 Threads can be pre-empted in the OS at any time. So even though thread A 
 issued
 W1 before thread B issued W2, the order is not guaranteed to arrive at ZFS as 
 W1, W2.
 Multi-threaded applications have to handle this.
 
 If this was a single thread issuing W1 then W2 then yes the order is 
 guaranteed
 regardless of whether W1 or W2 are synchronous or asynchronous.
 Of course if the system crashes then the async operations might not be there.

Could you please clarify this last paragraph a little:
Do you mean that this is in the case that you have ZIL enabled
and the txg for W1 and W2 hasn't been commited, so that upon reboot
the ZIL is replayed, and therefore only the sync writes are
eventually there?

If, lets say, W1 is an async small write, W2 is a sync small write,
W1 arrives to zfs before W2, and W2 arrives before the txg is
commited, will both writes always be in the txg on disk?
If so, it would mean that zfs itself never buffer up async writes to
larger blurbs to write at a later txg, correct?
I take it that ZIL enabled or not does not make any difference here
(we pretend the system did _not_ crash), correct?

Thanks!

/ragge

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] To slice, or not to slice

2010-04-03 Thread Tim Cook
On Sat, Apr 3, 2010 at 6:53 PM, Robert Milkowski mi...@task.gda.pl wrote:

  On 03/04/2010 19:24, Tim Cook wrote:



 On Fri, Apr 2, 2010 at 4:05 PM, Edward Ned Harvey guacam...@nedharvey.com
  wrote:

   Momentarily, I will begin scouring the omniscient interweb for
 information, but I’d like to know a little bit of what people would say
 here.  The question is to slice, or not to slice, disks before using them in
 a zpool.



 One reason to slice comes from recent personal experience.  One disk of a
 mirror dies.  Replaced under contract with an identical disk.  Same model
 number, same firmware.  Yet when it’s plugged into the system, for an
 unknown reason, it appears 0.001 Gb smaller than the old disk, and therefore
 unable to attach and un-degrade the mirror.  It seems logical this problem
 could have been avoided if the device added to the pool originally had been
 a slice somewhat smaller than the whole physical device.  Say, a slice of
 28G out of the 29G physical disk.  Because later when I get the
 infinitesimally smaller disk, I can always slice 28G out of it to use as the
 mirror device.



 There is some question about performance.  Is there any additional
 overhead caused by using a slice instead of the whole physical device?



 There is another question about performance.  One of my colleagues said he
 saw some literature on the internet somewhere, saying ZFS behaves
 differently for slices than it does on physical devices, because it doesn’t
 assume it has exclusive access to that physical device, and therefore caches
 or buffers differently … or something like that.



 Any other pros/cons people can think of?



 And finally, if anyone has experience doing this, and process
 recommendations?  That is … My next task is to go read documentation again,
 to refresh my memory from years ago, about the difference between “format,”
 “partition,” “label,” “fdisk,” because those terms don’t have the same
 meaning that they do in other OSes…  And I don’t know clearly right now,
 which one(s) I want to do, in order to create the large slice of my disks.


  Your experience is exactly why I suggested ZFS start doing some right
 sizing if you will.  Chop off a bit from the end of any disk so that we're
 guaranteed to be able to replace drives from different manufacturers.  The
 excuse being no reason to, Sun drives are always of identical size.  If
 your drives did indeed come from Sun, their response is clearly not true.
  Regardless, I guess I still think it should be done.  Figure out what the
 greatest variation we've seen from drives that are supposedly of the exact
 same size, and chop it off the end of every disk.  I'm betting it's no more
 than 1GB, and probably less than that.  When we're talking about a 2TB
 drive, I'm willing to give up a gig to be guaranteed I won't have any issues
 when it comes time to swap it out.


  that's what open solaris is doing more or less for some time now.

 look in the archives of this mailing list for more information.
 --
 Robert Milkowski
 http://milek.blogspot.com



Since when?  It isn't doing it on any of my drives, build 134, and judging
by the OP's issues, it isn't doing it for him either... I try to follow this
list fairly closely and I've never seen anyone at Sun/Oracle say they were
going to start doing it after I was shot down the first time.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] To slice, or not to slice

2010-04-03 Thread Tim Cook
On Sat, Apr 3, 2010 at 7:50 PM, Tim Cook t...@cook.ms wrote:



 On Sat, Apr 3, 2010 at 6:53 PM, Robert Milkowski mi...@task.gda.plwrote:

  On 03/04/2010 19:24, Tim Cook wrote:



 On Fri, Apr 2, 2010 at 4:05 PM, Edward Ned Harvey 
 guacam...@nedharvey.com wrote:

   Momentarily, I will begin scouring the omniscient interweb for
 information, but I’d like to know a little bit of what people would say
 here.  The question is to slice, or not to slice, disks before using them in
 a zpool.



 One reason to slice comes from recent personal experience.  One disk of a
 mirror dies.  Replaced under contract with an identical disk.  Same model
 number, same firmware.  Yet when it’s plugged into the system, for an
 unknown reason, it appears 0.001 Gb smaller than the old disk, and therefore
 unable to attach and un-degrade the mirror.  It seems logical this problem
 could have been avoided if the device added to the pool originally had been
 a slice somewhat smaller than the whole physical device.  Say, a slice of
 28G out of the 29G physical disk.  Because later when I get the
 infinitesimally smaller disk, I can always slice 28G out of it to use as the
 mirror device.



 There is some question about performance.  Is there any additional
 overhead caused by using a slice instead of the whole physical device?



 There is another question about performance.  One of my colleagues said
 he saw some literature on the internet somewhere, saying ZFS behaves
 differently for slices than it does on physical devices, because it doesn’t
 assume it has exclusive access to that physical device, and therefore caches
 or buffers differently … or something like that.



 Any other pros/cons people can think of?



 And finally, if anyone has experience doing this, and process
 recommendations?  That is … My next task is to go read documentation again,
 to refresh my memory from years ago, about the difference between “format,”
 “partition,” “label,” “fdisk,” because those terms don’t have the same
 meaning that they do in other OSes…  And I don’t know clearly right now,
 which one(s) I want to do, in order to create the large slice of my disks.


  Your experience is exactly why I suggested ZFS start doing some right
 sizing if you will.  Chop off a bit from the end of any disk so that we're
 guaranteed to be able to replace drives from different manufacturers.  The
 excuse being no reason to, Sun drives are always of identical size.  If
 your drives did indeed come from Sun, their response is clearly not true.
  Regardless, I guess I still think it should be done.  Figure out what the
 greatest variation we've seen from drives that are supposedly of the exact
 same size, and chop it off the end of every disk.  I'm betting it's no more
 than 1GB, and probably less than that.  When we're talking about a 2TB
 drive, I'm willing to give up a gig to be guaranteed I won't have any issues
 when it comes time to swap it out.


  that's what open solaris is doing more or less for some time now.

 look in the archives of this mailing list for more information.
 --
 Robert Milkowski
 http://milek.blogspot.com



 Since when?  It isn't doing it on any of my drives, build 134, and judging
 by the OP's issues, it isn't doing it for him either... I try to follow this
 list fairly closely and I've never seen anyone at Sun/Oracle say they were
 going to start doing it after I was shot down the first time.

 --Tim



Oh... and after 15 minutes of searching for everything from 'right-sizing'
to 'block reservation' to 'replacement disk smaller size fewer blocks' etc.
etc. I don't see a single thread on it.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Problems with zfs and a STK RAID INT SAS HBA

2010-04-03 Thread Ragnar Sundblad

Hello,

Maybe this question should be put on another list, but since there
are a lot of people here using all kinds of HBAs, this could be right
anyway;

I have a X4150 running snv_134. It was shipped with a STK RAID INT
adaptec/intel/storagetek/sun SAS HBA.

When running the card in copyback write cache mode, I got horrible
performance (with zfs), much worse than with copyback disabled
(which I believe should mean it does write-through), when tested
with filebench.
This could actually be expected, depending on how good or bad the
the card is, but I am still not sure about what to expect.

It logs some errors, as shown with fmdump -e(V).
It is most often a pci bridge error (I think), about five to ten
times an hour, and occasionally a problem with accessing a
mode page on the disks for enabling/disabling the write cache,
one error for each disk, about every three hours.
I don't believe the two have to be related.

I am not sure if the PCI-PCI bridge is on the RAID board itself
or in the host.

I haven't seen this problem on other more or less identical
machines running sol10.

Is this a known software problem, or do I have faulty hardware?

Thanks!

/ragge

--

% fmdump -e
...
Apr 04 01:21:53.2244 ereport.io.pci.fabric   
Apr 04 01:30:00.6999 ereport.io.pci.fabric   
Apr 04 01:30:23.4647 ereport.io.scsi.cmd.disk.dev.uderr
Apr 04 01:30:23.4651 ereport.io.scsi.cmd.disk.dev.uderr
...
% fmdump -eV
Apr 04 2010 01:21:53.224492765 ereport.io.pci.fabric
nvlist version: 0
class = ereport.io.pci.fabric
ena = 0xd6a00a43be800c01
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
device-path = /p...@0,0/pci8086,2...@4
(end detector)

bdf = 0x20
device_id = 0x25f8
vendor_id = 0x8086
rev_id = 0xb1
dev_type = 0x40
pcie_off = 0x6c
pcix_off = 0x0
aer_off = 0x100
ecc_ver = 0x0
pci_status = 0x10
pci_command = 0x147
pci_bdg_sec_status = 0x0
pci_bdg_ctrl = 0x3
pcie_status = 0x0
pcie_command = 0x2027
pcie_dev_cap = 0xfc1
pcie_adv_ctl = 0x0
pcie_ue_status = 0x0
pcie_ue_mask = 0x10
pcie_ue_sev = 0x62031
pcie_ue_hdr0 = 0x0
pcie_ue_hdr1 = 0x0
pcie_ue_hdr2 = 0x0
pcie_ue_hdr3 = 0x0
pcie_ce_status = 0x0
pcie_ce_mask = 0x0
pcie_rp_status = 0x0
pcie_rp_control = 0x7
pcie_adv_rp_status = 0x0
pcie_adv_rp_command = 0x7
pcie_adv_rp_ce_src_id = 0x0
pcie_adv_rp_ue_src_id = 0x0
remainder = 0x0
severity = 0x1
__ttl = 0x1
__tod = 0x4bb7cd91 0xd617cdd
...
Apr 04 2010 01:30:23.464768275 ereport.io.scsi.cmd.disk.dev.uderr
nvlist version: 0
class = ereport.io.scsi.cmd.disk.dev.uderr
ena = 0xde0cd54f84201c01
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
device-path = /p...@0,0/pci8086,2...@4/pci108e,2...@0/d...@5,0
devid = id1,s...@tsun_stk_raid_intea4b6f24
(end detector)

driver-assessment = fail
op-code = 0x1a
cdb = 0x1a 0x0 0x8 0x0 0x18 0x0
pkt-reason = 0x0
pkt-state = 0x1f
pkt-stats = 0x0
stat-code = 0x0
un-decode-info = sd_get_write_cache_enabled: Mode Sense caching page 
code mismatch 0

un-decode-value =
__ttl = 0x1
__tod = 0x4bb7cf8f 0x1bb3cd13
...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] To slice, or not to slice

2010-04-03 Thread Richard Elling
On Apr 3, 2010, at 5:56 PM, Tim Cook wrote:
 
 On Sat, Apr 3, 2010 at 7:50 PM, Tim Cook t...@cook.ms wrote:
 Your experience is exactly why I suggested ZFS start doing some right 
 sizing if you will.  Chop off a bit from the end of any disk so that we're 
 guaranteed to be able to replace drives from different manufacturers.  The 
 excuse being no reason to, Sun drives are always of identical size.  If 
 your drives did indeed come from Sun, their response is clearly not true.  
 Regardless, I guess I still think it should be done.  Figure out what the 
 greatest variation we've seen from drives that are supposedly of the exact 
 same size, and chop it off the end of every disk.  I'm betting it's no more 
 than 1GB, and probably less than that.  When we're talking about a 2TB 
 drive, I'm willing to give up a gig to be guaranteed I won't have any issues 
 when it comes time to swap it out.
 
 
 that's what open solaris is doing more or less for some time now.
 
 look in the archives of this mailing list for more information.
 -- 
 Robert Milkowski
 http://milek.blogspot.com
 
 
 
 Since when?  It isn't doing it on any of my drives, build 134, and judging by 
 the OP's issues, it isn't doing it for him either... I try to follow this 
 list fairly closely and I've never seen anyone at Sun/Oracle say they were 
 going to start doing it after I was shot down the first time.
 
 --Tim
 
 
 Oh... and after 15 minutes of searching for everything from 'right-sizing' to 
 'block reservation' to 'replacement disk smaller size fewer blocks' etc. etc. 
 I don't see a single thread on it.

CR 6844090, zfs should be able to mirror to a smaller disk
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6844090
b117, June 2009
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] To slice, or not to slice

2010-04-03 Thread Richard Elling
On Apr 2, 2010, at 2:05 PM, Edward Ned Harvey wrote:

 Momentarily, I will begin scouring the omniscient interweb for information, 
 but I’d like to know a little bit of what people would say here.  The 
 question is to slice, or not to slice, disks before using them in a zpool.
  
 One reason to slice comes from recent personal experience.  One disk of a 
 mirror dies.  Replaced under contract with an identical disk.  Same model 
 number, same firmware.  Yet when it’s plugged into the system, for an unknown 
 reason, it appears 0.001 Gb smaller than the old disk, and therefore unable 
 to attach and un-degrade the mirror.  It seems logical this problem could 
 have been avoided if the device added to the pool originally had been a slice 
 somewhat smaller than the whole physical device.  Say, a slice of 28G out of 
 the 29G physical disk.  Because later when I get the infinitesimally smaller 
 disk, I can always slice 28G out of it to use as the mirror device.

If the HBA is configured for RAID mode, then it will reserve some space on disk
for its metadata.  This occurs no matter what type of disk you attach.

 There is some question about performance.  Is there any additional overhead 
 caused by using a slice instead of the whole physical device?

No.

 There is another question about performance.  One of my colleagues said he 
 saw some literature on the internet somewhere, saying ZFS behaves differently 
 for slices than it does on physical devices, because it doesn’t assume it has 
 exclusive access to that physical device, and therefore caches or buffers 
 differently … or something like that.
  
 Any other pros/cons people can think of?

If the disk is only used for ZFS, then it is ok to enable volatile disk write 
caching
if the disk also supports write cache flush requests.

If the disk is shared with UFS, then it is not ok to enable volatile disk write 
caching.

 -- richard

 
 And finally, if anyone has experience doing this, and process 
 recommendations?  That is … My next task is to go read documentation again, 
 to refresh my memory from years ago, about the difference between “format,” 
 “partition,” “label,” “fdisk,” because those terms don’t have the same 
 meaning that they do in other OSes…  And I don’t know clearly right now, 
 which one(s) I want to do, in order to create the large slice of my disks.
  
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] To slice, or not to slice

2010-04-03 Thread Tim Cook
On Sat, Apr 3, 2010 at 9:52 PM, Richard Elling richard.ell...@gmail.comwrote:

 On Apr 3, 2010, at 5:56 PM, Tim Cook wrote:
 
  On Sat, Apr 3, 2010 at 7:50 PM, Tim Cook t...@cook.ms wrote:
  Your experience is exactly why I suggested ZFS start doing some right
 sizing if you will.  Chop off a bit from the end of any disk so that we're
 guaranteed to be able to replace drives from different manufacturers.  The
 excuse being no reason to, Sun drives are always of identical size.  If
 your drives did indeed come from Sun, their response is clearly not true.
  Regardless, I guess I still think it should be done.  Figure out what the
 greatest variation we've seen from drives that are supposedly of the exact
 same size, and chop it off the end of every disk.  I'm betting it's no more
 than 1GB, and probably less than that.  When we're talking about a 2TB
 drive, I'm willing to give up a gig to be guaranteed I won't have any issues
 when it comes time to swap it out.
 
 
  that's what open solaris is doing more or less for some time now.
 
  look in the archives of this mailing list for more information.
  --
  Robert Milkowski
  http://milek.blogspot.com
 
 
 
  Since when?  It isn't doing it on any of my drives, build 134, and
 judging by the OP's issues, it isn't doing it for him either... I try to
 follow this list fairly closely and I've never seen anyone at Sun/Oracle say
 they were going to start doing it after I was shot down the first time.
 
  --Tim
 
 
  Oh... and after 15 minutes of searching for everything from
 'right-sizing' to 'block reservation' to 'replacement disk smaller size
 fewer blocks' etc. etc. I don't see a single thread on it.

 CR 6844090, zfs should be able to mirror to a smaller disk
 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6844090
 b117http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6844090%0Ab117,
 June 2009
  -- richard



Unless the bug description is incomplete, that's talking about adding a
mirror to an existing drive.  Not about replacing a failed drive in an
existing vdev that could be raid-z#.  I'm almost positive I had an issue
post b117 with replacing a failed drive in a raid-z2 vdev.

I'll have to see if I can dig up a system to test the theory on.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] dedup and memory/l2arc requirements

2010-04-03 Thread Richard Elling
On Apr 1, 2010, at 9:34 PM, Roy Sigurd Karlsbakk wrote:

 You can estimate the amount of disk space needed for the deduplication
 table
 and the expected deduplication ratio by using zdb -S poolname on
 your existing
 pool. 
 
 This is all good, but it doesn't work too well for planning. Is there a rule 
 of thumb I can use for a general overview?

If you know the average record size for your workload, then you can calculate
the average number of records when given the total space.  This should get 
you in the ballpark.

 Say I want 125TB space and I want to dedup that for backup use. It'll 
 probably be quite efficient dedup, so long alignment will match. By the way, 
 is there a way to auto-align data for dedup in case of backup? Or does zfs do 
 this by itself?

ZFS does not change alignment.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] To slice, or not to slice

2010-04-03 Thread Richard Elling
On Apr 3, 2010, at 8:00 PM, Tim Cook wrote:
 On Sat, Apr 3, 2010 at 9:52 PM, Richard Elling richard.ell...@gmail.com 
 wrote:
 On Apr 3, 2010, at 5:56 PM, Tim Cook wrote:
 
  On Sat, Apr 3, 2010 at 7:50 PM, Tim Cook t...@cook.ms wrote:
  Your experience is exactly why I suggested ZFS start doing some right 
  sizing if you will.  Chop off a bit from the end of any disk so that 
  we're guaranteed to be able to replace drives from different 
  manufacturers.  The excuse being no reason to, Sun drives are always of 
  identical size.  If your drives did indeed come from Sun, their response 
  is clearly not true.  Regardless, I guess I still think it should be done. 
   Figure out what the greatest variation we've seen from drives that are 
  supposedly of the exact same size, and chop it off the end of every disk.  
  I'm betting it's no more than 1GB, and probably less than that.  When 
  we're talking about a 2TB drive, I'm willing to give up a gig to be 
  guaranteed I won't have any issues when it comes time to swap it out.
 
 
  that's what open solaris is doing more or less for some time now.
 
  look in the archives of this mailing list for more information.
  --
  Robert Milkowski
  http://milek.blogspot.com
 
 
 
  Since when?  It isn't doing it on any of my drives, build 134, and judging 
  by the OP's issues, it isn't doing it for him either... I try to follow 
  this list fairly closely and I've never seen anyone at Sun/Oracle say they 
  were going to start doing it after I was shot down the first time.
 
  --Tim
 
 
  Oh... and after 15 minutes of searching for everything from 'right-sizing' 
  to 'block reservation' to 'replacement disk smaller size fewer blocks' etc. 
  etc. I don't see a single thread on it.
 
 CR 6844090, zfs should be able to mirror to a smaller disk
 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6844090
 b117, June 2009
  -- richard
 
 
 
 Unless the bug description is incomplete, that's talking about adding a 
 mirror to an existing drive.  Not about replacing a failed drive in an 
 existing vdev that could be raid-z#.  I'm almost positive I had an issue post 
 b117 with replacing a failed drive in a raid-z2 vdev.

It is the same code.

That said, I have experimented with various cases and I have not found
prediction of tolerable size difference to be easy.

 I'll have to see if I can dig up a system to test the theory on.

Works fine.

# ramdiskadm -a rd1 10k
/dev/ramdisk/rd1
# ramdiskadm -a rd2 10k
/dev/ramdisk/rd2
# ramdiskadm -a rd3 10k
/dev/ramdisk/rd3
# ramdiskadm -a rd4 99900k
/dev/ramdisk/rd4
# zpool create -o cachefile=none zwimming raidz /dev/ramdisk/rd1 
/dev/ramdisk/rd2 /dev/ramdisk/rd3
# zpool status zwimming
  pool: zwimming
 state: ONLINE
 scrub: none requested
config:

NAME  STATE READ WRITE CKSUM
zwimming  ONLINE   0 0 0
  raidz1-0ONLINE   0 0 0
/dev/ramdisk/rd1  ONLINE   0 0 0
/dev/ramdisk/rd2  ONLINE   0 0 0
/dev/ramdisk/rd3  ONLINE   0 0 0

errors: No known data errors
# zpool replace zwimming /dev/ramdisk/rd3 /dev/ramdisk/rd4
# zpool status zwimming
  pool: zwimming
 state: ONLINE
 scrub: resilver completed after 0h0m with 0 errors on Sat Apr  3 20:08:35 2010
config:

NAME  STATE READ WRITE CKSUM
zwimming  ONLINE   0 0 0
  raidz1-0ONLINE   0 0 0
/dev/ramdisk/rd1  ONLINE   0 0 0
/dev/ramdisk/rd2  ONLINE   0 0 0
/dev/ramdisk/rd4  ONLINE   0 0 0  45K resilvered

errors: No known data errors


 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] L2ARC Workingset Size

2010-04-03 Thread Richard Elling
On Apr 1, 2010, at 9:41 PM, Abdullah Al-Dahlawi wrote:

 Hi all
 
 I ran a workload that reads  writes within 10 files each file is 256M, ie,  
 (10 * 256M = 2.5GB total Dataset Size).
 
 I have set the ARC max size to 1 GB  on  etc/system file
 
 In the worse case, let us assume that the whole dataset is hot, meaning my 
 workingset size= 2.5GB
 
 My SSD flash size = 8GB and being used for L2ARC
 
 No slog is used in the pool
 
 My File system record size = 8K , meaning 2.5% of 8GB is used for L2ARC 
 Directory in ARC. which ultimately mean that available ARC is 1024M - 204.8M 
 = 819.2M Available ARC  (Am I Right ?)

this is worst case

 Now the Question ...
 
 After running the workload for 75 minutes, I have noticed that L2ARC device 
 has grown to 6 GB !!!   

You're not interpreting the values properly, see below.

 What is in L2ARC beyond my 2.5GB Workingset ?? something else is has been 
 added to L2ARC 

ZFS is COW, so modified data is written to disk and the L2ARC.

 Here is a 5 minute interval of Zpool iostat 

[snip]
 Also, a FULL  Kstat ZFS for 5 minutes Interval

[snip]
 module: zfs instance: 0 
 name:   arcstatsclass:misc
 c   1073741824
 c_max   1073741824

Max ARC size is limited to 1GB

 c_min   134217728
 crtime  28.083178473
 data_size   955407360
 deleted 966956
 demand_data_hits843880
 demand_data_misses  452182
 demand_metadata_hits68572
 demand_metadata_misses  5737
 evict_skip  82548
 hash_chain_max  18
 hash_chains 61732
 hash_collisions 1444874
 hash_elements   329553
 hash_elements_max   329561
 hdr_size46553328
 hits978241
 l2_abort_lowmem 0
 l2_cksum_bad0
 l2_evict_lock_retry 0
 l2_evict_reading0
 l2_feeds4738
 l2_free_on_write184
 l2_hdr_size 17024784

size of L2ARC headers is approximately 17MB

 l2_hits 252839
 l2_io_error 0
 l2_misses   203767
 l2_read_bytes   2071482368
 l2_rw_clash 13
 l2_size 2632226304

currently, there is approximately 2.5GB in the L2ARC

 l2_write_bytes  6486009344

total amount of data written to L2ARC since boot is 6+ GB

 l2_writes_done  4127
 l2_writes_error 0
 l2_writes_hdr_miss  21
 l2_writes_sent  4127
 memory_throttle_count   0
 mfu_ghost_hits  120524
 mfu_hits500516
 misses  468227
 mru_ghost_hits  61398
 mru_hits412112
 mutex_miss  511
 other_size  56325712
 p   775528448
 prefetch_data_hits  50804
 prefetch_data_misses7819
 prefetch_metadata_hits  14985
 prefetch_metadata_misses2489
 recycle_miss13096
 size1073830768

ARC size is 1GB

The best way to understand these in detail is to look at the source which 
is nicely commented. L2ARC design is commented near
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c#3590

 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-03 Thread Richard Elling
On Apr 3, 2010, at 5:47 PM, Ragnar Sundblad wrote:
 On 2 apr 2010, at 22.47, Neil Perrin wrote:
 
 Suppose there is an application which sometimes does sync writes, and
 sometimes async writes.  In fact, to make it easier, suppose two processes
 open two files, one of which always writes asynchronously, and one of which
 always writes synchronously.  Suppose the ZIL is disabled.  Is it possible
 for writes to be committed to disk out-of-order?  Meaning, can a large block
 async write be put into a TXG and committed to disk before a small sync
 write to a different file is committed to disk, even though the small sync
 write was issued by the application before the large async write?  Remember,
 the point is:  ZIL is disabled.  Question is whether the async could
 possibly be committed to disk before the sync.
 
 
 
 Threads can be pre-empted in the OS at any time. So even though thread A 
 issued
 W1 before thread B issued W2, the order is not guaranteed to arrive at ZFS 
 as W1, W2.
 Multi-threaded applications have to handle this.
 
 If this was a single thread issuing W1 then W2 then yes the order is 
 guaranteed
 regardless of whether W1 or W2 are synchronous or asynchronous.
 Of course if the system crashes then the async operations might not be there.
 
 Could you please clarify this last paragraph a little:
 Do you mean that this is in the case that you have ZIL enabled
 and the txg for W1 and W2 hasn't been commited, so that upon reboot
 the ZIL is replayed, and therefore only the sync writes are
 eventually there?

yes. The ZIL needs to be replayed on import after an unclean shutdown.

 If, lets say, W1 is an async small write, W2 is a sync small write,
 W1 arrives to zfs before W2, and W2 arrives before the txg is
 commited, will both writes always be in the txg on disk?

yes

 If so, it would mean that zfs itself never buffer up async writes to
 larger blurbs to write at a later txg, correct?

correct

 I take it that ZIL enabled or not does not make any difference here
 (we pretend the system did _not_ crash), correct?

For import following a clean shutdown, there are no transactions in 
the ZIL to apply.

For async-only workloads, there are no transactions in the ZIL to apply.

Do not assume that power outages are the only cause of unclean shutdowns.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss