date:20100402

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Al Hopper

Hi Jeroen,

Have you tried the DDRdrive from Christopher George ?
Looks to me like a much better fit for your application than the F20?

It would not hurt to check it out.  Looks to me like you need a
product with low *latency* - and a RAM based cache would be a much
better performer than any solution based solely on flash.

Let us know (on the list) how this works out for you.

Regards,

-- 
Al Hopper  Logical Approach Inc,Plano,TX a...@logical-approach.com
   Voice: 214.233.5089 Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] To slice, or not to slice

2010-04-02 Thread Richard Elling

On Apr 2, 2010, at 2:29 PM, Edward Ned Harvey wrote:
> I’ve also heard that the risk for unexpected failure of your pool is higher 
> if/when you reach 100% capacity.  I’ve heard that you should always create a 
> small ZFS filesystem within a pool, and give it some reserved space, along 
> with the filesystem that you actually plan to use in your pool.  Anyone care 
> to offer any comments on that?

Define "failure" in this context?

I am not aware of a data loss failure when near full.  However, all file systems
will experience performance degradation for write operations as they become
full.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] dedup and memory/l2arc requirements

2010-04-02 Thread Richard Elling

On Apr 2, 2010, at 2:03 PM, Miles Nordin wrote:

>> "re" == Richard Elling  writes:
> 
>re> # ptime zdb -S zwimming Simulated DDT histogram:
>re>  refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   
> DSIZE
>re>   Total2.63M277G218G225G3.22M337G263G
> 270G
> 
>re>in-core size = 2.63M * 250 = 657.5 MB
> 
> Thanks, that is really useful!  It'll probably make the difference
> between trying dedup and not, for me.
> 
> It is not working for me yet.  It got to this point in prstat:
> 
>  6754 root 2554M 1439M sleep   600   0:03:31 1.9% zdb/106
> 
> and then ran out of memory:
> 
> $ pfexec ptime zdb -S tub
> out of memory -- generating core dump

This is annoying. By default, zdb is compiled as a 32-bit executable and
it can be a hog. Compiling it yourself is too painful for most folks :-(

> I might add some swap I guess.  I will have to try it on another
> machine with more RAM and less pool, and see how the size of the zdb
> image compares to the calculated size of DDT needed.  So long as zdb
> is the same or a little smaller than the DDT it predicts, the tool's
> still useful, just sometimes it will report ``DDT too big but not sure
> by how much'', by coredumping/thrashing instead of finishing.

In my experience, more swap doesn't help break through the 2GB memory
barrier.  As zdb is an intentionally unsupported tool, methinks recompile
may be required (or write your own).
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] To slice, or not to slice

2010-04-02 Thread Brandon High

On Fri, Apr 2, 2010 at 2:29 PM, Edward Ned Harvey wrote:

>  I’ve also heard that the risk for unexpected failure of your pool is
> higher if/when you reach 100% capacity.  I’ve heard that you should always
> create a small ZFS filesystem within a pool, and give it some reserved
> space, along with the filesystem that you actually plan to use in your
> pool.  Anyone care to offer any comments on that?
>
I think you can just create a dataset with a reservation to avoid the issue.
As I understand it, zfs doesn't automatically set aside a few percent of
reserved space like ufs does.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] To slice, or not to slice

2010-04-02 Thread Brandon High

On Fri, Apr 2, 2010 at 2:23 PM, Edward Ned Harvey
wrote:

>  There is some question about performance.  Is there any additional
> overhead caused by using a slice instead of the whole physical device?
>

zfs will disable the write cache when it's not working with whole disks,
which may reduce performance. You can turn the cache back on however. I
don't remember the exact incantation to do so, but "format -e" springs to
mind.

And finally, if anyone has experience doing this, and process
> recommendations?  That is … My next task is to go read documentation again,
> to refresh my memory from years ago, about the difference between “format,”
> “partition,” “label,” “fdisk,” because those terms don’t have the same
> meaning that they do in other OSes…  And I don’t know clearly right now,
> which one(s) I want to do, in order to create the large slice of my disks.
>

The whole partition vs. slice thing is a bit fuzzy to me, so take this with
a grain of salt. You can create partitions using fdisk, or slices using
format. The BIOS and other operating systems (windows, linux, etc) will be
able to recognize partitions, while they won't be able to make sense of
slices. If you need to boot from the drive or share it with another OS, then
partitions are the way to go. If it's exclusive to solaris, then you can use
slices. You can (but shouldn't) use slices and partitions from the same
device (eg: c5t0d0s0 and c5t0d0p0).

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] To slice, or not to slice

2010-04-02 Thread Ian Collins


On 04/ 3/10 10:23 AM, Edward Ned Harvey wrote:


Momentarily, I will begin scouring the omniscient interweb for 
information, but I’d like to know a little bit of what people would 
say here. The question is to slice, or not to slice, disks before 
using them in a zpool.




Not.

One reason to slice comes from recent personal experience. One disk of 
a mirror dies. Replaced under contract with an identical disk. Same 
model number, same firmware. Yet when it’s plugged into the system, 
for an unknown reason, it appears 0.001 Gb smaller than the old disk, 
and therefore unable to attach and un-degrade the mirror. It seems 
logical this problem could have been avoided if the device added to 
the pool originally had been a slice somewhat smaller than the whole 
physical device. Say, a slice of 28G out of the 29G physical disk. 
Because later when I get the infinitesimally smaller disk, I can 
always slice 28G out of it to use as the mirror device.




What build were you running? The should have been addressed by CR6844090 
that went into build 117.


There is some question about performance. Is there any additional 
overhead caused by using a slice instead of the whole physical device?


There is another question about performance. One of my colleagues said 
he saw some literature on the internet somewhere, saying ZFS behaves 
differently for slices than it does on physical devices, because it 
doesn’t assume it has exclusive access to that physical device, and 
therefore caches or buffers differently … or something like that.


it's well documented. ZFS won't attempt to enable the drive's cache 
unless it has the physical device. See


http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Storage_Pools

--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] RAID-Z with Permanent errors detected in files

2010-04-02 Thread Lutz Schumann

> I guess it will then
> remain a mystery how did this happen, since I'm very
> careful when engaging the commands and I'm sure that
> I didn't miss the "raidz" parameter. 

You can be sure by calling "zpool history". 

Robert
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] To slice, or not to slice

2010-04-02 Thread Edward Ned Harvey

This might be unrelated, but along similar lines .

 

I've also heard that the risk for unexpected failure of your pool is higher
if/when you reach 100% capacity.  I've heard that you should always create a
small ZFS filesystem within a pool, and give it some reserved space, along
with the filesystem that you actually plan to use in your pool.  Anyone care
to offer any comments on that?

 

 

 

From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
Sent: Friday, April 02, 2010 5:23 PM
To: zfs-discuss@opensolaris.org
Subject: [zfs-discuss] To slice, or not to slice

 

Momentarily, I will begin scouring the omniscient interweb for information,
but I'd like to know a little bit of what people would say here.  The
question is to slice, or not to slice, disks before using them in a zpool.

 

One reason to slice comes from recent personal experience.  One disk of a
mirror dies.  Replaced under contract with an identical disk.  Same model
number, same firmware.  Yet when it's plugged into the system, for an
unknown reason, it appears 0.001 Gb smaller than the old disk, and therefore
unable to attach and un-degrade the mirror.  It seems logical this problem
could have been avoided if the device added to the pool originally had been
a slice somewhat smaller than the whole physical device.  Say, a slice of
28G out of the 29G physical disk.  Because later when I get the
infinitesimally smaller disk, I can always slice 28G out of it to use as the
mirror device.

 

There is some question about performance.  Is there any additional overhead
caused by using a slice instead of the whole physical device?

 

There is another question about performance.  One of my colleagues said he
saw some literature on the internet somewhere, saying ZFS behaves
differently for slices than it does on physical devices, because it doesn't
assume it has exclusive access to that physical device, and therefore caches
or buffers differently . or something like that.

 

Any other pros/cons people can think of?

 

And finally, if anyone has experience doing this, and process
recommendations?  That is . My next task is to go read documentation again,
to refresh my memory from years ago, about the difference between "format,"
"partition," "label," "fdisk," because those terms don't have the same
meaning that they do in other OSes.  And I don't know clearly right now,
which one(s) I want to do, in order to create the large slice of my disks.

 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] To slice, or not to slice

2010-04-02 Thread Edward Ned Harvey

Momentarily, I will begin scouring the omniscient interweb for information,
but I'd like to know a little bit of what people would say here.  The
question is to slice, or not to slice, disks before using them in a zpool.

 

One reason to slice comes from recent personal experience.  One disk of a
mirror dies.  Replaced under contract with an identical disk.  Same model
number, same firmware.  Yet when it's plugged into the system, for an
unknown reason, it appears 0.001 Gb smaller than the old disk, and therefore
unable to attach and un-degrade the mirror.  It seems logical this problem
could have been avoided if the device added to the pool originally had been
a slice somewhat smaller than the whole physical device.  Say, a slice of
28G out of the 29G physical disk.  Because later when I get the
infinitesimally smaller disk, I can always slice 28G out of it to use as the
mirror device.

 

There is some question about performance.  Is there any additional overhead
caused by using a slice instead of the whole physical device?

 

There is another question about performance.  One of my colleagues said he
saw some literature on the internet somewhere, saying ZFS behaves
differently for slices than it does on physical devices, because it doesn't
assume it has exclusive access to that physical device, and therefore caches
or buffers differently . or something like that.

 

Any other pros/cons people can think of?

 

And finally, if anyone has experience doing this, and process
recommendations?  That is . My next task is to go read documentation again,
to refresh my memory from years ago, about the difference between "format,"
"partition," "label," "fdisk," because those terms don't have the same
meaning that they do in other OSes.  And I don't know clearly right now,
which one(s) I want to do, in order to create the large slice of my disks.

 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] dedup and memory/l2arc requirements

2010-04-02 Thread Miles Nordin

> "re" == Richard Elling  writes:

re> # ptime zdb -S zwimming Simulated DDT histogram:
re>  refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   
DSIZE
re>   Total2.63M277G218G225G3.22M337G263G
270G

re>in-core size = 2.63M * 250 = 657.5 MB

Thanks, that is really useful!  It'll probably make the difference
between trying dedup and not, for me.

It is not working for me yet.  It got to this point in prstat:

  6754 root 2554M 1439M sleep   600   0:03:31 1.9% zdb/106

and then ran out of memory:

 $ pfexec ptime zdb -S tub
 out of memory -- generating core dump

I might add some swap I guess.  I will have to try it on another
machine with more RAM and less pool, and see how the size of the zdb
image compares to the calculated size of DDT needed.  So long as zdb
is the same or a little smaller than the DDT it predicts, the tool's
still useful, just sometimes it will report ``DDT too big but not sure
by how much'', by coredumping/thrashing instead of finishing.


pgprpk9HSdr61.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Eric D. Mudama


On Fri, Apr  2 at 11:14, Tirso Alonso wrote:

If my new replacement SSD with identical part number and firmware is 0.001
Gb smaller than the original and hence unable to mirror, what's to prevent
the same thing from happening to one of my 1TB spindle disk mirrors?


There is a standard for sizes that many manufatures use (IDEMA LBA1-02):

LBA count = (97696368) + (1953504 * (Desired Capacity in Gbytes ??? 50.0))

Sizes should match exactly if the manufacturer follows the standard.

See:
http://opensolaris.org/jive/message.jspa?messageID=393336#393336
http://www.idema.org/_smartsite/modules/local/data_file/show_file.php?cmd=download&data_file_id=1066


Problem is that it only applies to devices that are >= 50GB in size,
and the X25 in question is only 32GB.

That being said, I'd be skeptical of either the sourcing of the parts,
or else some other configuration feature on the drives (like HPA or
DCO) that is changing the capacity.  It's possible one of these is in
effect.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Tim Cook

On Fri, Apr 2, 2010 at 10:08 AM, Kyle McDonald wrote:

> On 4/2/2010 8:08 AM, Edward Ned Harvey wrote:
> >> I know it is way after the fact, but I find it best to coerce each
> >> drive down to the whole GB boundary using format (create Solaris
> >> partition just up to the boundary). Then if you ever get a drive a
> >> little smaller it still should fit.
> >>
> > It seems like it should be unnecessary.  It seems like extra work.  But
> > based on my present experience, I reached the same conclusion.
> >
> > If my new replacement SSD with identical part number and firmware is
> 0.001
> > Gb smaller than the original and hence unable to mirror, what's to
> prevent
> > the same thing from happening to one of my 1TB spindle disk mirrors?
> > Nothing.  That's what.
> >
> >
> Actually, It's my experience that Sun (and other vendors) do exactly
> that for you when you buy their parts - at least for rotating drives, I
> have no experience with SSD's.
>
> The Sun disk label shipped on all the drives is setup to make the drive
> the standard size for that sun part number. They have to do this since
> they (for many reasons) have many sources (diff. vendors, even diff.
> parts from the same vendor) for the actual disks they use for a
> particular Sun part number.
>
> This isn't new, I beleive IBM, EMC, HP, etc all do it also for the same
> reasons.
> I'm a little surprised that the engineers would suddenly stop doing it
> only on SSD's. But who knows.
>
>  -Kyle
>
>

If I were forced to ignorantly cast a stone, it would be into Intel's lap
(if the SSD's indeed came directly from Sun).  Sun's "normal" drive vendors
have been in this game for decades, and know the expectations.  Intel on the
other hand, may not have quite the same QC in place yet.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS behavior under limited resources

2010-04-02 Thread Mike Z

I am trying to see how ZFS behaves under resource starvation - corner cases in 
embedded environments. I see some very strange behavior. Any help/explanation 
would really be appreciated.

My current setup is :
OpenSolaris 111b (iSCSI seems to be broken in 132 - unable to get multiple 
connections/mutlipathing)
iSCSI Storage Array that is capable of 
20 MB/s random writes @ 4k and 70 MB random reads @ 4k
150 MB/s random writes @ 128k and 180 MB/S random reads @ 128K
180+ MB/S for sequntial reads and write at both 4k and 128k.
8 Intel CPU and 12 GB of RAM (DELL poweredge 610)

The ARC size is limited to 512MB (hard limit). No L2 Cache.

In both test below the file system size is about 300 GB. This file system 
conatins a single directory  with about 15'000 files totalling to 200 GB (so 
the file system is 2/3 full). The tests are run within the same directory.

Test 1:
Random writes @ 4k to 1000 1MB files (1000 threads, 1 per file).

First I observe that ARC size grows (momentarily) above 512 MB limit (via kstat 
and arcstat.pl).
Q: It seems that zfs:zfs_arc_max is not really a hard limit?

I tried setting primarycache to none, metadata and all. The I/O reported is 
similar in the NONE and METADATA case (17 MB/S) while when set to ALL, I/O is 3 
- 4 time less (4-5 MB/S).
Q: Any explanation would be useful.

In this test I observe for backend on average I/O is 132 MB/s for READs and 51 
MB/s WRITES
Q: Why is more read than wtritten?

Test 2:
Random writes @ 4k to 10'000 1MB files (10'000 threads, 1 per file).

- ARC size now goes to 1 GB during the entire test (way above the hard limit)

- ::memstat reports that zfs grew from the original 430 MB to about 1.5 GB
Q: Does mdb memstat reporting include ARC?

Q: On the backend I see 170 MB/s reads and 0.5 MB.s writes -- What is happening 
here?



SOME sample output ...

---
> ::memstat
Page SummaryPagesMB  %Tot
     
Kernel 800933  3128   25%
ZFS File Data  394450  1540   13%
Anon   128909   5034%
Exec and libs4172160%
Page cache  14749570%
Free (cachelist)21884851%
Free (freelist)   1776079  6937   57%

Total 3141176 12270
Physical  3141175 12270

--
System Memory:
 Physical RAM:  12270 MB
 Free Memory :  6966 MB
 LotsFree:  191 MB

ZFS Tunables (/etc/system):
 set zfs:zfs_prefetch_disable = 1
 set zfs:zfs_arc_max = 0x2000
 set zfs:zfs_arc_min = 0x1000

ARC Size:
 Current Size: 669 MB (arcsize)
 Target Size (Adaptive):   512 MB (c)
 Min Size (Hard Limit):256 MB (zfs_arc_min)
 Max Size (Hard Limit):512 MB (zfs_arc_max)

ARC Size Breakdown:
 Most Recently Used Cache Size:   6%32 MB (p)
 Most Frequently Used Cache Size:93%480 MB (c-p)

ARC Efficency:
 Cache Access Total: 47002757
 Cache Hit Ratio:  52%   24657634   [Defined State for 
buffer]
 Cache Miss Ratio: 47%   22345123   [Undefined State for 
Buffer]
 REAL Hit Ratio:   52%   24657634   [MRU/MFU Hits Only]

 Data Demand   Efficiency:36%
 Data Prefetch Efficiency:DISABLED (zfs_prefetch_disable)

CACHE HITS BY CACHE LIST:
  Anon:   --%Counter Rolled.
  Most Recently Used: 13%3420349 (mru)  [ 
Return Customer ]
  Most Frequently Used:   86%21237285 (mfu) [ 
Frequent Customer ]
  Most Recently Used Ghost:   16%4057965 (mru_ghost)[ 
Return Customer Evicted, Now Back ]
  Most Frequently Used Ghost: 31%7837353 (mfu_ghost)[ 
Frequent Customer Evicted, Now Back ]
CACHE HITS BY DATA TYPE:
  Demand Data:31%7793822 
  Prefetch Data:   0%0 
  Demand Metadata:68%16863812 
  Prefetch Metadata:   0%0 
CACHE MISSES BY DATA TYPE:
  Demand Data:60%13573358 
  Prefetch Data:   0%0 
  Demand Metadata:39%8771406 
  Prefetch Metadata:   0%359 
-
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Miles Nordin

> "enh" == Edward Ned Harvey  writes:

   enh> If you have zpool less than version 19 (when ability to remove
   enh> log device was introduced) and you have a non-mirrored log
   enh> device that failed, you had better treat the situation as an
   enh> emergency.

Ed the log device removal support is only good for adding a slog to
try it out, then changing your mind and removing the slog (which was
not possible before).  It doesn't change the reliability situation one
bit: pools with dead slogs are not importable.  There've been threads
on this for a while.  It's well-discussed because it's an example of
IMHO broken process of ``obviously a critical requirement but not
technically part of the original RFE which is already late,'' as well
as a dangerous pitfall for ZFS admins.  I imagine the process works
well in other cases to keep stuff granular enough that it can be
prioritized effectively, but in this case it's made the slog feature
significantly incomplete for a couple years and put many production
systems in a precarious spot, and the whole mess was predicted before
the slog feature was integrated.

 >> The on-disk log (slog or otherwise), if I understand right, can
 >> actually make the filesystem recover to a crash-INconsistent
 >> state 

   enh> You're speaking the opposite of common sense.  

Yeah, I'm doing it on purpose to suggest that just guessing how you
feel things ought to work based on vague notions of economy isn't a
good idea.

   enh> If disabling the ZIL makes the system faster *and* less prone
   enh> to data corruption, please explain why we don't all disable
   enh> the ZIL?

I said complying with fsync can make the system recover to a state not
equal to one you might have hypothetically snapshotted in a moment
leading up to the crash.  Elsewhere I might've said disabling the ZIL
does not make the system more prone to data corruption, *iff* you are
not an NFS server.

If you are, disabling the ZIL can lead to lost writes if an NFS server
reboots and an NFS client does not, which can definitely cause
app-level data corruption.

Disabling the ZIL breaks the D requirement of ACID databases which
might screw up apps that replicate, or keep databases on several
separate servers in sync, and it might lead to lost mail on an MTA,
but because unlike non-COW filesystems it costs nothing extra for ZFS
to preserve write ordering even without fsync(), AIUI you will not get
corrupted application-level data by disabling the ZIL.  you just get
missing data that the app has a right to expect should be there.  The
dire warnings written by kernel developers in the wikis of ``don't
EVER disable the ZIL'' are totally ridiculous and inappropriate IMO.
I think they probably just worked really hard to write the ZIL piece
of ZFS, and don't want people telling their brilliant code to fuckoff
just because it makes things a little slower.  so we get all this
``enterprise'' snobbery and so on.

``crash consistent'' is a technical term not a common-sense term, and
I may have used it incorrectly:

http://oraclestorageguy.typepad.com/oraclestorageguy/2007/07/why-emc-technol.html

With a system that loses power on which fsync() had been in use, the
files getting fsync()'ed will probably recover to more recent versions
than the rest of the files, which means the recovered state achieved
by yanking the cord couldn't have been emulated by cloning a snapshot
and not actually having lost power.  However, the app calling fsync()
will expect this, so it's not supposed to lead to application-level
inconsistency.  

If you test your app's recovery ability in just that way, by cloning
snapshots of filesystems on which the app is actively writing and then
seeing if the app can recover the clone, then you're unfortunately not
testing the app quite hard enough if fsync() is involved, so yeah I
guess disabling the ZIL might in theory make incorrectly-written apps
less prone to data corruption.  Likewise, no testing of the app on a
ZFS will be aggressive enough to make the app powerfail-proof on a
non-COW POSIX system because ZFS keeps more ordering than the API
actually guarantees to the app.

I'm repeating myself though.  I wish you'll just read my posts with at
least paragraph granularity instead of just picking out individual
sentences and discarding everything that seems too complicated or too
awkwardly stated.

I'm basing this all on the ``common sense'' that to do otherwise,
fsync() would have to completely ignore its filedescriptor
argument. It'd have to copy the entire in-memory ZIL to the slog and
behave the same as 'lockfs -fa', which I think would perform too badly
compared to non-ZFS filesystems' fsync()s, and would lead to emphatic
performance advice like ``segregate files that get lots of fsync()s
into separate ZFS datasets from files that get high write bandwidth,''
and we don't have advice like that in the blogs/lists/wikis which
makes me think it's not beneficial (the benefit would be dramat

Re: [zfs-discuss] is this pool recoverable?

2010-04-02 Thread Patrick Tiquet

Thanks, that worked!! 

It needed "-Ff" 

The pool has been recovered with minimal loss in data.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] dedup and memory/l2arc requirements

2010-04-02 Thread Richard Elling

On Apr 1, 2010, at 5:39 PM, Roy Sigurd Karlsbakk wrote:
> Hi all
> 
> I've been told (on #opensolaris, irc.freenode.net) that opensolaris needs a 
> lot of memory and/or l2arc for dedup to function properly. How much memory or 
> l2arc should I get for a 12TB zpool (8x2GB in RAIDz2), and then, how much for 
> 125TB (after RAIDz2 overhead)? Is there a function into which I can plug my 
> recordsize and volume size to get the appropriate numbers?

You can estimate the amount of disk space needed for the deduplication table
and the expected deduplication ratio by using "zdb -S poolname" on your existing
pool.  Be patient, for an existing pool with lots of objects, this can take 
some time to run.

# ptime zdb -S zwimming
Simulated DDT histogram:

bucket  allocated   referenced  
__   __   __
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
--   --   -   -   -   --   -   -   -
 12.27M239G188G194G2.27M239G188G194G
 2 327K   34.3G   27.8G   28.1G 698K   73.3G   59.2G   59.9G
 430.1K   2.91G   2.10G   2.11G 152K   14.9G   10.6G   10.6G
 87.73K691M529M529M74.5K   6.25G   4.79G   4.80G
16  673   43.7M   25.8M   25.9M13.1K822M492M494M
32  197   12.3M   7.02M   7.03M7.66K480M269M270M
64   47   1.27M626K626K3.86K103M   51.2M   51.2M
   128   22908K250K251K3.71K150M   40.3M   40.3M
   2567302K 48K   53.7K2.27K   88.6M   17.3M   19.5M
   5124131K   7.50K   7.75K2.74K102M   5.62M   5.79M
2K1  2K  2K  2K3.23K   6.47M   6.47M   6.47M
8K1128K  5K  5K13.9K   1.74G   69.5M   69.5M
 Total2.63M277G218G225G3.22M337G263G270G

dedup = 1.20, compress = 1.28, copies = 1.03, dedup * compress / copies = 1.50


real 8:02.391932786
user 1:24.231855093
sys15.193256108

In this file system, 2.75 million blocks are allocated. The in-core size
of a DDT entry is approximately 250 bytes.  So the math is pretty simple:
in-core size = 2.63M * 250 = 657.5 MB

If your dedup ratio is 1.0, then this number will scale linearly with size.
If the dedup rate > 1.0, then this number will not scale linearly, it will be
less. So you can use the linear scale as a worst-case approximation.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Tirso Alonso

> If my new replacement SSD with identical part number and firmware is 0.001
> Gb smaller than the original and hence unable to mirror, what's to prevent
> the same thing from happening to one of my 1TB spindle disk mirrors?

There is a standard for sizes that many manufatures use (IDEMA LBA1-02):

LBA count = (97696368) + (1953504 * (Desired Capacity in Gbytes – 50.0)) 

Sizes should match exactly if the manufacturer follows the standard.

See:
http://opensolaris.org/jive/message.jspa?messageID=393336#393336
http://www.idema.org/_smartsite/modules/local/data_file/show_file.php?cmd=download&data_file_id=1066
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] is this pool recoverable?

2010-04-02 Thread Bob Friesenhahn


On Fri, 2 Apr 2010, Patrick Tiquet wrote:

I tried booting with b134 to attempt to recover the pool. I 
attempted with one disk of the mirror. Zpool tells me to use -F for 
import, fails, but then tells me to use -f, which also fails and 
tells me to use -F again. Any thoughts?


It looks like it wants you to use both -f and -F at the same time.  I 
don't see that you tried that.


Good luck.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Robert Milkowski


On 02/04/2010 16:04, casper@sun.com wrote:


sync() is actually *async* and returning from sync() says nothing about

   


to clarify - in case of ZFS sync() is actually synchronous.

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] is this pool recoverable?

2010-04-02 Thread Patrick Tiquet

I tried booting with b134 to attempt to recover the pool. I attempted with one 
disk of the mirror. Zpool tells me to use -F for import, fails, but then tells 
me to use -f, which also fails and tells me to use -F again. Any thoughts?



j...@opensolaris:~# zpool import
  pool: atomfs
id: 1344695315736882
 state: FAULTED
status: The pool was last accessed by another system.
action: The pool cannot be imported due to damaged devices or data.
The pool may be active on another system, but can be imported using
the '-f' flag.
   see: http://www.sun.com/msg/ZFS-8000-EY
config:

atomfs  FAULTED  corrupted data
  mirror-0  FAULTED  corrupted data
c4t5d0  ONLINE
c9d0UNAVAIL  cannot open
j...@opensolaris:~# zpool import -f
  pool: atomfs
id: 1344695315736882
 state: FAULTED
status: The pool was last accessed by another system.
action: The pool cannot be imported due to damaged devices or data.
The pool may be active on another system, but can be imported using
the '-f' flag.
   see: http://www.sun.com/msg/ZFS-8000-EY
config:

atomfs  FAULTED  corrupted data
  mirror-0  FAULTED  corrupted data
c4t5d0  ONLINE
c9d0UNAVAIL  cannot open
j...@opensolaris:~# zpool import -f 1344695315736882 newpool
cannot import 'atomfs' as 'newpool': one or more devices is currently 
unavailable
Recovery is possible, but will result in some data loss.
Returning the pool to its state as of March 12, 2010 09:08:29 AM PST
should correct the problem.  Recovery can be attempted
by executing 'zpool import -F atomfs'.  A scrub of the pool
is strongly recommended after recovery.
j...@opensolaris:~# zpool import -F atomfs
cannot import 'atomfs': pool may be in use from other system, it was last 
accessed by blue (hostid: 0x82aa00) on Fri Mar 12 09:08:29 2010
use '-f' to import anyway
j...@opensolaris:~# zpool status
no pools available
j...@opensolaris:~# zpool import -f 1344695315736882 
cannot import 'atomfs': one or more devices is currently unavailable
Recovery is possible, but will result in some data loss.
Returning the pool to its state as of March 12, 2010 09:08:29 AM PST
should correct the problem.  Recovery can be attempted
by executing 'zpool import -F atomfs'.  A scrub of the pool
is strongly recommended after recovery.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [install-discuss] Installing Opensolaris without ZFS?

2010-04-02 Thread Roy Sigurd Karlsbakk

I doubt it. ZFS is meant to be used for large systems, in which memory is not 
an issue

Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.

- "ольга крыжановская"  skrev:

> Are there plans to reduce the memory usage of ZFS in the near future?
> 
> Olga
> 
> 2010/4/2 Alan Coopersmith :
> > ольга крыжановская wrote:
> >> Does Opensolaris have an option to install without ZFS, i.e. use
> UFS
> >> for root like SXCE did?
> >
> > No.  beadm & pkg image-update rely on ZFS functionality for the root
> > filesystem.
> >
> > --
> >-Alan Coopersmith-alan.coopersm...@oracle.com
> > Oracle Solaris Platform Engineering: X Window System
> >
> >
> 
> 
> 
> --
>   ,   __   ,
>  { \/`o;-Olga Kryzhanovska   -;o`\/ }
> .'-/`-/ olga.kryzhanov...@gmail.com   \-`\-'.
>  `'-..-| / Solaris/BSD//C/C++ programmer   \ |-..-'`
>   /\/\ /\/\
>   `--`  `--`
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] dedup and memory/l2arc requirements

2010-04-02 Thread Roy Sigurd Karlsbakk

Hi all

I've been told (on #opensolaris, irc.freenode.net) that opensolaris needs a lot 
of memory and/or l2arc for dedup to function properly. How much memory or l2arc 
should I get for a 12TB zpool (8x2GB in RAIDz2), and then, how much for 125TB 
(after RAIDz2 overhead)? Is there a function into which I can plug my 
recordsize and volume size to get the appropriate numbers?

Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Ross Walker

On Fri, Apr 2, 2010 at 8:03 AM, Edward Ned Harvey
 wrote:
>> > Seriously, all disks configured WriteThrough (spindle and SSD disks
>> > alike)
>> > using the dedicated ZIL SSD device, very noticeably faster than
>> > enabling the
>> > WriteBack.
>>
>> What do you get with both SSD ZIL and WriteBack disks enabled?
>>
>> I mean if you have both why not use both? Then both async and sync IO
>> benefits.
>
> Interesting, but unfortunately false.  Soon I'll post the results here.  I
> just need to package them in a way suitable to give the public, and stick it
> on a website.  But I'm fighting IT fires for now and haven't had the time
> yet.
>
> Roughly speaking, the following are approximately representative.  Of course
> it varies based on tweaks of the benchmark and stuff like that.
>        Stripe 3 mirrors write through:  450-780 IOPS
>        Stripe 3 mirrors write back:  1030-2130 IOPS
>        Stripe 3 mirrors write back + SSD ZIL:  1220-2480 IOPS
>        Stripe 3 mirrors write through + SSD ZIL:  1840-2490 IOPS
>
> Overall, I would say WriteBack is 2-3 times faster than naked disks.  SSD
> ZIL is 3-4 times faster than naked disk.  And for some reason, having the
> WriteBack enabled while you have SSD ZIL actually hurts performance by
> approx 10%.  You're better off to use the SSD ZIL with disks in Write
> Through mode.
>
> That result is surprising to me.  But I have a theory to explain it.  When
> you have WriteBack enabled, the OS issues a small write, and the HBA
> immediately returns to the OS:  "Yes, it's on nonvolatile storage."  So the
> OS quickly gives it another, and another, until the HBA write cache is full.
> Now the HBA faces the task of writing all those tiny writes to disk, and the
> HBA must simply follow orders, writing a tiny chunk to the sector it said it
> would write, and so on.  The HBA cannot effectively consolidate the small
> writes into a larger sequential block write.  But if you have the WriteBack
> disabled, and you have a SSD for ZIL, then ZFS can log the tiny operation on
> SSD, and immediately return to the process:  "Yes, it's on nonvolatile
> storage."  So the application can issue another, and another, and another.
> ZFS is smart enough to aggregate all these tiny write operations into a
> single larger sequential write before sending it to the spindle disks.

Hmm, when you did the write-back test was the ZIL SSD included in the
write-back?

What I was proposing was write-back only on the disks, and ZIL SSD
with no write-back.

Not all operations hit the ZIL, so it would still be nice to have the
non-ZIL operations return quickly.

-Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Stuart Anderson


On Apr 2, 2010, at 5:08 AM, Edward Ned Harvey wrote:

>> I know it is way after the fact, but I find it best to coerce each
>> drive down to the whole GB boundary using format (create Solaris
>> partition just up to the boundary). Then if you ever get a drive a
>> little smaller it still should fit.
> 
> It seems like it should be unnecessary.  It seems like extra work.  But
> based on my present experience, I reached the same conclusion.
> 
> If my new replacement SSD with identical part number and firmware is 0.001
> Gb smaller than the original and hence unable to mirror, what's to prevent
> the same thing from happening to one of my 1TB spindle disk mirrors?
> Nothing.  That's what.
> 
> I take it back.  Me.  I am to prevent it from happening.  And the technique
> to do so is precisely as you've said.  First slice every drive to be a
> little smaller than actual.  Then later if I get a replacement device for
> the mirror, that's slightly smaller than the others, I have no reason to
> care.

However, I believe there are some downsides to letting ZFS manage just
a slice rather than an entire drive, but perhaps those do not apply as
significantly to SSD devices?

Thanks

--
Stuart Anderson  ander...@ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Bob Friesenhahn


On Fri, 2 Apr 2010, Edward Ned Harvey wrote:

were taking place at the same time.  That is, if two processes both complete
a write operation at the same time, one in sync mode and the other in async
mode, then it is guaranteed the data on disk will never have the async data
committed before the sync data.

Based on this understanding, if you disable ZIL, then there is no guarantee
about order of writes being committed to disk.  Neither of the above
guarantees is valid anymore.  Sync writes may be completed out of order.
Async writes that supposedly happened after sync writes may be committed to
disk before the sync writes.


You seem to be assuming that Solaris is an incoherent operating 
system.  With ZFS, the filesystem in memory is coherent, and 
transaction groups are constructed in simple chronological order 
(capturing combined changes up to that point in time), without regard 
to SYNC options.  The only possible exception to the coherency is for 
memory mapped files, where the mapped memory is a copy of data 
(originally) from the ZFS ARC and needs to be reconciled with the ARC 
if an application has dirtied it.  This differs from UFS and the way 
Solaris worked prior to Solaris 10.


Synchronous writes are not "faster" than asynchronous writes.  If you 
drop heavy and light objects from the same height, they fall at the 
same rate.  This was proven long ago.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Bob Friesenhahn


On Fri, 2 Apr 2010, Edward Ned Harvey wrote:


So you're saying that while the OS is building txg's to write to disk, the
OS will never reorder the sequence in which individual write operations get
ordered into the txg's.  That is, an application performing a small sync
write, followed by a large async write, will never have the second operation
flushed to disk before the first.  Can you support this belief in any way?


I am like a "pool" or "tank" of regurgitated zfs knowledge.  I simply 
pay attention when someone who really knows explains something (e.g. 
Neil Perrin, as Casper referred to) so I can regurgitate it later.  I 
try to do so faithfully.  If I had behaved this way in school, I would 
have been a good student.  Sometimes I am wrong or the design has 
somewhat changed since the original information was provided.


There are indeed popular filesystems (e.g. Linux EXT4) which write 
data to disk in different order than cronologically requested so it is 
good that you are paying attention to these issues.  While in the 
slog-based recovery scenario, it is possible for a TXG to be generated 
which lacks async data, this only happens after a system crash and if 
all of the critical data is written as a sync request, it will be 
faithfully preserved.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Mattias Pantzare

On Fri, Apr 2, 2010 at 16:24, Edward Ned Harvey  wrote:
>> The purpose of the ZIL is to act like a fast "log" for synchronous
>> writes.  It allows the system to quickly confirm a synchronous write
>> request with the minimum amount of work.
>
> Bob and Casper and some others clearly know a lot here.  But I'm hearing
> conflicting information, and don't know what to believe.  Does anyone here
> work on ZFS as an actual ZFS developer for Sun/Oracle?  Can claim "I can
> answer this question, I wrote that code, or at least have read it?"
>
> Questions to answer would be:
>
> Is a ZIL log device used only by sync() and fsync() system calls?  Is it
> ever used to accelerate async writes?

sync() will tell the filesystems to flush writes to disk. sync() will
not use ZIL, it will just start a new TXG, and could return before the
writes are done.

fsync() is what you are interested in.

>
> Suppose there is an application which sometimes does sync writes, and
> sometimes async writes.  In fact, to make it easier, suppose two processes
> open two files, one of which always writes asynchronously, and one of which
> always writes synchronously.  Suppose the ZIL is disabled.  Is it possible
> for writes to be committed to disk out-of-order?  Meaning, can a large block
> async write be put into a TXG and committed to disk before a small sync
> write to a different file is committed to disk, even though the small sync
> write was issued by the application before the large async write?  Remember,
> the point is:  ZIL is disabled.  Question is whether the async could
> possibly be committed to disk before the sync.
>

Writers from a TXG will not be used until the whole TXG is committed to disk.
Everything from a half written TXG will be ignored after a crash.

This means that the order of writes within a TXG is not important.

The only way to do a sync write without ZIL is to start a new TXG
after the write. That costs a lot so we have the ZIL for sync writes.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Kyle McDonald

On 4/2/2010 8:08 AM, Edward Ned Harvey wrote:
>> I know it is way after the fact, but I find it best to coerce each
>> drive down to the whole GB boundary using format (create Solaris
>> partition just up to the boundary). Then if you ever get a drive a
>> little smaller it still should fit.
>> 
> It seems like it should be unnecessary.  It seems like extra work.  But
> based on my present experience, I reached the same conclusion.
>
> If my new replacement SSD with identical part number and firmware is 0.001
> Gb smaller than the original and hence unable to mirror, what's to prevent
> the same thing from happening to one of my 1TB spindle disk mirrors?
> Nothing.  That's what.
>
>   
Actually, It's my experience that Sun (and other vendors) do exactly
that for you when you buy their parts - at least for rotating drives, I
have no experience with SSD's.

The Sun disk label shipped on all the drives is setup to make the drive
the standard size for that sun part number. They have to do this since
they (for many reasons) have many sources (diff. vendors, even diff.
parts from the same vendor) for the actual disks they use for a
particular Sun part number.

This isn't new, I beleive IBM, EMC, HP, etc all do it also for the same
reasons.
I'm a little surprised that the engineers would suddenly stop doing it
only on SSD's. But who knows.

  -Kyle

> I take it back.  Me.  I am to prevent it from happening.  And the technique
> to do so is precisely as you've said.  First slice every drive to be a
> little smaller than actual.  Then later if I get a replacement device for
> the mirror, that's slightly smaller than the others, I have no reason to
> care.
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Casper . Dik



>Questions to answer would be:
>
>Is a ZIL log device used only by sync() and fsync() system calls?  Is it
>ever used to accelerate async writes?

There are quite a few of "sync" writes, specifically when you mix in the 
NFS server.

>Suppose there is an application which sometimes does sync writes, and
>sometimes async writes.  In fact, to make it easier, suppose two processes
>open two files, one of which always writes asynchronously, and one of which
>always writes synchronously.  Suppose the ZIL is disabled.  Is it possible
>for writes to be committed to disk out-of-order?  Meaning, can a large block
>async write be put into a TXG and committed to disk before a small sync
>write to a different file is committed to disk, even though the small sync
>write was issued by the application before the large async write?  Remember,
>the point is:  ZIL is disabled.  Question is whether the async could
>possibly be committed to disk before the sync.

What I quoted from the other discussion, it seems to be that later writes 
cannot be committed in an earlier TXG then your sync write or other earlier
writes.

>I make the assumption that an uberblock is the term for a TXG after it is
>committed to disk.  Correct?

The "uberblock" is the "root of all the data".  All the data in a ZFS pool 
is referenced by it; after the txg is in stable storage then the uberblock 
is updated.

>At boot time, or "zpool import" time, what is taken to be "the current
>filesystem?"  The latest uberblock?  Something else?

The current "zpool" and the filesystems such as referenced by the last
uberblock.

>My understanding is that enabling a dedicated ZIL device guarantees sync()
>and fsync() system calls block until the write has been committed to
>nonvolatile storage, and attempts to accelerate by using a physical device
>which is faster or more idle than the main storage pool.  My understanding
>is that this provides two implicit guarantees:  (1) sync writes are always
>guaranteed to be committed to disk in order, relevant to other sync writes.
>(2) In the event of OS halting or ungraceful shutdown, sync writes committed
>to disk are guaranteed to be equal or greater than the async writes that
>were taking place at the same time.  That is, if two processes both complete
>a write operation at the same time, one in sync mode and the other in async
>mode, then it is guaranteed the data on disk will never have the async data
>committed before the sync data.

sync() is actually *async* and returning from sync() says nothing about 
stable storage.  After fsync() returns it signals that all the data is
in stable storage (except if you disable ZIL), or, apparently, in Linux
when the write caches for your disks are enabled (the default for PC
drives).  ZFS doesn't care about the writecache; it makes sure it is 
flushed.  (There's fsyc() and open(..., O_DSYNC|O_SYNC)

>Based on this understanding, if you disable ZIL, then there is no guarantee
>about order of writes being committed to disk.  Neither of the above
>guarantees is valid anymore.  Sync writes may be completed out of order.
>Async writes that supposedly happened after sync writes may be committed to
>disk before the sync writes.
>
>Somebody, (Casper?) said it before, and now I'm starting to realize ... This
>is also true of the snapshots.  If you disable your ZIL, then there is no
>guarantee your snapshots are consistent either.  Rolling back doesn't
>necessarily gain you anything.
>
>The only way to guarantee consistency in the snapshot is to always
>(regardless of ZIL enabled/disabled) give priority for sync writes to get
>into the TXG before async writes.
>
>If the OS does give priority for sync writes going into TXG's before async
>writes (even with ZIL disabled), then after spontaneous ungraceful reboot,
>the latest uberblock is guaranteed to be consistent.


I believe that the writes are still ordered so the consistency you want is 
actually delivered even without the ZIL enabled.

Casper


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey

> The purpose of the ZIL is to act like a fast "log" for synchronous
> writes.  It allows the system to quickly confirm a synchronous write
> request with the minimum amount of work.  

Bob and Casper and some others clearly know a lot here.  But I'm hearing
conflicting information, and don't know what to believe.  Does anyone here
work on ZFS as an actual ZFS developer for Sun/Oracle?  Can claim "I can
answer this question, I wrote that code, or at least have read it?"

Questions to answer would be:

Is a ZIL log device used only by sync() and fsync() system calls?  Is it
ever used to accelerate async writes?

Suppose there is an application which sometimes does sync writes, and
sometimes async writes.  In fact, to make it easier, suppose two processes
open two files, one of which always writes asynchronously, and one of which
always writes synchronously.  Suppose the ZIL is disabled.  Is it possible
for writes to be committed to disk out-of-order?  Meaning, can a large block
async write be put into a TXG and committed to disk before a small sync
write to a different file is committed to disk, even though the small sync
write was issued by the application before the large async write?  Remember,
the point is:  ZIL is disabled.  Question is whether the async could
possibly be committed to disk before the sync.

I make the assumption that an uberblock is the term for a TXG after it is
committed to disk.  Correct?

At boot time, or "zpool import" time, what is taken to be "the current
filesystem?"  The latest uberblock?  Something else?

My understanding is that enabling a dedicated ZIL device guarantees sync()
and fsync() system calls block until the write has been committed to
nonvolatile storage, and attempts to accelerate by using a physical device
which is faster or more idle than the main storage pool.  My understanding
is that this provides two implicit guarantees:  (1) sync writes are always
guaranteed to be committed to disk in order, relevant to other sync writes.
(2) In the event of OS halting or ungraceful shutdown, sync writes committed
to disk are guaranteed to be equal or greater than the async writes that
were taking place at the same time.  That is, if two processes both complete
a write operation at the same time, one in sync mode and the other in async
mode, then it is guaranteed the data on disk will never have the async data
committed before the sync data.

Based on this understanding, if you disable ZIL, then there is no guarantee
about order of writes being committed to disk.  Neither of the above
guarantees is valid anymore.  Sync writes may be completed out of order.
Async writes that supposedly happened after sync writes may be committed to
disk before the sync writes.

Somebody, (Casper?) said it before, and now I'm starting to realize ... This
is also true of the snapshots.  If you disable your ZIL, then there is no
guarantee your snapshots are consistent either.  Rolling back doesn't
necessarily gain you anything.

The only way to guarantee consistency in the snapshot is to always
(regardless of ZIL enabled/disabled) give priority for sync writes to get
into the TXG before async writes.

If the OS does give priority for sync writes going into TXG's before async
writes (even with ZIL disabled), then after spontaneous ungraceful reboot,
the latest uberblock is guaranteed to be consistent.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey

> Only a broken application uses sync writes
> sometimes, and async writes at other times.

Suppose there is a virtual machine, with virtual processes inside it.  Some
virtual process issues a sync write to the virtual OS, meanwhile another
virtual process issues an async write.  Then the virtual OS will sometimes
issue sync writes and sometimes async writes to the host OS.

Are you saying this makes qemu, and vbox, and vmware "broken applications?"

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey

> >Dude, don't be so arrogant.  Acting like you know what I'm talking
> about
> >better than I do.  Face it that you have something to learn here.
> 
> You may say that, but then you post this:

Acknowledged.  I read something arrogant, and I replied even more arrogant.
That was dumb of me.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Casper . Dik



>So you're saying that while the OS is building txg's to write to disk, the
>OS will never reorder the sequence in which individual write operations get
>ordered into the txg's.  That is, an application performing a small sync
>write, followed by a large async write, will never have the second operation
>flushed to disk before the first.  Can you support this belief in any way?

The question is not how the writes are ordered but whether an earlier
write can be in a later txg.  A transaction group is committed atomically.

In http://arc.opensolaris.org/caselog/PSARC/2010/108/mail I ask a similar 
question to make sure I understand it correctly, and the answer was:

"> = Casper", the answer is from Neil Perrin:

> Is there a partialy order defined for all filesystem operations?
>   

File system operations  will be written in order for all settings of 
the 
sync flag.

> Specifically, will ZFS guarantee that when fsync()/O_DATA happens on a
> file,
   
(I assume by O_DATA you meant O_DSYNC).

> that later transactions will not be in an earlier transaction group?
> (Or is this already the case?)
  
This is already the case.


So what I assumed was true but what you made me doubt, was apparently still
true: later transactions cannot be committed in an earlier txg.



>If that's true, if there's no increased risk of data corruption, then why
>doesn't everybody just disable their ZIL all the time on every system?

For an application running on the file server, there is no difference.
When the system panics you know that data might be lost.  The application 
also dies.  (The snapshot and the last valid uberblock are equally valid)

But for an application on an NFS client, without ZIL data will be lost 
while the NFS client believes the data is written amd it will not try 
again.  With the ZIL, when the NFS server says that data is written then 
it is actually on stable storage.

>The reason to have a sync() function in C/C++ is so you can ensure data is
>written to disk before you move on.  It's a blocking call, that doesn't
>return until the sync is completed.  The only reason you would ever do this
>is if order matters.  If you cannot allow the next command to begin until
>after the previous one was completed.  Such is the situation with databases
>and sometimes virtual machines.  

So the question is: when will your data invalid?

What happens with the data when the system dies before the fsync() call?
What happens with the data when the system dies after the fsync() call?
What happens with the data when the system dies after more I/O operations?

With the zil disabled, you call fsync() but you may encounter data from
before the call to fsync().  That could happen before, so I assume you can
actually recover from that situation.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Casper . Dik


>> > http://nfs.sourceforge.net/
>> 
>> I think B4 is the answer to Casper's question:
>
>We were talking about ZFS, and under what circumstances data is flushed to
>disk, in what way "sync" and "async" writes are handled by the OS, and what
>happens if you disable ZIL and lose power to your system.
>
>We were talking about C/C++ sync and async.  Not NFS sync and async.

I don't think so.

http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg36783.html

(This discussion was started, I think, in the context of NFS performance)

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey

> If you have zpool less than version 19 (when ability to remove log
> device
> was introduced) and you have a non-mirrored log device that failed, you
> had
> better treat the situation as an emergency.  

> Instead, do "man zpool" and look for "zpool
> remove."
> If it says "supports removing log devices" then you had better use it
> to
> remove your log device.  If it says "only supports removing hotspares
> or
> cache" then your zpool is lost permanently.

I take it back.  If you lost your log device on a zpool which is less than
version 19, then you *might* have a possible hope if you migrate your disks
to a later system.  You *might* be able to "zpool import" on a later version
of OS.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey

> ZFS recovers to a crash-consistent state, even without the slog,
> meaning it recovers to some state through which the filesystem passed
> in the seconds leading up to the crash.  This isn't what UFS or XFS
> do.
> 
> The on-disk log (slog or otherwise), if I understand right, can
> actually make the filesystem recover to a crash-INconsistent state (a

You're speaking the opposite of common sense.  If disabling the ZIL makes
the system faster *and* less prone to data corruption, please explain why we
don't all disable the ZIL?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey

> hello
> 
> i have had this problem this week. our zil ssd died (apt slc ssd 16gb).
> because we had no spare drive in stock, we ignored it.
> 
> then we decided to update our nexenta 3 alpha to beta, exported the
> pool and made a fresh install to have a clean system and tried to
> import the pool. we only got a error message about a missing drive.
> 
> we googled about this and it seems there is no way to acces the pool
> !!!
> (hope this will be fixed in future)
> 
> we had a backup and the data are not so important, but that could be a
> real problem.
> you have  a valid zfs3 pool and you cannot access your data due to
> missing zil.

If you have zpool less than version 19 (when ability to remove log device
was introduced) and you have a non-mirrored log device that failed, you had
better treat the situation as an emergency.  Normally you can find your
current zpool version by doing "zpool upgrade," but you cannot now if you're
in this failure state.  Do not attempt "zfs send" or "zfs list" or any other
zpool or zfs command.  Instead, do "man zpool" and look for "zpool remove."
If it says "supports removing log devices" then you had better use it to
remove your log device.  If it says "only supports removing hotspares or
cache" then your zpool is lost permanently.

If you are running Solaris, take it as given, you do not have zpool version
19.  If you are running Opensolaris, I don't know at which point zpool 19
was introduced.  Your only hope is to "zpool remove" the log device.  Use
tar or cp or something, to try and salvage your data out of there.  Your
zpool is lost and if it's functional at all right now, it won't stay that
way for long.  Your system will soon hang, and then you will not be able to
import your pool.

Ask me how I know.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey

> > I am envisioning a database, which issues a small sync write,
> followed by a
> > larger async write.  Since the sync write is small, the OS would
> prefer to
> > defer the write and aggregate into a larger block.  So the
> possibility of
> > the later async write being committed to disk before the older sync
> write is
> > a real risk.  The end result would be inconsistency in my database
> file.
> 
> Zfs writes data in transaction groups and each bunch of data which
> gets written is bounded by a transaction group.  The current state of
> the data at the time the TXG starts will be the state of the data once
> the TXG completes.  If the system spontaneously reboots then it will
> restart at the last completed TXG so any residual writes which might
> have occured while a TXG write was in progress will be discarded.
> Based on this, I think that your ordering concerns (sync writes
> getting to disk "faster" than async writes) are unfounded for normal
> file I/O.

So you're saying that while the OS is building txg's to write to disk, the
OS will never reorder the sequence in which individual write operations get
ordered into the txg's.  That is, an application performing a small sync
write, followed by a large async write, will never have the second operation
flushed to disk before the first.  Can you support this belief in any way?

If that's true, if there's no increased risk of data corruption, then why
doesn't everybody just disable their ZIL all the time on every system?

The reason to have a sync() function in C/C++ is so you can ensure data is
written to disk before you move on.  It's a blocking call, that doesn't
return until the sync is completed.  The only reason you would ever do this
is if order matters.  If you cannot allow the next command to begin until
after the previous one was completed.  Such is the situation with databases
and sometimes virtual machines.  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey

> > http://nfs.sourceforge.net/
> 
> I think B4 is the answer to Casper's question:

We were talking about ZFS, and under what circumstances data is flushed to
disk, in what way "sync" and "async" writes are handled by the OS, and what
happens if you disable ZIL and lose power to your system.

We were talking about C/C++ sync and async.  Not NFS sync and async.

I don't think anything relating to NFS is the answer to Casper's question,
or else, Casper was simply jumping context by asking it.  Don't get me
wrong, I have no objection to his question or anything, it's just that the
conversation has derailed and now people are talking about NFS sync/async
instead of what happens when a C/C++ application is doing sync/async writes
to a disabled ZIL.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Roch


  When we use one vmod, both machines are finished in about 6min45,
  zilstat maxes out at about 4200 IOPS.
  Using four vmods it takes about 6min55, zilstat maxes out at 2200
  IOPS. 

Can  you try 4 concurrent tar to four different ZFS filesystems (same
pool). 

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey

> I know it is way after the fact, but I find it best to coerce each
> drive down to the whole GB boundary using format (create Solaris
> partition just up to the boundary). Then if you ever get a drive a
> little smaller it still should fit.

It seems like it should be unnecessary.  It seems like extra work.  But
based on my present experience, I reached the same conclusion.

If my new replacement SSD with identical part number and firmware is 0.001
Gb smaller than the original and hence unable to mirror, what's to prevent
the same thing from happening to one of my 1TB spindle disk mirrors?
Nothing.  That's what.

I take it back.  Me.  I am to prevent it from happening.  And the technique
to do so is precisely as you've said.  First slice every drive to be a
little smaller than actual.  Then later if I get a replacement device for
the mirror, that's slightly smaller than the others, I have no reason to
care.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey

> > Seriously, all disks configured WriteThrough (spindle and SSD disks
> > alike)
> > using the dedicated ZIL SSD device, very noticeably faster than
> > enabling the
> > WriteBack.
> 
> What do you get with both SSD ZIL and WriteBack disks enabled?
> 
> I mean if you have both why not use both? Then both async and sync IO
> benefits.

Interesting, but unfortunately false.  Soon I'll post the results here.  I
just need to package them in a way suitable to give the public, and stick it
on a website.  But I'm fighting IT fires for now and haven't had the time
yet.

Roughly speaking, the following are approximately representative.  Of course
it varies based on tweaks of the benchmark and stuff like that.
Stripe 3 mirrors write through:  450-780 IOPS
Stripe 3 mirrors write back:  1030-2130 IOPS
Stripe 3 mirrors write back + SSD ZIL:  1220-2480 IOPS
Stripe 3 mirrors write through + SSD ZIL:  1840-2490 IOPS

Overall, I would say WriteBack is 2-3 times faster than naked disks.  SSD
ZIL is 3-4 times faster than naked disk.  And for some reason, having the
WriteBack enabled while you have SSD ZIL actually hurts performance by
approx 10%.  You're better off to use the SSD ZIL with disks in Write
Through mode.

That result is surprising to me.  But I have a theory to explain it.  When
you have WriteBack enabled, the OS issues a small write, and the HBA
immediately returns to the OS:  "Yes, it's on nonvolatile storage."  So the
OS quickly gives it another, and another, until the HBA write cache is full.
Now the HBA faces the task of writing all those tiny writes to disk, and the
HBA must simply follow orders, writing a tiny chunk to the sector it said it
would write, and so on.  The HBA cannot effectively consolidate the small
writes into a larger sequential block write.  But if you have the WriteBack
disabled, and you have a SSD for ZIL, then ZFS can log the tiny operation on
SSD, and immediately return to the process:  "Yes, it's on nonvolatile
storage."  So the application can issue another, and another, and another.
ZFS is smart enough to aggregate all these tiny write operations into a
single larger sequential write before sending it to the spindle disks.  

Long story short, the evidence suggests if you have SSD ZIL, you're better
off without WriteBack on the HBA.  And I conjecture the reasoning behind it
is because ZFS can write buffer better than the HBA can.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] how can I remove files when the fiile system is full?

2010-04-02 Thread Edward Ned Harvey

> On opensolaris?  Did you try deleting any old BEs?

Don't forget to "zfs destroy rp...@snapshot"
In fact, you might start with destroying snapshots ... if there are any
occupying space.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] how can I remove files when the fiile system is full?

2010-04-02 Thread Eiji Ota

Thanks, Brandon. 

Now that the issue goes away, I could recover my host. 

-Eiji 

> 
> On Thu, Apr 1, 2010 at 1:39 PM, Eiji Ota < eiji@oracle.com > wrote: 
> 



> Thanks. It worked, but yet the fs says it's full. Is it normal and I can get 
> some space eventually (if I continue this)? 
> 

> 

You may need to destroy some snapshots before the space becomes available. 

> 
zfs list -t snapshot will how approximately how much space will be freed for 
each snapshot. 

> 
-B 
> -- 
> Brandon High : bh...@freaks.com 
>___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] how can I remove files when the fiile system is full?

2010-04-02 Thread Eiji Ota

Thanks. It worked, but yet the fs says it's full. Is it normal and I can get 
some space eventually (if I continue this)? 

# cat /dev/null >./messages.1 
# cat /dev/null >./messages.0 

# df -kl 
Filesystem 1K-blocks Used Available Use% Mounted on 
rpool/ROOT/opensolaris 
4976123 4976123 0 100% / <== the availabe space is yet 0. 
swap 14218704 244 14218460 1% /etc/svc/volatile 
/usr/lib/libc/libc_hwcap2.so.1 
4976123 4976123 0 100% /lib/libc.so.1 
swap 14218600 140 14218460 1% /tmp 
swap 14218472 12 14218460 1% /var/run 

-Eiji 

> 
> On Thu, Apr 1, 2010 at 12:46 PM, Eiji Ota < eiji@oracle.com > wrote: 
> 

# cd /var/adm 
> # rm messages.? 
> rm: cannot remove `messages.0': No space left on device 
> rm: cannot remove `messages.1': No space left on device 
> 

> 

I think doing "cat /dev/null > /var/adm/messages.1" will work. 

> 
-B 
> -- 
> Brandon High : bh...@freaks.com 
>___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] how can I remove files when the fiile system is full?

2010-04-02 Thread Tim Haley


On 04/ 1/10 01:46 PM, Eiji Ota wrote:

During the IPS upgrade, the file system got full, then I cannot do anything to 
recover it.

# df -kl
Filesystem   1K-blocks  Used Available Use% Mounted on
rpool/ROOT/opensolaris
4976642   4976642 0 100% /
swap  14217564   244  14217320   1% /etc/svc/volatile
/usr/lib/libc/libc_hwcap2.so.1
4976642   4976642 0 100% /lib/libc.so.1
swap  14217460   140  14217320   1% /tmp
swap  1421734424  14217320   1% /var/run

# cd /var/adm
# rm messages.?
rm: cannot remove `messages.0': No space left on device
rm: cannot remove `messages.1': No space left on device

Likely a similar issue was reported a few years ago like:
http://opensolaris.org/jive/thread.jspa?messageID=241580𺾬

However, my system is on snv_133.

Is there any way to work around the situation?
This is really critical since after the IPS gets the file system full, 
customers seem not able to recover.

Thanks,

-Eiji
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


On opensolaris?  Did you try deleting any old BEs?

-tim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Roch

Robert Milkowski writes:
 > On 01/04/2010 20:58, Jeroen Roodhart wrote:
 > >
 > >> I'm happy to see that it is now the default and I hope this will cause the
 > >> Linux NFS client implementation to be faster for conforming NFS servers.
 > >>  
 > > Interesting thing is that apparently defaults on Solaris an Linux are 
 > > chosen such that one can't signal the desired behaviour to the other. At 
 > > least we didn't manage to get a Linux client to asynchronously mount a 
 > > Solaris (ZFS backed) NFS export...
 > >
 > 
 > Which is to be expected as it is not a nfs client which requests the 
 > behavior but rather a nfs server.
 > Currently on Linux you can export a share with as sync (default) or 
 > async share while on Solaris you can't really currently force a NFS 
 > server to start working in an async mode.
 > 

True, and there is an entrenched misconception (not you)
that this a ZFS specific problem which it's not. 

It's really an NFS protocol feature which can be
circumvented using zil_disable which therefore reinforces
the misconception.  It's further reinforced by testing NFS
server on disk drives with WCE=1 with filesystem not ZFS.

All fast options cause the NFS client to become inconsistent
after a server reboot. Whatever was being done in the moments
prior to server reboot will need to be wiped out by users if
they are told that the server did reboot. That's manageable
for home use not for the entreprise.

-r

 > -- 
 > Robert Milkowski
 > http://milek.blogspot.com
 > ___
 > zfs-discuss mailing list
 > zfs-discuss@opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Casper . Dik


>On 01/04/2010 20:58, Jeroen Roodhart wrote:
>>
>>> I'm happy to see that it is now the default and I hope this will cause the
>>> Linux NFS client implementation to be faster for conforming NFS servers.
>>>  
>> Interesting thing is that apparently defaults on Solaris an Linux are chosen 
>> such that one can't
 signal the desired behaviour to the other. At least we didn't manage to get a 
Linux client to asyn
chronously mount a Solaris (ZFS backed) NFS export...
>>
>
>Which is to be expected as it is not a nfs client which requests the 
>behavior but rather a nfs server.
>Currently on Linux you can export a share with as sync (default) or 
>async share while on Solaris you can't really currently force a NFS 
>server to start working in an async mode.


The other part of the issue is that the Solaris Clients have been 
developed with a "sync" server.  The client write behinds more and
continues caching the non-acked data.  The Linux client has been developed 
with a "async" server and has some catching up to do.


Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [install-discuss] Installing Opensolaris without ZFS?

2010-04-02 Thread Erik Trimble

[removing all lists except ZFS-discuss, as this is really pertinent only 
there]


ольга крыжановская wrote:

Are there plans to reduce the memory usage of ZFS in the near future?

Olga

2010/4/2 Alan Coopersmith :
  

ольга крыжановская wrote:


Does Opensolaris have an option to install without ZFS, i.e. use UFS
for root like SXCE did?
  

No.  beadm & pkg image-update rely on ZFS functionality for the root
filesystem.

--
   -Alan Coopersmith-alan.coopersm...@oracle.com
Oracle Solaris Platform Engineering: X Window System

The vast majority of ZFS memory consumption is for caching, which can be 
manually reduced if it's impinging on your application. See the tuning 
guide for more info:


http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide

As pointed out elsewhere, these tuning parameters are generally for 
highwater marks - ZFS will return RAM back to the system if it's needed 
for applications. So, in your original problem, the likelihood is /not/ 
that ZFS is consuming RAM and not releasing it, but rather than your 
other apps are overloading the system.



That said, there are certain minimum allocations that can't be reduced 
and must be held in RAM, but they're not generally significant. UFS's 
memory usage is really not measurably different than ZFS's, so far as I 
can measure from a kernel standpoint.  It's all the caching that makes 
ZFS look like a RAM pig.


One thing though:  taking away all of ZFS's caching hurts performance 
more than removing all of UFS's file cache, because ZFS stores more than 
simple data in it's filecache (ARC).


Realistically speaking, I can't see running ZFS on a machine with less 
than 1GB of RAM.  I also can't see modifying ZFS to work well in such 
circumstances, as (a) ZFS isn't targeted at such limited platforms and 
(b) you'd seriously compromise a major chunk of performance trying to 
make it fit. These days, 4GB is really more of a minimum for a 64-bit 
machine/OS in any case.


I certainly would be interested in seeing what a large L2ARC cache would 
mean for reduction in RAM footprint;  on one hand, having an L2ARC 
requires ARC (i.e. DRAM) allocations for each entry in the L2ARC, but on 
the other hand, it would reduce/eliminate storage of actual data and 
metadata in DRAM.


Anyone up for running tests for a box with say 512MB of RAM and a 10GB+ 
L2ARC (in say an SSD)?


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [install-discuss] Installing Opensolaris without ZFS?

2010-04-02 Thread ольга крыжановская

Are there plans to reduce the memory usage of ZFS in the near future?

Olga

2010/4/2 Alan Coopersmith :
> ольга крыжановская wrote:
>> Does Opensolaris have an option to install without ZFS, i.e. use UFS
>> for root like SXCE did?
>
> No.  beadm & pkg image-update rely on ZFS functionality for the root
> filesystem.
>
> --
>-Alan Coopersmith-alan.coopersm...@oracle.com
> Oracle Solaris Platform Engineering: X Window System
>
>



-- 
  ,   __   ,
 { \/`o;-Olga Kryzhanovska   -;o`\/ }
.'-/`-/ olga.kryzhanov...@gmail.com   \-`\-'.
 `'-..-| / Solaris/BSD//C/C++ programmer   \ |-..-'`
  /\/\ /\/\
  `--`  `--`
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ARC Tail

2010-04-02 Thread Abdullah Al-Dahlawi

Greeting All


Can any one help me figure out the size of the ARC tail, i.e, the portion in
ARC that l2_feed thread is reading from before pages are evicted from ARC.

Is the size of this tail proportional to total ARC size ?   L2ARC device
size ?   is tunable ??


your feed back is highly appreciated





-- 
Abdullah Al-Dahlawi
PhD Candidate
George Washington University
Department. Of Electrical & Computer Engineering

Check The Fastest 500 Super Computers Worldwide
http://www.top500.org/list/2009/11/100
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

51 matches

Mail list logo