Re: [zfs-discuss] Interesting experience with Nexenta - anyone seen it?

2010-05-22 Thread Bob Friesenhahn

On Fri, 21 May 2010, David Dyer-Bennet wrote:


To be comfortable (I don't ask for know for a certainty; I'm not sure
that exists outside of faith), I want a claim by the manufacturer and
multiple outside tests in significant journals -- which could be the
blog of somebody I trusted, as well as actual magazines and such.
Ideally, certainly if it's important, I'd then verify the tests myself.


For me, know for a certainty means that the feature is clearly 
specified in the formal specification sheet for the product, and the 
vendor has historically published reliable specification sheets. 
This may not be the same as money in the bank, but it is better than 
relying on thoughts from some blog posting.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interesting experience with Nexenta - anyone seen it?

2010-05-22 Thread Bob Friesenhahn

On Fri, 21 May 2010, Brandon High wrote:


My understanding is that the controller contains enough cache to
buffer enough data to write a complete erase block size, eliminating
the need to read / erase / write that a partial block write entails.
It's reported to do a copy-on-write, so it doesn't need to do a read
of existing blocks when making changes, which gives it such high iops
- Even random writes are turned into sequential writes (much like how
ZFS works) of entire erase blocks. The excessive spare area is used to
ensure that there are always full pages free to write to. (Some
vendors are releasing consumer drives with 60/120/240 GB, using 7%
reserved space rather than the 27% that the original drives ship
with.)


FLASH is useless as working space since it does not behave like RAM so 
every SSD needs to have some RAM for temporary storage of data.  This 
COW approach seems nice except that it would appear to inflate 
performance by only considering a specific magic block size and 
alignment.  Other block sizes and alignments would require that 
existing data be read so that the new block content can be 
constructed.  Also, the blazing fast write speed (which depends on 
plenty of already erased blocks) would stop once the spare space in 
the SSD has been consumed.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interesting experience with Nexenta - anyone seen it?

2010-05-22 Thread Bob Friesenhahn

On Fri, 21 May 2010, Don wrote:

You know- it would probably be sufficient to provide the SSD with 
_just_ a big capacitor bank. If the host lost power it would stop 
writing and if the SSD still had power it would probably use the 
idle time to flush it's buffers. Then there would be world peace!


This makes the assumption that an SSD will want to flush its write 
cache as soon as possible rather than just letting it sit there 
waiting for more data.  This is probably not a good assumption.  If 
the OS sends 512 bytes of data but the SSD block size is 4K, it is 
reasonable for the SSD to wait for 3584 more contiguous bytes of data 
before it bothers to write anything.


Writes increase the wear on the flash and writes require a slow erase 
cycle so it is reasonable for SSDs to buffer as much data in their 
write cache as possible before writing anything.  An advanced SSD 
could write non-contiguous sectors in a SSD page and then use a sort 
of lookup table to know where the sectors actually are.  Regardless, 
under slow write conditions, it is is definitely valuable to buffer 
the data for a while in the hope that more related data will appear, 
or the data might even be overwritten.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interesting experience with Nexenta - anyone seen it?

2010-05-21 Thread Miika Vesti

If you do not care about this NFS problem (or the others) then maybe
you can just disable the ZIL.  It is a matter of working through step
1.  Working through STEP 1 might be ``doesn't affect us.  Disable
ZIL.''  Or it might be ``get slog with supercap''.  STEP 1 will never
be ``plug in OCZ Vertex cheaposlog that ignores cacheflush'' if you
are doing it right.  And Step 2 has nothing to do with anything yet
until we finish STEP 1 and the insane failure cases.


AFAIK OCZ Vertex 2 does not use volatile DRAM cache but non-volatile 
NAND grid. Whether it respects or ignores the cache flush seems irrelevant.


There has been previous discussion about this: 
http://comments.gmane.org/gmane.os.solaris.opensolaris.zfs/35702


I'm pretty sure that all SandForce-based SSDs don't use DRAM as their
cache, but take a hunk of flash to use as scratch space instead. Which
means that they'll be OK for ZIL use.

Also:
http://www.techspot.com/news/37729-ocz-vertex-2-pro-100gb-ssd-review.html

Another benefit of SandForce's architecture is that the SSD keeps 
information on the NAND grid and removes the need for a separate cache 
buffer DRAM module. The result is a faster transaction, albeit at the 
expense of total storage capacity.


So if I interpret them correctly, what they chose to do with the 
current incarnation of the architecture is actually reserve some of the 
primary memory capacity for I/O transaction management.


In plain English, if the system gets interrupted either by power or by 
a crash, when it initializes the next time, it can read from its 
transaction space and resume where it left off. This makes it durable.


So, OCZ Vertex 2 seems to be a good choice for ZIL.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interesting experience with Nexenta - anyone seen it?

2010-05-21 Thread Attila Mravik
 AFAIK OCZ Vertex 2 does not use volatile DRAM cache but non-volatile NAND
 grid. Whether it respects or ignores the cache flush seems irrelevant.

 There has been previous discussion about this:
 http://comments.gmane.org/gmane.os.solaris.opensolaris.zfs/35702

 I'm pretty sure that all SandForce-based SSDs don't use DRAM as their
 cache, but take a hunk of flash to use as scratch space instead. Which
 means that they'll be OK for ZIL use.

 Also:
 http://www.techspot.com/news/37729-ocz-vertex-2-pro-100gb-ssd-review.html

 Another benefit of SandForce's architecture is that the SSD keeps
 information on the NAND grid and removes the need for a separate cache
 buffer DRAM module. The result is a faster transaction, albeit at the
 expense of total storage capacity.

 So if I interpret them correctly, what they chose to do with the current
 incarnation of the architecture is actually reserve some of the primary
 memory capacity for I/O transaction management.

 In plain English, if the system gets interrupted either by power or by a
 crash, when it initializes the next time, it can read from its transaction
 space and resume where it left off. This makes it durable.


Here is a detailed explanation of the SandForce controllers:
http://www.anandtech.com/show/3661/understanding-sandforces-sf1200-sf1500-not-all-drives-are-equal

So the SF-1500 is enterprise class and relies on a supercap, the
SF-1200 is consumer class and does not rely on a supercap.

The SF-1200 firmware on the other hand doesn’t assume the presence of
a large capacitor to keep the controller/NAND powered long enough to
complete all writes in the event of a power failure. As such it does
more frequent check pointing and doesn’t guarantee the write in
progress will complete before it’s acknowledged.

As I understand it, the SF-1200 will ack the sync write only after it
is written to flash thus reducing write performance.

There is an interesting part about firmwares and OCZ having an
exclusive firmware in the Vertex 2 series which based on the SF-1200
but its random write IOPS is not capped at 10K (while other vendors
and other SSDs from OCZ using the SF-1200 are capped, unless they sell
the drive with the RC firmware which is for OEM evaluation and not
production ready but does not contain the IOPS cap).
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interesting experience with Nexenta - anyone seen it?

2010-05-21 Thread Kyle McDonald
SNIP a whole lot of ZIL/SLOG discussion

Hi guys.

yep I know about the ZIL, and SSD Slogs.

While setting Nextenta up it offered to disable the ZIL entirely. For
now I left it on. In the end (hopefully for only specifc filesystems -
once that feature is released.) I'll end up disabling the ZIL for our
software builds since:

1) The builds are disposable - We only need to save them if they finish,
and we can restart them if needed.
2) The build servers are not on UPS so a power failure is likely to make
the clients lose all state and need to restart anyway.

But, This issue I've seen with Nexenta, is not due to the ZIL. It runs
until it literally crashes the machine. It's not just slow, It brings
the machine to it's knees. I beleive it does have something to do with
exhausting memory though. As Erast says it maybe the IPS driver (though
I've used that on b130 of SXCE without issues,) or who knows what else.

I did download some updates from Nexenta yesterday. I'm going to try to
retest today or tomorrow.

 -Kyle

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interesting experience with Nexenta - anyone seen it?

2010-05-21 Thread Bob Friesenhahn

On Fri, 21 May 2010, Miika Vesti wrote:

AFAIK OCZ Vertex 2 does not use volatile DRAM cache but non-volatile NAND 
grid. Whether it respects or ignores the cache flush seems irrelevant.


There has been previous discussion about this: 
http://comments.gmane.org/gmane.os.solaris.opensolaris.zfs/35702


I'm pretty sure that all SandForce-based SSDs don't use DRAM as their
cache, but take a hunk of flash to use as scratch space instead. Which
means that they'll be OK for ZIL use.

So, OCZ Vertex 2 seems to be a good choice for ZIL.


There seem to be quite a lot of blind assumptions in the above.  The 
only good choice for ZIL is when you know for a certainty and not 
assumptions based on 3rd party articles and blog postings.  Otherwise 
it is like assuming that if you jump through an open window that there 
will be firemen down below to catch you.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interesting experience with Nexenta - anyone seen it?

2010-05-21 Thread David Dyer-Bennet

On Fri, May 21, 2010 10:19, Bob Friesenhahn wrote:
 On Fri, 21 May 2010, Miika Vesti wrote:

 AFAIK OCZ Vertex 2 does not use volatile DRAM cache but non-volatile
 NAND
 grid. Whether it respects or ignores the cache flush seems irrelevant.

 There has been previous discussion about this:
 http://comments.gmane.org/gmane.os.solaris.opensolaris.zfs/35702

 I'm pretty sure that all SandForce-based SSDs don't use DRAM as their
 cache, but take a hunk of flash to use as scratch space instead. Which
 means that they'll be OK for ZIL use.

 So, OCZ Vertex 2 seems to be a good choice for ZIL.

 There seem to be quite a lot of blind assumptions in the above.  The
 only good choice for ZIL is when you know for a certainty and not
 assumptions based on 3rd party articles and blog postings.  Otherwise
 it is like assuming that if you jump through an open window that there
 will be firemen down below to catch you.

Just how DOES one know something for a certainty, anyway?  I've seen LOTS
of people mess up performance testing in ways that gave them very wrong
answers; relying solely on your own testing is as foolish as relying on a
couple of random blog posts.

To be comfortable (I don't ask for know for a certainty; I'm not sure
that exists outside of faith), I want a claim by the manufacturer and
multiple outside tests in significant journals -- which could be the
blog of somebody I trusted, as well as actual magazines and such. 
Ideally, certainly if it's important, I'd then verify the tests myself.

There aren't enough hours in the day, so I often get by with less.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interesting experience with Nexenta - anyone seen it?

2010-05-21 Thread Miika Vesti
This is intresting. I thought all Vertex 2 SSDs are good choices for ZIL 
but this does not seem to be the case.


According to http://www.legitreviews.com/article/1208/1/ Vertex 2 LE, 
Vertex 2 Pro and Vertex 2 EX are SF-1500 based but Vertex 2 (without any 
suffix) is SF-1200 based.


Here is the table:
ModelController Max Read Max Write IOPS
Vertex 2 SF-1200270MB/s  260MB/s   9500
Vertex 2 LE  SF-1500270MB/s  250MB/s   ?
Vertex 2 Pro SF-1500280MB/s  270MB/s   19000
Vertex 2 EX  SF-1500280MB/s  270MB/s   25000

21.05.2010 17:09, Attila Mravik kirjoitti:

AFAIK OCZ Vertex 2 does not use volatile DRAM cache but non-volatile NAND
grid. Whether it respects or ignores the cache flush seems irrelevant.

There has been previous discussion about this:
http://comments.gmane.org/gmane.os.solaris.opensolaris.zfs/35702

I'm pretty sure that all SandForce-based SSDs don't use DRAM as their
cache, but take a hunk of flash to use as scratch space instead. Which
means that they'll be OK for ZIL use.

Also:
http://www.techspot.com/news/37729-ocz-vertex-2-pro-100gb-ssd-review.html

Another benefit of SandForce's architecture is that the SSD keeps
information on the NAND grid and removes the need for a separate cache
buffer DRAM module. The result is a faster transaction, albeit at the
expense of total storage capacity.

So if I interpret them correctly, what they chose to do with the current
incarnation of the architecture is actually reserve some of the primary
memory capacity for I/O transaction management.

In plain English, if the system gets interrupted either by power or by a
crash, when it initializes the next time, it can read from its transaction
space and resume where it left off. This makes it durable.



Here is a detailed explanation of the SandForce controllers:
http://www.anandtech.com/show/3661/understanding-sandforces-sf1200-sf1500-not-all-drives-are-equal

So the SF-1500 is enterprise class and relies on a supercap, the
SF-1200 is consumer class and does not rely on a supercap.

The SF-1200 firmware on the other hand doesn’t assume the presence of
a large capacitor to keep the controller/NAND powered long enough to
complete all writes in the event of a power failure. As such it does
more frequent check pointing and doesn’t guarantee the write in
progress will complete before it’s acknowledged.

As I understand it, the SF-1200 will ack the sync write only after it
is written to flash thus reducing write performance.

There is an interesting part about firmwares and OCZ having an
exclusive firmware in the Vertex 2 series which based on the SF-1200
but its random write IOPS is not capped at 10K (while other vendors
and other SSDs from OCZ using the SF-1200 are capped, unless they sell
the drive with the RC firmware which is for OEM evaluation and not
production ready but does not contain the IOPS cap).
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interesting experience with Nexenta - anyone seen it?

2010-05-21 Thread Brandon High
On Thu, May 20, 2010 at 2:23 PM, Miika Vesti miika.ve...@trivore.com wrote:
 I'm pretty sure that all SandForce-based SSDs don't use DRAM as their
 cache, but take a hunk of flash to use as scratch space instead. Which
 means that they'll be OK for ZIL use.

I've read conflicting reports that the controller contains a small
DRAM cache. So while it doesn't rely on an external DRAM cache, it
does have one: http://www.legitreviews.com/article/1299/2/
As we noted, the Vertex 2 doesn't have any cache chips on it as that
is because the SandForce controller itself is said to carry a small
cache inside that is a number of megabytes in size.

 Another benefit of SandForce's architecture is that the SSD keeps
 information on the NAND grid and removes the need for a separate cache
 buffer DRAM module. The result is a faster transaction, albeit at the
 expense of total storage capacity.

Again, conflicting reports indicate otherwise.
http://www.legitreviews.com/article/1299/2/
That adds up to 128GB of storage space, but only 93.1GB of it will be
usable space! The 'hidden' capacity is used for wear leveling, which
is crucial to keeping SSDs running as long as possible.

My understanding is that the controller contains enough cache to
buffer enough data to write a complete erase block size, eliminating
the need to read / erase / write that a partial block write entails.
It's reported to do a copy-on-write, so it doesn't need to do a read
of existing blocks when making changes, which gives it such high iops
- Even random writes are turned into sequential writes (much like how
ZFS works) of entire erase blocks. The excessive spare area is used to
ensure that there are always full pages free to write to. (Some
vendors are releasing consumer drives with 60/120/240 GB, using 7%
reserved space rather than the 27% that the original drives ship
with.)

With an unexpected power loss, you could still lose any data that's
cached in the controller, or any uncommitted changes that have been
partially written to the NAND

I hate having to rely on sites like Legit Reviews and Anandtech for
technical data, but there don't seem to be non-fanboy sites doing
comprehensive reviews of the drives ...

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interesting experience with Nexenta - anyone seen it?

2010-05-21 Thread Miles Nordin
 dd == David Dyer-Bennet d...@dd-b.net writes:

dd Just how DOES one know something for a certainty, anyway?

science.

Do a test like Lutz did on X25M G2.  see list archives 2010-01-10.


pgpeiR4DYODbj.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interesting experience with Nexenta - anyone seen it?

2010-05-21 Thread Don
 Now, if someone would make a Battery FOB, that gives broken SSD 60
 seconds of power, then we could use the consumer  SSD's in servers
 again with real value instead of CYA value.
You know- it would probably be sufficient to provide the SSD with _just_ a big 
capacitor bank. If the host lost power it would stop writing and if the SSD 
still had power it would probably use the idle time to flush it's buffers. Then 
there would be world peace!

Yeah- got a little carried away there. Still this seems like an experiment I'm 
going to have to try on my home server out of curiosity more than anything else 
:)
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Interesting experience with Nexenta - anyone seen it?

2010-05-20 Thread Kyle McDonald
Hi all,

I recently installed Nexenta Community 3.0.2 on one of my servers:

IBM eSeries X346
2.8Ghz Xeon
12GB DDR2 RAM
1 builtin BGE interface for management
4 port Intel GigE card aggregated for Data
IBM ServRAID 7k with 256MB BB Cache with (isp driver)
  6 RAID0 single drive LUNS (so I can use the Cache)
1 18GB LUN for the rpool
5 300GB LUN for the data pool
1 RAIDZ1 pool from the 5 300GB drives.
  4 test filesystems
1 No Dedup, No Compression
1 DeDup, No Compression
1 No DeDup, Compression
1 DeDup, Compression

This is pretty old hardware, so I wasn't expecting miracles, but I
thought I'd give it a shot.
My work load is NFS service to software build servers (cvs checkouts, un
tarring files, compiling, etc.) I'm hoping the many CVS checkout trees
will lend themselves to DeDup well, and I know source code should
compress easily.

I setup one client with a single GigE connection, mounted the four file
systems (plus one from the netapp we have here) and proceeded to write a
loop to time both un-tarring the gcc-4.3.3 sources to those 5
filesystems, and to 1 local directory, and to rm -rf the sources too.

The tar took 28 seconds and 10 seconds to remove in the local dir, then
on the first ZFS/NFS filesystem mount, it took basically forever and
hung the Nexenta server. I was watching it go on the web admin page and
it all looked fine for a while, then the client started reporting 'NFS
Server not responding, still trying...' For a while, there were Also
'NFS Server OK' messages too, and the Web GUI remained responsive.
Eventually The OK messages stopped, and the Web GUI froze.

I went an rebooted the NFS client thinking that id the requests stopped
the Server might catch up, but it never started responding again.

I was only untarring a file.. How did this bring the machine down?
I hadn't even gotten to the FS's that had SeSup or Compression turned
on, so those shouldn't have affected things - yet.

Any ideas?

  -Kyle



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interesting experience with Nexenta - anyone seen it?

2010-05-20 Thread Erast

Hi Kyle,

very likely that you hit driver bug in isp. After the reboot, take a 
look on /var/adm/messages file - anything related might shed some light.


I wouldn't suspect Intel GigE card - fairly good one and driver is very 
stable.


Also, some upgrades posted, make sure the kernel displays 134e after the 
reboot into the new upgrade checkpoint. The upgrade command:


nmc$ setup appliance upgrade

On 05/20/2010 08:05 AM, Kyle McDonald wrote:

Hi all,

I recently installed Nexenta Community 3.0.2 on one of my servers:

IBM eSeries X346
2.8Ghz Xeon
12GB DDR2 RAM
1 builtin BGE interface for management
4 port Intel GigE card aggregated for Data
IBM ServRAID 7k with 256MB BB Cache with (isp driver)
   6 RAID0 single drive LUNS (so I can use the Cache)
 1 18GB LUN for the rpool
 5 300GB LUN for the data pool
1 RAIDZ1 pool from the 5 300GB drives.
   4 test filesystems
 1 No Dedup, No Compression
 1 DeDup, No Compression
 1 No DeDup, Compression
 1 DeDup, Compression

This is pretty old hardware, so I wasn't expecting miracles, but I
thought I'd give it a shot.
My work load is NFS service to software build servers (cvs checkouts, un
tarring files, compiling, etc.) I'm hoping the many CVS checkout trees
will lend themselves to DeDup well, and I know source code should
compress easily.

I setup one client with a single GigE connection, mounted the four file
systems (plus one from the netapp we have here) and proceeded to write a
loop to time both un-tarring the gcc-4.3.3 sources to those 5
filesystems, and to 1 local directory, and to rm -rf the sources too.

The tar took 28 seconds and 10 seconds to remove in the local dir, then
on the first ZFS/NFS filesystem mount, it took basically forever and
hung the Nexenta server. I was watching it go on the web admin page and
it all looked fine for a while, then the client started reporting 'NFS
Server not responding, still trying...' For a while, there were Also
'NFS Server OK' messages too, and the Web GUI remained responsive.
Eventually The OK messages stopped, and the Web GUI froze.

I went an rebooted the NFS client thinking that id the requests stopped
the Server might catch up, but it never started responding again.

I was only untarring a file.. How did this bring the machine down?
I hadn't even gotten to the FS's that had SeSup or Compression turned
on, so those shouldn't have affected things - yet.

Any ideas?

   -Kyle



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interesting experience with Nexenta - anyone seen it?

2010-05-20 Thread Travis Tabbal
Disable ZIL and test again. NFS does a lot of sync writes and kills 
performance. Disabling ZIL (or using the synchronicity option if a build with 
that ever comes out) will prevent that behavior, and should get your NFS 
performance close to local. It's up to you if you want to leave it that way. 
There are reasons not to as well. NFS clients can get corrupted views of the 
filesystem should the server go down before a write flush is completed. ZIL 
prevents that problem. In my case, the clients aren't on a UPS while the server 
is, so it's not an issue. :)
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interesting experience with Nexenta - anyone seen it?

2010-05-20 Thread Roy Sigurd Karlsbakk
- Travis Tabbal tra...@tabbal.net skrev:

 Disable ZIL and test again. NFS does a lot of sync writes and kills
 performance. Disabling ZIL (or using the synchronicity option if a
 build with that ever comes out) will prevent that behavior, and should
 get your NFS performance close to local. It's up to you if you want to
 leave it that way. There are reasons not to as well. NFS clients can
 get corrupted views of the filesystem should the server go down before
 a write flush is completed. ZIL prevents that problem. In my case, the
 clients aren't on a UPS while the server is, so it's not an issue. :)

Disabling ZIL is, according to ZFS best practice, NOT recommended. Get some SSD 
for the Zil instead, preferably mirrored. You won't need a lot, ZIL never uses 
more than half the RAM size
 
Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interesting experience with Nexenta - anyone seen it?

2010-05-20 Thread David Magda
On Thu, May 20, 2010 13:58, Roy Sigurd Karlsbakk wrote:
 - Travis Tabbal tra...@tabbal.net skrev:

 Disable ZIL and test again. NFS does a lot of sync writes and kills
 performance. Disabling ZIL (or using the synchronicity option if a
 build with that ever comes out) will prevent that behavior, and should
 get your NFS performance close to local. It's up to you if you want to
 leave it that way. There are reasons not to as well. NFS clients can
 get corrupted views of the filesystem should the server go down before
 a write flush is completed. ZIL prevents that problem. In my case, the
 clients aren't on a UPS while the server is, so it's not an issue. :)

 Disabling ZIL is, according to ZFS best practice, NOT recommended. Get
 some SSD for the Zil instead, preferably mirrored. You won't need a lot,
 ZIL never uses more than half the RAM size

Disabling the ZIL is an easy way to TEST whether a ZIL would be helpful.
If things speed up after turning it off, then you'd turn it back on, and
go and purchase an SSD.

There's no sense spending money if it won't fix the problem.


To the OP, see Section 2.7 (Disabling the ZIL (Don't)) of:

   http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide

As mentioned, you do NOT want to run with this in production, but it is a
quick way to check.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interesting experience with Nexenta - anyone seen it?

2010-05-20 Thread Miles Nordin
 rsk == Roy Sigurd Karlsbakk r...@karlsbakk.net writes:
 dm == David Magda dma...@ee.ryerson.ca writes:
 tt == Travis Tabbal tra...@tabbal.net writes:

   rsk Disabling ZIL is, according to ZFS best practice, NOT
   rsk recommended.

dm As mentioned, you do NOT want to run with this in production,
dm but it is a quick way to check.

REPEAT: I disagree.

Once you associate the disasterizing and dire warnings from the
developer's advice-wiki with the specific problems that ZIL-disabling
causes for real sysadmins rather than abstract notions of ``POSIX'' or
``the application'', a lot more people end up wanting to disable their
ZIL's.

In fact, most of the SSD's sold seem to be relying on exactly the
trick disabled-ZIL ZFS does for much of their high performance, if not
their feasibility within their price bracket period: provide a
guarantee of write ordering without durability, and many applications
are just, poof, happy.

If the SSD's arrange that no writes are reordered across a SYNC CACHE,
but don't bother actually providing durability, end uzarZ will ``OMG
windows fast and no corruption.'' -- ssd sales.

The ``do-not-disable-buy-SSD!!!1!'' advice thus translates to ``buy
one of these broken SSD's, and you will be basically happy.  Almost
everyone is.  When you aren't, we can blame the SSD instead of ZFS.''
all that bottlenecked SATA traffic host-SSD is just CYA and of no
real value (except for kernel panics).


Now, if someone would make a Battery FOB, that gives broken SSD 60
seconds of power, then we could use the consumer crap SSD's in servers
again with real value instead of CYA value.  FOB should work like
this:

== RUNNING ==
   battery   ,--- SATA port: pass   -.
 recharged? /  power to SSD: on\  input
   /\ power
  (  . lost
  |  |
  .   input  ,---\   v
  power / v
  restored /   =power lost=
=power restored=   .   =hold-down =
=hold down =-- SATA port: block
power to SSD: off  power to SSD: on
   ^   |
   |   |
   .  .  60 seconds
   input\/   elapsed
   power .  =power off= ,
   restored power to SSD: off -


The device must know when its battery has gone bad and stick itself in
``power restored hold down'' state.  Knowing when the battery is bad
may require more states to test the battery, but this is the general
idea.

I think it would be much cheaper to build an SSD with supercap, and
simpler because you can assume the supercap is good forever instead of
testing it.  However because of ``market forces'' the FOB approach
might sell for cheaper because the FOB cannot be tied to the SSD and
used as a way to segment the market.  If there are 2 companies making
only FOB's and not making SSD's, only then competition will work like
people want it to.  Otherwise FOBs will be $1000 or something because
only ``enterprise'' users are smart/dumb enough to demand them.

Normally I would have a problem that the FOB and SSD are separable,
but see, the FOB and SSD can be put together with double-sided tape:
the tape only has to hold for 60 seconds after $event, and there's no
way to separate the two by tripping over a cord.  You can safely move
SSD+FOB from one chassis to another without fearing all is lost if you
jiggle the connection.  I think it's okay overall.

tt This risk is mostly mitigated by UPS backup and auto-shutdown
tt when the UPS detects power loss, correct?

no no it's about cutting off a class of failure cases and constraining
ourselves to relatively sane forms of failure.  We are not haggling
about NO FAILURES EVAR yet.  First, for STEP 1 we isolate the insane
kinds of failure that cost us days or months of data rather than just
a few seconds, the kinds that call for crazy unplannable ad-hoc
recovery methods like `Viktor plz help me' and ``is anyone here a
Postgres data recovery expert?'' and ``is there a way I can invalidate
the batch of billing auth requests I uploaded yesterday so I can rerun
it without double-billing anyone?''  For STEP 1 we make the insane
fail almost impossible through clever software and planning.  A UPS
never never ever qualifies as ``almost impossible''.  

Then, once that's done, we come back for STEP 2 where we try to
minimize the sane failures also, and for step 2 things like UPS might
be useful.  For STEP 2 it makes sense to talk about percent
availability, probability of failure, length of time to recover from
Scenario X.  but in STEP 1 all the failures are insane