date:20100615

Hi,

We are currently building a storage box based on OpenSolaris/Nexenta using ZFS.
Our hardware specifications are as follows:

Quad AMD G34 12-core 2.3 GHz (~110 GHz)
10 Crucial RealSSD (6Gb/s) 
42 WD RAID Ed. 4 2TB disks + 6Gb/s SAS expanders
LSI2008SAS (two 4x ports)
Mellanox InfiniBand 40 Gbit NICs
128 GB RAM

This setup gives us about 40TB storage after mirror (two disks in spare), 2.5TB 
L2ARC and 64GB Zil, all fit into a single 5U box.

Both L2ARC and Zil shares the same disks (striped) due to bandwidth 
requirements. Each SSD has a theoretical performance of 40-50k IOPS on 4k 
read/write scenario with 70/30 distribution. Now, I know that you should have 
mirrored Zil for safety, but the entire box are synchronized with an active 
standby on a different site location (18km distance - round trip of 0.16ms + 
equipment latency). So in case the Zil in Site A takes a fall, or the 
motherboard/disk group/motherboard dies - we still have safety.

DDT requirements for dedupe on 16k blocks should be about 640GB when main pool 
are full (capacity).

Without going into details about chipsets and such, do any of you on this list 
have any experience with a similar setup and can share with us your thoughts, 
do's and dont's, and any other information that could be of help while building 
and configuring this?

What I want to achieve is 2 GB/s+ NFS traffic against our ESX clusters (also 
InfiniBand-based), with both dedupe and compression enabled in ZFS.

Let's talk moon landings.

Regards,
Arve
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)

On 6/15/2010 4:42 AM, Arve Paalsrud wrote:

Hi,

We are currently building a storage box based on OpenSolaris/Nexenta using ZFS.
Our hardware specifications are as follows:

Quad AMD G34 12-core 2.3 GHz (~110 GHz)
10 Crucial RealSSD (6Gb/s)
42 WD RAID Ed. 4 2TB disks + 6Gb/s SAS expanders
LSI2008SAS (two 4x ports)
Mellanox InfiniBand 40 Gbit NICs
128 GB RAM

This setup gives us about 40TB storage after mirror (two disks in spare), 2.5TB
L2ARC and 64GB Zil, all fit into a single 5U box.

Both L2ARC and Zil shares the same disks (striped) due to bandwidth
requirements. Each SSD has a theoretical performance of 40-50k IOPS on 4k
read/write scenario with 70/30 distribution. Now, I know that you should have
mirrored Zil for safety, but the entire box are synchronized with an active
standby on a different site location (18km distance - round trip of 0.16ms +
equipment latency). So in case the Zil in Site A takes a fall, or the
motherboard/disk group/motherboard dies - we still have safety.

DDT requirements for dedupe on 16k blocks should be about 640GB when main pool
are full (capacity).

Without going into details about chipsets and such, do any of you on this list
have any experience with a similar setup and can share with us your thoughts,
do's and dont's, and any other information that could be of help while building
and configuring this?

What I want to achieve is 2 GB/s+ NFS traffic against our ESX clusters (also
InfiniBand-based), with both dedupe and compression enabled in ZFS.

Let's talk moon landings.

Regards,
Arve

Given that for ZIL, random write IOPS is paramount, the RealSSD isn't a
good choice. SLC SSDs still spank any MLC device, and random IOPS for
something like an Intel X25-E or OCZ Vertex EX are over twice that of
the RealSSD. I don't know where they manage to get 40k+ IOPS number for
the RealSSD (I know it's in the specs, but how did they get that?), but
that's not what others are reporting:

http://benchmarkreviews.com/index.php?option=com_contenttask=viewid=454Itemid=60limit=1limitstart=7

Sadly, none of the current crop of SSDs support a capacitor or battery
to back up their local (on-SSD) cache, so they're all subject to data
loss on a power interruption.

Likewise, random Read dominates L2ARC usage. Here, the most
cost-effective solutions tend to be MLC-based SSDs with more moderate
IOPS performance - the Intel X25-M and OCZ Vertex series are likely much
more cost-effective than a RealSSD, especially considering
price/performance.

Also, given the limitations of a x4 port connection to the rest of the
system, I'd consider using a couple more SAS controllers, and fewer
Expanders. The SSDs together are likely to be able to overwhelm a x4
PCI-E connection, so I'd want at least one dedicated x4 SAS HBA just for
them. For the 42 disks, it depends more on what your workload looks
like. If it is mostly small or random I/O to the disks, you can get away
with fewer HBAs. Large, sequential I/O to the disks is going to require
more HBAs. Remember, a modern 7200RPM SATA drive can pump out well over
100MB/s sequential, but well under 10MB/s random. Do the math to see
how fast it will overwhelm the x4 PCI-E 2.0 connection which maxes out
at about 2GB/s.

I'd go with 2 Intel X25-E 32GB models for ZIL. Mirror them - striping
isn't really going to buy you much here (so far as I can tell). 6Gbit/s
SAS is wasted on HDs, so don't bother paying for it if you can avoid
doing so. Really, I'd suspect that paying for 6Gb/s SAS isn't worth it
at all, as really only the read performance of the L2ARC SSDs might
possibly exceed 3Gb/s SAS.

I'm going to say something sacrilegious here: 128GB of RAM may be
overkill. You have the SSDs for L2ARC - much of which will be the DDT,
but, if I'm reading this correctly, even if you switch to the 160GB
Intel X25-M, that give you 8 x 160GB = 1280GB of L2ARC, of which only
half is in-use by the DDT. The rest is file cache. You'll need lots of
RAM if you plan on storing lots of small files in the L2ARC (that is, if
your workload is lots of small files). 200bytes/record needed in RAM
for an L2ARC entry.

I.e.

if you have 1k average record size, for 600GB of L2ARC, you'll need
600GB / 1kb * 200B = 120GB RAM.

if you have a more manageable 8k record size, then, 600GB / 8kB * 200B =
15GB

--
Erik Trimble
Java System Support
Mailstop: usca22-123
Phone: x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)

2010-06-15 Thread Darren J Moffat


On 15/06/2010 14:09, Erik Trimble wrote:

I'm going to say something sacrilegious here: 128GB of RAM may be
overkill. You have the SSDs for L2ARC - much of which will be the DDT,


The point of L2ARC is that you start adding L2ARC when you can no longer 
physically put in (or afford) to add any more DRAM, so if OP can afford 
to put in 128GB of RAM then they should.


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)


On 6/15/2010 6:17 AM, Darren J Moffat wrote:

On 15/06/2010 14:09, Erik Trimble wrote:

I'm going to say something sacrilegious here: 128GB of RAM may be
overkill. You have the SSDs for L2ARC - much of which will be the DDT,


The point of L2ARC is that you start adding L2ARC when you can no 
longer physically put in (or afford) to add any more DRAM, so if OP 
can afford to put in 128GB of RAM then they should.




True.

I was speaking price/performance.  Those 8GB DIMMs are still pretty 
darned pricey...


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)

2010-06-15 Thread Roy Sigurd Karlsbakk

 I'm going to say something sacrilegious here: 128GB of RAM may be
 overkill. You have the SSDs for L2ARC - much of which will be the DDT,
 but, if I'm reading this correctly, even if you switch to the 160GB
 Intel X25-M, that give you 8 x 160GB = 1280GB of L2ARC, of which only
 half is in-use by the DDT. The rest is file cache. You'll need lots of
 RAM if you plan on storing lots of small files in the L2ARC (that is,
 if your workload is lots of small files). 200bytes/record needed in
 RAM for an L2ARC entry.
 
 I.e.
 
 if you have 1k average record size, for 600GB of L2ARC, you'll need
 600GB / 1kb * 200B = 120GB RAM.
 
 if you have a more manageable 8k record size, then, 600GB / 8kB * 200B
 = 15GB

Now I'm confused. First thing I heard, was about 160 bytes was needed per DDT 
entry. Later, someone else told med 270. Then you, at 200. Also, there should 
be a good way to list out a total of blocks (zdb just crashed with a full 
memory on my 10TB test box). I tried browsing the source to see the size of the 
ddt struct, but I got lost. Can someone with an osol development environment 
please just check sizeof that struct?

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)


On 6/15/2010 6:40 AM, Roy Sigurd Karlsbakk wrote:

I'm going to say something sacrilegious here: 128GB of RAM may be
overkill. You have the SSDs for L2ARC - much of which will be the DDT,
but, if I'm reading this correctly, even if you switch to the 160GB
Intel X25-M, that give you 8 x 160GB = 1280GB of L2ARC, of which only
half is in-use by the DDT. The rest is file cache. You'll need lots of
RAM if you plan on storing lots of small files in the L2ARC (that is,
if your workload is lots of small files). 200bytes/record needed in
RAM for an L2ARC entry.

I.e.

if you have 1k average record size, for 600GB of L2ARC, you'll need
600GB / 1kb * 200B = 120GB RAM.

if you have a more manageable 8k record size, then, 600GB / 8kB * 200B
= 15GB
 

Now I'm confused. First thing I heard, was about 160 bytes was needed per DDT 
entry. Later, someone else told med 270. Then you, at 200. Also, there should 
be a good way to list out a total of blocks (zdb just crashed with a full 
memory on my 10TB test box). I tried browsing the source to see the size of the 
ddt struct, but I got lost. Can someone with an osol development environment 
please just check sizeof that struct?

Vennlige hilsener / Best regards

roy
--
   


A DDT entry takes up about 250 bytes, regardless of where it is stored.

For every normal (i.e. block, metadata, etc - NOT DDT ) L2ARC entry, 
about 200 bytes has to be stored in main memory (ARC).



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)

On Jun 15, 2010, at 6:40 AM, Roy Sigurd Karlsbakk wrote:

 I'm going to say something sacrilegious here: 128GB of RAM may be
 overkill. You have the SSDs for L2ARC - much of which will be the DDT,
 but, if I'm reading this correctly, even if you switch to the 160GB
 Intel X25-M, that give you 8 x 160GB = 1280GB of L2ARC, of which only
 half is in-use by the DDT. The rest is file cache. You'll need lots of
 RAM if you plan on storing lots of small files in the L2ARC (that is,
 if your workload is lots of small files). 200bytes/record needed in
 RAM for an L2ARC entry.
 
 I.e.
 
 if you have 1k average record size, for 600GB of L2ARC, you'll need
 600GB / 1kb * 200B = 120GB RAM.
 
 if you have a more manageable 8k record size, then, 600GB / 8kB * 200B
 = 15GB
 
 Now I'm confused. First thing I heard, was about 160 bytes was needed per DDT 
 entry. Later, someone else told med 270. Then you, at 200. Also, there should 
 be a good way to list out a total of blocks (zdb just crashed with a full 
 memory on my 10TB test box). I tried browsing the source to see the size of 
 the ddt struct, but I got lost. Can someone with an osol development 
 environment please just check sizeof that struct?

Why read source when you can read the output of zdb -D? :-)
 -- richard

-- 
Richard Elling
rich...@nexenta.com   +1-760-896-4422
ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
http://nexenta-rotterdam.eventbrite.com/




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)

On Jun 15, 2010, at 4:42 AM, Arve Paalsrud wrote:
 Hi,
 
 We are currently building a storage box based on OpenSolaris/Nexenta using 
 ZFS.
 Our hardware specifications are as follows:
 
 Quad AMD G34 12-core 2.3 GHz (~110 GHz)
 10 Crucial RealSSD (6Gb/s) 
 42 WD RAID Ed. 4 2TB disks + 6Gb/s SAS expanders
 LSI2008SAS (two 4x ports)
 Mellanox InfiniBand 40 Gbit NICs
 128 GB RAM
 
 This setup gives us about 40TB storage after mirror (two disks in spare), 
 2.5TB L2ARC and 64GB Zil, all fit into a single 5U box.
 
 Both L2ARC and Zil shares the same disks (striped) due to bandwidth 
 requirements. Each SSD has a theoretical performance of 40-50k IOPS on 4k 
 read/write scenario with 70/30 distribution. Now, I know that you should have 
 mirrored Zil for safety, but the entire box are synchronized with an active 
 standby on a different site location (18km distance - round trip of 0.16ms + 
 equipment latency). So in case the Zil in Site A takes a fall, or the 
 motherboard/disk group/motherboard dies - we still have safety.
 
 DDT requirements for dedupe on 16k blocks should be about 640GB when main 
 pool are full (capacity).
 
 Without going into details about chipsets and such, do any of you on this 
 list have any experience with a similar setup and can share with us your 
 thoughts, do's and dont's, and any other information that could be of help 
 while building and configuring this?
 
 What I want to achieve is 2 GB/s+ NFS traffic against our ESX clusters (also 
 InfiniBand-based), with both dedupe and compression enabled in ZFS.

In general, both dedup and compression gain space by trading off performance.
You should take a closer look at snapshots + clones because they gain
performance by trading off systems management.

You can't size by ESX server, because ESX works (mostly) as a pass-through
of the client VM workload. In your sizing calculations, think of ESX as a fancy
network switch.
 -- richard

-- 
Richard Elling
rich...@nexenta.com   +1-760-896-4422
ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
http://nexenta-rotterdam.eventbrite.com/




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)

2010-06-15 Thread Garrett D'Amore

On Tue, 2010-06-15 at 04:42 -0700, Arve Paalsrud wrote:
 Hi,
 
 We are currently building a storage box based on OpenSolaris/Nexenta using 
 ZFS.
 Our hardware specifications are as follows:
 
 Quad AMD G34 12-core 2.3 GHz (~110 GHz)
 10 Crucial RealSSD (6Gb/s) 
 42 WD RAID Ed. 4 2TB disks + 6Gb/s SAS expanders
 LSI2008SAS (two 4x ports)
 Mellanox InfiniBand 40 Gbit NICs

Just recognize that those NICs are IB only.  Solaris currently does
not support 10GbE using Mellanox products, even though other operating
systems do.  (There are folks working on resolving this, but I think
we're still a couple months from seeing the results of that effort.)

 128 GB RAM
 
 This setup gives us about 40TB storage after mirror (two disks in spare), 
 2.5TB L2ARC and 64GB Zil, all fit into a single 5U box.
 
 Both L2ARC and Zil shares the same disks (striped) due to bandwidth 
 requirements. Each SSD has a theoretical performance of 40-50k IOPS on 4k 
 read/write scenario with 70/30 distribution. Now, I know that you should have 
 mirrored Zil for safety, but the entire box are synchronized with an active 
 standby on a different site location (18km distance - round trip of 0.16ms + 
 equipment latency). So in case the Zil in Site A takes a fall, or the 
 motherboard/disk group/motherboard dies - we still have safety.

I expect that you need more space for L2ARC and a lot less for Zil.
Furthmore, you'd be better served by an even lower latency/higher IOPs
ZIL.  If you're going to spend this kind of cash, I think I'd recommend
at least one or two DDR Drive X1 units or something similar.  While not
very big, you don't need much to get a huge benefit from the ZIL, and I
think the vastly superior IOPS of these units will pay off in the end.

 
 DDT requirements for dedupe on 16k blocks should be about 640GB when main 
 pool are full (capacity).

Dedup is not always a win, I think.  I'd look hard at your data and
usage to determine whether to use it.

-- Garrett

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)

2010-06-15 Thread Garrett D'Amore

On Tue, 2010-06-15 at 07:36 -0700, Richard Elling wrote:
  What I want to achieve is 2 GB/s+ NFS traffic against our ESX clusters 
  (also InfiniBand-based), with both dedupe and compression enabled in ZFS.
 
 In general, both dedup and compression gain space by trading off performance.
 You should take a closer look at snapshots + clones because they gain
 performance by trading off systems management.

It depends on the usage.  Note that for some uses, compression can be a
performance *win*, because generally CPUs are fast enough that the cost
of decompression beats the cost of the larger IOs required to transfer
uncompressed data.  Of course, that assumes you have CPU cycles to
spare.

-- Garrett

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)

On Tue, Jun 15, 2010 at 3:09 PM, Erik Trimble erik.trim...@oracle.comwrote:

On 6/15/2010 4:42 AM, Arve Paalsrud wrote:

Hi,

We are currently building a storage box based on OpenSolaris/Nexenta using
ZFS.
Our hardware specifications are as follows:

Quad AMD G34 12-core 2.3 GHz (~110 GHz)
10 Crucial RealSSD (6Gb/s)
42 WD RAID Ed. 4 2TB disks + 6Gb/s SAS expanders
LSI2008SAS (two 4x ports)
Mellanox InfiniBand 40 Gbit NICs
128 GB RAM

This setup gives us about 40TB storage after mirror (two disks in spare),
2.5TB L2ARC and 64GB Zil, all fit into a single 5U box.

Both L2ARC and Zil shares the same disks (striped) due to bandwidth
requirements. Each SSD has a theoretical performance of 40-50k IOPS on 4k
read/write scenario with 70/30 distribution. Now, I know that you should
have mirrored Zil for safety, but the entire box are synchronized with an
active standby on a different site location (18km distance - round trip of
0.16ms + equipment latency). So in case the Zil in Site A takes a fall, or
the motherboard/disk group/motherboard dies - we still have safety.

DDT requirements for dedupe on 16k blocks should be about 640GB when main
pool are full (capacity).

Without going into details about chipsets and such, do any of you on this
list have any experience with a similar setup and can share with us your
thoughts, do's and dont's, and any other information that could be of help
while building and configuring this?

What I want to achieve is 2 GB/s+ NFS traffic against our ESX clusters
(also InfiniBand-based), with both dedupe and compression enabled in ZFS.

Let's talk moon landings.

Regards,
Arve

Given that for ZIL, random write IOPS is paramount, the RealSSD isn't a
good choice. SLC SSDs still spank any MLC device, and random IOPS for
something like an Intel X25-E or OCZ Vertex EX are over twice that of the
RealSSD. I don't know where they manage to get 40k+ IOPS number for the
RealSSD (I know it's in the specs, but how did they get that?), but that's
not what others are reporting:

http://benchmarkreviews.com/index.php?option=com_contenttask=viewid=454Itemid=60limit=1limitstart=7

See http://www.anandtech.com/show/2944/3 and
http://www.crucial.com/pdf/Datasheets-letter_C300_RealSSD_v2-5-10_online.pdf
But I agree that we should look into using the Vertex instead.

Sadly, none of the current crop of SSDs support a capacitor or battery to
back up their local (on-SSD) cache, so they're all subject to data loss on a
power interruption.

Noted

Likewise, random Read dominates L2ARC usage. Here, the most cost-effective
solutions tend to be MLC-based SSDs with more moderate IOPS performance -
the Intel X25-M and OCZ Vertex series are likely much more cost-effective
than a RealSSD, especially considering price/performance.

Our other option are to use two Fusion-IO ioDrive Duo SLC/MLC or the SMLC
when available (as well as drivers for Solaris) - so the price we're
currently talking about is not an issue.

Also, given the limitations of a x4 port connection to the rest of the
system, I'd consider using a couple more SAS controllers, and fewer
Expanders. The SSDs together are likely to be able to overwhelm a x4 PCI-E
connection, so I'd want at least one dedicated x4 SAS HBA just for them.
For the 42 disks, it depends more on what your workload looks like. If it
is mostly small or random I/O to the disks, you can get away with fewer
HBAs. Large, sequential I/O to the disks is going to require more HBAs.
Remember, a modern 7200RPM SATA drive can pump out well over 100MB/s
sequential, but well under 10MB/s random. Do the math to see how fast it
will overwhelm the x4 PCI-E 2.0 connection which maxes out at about 2GB/s.

We're talking about 4X SAS 6Gb/s lanes - 4800MB/s per port. See
http://www.lsi.com/DistributionSystem/AssetDocument/SCG_LSISAS2008_PB_043009.pdffor
specifications of the LSI chip. In other words, it utilizes PCIe 2.0
8x.

I'd go with 2 Intel X25-E 32GB models for ZIL. Mirror them - striping isn't
really going to buy you much here (so far as I can tell). 6Gbit/s SAS is
wasted on HDs, so don't bother paying for it if you can avoid doing so.
Really, I'd suspect that paying for 6Gb/s SAS isn't worth it at all, as
really only the read performance of the L2ARC SSDs might possibly exceed
3Gb/s SAS.

What about bandwidth in this scenario? Won't the ZIL be limited to the
throughput of only one X25-E? The SATA disks operates at 3Gb/s through the
SAS expanders, so no 6Gb/s there.

I'm going to say something sacrilegious here: 128GB of RAM may be
overkill. You have the SSDs for L2ARC - much of which will be the DDT, but,
if I'm reading this correctly, even if you switch to the 160GB Intel X25-M,
that give you 8 x 160GB = 1280GB of L2ARC, of which only half is in-use by
the DDT. The rest is file cache. You'll need lots of RAM if you plan on
storing lots of small files in the L2ARC (that is, if your workload is lots
of small

Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)

On Tue, Jun 15, 2010 at 4:20 PM, Erik Trimble erik.trim...@oracle.comwrote:

  On 6/15/2010 6:57 AM, Arve Paalsrud wrote:

 On Tue, Jun 15, 2010 at 3:09 PM, Erik Trimble erik.trim...@oracle.comwrote:

 I'd go with 2 Intel X25-E 32GB models for ZIL. Mirror them - striping isn't
 really going to buy you much here (so far as I can tell).  6Gbit/s SAS is
 wasted on HDs, so don't bother paying for it if you can avoid doing so.
 Really, I'd suspect that paying for 6Gb/s SAS isn't worth it at all, as
 really only the read performance of the L2ARC SSDs might possibly exceed
 3Gb/s SAS.


  What about bandwidth in this scenario? Won't the ZIL be limited to the
 throughput of only one X25-E? The SATA disks operates at 3Gb/s through the
 SAS expanders, so no 6Gb/s there.

 Yes - though I'm not sure how the slog devices work when there is more than
 one. I *don't* think they work like the L2ARC devices, which work
 round-robin. You'd have to ask.  If they're doing a true stripe, then I
 doubt you'll get much more performance as weird as that sounds.  Also, even
 with a single X25-E, you can service a huge number of IOPS - likely more
 small IOPS than can be pushed over even an Infiniband interface.  The place
 that the Infiniband would certainly outpace the X25-E's capacity is for
 large writes, where a single 100MB write would suck up all the X25-E's
 throughput capability.


But the Intel X25-E are limited to about 200 MB/s write, regardless of IOPS.
So when throwing a lot of 16k IOPS (about 13 000) at it, it will still be
limited to 200 MB/s - or about 6-7% of the throughput of an QDR InfiniBand
links capacity.

So I hereby officially ask: Can I have multiple slogs striped to handle
higher bandwidth than a single device can - is that supported in ZFS?

  --
 Erik Trimble
 Java System Support
 Mailstop:  usca22-123
 Phone:  x17195
 Santa Clara, CA

  - Arve
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Dedup... still in beta status

2010-06-15 Thread Fco Javier Garcia

Data: 90% of current computers has less than 9 GB of RAM, less than 5% has SSD 
systems.
Let use a computer storage standard, with a capacity of 4 TB ... dedupe on, 
dataset with blocks of 32 kb ..., 2 TB of data in use ... need 16 GB of memory 
just only for DTT ... but this will not see it until it's too late ... ie, we 
will work with the system ... performance will be good ... Little by little we 
will see that write performance is dropping ... then we will see that the 
system crashes randomly (when deleting automatic snapshots) ... and finally 
will see that disabling dedup doesnt solve it.


It may indicate that dedupe has some requirements ... that is true, but what is 
 true too is that in systems with large amounts of   RAM(for the usual 
parameters) usual operations as file deleting  or datasets/snapshot destroying 
give us  a decrease of performance ... even totally blocking system ... and 
that is not admissible ... so maybe it would be  desirable to place dedupe in a 
freeze (beta or development situation) until we can get one stable version so 
we can  make any necessary changes in the nucleus of zfs that allow its use 
without compromising the integrity of the entire system (p.ejm: Enabling the 
erasing of blocks in multithreading  .)

And what can we do if we have a system already contaminated with dedupe? ...
1st Disable snapshots
2. Create a new dataset without dedupe and copy the data to the new dataset.
3. After copying the data, delete the snapshots... first the smaller, if 
there is some snapshot bigger (more than 10 Gb)... make progresive roollback to 
it  (Thus the snapshot will use 0 bytes) and we can delete.
4. When there are no snapshots in the dataset ... remove slowly (in batches) 
all  files.
5. Finally, when there are no files... destroy de dataset

If we miss any of these steps (and directly try to delete a snapshot with 95 
Gb) , the system will crash ... if we try to delete the dataset and the system 
crashes ... by restarting your computer will crash the system too (since the 
operation will continue trying to erase )

My test system: AMD Athlon X2 5400, 8 Gb RAM, RAIDZ 3 TB, dataset 1,7 Tb, 
snapshot: 87 Gb... tested with: OSOL 134, EON 0.6, Nexenta core 3.02, 
Nexentastor enterprise 3.02... all systems freezes when trying to delete 
snapshots... finally with rollback i could delete all snapshots... but when 
trying to destroy the dataset  ... The system is still processing the order ... 
(after 20 hours ... )
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)

On Tue, Jun 15, 2010 at 3:33 PM, Erik Trimble erik.trim...@oracle.comwrote:

 On 6/15/2010 6:17 AM, Darren J Moffat wrote:

 On 15/06/2010 14:09, Erik Trimble wrote:

 I'm going to say something sacrilegious here: 128GB of RAM may be
 overkill. You have the SSDs for L2ARC - much of which will be the DDT,


 The point of L2ARC is that you start adding L2ARC when you can no longer
 physically put in (or afford) to add any more DRAM, so if OP can afford to
 put in 128GB of RAM then they should.


 True.

 I was speaking price/performance.  Those 8GB DIMMs are still pretty darned
 pricey...


 --
 Erik Trimble
 Java System Support
 Mailstop:  usca22-123
 Phone:  x17195
 Santa Clara, CA


The motherboard has 32 DIMM slots - making use of 32 4GB modules to gain
128GB quite affordable :)

-Arve
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] OCZ Devena line of enterprise SSD

2010-06-15 Thread Roger Hernandez

OCZ has a new line of enterprise SSDs, based on the SandForce 1500
controller.

These three have a supercapacitor:
OCZ Deneva Reliability 2.5 MLC
SSDhttp://www.oczenterprise.com/products/details/ocz-deneva-reliability-2-5-mlc-ssd.html
OCZ Deneva Reliability 2.5 SLC
SSDhttp://www.oczenterprise.com/products/details/ocz-deneva-reliability-2-5-slc-ssd.html
OCZ Deneva Reliability 2.5 eMLC
SSDhttp://www.oczenterprise.com/products/details/ocz-deneva-reliability-2-5-emlc-ssd.html

Any thoughts on how well they would work as a L2ARC/ZIL drive?

Roger Hernandez
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] COMSTAR dropouts with dedup enabled

2010-06-15 Thread Matthew Anderson

Thanks Brandon,

This system has 24GB of RAM and currently no L2ARC. The total de duplicated 
data was about 250GB so I wouldn't have thought I would be out of RAM, I've 
removed the LUN for the time being so I can't get the DDR size at the moment. I 
have some X25-E's to go in as L2ARC and SLOG so I'll revisit de-dup soon to see 
if that helps the issue.  

-Matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Data Loss on system crash/upgrade

2010-06-15 Thread Cindy Swearingen


Hi Austin,

No much help, as it turns out.

I don't see any evidence that a recovery mechanism, where you might lose 
a few seconds of data transactions, was triggered.


It almost sounds like your file system was rolled back to a previous
snapshot because the data is lost as of a certain date. I don't see any
evidence of a rollback either.

I'm stumped at this point but maybe someone else has ideas.

Is it possible that hardware failures caused the outright removal
of all data after a certain date (?) Doesn't seem possible.

You can review how the critical hardware failures were impacting
your ZFS pools by reviewing the contents of fmdump -eV. Its a lot of
output to sort through but looking for checksum errors and other
problems. Still, ongoing checksum errors would result in data
corruption, possibly, but not total loss of data after a certain
date.

Can you recover your data from your existing snapshots?

Cindy

On 06/14/10 21:44, Austin Rotondo wrote:

Cindy,

The log is quite long, so I've attached a text file of the command output.

The last command in the log before the system crash was:

2010-04-07.20:09:06 zpool scrub zarray1

The system crashed sometime after 4/20/10, which is the last file I have record 
of creating.
I looked through the log and didn't see anything unusual, but I'm definitely no 
expert on the subject.

Thanks for your help,
Austin




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)

 -Original Message-
 From: Garrett D'Amore [mailto:garr...@nexenta.com]
 Sent: 15. juni 2010 17:43
 To: Arve Paalsrud
 Cc: zfs-discuss@opensolaris.org
 Subject: Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)

 On Tue, 2010-06-15 at 04:42 -0700, Arve Paalsrud wrote:
  Hi,

  We are currently building a storage box based on OpenSolaris/Nexenta
 using ZFS.
  Our hardware specifications are as follows:

  Quad AMD G34 12-core 2.3 GHz (~110 GHz)
  10 Crucial RealSSD (6Gb/s)
  42 WD RAID Ed. 4 2TB disks + 6Gb/s SAS expanders
  LSI2008SAS (two 4x ports)
  Mellanox InfiniBand 40 Gbit NICs

 Just recognize that those NICs are IB only.  Solaris currently does
 not support 10GbE using Mellanox products, even though other operating
 systems do.  (There are folks working on resolving this, but I think
 we're still a couple months from seeing the results of that effort.)

  128 GB RAM

  This setup gives us about 40TB storage after mirror (two disks in
 spare), 2.5TB L2ARC and 64GB Zil, all fit into a single 5U box.

  Both L2ARC and Zil shares the same disks (striped) due to bandwidth
 requirements. Each SSD has a theoretical performance of 40-50k IOPS on
 4k read/write scenario with 70/30 distribution. Now, I know that you
 should have mirrored Zil for safety, but the entire box are
 synchronized with an active standby on a different site location (18km
 distance - round trip of 0.16ms + equipment latency). So in case the
 Zil in Site A takes a fall, or the motherboard/disk group/motherboard
 dies - we still have safety.

 I expect that you need more space for L2ARC and a lot less for Zil.
 Furthmore, you'd be better served by an even lower latency/higher IOPs
 ZIL.  If you're going to spend this kind of cash, I think I'd recommend
 at least one or two DDR Drive X1 units or something similar.  While not
 very big, you don't need much to get a huge benefit from the ZIL, and I
 think the vastly superior IOPS of these units will pay off in the end.

What about the ZIL bandwidth in this case? I mean, could I stripe across 
multiple devices to be able to handle higher throughput? Otherwise I would 
still be limited to the performance of the unit itself (155 MB/s).

  DDT requirements for dedupe on 16k blocks should be about 640GB when
 main pool are full (capacity).

 Dedup is not always a win, I think.  I'd look hard at your data and
 usage to determine whether to use it.

   -- Garrett

-Arve

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] zpool export / import discrepancy

2010-06-15 Thread Scott Squires

Hello All,

I've migrated a JBOD of 16 drives from one server to another.  I did a zpool 
export from the old system and a zpool import to the new system.  One thing I 
did notice is since the drives are on a different controller card, the naming 
is different (as expected) but the order is also different.  I setup the drives 
as passthrough on the controller card and went through each drive 
incrementally.  I assumed the zpool import would have listed the drives in the 
order of c10t2d0, d1, d2, ... c10t3d7.  As shown below the order the drives 
were imported is c10t2d0, d2, d3, d1, c10t3d0 through d7.  

__
|Original zpool setup on old server:
|
|zpool status backup
|  pool: backup
| state: ONLINE
|config:
|NAME STATE READ WRITE CKSUM
|backup   ONLINE   0 0 0
|  raidz2 ONLINE   0 0 0
|c7t1d0   ONLINE   0 0 0
|c7t2d0   ONLINE   0 0 0
|c7t3d0   ONLINE   0 0 0
|c7t4d0   ONLINE   0 0 0
|c7t5d0   ONLINE   0 0 0
|c7t6d0   ONLINE   0 0 0
|c7t7d0   ONLINE   0 0 0
|c7t8d0   ONLINE   0 0 0
|c7t9d0   ONLINE   0 0 0
|c7t10d0  ONLINE   0 0 0
|c7t11d0  ONLINE   0 0 0
|c7t12d0  ONLINE   0 0 0
|c7t13d0  ONLINE   0 0 0
|c7t14d0  ONLINE   0 0 0
|c7t15d0  ONLINE   0 0 0
|spares
|  c7t16d0AVAIL   
|_

__
|Imported zpool on new server:
|
|zpool status backup
|  pool: backup
| state: ONLINE
|config:
|NAME STATE READ WRITE CKSUM
|backup   ONLINE   0 0 0
|  raidz2 ONLINE   0 0 0
|c10t2d0  ONLINE   0 0 0
|c10t2d2  ONLINE   0 0 0
|c10t2d3  ONLINE   0 0 0
|c10t2d1  ONLINE   0 0 0
|c10t2d4  ONLINE   0 0 0
|c10t2d5  ONLINE   0 0 0
|c10t2d6  ONLINE   0 0 0
|c10t2d7  ONLINE   0 0 0
|c10t3d0  ONLINE   0 0 0
|c10t3d1  ONLINE   0 0 0
|c10t3d2  ONLINE   0 0 0
|c10t3d3  ONLINE   0 0 0
|c10t3d4  ONLINE   0 0 0
|c10t3d5  ONLINE   0 0 0
|c10t3d6  ONLINE   0 0 0
|spares
|  c10t3d7AVAIL   
|_


Is ZFS dependent on the order of the drives?  Will this cause any issue down 
the road?  Thank you all; 

Scott
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)

2010-06-15 Thread Garrett D'Amore

On Tue, 2010-06-15 at 18:33 +0200, Arve Paalsrud wrote:

 
 What about the ZIL bandwidth in this case? I mean, could I stripe across 
 multiple devices to be able to handle higher throughput? Otherwise I would 
 still be limited to the performance of the unit itself (155 MB/s).
  

I think so.  Btw, I've gotten better performance than that with my
driver (not sure about the production driver).  I seem to recall about
220 MB/sec.  (I was basically driving the PCIe x1 bus to its limit.)
This was with large transfers (sized at 64k IIRC.)  Shrinking the job
size down, I could get up to 150K IOPS with 512 byte jobs.  (This high
IOP rate is unrealistic for ZFS -- for ZFS the bus bandwidth limitation
comes into play long before you start hitting IOPS limitations.)

One issue of course is that each of these units occupies a PCIe x1 slot.

On another note, if you're dataset and usage requirements don't require
strict I/O flush/sync guarantees, you could probably get away without
any ZIL at all, and just use lots of RAM to get really good performance.
(You'd then disable the zil on filesystems that didn't have this need.
This is a very new feature in OpenSolaris.)  Of course, you don't want
to do this for data sets where loss of the data would be tragic.  (But
its ideal for situations such as filesystems used for compiling, etc. --
where the data being written can be easily regenerated in the event of a
failure.)

-- Garrett


  
   DDT requirements for dedupe on 16k blocks should be about 640GB when
  main pool are full (capacity).
  
  Dedup is not always a win, I think.  I'd look hard at your data and
  usage to determine whether to use it.
  
  -- Garrett
 
 -Arve
 
 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] NexentaStor Community edition 3.0.3 released

2010-06-15 Thread Anil Gulecha

Hi All,

On behalf of NexentaStor team, I'm happy to announce the release of
NexentaStor Community Edition 3.0.3. This release is the result of the
community efforts of Nexenta Partners and users.

Changes over 3.0.2 include
* Many fixes to ON/ZFS backported to b134.
* Multiple bug fixes in the appliance.

With the addition of many new features, NexentaStor CE is the *most
complete*, and feature-rich gratis unified storage solution today.

Quick Summary of Features
-
* ZFS additions: Deduplication (based on OpenSolaris b134).
* Free for upto 12 TB of *used* storage
* Community edition supports easy upgrades
* Many new features in the easy to use management interface.
* Integrated search

Grab the iso from
http://www.nexentastor.org/projects/site/wiki/CommunityEdition

If you are a storage solution provider, we invite you to join our
growing social network at http://people.nexenta.com.

--
Thanks
Anil Gulecha
Community Leader
http://www.nexentastor.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] OCZ Devena line of enterprise SSD

2010-06-15 Thread Brandon High

On Mon, Jun 14, 2010 at 2:07 PM, Roger Hernandez rhvar...@gmail.com wrote:
 OCZ has a new line of enterprise SSDs, based on the SandForce 1500
 controller.

The SLC based drive should be great as a ZIL, and the MLC drives
should be a close second.

Neither is cost effective as a L2ARC, since the cache device doesn't
require resiliency or high random iops. A previous generation drive
(such as the Vertex or X25-M) is probably sufficient.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup... still in beta status


On 6/15/2010 9:03 AM, Fco Javier Garcia wrote:

Data: 90% of current computers has less than 9 GB of RAM, less than 5% has SSD 
systems.
Let use a computer storage standard, with a capacity of 4 TB ... dedupe on, 
dataset with blocks of 32 kb ..., 2 TB of data in use ... need 16 GB of memory just only 
for DTT ... but this will not see it until it's too late ... ie, we will work with the 
system ... performance will be good ... Little by little we will see that write 
performance is dropping ... then we will see that the system crashes randomly (when 
deleting automatic snapshots) ... and finally will see that disabling dedup doesnt solve 
it.


It may indicate that dedupe has some requirements ... that is true, but what is 
 true too is that in systems with large amounts of   RAM(for the usual 
parameters) usual operations as file deleting  or datasets/snapshot destroying 
give us  a decrease of performance ... even totally blocking system ... and 
that is not admissible ... so maybe it would be  desirable to place dedupe in a 
freeze (beta or development situation) until we can get one stable version so 
we can  make any necessary changes in the nucleus of zfs that allow its use 
without compromising the integrity of the entire system (p.ejm: Enabling the 
erasing of blocks in multithreading  .)

And what can we do if we have a system already contaminated with dedupe? ...
1st Disable snapshots
2. Create a new dataset without dedupe and copy the data to the new dataset.
3. After copying the data, delete the snapshots... first the smaller, if 
there is some snapshot bigger (more than 10 Gb)... make progresive roollback to it  (Thus 
the snapshot will use 0 bytes) and we can delete.
4. When there are no snapshots in the dataset ... remove slowly (in batches) 
all  files.
5. Finally, when there are no files... destroy de dataset

If we miss any of these steps (and directly try to delete a snapshot with 95 
Gb) , the system will crash ... if we try to delete the dataset and the system 
crashes ... by restarting your computer will crash the system too (since the 
operation will continue trying to erase )

My test system: AMD Athlon X2 5400, 8 Gb RAM, RAIDZ 3 TB, dataset 1,7 Tb, 
snapshot: 87 Gb... tested with: OSOL 134, EON 0.6, Nexenta core 3.02, 
Nexentastor enterprise 3.02... all systems freezes when trying to delete 
snapshots... finally with rollback i could delete all snapshots... but when 
trying to destroy the dataset  ... The system is still processing the order ... 
(after 20 hours ... )
   


Frankly, dedup isn't practical for anything but enterprise-class 
machines. It's certainly not practical for desktops or anything remotely 
low-end.


This isn't just a ZFS issue - all implementations I've seen so far 
require enterprise-class solutions.


Realistically, I think people are overtly-enamored with dedup as a 
feature - I would generally only consider it worth-while in cases where 
you get significant savings. And by significant, I'm talking an order of 
magnitude space savings.  A 2x savings isn't really enough to counteract 
the down sides.  Especially when even enterprise disk space is 
(relatively) cheap.



That all said, ZFS dedup is still definitely beta. There are known 
severe bugs and performance issues which will take time to fix, as not 
all of them have obvious solutions.  Given current schedules, I predict 
that it should be production-ready some time in 2011. *When* in 2011, I 
couldn't hazard...


Maybe time to make Solaris 10 Update 12 or so? grin

--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] SSDs adequate ZIL devices?

2010-06-15 Thread Arne Jansen


There has been many threads in the past asking about ZIL
devices. Most of them end up in recommending Intel X-25
as an adequate device. Nevertheless there is always the
warning about them not heeding cache flushes. But what use
is a ZIL that ignores cache flushes? If I'm willing to
tolerate that (I'm not), I can just as well take a mechanical
drive and force zfs to not issue cache flushes to it. In
this case it can easily compete with SSD in regard to IOPS
and bandwidth. In case of a power failure I will likely
lose about as many writes as I do with SSDs, a few milliseconds.

So why buy SSD for ZIL at all?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup... still in beta status

2010-06-15 Thread Fco Javier Garcia

 Realistically, I think people are overtly-enamored
 with dedup as a 
 feature - I would generally only consider it
 worth-while in cases where 
 you get significant savings. And by significant, I'm
 talking an order of 
 magnitude space savings.  A 2x savings isn't really
 enough to counteract 
 the down sides.  Especially when even enterprise disk
 space is 
 (relatively) cheap.
 


I think dedup may have its greatest appeal in VDI environments (think about a 
environment with 85% if the data that the virtual machine needs is into ARC or 
L2ARC... is like a dream...almost instantaneous response... and you can boot a 
new machine in a few seconds)... 

 
 That all said, ZFS dedup is still definitely beta.
 There are known 
 severe bugs and performance issues which will take
 time to fix, as not 
 all of them have obvious solutions  Given current
 schedules, I predict 
 that it should be production-ready some time in 2011.
 *When* in 2011, I 
 couldn't hazard...
 
 Maybe time to make Solaris 10 Update 12 or so?


Yes... so you can start paching Solaris on Monday... and perhaps... it will be 
finished on Tuesday (but next week) 



 
 -- 
 Erik Trimble
 Java System Support
 Mailstop:  usca22-123
 Phone:  x17195
 Santa Clara, CA
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discu
 ss

-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup... still in beta status


On 6/15/2010 10:52 AM, Erik Trimble wrote:


Frankly, dedup isn't practical for anything but enterprise-class 
machines. It's certainly not practical for desktops or anything 
remotely low-end.


This isn't just a ZFS issue - all implementations I've seen so far 
require enterprise-class solutions.


Realistically, I think people are overtly-enamored with dedup as a 
feature - I would generally only consider it worth-while in cases 
where you get significant savings. And by significant, I'm talking an 
order of magnitude space savings.  A 2x savings isn't really enough to 
counteract the down sides.  Especially when even enterprise disk space 
is (relatively) cheap.



That all said, ZFS dedup is still definitely beta. There are known 
severe bugs and performance issues which will take time to fix, as not 
all of them have obvious solutions.  Given current schedules, I 
predict that it should be production-ready some time in 2011. *When* 
in 2011, I couldn't hazard...


Maybe time to make Solaris 10 Update 12 or so? grin



One thing here - I forgot to say, this is my opinion based on my 
observations/conversations on this list, and I in no way speak for 
Oracle officially, or as a member of the ZFS team (which I'm not).


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup... still in beta status

2010-06-15 Thread Geoff Nordli

From: Fco Javier Garcia
Sent: Tuesday, June 15, 2010 11:21 AM

 Realistically, I think people are overtly-enamored with dedup as a
 feature - I would generally only consider it worth-while in cases
 where you get significant savings. And by significant, I'm talking an
 order of magnitude space savings.  A 2x savings isn't really enough to
 counteract the down sides.  Especially when even enterprise disk space
 is
 (relatively) cheap.



I think dedup may have its greatest appeal in VDI environments (think about
a
environment with 85% if the data that the virtual machine needs is into ARC
or
L2ARC... is like a dream...almost instantaneous response... and you can
boot a
new machine in a few seconds)...


Does dedup benefit in the ARC/L2ARC space?  

For some reason, I have it in my head that for each time it requests the
block from storage it will copy it into cache; therefore if I had 10 VMs
requesting the same dedup'd block, there will be 10 copies of the same block
in ARC/L2ARC.

Geoff 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup... still in beta status

2010-06-15 Thread Fco Javier Garcia

or as a member of the ZFS team
 (which I'm not).
 

Then you have to be brutally good with Java







 -- 
 Erik Trimble
 Java System Support
 Mailstop:  usca22-123
 Phone:  x17195
 Santa Clara, CA
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discu
 ss

-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup... still in beta status


On 6/15/2010 11:49 AM, Geoff Nordli wrote:

From: Fco Javier Garcia
Sent: Tuesday, June 15, 2010 11:21 AM

 

Realistically, I think people are overtly-enamored with dedup as a
feature - I would generally only consider it worth-while in cases
where you get significant savings. And by significant, I'm talking an
order of magnitude space savings.  A 2x savings isn't really enough to
counteract the down sides.  Especially when even enterprise disk space
is
(relatively) cheap.

   


I think dedup may have its greatest appeal in VDI environments (think about
 

a
   

environment with 85% if the data that the virtual machine needs is into ARC
 

or
   

L2ARC... is like a dream...almost instantaneous response... and you can
 

boot a
   

new machine in a few seconds)...

 

Does dedup benefit in the ARC/L2ARC space?

For some reason, I have it in my head that for each time it requests the
block from storage it will copy it into cache; therefore if I had 10 VMs
requesting the same dedup'd block, there will be 10 copies of the same block
in ARC/L2ARC.

Geoff
   


No, that's not correct. It's the *same* block, regardless of where it 
was referenced from. The cached block has no idea where it was 
referenced from (that's in the metadata).  So, even if I have 10 VMs, 
requesting access to 10 different files, if those files have been 
dedup-ed, then any common (i.e. deduped) blocks will be stored only 
once in the ARC/L2ARC.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup... still in beta status


On 6/15/2010 11:53 AM, Fco Javier Garcia wrote:

or as a member of the ZFS team
   

(which I'm not).

 

Then you have to be brutally good with Java
   


Thanks, but I do get it wrong every so often (hopefully, rarely).  More 
importantly, I don't know anything about the internal goings-on of the 
ZFS team, so I have nothing extra to say about schedules, plans, timing, 
etc. that everyone else doesn't know. I can only speculate based on 
what's been publicly said on those topics.  E.g. I wish I knew when 
certain bugs would be fixed, but I don't have any more visibility to 
that than the public.



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Complete Linux Noob

2010-06-15 Thread CarlPalmer

I have been researching different types of raids, and I happened across raidz, 
and I am blown away.  I have been trying to find resources to answer some of my 
questions, but many of them are either over my head in terms of details, or 
foreign to me as I am a linux noob, and I have to admit I have never even 
looked at Solaris.

Are the Parity drives just that, a drive assigned to parity, or is the parity 
shared over several drives?

I understand that you can build a raidz2 that will have 2 parity disks.  So in 
theory I could lose 2 disks and still rebuild my array so long as they are not 
both the parity disks correct?

I understand that you can have Spares assigned to the raid, so that if a drive 
fails, it will immediately grab the spare and rebuild the damaged drive.  Is 
this correct?

Now I can not find anything on how much space is taken up in the raidz1 or 
raidz2.  If all the drives are the same size, does a raidz2 take up the space 
of 2 of the drives for parity, or is the space calculation different?

I get that you can not expand a raidz as you would a normal raid, by simply 
slapping on a drive.  Instead it seems that the preferred method is to create a 
new raidz.  Now Lets say that I want to add another raidz1 to my system, can I 
get the OS to present this as one big drive with the space from both raid pools?

How do I share these types of raid pools across the network.  Or more 
specifically, how do I access them from Windows based systems?  Is there any 
special trick?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Complete Linux Noob

2010-06-15 Thread Roy Sigurd Karlsbakk

snip/
 How do I share these types of raid pools across the network. Or more
 specifically, how do I access them from Windows based systems? Is
 there any special trick?

Most of your questions are answered here

http://hub.opensolaris.org/bin/download/Community+Group+zfs/docs/zfslast.pdf
 
Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Native ZFS for Linux

2010-06-15 Thread Bob Friesenhahn


On Tue, 15 Jun 2010, Joerg Schilling wrote:


Sorry but your reply is completely misleading as the people who claim that
there is a legal problem with having ZFS in the Linux kernel would of course
also claim that Reiserfs cannot be in the FreeBSD kernel.


It seems that it is a license violation to link a computer containing 
GPLed code to the Internet.  I think I heard on usenet or a blog that 
it was illegal to link GPLed code with non-GPLed code.  The Internet 
itself is obviously a derived work and is therefore subject to the 
GPL.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] OCZ Devena line of enterprise SSD

2010-06-15 Thread Scott Meilicke

Price? I cannot find it.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Native ZFS for Linux

2010-06-15 Thread Joerg Schilling

Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote:

 On Tue, 15 Jun 2010, Joerg Schilling wrote:
 
  Sorry but your reply is completely misleading as the people who claim that
  there is a legal problem with having ZFS in the Linux kernel would of course
  also claim that Reiserfs cannot be in the FreeBSD kernel.

 It seems that it is a license violation to link a computer containing 
 GPLed code to the Internet.  I think I heard on usenet or a blog that 
 it was illegal to link GPLed code with non-GPLed code.  The Internet 
 itself is obviously a derived work and is therefore subject to the 
 GPL.

This is what e.g. Lawrence Rosen also mentions ;-)

BTW: Our preliminary license compatibility information is now on-line:

http://www.osscc.net/en/licenses.html#compatibility

To switch to German, use the top level at:

http://www.osscc.net/en/index.html

Most people may know the OpenSource book from Larwence Rosen (see link
in our web page). I have a new paper on License combinations from my collegue
Tom Gordon (US-lawyer) on our server at:

http://www.osscc.net/pdf/QualipsoA1D113.pdf

Hope this helps to understand things better.

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Complete Linux Noob

2010-06-15 Thread David Dyer-Bennet


On Tue, June 15, 2010 14:13, CarlPalmer wrote:
 I have been researching different types of raids, and I happened across
 raidz, and I am blown away.  I have been trying to find resources to
 answer some of my questions, but many of them are either over my head in
 terms of details, or foreign to me as I am a linux noob, and I have to
 admit I have never even looked at Solaris.

Heh; caught another one :-) .

 Are the Parity drives just that, a drive assigned to parity, or is the
 parity shared over several drives?

No drives are formally designated for parity; all n drives in the RAIDZ
vdev are used together in such a way that you can lose one drive without
loss of data, but exactly which bits are data and which bits are
parity and where they are stored is not something the admin has to think
about or know (and in fact cannot know).

 I understand that you can build a raidz2 that will have 2 parity disks.
 So in theory I could lose 2 disks and still rebuild my array so long as
 they are not both the parity disks correct?

Any two disks out of a raidz2 vdev can be lost.  Lose a third before the
recover completes and your data is toast.

 I understand that you can have Spares assigned to the raid, so that if a
 drive fails, it will immediately grab the spare and rebuild the damaged
 drive.  Is this correct?

Yes, RAIDZ (including z2 and z3) and mirror vdevs will grab a hot spare
if one is assigned and needed, and start the resilvering operation
immediately.

 Now I can not find anything on how much space is taken up in the raidz1 or
 raidz2.  If all the drives are the same size, does a raidz2 take up the
 space of 2 of the drives for parity, or is the space calculation
 different?

That's the right calculation.

 I get that you can not expand a raidz as you would a normal raid, by
 simply slapping on a drive.  Instead it seems that the preferred method is
 to create a new raidz.  Now Lets say that I want to add another raidz1 to
 my system, can I get the OS to present this as one big drive with the
 space from both raid pools?

You can't expand a normal RAID, either, anywhere I've ever seen.

A pool can contain multiple vdevs.  You can add additional vdevs to a
pool and the new space become immediately available to the pool, and hence
to anything (like a filesystem) drawing from that pool.

(The zpool command will attempt to stop you from mixing vdevs of different
redundancy in the same pool, but you can force it to let you.  Mixing a
RAIDZ vdev and a RAIDZ3 vdev in the same pool is a silly thing to do,
since you don't control where in the pool any new data goes, and it's
likely to be striped across the vdevs in the pool.)

You can also replace all the drives in a vdev, serially (and waiting for
the resilver to complete at each step before continuing to the next
drive), and if the new drives are larger than the old drives, when  you've
replaced all of them the new space will be usable in that vdev.  This is
particularly useful with mirrors, where there are only two drives to
replace.

(Well, actually, ZFS mirrors can have any number of drives.  To avoid the
risk of loss when upgrading the drives in a mirror, attach the new bigger
drive FIRST, wait for the resilver, and THEN detach one of the smaller
original drives, repeat for the second drive, and you will never go to a
redundancy lower than 2.  You can even attach BOTH new disks at once, if
you have the slots and controller space, and have a 4-way  mirror for a
while.  Somebody reported configuring ALL the drives in a 'Thumper' as a
mirror, a 48-way mirror, just to see if it worked.  It did.)

 How do I share these types of raid pools across the network.  Or more
 specifically, how do I access them from Windows based systems?  Is there
 any special trick?

Nothing special.  In-kernel CIFS is better than SAMBA, and supports full
NTFS ACLs.  I hear it also attaches to AD cleanly, but I haven't done
that, don't run AD at home.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool export / import discrepancy

2010-06-15 Thread Giovanni Tirloni

On Tue, Jun 15, 2010 at 1:56 PM, Scott Squires ssqui...@gmail.com wrote:
 Is ZFS dependent on the order of the drives?  Will this cause any issue down 
 the road?  Thank you all;

No. In your case the logical names changed but ZFS managed to order
the disks correctly as they were before.

-- 
Giovanni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SSDs adequate ZIL devices?

2010-06-15 Thread Bob Friesenhahn


On Tue, 15 Jun 2010, Arne Jansen wrote:

In case of a power failure I will likely lose about as many writes 
as I do with SSDs, a few milliseconds.


I agree with your concerns, but the data loss may span as much as 30 
seconds rather than just a few milliseconds.


Using an SSD as the ZIL allows zfs to turn a synchronous write into a 
normal batched async write which is scheduled for the next TXG.  Zfs 
intentionally postpones writes.


Without the SSD, zfs needs to write to an intent log in the main pool 
(consuming precious IOPS) or write directly to the main pool 
(consuming precious response latency).  Battery-backed RAM in the 
adaptor card or storage array can do almost as well as the SSD as long 
as the amount of data does not overrun the limited write cache.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs-discuss Digest, Vol 56, Issue 78

2010-06-15 Thread Nikos George

-- next part --
An HTML attachment was scrubbed...
URL: http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100615/767d6c7d/attachment-0001.html 



--

Message: 6
Date: Tue, 15 Jun 2010 21:32:02 +0200 (CEST)
From: Roy Sigurd Karlsbakk r...@karlsbakk.net
To: CarlPalmer dwarvenlo...@yahoo.com
Cc: OpenSolaris ZFS discuss zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Complete Linux Noob
Message-ID: 2566843.64.1276630322625.javamail.r...@zimbra
Content-Type: text/plain; charset=utf-8

snip/

How do I share these types of raid pools across the network. Or more
specifically, how do I access them from Windows based systems? Is
there any special trick?


Most of your questions are answered here

http://hub.opensolaris.org/bin/download/Community+Group+zfs/docs/zfslast.pdf

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres  
intelligibelt. Det er et element?rt imperativ for alle pedagoger ?  
unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de  
fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.



--

Message: 7
Date: Tue, 15 Jun 2010 15:03:35 -0500 (CDT)
From: Bob Friesenhahn bfrie...@simple.dallas.tx.us
To: Joerg Schilling joerg.schill...@fokus.fraunhofer.de
Cc: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Native ZFS for Linux
Message-ID:
   alpine.gso.2.01.1006151456200.12...@freddy.simplesystems.org
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed

On Tue, 15 Jun 2010, Joerg Schilling wrote:


Sorry but your reply is completely misleading as the people who  
claim that
there is a legal problem with having ZFS in the Linux kernel would  
of course

also claim that Reiserfs cannot be in the FreeBSD kernel.


It seems that it is a license violation to link a computer containing
GPLed code to the Internet.  I think I heard on usenet or a blog that
it was illegal to link GPLed code with non-GPLed code.  The Internet
itself is obviously a derived work and is therefore subject to the
GPL.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/


--

___
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


End of zfs-discuss Digest, Vol 56, Issue 78
***

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Data Loss on system crash/upgrade

2010-06-15 Thread Austin Rotondo

Cindy,

I've attached the results of the fmdump -eV command.  They don't tell me 
anything, but someone more knowledgeable might be able to decipher it.  

As for snapshots, the earliest one I have is from after the new hardware.  It 
doesn't appear that there are any snapshots from before the crash, and I'm 
positive I had them enabled through the GUI with time slider.

The more research I do into figuring out how to solve the issue, the curiouser 
and curiouser this issue becomes.  I knew ZFS was designed to be fault 
tolerant, but [i]nothing[/i] i've read has even hinted that something like this 
might have a remote possibility of occurring.  

Before I blew away the previous system install, I took a snapshot of it and 
dumped it on a spare hard drive I had lying around.  My next step might be 
figuring out how to get that snapshot back on a hard drive to boot again, but 
that seemed very difficult because I stupidly upgraded the ZFS version so the 
live CD will not allow me to import/export it.  I'll try that as a last-ditch 
effort, but at this point I'm very curious as to how something like this could 
have happened, and how one could recover from it.

Thanks again for all your help,
Austin
-- 
This message posted from opensolaris.orgTIME   CLASS
May 28 2010 07:57:25.712193068 ereport.fs.zfs.checksum
nvlist version: 0
class = ereport.fs.zfs.checksum
ena = 0xd953c9a23d51
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0x9cb8636d530e0eab
vdev = 0x3d14c4fd46fc236d
(end detector)

pool = zarray1
pool_guid = 0x9cb8636d530e0eab
pool_context = 0
pool_failmode = wait
vdev_guid = 0x3d14c4fd46fc236d
vdev_type = disk
vdev_path = /dev/dsk/c11d1s0
vdev_devid = id1,c...@asamsung_hd103sj=s246jdwz407200/a
parent_guid = 0xfca8d0350c5e7014
parent_type = mirror
zio_err = 50
zio_offset = 0x2e3fe600
zio_size = 0x400
zio_objset = 0x1e
zio_object = 0x36172
zio_level = 1
zio_blkid = 0x24
cksum_expected = 0x679a94912e 0x44a636ab8ab3 0x19330206d17513 
0x69869956c85878c
cksum_actual = 0x4a4c39042c 0x25f430d983f8 0xce6275bab1357 
0x34cba59b47c2e80
cksum_algorithm = fletcher4
bad_ranges = 0x0 0x400
bad_ranges_min_gap = 0x8
bad_range_sets = 0x63d
bad_range_clears = 0x83a
bad_set_histogram = 0x20 0xd 0x27 0x19 0x13 0xf 0x11 0x2e 0x15 0xc 0x15 
0x23 0x1e 0x1e 0x18 0x2a 0xd 0x13 0xe 0xe 0x13 0x19 0x18 0x2a 0x1d 0xb 0x10 
0x17 0x1d 0xe 0x21 0x26 0x2c 0x12 0x2b 0x12 0x1e 0x10 0xf 0x2e 0x15 0xb 0x11 
0x21 0x1f 0x19 0x19 0x22 0xd 0x15 0x13 0x13 0x18 0x1b 0x1c 0x2d 0x1e 0x9 0xd 
0x1b 0x1e 0xe 0x1b 0x28
bad_cleared_histogram = 0x17 0x36 0x17 0x26 0x26 0x27 0x25 0xb 0x24 
0x3b 0x1e 0x23 0x20 0x1c 0x23 0x9 0x29 0x2f 0x23 0x27 0x25 0x22 0x1f 0xa 0x18 
0x3c 0x2a 0x1c 0x1e 0x27 0x1c 0xa 0x14 0x30 0x12 0x21 0x19 0x21 0x23 0xf 0x22 
0x3d 0x24 0x1c 0x20 0x1f 0x22 0x11 0x27 0x32 0x26 0x28 0x26 0x20 0x22 0xd 0x1c 
0x3e 0x23 0x1f 0x1e 0x28 0x1a 0x8
__ttl = 0x1
__tod = 0x4bffafa5 0x2a73342c

May 28 2010 07:58:34.612731735 ereport.fs.zfs.checksum
nvlist version: 0
class = ereport.fs.zfs.checksum
ena = 0xda54764ec1b00801
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0x9cb8636d530e0eab
vdev = 0x3d14c4fd46fc236d
(end detector)

pool = zarray1
pool_guid = 0x9cb8636d530e0eab
pool_context = 0
pool_failmode = wait
vdev_guid = 0x3d14c4fd46fc236d
vdev_type = disk
vdev_path = /dev/dsk/c11d1s0
vdev_devid = id1,c...@asamsung_hd103sj=s246jdwz407200/a
parent_guid = 0xfca8d0350c5e7014
parent_type = mirror
zio_err = 50
zio_offset = 0x2c269600
zio_size = 0x400
zio_objset = 0x1e
zio_object = 0x36172
zio_level = 1
zio_blkid = 0x14
cksum_expected = 0x6de977087a 0x4702ac30c13e 0x19a6384bcbc3b5 
0x6a7654eca5ac6b8
cksum_actual = 0xbaddcafe00 0x5dcc54647f00 0x1f82a459c2aa00 
0x7f84b11b3fc7f80
cksum_algorithm = fletcher4
bad_ranges = 0x0 0x400
bad_ranges_min_gap = 0x8
bad_range_sets = 0xd60
bad_range_clears = 0x33a
bad_set_histogram = 0x4c 0x33 0x4e 0x4c 0x52 0x4f 0x53 0x0 0x52 0x31 
0x0 0x0 0x50 0x0 0x50 0x0 0x4d 0x35 0x0 0x51 0x4d 0x4e 0x0 0x6a 0x53 0x0 0x4e 
0x50 0x4f 0x0 0x50 0x0 0x4e 0x36 0x53 0x4d 0x4e 0x4f 0x50 0x0 0x4c 0x34 0x0 0x0 
0x50 0x0 0x53 0x0 0x53 0x38 0x0 0x57 0x4e 0x50 0x0 0x74 0x4e 0x0 0x53 0x53 0x54 
0x0 0x58 0x0
bad_cleared_histogram = 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x14 0x0 0x0 0x2e 
0x2a 0x0 0x31 0x0 0x10 0x0 0x0 0x32 0x0 0x0 0x0

Re: [zfs-discuss] SSDs adequate ZIL devices?

2010-06-15 Thread Christopher George

 So why buy SSD for ZIL at all?

For the record, not all SSDs ignore cache flushes.  There are at least 
two SSDs sold today that guarantee synchronous write semantics; the 
Sun/Oracle LogZilla and the DDRdrive X1.  Also, I believe it is more 
accurate to describe the root cause as not power protecting on-board 
volatile caches.  As the X25-E does implement the ATA FLUSH 
CACHE command, but does not have the required power protection to 
avoid transaction (data) loss.

Best regards,

Christopher George
Founder/CTO
www.ddrdrive.com
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)

2010-06-15 Thread Przemyslaw Ceglowski

On 15/06/2010 12:42, Arve Paalsrud arve.paals...@gmail.com wrote:

 Hi,
 
 We are currently building a storage box based on OpenSolaris/Nexenta using
 ZFS.
 Our hardware specifications are as follows:
 
 Quad AMD G34 12-core 2.3 GHz (~110 GHz)
 10 Crucial RealSSD (6Gb/s)
 42 WD RAID Ed. 4 2TB disks + 6Gb/s SAS expanders
 LSI2008SAS (two 4x ports)
 Mellanox InfiniBand 40 Gbit NICs

I was told that IB support in Nexenta is scheduled to be released in 3.0.4
(beginning of July).

 128 GB RAM
 
 This setup gives us about 40TB storage after mirror (two disks in spare),
 2.5TB L2ARC and 64GB Zil, all fit into a single 5U box.
 
 Both L2ARC and Zil shares the same disks (striped) due to bandwidth
 requirements. Each SSD has a theoretical performance of 40-50k IOPS on 4k
 read/write scenario with 70/30 distribution. Now, I know that you should have
 mirrored Zil for safety, but the entire box are synchronized with an active
 standby on a different site location (18km distance - round trip of 0.16ms +
 equipment latency). So in case the Zil in Site A takes a fall, or the
 motherboard/disk group/motherboard dies - we still have safety.
 
 DDT requirements for dedupe on 16k blocks should be about 640GB when main pool
 are full (capacity).
 
 Without going into details about chipsets and such, do any of you on this list
 have any experience with a similar setup and can share with us your thoughts,
 do's and dont's, and any other information that could be of help while
 building and configuring this?
 
 What I want to achieve is 2 GB/s+ NFS traffic against our ESX clusters (also
 InfiniBand-based), with both dedupe and compression enabled in ZFS.

As VMware does not currently support NFS over RDMA, you will need to stick
with IPoIB which will suffer from some performance implications inherent to
traditional TCP/IP stack. You could also use iSER or SRP which are both
supported.

 
 Let's talk moon landings.
 
 Regards,
 Arve

-- 
Przem

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SSDs adequate ZIL devices?

2010-06-15 Thread Przemyslaw Ceglowski




On 15/06/2010 23:46, Christopher George cgeo...@ddrdrive.com wrote:

 So why buy SSD for ZIL at all?
 
 For the record, not all SSDs ignore cache flushes.  There are at least
 two SSDs sold today that guarantee synchronous write semantics; the
 Sun/Oracle LogZilla and the DDRdrive X1.  Also, I believe it is more
 accurate to describe the root cause as not power protecting on-board
 volatile caches.  As the X25-E does implement the ATA FLUSH
 CACHE command, but does not have the required power protection to
 avoid transaction (data) loss.
 
 Best regards,
 
 Christopher George
 Founder/CTO
 www.ddrdrive.com

Often forgotten (most probably due the price) are the latest Pliant SSDs.

--
Przem

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool export / import discrepancy

2010-06-15 Thread Frank Contrepois

http://blogs.sun.com/constantin/entry/csi_munich_how_to_save

2010/6/15 Scott Squires ssqui...@gmail.com

 Hello All,

 I've migrated a JBOD of 16 drives from one server to another.  I did a
 zpool export from the old system and a zpool import to the new system.  One
 thing I did notice is since the drives are on a different controller card,
 the naming is different (as expected) but the order is also different.  I
 setup the drives as passthrough on the controller card and went through each
 drive incrementally.  I assumed the zpool import would have listed the
 drives in the order of c10t2d0, d1, d2, ... c10t3d7.  As shown below the
 order the drives were imported is c10t2d0, d2, d3, d1, c10t3d0 through d7.

 __
 |Original zpool setup on old server:
 |
 |zpool status backup
 |  pool: backup
 | state: ONLINE
 |config:
 |NAME STATE READ WRITE CKSUM
 |backup   ONLINE   0 0 0
 |  raidz2 ONLINE   0 0 0
 |c7t1d0   ONLINE   0 0 0
 |c7t2d0   ONLINE   0 0 0
 |c7t3d0   ONLINE   0 0 0
 |c7t4d0   ONLINE   0 0 0
 |c7t5d0   ONLINE   0 0 0
 |c7t6d0   ONLINE   0 0 0
 |c7t7d0   ONLINE   0 0 0
 |c7t8d0   ONLINE   0 0 0
 |c7t9d0   ONLINE   0 0 0
 |c7t10d0  ONLINE   0 0 0
 |c7t11d0  ONLINE   0 0 0
 |c7t12d0  ONLINE   0 0 0
 |c7t13d0  ONLINE   0 0 0
 |c7t14d0  ONLINE   0 0 0
 |c7t15d0  ONLINE   0 0 0
 |spares
 |  c7t16d0AVAIL
 |_

 __
 |Imported zpool on new server:
 |
 |zpool status backup
 |  pool: backup
 | state: ONLINE
 |config:
 |NAME STATE READ WRITE CKSUM
 |backup   ONLINE   0 0 0
 |  raidz2 ONLINE   0 0 0
 |c10t2d0  ONLINE   0 0 0
 |c10t2d2  ONLINE   0 0 0
 |c10t2d3  ONLINE   0 0 0
 |c10t2d1  ONLINE   0 0 0
 |c10t2d4  ONLINE   0 0 0
 |c10t2d5  ONLINE   0 0 0
 |c10t2d6  ONLINE   0 0 0
 |c10t2d7  ONLINE   0 0 0
 |c10t3d0  ONLINE   0 0 0
 |c10t3d1  ONLINE   0 0 0
 |c10t3d2  ONLINE   0 0 0
 |c10t3d3  ONLINE   0 0 0
 |c10t3d4  ONLINE   0 0 0
 |c10t3d5  ONLINE   0 0 0
 |c10t3d6  ONLINE   0 0 0
 |spares
 |  c10t3d7AVAIL
 |_


 Is ZFS dependent on the order of the drives?  Will this cause any issue
 down the road?  Thank you all;

 Scott
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




-- 
Frank Contrepois
Coblan srl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup... still in beta status

2010-06-15 Thread Bill Sommerfeld


On 06/15/10 10:52, Erik Trimble wrote:

Frankly, dedup isn't practical for anything but enterprise-class
machines. It's certainly not practical for desktops or anything remotely
low-end.


We're certainly learning a lot about how zfs dedup behaves in practice. 
 I've enabled dedup on two desktops and a home server and so far 
haven't regretted it on those three systems.


However, they each have more than typical amounts of memory (4G and up) 
a data pool in two or more large-capacity SATA drives, plus an X25-M ssd 
sliced into a root pool as well as l2arc and slog slices for the data 
pool (see below: [1])


I tried enabling dedup on a smaller system (with only 1G memory and a 
single very slow disk), observed serious performance problems, and 
turned it off pretty quickly.


I think, with current bits, it's not a simple matter of ok for 
enterprise, not ok for desktops.  with an ssd for either main storage 
or l2arc, and/or enough memory, and/or a not very demanding workload, it 
seems to be ok.


For one such system, I'm seeing:

# zpool list z
NAME   SIZE  ALLOC   FREECAP  DEDUP  HEALTH  ALTROOT
z  464G   258G   206G55%  1.25x  ONLINE  -
# zdb -D z
DDT-sha256-zap-duplicate: 432759 entries, size 304 on disk, 156 in core
DDT-sha256-zap-unique: 1094244 entries, size 298 on disk, 151 in core

dedup = 1.25, compress = 1.44, copies = 1.00, dedup * compress / copies 
= 1.80

- Bill

[1] To forestall responses of the form: you're nuts for putting a slog 
on an x25-m, which is off-topic for this thread and being discussed 
elsewhere:


Yes, I'm aware of the write cache issues on power fail on the x25-m. 
For my purposes, it's a better robustness/performance tradeoff than 
either zil-on-spinning-rust or zil disabled, because:
 a) for many potential failure cases on whitebox hardware running 
bleeding edge opensolaris bits, the x25-m will not lose power and thus 
the write cache will stay intact across a crash.
 b) even if it loses power and loses some writes-in-flight, it's not 
likely to lose *everything* since the last txg sync.


It's good enough for my personal use.  Your mileage will vary.  As 
always, system design involves tradeoffs.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup... still in beta status

2010-06-15 Thread David Magda


On Jun 15, 2010, at 14:20, Fco Javier Garcia wrote:

I think dedup may have its greatest appeal in VDI environments  
(think about a environment with 85% if the data that the virtual  
machine needs is into ARC or L2ARC... is like a dream...almost  
instantaneous response... and you can boot a new machine in a few  
seconds)...


This may also be accomplished by using snapshots and clones of data  
sets. At least for OS images: user profiles and documents could be  
something else entirely.


Another situation that comes to mind is perhaps as the back-end to a  
mail store: if you send out a message(s) with an attachment(s) to a  
lot of people, the attachment blocks could be deduped (and perhaps  
compressed as well, since base-64 adds 1/3 overhead).


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup... still in beta status

2010-06-15 Thread Mike Gerdts

On Tue, Jun 15, 2010 at 7:28 PM, David Magda dma...@ee.ryerson.ca wrote:
 On Jun 15, 2010, at 14:20, Fco Javier Garcia wrote:

 I think dedup may have its greatest appeal in VDI environments (think
 about a environment with 85% if the data that the virtual machine needs is
 into ARC or L2ARC... is like a dream...almost instantaneous response... and
 you can boot a new machine in a few seconds)...

 This may also be accomplished by using snapshots and clones of data sets. At
 least for OS images: user profiles and documents could be something else
 entirely.

It all depends on the nature of the VDI environment.  If the VMs are
regenerated on each login, the snapshot + clone mechanism is
sufficient.  Deduplication is not needed.  However, if VMs have a long
life and get periodic patches and other software updates,
deduplication will be required if you want to remain at somewhat
constant storage utilization.

It probably makes a lot of sense to be sure that swap or page files
are on a non-dedup dataset.  Executables and shared libraries
shouldn't be getting paged out to it and the likelihood that multiple
VMs page the same thing to swap or a page file is very small.

 Another situation that comes to mind is perhaps as the back-end to a mail
 store: if you send out a message(s) with an attachment(s) to a lot of
 people, the attachment blocks could be deduped (and perhaps compressed as
 well, since base-64 adds 1/3 overhead).

It all depends on how this is stored.  If the attachments are stored
like they were in 1990 as part of an mbox format, you will be very
unlikely to get the proper block alignment.  Even storing the message
body (including headers) in the same file as the attachment may not
align the attachments because the mail headers may be different (e.g.
different recipients messages took different paths, some were
forwarded, etc.).  If the attachments are stored in separate files or
a database format is used that stores attachments separate from the
message (with matching database + zfs block size) things may work out
favorably.

However, a system that detaches messages and stores them separately
may just as well store them in a file that matches the SHA256 hash,
assuming that file doesn't already exist.  If does exist, it can just
increment a reference count.  In other words, an intelligent mail
system should already dedup.  Or at least that is how I would have
written it for the last decade or so...

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] size of slog device

2010-06-15 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Bob Friesenhahn
 
 It is good to keep in mind that only small writes go to the dedicated
 slog.  Large writes to to main store.  A succession of that many small
 writes (to fill RAM/2) is highly unlikely.  Also, that the zil is not
 read back unless the system is improperly shut down.

Can anyone verify this?  I thought the decision for small vs large sync
writes to go to log vs main store was determined by zfs_immediate_write_sz
and logbias.

logbias was introduced in snv_122, which is zpool 18 or 19.
zfs_immediate_write_sz seems to have been around forever (I see comments
about it as early as 2006).

Then again, I can't seem to find my zfs_immediate_write_sz, via either zpool
or zfs.  Can anybody say what version zpool introduced
zfs_immediate_write_sz, or perhaps I'm using the wrong commands to try and
see mine?  zpool get all rpool | grep zfs_immediate_write_sz ; zfs get all
rpool | grep zfs_immediate_write_sz

I thought, if you didn't explicitly tune these, all sync writes go to ZIL
before the main store.  Can't seem to find any way to verify this.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] size of slog device

On Jun 15, 2010, at 8:13 PM, Edward Ned Harvey wrote:

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Bob Friesenhahn
 
 It is good to keep in mind that only small writes go to the dedicated
 slog.  Large writes to to main store.  A succession of that many small
 writes (to fill RAM/2) is highly unlikely.  Also, that the zil is not
 read back unless the system is improperly shut down.
 
 Can anyone verify this?  I thought the decision for small vs large sync
 writes to go to log vs main store was determined by zfs_immediate_write_sz
 and logbias.
 
 logbias was introduced in snv_122, which is zpool 18 or 19.
 zfs_immediate_write_sz seems to have been around forever (I see comments
 about it as early as 2006).
 
 Then again, I can't seem to find my zfs_immediate_write_sz, via either zpool
 or zfs.  Can anybody say what version zpool introduced
 zfs_immediate_write_sz, or perhaps I'm using the wrong commands to try and
 see mine?  zpool get all rpool | grep zfs_immediate_write_sz ; zfs get all
 rpool | grep zfs_immediate_write_sz

It is an int, as in C, not a parameter tunable by zpool or zfs commands.
For NFS service, it can be tuned by the client via wsize.

 I thought, if you didn't explicitly tune these, all sync writes go to ZIL
 before the main store.  Can't seem to find any way to verify this.

Cake.  All sync writes go to the ZIL.  The ZIL may be in the pool or in
the separate log device :-)
 -- richard

-- 
Richard Elling
rich...@nexenta.com   +1-760-896-4422
ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
http://nexenta-rotterdam.eventbrite.com/




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SSDs adequate ZIL devices?

2010-06-15 Thread Arne Jansen


Bob Friesenhahn wrote:

On Tue, 15 Jun 2010, Arne Jansen wrote:

In case of a power failure I will likely lose about as many writes as 
I do with SSDs, a few milliseconds.


I agree with your concerns, but the data loss may span as much as 30 
seconds rather than just a few milliseconds.


Wait, I'm talking about using SSD for ZIL vs. using a dedicated hard drive
for ZIL which is configured to ignore cache flushes. Do you say I can lose
30 seconds also if I use a badly behaving SSD?



Using an SSD as the ZIL allows zfs to turn a synchronous write into a 
normal batched async write which is scheduled for the next TXG.  Zfs 
intentionally postpones writes.


Without the SSD, zfs needs to write to an intent log in the main pool 
(consuming precious IOPS) or write directly to the main pool (consuming 
precious response latency).  Battery-backed RAM in the adaptor card or 
storage array can do almost as well as the SSD as long as the amount of 
data does not overrun the limited write cache.


Bob


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] size of slog device