Re: [zfs-discuss] New SSD options

2010-05-19 Thread thomas
40k IOPS sounds like best in case, you'll never see it in the real world 
marketing to me. There are a few benchmarks if you google and they all seem to 
indicate the performance is probably +/- 10% of an intel x25-e. I would 
personally trust intel over one of these drives.

Is it even possible to buy a zeus iops anywhere? I haven't been able to find 
one. I get the impression they mostly sell to other vendors like sun? I'd be 
curious what the price is on a 9GB zeus iops is these days?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New SSD options

2010-05-19 Thread sensille
Don wrote:
 
 With that in mind- Is anyone using the new OCZ Vertex 2 SSD's as a ZIL?
 
 They're claiming 50k IOPS (4k Write- Aligned), 2 million hour MTBF, TRIM 
 support, etc. That's more write IOPS than the ZEUS (40k IOPS, $) but at 
 half the price of an Intel X25-E (3.3k IOPS, $400).
 
 Needless to say I'd love to know if anyone has evaluated these drives to see 
 if they make sense as a ZIL- for example- do they honor cache flush requests? 
 Are those sustained IOPS numbers?

In my understanding nearly the only relevant number is the number
of cache flushes a drive can handle per second, as this determines
my single thread performance.
Has anyone an idea what numbers I can expect from an Intel X25-E or
an OCZ Vertex 2?

-Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New SSD options

2010-05-19 Thread Brandon High
On Tue, May 18, 2010 at 4:28 PM, Don d...@blacksun.org wrote:
 With that in mind- Is anyone using the new OCZ Vertex 2 SSD's as a ZIL?

The current Sandforce drives out don't have an ultra-capacitor on
them, so they could lose data if the system crashed. There are
supposed to be enterprise class drives based on the chipset out that
do have an ultra-cap released any day now.

 Needless to say I'd love to know if anyone has evaluated these drives to see 
 if they make sense as a ZIL- for example- do they honor cache flush requests? 
 Are those sustained IOPS numbers?

I don't think they do, the chipset was designed to use an ultra-cap to
avoid having to honor flushes. Then again, the X25-E has the same
problem.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New SSD options

2010-05-19 Thread Ragnar Sundblad

On 2010-05-19 08.32, sensille wrote:

Don wrote:


With that in mind- Is anyone using the new OCZ Vertex 2 SSD's as a ZIL?

They're claiming 50k IOPS (4k Write- Aligned), 2 million hour MTBF, TRIM 
support, etc. That's more write IOPS than the ZEUS (40k IOPS, $) but at 
half the price of an Intel X25-E (3.3k IOPS, $400).

Needless to say I'd love to know if anyone has evaluated these drives to see if 
they make sense as a ZIL- for example- do they honor cache flush requests? Are 
those sustained IOPS numbers?


In my understanding nearly the only relevant number is the number
of cache flushes a drive can handle per second, as this determines
my single thread performance.
Has anyone an idea what numbers I can expect from an Intel X25-E or
an OCZ Vertex 2?


I don't know about OCZ Vertex 2, but the Intel X25-E
roughly halves it's IOPS number when you disable it's
write cache (IIRC, it was in the range 1300-1600
writes/s or so).
Since it ignores Cache Flush command and it doesn't
have any persistant buffer storage, disabling the write
cache is the best you can do.
Note that there were reports of the Intel X25-E loosing
a write even though you had the write cache disabled!
Since they still haven't fixed this, after more than a
year on the market, I believe it rather qualifies into
the hardly usable toy class. I am very disappointed,
I had hopes for a new class of cheap but usable flash
drives. Maybe some day...

/ragge
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] scsi messages and mpt warning in log - harmless, or indicating a problem?

2010-05-19 Thread Carson Gaspar

Willard Korfhage wrote:

This afternoon, messages like the following started appearing in 
/var/adm/messages:

May 18 13:46:37 fs8 scsi: [ID 365881 kern.info] 
/p...@0,0/pci8086,2...@1/pci15d9,a...@0 (mpt0):
May 18 13:46:37 fs8 Log info 0x3108 received for target 5.
May 18 13:46:37 fs8 scsi_status=0x0, ioc_status=0x804b, scsi_state=0x1
May 18 13:46:38 fs8 scsi: [ID 365881 kern.info] 
/p...@0,0/pci8086,2...@1/pci15d9,a...@0 (mpt0):
May 18 13:46:38 fs8 Log info 0x3108 received for target 5.
May 18 13:46:38 fs8 scsi_status=0x0, ioc_status=0x804b, scsi_state=0x0
May 18 13:46:40 fs8 scsi: [ID 365881 kern.info] 
/p...@0,0/pci8086,2...@1/pci15d9,a...@0 (mpt0):
May 18 13:46:40 fs8 Log info 0x3108 received for target 5.
May 18 13:46:40 fs8 scsi_status=0x0, ioc_status=0x804b, scsi_state=0x0

...

So, is my system in trouble or not?

Particulars of my system:

% uname -a
SunOS fs8 5.11 snv_134 i86pc i386 i86pc


Welcome to the mpt driver / firmware / something bug! I forget if your 
symptoms were indicative of the card not liking the drives (Hitachis in 
particular, which I fixed by upgrading to larger Seagates) or an issue 
with MSI support (which I fixed by adding set xpv_psm:xen_support_msi = 
-1 to /etc/system, but I was running a Xen enabled kernel).


I suggest searching the list archives.

--
Carson
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very serious performance degradation

2010-05-19 Thread Philippe
 mm.. Service time of sd3..5 are waay too high to be
 good working disks.
 21 writes shouldn't take 1.3 seconds.
 
 Some of your disks are not feeling well, possibly
 doing
 block-reallocation like mad all the time, or block
 recovery of some
 form. Service times should be closer to what sd1 and
 2 are doing.
 sd2,3,4 seems to be getting about the same amount of
 read+write, but
 their service time is 15-20 times higher. This will
 lead to crap
 performance (and probably broken array in a while).
 
 /Tomas

Hi !

It is strange because I've checked the SMART data of the 4 disks, and 
everything seems really OK ! (on another hardware/controller, because I needed 
Windows to check it). Maybe it's a problem with the SAS/SATA controller ?!

One question : if I halt the server, and change the order of the disks on the 
SATA array, will RAIDZ still detect the array fine 
The idea is to check if the results (big service times) depends on the drives 
position, or on the hard drives themself !

Thank you !
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very serious performance degradation

2010-05-19 Thread Philippe
 How full is your filesystem?  Give us the output of
 zfs list
 You might be having a hardware problem, or maybe it's
 extremely full.

Hi Edward,

The _db filesystems have a recordsise of 16K (the others have the default 
128K) :

NAME  USED  AVAIL  REFER  MOUNTPOINT
zfs_raid 1,02T  1,65T  28,4K  /zfs_raid
zfs_raid/fs1_db  8,89G  1,65T  7,73G  /home/fs1_db
zfs_raid/fs2 2,68G  1,65T  1,73G  /home/fs2
zfs_raid/fs3 3,38G  1,65T  3,12G  /home/fs3
zfs_raid/fs4 10,1G  1,65T  10,0G  /home/fs4
zfs_raid/fs5  517G  1,65T   326G  /home/fs5
zfs_raid/fs6_db  35,1G  1,65T  28,0G  /home/fs6_db
zfs_raid/fs7 9,22G  1,65T  7,67G  /home/fs7
zfs_raid/fs8_db  22,7G  1,65T  21,6G  /home/fs8_db
zfs_raid/fs9  179G  1,65T   108G  /home/fs9
zfs_raid/fs10 115G  1,65T  97,0G  /home/fs10
zfs_raid/fs11_db 28,6G  1,65T  17,3G  /home/fs11_db
zfs_raid/fs1217,1G  1,65T  4,70G  /home/fs12
zfs_raid/fs139,66G  1,65T  6,77G  /home/fs13
zfs_raid/fs144,13G  1,65T  3,12G  /home/fs14
zfs_raid/fs1515,2G  1,65T  9,48G  /home/fs15
zfs_raid/fs1614,7G  1,65T  6,59G  /home/fs16
zfs_raid/fs177,49G  1,65T  5,31G  /home/fs17
zfs_raid/fs1841,0G  1,65T  21,6G  /home/fs18


 Also, if you have dedup enabled, on a 3TB filesystem,
 you surely want more
 RAM.  I don't know if there's any rule of thumb you
 could follow, but
 offhand I'd say 16G or 32G.  Numbers based on the
 vapor passing around the
 room I'm in right now.

It seems that the dedup property doesn't exist on my system ! Are you sure 
this capability is supported on the version of ZFS included in Opensolaris ?

Thank you !
Philippe
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very serious performance degradation

2010-05-19 Thread Ian Collins

On 05/19/10 09:34 PM, Philippe wrote:

Hi !

It is strange because I've checked the SMART data of the 4 disks, and 
everything seems really OK ! (on another hardware/controller, because I needed 
Windows to check it). Maybe it's a problem with the SAS/SATA controller ?!

One question : if I halt the server, and change the order of the disks on the 
SATA array, will RAIDZ still detect the array fine 
   


Yes, it will.

--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very serious performance degradation

2010-05-19 Thread Philippe
 it looks like your 'sd5' disk is performing horribly
 bad and except 
 for the horrible performance of 'sd5' (which
 bottlenecks the I/O), 
 'sd4' would look just as bad.  Regardless, the first
 step would be to 
 investigate 'sd5'.

Hi Bob !

I've already tried the pool without the sd5 disk (so pool in degraded mode), 
but the performances was still the same... So the sd5 disk itself is not the 
(only) bottleneck...


 Use 'iostat -xen' to obtain more information,
 including the number of 
 reported errors.

iostat -xen
extended device statistics    errors ---
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w trn tot 
device
0.00.00.00.0  0.0  0.00.00.0   0   0   0   9   0   9 
c8t0d0
0.30.4   14.13.2  0.0  0.00.0   17.9   0   0   0   0   0   0 
c7t0d0
   65.56.6 1234.3   97.6  0.0  1.00.0   14.5   0  14   0   0   0   0 
c7t2d0
   70.66.1 1229.2   97.6  0.0  1.30.0   16.3   0  16   0   0   0   0 
c7t3d0
   94.06.7 2349.2   97.0  0.0  3.60.0   36.1   0  23   0   0   0   0 
c7t4d0
   80.4   12.1 2306.5   91.3  0.0 16.60.0  179.7   0  68   0   0   0   0 
c7t5d0

Thanks !
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] inodes in snapshots

2010-05-19 Thread Chris Gerhard
If I create a file in a file system and then snapshot the file system. 

Then delete the file. 

Is it guaranteed that while the snapshot exists no new file will be created 
with the same inode number as the deleted file?

--chris
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] inodes in snapshots

2010-05-19 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 
 If I create a file in a file system and then snapshot the file system.
 
 Then delete the file.
 
 Is it guaranteed that while the snapshot exists no new file will be
 created with the same inode number as the deleted file?

I believe the answer to your question is No.  Meaning, Yes an inode number
could be recycled, in the present filesystem, although that inode number
exists for some other object in a snapshot gone by.  AFAIK.

Informed guesses aside, here's what I really have to say:

You must have a special case, if you care about this.  Because a snapshot is
treated as a different device, it's allowed for a new inode to be created in
the present filesystem, having the same inode number.  Generally speaking,
there's no reason you should care if an inode number gets recycled.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS memory recommendations

2010-05-19 Thread Deon Cui
I am currently doing research on how much memory ZFS should have for a storage 
server.

I came across this blog

http://constantin.glez.de/blog/2010/04/ten-ways-easily-improve-oracle-solaris-zfs-filesystem-performance

It recommends that for every TB of storage you have you want 1GB of RAM just 
for the metadata.

Is this really the case that ZFS metadata consumes so much RAM?
I'm currently building a storage server which will eventually hold up to 20TB 
of storage, I can't fit in 20GB of RAM on the motherboard!
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Ideal SATA/SAS Controllers for ZFS

2010-05-19 Thread Deon Cui
My work has bought a bunch of IBM servers recently as ESX hosts. They all come 
with LSI SAS1068E controllers as standard, which we remove and upgrade to a 
raid 5 controller.

So I had a bunch of them lying around. We've bought a 16x SAS hotswap case and 
I've put in an AMD X4 955 BE with an ASUS M4A89GTD Pro as the mobo.

In the two 16x PCI-E slots I've put in the 1068E controllers I had lying 
around. Everything is still being put together and I still haven't even 
installed opensolaris yet but I'll see if I can get you some numbers on the 
controllers when I am done.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New SSD options

2010-05-19 Thread Don
Well- 40k IOPS is the current claim from ZEUS- and they're the benchmark. They 
use to be 17k IOPS. How real any of these numbers are from any manufacturer is 
a guess.

Given the Intel's refusal to honor a cache flush, and their performance 
problems with the cache disabled- I don't trust them any more than anyone else 
right now.

As for the Vertex drives- if they are within +-10% of the Intel they're still 
doing it for half of what the Intel drive costs- so it's an option- not a great 
option- but still an option.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] inodes in snapshots

2010-05-19 Thread Chris Gerhard
The reason for wanting to know is to try and find versions of a file.

If a file is renamed then the only way to know that the renamed file was the 
same as a file in a snapshot would be if the inode numbers matched. However for 
that to be reliable it would require the i-nodes are not reused.

 If they are able to be reused then when an inode number matches I would also 
have to compare the real creation time which requires looking at the extended 
attributes.

--chris
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New SSD options

2010-05-19 Thread Yuri Vorobyev



As for the Vertex drives- if they are within +-10% of the Intel they're still 
doing it for half of what the Intel drive costs- so it's an option- not a great 
option- but still an option.

Yes, but Intel is SLC. Much more endurance.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS in campus clusters

2010-05-19 Thread David Magda
On Tue, May 18, 2010 20:45, Edward Ned Harvey wrote:

 The whole point of a log device is to accelerate sync writes, by providing
 nonvolatile storage which is faster than the primary storage.  You're not
 going to get this if any part of the log device is at the other side of a
 WAN.  So either add a mirror of log devices locally and not across the
 WAN, or don't do it at all.

A good example of using distant iSCSI with close-by SSDs:

http://blogs.sun.com/jkshah/entry/zfs_with_cloud_storage_and


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Review: SuperMicro’s SC847 (S C847A) 4U chassis with 36 drive bays

2010-05-19 Thread Eugen Leitl

http://www.natecarlson.com/2010/05/07/review-supermicros-sc847a-4u-chassis-with-36-drive-bays/
 

Review: SuperMicro’s SC847 (SC847A) 4U chassis with 36 drive bays

May 7, 2010 · 9 comments

in Geek Stuff, Linux, Storage, Virtualization, Work Stuff

SuperMicro SC847 Thumbnail

[Or my quest for the ultimate home-brew storage array.] At my day job, we
use a variety of storage solutions based on the type of data we’re hosting.
Over the last year, we have started to deploy SuperMicro-based hardware with
OpenSolaris and ZFS for storage of some classes of data. The systems we have
built previously have not had any strict performance requirements, and were
built with SuperMicro’s SC846E2 chassis, which supports 24 total SAS/SATA
drives, with an integrated port multiplier in the backplane to support
multipath to SAS drives. We’re building out a new system that we hope to be
able to promote to tier-1 for some “less critical data”, so we wanted better
drive density and more performance. We landed on the relatively new
SuperMicro SC847 chassis, which supports 36 total 3.5″ drives (24 front and
12 rear) in a 4U enclosure. While researching this product, I didn’t find
many reviews and detailed pictures of the chassis, so figured I’d take some
pictures while building the system and post them for the benefit of anyone
else interested in such a solution.

In the systems we’ve built so far, we’ve only deployed SATA drives since
OpenSolaris can still get us decent performance with SSD for read and write
cache. This means that in the 4U cases we’ve used with integrated port
multipliers, we have only used one of the two SFF-8087 connectors on the
backplane; this works fine, but limits the total throughput of all drives in
the system to 4 3gbit/s channels (on this chassis, 6 drives would be on each
3gbit channel.) On our most recent build, we built it with the intention of
using it both for “nearline”-class storage, and as a test platform to see if
we can get the performance we need to store VM images. As part of this
decision, we decided to go with a backplane that supports full throughput to
each drive.

[...]
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New SSD options

2010-05-19 Thread David Magda
On Wed, May 19, 2010 02:09, thomas wrote:

 Is it even possible to buy a zeus iops anywhere? I haven't been able to
 find one. I get the impression they mostly sell to other vendors like sun?
 I'd be curious what the price is on a 9GB zeus iops is these days?

Correct, their Zeus products are only available to OEMs.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS in campus clusters

2010-05-19 Thread Bob Friesenhahn

On Tue, 18 May 2010, Edward Ned Harvey wrote:


Either I'm crazy, or I completely miss what you're asking.  You want to have
one side of a mirror attached locally, and the other side of the mirror
attached ... via iscsi or something ... across the WAN?  Even if you have a
really fast WAN (1Gb or so) your performance is going to be terrible, and I
would be very concerned about reliability.  What happens if a switch reboots
or crashes?  Then suddenly half of the mirror isn't available anymore
(redundancy is degraded on all pairs) and ... Will it be a degraded mirror?
Or will the system just hang, waiting for iscsi IO to timeout?  When it
comes back online, will it intelligently resilver only the parts which have
changed since?  Since the mirror is now broken, and local operations can
happen faster than the WAN can carry them across, will the resilver ever
complete, ever?  I don't know.


This has been accomplished successfully before.  There used to be a 
fellow posting here (from New Zealand I think) who used distributed 
storage just like that.  If the WAN goes away, then zfs writes will 
likely hang for the iSCSI timeout period (likely 3 minutes) and then 
continue normally once iSCSI/zfs decides that the mirror device is not 
available.  When the WAN returns, then zfs will send only the 
missing updates.



The whole point of a log device is to accelerate sync writes, by providing
nonvolatile storage which is faster than the primary storage.  You're not
going to get this if any part of the log device is at the other side of a
WAN.  So either add a mirror of log devices locally and not across the WAN,
or don't do it at all.


This depends on the nature of the WAN.  The WAN latency may still be 
relatively low as compared with drive latency.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS in campus clusters

2010-05-19 Thread John Hoogerdijk
  From: zfs-discuss-boun...@opensolaris.org
 [mailto:zfs-discuss-
  boun...@opensolaris.org] On Behalf Of John
 Hoogerdijk
  
  I'm building a campus cluster with identical
 storage in two locations
  with ZFS mirrors spanning both storage frames. Data
 will be mirrored
  using zfs.  I'm looking for the best way to add log
 devices to this
  campus cluster.
 
 Either I'm crazy, or I completely miss what you're
 asking.  You want to have
 one side of a mirror attached locally, and the other
 side of the mirror
 attached ... via iscsi or something ... across the
 WAN?  Even if you have a
 really fast WAN (1Gb or so) your performance is going
 to be terrible, and I
 would be very concerned about reliability.  What
 happens if a switch reboots
 or crashes?  Then suddenly half of the mirror isn't
 available anymore
 (redundancy is degraded on all pairs) and ... Will it
 be a degraded mirror?
 Or will the system just hang, waiting for iscsi IO to
 timeout?  When it
 comes back online, will it intelligently resilver
 only the parts which have
 changed since?  Since the mirror is now broken, and
 local operations can
 happen faster than the WAN can carry them across,
 will the resilver ever
 complete, ever?  I don't know.
 
 anyway, it just doesn't sound like a good idea to me.
  It sounds like
 omething that was meant for a clustering filesystem
 of some kind, not
 particularly for ZFS.
 
 If you are adding log devices to this, I have a
 couple of things to say:
 
 The whole point of a log device is to accelerate sync
 writes, by providing
 nonvolatile storage which is faster than the primary
 storage.  You're not
 going to get this if any part of the log device is at
 the other side of a
 WAN.  So either add a mirror of log devices locally
 and not across the WAN,
 or don't do it at all.
 
 
  I am considering building a separate mirrored zpool
 of Flash disk that
  span the frames,  then creating zvols to use as log
 devices for the
  data zpool.  Will this work?   Any other
 suggestions?
 
 This also sounds nonsensical to me.  If your primary
 pool devices are Flash,
 then there's no point to add separate log devices.
  Unless you have another
 ype of even faster nonvolatile storage.

Both frames are FC connected with Flash devices in the frame.  Latencies are 
additive, so there is benefit to a logging device.  The cluster is a standard 
HA cluster about 10km apart with identical storage in both locations, mirrored 
using ZFS. 

Think about the potential problems if I don't mirror the log devices across the 
WAN.

 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discu
 ss

-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS in campus clusters

2010-05-19 Thread John Hoogerdijk
 On Tue, May 18, 2010 20:45, Edward Ned Harvey wrote:
 
  The whole point of a log device is to accelerate
 sync writes, by providing
  nonvolatile storage which is faster than the
 primary storage.  You're not
  going to get this if any part of the log device is
 at the other side of a
  WAN.  So either add a mirror of log devices locally
 and not across the
  WAN, or don't do it at all.
 
 A good example of using distant iSCSI with close-by
 SSDs:
 
 http://blogs.sun.com/jkshah/entry/zfs_with_cloud_stora
 ge_and

Good stuff, but doesn't address HA clusters and consistent storage.

 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discu
 ss

-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS in campus clusters

2010-05-19 Thread Richard Elling
comment below...

On May 19, 2010, at 7:50 AM, John Hoogerdijk wrote:

 From: zfs-discuss-boun...@opensolaris.org
 [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of John
 Hoogerdijk
 
 I'm building a campus cluster with identical
 storage in two locations
 with ZFS mirrors spanning both storage frames. Data
 will be mirrored
 using zfs.  I'm looking for the best way to add log
 devices to this
 campus cluster.
 
 Either I'm crazy, or I completely miss what you're
 asking.  You want to have
 one side of a mirror attached locally, and the other
 side of the mirror
 attached ... via iscsi or something ... across the
 WAN?  Even if you have a
 really fast WAN (1Gb or so) your performance is going
 to be terrible, and I
 would be very concerned about reliability.  What
 happens if a switch reboots
 or crashes?  Then suddenly half of the mirror isn't
 available anymore
 (redundancy is degraded on all pairs) and ... Will it
 be a degraded mirror?
 Or will the system just hang, waiting for iscsi IO to
 timeout?  When it
 comes back online, will it intelligently resilver
 only the parts which have
 changed since?  Since the mirror is now broken, and
 local operations can
 happen faster than the WAN can carry them across,
 will the resilver ever
 complete, ever?  I don't know.
 
 anyway, it just doesn't sound like a good idea to me.
 It sounds like
 omething that was meant for a clustering filesystem
 of some kind, not
 particularly for ZFS.
 
 If you are adding log devices to this, I have a
 couple of things to say:
 
 The whole point of a log device is to accelerate sync
 writes, by providing
 nonvolatile storage which is faster than the primary
 storage.  You're not
 going to get this if any part of the log device is at
 the other side of a
 WAN.  So either add a mirror of log devices locally
 and not across the WAN,
 or don't do it at all.
 
 
 I am considering building a separate mirrored zpool
 of Flash disk that
 span the frames,  then creating zvols to use as log
 devices for the
 data zpool.  Will this work?   Any other
 suggestions?
 
 This also sounds nonsensical to me.  If your primary
 pool devices are Flash,
 then there's no point to add separate log devices.
 Unless you have another
 ype of even faster nonvolatile storage.
 
 Both frames are FC connected with Flash devices in the frame.  Latencies are 
 additive, so there is benefit to a logging device.  The cluster is a standard 
 HA cluster about 10km apart with identical storage in both locations, 
 mirrored using ZFS. 

There are quite a few metro clusters in the world today. Many use
traditional mirroring software.  Some use array-based sync replication.
A ZFS-based solution works and behaves similarly.

 Think about the potential problems if I don't mirror the log devices across 
 the WAN.

If you use log devices, mirror them.
 -- richard

-- 
Richard Elling
rich...@nexenta.com   +1-760-896-4422
ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
http://nexenta-rotterdam.eventbrite.com/




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS memory recommendations

2010-05-19 Thread Bob Friesenhahn

On Wed, 19 May 2010, Deon Cui wrote:


http://constantin.glez.de/blog/2010/04/ten-ways-easily-improve-oracle-solaris-zfs-filesystem-performance

It recommends that for every TB of storage you have you want 1GB of 
RAM just for the metadata.


Interesting conclusion.


Is this really the case that ZFS metadata consumes so much RAM?
I'm currently building a storage server which will eventually hold 
up to 20TB of storage, I can't fit in 20GB of RAM on the 
motherboard!


Unless you do something like enable dedup (which is still risky to 
use), then there is no rule of thumb that I know of.  ZFS will take 
advantage of available RAM.  You should have at least 1GB of RAM 
available for ZFS to use.  Beyond that, it depends entirely on the 
size of your expected working set.  The size of accessed files, the 
randomness of the access, the number of simultaneous accesses, and the 
maximum number of files per directory all make a difference to how 
much RAM you should have for good performance. If you have 200TB of 
stored data, but only actually access 2GB of it at any one time, then 
the caching requirements are not very high.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs mount -a kernel panic

2010-05-19 Thread John Andrunas
Running ZFS on a Nexenta box, I had a mirror get broken and apparently
the metadata is corrupt now.  If I try and mount vol2 it works but if
I try and mount -a or mount vol2/vm2 is instantly kernel panics and
reboots.  Is it possible to recover from this?  I don't care if I lose
the file listed below, but the other data in the volume would be
really nice to get back.  I have scrubbed the volume to no avail.  Any
other thoughts.


zpool status -xv vol2
  pool: vol2
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
vol2ONLINE   0 0 0
  mirror-0  ONLINE   0 0 0
c3t3d0  ONLINE   0 0 0
c3t2d0  ONLINE   0 0 0

errors: Permanent errors have been detected in the following files:

vol2/v...@snap-daily-1-2010-05-06-:/as5/as5-flat.vmdk

-- 
John
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs mount -a kernel panic

2010-05-19 Thread Mark J Musante


Do you have a coredump?  Or a stack trace of the panic?

On Wed, 19 May 2010, John Andrunas wrote:


Running ZFS on a Nexenta box, I had a mirror get broken and apparently
the metadata is corrupt now.  If I try and mount vol2 it works but if
I try and mount -a or mount vol2/vm2 is instantly kernel panics and
reboots.  Is it possible to recover from this?  I don't care if I lose
the file listed below, but the other data in the volume would be
really nice to get back.  I have scrubbed the volume to no avail.  Any
other thoughts.


zpool status -xv vol2
 pool: vol2
state: ONLINE
status: One or more devices has experienced an error resulting in data
   corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
   entire pool from backup.
  see: http://www.sun.com/msg/ZFS-8000-8A
scrub: none requested
config:

   NAMESTATE READ WRITE CKSUM
   vol2ONLINE   0 0 0
 mirror-0  ONLINE   0 0 0
   c3t3d0  ONLINE   0 0 0
   c3t2d0  ONLINE   0 0 0

errors: Permanent errors have been detected in the following files:

   vol2/v...@snap-daily-1-2010-05-06-:/as5/as5-flat.vmdk

--
John
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




Regards,
markm
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs mount -a kernel panic

2010-05-19 Thread John Andrunas
Not to my knowledge, how would I go about getting one?  (CC'ing discuss)


On Wed, May 19, 2010 at 8:46 AM, Mark J Musante mark.musa...@oracle.com wrote:

 Do you have a coredump?  Or a stack trace of the panic?

 On Wed, 19 May 2010, John Andrunas wrote:

 Running ZFS on a Nexenta box, I had a mirror get broken and apparently
 the metadata is corrupt now.  If I try and mount vol2 it works but if
 I try and mount -a or mount vol2/vm2 is instantly kernel panics and
 reboots.  Is it possible to recover from this?  I don't care if I lose
 the file listed below, but the other data in the volume would be
 really nice to get back.  I have scrubbed the volume to no avail.  Any
 other thoughts.


 zpool status -xv vol2
  pool: vol2
 state: ONLINE
 status: One or more devices has experienced an error resulting in data
       corruption.  Applications may be affected.
 action: Restore the file in question if possible.  Otherwise restore the
       entire pool from backup.
  see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
 config:

       NAME        STATE     READ WRITE CKSUM
       vol2        ONLINE       0     0     0
         mirror-0  ONLINE       0     0     0
           c3t3d0  ONLINE       0     0     0
           c3t2d0  ONLINE       0     0     0

 errors: Permanent errors have been detected in the following files:

       vol2/v...@snap-daily-1-2010-05-06-:/as5/as5-flat.vmdk

 --
 John
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



 Regards,
 markm




-- 
John
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs mount -a kernel panic

2010-05-19 Thread Michael Schuster

On 19.05.10 17:53, John Andrunas wrote:

Not to my knowledge, how would I go about getting one?  (CC'ing discuss)


man savecore and dumpadm.

Michael



On Wed, May 19, 2010 at 8:46 AM, Mark J Musantemark.musa...@oracle.com  wrote:


Do you have a coredump?  Or a stack trace of the panic?

On Wed, 19 May 2010, John Andrunas wrote:


Running ZFS on a Nexenta box, I had a mirror get broken and apparently
the metadata is corrupt now.  If I try and mount vol2 it works but if
I try and mount -a or mount vol2/vm2 is instantly kernel panics and
reboots.  Is it possible to recover from this?  I don't care if I lose
the file listed below, but the other data in the volume would be
really nice to get back.  I have scrubbed the volume to no avail.  Any
other thoughts.


zpool status -xv vol2
  pool: vol2
state: ONLINE
status: One or more devices has experienced an error resulting in data
   corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
   entire pool from backup.
  see: http://www.sun.com/msg/ZFS-8000-8A
scrub: none requested
config:

   NAMESTATE READ WRITE CKSUM
   vol2ONLINE   0 0 0
 mirror-0  ONLINE   0 0 0
   c3t3d0  ONLINE   0 0 0
   c3t2d0  ONLINE   0 0 0

errors: Permanent errors have been detected in the following files:

   vol2/v...@snap-daily-1-2010-05-06-:/as5/as5-flat.vmdk

--
John
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




Regards,
markm








--
michael.schus...@oracle.com http://blogs.sun.com/recursion
Recursion, n.: see 'Recursion'
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] inodes in snapshots

2010-05-19 Thread Nicolas Williams
On Wed, May 19, 2010 at 05:33:05AM -0700, Chris Gerhard wrote:
 The reason for wanting to know is to try and find versions of a file.

No, there's no such guarantee.  The same inode and generation number
pair is extremely unlikely to be re-used, but the inode number itself is
likely to be re-used.

 If a file is renamed then the only way to know that the renamed file
 was the same as a file in a snapshot would be if the inode numbers
 matched. However for that to be reliable it would require the i-nodes
 are not reused.

There's also the crtime (creation time, not to be confused with ctime),
which you can get with ls(1).

  If they are able to be reused then when an inode number matches I
  would also have to compare the real creation time which requires
  looking at the extended attributes.

Right, that's what you'll have to do.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS in campus clusters

2010-05-19 Thread Nicolas Williams
On Wed, May 19, 2010 at 07:50:13AM -0700, John Hoogerdijk wrote:
 Think about the potential problems if I don't mirror the log devices
 across the WAN.

If you don't mirror the log devices then your disaster recovery
semantics will be that you'll miss any transactions that hadn't been
committed to disk yet at the time of the disaster.  Which means that the
log devices' effects is purely local: for recovery from local power
failures (not extending to local disasters) and for acceleration.

This may or may not be acceptable to you.  If not, then mirror the log
devices.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS memory recommendations

2010-05-19 Thread Roy Sigurd Karlsbakk
- Deon Cui deon@gmail.com skrev:

 I am currently doing research on how much memory ZFS should have for a
 storage server.
 
 I came across this blog
 
 http://constantin.glez.de/blog/2010/04/ten-ways-easily-improve-oracle-solaris-zfs-filesystem-performance
 
 It recommends that for every TB of storage you have you want 1GB of
 RAM just for the metadata.

That's for dedup, 150 bytes per block, meaning approx 1GB per 1TB if all (or 
most) are 128kB blocks, and way more memory (or L2ARC) if you have small files.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS memory recommendations

2010-05-19 Thread Erik Trimble

Bob Friesenhahn wrote:

On Wed, 19 May 2010, Deon Cui wrote:


http://constantin.glez.de/blog/2010/04/ten-ways-easily-improve-oracle-solaris-zfs-filesystem-performance 



It recommends that for every TB of storage you have you want 1GB of 
RAM just for the metadata.


Interesting conclusion.


Is this really the case that ZFS metadata consumes so much RAM?
I'm currently building a storage server which will eventually hold up 
to 20TB of storage, I can't fit in 20GB of RAM on the motherboard!


Unless you do something like enable dedup (which is still risky to 
use), then there is no rule of thumb that I know of.  ZFS will take 
advantage of available RAM.  You should have at least 1GB of RAM 
available for ZFS to use.  Beyond that, it depends entirely on the 
size of your expected working set.  The size of accessed files, the 
randomness of the access, the number of simultaneous accesses, and the 
maximum number of files per directory all make a difference to how 
much RAM you should have for good performance. If you have 200TB of 
stored data, but only actually access 2GB of it at any one time, then 
the caching requirements are not very high.


Bob


I'd second Bob's notes here - for non-dedup purposes, you need at a very 
bare minimum of 512MB of RAM just for ZFS (Bob's recommendation of 1GB 
is much better, I'm quoting a real basement level beyond which you're 
effectively crippling ZFS).


The primary RAM consumption determination for pools without dedup is the 
size of your active working set (as Bob mentioned).  It's unrealistic to 
expect to cache /all/ metadata for every file for large pools, and I 
can't really see the worth in it anyhow (you end up with very 
infrequently-used metadata sitting in RAM, which gets evicted for use by 
other things in most cases).  Storing any more metadata than what you 
need for your working set isn't going to bring much performance bonus.  
What you need to have is sufficient RAM to cache your async writes 
(remember, this amount is relatively small in most cases - it's 3 
pending transactions per pool), plus enough RAM to hold your all the 
files (plus metadata) you expect to use (i.e. read more than once or 
write to) within about 5 minutes.


Here's three examples to show the differences (all without dedup):

(1)  100TB system which contains scientific data used in a data-mining 
app.  The system will need to frequently access very large amounts of 
the available data, but seldom writes much.  As it is doing data-mining, 
a specific piece of data is read seldom, though the system needs to read 
large aggregate amounts continuously.  In this case, you're pretty much 
out of luck for caching. You'll need enough RAM to cache your maximum 
write size, and a little bit for read-ahead, but since you're accessing 
the pool almost at random for large amounts of data which aren't 
re-used, caching isn't going to help really at all.  In this cases, 
1-2GB of RAM is likely all that really can be used.


(2)  1TB of data are being used for a Virtual Machine disk server. That 
is, the machine exports iSCSI (or FCoE, or NFS, or whatever) volumes for 
use on client hardware to run a VM.  Typically in this case, there are 
lots of effectively random read requests coming in for a bunch of hot 
files (which tend to be OS files in the VM-hosted OSes). There's also 
fairly frequent write requests.   However, the VMs will do a fair amount 
of read-caching of their own, so the amount of read requests is lower 
than one would think. For performance and administrative reasons, it is 
likely that you will want multiple pools, rather than a single large 
pool.  In this case, you need a reasonable amount of write-cache for 
*each* pool, plus enough RAM to cache all of the OS files very often 
used for ALL the VMs.  In this case, dedup would actually really help 
RAM consumption, since it is highly likely that frequently-accessed 
files from multiple VMs are in fact identical, and thus with dedup, 
you'd only need to store one copy in the cache.   In any case, here 
you'd need a few GB for the write caching, plus likely a dozen or more 
GB for read caching, as your working set is moderately large, and 
frequently re-used.


(3)  100TB of data for NFS home directory serving.  Access pattern here 
is likely highly random, with only small amounts of re-used data. 
However, you'll often have non-trivial write sizes. Having a ZIL is 
probably a good idea, but in any case, you'll want a couple of GB (call 
it 3-4) for write caching per pool, and then several dozen MB per active 
user as read cache.  That is, in this case, it's likely that your 
determining factor is not total data size, but the number of 
simultaneous users, since the latter will dictate your frequency of file 
access.



I'd say all of the recommendations/insights on the referenced link are 
good, except for #1.  The base amount of RAM is highly variable based on 
the factors discussed above, and the blanket assumption 

Re: [zfs-discuss] zfs mount -a kernel panic

2010-05-19 Thread John Andrunas
Hmmm... no coredump even though I configured it.

Here is the trace though  I will see what I can do about the coredump

r...@cluster:/export/home/admin# zfs mount vol2/vm2

panic[cpu3]/thread=ff001f45ec60: BAD TRAP: type=e (#pf Page fault)
rp=ff001f45e950 addr=30 occurred in module zfs due to a NULL
pointer deree

zpool-vol2: #pf Page fault
Bad kernel fault at addr=0x30
pid=1469, pc=0xf795d054, sp=0xff001f45ea48, eflags=0x10296
cr0: 8005003bpg,wp,ne,et,ts,mp,pe cr4: 6f8xmme,fxsr,pge,mce,pae,pse,de
cr2: 30cr3: 500cr8: c

rdi:0 rsi: ff05208b2388 rdx: ff001f45e888
rcx:0  r8:3000900ff  r9: 198f5ff6
rax:0 rbx:  200 rbp: ff001f45ea50
r10: c0130803 r11: ff001f45ec60 r12: ff05208b2388
r13: ff0521fc4000 r14: ff050c0167e0 r15: ff050c0167e8
fsb:0 gsb: ff04eb9b8080  ds:   4b
 es:   4b  fs:0  gs:  1c3
trp:e err:2 rip: f795d054
 cs:   30 rfl:10296 rsp: ff001f45ea48
 ss:   38

ff001f45e830 unix:die+dd ()
ff001f45e940 unix:trap+177b ()
ff001f45e950 unix:cmntrap+e6 ()
ff001f45ea50 zfs:ddt_phys_decref+c ()
ff001f45ea80 zfs:zio_ddt_free+55 ()
ff001f45eab0 zfs:zio_execute+8d ()
ff001f45eb50 genunix:taskq_thread+248 ()
ff001f45eb60 unix:thread_start+8 ()

syncing file systems... done
skipping system dump - no dump device configured
rebooting...


On Wed, May 19, 2010 at 8:55 AM, Michael Schuster
michael.schus...@oracle.com wrote:
 On 19.05.10 17:53, John Andrunas wrote:

 Not to my knowledge, how would I go about getting one?  (CC'ing discuss)

 man savecore and dumpadm.

 Michael


 On Wed, May 19, 2010 at 8:46 AM, Mark J Musantemark.musa...@oracle.com
  wrote:

 Do you have a coredump?  Or a stack trace of the panic?

 On Wed, 19 May 2010, John Andrunas wrote:

 Running ZFS on a Nexenta box, I had a mirror get broken and apparently
 the metadata is corrupt now.  If I try and mount vol2 it works but if
 I try and mount -a or mount vol2/vm2 is instantly kernel panics and
 reboots.  Is it possible to recover from this?  I don't care if I lose
 the file listed below, but the other data in the volume would be
 really nice to get back.  I have scrubbed the volume to no avail.  Any
 other thoughts.


 zpool status -xv vol2
  pool: vol2
 state: ONLINE
 status: One or more devices has experienced an error resulting in data
       corruption.  Applications may be affected.
 action: Restore the file in question if possible.  Otherwise restore the
       entire pool from backup.
  see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
 config:

       NAME        STATE     READ WRITE CKSUM
       vol2        ONLINE       0     0     0
         mirror-0  ONLINE       0     0     0
           c3t3d0  ONLINE       0     0     0
           c3t2d0  ONLINE       0     0     0

 errors: Permanent errors have been detected in the following files:

       vol2/v...@snap-daily-1-2010-05-06-:/as5/as5-flat.vmdk

 --
 John
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



 Regards,
 markm






 --
 michael.schus...@oracle.com     http://blogs.sun.com/recursion
 Recursion, n.: see 'Recursion'




-- 
John
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS memory recommendations

2010-05-19 Thread Miles Nordin
 et == Erik Trimble erik.trim...@oracle.com writes:

et frequently-accessed files from multiple VMs are in fact
et identical, and thus with dedup, you'd only need to store one
et copy in the cache.

although counterintuitive I thought this wasn't part of the initial
release.  Maybe I'm wrong altogether or maybe it got added later?

  http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup#comment-1257191094000



pgp4W7jhfu4MV.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS memory recommendations

2010-05-19 Thread Erik Trimble

Miles Nordin wrote:

et == Erik Trimble erik.trim...@oracle.com writes:



et frequently-accessed files from multiple VMs are in fact
et identical, and thus with dedup, you'd only need to store one
et copy in the cache.

although counterintuitive I thought this wasn't part of the initial
release.  Maybe I'm wrong altogether or maybe it got added later?

  http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup#comment-1257191094000
  
No, you're reading that blog right - dedup is on a per-pool basis.  What 
I was talking about was inside a single pool.  Without dedup enabled on 
a pool, if I have 2 VM images, both of which are say WinXP, then I'd 
have to cache identical files twice.  With dedup, I'd only have to cache 
those blocks once, even if they were being accessed by both VMs.


So, dedup is both hard on RAM (you need the DDT), and easier (it lowers 
the amount of actual data blocks which have to be stored in cache). 



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs mount -a kernel panic

2010-05-19 Thread John Andrunas
OK, I got a core dump, what do I do with it now?

It is 1.2G in size.


On Wed, May 19, 2010 at 10:54 AM, John Andrunas j...@andrunas.net wrote:
 Hmmm... no coredump even though I configured it.

 Here is the trace though  I will see what I can do about the coredump

 r...@cluster:/export/home/admin# zfs mount vol2/vm2

 panic[cpu3]/thread=ff001f45ec60: BAD TRAP: type=e (#pf Page fault)
 rp=ff001f45e950 addr=30 occurred in module zfs due to a NULL
 pointer deree

 zpool-vol2: #pf Page fault
 Bad kernel fault at addr=0x30
 pid=1469, pc=0xf795d054, sp=0xff001f45ea48, eflags=0x10296
 cr0: 8005003bpg,wp,ne,et,ts,mp,pe cr4: 6f8xmme,fxsr,pge,mce,pae,pse,de
 cr2: 30cr3: 500cr8: c

        rdi:                0 rsi: ff05208b2388 rdx: ff001f45e888
        rcx:                0  r8:        3000900ff  r9:         198f5ff6
        rax:                0 rbx:              200 rbp: ff001f45ea50
        r10:         c0130803 r11: ff001f45ec60 r12: ff05208b2388
        r13: ff0521fc4000 r14: ff050c0167e0 r15: ff050c0167e8
        fsb:                0 gsb: ff04eb9b8080  ds:               4b
         es:               4b  fs:                0  gs:              1c3
        trp:                e err:                2 rip: f795d054
         cs:               30 rfl:            10296 rsp: ff001f45ea48
         ss:               38

 ff001f45e830 unix:die+dd ()
 ff001f45e940 unix:trap+177b ()
 ff001f45e950 unix:cmntrap+e6 ()
 ff001f45ea50 zfs:ddt_phys_decref+c ()
 ff001f45ea80 zfs:zio_ddt_free+55 ()
 ff001f45eab0 zfs:zio_execute+8d ()
 ff001f45eb50 genunix:taskq_thread+248 ()
 ff001f45eb60 unix:thread_start+8 ()

 syncing file systems... done
 skipping system dump - no dump device configured
 rebooting...


 On Wed, May 19, 2010 at 8:55 AM, Michael Schuster
 michael.schus...@oracle.com wrote:
 On 19.05.10 17:53, John Andrunas wrote:

 Not to my knowledge, how would I go about getting one?  (CC'ing discuss)

 man savecore and dumpadm.

 Michael


 On Wed, May 19, 2010 at 8:46 AM, Mark J Musantemark.musa...@oracle.com
  wrote:

 Do you have a coredump?  Or a stack trace of the panic?

 On Wed, 19 May 2010, John Andrunas wrote:

 Running ZFS on a Nexenta box, I had a mirror get broken and apparently
 the metadata is corrupt now.  If I try and mount vol2 it works but if
 I try and mount -a or mount vol2/vm2 is instantly kernel panics and
 reboots.  Is it possible to recover from this?  I don't care if I lose
 the file listed below, but the other data in the volume would be
 really nice to get back.  I have scrubbed the volume to no avail.  Any
 other thoughts.


 zpool status -xv vol2
  pool: vol2
 state: ONLINE
 status: One or more devices has experienced an error resulting in data
       corruption.  Applications may be affected.
 action: Restore the file in question if possible.  Otherwise restore the
       entire pool from backup.
  see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
 config:

       NAME        STATE     READ WRITE CKSUM
       vol2        ONLINE       0     0     0
         mirror-0  ONLINE       0     0     0
           c3t3d0  ONLINE       0     0     0
           c3t2d0  ONLINE       0     0     0

 errors: Permanent errors have been detected in the following files:

       vol2/v...@snap-daily-1-2010-05-06-:/as5/as5-flat.vmdk

 --
 John
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



 Regards,
 markm






 --
 michael.schus...@oracle.com     http://blogs.sun.com/recursion
 Recursion, n.: see 'Recursion'




 --
 John




-- 
John
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zpool import On Fail Over Server Using Shared SAS zpool Storage But Not Shared cache SSD Devices

2010-05-19 Thread Preston Connors
Hello and good day,

I will have two OpenSolaris snv_134 storage servers both connected to a
SAS chassis with SAS disks used to store zpool data. One storage server
will be the active storage server and the other will be the passive fail
over storage server. Both servers will be able to access the same disks
in the SAS chassis. I am planning on having unique, non-shared, SSD
cache directly connected to each storage node to allow for better
performance when utilizing the cache SSDs.

Would having unique, non-shared, SSD cache directly connected to each
storage node's motherboard actually allow for better performance when
utilizing the cache SSDs? Or would having the SSD cache devices be
shared going through a shared controller yield just the same
performance?

If having unique, non-shared, SSD cache directly connected to each
storage node's motherboard actually yields better performance how would
a zpool import of a zpool utilizing the unique SSD cache devices work on
the passive, fail over storage node when a fail over happened? Would
OpenSolaris/ZFS use these directly connected cache devices automatically
or would we have to add these cache devices into the zpool? If there was
data on the cache devices on the active storage node and say the power
went out and fail over occurred would the data on the cache devices be
lost during a zpool import on the fail over node?

Also, if you would like any other details about this storage environment
to better provide myself and the list with insight to these questions
please just ask!

-- 
Thank you,
Preston Connors
Atlantic.Net

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New SSD options

2010-05-19 Thread Don
Well the larger size of the Vertex, coupled with their smaller claimed write 
amplification should result in sufficient service life for my needs. Their 
claimed MTBF also matches the Intel X25-E's.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New SSD options

2010-05-19 Thread Don
Since it ignores Cache Flush command and it doesn't have any persistant buffer 
storage, disabling the write cache is the best you can do.

This actually brings up another question I had: What is the risk, beyond a few 
seconds of lost writes, if I lose power, there is no capacitor and the cache is 
not disabled?

My ZFS system is shared storage for a large VMWare based QA farm. If I lose 
power then a few seconds of writes are the least of my concerns. All of the QA 
tests will need to be restarted and all of the file systems will need to be 
checked. A few seconds of writes won't make any difference unless it has the 
potential to affect the integrity of the pool itself.

Considering the performance trade-off, I'd happily give up a few seconds worth 
of writes for significantly improved IOPS.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New SSD options

2010-05-19 Thread Richard Elling
On May 19, 2010, at 2:29 PM, Don wrote:

 Since it ignores Cache Flush command and it doesn't have any persistant 
 buffer storage, disabling the write cache is the best you can do.
 
 This actually brings up another question I had: What is the risk, beyond a 
 few seconds of lost writes, if I lose power, there is no capacitor and the 
 cache is not disabled?

The data risk is a few moments of data loss. However, if the order of the
uberblock updates is not preserved (which is why the caches are flushed)
then recovery from a reboot may require manual intervention.  The amount
of manual intervention could be significant for builds prior to b128.

 My ZFS system is shared storage for a large VMWare based QA farm. If I lose 
 power then a few seconds of writes are the least of my concerns. All of the 
 QA tests will need to be restarted and all of the file systems will need to 
 be checked. A few seconds of writes won't make any difference unless it has 
 the potential to affect the integrity of the pool itself.
 
 Considering the performance trade-off, I'd happily give up a few seconds 
 worth of writes for significantly improved IOPS.

Space, dependability, performance: pick two :-)
 -- richard

-- 
Richard Elling
rich...@nexenta.com   +1-760-896-4422
ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
http://nexenta-rotterdam.eventbrite.com/




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New SSD options

2010-05-19 Thread Nicolas Williams
On Wed, May 19, 2010 at 02:29:24PM -0700, Don wrote:
 Since it ignores Cache Flush command and it doesn't have any
 persistant buffer storage, disabling the write cache is the best you
 can do.
 
 This actually brings up another question I had: What is the risk,
 beyond a few seconds of lost writes, if I lose power, there is no
 capacitor and the cache is not disabled?

You can lose all writes from the last committed transaction (i.e., the
one before the currently open transaction).  (You also lose writes from
the currently open transaction, but that's unavoidable in any system.)

Nowadays the system will let you know at boot time that the last
transaction was not committed properly and you'll have a chance to go
back to the previous transaction.

For me, getting much-better-than-disk performance out of an SSD with
cache disabled is enough to make that SSD worthwhile, provided the price
is right of course.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New SSD options

2010-05-19 Thread Don
You can lose all writes from the last committed transaction (i.e., the
one before the currently open transaction).

And I don't think that bothers me. As long as the array itself doesn't go belly 
up- then a few seconds of lost transactions are largely irrelevant- all of the 
QA virtual machines are going to have to be rolled back to their initial states 
anyway.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs mount -a kernel panic

2010-05-19 Thread Lori Alt


First, I suggest you open a bug at  https://defect.opensolaris.org/bz  
and get a bug number.


Then, name your core dump something like bug.bugnumber and upload it 
using the instructions here:


http://supportfiles.sun.com/upload

Update the bug once you've uploaded the core  and supply the name of the 
core file.



Lori




On 05/19/10 12:40 PM, John Andrunas wrote:

OK, I got a core dump, what do I do with it now?

It is 1.2G in size.


On Wed, May 19, 2010 at 10:54 AM, John Andrunasj...@andrunas.net  wrote:
   

Hmmm... no coredump even though I configured it.

Here is the trace though  I will see what I can do about the coredump

r...@cluster:/export/home/admin# zfs mount vol2/vm2

panic[cpu3]/thread=ff001f45ec60: BAD TRAP: type=e (#pf Page fault)
rp=ff001f45e950 addr=30 occurred in module zfs due to a NULL
pointer deree

zpool-vol2: #pf Page fault
Bad kernel fault at addr=0x30
pid=1469, pc=0xf795d054, sp=0xff001f45ea48, eflags=0x10296
cr0: 8005003bpg,wp,ne,et,ts,mp,pe  cr4: 6f8xmme,fxsr,pge,mce,pae,pse,de
cr2: 30cr3: 500cr8: c

rdi:0 rsi: ff05208b2388 rdx: ff001f45e888
rcx:0  r8:3000900ff  r9: 198f5ff6
rax:0 rbx:  200 rbp: ff001f45ea50
r10: c0130803 r11: ff001f45ec60 r12: ff05208b2388
r13: ff0521fc4000 r14: ff050c0167e0 r15: ff050c0167e8
fsb:0 gsb: ff04eb9b8080  ds:   4b
 es:   4b  fs:0  gs:  1c3
trp:e err:2 rip: f795d054
 cs:   30 rfl:10296 rsp: ff001f45ea48
 ss:   38

ff001f45e830 unix:die+dd ()
ff001f45e940 unix:trap+177b ()
ff001f45e950 unix:cmntrap+e6 ()
ff001f45ea50 zfs:ddt_phys_decref+c ()
ff001f45ea80 zfs:zio_ddt_free+55 ()
ff001f45eab0 zfs:zio_execute+8d ()
ff001f45eb50 genunix:taskq_thread+248 ()
ff001f45eb60 unix:thread_start+8 ()

syncing file systems... done
skipping system dump - no dump device configured
rebooting...


On Wed, May 19, 2010 at 8:55 AM, Michael Schuster
michael.schus...@oracle.com  wrote:
 

On 19.05.10 17:53, John Andrunas wrote:
   

Not to my knowledge, how would I go about getting one?  (CC'ing discuss)
 

man savecore and dumpadm.

Michael
   


On Wed, May 19, 2010 at 8:46 AM, Mark J Musantemark.musa...@oracle.com
  wrote:
 

Do you have a coredump?  Or a stack trace of the panic?

On Wed, 19 May 2010, John Andrunas wrote:

   

Running ZFS on a Nexenta box, I had a mirror get broken and apparently
the metadata is corrupt now.  If I try and mount vol2 it works but if
I try and mount -a or mount vol2/vm2 is instantly kernel panics and
reboots.  Is it possible to recover from this?  I don't care if I lose
the file listed below, but the other data in the volume would be
really nice to get back.  I have scrubbed the volume to no avail.  Any
other thoughts.


zpool status -xv vol2
  pool: vol2
state: ONLINE
status: One or more devices has experienced an error resulting in data
   corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
   entire pool from backup.
  see: http://www.sun.com/msg/ZFS-8000-8A
scrub: none requested
config:

   NAMESTATE READ WRITE CKSUM
   vol2ONLINE   0 0 0
 mirror-0  ONLINE   0 0 0
   c3t3d0  ONLINE   0 0 0
   c3t2d0  ONLINE   0 0 0

errors: Permanent errors have been detected in the following files:

   vol2/v...@snap-daily-1-2010-05-06-:/as5/as5-flat.vmdk

--
John
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

 


Regards,
markm

   



 


--
michael.schus...@oracle.com http://blogs.sun.com/recursion
Recursion, n.: see 'Recursion'

   



--
John

 



   


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New SSD options

2010-05-19 Thread Don
You can lose all writes from the last committed transaction (i.e., the
one before the currently open transaction).

I'll pick one- performance :)

Honestly- I wish I had a better grasp on the real world performance of these 
drives. 50k IOPS is nice- and considering the incredible likelihood of data 
duplication in my environment- the SandForce controller seems like a win. That 
said- does anyone have a good set of real world performance numbers for these 
drives that you can link to?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] vibrations and consumer drives

2010-05-19 Thread David Magda
A recent post on StorageMojo has some interesting numbers on how  
vibrations can affect disks, especially consumer drives:


http://storagemojo.com/2010/05/19/shock-vibe-and-awe/

He mentions a 2005 study that I wasn't aware of. In its conclusion it  
states:


Based on the results of these measurements, it was determined that  
the effects of vibration can be
observed and quantified. Furthermore, it demonstrates that [Consumer  
Storage (CS)] disk drives are more sensitive to the vibration from  
physically coupled adjacent disk drives [than Enterprise-class disk  
drives]. However, even though the CS drives are more sensitive to  
vibration, there was no evidence of data corruption when the  
vibration affected write operations.


https://dtc.umn.edu/publications/reports/2005_08.pdf

Another study gives numbers of 20% decrease in IO throughput, 25%  
increase completion time, and 25% increase in energy consumption.


Probably not a big deal for home use, but it can certainly add up if  
you've got lots of shelves.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] mpt hotswap procedure

2010-05-19 Thread Russ Price
I'm not having any luck hotswapping a drive attached to my Intel SASUC8I 
(LSI-based) controller. The commands which work for the AMD AHCI ports don't 
work for the LSI. Here's what cfgadm -a reports with all drives installed and 
operational:


Ap_Id  Type Receptacle   Occupant Condition
c4 scsi-sas connectedconfigured   unknown
c4::dsk/c4t0d0 disk connectedconfigured   unknown
c4::dsk/c4t1d0 disk connectedconfigured   unknown
c4::dsk/c4t2d0 disk connectedconfigured   unknown
c4::dsk/c4t3d0 disk connectedconfigured   unknown
c4::dsk/c4t4d0 disk connectedconfigured   unknown
c4::dsk/c4t5d0 disk connectedconfigured   unknown
c4::dsk/c4t6d0 disk connectedconfigured   unknown
c4::dsk/c4t7d0 disk connectedconfigured   unknown
sata0/0::dsk/c5t0d0disk connectedconfigured   ok
sata0/1::dsk/c5t1d0disk connectedconfigured   ok
sata0/2::dsk/c5t2d0disk connectedconfigured   ok
sata0/3::dsk/c5t3d0disk connectedconfigured   ok
sata0/4::dsk/c5t4d0disk connectedconfigured   ok
sata0/5::dsk/c5t5d0disk connectedconfigured   ok

[irrelevant USB entries snipped]

Now, if I yank out a drive on one of the AHCI ports (let's use port 3 as an 
example), I can use:


cfgadm -c connect sata0/3
cfgadm -c configure sata0/3

and bring the new drive online. I have had no luck with the SASUC8I; even though 
I can see messages in the system log that a drive was inserted, the only way 
I've been able to actually use the drive afterwards has been via a reboot.  A 
command like:


cfgadm -c connect c4::dsk/c4t4d0

will be greeted with the message:

cfgadm: Hardware specific failure: operation not supported for SCSI device

Is cfgadm -c connect c4 sufficient, or is there some other incantation I'm 
missing? :)


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Ideal SATA/SAS Controllers for ZFS

2010-05-19 Thread Marc Bevand
Deon Cui deon.cui at gmail.com writes:
 
 So I had a bunch of them lying around. We've bought a 16x SAS hotswap
 case and I've put in an AMD X4 955 BE with an ASUS M4A89GTD Pro as
 the mobo.
 
 In the two 16x PCI-E slots I've put in the 1068E controllers I had
 lying around. Everything is still being put together and I still
 haven't even installed opensolaris yet but I'll see if I can get
 you some numbers on the controllers when I am done.

This is a well-architected config with no bottlenecks on the PCIe
links to the 890GX northbridge or on the HT link to the CPU. If you
run 16 concurrent dd if=/dev/rdsk/c?d?t?p0 of=/dev/zero bs=1024k and
assuming your drives can do ~100MB/s sustained reads at the
beginning of the platter, you should literally see an aggregate
throughput of ~1.6GB/s...

-mrb

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss