Re: [zfs-discuss] raidz data loss stories?
> On Dec 21, 2009, at 4:09 PM, Michael Herf > wrote: > > > Anyone who's lost data this way: were you doing > weekly scrubs, or > > did you find out about the simultaneous failures > after not touching > > the bits for months? > > Scrubbing on a routine basis is good for detecting > problems early, but > it doesn't solve the problem of a double failure > during resilver. As > the size of disks become huge the chance of a double > failure during > resilvering increases to the point of real > possibility. Due to the > amount of data, the bit error rates of the medium and > the prolonged > stress of resilvering these monsters. > > For up to 1TB drives use nothing less than raidz2. > For 1TB+ drives use > raidz3. Avoid raidz vdevs larger than 7 drives, > better to have > multiple vdevs both for performance and reliability. > > With 24 2.5" drive enclosures you can easily create 3 > 7 drive raidz3s > or 4 5 drive raidz2s with a spare for each vdev, or 2 > spares and 1-2 > SSD drives. Both options give 12/24 usable disk > space. 4 raidz2s give > more performance, 3 raidz3s gives more reliability. > > -Ross > Hi Ross, What about old good raid10? It's a pretty reasonable choice for heavy loaded storages, isn't it? I remember when I migrated raidz2 to 8xdrives raid10 the application administrators were just really happy with the new access speed. (we didn't use stripped raidz2 though as you are suggesting). -- Roman -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz data loss stories?
> On Dec 21, 2009, at 4:09 PM, Michael Herf > wrote: > > > Anyone who's lost data this way: were you doing > weekly scrubs, or > > did you find out about the simultaneous failures > after not touching > > the bits for months? > > Scrubbing on a routine basis is good for detecting > problems early, but > it doesn't solve the problem of a double failure > during resilver. As > the size of disks become huge the chance of a double > failure during > resilvering increases to the point of real > possibility. Due to the > amount of data, the bit error rates of the medium and > the prolonged > stress of resilvering these monsters. > > For up to 1TB drives use nothing less than raidz2. > For 1TB+ drives use > raidz3. Avoid raidz vdevs larger than 7 drives, > better to have > multiple vdevs both for performance and reliability. > > With 24 2.5" drive enclosures you can easily create 3 > 7 drive raidz3s > or 4 5 drive raidz2s with a spare for each vdev, or 2 > spares and 1-2 > SSD drives. Both options give 12/24 usable disk > space. 4 raidz2s give > more performance, 3 raidz3s gives more reliability. > > -Ross > Hi Ross, What about old good raid10? It's a pretty reasonable choice for heavy loaded storages, isn't it? I remember when I migrated raidz2 to 8xdrives raid10 the application administrators were just really happy with the new access speed. (we didn't use stripped raidz2 though as you are suggesting). -- Roman -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz data loss stories?
On Dec 21, 2009, at 4:09 PM, Michael Herf wrote: Anyone who's lost data this way: were you doing weekly scrubs, or did you find out about the simultaneous failures after not touching the bits for months? Scrubbing on a routine basis is good for detecting problems early, but it doesn't solve the problem of a double failure during resilver. As the size of disks become huge the chance of a double failure during resilvering increases to the point of real possibility. Due to the amount of data, the bit error rates of the medium and the prolonged stress of resilvering these monsters. For up to 1TB drives use nothing less than raidz2. For 1TB+ drives use raidz3. Avoid raidz vdevs larger than 7 drives, better to have multiple vdevs both for performance and reliability. With 24 2.5" drive enclosures you can easily create 3 7 drive raidz3s or 4 5 drive raidz2s with a spare for each vdev, or 2 spares and 1-2 SSD drives. Both options give 12/24 usable disk space. 4 raidz2s give more performance, 3 raidz3s gives more reliability. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz data loss stories?
Hey James, > Personally, I think mirroring is safer (and 3 way mirroring) than raidz/z2/5. > All my "boot from zfs" systems have 3 way mirrors root/usr/var disks (using > 9 disks) but all my data partitions are 2 way mirrors (usually 8 disks or > more and a spare.) Double-parity (or triple-parity) RAID are certainly more resilient against some failure modes than 2-way mirroring. For example, bit errors can arise at a certain rate from disks. In the case of a disk failure in a mirror, it's possible to encounter a bit error such that data is lost. I recently wrote an article for ACM Queue that examines recent trends in hard drives and makes the case for triple-parity RAID. It's at least peripherally relevant to this conversation: http://blogs.sun.com/ahl/entry/acm_triple_parity_raid Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool FAULTED after power outage
I was able to recover! Thank you both for replying and thank you victor for the step-by-step. I downloaded dev-129 from the site and booted off of it. I first ran: zpool import -nfF -R /mnt rpool and the cmd output that I could go back to when the box rebooted itself. Therefore, I ran the cmd: zpool import -fF -R /mnt rpool and everything was good. Thanks again! -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] CIFS Strange Problem
Sassy, this is the zfs-discuss forum. You might have better luck asking at the cifs-discuss forum. http://mail.opensolaris.org/mailman/listinfo/cifs-discuss -- richard On Dec 21, 2009, at 2:36 PM, Sassy Natan wrote: Hi Group I have install the latest version of OpenSolairs (version 129) on my machine. I have configure the DNS, Kerberos, PAM and LDAP client to use my Windows 2003R2 domain. My Windows Domain Include the RFC2307 Posix account, so each user has UID, GID configure. This was very east to configure, and I manage to get all my users from the Windows Domain to logon to the opensoalris machine. When running "getent passwd username" I getting off course the id and group id from the AD server. So now I wanted to use the CIFS server. So I install the services and started them, add the machine to the domain and configure a ZFS share. Now I only add to create rule using the idmap so users from the windows will be mapped to the unix account. But this seems not to work. when checking the mapping I get error: see below #id rona uid=10005(rona) gid=1(Domain Users) groups=1(Domain Users) #getent passwd rona rona:x:1:1:rona:/home/rona:/bin/sh #idmap show -cv rona@ winname:rona@ -> uid:60001 Error: Not found #idmap show -cv r...@domain.local winname:r...@domain.local -> uid:60001 Error: Not found I run the cifs-gendiag and didn't saw any problems Any idea? thanks Sassy -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool FAULTED after power outage
JD Trout wrote: Hello, I am running OpenSol 2009.06 and after a power outage opsensol will no longer boot past GRUB. Booting from the liveCD shows me the following: r...@opensolaris:~# zpool import -f rpool cannot import 'rpool': I/O error r...@opensolaris:~# zpool import -f pool: rpool id: 15378657248391821369 state: FAULTED status: The pool was last accessed by another system. action: The pool cannot be imported due to damaged devices or data. The pool may be active on another system, but can be imported using the '-f' flag. see: http://www.sun.com/msg/ZFS-8000-EY config: rpool FAULTED corrupted data c7d0s0ONLINE Is all hope lost? No. Try to get LiveCD based on build 128 or later at e.g. www.genunix.org, boot off it and try to import your rpool this way: zpool import -nfF -R /mnt rpool If it reports that it can get back to good pool state, then do actual import with zpool import -fF -R /mnt rpool In case first command cannot rewind to older state, try to add -X option: zpool import -nfFX -R /mnt rpool and if it says that it can recover your pool with some data loss and you are ok with it, then do actual import zpool import -fFX -R /mnt rpool regards, victor ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool FAULTED after power outage
On Mon, Dec 21, 2009 at 6:50 PM, JD Trout wrote: > Hello, > I am running OpenSol 2009.06 and after a power outage opsensol will no > longer boot past GRUB. Booting from the liveCD shows me the following: > > r...@opensolaris:~# zpool import -f rpool > cannot import 'rpool': I/O error > > r...@opensolaris:~# zpool import -f > pool: rpool >id: 15378657248391821369 > state: FAULTED > status: The pool was last accessed by another system. > action: The pool cannot be imported due to damaged devices or data. >The pool may be active on another system, but can be imported using >the '-f' flag. > see: http://www.sun.com/msg/ZFS-8000-EY > config: > >rpool FAULTED corrupted data > c7d0s0ONLINE > > > Is all hope lost? > > No, but you'll need to use a newer version of opensolaris to recover it automagically. http://www.c0t0d0s0.org/archives/6067-PSARC-2009479-zpool-recovery-support.html -- --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool FAULTED after power outage
Hello, I am running OpenSol 2009.06 and after a power outage opsensol will no longer boot past GRUB. Booting from the liveCD shows me the following: r...@opensolaris:~# zpool import -f rpool cannot import 'rpool': I/O error r...@opensolaris:~# zpool import -f pool: rpool id: 15378657248391821369 state: FAULTED status: The pool was last accessed by another system. action: The pool cannot be imported due to damaged devices or data. The pool may be active on another system, but can be imported using the '-f' flag. see: http://www.sun.com/msg/ZFS-8000-EY config: rpool FAULTED corrupted data c7d0s0ONLINE Is all hope lost? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS pool unusable after attempting to destroy a dataset with dedup enabled
I don't mean to sound ungrateful (because I really do appreciate all the help I have received here), but I am really missing the use of my server. Over Christmas, I want to be able to use my laptop (right now, it's acting as a server for some of the things my OpenSolaris server did). This means I will need to get my server back up and running in full working order by then. All the data that I lost is unimportant data, so I'm not really missing anything there. Again, I do appreciate all the help, but I'm going to "give up" if no solution can be found in the next couple of days. This is simply because I want to be able to use my hardware. What I plan on doing is simply formatting each disk that was part of the bad pool and creating a new one. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] DeDup and Compression - Reverse Order?
Kjetil Torgrim Homme wrote: Note also that the compress/encrypt/checksum and the dedup are separate pipeline stages so while dedup is happening for block N block N+1 can be getting transformed - so this is designed to take advantage of multiple scheduling units (threads,cpus,cores etc). nice. are all of them separate stages, or are compress/encrypt/checksum done as one stage? Originally compress, encrypt, checksum were all separate stages in the zio pipeline they are now all one stage ZIO_WRITE_BP_INIT for the write case and ZIO_READ_BP_INIT for the read case. Also if you place a block in an unencrypted dataset that happens to match the ciphertext in an encrypted dataset they won't dedup either (you need to understand what I've done with the AES CCM/GCM MAC and the zio_chksum_t field in the blkptr_t and how that is used by dedup to see why). wow, I didn't think of that problem. did you get bitten by wrongful dedup during testing with image files? :-) No I didn't see the problem in reality I just thought about it in as a possible risk that needed to be addressed. Solving it didn't actually require me to do any additional work because ZFS uses a separate table for each checksum algorithm anyway and the checksum algorithm for encrypted datasets is listed as sha256+mac not sha256. It was nice that I didn't have to write more code to solve the problem but it may not have been that way. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] DeDup and Compression - Reverse Order?
Daniel Carosone wrote: Your parenthetical comments here raise some concerns, or at least eyebrows, with me. Hopefully you can lower them again. compress, encrypt, checksum, dedup. (and you need to use zdb to get enough info to see the leak - and that means you have access to the raw devices) An attacker with access to the raw devices is the primary base threat model for on-disk encryption, surely? An attacker with access to disk traffic, via e.g. iSCSI, who can also deploy dynamic traffic analysis in addition to static content analysis, and who also has similarly greater opportunities for tampering, is another trickier threat model. It seems like entirely wrong thinking (even in parentheses) to dismiss an issue as irrelevant because it only applies in the primary threat model. I wasn't dismissing it I was pointing out that this wasn't something an unprivilege end user could easily do. If the risk is unacceptable then dedup shouldn't be enabled. For some uses cases the risk is acceptable and for those use cases we want to allow the use of dedup with encryption. (and the way I have implemented the IV generation for AES CCM/GCM mode ensures that the same plaintext will have the same IV so the ciphertexts will match). Again, this seems like a cause for concern. Have you effectively turned these fancy and carefully designed crypto modes back into ECB, albeit at a larger block size (and only within a dataset)? No I don't believe I have. The IV generation when doing deduplication is done by calculating an HMAC of the plaintext using a separate per dataset key (that is also refreshed if 'zfs key -K' is run to rekey the dataset). Let's consider copy-on-write semantics: with the above issue an attacker can tell which blocks of a file have changed over time, even if unchanged blocks have been rewritten.. giving even the static image attacker some traffic analysis capability. So if that is part of your deployment risk model deduplication is not worth enabling in that case. This would be a problem regardless of dedup, for the scenario where the attacker can see repeated ciphertext on disk (unless the dedup metadata itself is sufficiently encrypted, which I understand it is not). In the case where deduplication is not enabled the IV generation uses a compbination of the txg number, the object and blockid which complies with the recommendations for IV generation for both CCM and GCM. (you need to understand what I've done with the AES CCM/GCM MAC I'd like to, but more to understand what (if any) protection is given against replay attacks (above that already provided by the merkle hash tree). What do you mean by a replay attack ? What is being replayed and by whom ? -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] CIFS Strange Problem
Hi Group I have install the latest version of OpenSolairs (version 129) on my machine. I have configure the DNS, Kerberos, PAM and LDAP client to use my Windows 2003R2 domain. My Windows Domain Include the RFC2307 Posix account, so each user has UID, GID configure. This was very east to configure, and I manage to get all my users from the Windows Domain to logon to the opensoalris machine. When running "getent passwd username" I getting off course the id and group id from the AD server. So now I wanted to use the CIFS server. So I install the services and started them, add the machine to the domain and configure a ZFS share. Now I only add to create rule using the idmap so users from the windows will be mapped to the unix account. But this seems not to work. when checking the mapping I get error: see below #id rona uid=10005(rona) gid=1(Domain Users) groups=1(Domain Users) #getent passwd rona rona:x:1:1:rona:/home/rona:/bin/sh #idmap show -cv rona@ winname:rona@ -> uid:60001 Error: Not found #idmap show -cv r...@domain.local winname:r...@domain.local -> uid:60001 Error: Not found I run the cifs-gendiag and didn't saw any problems Any idea? thanks Sassy -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz data loss stories?
Anyone who's lost data this way: were you doing weekly scrubs, or did you find out about the simultaneous failures after not touching the bits for months? mike ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] directory size on compressed file system on Solaris 10
Le 21 déc. 09 à 20:23, Joerg Schilling a écrit : Matthew Ahrens wrote: Gaëtan Lehmann wrote: Hi, On opensolaris, I use du with the -b option to get the uncompressed size of a directory): r...@opensolaris:~# du -sh /usr/local/ 399M/usr/local/ r...@opensolaris:~# du -sbh /usr/local/ 915M/usr/local/ r...@opensolaris:~# zfs list -o space,refer,ratio,compress data/ local NAMEAVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD REFER RATIO COMPRESS data/local 228G 643M 249M394M 0 0 394M 2.51xon r...@opensolaris:~# which du /usr/gnu/bin/du but on Solaris 10, there is no such option. So what is the best way to get the uncompressed size of a directory on Solaris 10? Install GNU du on solaris 10? Although the answer will be just as (in)accurate as GNU du on solaris 10. Note that it reports the compression How about: find . -type f -ls | awk '{ sum += $7} END {print sum}' sounds good r...@opensolaris:~# find /usr/local/ -ls | awk '{sum += $7} END {print sum/1024**2}' 914.039 but maybe a little longer to write r...@opensolaris:~# echo "du -sbh ." | wc -c 10 r...@opensolaris:~# echo "find . -ls | awk '{sum += $7} END {print sum/1024**2}'" | wc -c 53 Thanks! Gaëtan -- Gaëtan Lehmann Biologie du Développement et de la Reproduction INRA de Jouy-en-Josas (France) tel: +33 1 34 65 29 66fax: 01 34 65 29 09 http://voxel.jouy.inra.fr http://www.itk.org http://www.mandriva.org http://www.bepo.fr PGP.sig Description: Ceci est une signature électronique PGP ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] directory size on compressed file system on Solaris 10
Le 21 déc. 09 à 19:28, Matthew Ahrens a écrit : Gaëtan Lehmann wrote: Hi, On opensolaris, I use du with the -b option to get the uncompressed size of a directory): r...@opensolaris:~# du -sh /usr/local/ 399M/usr/local/ r...@opensolaris:~# du -sbh /usr/local/ 915M/usr/local/ r...@opensolaris:~# zfs list -o space,refer,ratio,compress data/ local NAMEAVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD REFER RATIO COMPRESS data/local 228G 643M 249M394M 0 0 394M 2.51xon r...@opensolaris:~# which du /usr/gnu/bin/du but on Solaris 10, there is no such option. So what is the best way to get the uncompressed size of a directory on Solaris 10? Install GNU du on solaris 10? That's an option of course, but I'd prefer something that I can use installing any extra program. Although the answer will be just as (in)accurate as GNU du on solaris 10. Note that it reports the compression ratio as 915/399 = 2.29x, actual is 2.51x. This could be due to sparse files, or metadata like directories, whose "apparent size" (st_size) is not what GNU du expects. At least it gives a not so bad estimation :-) And the compression ratio includes the data in the snapshots, so it may be inaccurate in that case also. The actual compression ratio, given on the last snapshot, is 2.41x. Gaëtan -- Gaëtan Lehmann Biologie du Développement et de la Reproduction INRA de Jouy-en-Josas (France) tel: +33 1 34 65 29 66fax: 01 34 65 29 09 http://voxel.jouy.inra.fr http://www.itk.org http://www.mandriva.org http://www.bepo.fr PGP.sig Description: Ceci est une signature électronique PGP ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] directory size on compressed file system on Solaris 10
Matthew Ahrens wrote: > Gaëtan Lehmann wrote: > > > > Hi, > > > > On opensolaris, I use du with the -b option to get the uncompressed size > > of a directory): > > > > r...@opensolaris:~# du -sh /usr/local/ > > 399M/usr/local/ > > r...@opensolaris:~# du -sbh /usr/local/ > > 915M/usr/local/ > > r...@opensolaris:~# zfs list -o space,refer,ratio,compress data/local > > NAMEAVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD > > REFER RATIO COMPRESS > > data/local 228G 643M 249M394M 0 0 > > 394M 2.51xon > > r...@opensolaris:~# which du > > /usr/gnu/bin/du > > > > but on Solaris 10, there is no such option. > > > > So what is the best way to get the uncompressed size of a directory on > > Solaris 10? > > Install GNU du on solaris 10? Although the answer will be just as > (in)accurate as GNU du on solaris 10. Note that it reports the compression How about: find . -type f -ls | awk '{ sum += $7} END {print sum}' Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Troubleshooting dedup performance
In case the overhead in calculating SHA256 was the cause, I set ZFS checksums to SHA256 at the pool level, and left for a number of days. This worked fine. Setting dedup=on immediately crippled performance, and then setting dedup=off fixed things again. I did notice through a zpool iostat that disk IO increased while dedup was on, although it didn't from the ESXi side. Could it be that dedup tables don't fit in memory? I don't have a great deal - 3GB. Is there a measure of how large the tables are in bytes, rather than number of entries? Chris -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Chris Murray Sent: 16 December 2009 17:19 To: Cyril Plisko; Andrey Kuzmin Cc: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Troubleshooting dedup performance So if the ZFS checksum is set to fletcher4 at the pool level, and dedup=on, which checksum will it be using? If I attempt to set dedup=fletcher4, I do indeed get this: cannot set property for 'zp': 'dedup' must be one of 'on | off | verify | sha256[,verify]' Could it be that my performance troubles are due to the calculation of two different checksums? Thanks, Chris -Original Message- From: cyril.pli...@gmail.com [mailto:cyril.pli...@gmail.com] On Behalf Of Cyril Plisko Sent: 16 December 2009 17:09 To: Andrey Kuzmin Cc: Chris Murray; zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Troubleshooting dedup performance >> I've set dedup to what I believe are the least resource-intensive >> settings - "checksum=fletcher4" on the pool, & "dedup=on" rather than > > I believe checksum=fletcher4 is acceptable in dedup=verify mode only. > What you're doing is seemingly deduplication with weak checksum w/o > verification. I think fletcher4 use for the deduplication purposes was disabled [1] at all, right before build 129 cut. [1] http://hg.genunix.org/onnv-gate.hg/diff/93c7076216f6/usr/src/common/zfs/ zfs_prop.c -- Regards, Cyril ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] directory size on compressed file system on Solaris 10
Gaëtan Lehmann wrote: Hi, On opensolaris, I use du with the -b option to get the uncompressed size of a directory): r...@opensolaris:~# du -sh /usr/local/ 399M/usr/local/ r...@opensolaris:~# du -sbh /usr/local/ 915M/usr/local/ r...@opensolaris:~# zfs list -o space,refer,ratio,compress data/local NAMEAVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD REFER RATIO COMPRESS data/local 228G 643M 249M394M 0 0 394M 2.51xon r...@opensolaris:~# which du /usr/gnu/bin/du but on Solaris 10, there is no such option. So what is the best way to get the uncompressed size of a directory on Solaris 10? Install GNU du on solaris 10? Although the answer will be just as (in)accurate as GNU du on solaris 10. Note that it reports the compression ratio as 915/399 = 2.29x, actual is 2.51x. This could be due to sparse files, or metadata like directories, whose "apparent size" (st_size) is not what GNU du expects. Took me a minute to realize you were talking about the space used under a subdirectory, not the space consumed by the directory itself! I guess I'm the only one creating 400MB directories :-) --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Mirror config and installgrub errors
I've hsut bought second drive for my hope PC and decided to do mirror. I've made pfexec zpool attach rpool c9d0s0 c13d0s0 waited for scrub and tried to install grub on second disk: $ pfexec installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c13d0s0 cannot open/stat device /dev/rdsk/c13d0s2 $ pfexec installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c13d0 raw device must be a root slice (not s2) What am I doing wrong? I believe that new device is SMI-labeled (I formated it and set SMI label), however I don't know how to check it... $ pfexec prtvtoc /dev/rdsk/c13d0s0 * /dev/rdsk/c13d0s0 partition map * * Dimensions: * 512 bytes/sector * 63 sectors/track * 255 tracks/cylinder * 16065 sectors/cylinder * 60799 cylinders * 60797 accessible cylinders * * Flags: * 1: unmountable * 10: read-only * * Unallocated space: * First SectorLast * Sector CountSector * 0 48195 48194 * * First SectorLast * Partition Tag FlagsSector CountSector Mount Directory 0 200 48195 976655610 976703804 8 101 0 16065 16064 9 900 16065 32130 48194 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS dedup memory usage for DDT
Dear all. We use an "old" 48TB 4500 aka Thumper as iSCSI server based on snv_129. As the machine has only 16GB of RAM we are wondering if it's sufficient for holding the bigger part of the DDT in memory without affecting performance by limiting the ARC. Any hints about scaling memory vs. disk space or the like Thanks ahead Thomas -- - GPG fingerprint: B1 EE D2 39 2C 82 26 DA A5 4D E0 50 35 75 9E ED ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] EON ZFS Storage 0.59.9 based on snv 129, Deduplication release!
Embedded Operating system/Networking (EON), RAM based live ZFS NAS appliance is released on Genunix! This is the first EON release with inline Deduplication features! Many thanks to Genunix.org for download hosting and serving the opensolaris community. EON Deduplication ZFS storage is available in 32 and 64-bit, CIFS and Samba versions: tryitEON 64-bit x86 CIFS ISO image version 0.59.9 based on snv_129 * eon-0.599-129-64-cifs.iso * MD5: 8e917a14dbf0c793ad2958bdf8feb24a * Size: ~93Mb * Released: Monday 21-December-2009 tryitEON 64-bit x86 Samba ISO image version 0.59.9 based on snv_129 * eon-0.599-129-64-smb.iso * MD5: 2c38a93036e4367e5cdf8a74605fcbaf * Size: ~107Mb * Released: Monday 21-December-2009 tryitEON 32-bit x86 CIFS ISO image version 0.59.9 based on snv_129 * eon-0.599-129-32-cifs.iso * MD5: 0dcdd754b937f1d6515eba34b6ed2607 * Size: ~59Mb * Released: Monday 21-December-2009 tryitEON 32-bit x86 Samba ISO image version 0.59.9 based on snv_129 * eon-0.599-129-32-smb.iso * MD5: c24008516eb4584a64d9239015559ba4 * Size: ~73Mb * Released: Monday 21-December-2009 tryitEON 64-bit x86 CIFS ISO image version 0.59.9 based on snv_129 (NO HTTPD) * eon-0.599-129-64-cifs-min.iso * MD5: 78b0bb116c0e32a48c473ce1b94e604f * Size: ~87Mb * Released: Monday 21-December-2009 tryitEON 64-bit x86 Samba ISO image version 0.59.9 based on snv_129 (NO HTTPD) * eon-0.599-129-64-smb-min.iso * MD5: 57d93eba9286c4bcc4c00c0154c684de * Size: ~101Mb * Released: Monday 21-December-2009 New/Changes/Fixes: - Deduplication, Deduplication, Deduplication. (Only 1x the storage space was used) - The hotplug errors at boot are being worked on. They are safe to ignore. - Cleaned up minor entries in /mnt/eon0/.exec. Added "rsync --daemon" to start by default. - EON rebooting at grub(since snv_122) in ESXi, Fusion and various versions of VMware workstation. This is related to bug 6820576. Workaround, at grub press e and add on the end of the kernel line "-B disable-pcieb=true" http://eonstorage.blogspot.com http://sites.google.com/site/eonstorage/ -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz data loss stories?
Yes, a coworker lost a second disk during a rebuild of a raid5 and lost all data. I have not had a failure, however when migrating EqualLogic arrays in and out of pools, I lost a disk on an array. No data loss, but it concerns me because during the moves, you are essentially reading and writing all of the data on the disk. Did I have a latent problem on that particular disk that only exposed itself when doing such a large read/write? What if another disk had failed, and during the rebuild this latent problem was exposed? Trouble, trouble. They say security is an onion. So is data protection. Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Emulex HBA fails periodically : ZFS, QFS, or the combination??
The question: is there an issue running: ZFS and QFS on the same file server? The details: We have a 2540 raid controller with 4 raidsets. Each raidset presents 2 slices to the OS. one slice (slice 0) from each raidset is a separate qfs filesystems shared among 7 servers running qfs 4.6patch6. One of the above servers has created on zfs pool using the other 4 slices (slice 1 from each of the 4 raidsets) pool: zfs_hpf state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM zfs_hpf ONLINE 0 0 0 c6t600A0B8000495A51081F492C644Dd0 ONLINE 0 0 0 c6t600A0B8000495B1C053148B41F54d0 ONLINE 0 0 0 c6t600A0B8000495B1C053248B42036d0 ONLINE 0 0 0 c6t600A0B8000495B1C05B948CA87A2d0 ONLINE 0 0 0 zpool upgrade This system is currently running ZFS pool version 15. zfs upgrade This system is currently running ZFS filesystem version 4. What we have been seeing is that the emulex HBAs attaching to the 2540 through a sanswitch have been dying periodically, particularly under load. luxadm -e port /devices/p...@1,0/pci1022,7...@2/pci10df,f...@1/f...@0,0:devctl NOT CONNECTED /devices/p...@1,0/pci1022,7...@2/pci10df,f...@1,1/f...@0,0:devctl NOT CONNECTED We need to reboot to get the hba to reconnect. we have never seen this on the other fileservers and are wondering if the HBA is faulty or is there an issue running: ZFS and QFS on the same file server? Has anyone seen this? As an FYI we are running cat /etc/release Solaris 10 10/09 s10x_u8wos_08a X86 Len Zaifman Systems Manager, High Performance Systems The Centre for Computational Biology The Hospital for Sick Children 555 University Ave. Toronto, Ont M5G 1X8 tel: 416-813-5513 email: leona...@sickkids.ca This e-mail may contain confidential, personal and/or health information(information which may be subject to legal restrictions on use, retention and/or disclosure) for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this e-mail in error, please contact the sender and delete all copies. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz data loss stories?
If you are asking if anyone has experienced two drive failures simultaneously? The answer is yes. It has happened to me (at home) and to one client, at least that I can remember. In both cases, I was able to dd off one of the failed disks (with just bad sectors or less bad sectors) and reconstruct the raid 5 (force it online) to then copy data off the raid onto new drives. Personally, I think mirroring is safer (and 3 way mirroring) than raidz/z2/5. All my "boot from zfs" systems have 3 way mirrors root/usr/var disks (using 9 disks) but all my data partitions are 2 way mirrors (usually 8 disks or more and a spare.) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] directory size on compressed file system on Solaris 10
Hi, On opensolaris, I use du with the -b option to get the uncompressed size of a directory): r...@opensolaris:~# du -sh /usr/local/ 399M/usr/local/ r...@opensolaris:~# du -sbh /usr/local/ 915M/usr/local/ r...@opensolaris:~# zfs list -o space,refer,ratio,compress data/local NAMEAVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD REFER RATIO COMPRESS data/local 228G 643M 249M394M 0 0 394M 2.51xon r...@opensolaris:~# which du /usr/gnu/bin/du but on Solaris 10, there is no such option. So what is the best way to get the uncompressed size of a directory on Solaris 10? Regards, Gaëtan -- Gaëtan Lehmann Biologie du Développement et de la Reproduction INRA de Jouy-en-Josas (France) tel: +33 1 34 65 29 66fax: 01 34 65 29 09 http://voxel.jouy.inra.fr http://www.itk.org http://www.mandriva.org http://www.bepo.fr PGP.sig Description: Ceci est une signature électronique PGP ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ARC not using all available RAM?
On Mon, 21 Dec 2009, Tristan Ball wrote: Yes, primarily since if there is no more memory immediately available, performance when starting new processes would suck. You need to reserve some working space for processes and short term requirements. Why is that a given? There are several systems that steal from cache under memory pressure. Earlier versions of solaris that I've dealt with a little managed with quite a bit less that 1G free. On this system, "lotsfree" is sitting at 127mb, which seems reasonable, and isn't it "lotsfree" and the related variables and page-reclaim logic that maintain that pool of free memory for new allocations? It ain't necessarily so but any time you need to run "reclaim" logic, there is CPU time expended and the CPU caches tend to get thrashed. Without constraints, the cache would expand to the total amount of file data encountered. It is much better to avoid any thrashing. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD strange performance problem, resilvering helps during operation
Mart van Santen wrote: Hi, Do the I/O problems go away when only one of the SSDs is attached? No, the problem stays with only one SSD. The problem is only less when resilvering, but not totally disappeared (maybe because of the resilver overhead). The resilver is likely masking some underlying problem. :-( Frankly, I'm betting that your SSDs are wearing out. Resilvering will essentially be one big streaming write, which is optimal for SSDs (even an SLC-based SSD, as you likely have, performs far better when writing large amounts of data at once). NFS (and to a lesser extent iSCSI) is generally a whole lot of random small writes, which are hard on an SSD (especially MLC-based ones, but even SLC ones). The resilvering process is likely turning many of the random writes coming in to the system into a large streaming write to the /resilvering/ drive. Hmm, interesting theory. Next I well execute only a resilver to see if the same happens. I assume when adding a new disk, even though it's only a slog disk, the whole tank will resilver? If I look to the zpool iostat currently I see a lot of reads on the separate SATA disks (not on the tank/or raidz2 pools), assuming resilvering takes place there and the SSD's are already synced. I'm not 100% sure, but replacing a device in a mirrored ZIL should only generate I/O on the other ZIL device, not on the main pool devices. SSDs are not hard drives. Even high-quality modern ones have /significantly/ lower USE lifespans than an HD - that is, a heavily-used SSD will die well before a HD, but a very-lightly used SSD will likely outlast a HD. And, in the case of SSDs, writes are far harder on the SSD than reads are. Is about half a year for these disk not really short? Sure, we have some I/O, but not that many write operations, about ~80-140 iops, anyway, I will try to get new disks from SUN (we have SLC disks from Sun). Is there any knowledge about the life time of SSD's? Maybe in terms of amount of I/O Operations? Regards, Mart van Santen That's not enough time for that level of IOPS to wear out the SSDs (which, are likely OEM Intel X25-E). Something else is wrong. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD strange performance problem, resilvering helps during operation
Hi, Do the I/O problems go away when only one of the SSDs is attached? No, the problem stays with only one SSD. The problem is only less when resilvering, but not totally disappeared (maybe because of the resilver overhead). Frankly, I'm betting that your SSDs are wearing out. Resilvering will essentially be one big streaming write, which is optimal for SSDs (even an SLC-based SSD, as you likely have, performs far better when writing large amounts of data at once). NFS (and to a lesser extent iSCSI) is generally a whole lot of random small writes, which are hard on an SSD (especially MLC-based ones, but even SLC ones). The resilvering process is likely turning many of the random writes coming in to the system into a large streaming write to the /resilvering/ drive. Hmm, interesting theory. Next I well execute only a resilver to see if the same happens. I assume when adding a new disk, even though it's only a slog disk, the whole tank will resilver? If I look to the zpool iostat currently I see a lot of reads on the separate SATA disks (not on the tank/or raidz2 pools), assuming resilvering takes place there and the SSD's are already synced. My guess is that the SSD you are having problems with has reached the end of it's useful lifespan, and the I/O problems you are seeing during normal operation are the result of that SSD's problems with committing data. There's no cure for this, other than replacing the SSD with a new one. SSDs are not hard drives. Even high-quality modern ones have /significantly/ lower USE lifespans than an HD - that is, a heavily-used SSD will die well before a HD, but a very-lightly used SSD will likely outlast a HD. And, in the case of SSDs, writes are far harder on the SSD than reads are. Is about half a year for these disk not really short? Sure, we have some I/O, but not that many write operations, about ~80-140 iops, anyway, I will try to get new disks from SUN (we have SLC disks from Sun). Is there any knowledge about the life time of SSD's? Maybe in terms of amount of I/O Operations? Regards, Mart van Santen -- Greenhost - Duurzame Hosting Derde Kostverlorenkade 35 1054 TS Amsterdam T: 020 489 4349 F: 020 489 2306 KvK: 34187349 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD strange performance problem, resilvering helps during operation
It might be helpful to contact SSD vendor, report the issue and inquire if half a year wearing out is expected behavior for this model. Further, if you have an option to replace one (or both) SSDs with fresh ones, this could tell for sure if they are the root cause. Regards, Andrey On Mon, Dec 21, 2009 at 1:18 PM, Erik Trimble wrote: > Mart van Santen wrote: >> >> Hi, >> >> We have a X4150 with a J4400 attached. Configured with 2x32GB SSD's, in >> mirror configuration (ZIL) and 12x 500GB SATA disks. We are running this >> setup for over a half year now in production for NFS and iSCSI for a bunch >> of virtual machines (currently about 100 VM's, Mostly Linux, some Windows) >> >> Since last week we have performance problems, cause IO Wait in the VM's. >> Of course we did a big search in networking issue's, hanging machines, >> filewall & traffic tests, but were unable to find any problems. So we had a >> look into the zpool and dropped one of the mirrored SSD's from the pool (we >> had some indication the ZIL was not working ok). No success. After adding >> the disk, we discovered the IO wait during the "resilvering" process was >> OK, or at least much better, again. So last night we did the same handling, >> dropped & added the same disk, and yes, again, the IO wait looked better. >> This morning the same story. >> >> Because this machine is a production machine, we cannot tolerate to much >> experiments. We now know this operation saves us for about 4 to 6 hours >> (time to resilvering), but we didn't had the courage to detach/attach the >> other SSD yet. We will try only a "resilver", without detach/attach, this >> night, to see what happens. >> >> Can anybody explain how the detach/attach and resilver process works, and >> especially if there is something different during the resilvering and the >> handling of the SSD's/slog disks? >> >> >> Regards, >> >> >> Mart >> >> >> > Do the I/O problems go away when only one of the SSDs is attached? > > > Frankly, I'm betting that your SSDs are wearing out. Resilvering will > essentially be one big streaming write, which is optimal for SSDs (even an > SLC-based SSD, as you likely have, performs far better when writing large > amounts of data at once). NFS (and to a lesser extent iSCSI) is generally a > whole lot of random small writes, which are hard on an SSD (especially > MLC-based ones, but even SLC ones). The resilvering process is likely > turning many of the random writes coming in to the system into a large > streaming write to the /resilvering/ drive. > > My guess is that the SSD you are having problems with has reached the end of > it's useful lifespan, and the I/O problems you are seeing during normal > operation are the result of that SSD's problems with committing data. > There's no cure for this, other than replacing the SSD with a new one. > > SSDs are not hard drives. Even high-quality modern ones have /significantly/ > lower USE lifespans than an HD - that is, a heavily-used SSD will die well > before a HD, but a very-lightly used SSD will likely outlast a HD. And, in > the case of SSDs, writes are far harder on the SSD than reads are. > > > -- > Erik Trimble > Java System Support > Mailstop: usca22-123 > Phone: x17195 > Santa Clara, CA > Timezone: US/Pacific (GMT-0800) > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD strange performance problem, resilvering helps during operation
Mart van Santen wrote: Hi, We have a X4150 with a J4400 attached. Configured with 2x32GB SSD's, in mirror configuration (ZIL) and 12x 500GB SATA disks. We are running this setup for over a half year now in production for NFS and iSCSI for a bunch of virtual machines (currently about 100 VM's, Mostly Linux, some Windows) Since last week we have performance problems, cause IO Wait in the VM's. Of course we did a big search in networking issue's, hanging machines, filewall & traffic tests, but were unable to find any problems. So we had a look into the zpool and dropped one of the mirrored SSD's from the pool (we had some indication the ZIL was not working ok). No success. After adding the disk, we discovered the IO wait during the "resilvering" process was OK, or at least much better, again. So last night we did the same handling, dropped & added the same disk, and yes, again, the IO wait looked better. This morning the same story. Because this machine is a production machine, we cannot tolerate to much experiments. We now know this operation saves us for about 4 to 6 hours (time to resilvering), but we didn't had the courage to detach/attach the other SSD yet. We will try only a "resilver", without detach/attach, this night, to see what happens. Can anybody explain how the detach/attach and resilver process works, and especially if there is something different during the resilvering and the handling of the SSD's/slog disks? Regards, Mart Do the I/O problems go away when only one of the SSDs is attached? Frankly, I'm betting that your SSDs are wearing out. Resilvering will essentially be one big streaming write, which is optimal for SSDs (even an SLC-based SSD, as you likely have, performs far better when writing large amounts of data at once). NFS (and to a lesser extent iSCSI) is generally a whole lot of random small writes, which are hard on an SSD (especially MLC-based ones, but even SLC ones). The resilvering process is likely turning many of the random writes coming in to the system into a large streaming write to the /resilvering/ drive. My guess is that the SSD you are having problems with has reached the end of it's useful lifespan, and the I/O problems you are seeing during normal operation are the result of that SSD's problems with committing data. There's no cure for this, other than replacing the SSD with a new one. SSDs are not hard drives. Even high-quality modern ones have /significantly/ lower USE lifespans than an HD - that is, a heavily-used SSD will die well before a HD, but a very-lightly used SSD will likely outlast a HD. And, in the case of SSDs, writes are far harder on the SSD than reads are. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How do I determine dedupe effectiveness?
Brandon High wrote: On Sat, Dec 19, 2009 at 8:34 AM, Colin Raven wrote: If snapshots reside within the confines of the pool, are you saying that dedup will also count what's contained inside the snapshots? I'm not sure why, but that thought is vaguely disturbing on some level. Sure, why not? Let's say you have snapshots enabled on a dataset with 1TB of files in it, and then decide to move 500GB to a new dataset for other sharing options, or what have you. If dedup didn't count the snapshots you'd wind up with 500GB in your original live dataset, an additional 500GB in the snapshots, and an additional 500GB in the new dataset. For instance, tank/export/samba/backups used to be a directory in tank/export/samba/public. Snapshots being used in dedup saved me 700+GB. tank/export/samba/backups704G 3.35T 704G /export/samba/backups tank/export/samba/public 816G 3.35T 101G /export/samba/public Architecturally, it is madness NOT to store (known) common data within the same local concept, in this case, a pool. Snapshots need to be retained close to their original parent (as do clones, et al.), and the abstract concept that holds them in ZFS is the pool. Frankly, I'd have a hard time thinking up of another structure (abstract or concrete) where it would make sense to store such an item (i.e. snapshots). Remember, that snapshot are A POINT IN TIME PICTURE of the filesystem/volume. No more, no less. As such, it makes logical sense to retain them "close" to their originator. People tend to slap all sorts of other inferences about what snapshots "mean", which is incorrect, both from a conceptual standpoint (a rose is a rose, not a pig, just because you want call it a pig) and at an implementation level. As for exactly what is meant by "counting" something inside a snapshot. Remember, a snapshot is already a form of dedup - that is, it is nothing more than a list of block pointers to blocks which existed at the time the snapshot was taken. I'll have to check, but since I believe that the dedup metric is counting blocks which have more than one reference to them, it currently DOES influence the dedup count if you have a snapshot.I'm not in front of a sufficiently late-version install to check this; please, would someone check if taking a snapshot does or does not influence the dedup metric. (it's a simple test - create a pool with 1 dataset, turn on dedup, then copy X amount of data to that dataset. check the dedup ratio. Then take a snapshot of the dataset, and re-check the dedup ratio)Conceptually speaking, it would be nice to exclude snapshots when computing the dedup ratio; implementation wise, I'm not sure how the ratio is really computed, so I can't say if it's simple or impossible. in fact handy. Hourly...ummm, maybe the same - but Daily/Monthly should reside "elsewhere". That's what replication to another system via send/recv is for. See backups, DR. Once again, these are concepts that have no bearing on what a snapshot /IS/. What one want to /do/ with a snapshot is up to the user, but that's not a decision to be made at the architecture level. That's a decision for further up the application abstraction stack. Y'know, that is a GREAT point. Taking this one step further then - does that also imply that there's one "hot spot" physically on a disk that keeps getting read/written to? if so then your point has even greater merit for more reasons...disk wear for starters, and other stuff too, no doubt. I believe I read that there is a max ref count for blocks, and beyond that the data is written out once again. This is for resilience and to avoid hot spots. -B Various ZFS metadata blocks are far more "hot" than anything associated with dedup. Brandon is correct in that ZFS will tend to re-write such frequently-WRITTEN blocks (whether meta or real data) after a certain point. In the dedup case, this is irrelevant, since dedup is READ-only (if you change the block, by definition, it is no longer a dedup of it's former "mates"). If anything, dedup blocks are /far/ more likely to end up in the L2ARC (read cache) than a typical block, everything else being equal. Now, if we can get a defrag utility/feature implemented (possibly after the BP rewrite stuff is committed), it would make sense to put frequently ACCESSED blocks at the highest-performing portions of the underlying media. This of course means that such a utility would have to be informed as to the characteristics of the underlying media (SSD, hard drive, RAM disk, etc.) and understand each of the limitations therein; case in point: for HDs, the highest-performing location is the outer sectors, while for MLC SSDs it is the "least used" ones, and it's irrelevant for solid-state (NVRAM) drives. Honestly, now that I've considered it, I'm thinking that it's not worth any real effort to do this kind of optimiz
[zfs-discuss] SSD strange performance problem, resilvering helps during operation
Hi, We have a X4150 with a J4400 attached. Configured with 2x32GB SSD's, in mirror configuration (ZIL) and 12x 500GB SATA disks. We are running this setup for over a half year now in production for NFS and iSCSI for a bunch of virtual machines (currently about 100 VM's, Mostly Linux, some Windows) Since last week we have performance problems, cause IO Wait in the VM's. Of course we did a big search in networking issue's, hanging machines, filewall & traffic tests, but were unable to find any problems. So we had a look into the zpool and dropped one of the mirrored SSD's from the pool (we had some indication the ZIL was not working ok). No success. After adding the disk, we discovered the IO wait during the "resilvering" process was OK, or at least much better, again. So last night we did the same handling, dropped & added the same disk, and yes, again, the IO wait looked better. This morning the same story. Because this machine is a production machine, we cannot tolerate to much experiments. We now know this operation saves us for about 4 to 6 hours (time to resilvering), but we didn't had the courage to detach/attach the other SSD yet. We will try only a "resilver", without detach/attach, this night, to see what happens. Can anybody explain how the detach/attach and resilver process works, and especially if there is something different during the resilvering and the handling of the SSD's/slog disks? Regards, Mart -- Greenhost - Duurzame Hosting Derde Kostverlorenkade 35 1054 TS Amsterdam T: 020 489 4349 F: 020 489 2306 KvK: 34187349 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] FW: ARC not using all available RAM?
On 21 December, 2009 - Tristan Ball sent me these 4,5K bytes: > Richard Elling wrote: > > > > On Dec 20, 2009, at 12:25 PM, Tristan Ball wrote: > > > >> I've got an opensolaris snv_118 machine that does nothing except > >> serve up NFS and ISCSI. > >> > >> The machine has 8G of ram, and I've got an 80G SSD as L2ARC. > >> The ARC on this machine is currently sitting at around 2G, the kernel > >> is using around 5G, and I've got about 1G free. ... > What I'm trying to find out is is my ARC relatively small because... > > 1) ZFS has decided that that's all it needs (the workload is fairly > random), and that adding more wont gain me anything.. > 2) The system is using so much ram for tracking the L2ARC, that the ARC > is being shrunk (we've got an 8K record size) > 3) There's some other memory pressure on the system that I'm not aware > of that is periodically chewing up then freeing the ram. > 4) There's some other memory management feature that's insisting on that > 1G free. My bet is on #4 ... http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c#arc_reclaim_needed See line 1956 .. I tried some tuning on a pure nfs server (although s10u8) here, and got it to use a bit more of "the last 1GB" out of 8G.. I think it was swapfs_minfree that I poked with a sharp stick. No idea if anything else that relies on it could break, but the machine has been fine for a few weeks here now and using more memory for ARC.. ;) /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How do I determine dedupe effectiveness?
On Sat, Dec 19, 2009 at 8:34 AM, Colin Raven wrote: > If snapshots reside within the confines of the pool, are you saying that > dedup will also count what's contained inside the snapshots? I'm not sure > why, but that thought is vaguely disturbing on some level. Sure, why not? Let's say you have snapshots enabled on a dataset with 1TB of files in it, and then decide to move 500GB to a new dataset for other sharing options, or what have you. If dedup didn't count the snapshots you'd wind up with 500GB in your original live dataset, an additional 500GB in the snapshots, and an additional 500GB in the new dataset. For instance, tank/export/samba/backups used to be a directory in tank/export/samba/public. Snapshots being used in dedup saved me 700+GB. tank/export/samba/backups704G 3.35T 704G /export/samba/backups tank/export/samba/public 816G 3.35T 101G /export/samba/public > in fact handy. Hourly...ummm, maybe the same - but Daily/Monthly should > reside "elsewhere". That's what replication to another system via send/recv is for. See backups, DR. > Y'know, that is a GREAT point. Taking this one step further then - does that > also imply that there's one "hot spot" physically on a disk that keeps > getting read/written to? if so then your point has even greater merit for > more reasons...disk wear for starters, and other stuff too, no doubt. I believe I read that there is a max ref count for blocks, and beyond that the data is written out once again. This is for resilience and to avoid hot spots. -B -- Brandon High : bh...@freaks.com Indecision is the key to flexibility. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss