Re: [zfs-discuss] raidz data loss stories?

2009-12-21 Thread Roman Naumenko
> On Dec 21, 2009, at 4:09 PM, Michael Herf
>  wrote:
> 
> > Anyone who's lost data this way: were you doing
> weekly scrubs, or  
> > did you find out about the simultaneous failures
> after not touching  
> > the bits for months?
> 
> Scrubbing on a routine basis is good for detecting
> problems early, but  
> it doesn't solve the problem of a double failure
> during resilver. As  
> the size of disks become huge the chance of a double
> failure during  
> resilvering increases to the point of real
> possibility. Due to the  
> amount of data, the bit error rates of the medium and
> the prolonged  
> stress of resilvering these monsters.
> 
> For up to 1TB drives use nothing less than raidz2.
> For 1TB+ drives use  
> raidz3. Avoid raidz vdevs larger than 7 drives,
> better to have  
> multiple vdevs both for performance and reliability.
> 
> With 24 2.5" drive enclosures you can easily create 3
> 7 drive raidz3s  
> or 4 5 drive raidz2s with a spare for each vdev, or 2
> spares and 1-2  
> SSD drives. Both options give 12/24 usable disk
> space. 4 raidz2s give  
> more performance, 3 raidz3s gives more reliability.
> 
> -Ross
> 

Hi Ross,

What about old good raid10? It's a pretty reasonable choice for heavy loaded 
storages, isn't it?

I remember when I migrated raidz2 to 8xdrives raid10 the application 
administrators were just really happy with the new access speed. (we didn't use 
stripped raidz2 though as you are suggesting).

--
Roman
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz data loss stories?

2009-12-21 Thread Roman Naumenko
> On Dec 21, 2009, at 4:09 PM, Michael Herf
>  wrote:
> 
> > Anyone who's lost data this way: were you doing
> weekly scrubs, or  
> > did you find out about the simultaneous failures
> after not touching  
> > the bits for months?
> 
> Scrubbing on a routine basis is good for detecting
> problems early, but  
> it doesn't solve the problem of a double failure
> during resilver. As  
> the size of disks become huge the chance of a double
> failure during  
> resilvering increases to the point of real
> possibility. Due to the  
> amount of data, the bit error rates of the medium and
> the prolonged  
> stress of resilvering these monsters.
> 
> For up to 1TB drives use nothing less than raidz2.
> For 1TB+ drives use  
> raidz3. Avoid raidz vdevs larger than 7 drives,
> better to have  
> multiple vdevs both for performance and reliability.
> 
> With 24 2.5" drive enclosures you can easily create 3
> 7 drive raidz3s  
> or 4 5 drive raidz2s with a spare for each vdev, or 2
> spares and 1-2  
> SSD drives. Both options give 12/24 usable disk
> space. 4 raidz2s give  
> more performance, 3 raidz3s gives more reliability.
> 
> -Ross
> 

Hi Ross,

What about old good raid10? It's a pretty reasonable choice for heavy loaded 
storages, isn't it?

I remember when I migrated raidz2 to 8xdrives raid10 the application 
administrators were just really happy with the new access speed. (we didn't use 
stripped raidz2 though as you are suggesting).

--
Roman
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz data loss stories?

2009-12-21 Thread Ross Walker

On Dec 21, 2009, at 4:09 PM, Michael Herf  wrote:

Anyone who's lost data this way: were you doing weekly scrubs, or  
did you find out about the simultaneous failures after not touching  
the bits for months?


Scrubbing on a routine basis is good for detecting problems early, but  
it doesn't solve the problem of a double failure during resilver. As  
the size of disks become huge the chance of a double failure during  
resilvering increases to the point of real possibility. Due to the  
amount of data, the bit error rates of the medium and the prolonged  
stress of resilvering these monsters.


For up to 1TB drives use nothing less than raidz2. For 1TB+ drives use  
raidz3. Avoid raidz vdevs larger than 7 drives, better to have  
multiple vdevs both for performance and reliability.


With 24 2.5" drive enclosures you can easily create 3 7 drive raidz3s  
or 4 5 drive raidz2s with a spare for each vdev, or 2 spares and 1-2  
SSD drives. Both options give 12/24 usable disk space. 4 raidz2s give  
more performance, 3 raidz3s gives more reliability.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz data loss stories?

2009-12-21 Thread Adam Leventhal
Hey James,

> Personally, I think mirroring is safer (and 3 way mirroring) than raidz/z2/5. 
>  All my "boot from zfs" systems have 3 way mirrors root/usr/var disks (using 
> 9 disks) but all my data partitions are 2 way mirrors (usually 8 disks or 
> more and a spare.)

Double-parity (or triple-parity) RAID are certainly more resilient against some 
failure modes than 2-way mirroring. For example, bit errors can arise at a 
certain rate from disks. In the case of a disk failure in a mirror, it's 
possible to encounter a bit error such that data is lost.

I recently wrote an article for ACM Queue that examines recent trends in hard 
drives and makes the case for triple-parity RAID. It's at least peripherally 
relevant to this conversation:

  http://blogs.sun.com/ahl/entry/acm_triple_parity_raid

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool FAULTED after power outage

2009-12-21 Thread JD Trout
I was able to recover! Thank you both for replying and thank you victor for the 
step-by-step.

I downloaded dev-129 from the site and booted off of it.  I first ran:

zpool import -nfF -R /mnt rpool

and the cmd output that I could go back to when the box rebooted itself.  
Therefore, I ran the cmd: 

zpool import -fF -R /mnt rpool

and everything was good.

Thanks again!
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] CIFS Strange Problem

2009-12-21 Thread Richard Elling

Sassy,
this is the zfs-discuss forum.  You might have better luck asking at the
cifs-discuss forum.
http://mail.opensolaris.org/mailman/listinfo/cifs-discuss
 -- richard

On Dec 21, 2009, at 2:36 PM, Sassy Natan wrote:


Hi Group

I have install the latest version of OpenSolairs (version 129) on my  
machine.
I have configure the DNS, Kerberos, PAM and LDAP client to use my  
Windows 2003R2 domain.


My Windows Domain Include the RFC2307 Posix account, so each user  
has UID, GID configure.
This was very east to configure, and I manage to get all my users  
from the Windows Domain to logon to the opensoalris machine.
When running "getent passwd username"  I getting off course the id  
and group id from the AD server.


So now I wanted to use the CIFS server. So I install the services  
and started them, add the machine to the domain and configure a ZFS  
share.


Now I only add to create rule using the idmap so users from the  
windows will be mapped to the unix account.


But this seems not to work. when checking the mapping I get error:  
see below


#id rona
uid=10005(rona) gid=1(Domain Users) groups=1(Domain Users)

#getent passwd rona
rona:x:1:1:rona:/home/rona:/bin/sh

#idmap show -cv rona@
winname:rona@ -> uid:60001
Error:  Not found

#idmap show -cv r...@domain.local
winname:r...@domain.local -> uid:60001
Error:  Not found

I run the cifs-gendiag and didn't saw any problems
Any idea?


thanks
Sassy
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool FAULTED after power outage

2009-12-21 Thread Victor Latushkin

JD Trout wrote:

Hello,
I am running OpenSol 2009.06 and after a power outage opsensol will no longer 
boot past GRUB.  Booting from the liveCD shows me the following:

r...@opensolaris:~# zpool import -f rpool 
cannot import 'rpool': I/O error


r...@opensolaris:~# zpool import -f
  pool: rpool
id: 15378657248391821369
 state: FAULTED
status: The pool was last accessed by another system.
action: The pool cannot be imported due to damaged devices or data.
The pool may be active on another system, but can be imported using
the '-f' flag.
   see: http://www.sun.com/msg/ZFS-8000-EY
config:

rpool   FAULTED  corrupted data
  c7d0s0ONLINE


Is all hope lost?


No. Try to get LiveCD based on build 128 or later at e.g. 
www.genunix.org, boot off it and try to import your rpool this way:


zpool import -nfF -R /mnt rpool

If it reports that it can get back to good pool state, then do actual 
import with


zpool import -fF -R /mnt rpool

In case first command cannot rewind to older state, try to add -X option:

zpool import -nfFX -R /mnt rpool

and if it says that it can recover your pool with some data loss and you 
are ok with it, then do actual import


zpool import -fFX -R /mnt rpool

regards,
victor
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool FAULTED after power outage

2009-12-21 Thread Tim Cook
On Mon, Dec 21, 2009 at 6:50 PM, JD Trout  wrote:

> Hello,
> I am running OpenSol 2009.06 and after a power outage opsensol will no
> longer boot past GRUB.  Booting from the liveCD shows me the following:
>
> r...@opensolaris:~# zpool import -f rpool
> cannot import 'rpool': I/O error
>
> r...@opensolaris:~# zpool import -f
>  pool: rpool
>id: 15378657248391821369
>  state: FAULTED
> status: The pool was last accessed by another system.
> action: The pool cannot be imported due to damaged devices or data.
>The pool may be active on another system, but can be imported using
>the '-f' flag.
>   see: http://www.sun.com/msg/ZFS-8000-EY
> config:
>
>rpool   FAULTED  corrupted data
>  c7d0s0ONLINE
>
>
> Is all hope lost?
>
>

No, but you'll need to use a newer version of opensolaris to recover it
automagically.
http://www.c0t0d0s0.org/archives/6067-PSARC-2009479-zpool-recovery-support.html

-- 
--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zpool FAULTED after power outage

2009-12-21 Thread JD Trout
Hello,
I am running OpenSol 2009.06 and after a power outage opsensol will no longer 
boot past GRUB.  Booting from the liveCD shows me the following:

r...@opensolaris:~# zpool import -f rpool 
cannot import 'rpool': I/O error

r...@opensolaris:~# zpool import -f
  pool: rpool
id: 15378657248391821369
 state: FAULTED
status: The pool was last accessed by another system.
action: The pool cannot be imported due to damaged devices or data.
The pool may be active on another system, but can be imported using
the '-f' flag.
   see: http://www.sun.com/msg/ZFS-8000-EY
config:

rpool   FAULTED  corrupted data
  c7d0s0ONLINE


Is all hope lost?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS pool unusable after attempting to destroy a dataset with dedup enabled

2009-12-21 Thread Jack Kielsmeier
I don't mean to sound ungrateful (because I really do appreciate all the help I 
have received here), but I am really missing the use of my server.

Over Christmas, I want to be able to use my laptop (right now, it's acting as a 
server for some of the things my OpenSolaris server did). This means I will 
need to get my server back up and running in full working order by then.

All the data that I lost is unimportant data, so I'm not really missing 
anything there.

Again, I do appreciate all the help, but I'm going to "give up" if no solution 
can be found in the next couple of days. This is simply because I want to be 
able to use my hardware.

What I plan on doing is simply formatting each disk that was part of the bad 
pool and creating a new one.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DeDup and Compression - Reverse Order?

2009-12-21 Thread Darren J Moffat

Kjetil Torgrim Homme wrote:


Note also that the compress/encrypt/checksum and the dedup are
separate pipeline stages so while dedup is happening for block N block
N+1 can be getting transformed - so this is designed to take advantage
of multiple scheduling units (threads,cpus,cores etc).


nice.  are all of them separate stages, or are compress/encrypt/checksum
done as one stage?


Originally compress, encrypt, checksum were all separate stages in the 
zio pipeline they are now all one stage ZIO_WRITE_BP_INIT for the write 
case and ZIO_READ_BP_INIT for the read case.




Also if you place a block in an unencrypted dataset that happens to
match the ciphertext in an encrypted dataset they won't dedup either
(you need to understand what I've done with the AES CCM/GCM MAC and
the zio_chksum_t field in the blkptr_t and how that is used by dedup
to see why).


wow, I didn't think of that problem.  did you get bitten by wrongful
dedup during testing with image files? :-)


No I didn't see the problem in reality I just thought about it in as a 
possible risk that needed to be addressed.


Solving it didn't actually require me to do any additional work because 
ZFS uses a separate table for each checksum algorithm anyway and the 
checksum algorithm for encrypted datasets is listed as sha256+mac not 
sha256.  It was nice that I didn't have to write more code to solve the 
problem but it may not have been that way.


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DeDup and Compression - Reverse Order?

2009-12-21 Thread Darren J Moffat

Daniel Carosone wrote:

Your parenthetical comments here raise some concerns, or at least eyebrows, 
with me.  Hopefully you can lower them again.


compress, encrypt, checksum, dedup.




(and you need to use zdb to get enough info to see the
leak - and that means you have access to the raw devices)


An attacker with access to the raw devices is the primary base threat model for 
on-disk encryption, surely?

An attacker with access to disk traffic, via e.g. iSCSI, who can also deploy 
dynamic traffic analysis in addition to static content analysis, and who also 
has similarly greater opportunities for tampering, is another trickier threat 
model.

It seems like entirely wrong thinking (even in parentheses) to dismiss an issue 
as irrelevant because it only applies in the primary threat model.


I wasn't dismissing it I was pointing out that this wasn't something an 
unprivilege end user could easily do.


If the risk is unacceptable then dedup shouldn't be enabled.  For some 
uses cases the risk is acceptable and for those use cases we want to 
allow the use of dedup with encryption.


(and the way I have implemented the IV 
generation for AES CCM/GCM mode ensures that the same

plaintext will have the same IV so the ciphertexts will match).


Again, this seems like a cause for concern.  Have you effectively turned these fancy and carefully designed crypto modes back into ECB, albeit at a larger block size (and only within a dataset)?  


No I don't believe I have.  The IV generation when doing deduplication 
is done by calculating an HMAC of the plaintext using a separate per 
dataset key (that is also refreshed if 'zfs key -K' is run to rekey the 
dataset).



Let's consider copy-on-write semantics: with the above issue an attacker can 
tell which blocks of a file have changed over time, even if unchanged blocks 
have been rewritten.. giving even the static image attacker some traffic 
analysis capability.


So if that is part of your deployment risk model deduplication is not 
worth enabling in that case.



This would be a problem regardless of dedup, for the scenario where the 
attacker can see repeated ciphertext on disk (unless the dedup metadata itself 
is sufficiently encrypted, which I understand it is not).


In the case where deduplication is not enabled the IV generation uses a 
compbination of the txg number, the object and blockid which complies 
with the recommendations for IV generation for both CCM and GCM.


(you need to understand 
what I've done with the AES CCM/GCM MAC


I'd like to, but more to understand what (if any) protection is given against 
replay attacks (above that already provided by the merkle hash tree).


What do you mean by a replay attack ?  What is being replayed and by whom ?

--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] CIFS Strange Problem

2009-12-21 Thread Sassy Natan
Hi Group

I have install the latest version of OpenSolairs (version 129) on my machine.
I have configure the DNS, Kerberos, PAM and LDAP client to use my Windows 
2003R2 domain.

My Windows Domain Include the RFC2307 Posix account, so each user has UID, GID 
configure.
This was very east to configure, and I manage to get all my users from the 
Windows Domain to logon to the opensoalris machine.
When running "getent passwd username"  I getting off course the id and group id 
from the AD server.

So now I wanted to use the CIFS server. So I install the services and started 
them, add the machine to the domain and configure a ZFS share.

Now I only add to create rule using the idmap so users from the windows will be 
mapped to the unix account. 

But this seems not to work. when checking the mapping I get error: see below

#id rona
uid=10005(rona) gid=1(Domain Users) groups=1(Domain Users)

#getent passwd rona
rona:x:1:1:rona:/home/rona:/bin/sh

#idmap show -cv rona@
winname:rona@ -> uid:60001
Error:  Not found

#idmap show -cv r...@domain.local
winname:r...@domain.local -> uid:60001
Error:  Not found

I run the cifs-gendiag and didn't saw any problems
Any idea?


thanks
Sassy
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz data loss stories?

2009-12-21 Thread Michael Herf
Anyone who's lost data this way: were you doing weekly scrubs, or did you
find out about the simultaneous failures after not touching the bits for
months?

mike
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] directory size on compressed file system on Solaris 10

2009-12-21 Thread Gaëtan Lehmann


Le 21 déc. 09 à 20:23, Joerg Schilling a écrit :


Matthew Ahrens  wrote:


Gaëtan Lehmann wrote:


Hi,

On opensolaris, I use du with the -b option to get the  
uncompressed size

of a directory):

 r...@opensolaris:~# du -sh /usr/local/
 399M/usr/local/
 r...@opensolaris:~# du -sbh /usr/local/
 915M/usr/local/
 r...@opensolaris:~# zfs list -o space,refer,ratio,compress data/ 
local
 NAMEAVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV   
USEDCHILD

REFER  RATIO  COMPRESS
 data/local   228G   643M  249M394M   
0  0

394M  2.51xon
 r...@opensolaris:~# which du
 /usr/gnu/bin/du

but on Solaris 10, there is no such option.

So what is the best way to get the uncompressed size of a  
directory on

Solaris 10?


Install GNU du on solaris 10?  Although the answer will be just as
(in)accurate as GNU du on solaris 10.  Note that it reports the  
compression


How about:

find . -type f -ls | awk '{ sum += $7} END {print sum}'



sounds good

  r...@opensolaris:~# find /usr/local/ -ls | awk '{sum += $7} END  
{print sum/1024**2}'

  914.039

but maybe a little longer to write

  r...@opensolaris:~# echo "du -sbh ." | wc -c
  10
  r...@opensolaris:~# echo "find . -ls | awk '{sum += $7} END {print  
sum/1024**2}'" | wc -c

  53

Thanks!

Gaëtan

--
Gaëtan Lehmann
Biologie du Développement et de la Reproduction
INRA de Jouy-en-Josas (France)
tel: +33 1 34 65 29 66fax: 01 34 65 29 09
http://voxel.jouy.inra.fr  http://www.itk.org
http://www.mandriva.org  http://www.bepo.fr



PGP.sig
Description: Ceci est une signature électronique PGP
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] directory size on compressed file system on Solaris 10

2009-12-21 Thread Gaëtan Lehmann


Le 21 déc. 09 à 19:28, Matthew Ahrens a écrit :


Gaëtan Lehmann wrote:

Hi,
On opensolaris, I use du with the -b option to get the uncompressed  
size of a directory):

 r...@opensolaris:~# du -sh /usr/local/
 399M/usr/local/
 r...@opensolaris:~# du -sbh /usr/local/
 915M/usr/local/
 r...@opensolaris:~# zfs list -o space,refer,ratio,compress data/ 
local
 NAMEAVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV   
USEDCHILD  REFER  RATIO  COMPRESS
 data/local   228G   643M  249M394M  0   
0   394M  2.51xon

 r...@opensolaris:~# which du
 /usr/gnu/bin/du
but on Solaris 10, there is no such option.
So what is the best way to get the uncompressed size of a directory  
on Solaris 10?


Install GNU du on solaris 10?


That's an option of course, but I'd prefer something that I can use  
installing any extra program.


 Although the answer will be just as (in)accurate as GNU du on  
solaris 10.  Note that it reports the compression ratio as 915/399 =  
2.29x, actual is 2.51x.  This could be due to sparse files, or  
metadata like directories, whose "apparent size" (st_size) is not  
what GNU du expects.


At least it gives a not so bad estimation :-)

And the compression ratio includes the data in the snapshots, so it  
may be inaccurate in that case also.

The actual compression ratio, given on the last snapshot, is 2.41x.

Gaëtan

--
Gaëtan Lehmann
Biologie du Développement et de la Reproduction
INRA de Jouy-en-Josas (France)
tel: +33 1 34 65 29 66fax: 01 34 65 29 09
http://voxel.jouy.inra.fr  http://www.itk.org
http://www.mandriva.org  http://www.bepo.fr



PGP.sig
Description: Ceci est une signature électronique PGP
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] directory size on compressed file system on Solaris 10

2009-12-21 Thread Joerg Schilling
Matthew Ahrens  wrote:

> Gaëtan Lehmann wrote:
> > 
> > Hi,
> > 
> > On opensolaris, I use du with the -b option to get the uncompressed size 
> > of a directory):
> > 
> >   r...@opensolaris:~# du -sh /usr/local/
> >   399M/usr/local/
> >   r...@opensolaris:~# du -sbh /usr/local/
> >   915M/usr/local/
> >   r...@opensolaris:~# zfs list -o space,refer,ratio,compress data/local
> >   NAMEAVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD  
> > REFER  RATIO  COMPRESS
> >   data/local   228G   643M  249M394M  0  0   
> > 394M  2.51xon
> >   r...@opensolaris:~# which du
> >   /usr/gnu/bin/du
> > 
> > but on Solaris 10, there is no such option.
> > 
> > So what is the best way to get the uncompressed size of a directory on 
> > Solaris 10?
>
> Install GNU du on solaris 10?  Although the answer will be just as 
> (in)accurate as GNU du on solaris 10.  Note that it reports the compression 

How about:

find . -type f -ls | awk '{ sum += $7} END {print sum}'

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Troubleshooting dedup performance

2009-12-21 Thread Chris Murray
In case the overhead in calculating SHA256 was the cause, I set ZFS
checksums to SHA256 at the pool level, and left for a number of days.
This worked fine.

Setting dedup=on immediately crippled performance, and then setting
dedup=off fixed things again. I did notice through a zpool iostat that
disk IO increased while dedup was on, although it didn't from the ESXi
side. Could it be that dedup tables don't fit in memory? I don't have a
great deal - 3GB. Is there a measure of how large the tables are in
bytes, rather than number of entries?

Chris

-Original Message-
From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Chris Murray
Sent: 16 December 2009 17:19
To: Cyril Plisko; Andrey Kuzmin
Cc: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Troubleshooting dedup performance

So if the ZFS checksum is set to fletcher4 at the pool level, and
dedup=on, which checksum will it be using?

If I attempt to set dedup=fletcher4, I do indeed get this:

cannot set property for 'zp': 'dedup' must be one of 'on | off | verify
| sha256[,verify]'

Could it be that my performance troubles are due to the calculation of
two different checksums?

Thanks,
Chris

-Original Message-
From: cyril.pli...@gmail.com [mailto:cyril.pli...@gmail.com] On Behalf
Of Cyril Plisko
Sent: 16 December 2009 17:09
To: Andrey Kuzmin
Cc: Chris Murray; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Troubleshooting dedup performance

>> I've set dedup to what I believe are the least resource-intensive
>> settings - "checksum=fletcher4" on the pool, & "dedup=on" rather than
>
> I believe checksum=fletcher4 is acceptable in dedup=verify mode only.
> What you're doing is seemingly deduplication with weak checksum w/o
> verification.

I think fletcher4 use for the deduplication purposes was disabled [1]
at all, right before build 129 cut.


[1]
http://hg.genunix.org/onnv-gate.hg/diff/93c7076216f6/usr/src/common/zfs/
zfs_prop.c


-- 
Regards,
Cyril

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] directory size on compressed file system on Solaris 10

2009-12-21 Thread Matthew Ahrens

Gaëtan Lehmann wrote:


Hi,

On opensolaris, I use du with the -b option to get the uncompressed size 
of a directory):


  r...@opensolaris:~# du -sh /usr/local/
  399M/usr/local/
  r...@opensolaris:~# du -sbh /usr/local/
  915M/usr/local/
  r...@opensolaris:~# zfs list -o space,refer,ratio,compress data/local
  NAMEAVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD  
REFER  RATIO  COMPRESS
  data/local   228G   643M  249M394M  0  0   
394M  2.51xon

  r...@opensolaris:~# which du
  /usr/gnu/bin/du

but on Solaris 10, there is no such option.

So what is the best way to get the uncompressed size of a directory on 
Solaris 10?


Install GNU du on solaris 10?  Although the answer will be just as 
(in)accurate as GNU du on solaris 10.  Note that it reports the compression 
ratio as 915/399 = 2.29x, actual is 2.51x.  This could be due to sparse 
files, or metadata like directories, whose "apparent size" (st_size) is not 
what GNU du expects.


Took me a minute to realize you were talking about the space used under a 
subdirectory, not the space consumed by the directory itself!  I guess I'm 
the only one creating 400MB directories :-)


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Mirror config and installgrub errors

2009-12-21 Thread Alexander
I've hsut bought second drive for my hope PC and decided to do mirror. I've 
made 

 pfexec zpool attach rpool c9d0s0 c13d0s0

waited for scrub  and tried to install grub on second disk:
$ pfexec installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c13d0s0
cannot open/stat device /dev/rdsk/c13d0s2
$ pfexec installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c13d0
raw device must be a root slice (not s2)

What am I doing wrong?  I believe that new device is SMI-labeled (I formated it 
and set SMI label), however I don't know how to check it...

$ pfexec prtvtoc   /dev/rdsk/c13d0s0 
* /dev/rdsk/c13d0s0 partition map
*
* Dimensions:
* 512 bytes/sector
*  63 sectors/track
* 255 tracks/cylinder
*   16065 sectors/cylinder
*   60799 cylinders
*   60797 accessible cylinders
*
* Flags:
*   1: unmountable
*  10: read-only
*
* Unallocated space:
*   First SectorLast
*   Sector CountSector 
*   0 48195 48194
*
*  First SectorLast
* Partition  Tag  FlagsSector CountSector  Mount Directory
   0  200  48195 976655610 976703804
   8  101  0 16065 16064
   9  900  16065 32130 48194
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS dedup memory usage for DDT

2009-12-21 Thread Thomas Nau
Dear all.
We use an "old" 48TB 4500 aka Thumper as iSCSI server based on snv_129.
As the machine has only 16GB of RAM we are wondering if it's sufficient
for holding the bigger part of the DDT in memory without affecting
performance by limiting the ARC. Any hints about scaling memory vs. disk
space or the like

Thanks ahead
Thomas
-- 
-
GPG fingerprint: B1 EE D2 39 2C 82 26 DA  A5 4D E0 50 35 75 9E ED
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] EON ZFS Storage 0.59.9 based on snv 129, Deduplication release!

2009-12-21 Thread Andre Lue
Embedded Operating system/Networking (EON), RAM based live ZFS NAS appliance is 
released on Genunix! This is the first EON release with inline Deduplication 
features! Many thanks to Genunix.org for download hosting and serving the 
opensolaris community.

EON Deduplication ZFS storage is available in 32 and 64-bit, CIFS and Samba 
versions:
tryitEON 64-bit x86 CIFS ISO image version 0.59.9 based on snv_129

* eon-0.599-129-64-cifs.iso
* MD5: 8e917a14dbf0c793ad2958bdf8feb24a
* Size: ~93Mb
* Released: Monday 21-December-2009

tryitEON 64-bit x86 Samba ISO image version 0.59.9 based on snv_129

* eon-0.599-129-64-smb.iso
* MD5: 2c38a93036e4367e5cdf8a74605fcbaf
* Size: ~107Mb
* Released: Monday 21-December-2009

tryitEON 32-bit x86 CIFS ISO image version 0.59.9 based on snv_129

* eon-0.599-129-32-cifs.iso
* MD5: 0dcdd754b937f1d6515eba34b6ed2607
* Size: ~59Mb
* Released: Monday 21-December-2009

tryitEON 32-bit x86 Samba ISO image version 0.59.9 based on snv_129

* eon-0.599-129-32-smb.iso
* MD5: c24008516eb4584a64d9239015559ba4
* Size: ~73Mb
* Released: Monday 21-December-2009

tryitEON 64-bit x86 CIFS ISO image version 0.59.9 based on snv_129 (NO HTTPD)

* eon-0.599-129-64-cifs-min.iso
* MD5: 78b0bb116c0e32a48c473ce1b94e604f
* Size: ~87Mb
* Released: Monday 21-December-2009

tryitEON 64-bit x86 Samba ISO image version 0.59.9 based on snv_129 (NO HTTPD)

* eon-0.599-129-64-smb-min.iso
* MD5: 57d93eba9286c4bcc4c00c0154c684de
* Size: ~101Mb
* Released: Monday 21-December-2009

New/Changes/Fixes:
- Deduplication, Deduplication, Deduplication. (Only 1x the storage space was 
used)
- The hotplug errors at boot are being worked on. They are safe to ignore.
- Cleaned up minor entries in /mnt/eon0/.exec. Added "rsync --daemon" to start 
by default.
- EON rebooting at grub(since snv_122) in ESXi, Fusion and various versions of 
VMware workstation. This is related to bug 6820576. Workaround, at grub press e 
and add on the end of the kernel line "-B disable-pcieb=true" 

http://eonstorage.blogspot.com
http://sites.google.com/site/eonstorage/
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz data loss stories?

2009-12-21 Thread Scott Meilicke
Yes, a coworker lost a second disk during a rebuild of a raid5 and lost all 
data. I have not had a failure, however when migrating EqualLogic arrays in and 
out of pools, I lost a disk on an array. No data loss, but it concerns me 
because during the moves, you are essentially reading and writing all of the 
data on the disk. Did I have a latent problem on that particular disk that only 
exposed itself when doing such a large read/write? What if another disk had 
failed, and during the rebuild this latent problem was exposed? Trouble, 
trouble.

They say security is an onion. So is data protection.

Scott
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Emulex HBA fails periodically : ZFS, QFS, or the combination??

2009-12-21 Thread Len Zaifman
The question: is there an issue running: ZFS and QFS on the same file server?

The details:

We have a 2540 raid controller with 4 raidsets. Each raidset presents 2 slices 
to the OS. one slice (slice 0) from each raidset is a separate qfs filesystems 
shared among 7 servers running qfs 4.6patch6.
One of the above  servers has created on zfs pool using the other 4 slices 
(slice 1 from each of the 4 raidsets)
pool: zfs_hpf
 state: ONLINE
 scrub: none requested
config:

NAME STATE READ WRITE CKSUM
zfs_hpf  ONLINE   0 0 0
  c6t600A0B8000495A51081F492C644Dd0  ONLINE   0 0 0
  c6t600A0B8000495B1C053148B41F54d0  ONLINE   0 0 0
  c6t600A0B8000495B1C053248B42036d0  ONLINE   0 0 0
  c6t600A0B8000495B1C05B948CA87A2d0  ONLINE   0 0 0
zpool upgrade
This system is currently running ZFS pool version 15.
zfs upgrade
This system is currently running ZFS filesystem version 4.

What we have been seeing is that the emulex HBAs attaching to the 2540 through 
a sanswitch have been dying periodically, particularly under load.


luxadm -e port
/devices/p...@1,0/pci1022,7...@2/pci10df,f...@1/f...@0,0:devctl   NOT 
CONNECTED
/devices/p...@1,0/pci1022,7...@2/pci10df,f...@1,1/f...@0,0:devctl NOT 
CONNECTED

We need to reboot to get the hba to reconnect.

we have never seen this on the other fileservers and are wondering if the HBA 
is faulty or is there an issue running:
ZFS and QFS on the same file server?

Has anyone seen this?

As an FYI we are running

cat /etc/release
   Solaris 10 10/09 s10x_u8wos_08a X86


Len Zaifman
Systems Manager, High Performance Systems
The Centre for Computational Biology
The Hospital for Sick Children
555 University Ave.
Toronto, Ont M5G 1X8

tel: 416-813-5513
email: leona...@sickkids.ca

This e-mail may contain confidential, personal and/or health 
information(information which may be subject to legal restrictions on use, 
retention and/or disclosure) for the sole use of the intended recipient. Any 
review or distribution by anyone other than the person for whom it was 
originally intended is strictly prohibited. If you have received this e-mail in 
error, please contact the sender and delete all copies.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz data loss stories?

2009-12-21 Thread James Risner
If you are asking if anyone has experienced two drive failures simultaneously?  
The answer is yes.

It has happened to me (at home) and to one client, at least that I can 
remember.  In both cases, I was able to dd off one of the failed disks (with 
just bad sectors or less bad sectors) and reconstruct the raid 5 (force it 
online) to then copy data off the raid onto new drives.

Personally, I think mirroring is safer (and 3 way mirroring) than raidz/z2/5.  
All my "boot from zfs" systems have 3 way mirrors root/usr/var disks (using 9 
disks) but all my data partitions are 2 way mirrors (usually 8 disks or more 
and a spare.)
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] directory size on compressed file system on Solaris 10

2009-12-21 Thread Gaëtan Lehmann


Hi,

On opensolaris, I use du with the -b option to get the uncompressed  
size of a directory):


  r...@opensolaris:~# du -sh /usr/local/
  399M/usr/local/
  r...@opensolaris:~# du -sbh /usr/local/
  915M/usr/local/
  r...@opensolaris:~# zfs list -o space,refer,ratio,compress data/local
  NAMEAVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV   
USEDCHILD  REFER  RATIO  COMPRESS
  data/local   228G   643M  249M394M  0   
0   394M  2.51xon

  r...@opensolaris:~# which du
  /usr/gnu/bin/du

but on Solaris 10, there is no such option.

So what is the best way to get the uncompressed size of a directory on  
Solaris 10?


Regards,

Gaëtan

--
Gaëtan Lehmann
Biologie du Développement et de la Reproduction
INRA de Jouy-en-Josas (France)
tel: +33 1 34 65 29 66fax: 01 34 65 29 09
http://voxel.jouy.inra.fr  http://www.itk.org
http://www.mandriva.org  http://www.bepo.fr



PGP.sig
Description: Ceci est une signature électronique PGP
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ARC not using all available RAM?

2009-12-21 Thread Bob Friesenhahn

On Mon, 21 Dec 2009, Tristan Ball wrote:


Yes, primarily since if there is no more memory immediately available, 
performance when starting new processes would suck.  You need to reserve 
some working space for processes and short term requirements.


Why is that a given? There are several systems that steal from cache under 
memory pressure. Earlier versions of solaris  that I've dealt with a little 
managed with quite a bit less that 1G free. On this system, "lotsfree" is 
sitting at 127mb, which seems reasonable, and isn't it "lotsfree" and the 
related variables and page-reclaim logic that maintain that pool of free 
memory for new allocations?


It ain't necessarily so but any time you need to run "reclaim" logic, 
there is CPU time expended and the CPU caches tend to get thrashed. 
Without constraints, the cache would expand to the total amount of 
file data encountered.  It is much better to avoid any thrashing.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD strange performance problem, resilvering helps during operation

2009-12-21 Thread Erik Trimble

Mart van Santen wrote:

Hi,





Do the I/O problems go away when only one of the SSDs is attached?
No, the problem stays with only one SSD. The problem is only less when 
resilvering, but not totally disappeared (maybe because of the 
resilver overhead).

The resilver is likely masking some underlying problem.  :-(






Frankly, I'm betting that your SSDs are wearing out.   Resilvering 
will essentially be one big streaming write, which is optimal for 
SSDs (even an SLC-based SSD, as you likely have, performs far better 
when writing large amounts of data at once).  NFS (and to a lesser 
extent iSCSI) is generally a whole lot of random small writes, which 
are hard on an SSD (especially MLC-based ones, but even SLC ones).   
The resilvering process is likely turning many of the random writes 
coming in to the system into a large streaming write to the 
/resilvering/ drive.
Hmm, interesting theory. Next I well execute only a resilver to see if 
the same happens. I assume when adding a new disk, even though it's 
only a slog disk, the whole tank will resilver? If I look to the zpool 
iostat currently I see a lot of reads on the separate SATA disks (not 
on the tank/or raidz2 pools), assuming resilvering takes place there 
and the SSD's are already synced.


I'm not 100% sure, but replacing a device in a mirrored ZIL should only 
generate I/O on the other ZIL device, not on the main pool devices.




SSDs are not hard drives. Even high-quality modern ones have 
/significantly/ lower USE lifespans than an HD - that is, a 
heavily-used SSD will die well before a HD, but a very-lightly used 
SSD will likely outlast a HD.  And, in the case of SSDs, writes are 
far harder on the SSD than reads are.





Is about half a year for these disk not really short? Sure, we have 
some I/O, but not that many write operations, about ~80-140 iops, 
anyway, I will try to get new disks from SUN (we have SLC disks from 
Sun). Is there any knowledge about the life time of SSD's? Maybe in 
terms of amount of I/O Operations?


Regards,

Mart van Santen


That's not enough time for that level of IOPS to wear out the SSDs 
(which, are likely OEM Intel X25-E).  Something else is wrong.



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD strange performance problem, resilvering helps during operation

2009-12-21 Thread Mart van Santen

Hi,





Do the I/O problems go away when only one of the SSDs is attached?
No, the problem stays with only one SSD. The problem is only less when 
resilvering, but not totally disappeared (maybe because of the resilver 
overhead).



Frankly, I'm betting that your SSDs are wearing out.   Resilvering 
will essentially be one big streaming write, which is optimal for SSDs 
(even an SLC-based SSD, as you likely have, performs far better when 
writing large amounts of data at once).  NFS (and to a lesser extent 
iSCSI) is generally a whole lot of random small writes, which are hard 
on an SSD (especially MLC-based ones, but even SLC ones).   The 
resilvering process is likely turning many of the random writes coming 
in to the system into a large streaming write to the /resilvering/ drive.
Hmm, interesting theory. Next I well execute only a resilver to see if 
the same happens. I assume when adding a new disk, even though it's only 
a slog disk, the whole tank will resilver? If I look to the zpool iostat 
currently I see a lot of reads on the separate SATA disks (not on the 
tank/or raidz2 pools), assuming resilvering takes place there and the 
SSD's are already synced.




My guess is that the SSD you are having problems with has reached the 
end of it's useful lifespan, and the I/O problems you are seeing 
during normal operation are the result of that SSD's problems with 
committing data.   There's no cure for this, other than replacing the 
SSD with a new one.





SSDs are not hard drives. Even high-quality modern ones have 
/significantly/ lower USE lifespans than an HD - that is, a 
heavily-used SSD will die well before a HD, but a very-lightly used 
SSD will likely outlast a HD.  And, in the case of SSDs, writes are 
far harder on the SSD than reads are.





Is about half a year for these disk not really short? Sure, we have some 
I/O, but not that many write operations, about ~80-140 iops, anyway, I 
will try to get new disks from SUN (we have SLC disks from Sun). Is 
there any knowledge about the life time of SSD's? Maybe in terms of 
amount of I/O Operations?


Regards,

Mart van Santen

--
Greenhost - Duurzame Hosting
Derde Kostverlorenkade 35
1054 TS Amsterdam
T: 020 489 4349
F: 020 489 2306
KvK: 34187349

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD strange performance problem, resilvering helps during operation

2009-12-21 Thread Andrey Kuzmin
It might be helpful to contact SSD vendor, report the issue and
inquire if half a year wearing out is expected behavior for this
model. Further, if you have an option to replace one (or both) SSDs
with fresh ones, this could tell for sure if they are the root cause.

Regards,
Andrey




On Mon, Dec 21, 2009 at 1:18 PM, Erik Trimble  wrote:
> Mart van Santen wrote:
>>
>> Hi,
>>
>> We have a X4150 with a J4400 attached. Configured with 2x32GB SSD's, in
>> mirror configuration (ZIL) and 12x 500GB SATA disks. We are running this
>> setup for over a half year now in production for NFS and iSCSI for a bunch
>> of virtual machines (currently about 100 VM's, Mostly Linux, some Windows)
>>
>> Since last week we have performance problems, cause IO Wait in the VM's.
>> Of course we did a big search in networking issue's, hanging machines,
>> filewall & traffic tests, but were unable to find any problems. So we had a
>> look into the zpool and dropped one of the mirrored SSD's from the pool (we
>> had some indication the ZIL was not working ok). No success. After adding
>> the disk, we  discovered the IO wait during the "resilvering" process was
>> OK, or at least much better, again. So last night we did the same handling,
>> dropped & added the same disk, and yes, again, the IO wait looked better.
>> This morning the same story.
>>
>> Because this machine is a production machine, we cannot tolerate to much
>> experiments. We now know this operation saves us for about 4 to 6 hours
>> (time to resilvering), but we didn't had the courage to detach/attach the
>> other SSD yet. We will try only a "resilver", without detach/attach, this
>> night, to see what happens.
>>
>> Can anybody explain how the detach/attach and resilver process works, and
>> especially if there is something different during the resilvering and the
>> handling of the SSD's/slog disks?
>>
>>
>> Regards,
>>
>>
>> Mart
>>
>>
>>
> Do the I/O problems go away when only one of the SSDs is attached?
>
>
> Frankly, I'm betting that your SSDs are wearing out.   Resilvering will
> essentially be one big streaming write, which is optimal for SSDs (even an
> SLC-based SSD, as you likely have, performs far better when writing large
> amounts of data at once).  NFS (and to a lesser extent iSCSI) is generally a
> whole lot of random small writes, which are hard on an SSD (especially
> MLC-based ones, but even SLC ones).   The resilvering process is likely
> turning many of the random writes coming in to the system into a large
> streaming write to the /resilvering/ drive.
>
> My guess is that the SSD you are having problems with has reached the end of
> it's useful lifespan, and the I/O problems you are seeing during normal
> operation are the result of that SSD's problems with committing data.
> There's no cure for this, other than replacing the SSD with a new one.
>
> SSDs are not hard drives. Even high-quality modern ones have /significantly/
> lower USE lifespans than an HD - that is, a heavily-used SSD will die well
> before a HD, but a very-lightly used SSD will likely outlast a HD.  And, in
> the case of SSDs, writes are far harder on the SSD than reads are.
>
>
> --
> Erik Trimble
> Java System Support
> Mailstop:  usca22-123
> Phone:  x17195
> Santa Clara, CA
> Timezone: US/Pacific (GMT-0800)
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD strange performance problem, resilvering helps during operation

2009-12-21 Thread Erik Trimble

Mart van Santen wrote:


Hi,

We have a X4150 with a J4400 attached. Configured with 2x32GB SSD's, 
in mirror configuration (ZIL) and 12x 500GB SATA disks. We are running 
this setup for over a half year now in production for NFS and iSCSI 
for a bunch of virtual machines (currently about 100 VM's, Mostly 
Linux, some Windows)


Since last week we have performance problems, cause IO Wait in the 
VM's. Of course we did a big search in networking issue's, hanging 
machines, filewall & traffic tests, but were unable to find any 
problems. So we had a look into the zpool and dropped one of the 
mirrored SSD's from the pool (we had some indication the ZIL was not 
working ok). No success. After adding the disk, we  discovered the IO 
wait during the "resilvering" process was OK, or at least much better, 
again. So last night we did the same handling, dropped & added the 
same disk, and yes, again, the IO wait looked better. This morning the 
same story.


Because this machine is a production machine, we cannot tolerate to 
much experiments. We now know this operation saves us for about 4 to 6 
hours (time to resilvering), but we didn't had the courage to 
detach/attach the other SSD yet. We will try only a "resilver", 
without detach/attach, this night, to see what happens.


Can anybody explain how the detach/attach and resilver process works, 
and especially if there is something different during the resilvering 
and the handling of the SSD's/slog disks?



Regards,


Mart




Do the I/O problems go away when only one of the SSDs is attached?


Frankly, I'm betting that your SSDs are wearing out.   Resilvering will 
essentially be one big streaming write, which is optimal for SSDs (even 
an SLC-based SSD, as you likely have, performs far better when writing 
large amounts of data at once).  NFS (and to a lesser extent iSCSI) is 
generally a whole lot of random small writes, which are hard on an SSD 
(especially MLC-based ones, but even SLC ones).   The resilvering 
process is likely turning many of the random writes coming in to the 
system into a large streaming write to the /resilvering/ drive.


My guess is that the SSD you are having problems with has reached the 
end of it's useful lifespan, and the I/O problems you are seeing during 
normal operation are the result of that SSD's problems with committing 
data.   There's no cure for this, other than replacing the SSD with a 
new one.


SSDs are not hard drives. Even high-quality modern ones have 
/significantly/ lower USE lifespans than an HD - that is, a heavily-used 
SSD will die well before a HD, but a very-lightly used SSD will likely 
outlast a HD.  And, in the case of SSDs, writes are far harder on the 
SSD than reads are.



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How do I determine dedupe effectiveness?

2009-12-21 Thread Erik Trimble

Brandon High wrote:

On Sat, Dec 19, 2009 at 8:34 AM, Colin Raven  wrote:
  

If snapshots reside within the confines of the pool, are you saying that
dedup will also count what's contained inside the snapshots? I'm not sure
why, but that thought is vaguely disturbing on some level.



Sure, why not? Let's say you have snapshots enabled on a dataset with
1TB of files in it, and then decide to move 500GB to a new dataset for
other sharing options, or what have you.

If dedup didn't count the snapshots you'd wind up with 500GB in your
original live dataset, an additional 500GB in the snapshots, and an
additional 500GB in the new dataset.

For instance, tank/export/samba/backups used to be a directory in
tank/export/samba/public. Snapshots being used in dedup saved me
700+GB.
tank/export/samba/backups704G  3.35T   704G
/export/samba/backups
tank/export/samba/public 816G  3.35T   101G
/export/samba/public

  


Architecturally, it is madness NOT to store (known) common data within 
the same local concept, in this case, a pool.  Snapshots need to be 
retained close to their original parent (as do clones, et al.), and the 
abstract concept that holds them in ZFS is the pool.  Frankly, I'd have 
a hard time thinking up of another structure (abstract or concrete) 
where it would make sense to store such an item (i.e. snapshots).


Remember, that snapshot are A POINT IN TIME PICTURE of the 
filesystem/volume.  No more, no less. As such, it makes logical sense to 
retain them "close" to their originator. People tend to slap all sorts 
of other inferences about what snapshots "mean", which is incorrect, 
both from a conceptual standpoint (a rose is a rose, not a pig, just 
because you want call it a pig) and at an implementation level.



As for exactly what is meant by "counting" something inside a snapshot.  
Remember, a snapshot is already a form of dedup - that is, it is nothing 
more than a list of block pointers to blocks which existed at the time 
the snapshot was taken. I'll have to check, but since I believe that the 
dedup metric is counting blocks which have more than one reference to 
them, it currently DOES influence the dedup count if you have a 
snapshot.I'm not in front of a sufficiently late-version install to 
check this;  please, would someone check if taking a snapshot does or 
does not influence the dedup metric.  (it's a simple test - create a 
pool with 1 dataset, turn on dedup, then copy X amount of data to that 
dataset. check the dedup ratio. Then take a snapshot of the dataset, and 
re-check the dedup ratio)Conceptually speaking, it would be nice to 
exclude snapshots when computing the dedup ratio; implementation wise, 
I'm not sure how the ratio is really computed, so I can't say if it's 
simple or impossible.





in fact handy. Hourly...ummm, maybe the same - but Daily/Monthly should
reside "elsewhere".



That's what replication to another system via send/recv is for. See backups, DR.

  
Once again, these are concepts that have no bearing on what a snapshot 
/IS/.  What one want to /do/ with a snapshot is up to the user, but 
that's not a decision to be made at the architecture level. That's a 
decision for further up the application abstraction stack.




Y'know, that is a GREAT point. Taking this one step further then - does that
also imply that there's one "hot spot" physically on a disk that keeps
getting read/written to? if so then your point has even greater merit for
more reasons...disk wear for starters, and other stuff too, no doubt.



I believe I read that there is a max ref count for blocks, and beyond
that the data is written out once again. This is for resilience and to
avoid hot spots.

-B
  
Various ZFS metadata blocks are far more "hot" than anything associated 
with dedup.  Brandon is correct in that ZFS will tend to re-write such 
frequently-WRITTEN blocks (whether meta or real data) after a certain 
point.  In the dedup case, this is irrelevant, since dedup is READ-only  
(if you change the block, by definition, it is no longer a dedup of it's 
former "mates").


If anything, dedup blocks are /far/ more likely to end up in the L2ARC 
(read cache) than a typical block, everything else being equal.   Now, 
if we can get a defrag utility/feature implemented (possibly after the 
BP rewrite stuff is committed), it would make sense to put frequently 
ACCESSED blocks at the highest-performing portions of the underlying 
media.  This of course means that such a utility would have to be 
informed as to the characteristics of the underlying media (SSD, hard 
drive, RAM disk, etc.) and understand each of the limitations therein; 
case in point:  for HDs, the highest-performing location is the outer 
sectors, while for MLC SSDs it is the "least used" ones, and it's 
irrelevant for solid-state (NVRAM) drives.   Honestly, now that I've 
considered it, I'm thinking that it's not worth any real effort to do 
this kind of optimiz

[zfs-discuss] SSD strange performance problem, resilvering helps during operation

2009-12-21 Thread Mart van Santen


Hi,

We have a X4150 with a J4400 attached. Configured with 2x32GB SSD's, in 
mirror configuration (ZIL) and 12x 500GB SATA disks. We are running this 
setup for over a half year now in production for NFS and iSCSI for a 
bunch of virtual machines (currently about 100 VM's, Mostly Linux, some 
Windows)


Since last week we have performance problems, cause IO Wait in the VM's. 
Of course we did a big search in networking issue's, hanging machines, 
filewall & traffic tests, but were unable to find any problems. So we 
had a look into the zpool and dropped one of the mirrored SSD's from the 
pool (we had some indication the ZIL was not working ok). No success. 
After adding the disk, we  discovered the IO wait during the 
"resilvering" process was OK, or at least much better, again. So last 
night we did the same handling, dropped & added the same disk, and yes, 
again, the IO wait looked better. This morning the same story.


Because this machine is a production machine, we cannot tolerate to much 
experiments. We now know this operation saves us for about 4 to 6 hours 
(time to resilvering), but we didn't had the courage to detach/attach 
the other SSD yet. We will try only a "resilver", without detach/attach, 
this night, to see what happens.


Can anybody explain how the detach/attach and resilver process works, 
and especially if there is something different during the resilvering 
and the handling of the SSD's/slog disks?



Regards,


Mart



--
Greenhost - Duurzame Hosting
Derde Kostverlorenkade 35
1054 TS Amsterdam
T: 020 489 4349
F: 020 489 2306
KvK: 34187349

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] FW: ARC not using all available RAM?

2009-12-21 Thread Tomas Ögren
On 21 December, 2009 - Tristan Ball sent me these 4,5K bytes:

> Richard Elling wrote:
> >
> > On Dec 20, 2009, at 12:25 PM, Tristan Ball wrote:
> >
> >> I've got an opensolaris snv_118 machine that does nothing except 
> >> serve up NFS and ISCSI.
> >>
> >> The machine has 8G of ram, and I've got an 80G SSD as L2ARC.
> >> The ARC on this machine is currently sitting at around 2G, the kernel
> >> is using around 5G, and I've got about 1G free.
...
> What I'm trying to find out is is my ARC relatively small because...
> 
> 1) ZFS has decided that that's all it needs (the workload is fairly 
> random), and that adding more wont gain me anything..
> 2) The system is using so much ram for tracking the L2ARC, that the ARC 
> is being shrunk (we've got an 8K record size)
> 3) There's some other memory pressure on the system that I'm not aware 
> of that is periodically chewing up then freeing the ram.
> 4) There's some other memory management feature that's insisting on that
> 1G free.

My bet is on #4 ...

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c#arc_reclaim_needed

See line 1956 .. I tried some tuning on a pure nfs server (although
s10u8) here, and got it to use a bit more of "the last 1GB" out of 8G..
I think it was swapfs_minfree that I poked with a sharp stick. No idea
if anything else that relies on it could break, but the machine has been
fine for a few weeks here now and using more memory for ARC.. ;)

/Tomas
-- 
Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How do I determine dedupe effectiveness?

2009-12-21 Thread Brandon High
On Sat, Dec 19, 2009 at 8:34 AM, Colin Raven  wrote:
> If snapshots reside within the confines of the pool, are you saying that
> dedup will also count what's contained inside the snapshots? I'm not sure
> why, but that thought is vaguely disturbing on some level.

Sure, why not? Let's say you have snapshots enabled on a dataset with
1TB of files in it, and then decide to move 500GB to a new dataset for
other sharing options, or what have you.

If dedup didn't count the snapshots you'd wind up with 500GB in your
original live dataset, an additional 500GB in the snapshots, and an
additional 500GB in the new dataset.

For instance, tank/export/samba/backups used to be a directory in
tank/export/samba/public. Snapshots being used in dedup saved me
700+GB.
tank/export/samba/backups704G  3.35T   704G
/export/samba/backups
tank/export/samba/public 816G  3.35T   101G
/export/samba/public

> in fact handy. Hourly...ummm, maybe the same - but Daily/Monthly should
> reside "elsewhere".

That's what replication to another system via send/recv is for. See backups, DR.

> Y'know, that is a GREAT point. Taking this one step further then - does that
> also imply that there's one "hot spot" physically on a disk that keeps
> getting read/written to? if so then your point has even greater merit for
> more reasons...disk wear for starters, and other stuff too, no doubt.

I believe I read that there is a max ref count for blocks, and beyond
that the data is written out once again. This is for resilience and to
avoid hot spots.

-B

-- 
Brandon High : bh...@freaks.com
Indecision is the key to flexibility.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss