Re: [zfs-discuss] Resolving performance issue w/ deduplication (NexentaStor)

2011-12-30 Thread Hung-Sheng Tsao (laoTsao)
now s11 support shadow migration, just for this  purpose, AFAIK
not sure nexentaStor support shadow migration



Sent from my iPad

On Dec 30, 2011, at 2:03, Ray Van Dolson rvandol...@esri.com wrote:

 On Thu, Dec 29, 2011 at 10:59:04PM -0800, Fajar A. Nugraha wrote:
 On Fri, Dec 30, 2011 at 1:31 PM, Ray Van Dolson rvandol...@esri.com wrote:
 Is there a non-disruptive way to undeduplicate everything and expunge
 the DDT?
 
 AFAIK, no
 
  zfs send/recv and then back perhaps (we have the extra
 space)?
 
 That should work, but it's disruptive :D
 
 Others might provide better answer though.
 
 Well, slightly _less_ disruptive perhaps.  We can zfs send to another
 file system on the same system, but different set of disks.  We then
 disable NFS shares on the original, do a final zfs send to sync, then
 share out the new undeduplicated file system with the same name.
 Hopefully the window here is short enough that NFS clients are able to
 recover gracefully.
 
 We'd then wipe out the old zpool, recreate and do the reverse to get
 data back onto it..
 
 Thanks,
 Ray
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resolving performance issue w/ deduplication (NexentaStor)

2011-12-30 Thread Ray Van Dolson
On Fri, Dec 30, 2011 at 05:57:47AM -0800, Hung-Sheng Tsao (laoTsao) wrote:
 now s11 support shadow migration, just for this  purpose, AFAIK
 not sure nexentaStor support shadow migration

Does not appear that it does (at least the shadow property is not in
NexentaStor's zfs man page).

Thanks for the pointer.

Ray

 
 On Dec 30, 2011, at 2:03, Ray Van Dolson rvandol...@esri.com wrote:
 
  On Thu, Dec 29, 2011 at 10:59:04PM -0800, Fajar A. Nugraha wrote:
  On Fri, Dec 30, 2011 at 1:31 PM, Ray Van Dolson rvandol...@esri.com 
  wrote:
  Is there a non-disruptive way to undeduplicate everything and expunge
  the DDT?
  
  AFAIK, no
  
   zfs send/recv and then back perhaps (we have the extra
  space)?
  
  That should work, but it's disruptive :D
  
  Others might provide better answer though.
  
  Well, slightly _less_ disruptive perhaps.  We can zfs send to another
  file system on the same system, but different set of disks.  We then
  disable NFS shares on the original, do a final zfs send to sync, then
  share out the new undeduplicated file system with the same name.
  Hopefully the window here is short enough that NFS clients are able to
  recover gracefully.
  
  We'd then wipe out the old zpool, recreate and do the reverse to get
  data back onto it..
  
  Thanks,
  Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resolving performance issue w/ deduplication (NexentaStor)

2011-12-30 Thread Richard Elling
On Dec 30, 2011, at 5:57 AM, Hung-Sheng Tsao (laoTsao) wrote:

 now s11 support shadow migration, just for this  purpose, AFAIK
 not sure nexentaStor support shadow migration

The shadow property is closed source. Once you go there, you are locked into 
Oracle.
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
illumos meetup, Jan 10, 2012, Menlo Park, CA
http://www.meetup.com/illumos-User-Group/events/41665962/ 












___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resolving performance issue w/ deduplication (NexentaStor)

2011-12-30 Thread Richard Elling
On Dec 29, 2011, at 10:31 PM, Ray Van Dolson wrote:

 Hi all;
 
 We have a dev box running NexentaStor Community Edition 3.1.1 w/ 24GB
 (we don't run dedupe on production boxes -- and we do pay for Nexenta
 licenses on prd as well) RAM and an 8.5TB pool with deduplication
 enabled (1.9TB or so in use).  Dedupe ratio is only 1.26x.

Yes, this workload is a poor fit for dedup.

 The box has an SLC-based SSD as ZIL and a 300GB MLC SSD as L2ARC.
 
 The box has been performing fairly poorly lately, and we're thinking
 it's due to deduplication:
 
  # echo ::arc | mdb -k | grep arc_meta
  arc_meta_used =  5884 MB
  arc_meta_limit=  5885 MB

This can be tuned. Since you are on the community edition and thus have no 
expectation of support, you can increase this limit yourself. In the future, the
limit will be increased OOB. For now, add something like the following to the
/etc/system file and reboot.

*** Parameter: zfs:zfs_arc_meta_limit
** Description: sets the maximum size of metadata stored in the ARC.
**   Metadata competes with real data for ARC space.
** Release affected: NexentaStor 3.0, 3.1, not needed for 4.0
** Validation: none
** When to change: for metadata-intensive or deduplication workloads
**   having more metadata in the ARC can improve performance.
** Stability: NexentaStor issue #7151 seeks to change the default 
**   value to be larger than 1/4 of arc_max.
** Data type: integer
** Default: 1/4 of arc_max (bytes)
** Range: 1 to arc_max
** Changed by: YOUR_NAME_HERE
** Change date: TODAYS_DATE
**
*set zfs:zfs_arc_meta_limit = 1000


  arc_meta_max  =  5888 MB
 
  # zpool status -D
  ...
  DDT entries 24529444, size 331 on disk, 185 in core
 
 So, not only are we using up all of our metadata cache, but the DDT
 table is taking up a pretty significant chunk of that (over 70%).
 
 ARC sizing is as follows:
 
  p = 15331 MB
  c = 16354 MB
  c_min =  2942 MB
  c_max = 23542 MB
  size  = 16353 MB
 
 I'm not really sure how to determine how many blocks are on this zpool
 (is it the same as the # of DDT entries? -- deduplication has been on
 since pool creation).  If I use a 64KB block size average, I get about
 31 million blocks, but DDT entries are 24 million ….

The zpool status -D output shows the number of blocks.

 zdb -DD and zdb -bb | grep 'bp count both do not complete (zdb says
 I/O error).  Probably because the pool is in use and is quite busy.

Yes, zdb is not expected to produce correct output for imported pools.

 Without the block count I'm having a hard time determining how much
 memory we _should_ have.  I can only speculate that it's more at this
 point. :)
 
 If I assume 24 million blocks is about accurate (from zpool status -D
 output above), then at 320 bytes per block we're looking at about 7.1GB
 for DDT table size.  

That is the on-disk calculation. Use the in-core number for memory consumption.
RAM needed if DDT is completely in ARC = 4,537,947,140 bytes (+)

 We do have L2ARC, though I'm not sure how ZFS
 decides what portion of the DDT stays in memory and what can go to
 L2ARC -- if all of it went to L2ARC, then the references to this
 information in arc_meta would be (at 176 bytes * 24million blocks)
 around 4GB -- which again is a good chuck of arc_meta_max.

Some of the data might already be in L2ARC. But L2ARC access is always
slower than RAM access by a few orders of magnitude.

 Given that our dedupe ratio on this pool is fairly low anyways, am
 looking for strategies to back out.  Should we just disable
 deduplication and then maybe bump up the size of the arc_meta_max?
 Maybe also increase the size of arc.size as well (8GB left for the
 system seems higher than we need)?

The arc_size is dynamic, but limited by another bug in Solaris to effectively 
7/8
of RAM (fixed in illumos). Since you are unsupported, you can try to add the
following to /etc/system along with the tunable above.

*** Parameter: swapfs_minfree
** Description: sets the minimum space reserved for the rest of the
**   system as swapfs grows. This value is also used to calculate the
**   dynamic upper limit of the ARC size.
** Release affected: NexentaStor 3.0, 3.1, not needed for 4.0
** Validation: none
** When to change: the default setting of physmem/8 caps the ARC to
**   approximately 7/8 of physmem, a value usually much smaller than
**   arc_max. Choosing a lower limit for swapfs_minfree can allow the
**   ARC to grow above 7/8 of physmem.
** Data type: unsigned integer (pages)
** Default: 1/8 of physmem
** Range: clamped at 256MB (65,536 4KB pages) for NexentaStor 4.0
** Changed by: YOUR_NAME_HERE
** Change date: TODAYS_DATE
**
*set swapfs_minfree=65536

 
 Is there a non-disruptive way to undeduplicate everything and expunge
 the DDT?

define disruptive

  zfs send/recv and then back perhaps (we have the 

Re: [zfs-discuss] Resolving performance issue w/ deduplication (NexentaStor)

2011-12-30 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Ray Van Dolson
 
 The box has been performing fairly poorly lately, and we're thinking
 it's due to deduplication:

Based on everything you said, you are definitely being hurt by dedup.  You
should definitely disable it if possible.  Start by simply disabling it, so
any new stuff will stop using it.

As for migration ... As mentioned, there are the disruptive ways etc
But ...

The only nondisruptive way that I know, may or may not be acceptable to
you...  May be acceptable only for certain subdirs...
Become root
mkdir temp
gtar cpf - somedir | (cd temp ; gtar xpf -)
mv somedir somedir-old  mv temp/somedir somedir  rm -rf somedir-old temp

Obviously, if you do this, you're assuming there are no open filehandles etc
that would prevent you from being able to do this on somedir...  You're
assuming somedir is static for the interim...  It's only nondisruptive if
you can make such assumptions safely.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resolving performance issue w/ deduplication (NexentaStor)

2011-12-30 Thread Ray Van Dolson
Thanks for you response, Richard.

On Fri, Dec 30, 2011 at 09:52:17AM -0800, Richard Elling wrote:
 On Dec 29, 2011, at 10:31 PM, Ray Van Dolson wrote:
 
  Hi all;
  
  We have a dev box running NexentaStor Community Edition 3.1.1 w/ 24GB
  (we don't run dedupe on production boxes -- and we do pay for Nexenta
  licenses on prd as well) RAM and an 8.5TB pool with deduplication
  enabled (1.9TB or so in use).  Dedupe ratio is only 1.26x.
 
 Yes, this workload is a poor fit for dedup.
 
  The box has an SLC-based SSD as ZIL and a 300GB MLC SSD as L2ARC.
  
  The box has been performing fairly poorly lately, and we're thinking
  it's due to deduplication:
  
   # echo ::arc | mdb -k | grep arc_meta
   arc_meta_used =  5884 MB
   arc_meta_limit=  5885 MB
 
 This can be tuned. Since you are on the community edition and thus have no 
 expectation of support, you can increase this limit yourself. In the future, 
 the
 limit will be increased OOB. For now, add something like the following to the
 /etc/system file and reboot.
 
 *** Parameter: zfs:zfs_arc_meta_limit
 ** Description: sets the maximum size of metadata stored in the ARC.
 **   Metadata competes with real data for ARC space.
 ** Release affected: NexentaStor 3.0, 3.1, not needed for 4.0
 ** Validation: none
 ** When to change: for metadata-intensive or deduplication workloads
 **   having more metadata in the ARC can improve performance.
 ** Stability: NexentaStor issue #7151 seeks to change the default 
 **   value to be larger than 1/4 of arc_max.
 ** Data type: integer
 ** Default: 1/4 of arc_max (bytes)
 ** Range: 1 to arc_max
 ** Changed by: YOUR_NAME_HERE
 ** Change date: TODAYS_DATE
 **
 *set zfs:zfs_arc_meta_limit = 1000

If we wanted to this on a running system, would the following work?

  # echo arc_meta_limit/Z 0x27100 | mdb -kw

(To up arc_meta_limit to 10GB)

 
 
   arc_meta_max  =  5888 MB
  
   # zpool status -D
   ...
   DDT entries 24529444, size 331 on disk, 185 in core
  
  So, not only are we using up all of our metadata cache, but the DDT
  table is taking up a pretty significant chunk of that (over 70%).
  
  ARC sizing is as follows:
  
   p = 15331 MB
   c = 16354 MB
   c_min =  2942 MB
   c_max = 23542 MB
   size  = 16353 MB
  
  I'm not really sure how to determine how many blocks are on this zpool
  (is it the same as the # of DDT entries? -- deduplication has been on
  since pool creation).  If I use a 64KB block size average, I get about
  31 million blocks, but DDT entries are 24 million ….
 
 The zpool status -D output shows the number of blocks.
 
  zdb -DD and zdb -bb | grep 'bp count both do not complete (zdb says
  I/O error).  Probably because the pool is in use and is quite busy.
 
 Yes, zdb is not expected to produce correct output for imported pools.
 
  Without the block count I'm having a hard time determining how much
  memory we _should_ have.  I can only speculate that it's more at this
  point. :)
  
  If I assume 24 million blocks is about accurate (from zpool status -D
  output above), then at 320 bytes per block we're looking at about 7.1GB
  for DDT table size.  
 
 That is the on-disk calculation. Use the in-core number for memory 
 consumption.
   RAM needed if DDT is completely in ARC = 4,537,947,140 bytes (+)
 
  We do have L2ARC, though I'm not sure how ZFS
  decides what portion of the DDT stays in memory and what can go to
  L2ARC -- if all of it went to L2ARC, then the references to this
  information in arc_meta would be (at 176 bytes * 24million blocks)
  around 4GB -- which again is a good chuck of arc_meta_max.
 
 Some of the data might already be in L2ARC. But L2ARC access is always
 slower than RAM access by a few orders of magnitude.
 
  Given that our dedupe ratio on this pool is fairly low anyways, am
  looking for strategies to back out.  Should we just disable
  deduplication and then maybe bump up the size of the arc_meta_max?
  Maybe also increase the size of arc.size as well (8GB left for the
  system seems higher than we need)?
 
 The arc_size is dynamic, but limited by another bug in Solaris to effectively 
 7/8
 of RAM (fixed in illumos). Since you are unsupported, you can try to add the
 following to /etc/system along with the tunable above.
 
 *** Parameter: swapfs_minfree
 ** Description: sets the minimum space reserved for the rest of the
 **   system as swapfs grows. This value is also used to calculate the
 **   dynamic upper limit of the ARC size.
 ** Release affected: NexentaStor 3.0, 3.1, not needed for 4.0
 ** Validation: none
 ** When to change: the default setting of physmem/8 caps the ARC to
 **   approximately 7/8 of physmem, a value usually much smaller than
 **   arc_max. Choosing a lower limit for swapfs_minfree can allow the
 **   ARC to grow above 7/8 of physmem.
 ** Data