[zfs-discuss] Using a zvol from your rpool as zil for another zpool

2010-07-01 Thread Ray Van Dolson
We have a server with a couple X-25E's and a bunch of larger SATA
disks.

To save space, we want to install Solaris 10 (our install is only about
1.4GB) to the X-25E's and use the remaining space on the SSD's for ZIL
attached to a zpool created from the SATA drives.

Currently we do this by installing the OS using SVM+UFS (to mirror the
OS between the two SSD's) and then using the remaining space on a slice
as ZIL for the larger SATA-based zpool.

However, SVM+UFS is more annoying to work with as far as LiveUpgrade is
concerned.  We'd love to use a ZFS root, but that requires that the
entire SSD be dedicated as an rpool leaving no space for ZIL.  Or does
it?

It appears that we could do a:

  # zfs create -V 24G rpool/zil

On our rpool and then:

  # zpool add satapool log /dev/zvol/dsk/rpool/zil

(I realize 24G is probably far more than a ZIL device will ever need)

As rpool is mirrored, this would also take care of redundancy for the
ZIL as well.

This lets us have a nifty ZFS rpool for simplified LiveUpgrades and a
fast SSD-based ZIL for our SATA zpool as well...

What are the downsides to doing this?  Will there be a noticeable
performance hit?

I know I've seen this discussed here before, but wasn't able to come up
with the right search terms...

Thanks,
Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup RAM requirements, vs. L2ARC?

2010-07-01 Thread Neil Perrin

On 07/01/10 22:33, Erik Trimble wrote:

On 7/1/2010 9:23 PM, Geoff Nordli wrote:

Hi Erik.

Are you saying the DDT will automatically look to be stored in an 
L2ARC device if one exists in the pool, instead of using ARC?


Or is there some sort of memory pressure point where the DDT gets 
moved from ARC to L2ARC?


Thanks,

Geoff
   


Good question, and I don't know.  My educated guess is the latter 
(initially stored in ARC, then moved to L2ARC as size increases).


Anyone?



The L2ARC just holds blocks that have been evicted from the ARC due
to memory pressure. The DDT is no different than any other object
(e.g. file). So when looking for a block ZFS checks first in the ARC then
the L2ARC and if neither succeeds reads from the main pool.

- Anyone.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs - filesystem versus directory

2010-07-01 Thread Malachi de Ælfweald
I created a zpool called 'data' from 7 disks.
I created zfs filesystems on the zpool for each Xen vm

I can choose to recursively snapshot all 'data'
I can choose to snapshot the individual 'directories'

If you use mkdir, I don't believe you can snapshot/restore  at that level


Malachi de Ælfweald
http://www.google.com/profiles/malachid


On Thu, Jul 1, 2010 at 9:12 PM, Peter Taps  wrote:

> Folks,
>
> While going through a quick tutorial on zfs, I came across a way to create
> zfs filesystem within a filesystem. For example:
>
> # zfs create mytest/peter
>
> where mytest is a zpool filesystem.
>
> When does this way, the new filesystem has the mount point as
> /mytest/peter.
>
> When does it make sense to create such a filesystem versus just creating a
> directory?
>
> # mkdir mytest/peter
>
> Thank you in advance for your help.
>
> Regards,
> Peter
> --
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup RAM requirements, vs. L2ARC?

2010-07-01 Thread Erik Trimble

On 7/1/2010 9:23 PM, Geoff Nordli wrote:

Hi Erik.

Are you saying the DDT will automatically look to be stored in an L2ARC device 
if one exists in the pool, instead of using ARC?

Or is there some sort of memory pressure point where the DDT gets moved from 
ARC to L2ARC?

Thanks,

Geoff
   


Good question, and I don't know.  My educated guess is the latter 
(initially stored in ARC, then moved to L2ARC as size increases).


Anyone?

--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup RAM requirements, vs. L2ARC?

2010-07-01 Thread Geoff Nordli
> Actually, I think the rule-of-thumb is 270 bytes/DDT
> entry.  It's 200 
> bytes of ARC for every L2ARC entry.
> 
> DDT doesn't count for this ARC space usage
> 
> E.g.:I have 1TB of 4k files that are to be
> deduped, and it turns 
> out that I have about a 5:1 dedup ratio. I'd also
> like to see how much 
> ARC usage I eat up with a 160GB L2ARC.
> 
> (1)How many entries are there in the DDT:
> 1TB of 4k files means there are 2^30
>  files (about 1 billion).
> However, at a 5:1 dedup ratio, I'm only
>  actually storing 
> 0% of that, so I have about 214 million blocks.
> Thus, I need a DDT of about 270 * 214
>  million  =~  58GB in size
> (2)My L2ARC is 160GB in size, but I'm using 58GB
> for the DDT.  Thus, 
> I have 102GB free for use as a data cache.
> 102GB / 4k =~ 27 million blocks can be
>  stored in the 
> emaining L2ARC space.
> However, 26 million files takes up:
>200 * 27 million =~ 
> GB of space in ARC
> Thus, I'd better have at least 5.5GB of
>  RAM allocated 
> olely for L2ARC reference pointers, and no other use.
> 
> 

Hi Erik.

Are you saying the DDT will automatically look to be stored in an L2ARC device 
if one exists in the pool, instead of using ARC? 

Or is there some sort of memory pressure point where the DDT gets moved from 
ARC to L2ARC?

Thanks,

Geoff
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs - filesystem versus directory

2010-07-01 Thread Peter Taps
Folks,

While going through a quick tutorial on zfs, I came across a way to create zfs 
filesystem within a filesystem. For example:

# zfs create mytest/peter

where mytest is a zpool filesystem.

When does this way, the new filesystem has the mount point as /mytest/peter.

When does it make sense to create such a filesystem versus just creating a 
directory?

# mkdir mytest/peter

Thank you in advance for your help.

Regards,
Peter
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] confused about lun alignment

2010-07-01 Thread Derek Olsen
doh! It turns out the host in question is actually a Solaris 10 update 6 host.  
 It appears that an Solaris 10 update 8 host actually sets the start sector at 
256.
  So to simplify the question.   If I'm using ZFS with EFI label and full disk 
do I even need to worry about lun alignment?  I was all excited to do some 
benchmarking and now it appears that zfs will just do the right thing (assuming 
the right OS update).
  THanks.  Deet.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool on raw disk. Do I need to format?

2010-07-01 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Peter Taps
> 
> I am learning more about zfs storage. It appears, zfs pool can be
> created on a raw disk. There is no need to create any partitions, etc.
> on the disk. Does this mean there is no need to run "format" on a raw
> disk?

No need to format.  Also, confusingly, the term "format" doesn't mean the
same here as it does in other situations.  "format" is what you use in
solaris to create partitions (slices) and a few other operations.

Without any partitions or slices, the disk will be called something like
c8t1d0.  But with a partition, it might be c8t1d0p0 or d8t1d0s0

Usually people will do as Cindy said.  "zpool create c8t1d0"
But depending on who you ask, there is possibly some benefit to creating
slices.  Specifically, if you want to mirror or replace a drive with a new
drive that's not precisely the same size.  In later versions of zpool (later
than what's currently available in osol 2009.06 or solaris 10) they have
applied a patch which works around the "slightly different disk size"
problem and eliminates any possible problem with using the whole disk.  So
generally speaking, you're advised to use the whole disk, the easy way, as
Cindy mentioned.  ;-)


> I have added a new disk to my system. It shows up as
> /dev/rdsk/c8t1d0s0. Do I need to format it before I convert it to zfs
> storage? Or, can I simply use it as:

Again, drop the "s0" from the above.  The "s0" means the first slice on the
disk.

zpool create mypool c8t1d0


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help destroying phantom clone (zfs filesystem)

2010-07-01 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Alxen4
> 
> It looks like I have some leftovers of old clones that I cannot delete:
> 
> Clone name is  tank/WinSrv/Latest
> 
> I'm trying:
> 
> zfs destroy -f -R tank/WinSrv/Latest
> cannot unshare 'tank/WinSrv/Latest': path doesn't exist: unshare(1M)
> failed
> 
> Please help me to get rid of this garbage.

This may not be what you're experiencing, but I recently had a similar
experience.

If you "zdb -d poolname" you'll see a bunch of stuff.  If you grep for a
percent % character, I'm not certain what it means, but I know on my
systems, it is a temporary clone that only exists while a "zfs receive"
incremental is in progress.  If the incremental receive is interrupted, the
% is supposed to go away, but if the receive is interrupted due to system
crash, then of course, it remains, and prevents other things from being
destroyed.

As a reproducible thing:  If you are receiving an incremental zfs send, and
you power cycle the system, one of these things will remain and prevent you
from receiving future incrementals.  The solution is to simply "zfs destroy"
the thing which contains the "%" ... and then the regular incremental works
again.

But you have a different symptom from what I had.  So your problem might be
different too.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Checksum errors with SSD.

2010-07-01 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Benjamin Grogg
> 
> When I scrub my pool I got a lot of checksum errors :
> 
> NAMESTATE READ WRITE CKSUM
> rpool   DEGRADED 0 0 5
>   c8d0s0DEGRADED 0 071  too many errors
> 
> Any hints?

What's the confusion?  Replace the drive.

If you think it's a false positive (drive is not actually failing) then you
would zpool clear, (or online, or whatever, until the pool looks normal
again) and then scrub.  If the errors come back, it definitely means the
drive is failing.  Or perhaps the sata cable that connects to it, or perhaps
the controller.  But 99% certain the drive.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool on raw disk. Do I need to format?

2010-07-01 Thread Peter Taps
Awesome. Thank you, CIndy.

Regards,
Peter
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] confused about lun alignment

2010-07-01 Thread Derek Olsen
Folks.
  My env is Solaris 10 update 8 amd64.   Does LUN alignment matter when I'm 
creating zpool's on disks (lun's) with EFI labels and providing zpool the 
entire disk?

  I recently read some sun/oracle docs and blog posts about adjusting the 
starting sector for partition 0 (in format -e) to address lun alignment 
problems.   However they never mention wether this matters for ZFS or not.  In 
this scenario I have a lun provisioned from an STK 2540 and I put an EFI label 
on the disk.  Then I modified partition 0 to start at sector 64 (tried 256 as 
well) and saved those changes.  Once I do a zpool create the first sector get's 
set to "34".  

  Here is the format output with the adjusted sector 

partition> pr
Current partition table (original):
Total disk sectors available: 31440861 + 16384 (reserved sectors)

Part  TagFlag First SectorSizeLast Sector
  0usrwm   256  14.99GB 31440860
  1 unassignedwm 0  0  0
  2 unassignedwm 0  0  0
  3 unassignedwm 0  0  0
  4 unassignedwm 0  0  0
  5 unassignedwm 0  0  0
  6 unassignedwm 0  0  0
  7 unassignedwm 0  0  0
  8   reservedwu  31440861   8.00MB 31457244  

now create the zpool

zpool create testsector c5t6006016070B01E0052374B8B4CAFDE11d0

partition table after zpool create


partition> pr
Current partition table (original):
Total disk sectors available: 31440861 + 16384 (reserved sectors)

Part  TagFlag First SectorSizeLast Sector
  0usrwm34  14.99GB 31440861


  I know i'm missing something here and any tips would be appreciated.
  TIA.  Deet.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)

2010-07-01 Thread Andrew Jones
Victor,

A little more info on the crash, from the messages file is attached here. I 
have also decompressed the dump with savecore to generate unix.0, vmcore.0, and 
vmdump.0.


Jun 30 19:39:10 HL-SAN unix: [ID 836849 kern.notice] 
Jun 30 19:39:10 HL-SAN ^Mpanic[cpu3]/thread=ff0017909c60: 
Jun 30 19:39:10 HL-SAN genunix: [ID 335743 kern.notice] BAD TRAP: type=e (#pf 
Page fault) rp=ff0017909790 addr=0 occurred in module "" due to a 
NULL pointer dereference
Jun 30 19:39:10 HL-SAN unix: [ID 10 kern.notice] 
Jun 30 19:39:10 HL-SAN unix: [ID 839527 kern.notice] sched: 
Jun 30 19:39:10 HL-SAN unix: [ID 753105 kern.notice] #pf Page fault
Jun 30 19:39:10 HL-SAN unix: [ID 532287 kern.notice] Bad kernel fault at 
addr=0x0
Jun 30 19:39:10 HL-SAN unix: [ID 243837 kern.notice] pid=0, pc=0x0, 
sp=0xff0017909880, eflags=0x10002
Jun 30 19:39:10 HL-SAN unix: [ID 211416 kern.notice] cr0: 
8005003b cr4: 6f8
Jun 30 19:39:10 HL-SAN unix: [ID 624947 kern.notice] cr2: 0
Jun 30 19:39:10 HL-SAN unix: [ID 625075 kern.notice] cr3: 336a71000
Jun 30 19:39:10 HL-SAN unix: [ID 625715 kern.notice] cr8: c
Jun 30 19:39:10 HL-SAN unix: [ID 10 kern.notice] 
Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]rdi:  282 
rsi:15809 rdx: ff03edb1e538
Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]rcx:5  
r8:0  r9: ff03eb2d6a00
Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]rax:  202 
rbx:0 rbp: ff0017909880
Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]r10: f80d16d0 
r11:4 r12:0
Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]r13: ff03e21bca40 
r14: ff03e1a0d7e8 r15: ff03e21bcb58
Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]fsb:0 
gsb: ff03e25fa580  ds:   4b
Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice] es:   4b  
fs:0  gs:  1c3
Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]trp:e 
err:   10 rip:0
Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice] cs:   30 
rfl:10002 rsp: ff0017909880
Jun 30 19:39:10 HL-SAN unix: [ID 266532 kern.notice] ss:   38
Jun 30 19:39:10 HL-SAN unix: [ID 10 kern.notice] 
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909670 
unix:die+dd ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909780 
unix:trap+177b ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909790 
unix:cmntrap+e6 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 802836 kern.notice] ff0017909880 0 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff00179098a0 
unix:debug_enter+38 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff00179098c0 
unix:abort_sequence_enter+35 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909910 
kbtrans:kbtrans_streams_key+102 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909940 
conskbd:conskbdlrput+e7 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff00179099b0 
unix:putnext+21e ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff00179099f0 
kbtrans:kbtrans_queueevent+7c ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909a20 
kbtrans:kbtrans_queuepress+7c ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909a60 
kbtrans:kbtrans_untrans_keypressed_raw+46 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909a90 
kbtrans:kbtrans_processkey+32 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909ae0 
kbtrans:kbtrans_streams_key+175 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909b10 
kb8042:kb8042_process_key+40 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909b50 
kb8042:kb8042_received_byte+109 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909b80 
kb8042:kb8042_intr+6a ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909bb0 
i8042:i8042_intr+c5 ()
Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff0017909c00 
unix:av_dispatch_autovect+7c ()
Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff0017909c40 
unix:dispatch_hardint+33 ()
Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff00183552f0 
unix:switch_sp_and_call+13 ()
Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff0018355340 
unix:do_interrupt+b8 ()
Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff0018355350 
unix:_interrupt+b8 ()
Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff00183554a0 
unix:htable_steal+198 ()
Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff0018355510 
unix:htable_alloc+248 ()
Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff00183555c0 

Re: [zfs-discuss] zpool on raw disk. Do I need to format?

2010-07-01 Thread Cindy Swearingen

Even easier, use the zpool create command to create a pool
on c8t1d0, using the whole disk. Try this:

# zpool create MyData c8t1d0

cs



On 07/01/10 16:01, Peter Taps wrote:

Folks,

I am learning more about zfs storage. It appears, zfs pool can be created on a raw disk. 
There is no need to create any partitions, etc. on the disk. Does this mean there is no 
need to run "format" on a raw disk?

I have added a new disk to my system. It shows up as /dev/rdsk/c8t1d0s0. Do I 
need to format it before I convert it to zfs storage? Or, can I simply use it 
as:

# zfs create MyData /dev/rdsk/c8t1d0s0

Thank you in advance for your help.

Regards,
Peter

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zpool on raw disk. Do I need to format?

2010-07-01 Thread Peter Taps
Folks,

I am learning more about zfs storage. It appears, zfs pool can be created on a 
raw disk. There is no need to create any partitions, etc. on the disk. Does 
this mean there is no need to run "format" on a raw disk?

I have added a new disk to my system. It shows up as /dev/rdsk/c8t1d0s0. Do I 
need to format it before I convert it to zfs storage? Or, can I simply use it 
as:

# zfs create MyData /dev/rdsk/c8t1d0s0

Thank you in advance for your help.

Regards,
Peter
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs destroy hangs machine if snapshot exists- workaround found

2010-07-01 Thread Erik Trimble

On 7/1/2010 12:23 PM, Lo Zio wrote:

Thanks roy, I read a lot around and also was thinking it was a dedup-related problem. 
Although I did not find any indication of how many RAM is enough, and never find 
something saying "Do not use dedup, it will definitely crash your server". I'm 
using a Dell Xeon with 4 Gb of RAM, maybe it is not an uber-server but it works really 
well (when it is not hung, I mean).
Do you have an idea about the optimal config to have 1,5T of available space in 
10 datasets (5 deduped), and 10 rotating snapshots?
Thanks
   


Take a look at the archives for these threads:

Dedup RAM requirements, vs. L2ARC?
http://mail.opensolaris.org/pipermail/zfs-discuss/2010-June/042661.html

Dedup performance hit
http://mail.opensolaris.org/pipermail/zfs-discuss/2010-June/042235.html



4GB of RAM is likely to be *way* too small to run dedup with your setup. 
You almost certainly need a SSD for L2ARC, and probably at least 2x the 
RAM.


The "hangs" you see are likely the Dedup Table being built on-the-fly 
from the datasets, which is massively I/O intensive.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs destroy hangs machine if snapshot exists- workaround found

2010-07-01 Thread Roy Sigurd Karlsbakk
- Original Message -
> Thanks roy, I read a lot around and also was thinking it was a
> dedup-related problem. Although I did not find any indication of how
> many RAM is enough, and never find something saying "Do not use dedup,
> it will definitely crash your server". I'm using a Dell Xeon with 4 Gb
> of RAM, maybe it is not an uber-server but it works really well (when
> it is not hung, I mean).
> Do you have an idea about the optimal config to have 1,5T of available
> space in 10 datasets (5 deduped), and 10 rotating snapshots?
> Thanks

Erik Timble had a post on this today, pasted below. Look though that, but 
seriously, with 4 gigs of RAM and no L2ARC, dedup is of no use

roy

Actually, I think the rule-of-thumb is 270 bytes/DDT entry.  It's 200
bytes of ARC for every L2ARC entry.

DDT doesn't count for this ARC space usage

E.g.:I have 1TB of 4k files that are to be deduped, and it turns
out that I have about a 5:1 dedup ratio. I'd also like to see how much
ARC usage I eat up with a 160GB L2ARC.

(1)How many entries are there in the DDT:
 1TB of 4k files means there are 2^30 files (about 1 billion).
 However, at a 5:1 dedup ratio, I'm only actually storing
20% of that, so I have about 214 million blocks.
 Thus, I need a DDT of about 270 * 214 million  =~  58GB in size

(2)My L2ARC is 160GB in size, but I'm using 58GB for the DDT.  Thus,
I have 102GB free for use as a data cache.
 102GB / 4k =~ 27 million blocks can be stored in the
remaining L2ARC space.
 However, 26 million files takes up:   200 * 27 million =~
5.5GB of space in ARC
 Thus, I'd better have at least 5.5GB of RAM allocated
solely for L2ARC reference pointers, and no other use.



-- 
Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs destroy hangs machine if snapshot exists- workaround found

2010-07-01 Thread Lo Zio
Thanks roy, I read a lot around and also was thinking it was a dedup-related 
problem. Although I did not find any indication of how many RAM is enough, and 
never find something saying "Do not use dedup, it will definitely crash your 
server". I'm using a Dell Xeon with 4 Gb of RAM, maybe it is not an uber-server 
but it works really well (when it is not hung, I mean).
Do you have an idea about the optimal config to have 1,5T of available space in 
10 datasets (5 deduped), and 10 rotating snapshots?
Thanks
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NexentaStor Community edition 3.0.3 released

2010-07-01 Thread Oliver Seidel
Hello,

this may not apply to your machine.  I have two changes to your setup:
* Opensolaris instead of Nexenta
* DL585G1 instead of your DL380G4

Here's my problem: reproducible crash after a certain time (1:30h in my case).

Explanation: the HP machine has enterprise features (ECC RAM) and performs 
scrubbing of the RAM, just as you could scrub ZFS disks; with the 4 AMD dual 
core CPUs, the memory is divided into 4 chunks and when the scrubber hits a 
hole, then the machine crashes without so much as a crashdump

Solution: add the following to /etc/system

set snooping=1
set pcplusmp:apic_panic_on_nmi=1
set cpu_ms.AuthenticAMD.15:ao_scrub_policy = 1
set cpu_ms.AuthenticAMD.15:ao_scrub_rate_dcache = 0
set cpu_ms.AuthenticAMD.15:ao_scrub_rate_l2cache = 0 
set mc-amd:mc_no_attach=1
set disable_memscrub = 1


Best regards,

Oliver
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mix SAS and SATA drives?

2010-07-01 Thread Roy Sigurd Karlsbakk
- Original Message -
> > As the 15k drives are faster seek-wise (and possibly faster for
> > linear I/O), you may want to separate them into different VDEVs or
> > even pools, but then, it's quite impossible to give a "correct"
> > answer unless knowing what it's going to be used for.
> 
> 
> Mostly database duty.

Then use the 15k drives in a striped mirror - they'll perform quite a bit 
better that way :p

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mix SAS and SATA drives?

2010-07-01 Thread Ian D

> As the 15k drives are faster seek-wise (and possibly faster for linear I/O), 
> you may want to separate them into different VDEVs or even pools, but then, 
> it's quite impossible to give a "correct" answer unless knowing what it's 
> going to be used for.Mostly database duty.> > Also, using 10+ drives in a 
> single VDEV is not really recommended - use fewer drives, loose more space, 
> but get a faster system with fewer drives in each VDEV. 15x 750GB SATA drives 
> in a single VDEV will give you about 120 IOPS for the whole VDEV at most. 
> Using smaller VDEVs will help speeding up things and will give you better 
> protection against faulty drives (silent or noisy). Once more - see 
> http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_GuideWith 
> that much space we can afford mirroring everything.  We'll put every disk in 
> pair across separate JBODs and controllers.Thanks a lot!
  ___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mix SAS and SATA drives?

2010-07-01 Thread Roy Sigurd Karlsbakk
- Original Message -
> Another question...
> We're building a ZFS NAS/SAN out of the following JBODs we already
> own:
> 
> 
> 2x 15x 1000GB SATA
> 3x 15x 750GB SATA
> 2x 12x 600GB SAS 15K
> 4x 15x 300GB SAS 15K
> 
> 
> That's a lot of spindles we'd like to benefit from, but our assumption
> is that we should split these in two separate pools, one for SATA
> drives and one for SAS 15K drives. Are we right?

As the 15k drives are faster seek-wise (and possibly faster for linear I/O), 
you may want to separate them into different VDEVs or even pools, but then, 
it's quite impossible to give a "correct" answer unless knowing what it's going 
to be used for.

Also, using 10+ drives in a single VDEV is not really recommended - use fewer 
drives, loose more space, but get a faster system with fewer drives in each 
VDEV. 15x 750GB SATA drives in a single VDEV will give you about 120 IOPS for 
the whole VDEV at most. Using smaller VDEVs will help speeding up things and 
will give you better protection against faulty drives (silent or noisy). Once 
more - see 
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mix SAS and SATA drives?

2010-07-01 Thread Ian D

Sorry for the formatting, that's
2x 15x 1000GB SATA
3x 15x  750GB SATA
2x 12x  600GB SAS 15K
4x 15x  300GB SAS 15K
  ___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Mix SAS and SATA drives?

2010-07-01 Thread Ian D

Another question...We're building a ZFS NAS/SAN out of the following JBODs we 
already own:
2x 15x 1000GB SATA3x 15x  750GB SATA2x 12x  600GB SAS 15K4x 15x  300GB SAS 15K
That's a lot of spindles we'd like to benefit from, but our assumption is that 
we should split these in two separate pools, one for SATA drives and one for 
SAS 15K drives.  Are we right?
Thanks___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Expected throughput

2010-07-01 Thread Roy Sigurd Karlsbakk











Hi! We've put 28x 750GB SATA drives in a RAIDZ2 pool (a single vdev) and we get 
about 80MB/s in sequential read or write. We're running local tests on the 
server itself (no network involved). Is that what we should be expecting? It 
seems slow to me. 
Please read the ZFS best practices guide at 
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide 

To summarise, putting 28 disks in a single vdev is nothing you would do if you 
want performance. You'll end up with as many IOPS a single drive can do. Split 
it up into smaller (<10 disk) vdevs and try again. If you need high 
performance, put them in a striped mirror (aka RAID1+0) 
A little addition - for 28 drives, I guess I'd choose four vdevs with seven 
drives each in raidz2. You'll loose space, but it'll be four times faster, and 
four times safer. Better safe than sorry 

Vennlige hilsener / Best regards 

roy 
-- 
Roy Sigurd Karlsbakk 
(+47) 97542685 
r...@karlsbakk.net 
http://blogg.karlsbakk.net/ 
-- 
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk. 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Supermicro AOC-USAS-L8i

2010-07-01 Thread Roy Sigurd Karlsbakk
> On a slightly different but related topic, anyone have advice on how
> to connect up my drives? I've got room for 20 pool drives in the case.
> I'll have two AOC-USAS-L8i cards along with cables to connect 16 SATA2
> drives. The motherboard has 6 SATA2 connectors plus 2 SATA3
> connectors. I was planning to use the SATA3 connectors for the boot
> drives (hopefully mirrored ZFS). Initially I'll have three 2TB drives
> with the intention of using raidz1 on them. Phase 2 will be three more
> 2TB drives. Phase 3 will be three 1.5TB drives. (I already have most
> of the drives for phase 2 and 3.) I'll probably fill the rest with a
> JBOD pool of assorted drives I have lying around.

I'd try to grab more drives of the same size to be able to do RAIDz2. It's 
quite common that a drive fails and there are 'silent' errors (non-detectable 
on the drive's CRC, but detectable by ZFS) during resilver. This will give you 
data corruption. If you have RAIDz2, ZFS will figure that out
 
> I'm specifically wondering if I'll see better performance by spreading
> the raidz1 array devices across the controllers or if I should keep an
> array to a single controller as much as possible.

This has been debated a lot. If you put a VDEV (or pool) on a single 
controller, if that controller dies, your pool will be unavailable until the 
controller is replaced. If you spread the disks over more controllers, the 
chances are better to allow access to the data until you replace the 
controller. Again, raidz2 will help out there as well
 
Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs destroy hangs machine if snapshot exists- workaround found

2010-07-01 Thread Roy Sigurd Karlsbakk
- Original Message -
> I also have this problem, with 134 if I delete big snapshots the
> server hangs only responding to ping.
> I also have the ZVOL issue.
> Any news about having them solved?
> In my case this is a big problem since I'm using osol as a file
> server...

Are you using dedup? If so, that's the reason - dedup isn't ready for 
production yet unless you have sufficient RAM/L2ARC
 
Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Expected throughput

2010-07-01 Thread Roy Sigurd Karlsbakk





Hi! We've put 28x 750GB SATA drives in a RAIDZ2 pool (a single vdev) and we get 
about 80MB/s in sequential read or write. We're running local tests on the 
server itself (no network involved). Is that what we should be expecting? It 
seems slow to me. 
Please read the ZFS best practices guide at 
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide 

To summarise, putting 28 disks in a single vdev is nothing you would do if you 
want performance. You'll end up with as many IOPS a single drive can do. Split 
it up into smaller (<10 disk) vdevs and try again. If you need high 
performance, put them in a striped mirror (aka RAID1+0) 

Vennlige hilsener / Best regards 

roy 
-- 
Roy Sigurd Karlsbakk 
(+47) 97542685 
r...@karlsbakk.net 
http://blogg.karlsbakk.net/ 
-- 
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk. 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS on external iSCSI storage

2010-07-01 Thread Roy Sigurd Karlsbakk
> >The best would be to export the drives in JBOD style, one "array" per
> >drive. If you rely on the Promise RAID, it you won't be able to
> >recover from "silent" errors. I'm in the progress of moving from a
> >NexSAN RAID to a JBOD-like style just because of that (we had data
> >>corruption on the RAID, stuff the NexSAN box didn't detect, as the
> >drives didn't detect it). As this NexSAN box can't export true JBOD,
> >I export each pair of drives as a stripe, and use them in VDEVs in
> >RAIDz2s and a spare or two. While not as good as local drives, it'll
> >be >better than trusting the 'hardware' RAID on the box (which has
> >proven to be unable to detect silent errors).
> 
> Yup, we just did the same thing with our PAC storage units. Created 48
> Raid 0 stripes with 1 drive each. We originally tried 4 drives per
> Raid 0 stripe but we ran into massive performance hits.

In case someone gets across NexSAN SATABOY boxes, keep in mind that they can 
neither do JBOD nor make more than 14 RAID sets. I ended up exporting three 
drives for each RAID-0 set and then used RAIDz2 across them together with a 
spare.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Expected throughput

2010-07-01 Thread Ian D

Hi!  We've put 28x 750GB SATA drives in a RAIDZ2 pool (a single vdev) and we 
get about 80MB/s in sequential read or write. We're running local tests on the 
server itself (no network involved).  Is that what we should be expecting?  It 
seems slow to me.
Thanks___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS on external iSCSI storage

2010-07-01 Thread Roy Sigurd Karlsbakk
- Original Message -
> I'm new with ZFS, but I have had good success using it with raw
> physical disks. One of my systems has access to an iSCSI storage
> target. The underlying physical array is in a propreitary disk storage
> device from Promise. So the question is, when building a OpenSolaris
> host to store its data on an external iSCSI device, is there anything
> conceptually wrong with creating a raidz pool from a group of "raw"
> LUNs carved from the iSCSI device?

The best would be to export the drives in JBOD style, one "array" per drive. If 
you rely on the Promise RAID, it you won't be able to recover from "silent" 
errors. I'm in the progress of moving from a NexSAN RAID to a JBOD-like style 
just because of that (we had data corruption on the RAID, stuff the NexSAN box 
didn't detect, as the drives didn't detect it). As this NexSAN box can't export 
true JBOD, I export each pair of drives as a stripe, and use them in VDEVs in 
RAIDz2s and a spare or two. While not as good as local drives, it'll be better 
than trusting the 'hardware' RAID on the box (which has proven to be unable to 
detect silent errors).

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Help destroying phantom clone (zfs filesystem)

2010-07-01 Thread Alxen4
It looks like I have some leftovers of old clones that I cannot delete:

Clone name is  tank/WinSrv/Latest

I'm trying:

zfs destroy -f -R tank/WinSrv/Latest
cannot unshare 'tank/WinSrv/Latest': path doesn't exist: unshare(1M) failed

Please help me to get rid of this garbage.

Thanks a lot.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Checksum errors with SSD.

2010-07-01 Thread Cindy Swearingen

Hi Benjamin,

I'm not familiar with this disk but you can see the fmstat output that
disk, system event, and zfs-related diagnostics are on overtime about
something and its probably this disk.

You can get further details from fmdump -eV and you will probably
see lots of checksum errors on this disk.

You might review some of the h/w diagnostic recommendations in this wiki:

http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide

I would recommend replacing the disk, soon, or figure out what other
issue might be causing problems for this disk.

Thanks,

Cindy
Benjamin Grogg wrote:

Dear Forum

I use a KINGSTON SNV125-S2/30GB SSD on a ASUS M3A78-CM Motherboard (AMD SB700 
Chipset).
SATA Type (in BIOS) is SATA 
Os : SunOS homesvr 5.11 snv_134 i86pc i386 i86pc


When I scrub my pool I got a lot of checksum errors :

NAMESTATE READ WRITE CKSUM
rpool   DEGRADED 0 0 5
  c8d0s0DEGRADED 0 071  too many errors

zpool clear rpool works after a scrub I have again the same situation.
fmstat looks like this :

module ev_recv ev_acpt wait  svc_t  %w  %b  open solve  memsz  bufsz
cpumem-retire0   0  0.00.0   0   0 0 0  0  0
disk-transport   0   0  0.0 1541.1   0   0 0 032b  0
eft  1   0  0.04.7   0   0 0 0   1.2M  0
ext-event-transport   3   0  0.02.1   0   0 0 0  0  0
fabric-xlate 0   0  0.00.0   0   0 0 0  0  0
fmd-self-diagnosis   6   0  0.00.0   0   0 0 0  0  0
io-retire0   0  0.00.0   0   0 0 0  0  0
sensor-transport 0   0  0.0   37.3   0   0 0 032b  0
snmp-trapgen 3   0  0.01.1   0   0 0 0  0  0
sysevent-transport   0   0  0.0 2836.3   0   0 0 0  0  0
syslog-msgs  3   0  0.02.7   0   0 0 0  0  0
zfs-diagnosis   91  77  0.0   28.9   0   0 2 1   336b   280b
zfs-retire  10   0  0.0  387.9   0   0 0 0   620b  0

fmadm looks like this :

---   -- -
TIMEEVENT-ID  MSG-ID SEVERITY
---   -- -
Jun 30 16:37:28 806072e5-7cd6-efc1-c89d-d40bce4adf72  ZFS-8000-GHMajor 


Host: homesvr
Platform: System-Product-Name   Chassis_id  : System-Serial-Number
Product_sn  : 


Fault class : fault.fs.zfs.vdev.checksum
Affects : zfs://pool=rpool/vdev=f7dad7554a72b3bc
  faulted but still in service
Problem in  : zfs://pool=rpool/vdev=f7dad7554a72b3bc
  faulted but still in service

In /var/adm/messages I don't have any abnormal issues.
I can put the SSD also on a other SATA-Port but without success.

My other HDD runs smoothly :

NAMESTATE READ WRITE CKSUM
tankONLINE   0 0 0
  mirror-0  ONLINE   0 0 0
c4d1ONLINE   0 0 0
c5d0ONLINE   0 0 0

iostat gives me following :

c4d1 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Model: WDC WD10EVDS-63 Revision:  Serial No:  WD-WCAV592 Size: 1000.20GB <1000202305536 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 0 
c5d0 Soft Errors: 981 Hard Errors: 0 Transport Errors: 981 
Model: Hitachi HDS7210 Revision:  Serial No:   JP2921HQ0 Size: 1000.20GB <1000202305536 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 0 
c8d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Model: KINGSTON SSDNOW Revision:  Serial No: 30PM10I Size: 30.02GB <30016659456 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 


Any hints?
Best regards and many thanks for your help!

Benjamin

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS on external iSCSI storage

2010-07-01 Thread Mark
I'm new with ZFS, but I have had good success using it with raw physical disks. 
One of my systems has access to an iSCSI storage target. The underlying 
physical array is in a propreitary disk storage device from Promise. So the 
question is, when building a OpenSolaris host to store its data on an external 
iSCSI device, is there anything conceptually wrong with creating a raidz pool 
from a group of "raw" LUNs carved from the iSCSI device?

Thanks for your advice.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] optimal ZFS filesystem layout on JBOD

2010-07-01 Thread Marty Scholes
Joachim Worringen wrote:
> Greetings,
> 
> we are running a few databases of currently 200GB
> (growing) in total for data warehousing:
> - new data via INSERTs for (up to) millions of rows
> per day; sometimes with UPDATEs
> - most data in a single table (=> 10 to 100s of
> millions of rows)
> - queries SELECT subsets of this table via an index
> - for effective parallelisation, queries create
> (potentially large) non-temporary tables which are
> deleted at the end of the query => lots of simple
> INSERTs and SELECTs during queries
> - large transactions: they may contain millions of
> INSERTs/UPDATEs
> - running version PostgreSQL 8.4.2
> 
> We are moving all this to a larger system - the
> hardware is available, therefore fixed:
> - Sun X4600 (16 cores, 64GB)
> - external SAS JBOD with 24 2,5" slots: 
>   o 18x SAS 10k 146GB drives
> o  2x SAS 10k 73GB drives
>   o 4x Intel SLC 32GB SATA SSD
> JBOD connected to Adaptec SAS HBA with BBU
> - Internal storage via on-board RAID HBA:
>   o 2x 73GB SAS 10k for OS (RAID1)
> o 2x Intel SLC 32GB SATA SSD for ZIL (RAID1) (?)
> - OS will be Solaris 10 to have ZFS as filesystem
> (and dtrace)
> - 10GigE towards client tier (currently, another
> X4600 with 32cores and 64GB)
> 
> What would be the optimal storage/ZFS layout for
> this? I checked solarisinternals.com and some
> PostgreSQL resources and came to the following
> concept - asking for your comments:
> - run the JBOD without HW-RAID, but let all
> redundancy be done by ZFS for maximum flexibility 
> - create separate ZFS pools for tablespaces (data,
> index, temp) and WAL on separate devices (LUNs):
> - use the 4 SSDs in the JBOD as Level-2 ARC cache
> (can I use a single cache for all pools?) w/o
> redundancy
> - use the 2 SSDs connected to the on-board HBA as
> RAID1 for ZFS ZIL
> 
> Potential issues that I see:
> - the ZFS ZIL will not benefit from a BBU (as it is
> connected to the backplane, driven by the
> onboard-RAID), and might be too small (32GB for ~2TB
> of data with lots of writes)?
> - the pools on the JBOD might have the wrong size for
> the tablespaces - like: using the 2 73GB drives as
> RAID 1 for temp might become too small, but adding a
> 146GB drive might not be a good idea?
> - with 20 spindles, does it make sense at all to use
> dedicated devices for the tabelspaces, or will the
> load be distributed well enough across the spindles
> anyway?
> 
> thanks for any comments & suggestions,
>  
>  Joachim

I'll chime in based on some tuning experience I had under UFS with Pg 7.x 
coupled with some experience with ZFS, but not experience with later Pg on ZFS. 
 Take this with a grain of salt.

Pg loves to push everything to the WAL and then dribble the changes back to the 
datafiles when convenient.  At a checkpoint, all of the changes are flushed in 
bulk to the tablespace.  Since the changes to WAL and disk are synchronous, ZIL 
is used, which I believe translates to all data being written four times under 
ZFS: once to WAL ZIL, then to WAL, then to tablespace ZIL, then to tablespace.

For writes, I would break WAL into it's own pool and then put an SSD ZIL mirror 
on that.  It would allow all writes to be nearly instant to WAL and would keep 
the ZIL needs to the size of the WAL, which probably won't exceed the size of 
your SSD.  The ZIL on WAL will especially help with large index updates which 
can cause cascading b-tree splits and result in large amounts of small 
syncronous I/O, bringing Pg to a crawl.  Checkpoints will still slow things 
down when the data is flushed to the tablespace pool, but that will happen with 
coalesced writes, so iops would be less of a concern.

For reads, I would either keep indexes and tables on the same pool and back 
them with as much L2ARC as needed for the working set, or if you lack 
sufficient L2ARC, break the indexes into their own pool and L2ARC those 
instead, because index reads generally are more random and heavily used, at 
least for well tuned queries.  Full table scans for well-vacuumed tables are 
generally sequential in nature, so table iops again are less of a concern.

If you have to break the indexes into their own pool for dedicated SSD L2ARC, 
you might consider adding some smaller or short-stroked 15K drives for L2ARC on 
the table pool.

For geometry, find the redundancy that you need, e.g. +1, +2 or +3, then decide 
which is more important, space or iops.  If L2ARC and ZIL reduce your need for 
iops, then go with RAIDZ[123].  If you still need the iops, pile a bunch of 
[123]-way mirrors together.

Yes, I would avoid HW raid and run pure JBOD and would be tempted to keep temp 
tables on the index or table pool.

Like I said above, take this with a grain of salt and feel free to throw out, 
disagree with or lampoon me for anything that does not resonate with you.

Whatever you do, make sure you stress-test the configuration with 
production-size data and workloads before you deploy it.

Good luck,
Marty
-- 
This message posted from

[zfs-discuss] Checksum errors with SSD.

2010-07-01 Thread Benjamin Grogg
Dear Forum

I use a KINGSTON SNV125-S2/30GB SSD on a ASUS M3A78-CM Motherboard (AMD SB700 
Chipset).
SATA Type (in BIOS) is SATA 
Os : SunOS homesvr 5.11 snv_134 i86pc i386 i86pc

When I scrub my pool I got a lot of checksum errors :

NAMESTATE READ WRITE CKSUM
rpool   DEGRADED 0 0 5
  c8d0s0DEGRADED 0 071  too many errors

zpool clear rpool works after a scrub I have again the same situation.
fmstat looks like this :

module ev_recv ev_acpt wait  svc_t  %w  %b  open solve  memsz  bufsz
cpumem-retire0   0  0.00.0   0   0 0 0  0  0
disk-transport   0   0  0.0 1541.1   0   0 0 032b  0
eft  1   0  0.04.7   0   0 0 0   1.2M  0
ext-event-transport   3   0  0.02.1   0   0 0 0  0  0
fabric-xlate 0   0  0.00.0   0   0 0 0  0  0
fmd-self-diagnosis   6   0  0.00.0   0   0 0 0  0  0
io-retire0   0  0.00.0   0   0 0 0  0  0
sensor-transport 0   0  0.0   37.3   0   0 0 032b  0
snmp-trapgen 3   0  0.01.1   0   0 0 0  0  0
sysevent-transport   0   0  0.0 2836.3   0   0 0 0  0  0
syslog-msgs  3   0  0.02.7   0   0 0 0  0  0
zfs-diagnosis   91  77  0.0   28.9   0   0 2 1   336b   280b
zfs-retire  10   0  0.0  387.9   0   0 0 0   620b  0

fmadm looks like this :

---   -- -
TIMEEVENT-ID  MSG-ID SEVERITY
---   -- -
Jun 30 16:37:28 806072e5-7cd6-efc1-c89d-d40bce4adf72  ZFS-8000-GHMajor 

Host: homesvr
Platform: System-Product-Name   Chassis_id  : System-Serial-Number
Product_sn  : 

Fault class : fault.fs.zfs.vdev.checksum
Affects : zfs://pool=rpool/vdev=f7dad7554a72b3bc
  faulted but still in service
Problem in  : zfs://pool=rpool/vdev=f7dad7554a72b3bc
  faulted but still in service

In /var/adm/messages I don't have any abnormal issues.
I can put the SSD also on a other SATA-Port but without success.

My other HDD runs smoothly :

NAMESTATE READ WRITE CKSUM
tankONLINE   0 0 0
  mirror-0  ONLINE   0 0 0
c4d1ONLINE   0 0 0
c5d0ONLINE   0 0 0

iostat gives me following :

c4d1 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Model: WDC WD10EVDS-63 Revision:  Serial No:  WD-WCAV592 Size: 1000.20GB 
<1000202305536 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 0 
c5d0 Soft Errors: 981 Hard Errors: 0 Transport Errors: 981 
Model: Hitachi HDS7210 Revision:  Serial No:   JP2921HQ0 Size: 1000.20GB 
<1000202305536 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 0 
c8d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Model: KINGSTON SSDNOW Revision:  Serial No: 30PM10I Size: 30.02GB 
<30016659456 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 

Any hints?
Best regards and many thanks for your help!

Benjamin
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NexentaStor Community edition 3.0.3 released

2010-07-01 Thread David Magda

On Jul 1, 2010, at 10:39, Pasi Kärkkäinen wrote:


basicly 5-30 seconds after login prompt shows up on the console
the server will reboot due to kernel crash.

the error seems to be about the broadcom nic driver..
Is this a known bug?


Please contact Nexenta via their support infrastructure (web site,  
forums, lists?):


http://www.nexentastor.org/
http://www.nexenta.com/corp/support

Using ZFS- and OpenSolaris-specific lists isn't appropriate.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Announce: zfsdump

2010-07-01 Thread Edward Ned Harvey
> From: Asif Iqbal [mailto:vad...@gmail.com]
> 
> currently to speed up the zfs send| zfs recv I am using mbuffer. It
> moves the data
> lot faster than using netcat (or ssh) as the transport method

Yup, this works because network and disk latency can both be variable.  So
without buffering, your data stream must instantaneously go the speed of
whichever is slower:  The disk or the network.

But when you use buffering, you're able to go as fast as the network at all
times.  You remove the effect of transient disk latency.


> that is why I thought may be transport it like axel does better than
> wget.
> axel let you create multiple pipes, so you get the data multiple times
> faster
> than with wget.

If you're using axel to download something from the internet, the reason
it's faster than wget is because your data stream is competing against all
the other users of the internet, to get something from that server across
some WAN.  Inherently, all the routers and servers on the internet will
treat each data stream fairly (except when explicitly configured to be
unfair.)  So when you axel some file from the internet using multiple
threads, instead of wget'ing with a single thread, you're unfairly hogging
the server and WAN bandwidth between your site and the remote site.  Slowing
down everyone else on the internet who are running with only 1 thread each.

Assuming your zfs send backup is going local, on a LAN, you almost certainly
do not want to do that.

If your zfs send is going across the WAN ... maybe you do want to
multithread the datastream.  But you better ensure it's encrypted.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NexentaStor Community edition 3.0.3 released

2010-07-01 Thread Pasi Kärkkäinen
On Tue, Jun 15, 2010 at 10:57:53PM +0530, Anil Gulecha wrote:
> Hi All,
> 
> On behalf of NexentaStor team, I'm happy to announce the release of
> NexentaStor Community Edition 3.0.3. This release is the result of the
> community efforts of Nexenta Partners and users.
> 
> Changes over 3.0.2 include
> * Many fixes to ON/ZFS backported to b134.
> * Multiple bug fixes in the appliance.
> 
> With the addition of many new features, NexentaStor CE is the *most
> complete*, and feature-rich gratis unified storage solution today.
> 
> Quick Summary of Features
> -
> * ZFS additions: Deduplication (based on OpenSolaris b134).
> * Free for upto 12 TB of *used* storage
> * Community edition supports easy upgrades
> * Many new features in the easy to use management interface.
> * Integrated search
> 
> Grab the iso from
> http://www.nexentastor.org/projects/site/wiki/CommunityEdition
> 
> If you are a storage solution provider, we invite you to join our
> growing social network at http://people.nexenta.com.
> 

Hey,

I tried installing Nexenta 3.0.3 on an old HP DL380G4 server,
and it installed ok, but it crashes all the time.. 

basicly 5-30 seconds after login prompt shows up on the console
the server will reboot due to kernel crash.

the error seems to be about the broadcom nic driver..
Is this a known bug? 

See the screenshots for the kernel error message:

http://pasik.reaktio.net/nexenta/nexenta303-crash02.jpg
http://pasik.reaktio.net/nexenta/nexenta303-crash01.jpg

-- Pasi

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] optimal ZFS filesystem layout on JBOD

2010-07-01 Thread Joachim Worringen
Greetings,

we are running a few databases of currently 200GB (growing) in total for data 
warehousing:
- new data via INSERTs for (up to) millions of rows per day; sometimes with 
UPDATEs
- most data in a single table (=> 10 to 100s of millions of rows)
- queries SELECT subsets of this table via an index
- for effective parallelisation, queries create (potentially large) 
non-temporary tables which are deleted at the end of the query => lots of 
simple INSERTs and SELECTs during queries
- large transactions: they may contain millions of INSERTs/UPDATEs
- running version PostgreSQL 8.4.2

We are moving all this to a larger system - the hardware is available, 
therefore fixed:
- Sun X4600 (16 cores, 64GB)
- external SAS JBOD with 24 2,5" slots: 
  o 18x SAS 10k 146GB drives
  o  2x SAS 10k 73GB drives
  o 4x Intel SLC 32GB SATA SSD
- JBOD connected to Adaptec SAS HBA with BBU
- Internal storage via on-board RAID HBA:
  o 2x 73GB SAS 10k for OS (RAID1)
  o 2x Intel SLC 32GB SATA SSD for ZIL (RAID1) (?)
- OS will be Solaris 10 to have ZFS as filesystem (and dtrace)
- 10GigE towards client tier (currently, another X4600 with 32cores and 64GB)

What would be the optimal storage/ZFS layout for this? I checked 
solarisinternals.com and some PostgreSQL resources and came to the following 
concept - asking for your comments:
- run the JBOD without HW-RAID, but let all redundancy be done by ZFS for 
maximum flexibility 
- create separate ZFS pools for tablespaces (data, index, temp) and WAL on 
separate devices (LUNs):
- use the 4 SSDs in the JBOD as Level-2 ARC cache (can I use a single cache for 
all pools?) w/o redundancy
- use the 2 SSDs connected to the on-board HBA as RAID1 for ZFS ZIL

Potential issues that I see:
- the ZFS ZIL will not benefit from a BBU (as it is connected to the backplane, 
driven by the onboard-RAID), and might be too small (32GB for ~2TB of data with 
lots of writes)?
- the pools on the JBOD might have the wrong size for the tablespaces - like: 
using the 2 73GB drives as RAID 1 for temp might become too small, but adding a 
146GB drive might not be a good idea?
- with 20 spindles, does it make sense at all to use dedicated devices for the 
tabelspaces, or will the load be distributed well enough across the spindles 
anyway?

thanks for any comments & suggestions,
 
  Joachim
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs destroy hangs machine if snapshot exists- workaround found

2010-07-01 Thread Lo Zio
I also have this problem, with 134 if I delete big snapshots the server hangs 
only responding to ping.
I also have the ZVOL issue.
Any news about having them solved?
In my case this is a big problem since I'm using osol as a file server...
Thanks
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] dedup accounting anomaly / dedup experiments

2010-07-01 Thread Lutz Schumann
Hello list, 

I wanted to test deduplication a little and did a experiment. 

My question was:  can I dedupe infinite or is ther a upper limit ? 

So for that I did a very basic test.
-  I created a ramdisk-pool (1GB)
- enabled dedup and 
- wrote zeros to it (in one single file) until an error is returned. 

The size of the pool was 1046 MB, I was able to write 62 GB to it then it says 
"no space left on device". The block size was 128k, so I was able to write 
507.000 blocks to the pool.

With this device beeing full, I see the following: 

1) zfs list reports that no space is left (AVAIL=0)
2) zpool reports that the dedup factor was ~507.000x
3) zpool reports also that 8,6 MB of space were allocated in the pool (0% used)

So for me it looks like there is something broken in ZFS accounting with 
dedupe. 

- zpool and zfs usage free space reporting do not align
- the real deduplication factor was not 507.000 (meaning I would have been able 
to write 507.000x1GB = a lot to the pool) 
- when calculating 1046 MB / 507000 = 2.1 KB, somehow for each block  of 128k, 
2,1 KB of data bas been written (assuming zfs list is correct). What is this ? 
Metadata ? Meaning that I have aprox 1.6 % of Meatadata in ZFS (1/(128k/2,1k)) 
? 

I repeatet the same thing for a recordsize of 32k. The funny thing is: 
- Also 60 GB could be written before "no space left"
- 31 MB of space were alloated in the pool (zpool list)

The version of the pool is 25.

During the experiment I could nicely see:
- that performance on ramdisk is CPU bound doing ~125 MB /sec per Core. 
- performance scales linearly with adding CPU cores. (125 MB/s cor 1core, 253 
Mb/s for 2core, 408 MB/s for 4core). 
- that the upper size of the deduplication table is blocks * ~150 Byte, 
indipendent of the dedupe factor 
- the ddt does not grow for deduplicatable blocks (zdb -D)
- performance goes down factor of ~4 when switching from allocation policy of 
"closest" to "best fit" (when the pool fills rate drops from 250 MB/s to 67 
MB/s. I suspect even worse results for spinning media because of the head 
movements (>10x slow down).

Anyone knowing why the dedup factor is wrong ? Any insights on what has 
actually been written (compressed meta data, deduped meta data .. etc.) would 
be greatly appreshiated. 

Regards, 
Robert
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Supermicro AOC-USAS-L8i

2010-07-01 Thread Jay Heyl
> I plan on removing the second USAS-L8i and connect
> all 16 drives to the 
> first USAS-L8i when I need more storage capacity. I
> have no doubt that 
> it will work as intended. I will report to the list
> otherwise.

I'm a little late to the party here. First, I'd like to thank those pioneers 
who came before me and found this board works fine. I have two of them on order 
for a server I'm putting together. 

I'm curious about this connecting all 16 drives bit. I read that the controller 
will handle up to 122 drives, but short of a lot of extra, expensive hardware, 
I have no idea how that is accomplished. How are you intending to connect 16 
drives to the two SAS ports on the controller?

On a slightly different but related topic, anyone have advice on how to connect 
up my drives? I've got room for 20 pool drives in the case. I'll have two 
AOC-USAS-L8i cards along with cables to connect 16 SATA2 drives. The 
motherboard has 6 SATA2 connectors plus 2 SATA3 connectors. I was planning to 
use the SATA3 connectors for the boot drives (hopefully mirrored ZFS). 
Initially I'll have three 2TB drives with the intention of using raidz1 on 
them. Phase 2 will be three more 2TB drives. Phase 3 will be three 1.5TB 
drives. (I already have most of the drives for phase 2 and 3.) I'll probably 
fill the rest with a JBOD pool of assorted drives I have lying around.

I'm specifically wondering if I'll see better performance by spreading the 
raidz1 array devices across the controllers or if I should keep an array to a 
single controller as much as possible.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss