Re: [smartos-discuss] Remove ZIL from zones pool?

2018-01-24 Thread Richard Elling
bonnie++ should  not be considered until after you read 
https://blogs.oracle.com/roch/decoding-bonnie

  -- richard



> On Jan 24, 2018, at 5:33 PM, Sam Nicholson  wrote:
> 
> Well,  I'm wrong.
> 
> I got to thinking about this and decided to test it.  See 
> github.com/SamCN2/zfs-stats
> 
> Turns out that having 2 cache devices hurts.  Mostly not a lot, but sometimes 
> a lot.
> It never helps, as far as the tests I ran could tell.
> 
> Mind, I'm on a small server.  I think, like Jim, I want to get the most out 
> of what I have.
> 
> As expected, mirrored ZILs cost a percent or two.  It's an extra write.  Even 
> with Parallelism, it costs.
> 
> I also explored mirrors, stripes, raidz raidz2 and mirrored stripes.  I had 
> always thought that mirrored
> stripes were the best.  Raidz2 is pretty darned good, sometimes better.
> 
> Cacheing helps.  I wouldn't be without it on any read-mostly workload, which 
> is what I have: source repos, data lakes...
> Cacheing hurts writes sometimes, tho, and I won't be using it in the future 
> on transaction pools.
> 
> To me, the only thing left to decide is mirrored ZILs.  I'll probably keep 
> them for transaction pools.
> For data repos, I'll just have one.
> 
> Cheers!
> -sam
> 
>> On Fri, Jan 19, 2018 at 10:49 AM, Sam Nicholson  wrote:
>> I'll throw my $.02 behind you Jim.  Perhaps splitting an SSD is not the best 
>> way to get the most performance out of the SSD.
>> But even a part of an SSD is a huge win.  In fact, separate logs have always 
>> been a win, even back in the pre-ZFS days.  I
>> recall some old Bonnie results from the late '90 that show separating logs 
>> onto partitions *on the same drive* is a good deal.
>> Relative to the bare drive performance, that is.  Not relative to optimal 
>> combinations of fast 2.5 inch 15K SAS for logs.
>> 
>> My standard is to use 2 SSDs and 2 HDDs.  Mirror the HDDs for reliability, 
>> split the SSD into a small slice for a mirrored log,
>> and a spanned cache.  Remember, there is no need to mirror the cache.  
>> Errors there are treated just like misses, they are 
>> read from primary volume.
>> 
>> It looks like this:
>> 
>> # zpool create zones mirror c0t0d0 c0t1d0 log mirror c0t2d0s4 c0t3d0s4 cache 
>> c0t2d0s5 c0t3d0s5
>> 
>> Having used format(1m) to partition c0t2 and c0t3 with Solaris labels with 
>> partitions 4 and 5 as the LOG and CACHE parts, respectively.
>> Size of part 4 depends upon your write speed.  Max write rate * 5 seconds 
>> rounded up.  I use 8 GB logs for my 10G connected  NFS server.
>> 10 Gbits * 5 (rounded up to 56) / 8bits = 7, which I round up to 8 GBytes.  
>> The rest is cache.
>> 
>> It's *so* much better than just the mirrored HDD.  But I do see contention 
>> for the SSD under heavy load.  So What!?  
>> My HDDs would have melted by now.
>> 
>> Only caveat is that TRIM support may be affected.  I don't know, I haven't 
>> looked into the behavior.  That's a good Q for another topic.
>> 
>> As to Parted being not there.  I'd gently advise to use native tools.  
>> format(1m) has fdisk, partition, and label within.
>> I'm going to be rebuilding a couple of servers later today.  I'll capture 
>> the format session I use and send it to you, if you like.
>> I use Parted(8) on Linux.  I don't like it, but I use it.  :)
>> 
>> Cheer!
>> -sam
>> 
>>> On Fri, Jan 19, 2018 at 10:07 AM, Jim Wiggs  wrote:
>>> I've been told by quite a few folks that splitting the SSD between log and 
>>> cache is "a bad idea" or "suboptimal" but frankly, I don't buy it.  It may 
>>> just be my personal experience, but for my use cases, I've been operating 
>>> with limited resources and haven't been able to justify the expense of 
>>> having three or more SSDs to do this.  Since I've never needed a ZIL with 
>>> more than 2 GB of space and the smallest SSDs you can buy are more than 10x 
>>> that size, mirroring a pair of SSDs for the ZIL was a huge waste of space.  
>>> I started doing this about 4 years ago when SSDs were much more expensive 
>>> and I couldn't justify that waste, so I'd partition a 1-2 GB slice on each 
>>> SSD and mirror them for my ZIL, and use the remaining space on both SSDs, 
>>> un-mirrored, for cache.  Again, in my experience, this has always resulted 
>>> in better general performance than either adding only log or only cache.
>>> 
>>> YMMV.
>>> 
 On Fri, Jan 19, 2018 at 1:07 AM, Ian Collins  
 wrote:
> On 01/19/2018 08:45 PM, Jim Wiggs wrote:
>  So all is right with the world again.  But I'm still left with one 
> question: why on Earth is *parted* not included as part of the SmartOS 
> hypervisor image?  The old Solaris format command is spectacularly 
> user-unfriendly and always has been. I can't imagine that parted requires 
> so much additional space that it couldn't be included.  Was there any 
> particular rationale to not put a better and more user-friendly 
> partitioning tool into the OS that runs at the top level and manages 

Re: [smartos-discuss] zpool detach mistake

2017-09-27 Thread Richard Elling

> On Sep 27, 2017, at 3:10 AM, Antoine Jacoutot  wrote:
> 
> Hello.
> 
> So I made a mistake today replacing a faulty drive in a RAID10 setup.
> i.e. 4 mirrors in a pool
> I stupidly detached the drive (c7) which means I lost one mirror. 
> Surprisingly,
> it _seems_ everything is still working fine.
> That said I now have this:

What you have shouldn't have be the result of a "zpool replace" command.
Can you share the output of "zpool history zones"?
 -- richard

> 
> # zpool status
> pool: zones
> state: ONLINE
> scan: scrub repaired 2.71M in 44h26m with 0 errors on Wed Sep 20 10:55:06 2017
> config:
> 
> NAME   STATE READ WRITE CKSUM
> zones  ONLINE   0 0 0
>   mirror-0 ONLINE   0 0 0
> c1t5000CCA012B1089Dd0  ONLINE   0 0 0
> c2t5000CCA012AEAFC1d0  ONLINE   0 0 0
>   mirror-1 ONLINE   0 0 0
> c3t5000CCA0124510DDd0  ONLINE   0 0 0
> c4t5000CCA012A1FF61d0  ONLINE   0 0 0
>   mirror-2 ONLINE   0 0 0
> c5t5000CCA012B06399d0  ONLINE   0 0 0
> c6t5000CCA012AEF71Dd0  ONLINE   0 0 0
>   c8t5000CCA012AD8ED9d0ONLINE   0 0 0
>   c7t5000CCA012B1A9F5d0ONLINE   0 0 0
> 
> errors: No known data errors
> 
> Meaning I no longer have mirror-3 but instead 2 drives participating in the
> RAID0 pool. That's obviously not good in case one of these 2 drives dies. My
> question, although I doubt this is possible, is: is there any way I could
> recreate mirror-3 with c7 and c8 without data loss?
> 
> Thanks!
> 
> --
> Antoine
> 


---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] hardware selection question

2017-08-06 Thread Richard Elling

> On Aug 6, 2017, at 4:01 PM, Steve  wrote:
> 
> I saw a Supermicro with 60 drives in a 4U chassis! They max out at 90 I
> believe.


We use 60 SAS 2.5" drives in 2u chassis with dual, 2-socket + 2 12G SAS HBA 
controllers.
Next week we'll have them on the exhibit floor at Flash Memory Summit in San 
Jose, CA,
visit the Newisys booth.
 -- richard



---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] znapzend (or similar) in a pull configuration

2017-05-23 Thread Richard Elling

> On May 23, 2017, at 6:53 AM, Chris Ferebee  wrote:
> 
> Hi all,
> 
> znapzend is wonderful for snapshots and backups.
> 
> However, for backups of internet-facing zones, I would prefer a "pull" rather 
> than "push" configuration, such that the backup host initiates the connection 
> to the live host, rather than the other way around. That way, the backup host 
> can sit securely behind a NAT firewall, and the live host doesn’t need to 
> have ssh keys etc. giving access to the backup host.

There are, literally, hundreds of ZFS send/receive wrappers and agents running 
around the
internet. The vast majority are push model, but as you note, the pull model is 
superior for 
scale and is much easier to write. Why? Because at the end of the day the 
send/receive is a
one-liner. But to make it work with all of the possible exceptions, you end up 
with hundreds of
lines of code. Since most of the failure modes occur on the receiving side, it 
becomes quite tedious
to build a viable push model. For the pull model, you can deal with the local 
issues prior to
calling send/receive making it much easier to manage.

zetaback is one such implementation.
https://github.com/omniti-labs/zetaback 


 — richard

> 
> This just seems more secure on general principles considering ransomware and 
> other threats.
> 
> I don’t see how to implement something along these lines easily with 
> znapzend. Is there an alternative that would support this type of 
> configuration?
> 
> Chris
> 



---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] sysinfo script modification question

2017-04-04 Thread Richard Elling

> On Apr 3, 2017, at 11:37 PM, Dale Ghent  wrote:
> 
> 
> I have SMCI servers that have mangled or all-zero UUIDs as well.

very common with supermicro gear. You'll also see an occasional bogus 
00010002-0003-0004-0005-000600070008. The sysinfo code in kernel recognizes 
some of these as bogus and uses a random number for hostid that is then stored 
in /etc. For smartos that method doesn't work, for obvious reasons. 

A few lives back we changed this, but that code isn't a general purpose 
solution. It should be easy enough to make a more general solution for modern 
SmartOS

  -- richard

> 
> By "mangled", SMCI has made the extraordinarily poor choice on several of 
> their X10 platforms to set the first 4 fields to 0 and the last 48 bits to 
> the MAC address of one of the on-board ethernet PHYs, in an apparent "good 
> enough" approach to UUID generation at the factory:
> 
> [daleg@xenon]~$ smbios | grep -i uuid
>  UUID: ----0cc47a09b5f2
> [daleg@xenon]~$ dladm show-phys -m
> LINK SLOT ADDRESSINUSE CLIENT
> igb1 primary  c:c4:7a:9:b5:f3no   --
> igb0 primary  c:c4:7a:9:b5:f2yes  igb0
> igb3 primary  c:c4:7a:9:b5:f5no   --
> igb2 primary  c:c4:7a:9:b5:f4no   --
> 
> [daleg@devohat]~$ smbios | grep -i uuid
>  UUID: ----0cc47a7b58d8
> [daleg@devohat]~$ dladm show-phys -m
> LINK SLOT ADDRESSINUSE CLIENT
> igb0 primary  c:c4:7a:7b:58:d8   yes  igb0
> igb1 primary  c:c4:7a:7b:58:d9   no   --
> ixgbe0   primary  c:c4:7a:7b:5c:be   yes  ixgbe0
> ixgbe1   primary  c:c4:7a:7b:5c:bf   yes  ixgbe1
> 
> How widespread this practice is throughout their product line? I'm not sure. 
> It might work from a practical standpoint insofar as it's a UUID that can be 
> used to identify a particular piece of iron, but it does seem extraordinarily 
> sloppy to not bother with filling out the first 80 bits which comprise the 
> first 4 fields, thus reducing a 128bit UUID to a 48bit one. It also means 
> that these really aren't UUIDs in spirit, because one could predict the UUID 
> of a given box based only on observed or even guessed MAC addresses.
> 
> /dale
> 
>> On Apr 4, 2017, at 2:01 AM, Jorge Schrauwen  wrote:
>> 
>> It's usually a bit and miss to be honest. I only have one of the machines I 
>> run smartos on report a UUID that is not all 0.
>> Most of them are SuperMicro too, I guess it is more of a OEM BIOS verder 
>> specific thing, I think they were all AMI.
>> 
>> 
>> 
>> 
>>> On 2017-04-03 23:42, Robert Mustacchi wrote:
 On 4/3/17 0:22 , 강경원 wrote:
 Hello.
 We are testing SDC with same SMBIOS uuid servers.
>>> We recommend that you talk to your hardware vendor and have them provide
>>> tooling to fix the server's UUID. If they have the same UUID, they've
>>> not properly implemented the SMBIOS spec (though it's far from the first
>>> time we've heard of this).
 So we tried to modify images's sysinfo script to test and after modifing 
 the
 sysinfo, the fake uuid can be created successfully and can be setup.
 But when we try to reboot the node, below error message is shown and 
 rebooting
 is not working.
 The only thing that we can do is ipmi power reset.
 How can we avoid the errors?
 svc.startd: Killing user processes.
 WARNING: Error writing ufs log state
 WARNING: ufs log for /usr changed state to Error
 WARNING: Please umount(1M) /usr and run fsck(1M)
>>> Given what little information we have to work on, I'd suggest you
>>> review
>>> your procedure for building and modifying the live image for how you
>>> updated sysinfo to your custom version. Without knowing what you've
>>> done
>>> or not done or how you've done it, it's hard to suggest actionable
>>> steps
>>> to take.
>>> Robert
>> 
>> 
> 
> 


---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] Zfs dirty data max value question

2017-03-07 Thread Richard Elling

> On Mar 7, 2017, at 2:59 PM, 강경원  wrote:
> 
> Hello 
> I saw the zfs_dirty_data_max is 4GB.
> If we have several vms and the data write request comes with 4GBps=32gbps, 
> the io will be throttled although we use enough ssd or nvme? zil data write 
> also can be about 4GB.
> Can I know why default max value is 4GB? Any ideas?

It is a guess.

> Sas3 bandwidth seems to be 4.8GB, is it related?

No. 

For such high data rates, you will want to tune it. However, it is not always 
clear 
whether to increase or decrease, there are many factors that impact that 
decision.
It is a good idea to test with the expected workload.
 — richard

> 
> https://www.delphix.com/blog/delphix-engineering/tuning-openzfs-write-throttle
> 
> Regards,
> Kyungwon Kang.
>  
> 
> 
> smartos-discuss | Archives 
>   
>  | 
> Modify  Your Subscription   
> 



---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] disk speeds and ramdisks

2017-02-24 Thread Richard Elling
This is what we call an "/etc/system virus"
Unless you understand each of the tunables and what they impact, do yourself a 
favor
and avoid putting them on your system. In some cases, they are documented in the
Solaris kernel tunables guide, in other cases they are documented only in 
source code.
 — richard


> On Feb 23, 2017, at 10:47 AM, Humberto Ramirez  wrote:
> 
> Did you re-run the dd tests after tweaking those parameters? 
> 
> 
> 
> On Feb 22, 2017 11:52 AM, "Will Beazley"  > wrote:
> Mille Grazie!
> 
> 
> On 2/21/17 23:12, Artem Penner wrote:
>> Read about this kernel parameters
>> zfs:zfs_dirty_data_max
>> zfs:zfs_txg_timeout
>> zfs:zfs_dirty_data_sync
>> 
>> It's limut your i/o.
>> 
>> Example of /etc/config
>> set ibft_noprobe=1
>> set noexec_user_stack=1
>> set noexec_user_stack_log=1
>> set idle_cpu_no_deep_c=1
>> set idle_cpu_prefer_mwait = 0
>> set hires_tick=1
>> set ip:ip_squeue_fanout=1
>> set pcplusmp:apic_panic_on_nmi=1
>> set apix:apic_panic_on_nmi=1
>> set dump_plat_mincpu=0
>> set dump_bzip2_level=1
>> set dump_metrics_on=1
>> set sata:sata_auto_online=1
>> set sd:sd_max_throttle = 128
>> set sd:sd_io_time=10
>> set rlim_fd_max = 131072
>> set rlim_fd_cur = 65536
>> set ndd:tcp_wscale_always = 1
>> set ndd:tstamp_if_wscale = 1
>> set ndd:tcp_max_buf = 166777216
>> set nfs:nfs_allow_preepoch_time = 1
>> set nfs:nfs3_max_threads = 256
>> set nfs:nfs4_max_threads = 256
>> set nfs:nfs3_nra = 32
>> set nfs:nfs4_nra = 32
>> set nfs:nfs3_bsize = 1048576
>> set nfs:nfs4_bsize = 1048576
>> set nfs3:max_transfer_size = 1048576
>> set nfs4:max_transfer_size = 1048576
>> set nfs:nfs4_async_clusters = 16
>> set rpcmod:svc_default_stksize=0x6000
>> set rpcmod:cotsmaxdupreqs = 4096
>> set rpcmod:maxdupreqs = 4096
>> set rpcmod:clnt_max_conns = 8
>> set maxphys=1048576
>> set zfs:zfs_dirty_data_max = 0x6
>> set zfs:zfs_txg_timeout = 0xc
>> set zfs:zfs_dirty_data_sync = 0x4
>> set zfs:zfs_arc_max = 0x64
>> set zfs:zfs_arc_shrink_shift=12
>> set zfs:l2arc_write_max = 0x640
>> set zfs:l2arc_write_boost = 0xC80
>> set zfs:l2arc_headroom = 12
>> set zfs:l2arc_norw=0
>> 
>> 22 февр. 2017 г. 7:28 пользователь "Will Beazley" 
>> mailto:will.beaz...@infoassets.com>> написал:
>> Christopher, et al.,
>> 
>> I am trying to get my head around why the performance of ramdisk is so much 
>> poorer than that of HDD-pool+SSD-slog.
>> 
>> /usbkey/test_dir]#  time dd if=/dev/zero of=/tmpfs/testfile bs=64k 
>> count=32768;time dd if=/dev/zero of=/usbkey/test_dir/testfile bs=64k 
>> count=32768
>> 32768+0 records in
>> 32768+0 records out
>> 2147483648  bytes transferred in 2.279053 secs 
>> (942270169 bytes/sec)
>> 
>> real0m2.312s
>> user0m0.021s
>> sys 0m1.062s
>> 32768+0 records in
>> 32768+0 records out
>> 2147483648  bytes transferred in 0.743729 secs 
>> (2887453957 bytes/sec)
>> 
>> real0m0.760s
>> user0m0.016s
>> sys 0m0.652s
>> 
>> I created the ramdisk thus:
>> ramdiskadm -a rd1 3072m
>> ...
>> zfs create -o  mountpoint=/tmpfs -o sync=disabled  ramdsk1/rd1
>> 
>> I've run it many times and although the results vary yet the tale is always 
>> the same.
>> 
>> Thank You,
>> Will
>> 
> 
> 
> 
> smartos-discuss | Archives 
>   
>  | 
> Modify  Your Subscription   
> 



---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] Expanded pool only seeing 2TB of 3TB disks

2017-02-02 Thread Richard Elling

> On Feb 2, 2017, at 7:31 AM, Gareth Howell  wrote:
> 
> Hi
> I’ve just upgraded the two mirrored disks in my zones pool from 1TB disks to 
> 3TB disks by replacing each in turn and then doing zpool replace.
> 
> I then did
> 'pool scrub'
> and
> 'zpool set autoexpand=on'
> 
> I expected 'zpool get size' to show 3TB, but it only shows 2TB.

This can happen if the drive is SATA and the BIOS 
is set to IDE emulation mode. In such cases, the
driver won't be sd, so in iostat -x it will not be sd#

  -- richard



> 
> Thanks
> 
> Gareth
> smartos-discuss | Archives  | Modify Your Subscription 



---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] smartos ssd disk question

2017-01-23 Thread Richard Elling

> On Jan 23, 2017, at 9:54 AM, Gernot Straßer  
> wrote:
> 
> Hi Robert,
> 
> so let me try the other way around then: Do you know a device where the 
> manufacturer explicitly says so?

STEC did. They also said synchronize_cache is a nop.

I’ve been working a lot with similar Toshibas recently. In general, nice drive 
and I don’t
notice any significant impact for enabling or disabling synchronize_cache — a 
good thing.
 — richard

> 
> Regards
> Gernot
> 
> 
> -Ursprüngliche Nachricht-
> Von: Robert Mustacchi [mailto:r...@joyent.com] 
> Gesendet: Montag, 23. Januar 2017 18:45
> An: smartos-discuss@lists.smartos.org
> Betreff: Re: AW: [smartos-discuss] smartos ssd disk question
> 
> On 1/23/17 9:36 , Gernot Straßer wrote:
>> Most (if not all) so called  of enterprise class SSD claim to be power-save 
>> (being equipped with supercaps to power the drive until DRAM write cache is 
>> emptied).
>> In case of a power failure no system will be able to send a synchronize 
>> command to the drive, so what sense would the supercap make if that was a 
>> requirement?
>> Does anybody have a suggestion on how to test that (besides pulling the 
>> power cable)?
> 
> Hi Gernot,
> 
> I think you're looking at this from the wrong perspective. For example, ZFS 
> will not treat the write as stable until it receives a synchronize cache 
> command. For some devices it may be that the synchronize cache command is 
> required to get outstanding writes into the state that it will be protected 
> by the supercap. Obviously, this is something that's going to vary from drive 
> to drive. If it's totally fine for these Toshiba's great. If someone wanted 
> to make a chance to illumos that said synchronize cache was unnecessary on 
> those devices, then I'd want the manufacturer to explicitly say so.
> 
> Robert
> 
>> -Ursprüngliche Nachricht-
>> Von: Robert Mustacchi [mailto:r...@joyent.com]
>> Gesendet: Montag, 23. Januar 2017 18:30
>> An: smartos-discuss@lists.smartos.org
>> Betreff: Re: [smartos-discuss] smartos ssd disk question
>> 
>> On 1/23/17 9:20 , Youzhong Yang wrote:
>>> it is power safe and we've tested it here.
>>> 
>>> https://toshiba.semicon-storage.com/us/product/storage-products/enter
>>> p
>>> rise-ssd/px02smb-px02smfxxx.html?sug=1
>> 
>> Sure, it does say it's power safe. Are you sure that means you don't need to 
>> issue synchronize cache commands to the device? For some devices, you still 
>> need to issue synchronize cache commands even if they're power safe. If it 
>> works, great. Hopefully that just means synchronize cache commands are a 
>> no-op.
>> 
>> Robert
>> 
>>> On Mon, Jan 23, 2017 at 12:01 PM, Robert Mustacchi  wrote:
>>> 
 On 1/23/17 6:29 , Youzhong Yang wrote:
> Add something like this to /kernel/drv/sd.conf:
> 
> "TOSHIBA PX02SMF020  ", "cache-nonvolatile:true",
> 
> I don't think the sd.conf comes with smartos image has it, so you 
> need to build your own image.
 
 In general, you should _never_ set this value. You have basically 
 told the system that this device is power safe and never requires a 
 synchronize cache command. This is not true for most devices and a 
 poorly timed panic will result in data loss on the one device whose 
 purpose is to protect its data: the slog.
 
 Note, when I generally talk about an SSD being power safe, that does 
 not mean that this can be set to true. The devices generally only 
 guarantee that data is safe after a synchronize cache command.
 
 I don't have as much experience with these Toshiba drives, so it may 
 be that their datasheet tells you something else in this case.
 
 Robert
 
>>> 
>>> 
>> 
>> 
> 
> 


---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] Can not override physical sector disks

2016-09-20 Thread Richard Elling

> On Sep 20, 2016, at 6:48 AM, Humberto Ramirez  wrote:
> 
> Is there a definitive approach / guide / manual / wiki as to how to properly 
> work / replace 4k - 512 - 512e disks? This has been asked before, here and in 
> some other lists and obviously continues to be a source of problems and 
> confusion…
> 

The reason the mismatch is reported is because it causes pain and tears if you 
override.
Do yourself a favor and don’t replace 512n drives with 512e.
 — richard

> 
> On Sep 19, 2016 11:54 PM, "Joshua M. Clulow"  > wrote:
> On 19 September 2016 at 20:12, 郑圆杰  > wrote:
> > I have created an zpool with ashift=9.
> 
> How did you do this?  Just by using disks with native 512 byte
> sectors, or through some other mechanism?
> 
> > Now  a disk is out of service. And I try to replace with a new disk.
> 
> Is the replacement disk a different model from the original disk?
> 
> > Unfortunately, new disk reports that the physical sector size is 4k. Some 
> > error occurs when trying exec command “zfs replace”/ “zfs attach”.
> 
> Do you know if the new disk is an "Advanced Format" disk (aka "512e")?
>  That is: does the new disk present 4KB physical sectors, but provide
> emulation for legacy 512 byte sectors?
> 
> If the new disks are 4K native, I'm afraid you cannot use them in an
> ashift=9 pool.  If the disks _do_ provide an emulated 512 byte logical
> sector size, you might be hitting this bug:
> 
> https://smartos.org/bugview/OS-4718 
> 
> If these _are_ Advanced Format (512e) disks, you might want to try
> this custom patched platform:
> 
> https://us-east.manta.joyent.com/jmc/public/tmp/platform-20160904T224833Z-OS-4718.tgz
>  
> 
> 
> This custom platform image includes an attempted fix for OS-4718 which
> should help.  Source diff for the platform build is here:
> 
> https://gist.github.com/jclulow/ccb00c396c2f6961672494ef2dbdee66 
> 
> 
> Let me know how it goes!
> 
> Cheers.
> 
> --
> Joshua M. Clulow
> UNIX Admin/Developer
> http://blog.sysmgr.org 
> 
> smartos-discuss | Archives 
>   
>  | 
> Modify  Your Subscription   
> 



---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] zpool replace with a disk which has different physical blocksize

2016-07-19 Thread Richard Elling

> On Jul 19, 2016, at 1:19 PM, smartos-discuss@lists.smartos.org wrote:
> 
> Hi,
> 
> the new model is a 2 TB Seagate NAS HDD ST2000VN000 and as to the report it 
> has 4K as physical blocksize:
> dkmp_ext.dki_lbsize   = 512
> dkmp_ext.dki_pbsize   = 4096

Friendly advice: you can force this 512e drive into an existing 512n pool. 
However, you will be
unhappy with the performance of the result. You will be happier with a 512e 
device.

> 
> the other models in the ZRAID are a Seagate ST31500341AS (1,5TB) which 
> reports as 512 and seems just to have 512
> dkmp_ext.dki_lbsize   = 512
> dkmp_ext.dki_pbsize   = 512   
> 
> and the same is reported by the other Western Digital 1TB WD10EADS
> 
> I do not think that there is a way that the old ones will work with 4K, but I 
> hope that the new one will work with 512. And I really hope that someone has 
> the clue how to integrate my new diskdrive before another one dies…

I think you’ve got this backwards?  512n devices work fine with 4k block sizes. 
512e drives perform poorly with 512 byte block sizes.
 — richard

> 
> Mat
> 

--

richard.ell...@richardelling.com
+1-760-896-4422



---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] Re: Any possibilities in python OSError?

2016-05-16 Thread Richard Elling

> On May 16, 2016, at 3:40 AM, Fred Liu  wrote:
> 
> It could be the slow read from the nfs path -- "/tool/iqs/run".
> We see nfs v4 client not working in LX brand(I have already opened OS-5265).
> And nfs v3 client is working but it has obvious lag( about 5-10
> seconds) at the very first try to mount nfs path.
> As long as the mount is done, there is no obvious lag in coming
> nfs-related operations.
> And further, we find if SmartOS where LX brands are living in serve
> nfs service, the lag won't appear.
> It could be something happened in network path.

this is consistent with name lookup timeout. double check dns, ldap, and all 
other name sources.

  -- richard
> 
> 
> Thanks.
> 
> 2016-05-16 18:07 GMT+08:00 Fred Liu :
>> Hi,
>> 
>> We are running some python scripts in LX brand (CentOS 6.7 20151005).
>> And we get following errors:
>> 
>> File "/tool/iqs/bin/execd.py", line 416, in polljob
>>runfiles = [int(f) for f in os.listdir(RUNDIR) if f.isdigit()]
>> exceptions.OSError: [Errno 4] Interrupted system call: '/tool/iqs/run'
>> 
>> 
>> 
>> They have been running well under normal Linux OS.
>> 
>> Any possibilities?
>> 
>> 
>> Thanks.
>> 
>> Fred
> 
> 


---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] low network bandwidth with 40 vms

2016-05-04 Thread Richard Elling

> On May 4, 2016, at 8:37 AM, Stefan  wrote:
> 
> Dear List,
> 
> I would like to provide you with an update on the issue.  We now use an
> intel 10GbE card (ixgbe) which reduces the %tim in intrstat but there is
> no increase in throughput.

Next steps are as Robert suggests, the USE method.

ixgbe uses 8 rings by default, so it will spread nicely. By contrast, if 
e1000g#0
was consuming 100% of a CPU, then that might be an issue. At this point, I
think you can eliminate CPU utilization by NIC and network saturation as the
bottleneck.

 — richard

> 
> With 79 running (but idle) KVM servers there is only 23 Mbps bandwidth
> according to iperf.  The CPU load is rather low:
> 
> device | cpu12 %tim cpu13 %tim
> -+--
> e1000g#0 | 0  0.0 0  0.0
> e1000g#1 | 0  0.0 0  0.0
> ehci#0 | 0  0.0 0  0.0
> ehci#1 | 0  0.0 0  0.0
> ixgbe#0 |  1009  1.4 0  0.0
> ixgbe#1 | 0  0.0 0  0.0
> 
> Any suggestions?
> 
> Kind Regards,
> Stefan
> 

--

richard.ell...@richardelling.com
+1-760-896-4422



---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] low network bandwidth with 40 vms

2016-05-03 Thread Richard Elling

> On May 3, 2016, at 10:16 AM, Stefan  wrote:
> 
> Dear List,
> 
> on our machines with about 40 vservers we observe low network
> throughput.  It seems to scale inversely with the number of vms on the
> respective node:
> 
>   # vms  bw (Mbps)
>   -  -
>  37 60
>  24 94
>  18174
>  12608
>   5933
>   2934
>   1935
> 
> The measurements were obtained using iperf from a different physical
> machine. It has been suggested that the slowdown may be due to
> interrupt 51 being clamped to CPU 12:
> 
>   # echo '::interrupts -d' | mdb -k
>   IRQ  Vect IPL BusTrg Type   CPU Share APIC/INT# Driver Name(s)
>   :
>   49   0x40 5   PCIEdg MSI8   1 - mpt#0
>   50   0x60 6   PCIEdg MSI11  1 - e1000g#0
>   51   0x61 6   PCIEdg MSI12  1 - e1000g#1
>   160  0xa0 0  Edg IPIall 0 - poke_cpu
>   :
> 
> If this is the cause of the problem we would like to deliver the
> interrupts to all of the cpus.  How do we achieve this with smartos?

To observe CPU usage by driver by processor, use intrstat.
 -- richard

> 
> Kind Regards,
> Stefan
> 


---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] preallocated zVOL / improving SSD performance

2016-04-16 Thread Richard Elling

> On Apr 14, 2016, at 1:56 AM, Dirk Steinberg  wrote:
> 
>> 
>> Am 14.04.2016 um 02:39 schrieb Richard Elling 
>> mailto:richard.ell...@richardelling.com>>:
>> 
>> 
>>> On Apr 13, 2016, at 4:40 PM, Daniel Carosone >> <mailto:daniel.caros...@gmail.com>> wrote:
>>> 
>>> Yes, agreed and understood. It is a space reservation that ensures some 
>>> number of blocks will never be allocated. 
>>> 
>>> That's not exactly the same as them never being used, due to CoW updates, 
>>> but it's very close. Once the pool is close to full, any writes that don't 
>>> immediately free the original blocks will get denied. 
>>> 
>>> The net effect is the same: a relatively constant number of free blocks for 
>>> the ssd controller to use in its own wear levelling and performance 
>>> management. Overprovisioned storage with lots of spare blocks above 
>>> whatever the device keeps internally already. 
>>> 
>>> At least, it seems so to me. My question, elaborated thus, is: what is the 
>>> difference you see that makes it insufficient? 
>>> 
>>> Oh, are we not issuing TRIM from zfs as space is freed?  
>>> 
>> no
>>> That would explain it. If so, writing zeros into the reserved space 
>>> (without compression, dedup, or snapshots) occasionally will tell the ssd 
>>> controller the blocks are empty. 
>>> 
>>> I feel this is an effective workaround entirely within zfs, without 
>>> resorting to the ugly tricks of multiple partitioning schemes and 
>>> inflexible external allocations we both dislike.
>>> 
>>> 
>> 
>> pedantic question: why not buy good quality SSDs? 
> 
> Hmm, price? My 2TB 850 EVO cost me 530 EUR. 
> How much would a „high quality“ SSD (say from Intel) cost? Maybe 2000 EUR?

EVO is designed and priced for the PeeCee market. Low DWPD. Space-optimized
garbage collection. Lower overprovisioning level. 

PRO version is designed and optimized for more intensive work. Here in the US, 
the 
difference in price is as large as 50%.
 — richard


> 
> Also, availability in certain form factors (M.2) and capacities (I have never 
> seen
> one of those HQ SSDs in 2 TB listed in a shop).
> 
>> In my studies, good quality SSDs with
>> decent overprovisioning perform more consistently than el-cheapos.
> 
> That is certainly true.
> 
>> FWIW, the preponderance of the evidence suggests that wear out is not as 
>> important as age.
>> COW file systems like ZFS are particularly well behaved.
>> https://www.usenix.org/conference/fast16/technical-sessions/presentation/schroeder
>>  
>> <https://www.usenix.org/conference/fast16/technical-sessions/presentation/schroeder>
>> https://users.ece.cmu.edu/~omutlu/pub/flash-memory-failures-in-the-field-at-facebook_sigmetrics15.pdf
>>  
>> <https://users.ece.cmu.edu/~omutlu/pub/flash-memory-failures-in-the-field-at-facebook_sigmetrics15.pdf>
>> 
>>  -- richard
>> 
>>> On 13 Apr 2016 18:27, "Dirk Steinberg" >> <mailto:d...@steinbergnet.net>> wrote:
>>> Am 13.04.2016 um 09:53 schrieb Daniel Carosone >> <mailto:daniel.caros...@gmail.com>>:
>>>> What is wrong with a dataset with refreserv set? 
>>>> 
>>> It does not actually reserve any specific blocks on the disk (LBAs for 
>>> SATA) which would 
>>> allow the SSD controller to deduct that a certain part of the SSD is not 
>>> being used.
>>> 
>>> freservation is purely a (virtual) space accounting method of ZFS.
>>> 
>> 
> 
> smartos-discuss | Archives 
> <https://www.listbox.com/member/archive/184463/=now>  
> <https://www.listbox.com/member/archive/rss/184463/21953302-fd56db47> | 
> Modify <https://www.listbox.com/member/?&;> Your Subscription  
> <http://www.listbox.com/>
--

richard.ell...@richardelling.com
+1-760-896-4422






---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] preallocated zVOL / improving SSD performance

2016-04-13 Thread Richard Elling

> On Apr 13, 2016, at 4:40 PM, Daniel Carosone  
> wrote:
> 
> Yes, agreed and understood. It is a space reservation that ensures some 
> number of blocks will never be allocated.
> 
> That's not exactly the same as them never being used, due to CoW updates, but 
> it's very close. Once the pool is close to full, any writes that don't 
> immediately free the original blocks will get denied.
> 
> The net effect is the same: a relatively constant number of free blocks for 
> the ssd controller to use in its own wear levelling and performance 
> management. Overprovisioned storage with lots of spare blocks above whatever 
> the device keeps internally already.
> 
> At least, it seems so to me. My question, elaborated thus, is: what is the 
> difference you see that makes it insufficient?
> 
> Oh, are we not issuing TRIM from zfs as space is freed? 
> 
no
> That would explain it. If so, writing zeros into the reserved space (without 
> compression, dedup, or snapshots) occasionally will tell the ssd controller 
> the blocks are empty.
> 
> I feel this is an effective workaround entirely within zfs, without resorting 
> to the ugly tricks of multiple partitioning schemes and inflexible external 
> allocations we both dislike.
> 
> 

pedantic question: why not buy good quality SSDs? In my studies, good quality 
SSDs with
decent overprovisioning perform more consistently than el-cheapos.

FWIW, the preponderance of the evidence suggests that wear out is not as 
important as age.
COW file systems like ZFS are particularly well behaved.
https://www.usenix.org/conference/fast16/technical-sessions/presentation/schroeder
 

https://users.ece.cmu.edu/~omutlu/pub/flash-memory-failures-in-the-field-at-facebook_sigmetrics15.pdf
 


 -- richard

> On 13 Apr 2016 18:27, "Dirk Steinberg"  > wrote:
> Am 13.04.2016 um 09:53 schrieb Daniel Carosone  >:
>> What is wrong with a dataset with refreserv set?
>> 
> It does not actually reserve any specific blocks on the disk (LBAs for SATA) 
> which would 
> allow the SSD controller to deduct that a certain part of the SSD is not 
> being used.
> 
> freservation is purely a (virtual) space accounting method of ZFS.
> 
> smartos-discuss | Archives 
>   
>  | 
> Modify  Your Subscription   
> 



---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] preallocated zVOL / improving SSD performance

2016-04-12 Thread Richard Elling

> On Apr 12, 2016, at 3:21 PM, Dirk Steinberg  wrote:
> 
> Hi,
> 
> in order to improve long-term performance on consumer-grade SSDs,
> I would like to reserve a certain range of LBA addresses on a freshly
> TRIMmed SSD to never be written to. That can be done by slicing the
> disk and leaving one slice of the disk unused. 

I do not believe this is the case. Overprovisioning is managed by changing the
capacity of the disk. The method depends on the drive. For SCSI drives we use 
sg_format.

> 
> OTOH I really like to use whole-disk vdev pools and putting pools
> on slices is unnecessarily complex and error-prone. 

A "whole-disk" vdev is one where slice 0 is set to an size between
LBA 256 and end - size of slice 9. The convenience is that the sysadmin
doesn't have to run format or fmthard.

> Also, should one later on decide that setting aside, say, 30% of the 
> disk capacity as spare was too much, changing this to a smaller 
> number afterwards in a slicing setup is a pain.
> 
> Therefore my idea is to just reserve a certain amount of SSD blocks 
> in the zpool and never use them. They must be nailed to specific
> block addresses but must never be written to.

This is not the same as adjusting the drive's overprovisioning.
 -- richard

> 
> A sparsely provisioned zvol does not do the trick, and neither does
> filling is thickly with zeros (then the blocks would have been written to,
> ignoring for the moment compression etc.).
> 
> What I need is more or less exactly what Joyent implemented for the
> multi_vdev_crash_dump support: just preallocate a range of blocks
> for a zvol. So I tried:
> 
> zfs create -V 50G -o checksum=noparity zones/SPARE
> 
> Looking at „zpool iostat“ I see that nothing much happens at all.
> Also, I can see that no actual blocks are allocated:
> [root@nuc6 ~]# zfs get referenced zones/SPARE
> NAME PROPERTYVALUE  SOURCE
> zones/SPARE  referenced  9K -
> 
> So the magic apparently only happens when you actually
> activate dumping to that zvol:
> 
> [root@nuc6 ~]# dumpadm  -d /dev/zvol/dsk/zones/SPARE
>  Dump content: kernel pages
>   Dump device: /dev/zvol/dsk/zones/SPARE (dedicated)
> 
> In zpool iostat 1 I can see that about 200mb of data is written:
> 
> zones   22.6G   453G  0  0  0  0
> zones   27.3G   449G  0  7.54K  0  18.9M
> zones   49.5G   427G  0  33.6K  0  89.3M
> zones   71.7G   404G  0  34.6K  0  86.2M
> zones   72.6G   403G  0  1.39K  0  3.75M
> zones   72.6G   403G  0  0  0  0
> 
> That must be the allocation metadata only, since this is much less than
> the 50G, but still a noticeable amount of data. And we can actually see
> that the full 50G have been pre-allocated:
> 
> [root@nuc6 ~]# zfs get referenced zones/SPARE
> NAME PROPERTYVALUE  SOURCE
> zones/SPARE  referenced  50.0G  -
> 
> Now I have exactly what I want: a nailed-down allocation of
> 50G of blocks that never have been written to.
> I’d like to keep that zvol in this state indefinitely.
> Only problem: as soon as I change dumpadm to dump
> to another device (or none), this goes away again.
> 
> [root@nuc6 ~]# dumpadm  -d none
> [root@nuc6 ~]# zfs get referenced zones/SPARE
> NAME PROPERTYVALUE  SOURCE
> zones/SPARE  referenced  9K -
> 
> Back to square one! BTW, the amount of data written for the de-allocation
> is much less:
> 
> zones   72.6G   403G  0  0  0  0
> zones   72.6G   403G  0101  0   149K
> zones   22.6G   453G  0529  0  3.03M
> zones   22.6G   453G  0  0  0  0
> 
> So my question is: can I somehow keep the zVOL in the pre-allocated state,
> even when I do not use it as a dump device?
> 
> While we are at it: If I DID use it as dump device, will a de-allocation
> and re-allocation occur on each reboot, or will the allocation remain intact?
> Can I somehow get a list of blocks allocated for the zvol via zdb?
> 
> Thanks.
> 
> Cheers
> Dirk
> 


---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] (U)EFI boot for SmartOS?

2016-04-12 Thread Richard Elling

> On Apr 12, 2016, at 2:41 PM, Dirk Steinberg  wrote:
> 
>> 
>> Am 12.04.2016 um 23:30 schrieb Dirk Steinberg > <mailto:d...@steinbergnet.net>>:
>> 
>> 
>>> Am 12.04.2016 um 23:09 schrieb Richard Elling 
>>> >> <mailto:richard.ell...@richardelling.com>>:
>>> 
>>> 
>>>> On Apr 12, 2016, at 1:39 PM, Dirk Steinberg >>> <mailto:d...@steinbergnet.net>> wrote:
>>>> 
>>>> The root file system actually resides on a ram disk, which cannot be used 
>>>> for booting.
>>> 
>>> The RAM disk contains a UFS file system.
>>> 
>>>> 
>>>> If I do not boot from USB or PXE, I like to put the boot files
>>>> (kernel and boot_archive, plus a few GRUB files) onto my zones
>>>> pool. I agree that one could use UFS, but that requires slicing/
>>>> partitioning the disk making things more complex than necessary.
>>>> Using a whole-disk pool is much easier.
>>> 
>>> This is a significant change in SmartOS architecture and IMHO, a 
>>> significantly inferior
>>> approach.
>>> 
>>> At InterModal Data, we have a different approach. We do install on one or 
>>> more 
>>> "boot disks" and keep a grub menu set for locally storing OS images (UFS in 
>>> RAMdisk
>>> image). However, this is not a general-purpose solution. In our world, the 
>>> "zones" pool
>>> is quite small, typically 32G or less. Thus we can easily accomodate "boot 
>>> disks" that
>>> are 64GB or more, though it is very rare for us to see more than 200GB. 
>>> There are a
>>> number of other constraints that impact us, that are not general purpose.
>>> 
>>> So, can you have a boot image area that cohabitates a single disk? Yes, but 
>>> there is
>>> a fair amount of work involved and every step brings you farther away from 
>>> the easy,
>>> scalable method used by default in SmartOS. You'll be better served by 
>>> burning a USB
>>> stick and taking a long lunch.
>>>  — richard
>> 
>> Richard,
>> 
>> thanks for the explanation. I do understand the advantage of booting from 
>> USB, 
>> just that the box I am currently fiddling with is a new, legacy-free Skylake 
>> box
>> with only xHCI, so effectively once SmartOS has booted, there is NO USB 
>> support
>> whatsoever, no keyboard, no USB stick, nada.
>> 
>> I also understand that I can slice a physical disk and have multiple UFS 
>> file systems
>> and potentially even multiple ZFS pools on that disk. I have done all of 
>> this before.
>> All I am saying is that I find it easier to use whole-disk zpools, and for 
>> some time now,
>> the GRUB that ships with SmartOS does support (legacy-)booting off 
>> whole-disk zpools.
>> That is very easy: just use the zones pool and copy the boot files to it.
>> I create a separate zfs dataset for the boot files (zones/smartos) and use 
>> that as the bootfs.
>> 
>> ## enable GRUB boot from whole-disk vdev zones pool
>> mkdir /zones/boot
>> cp -a <…..>/boot/grub /zones/boot
>> ## first save boot ramdisk image without any pools mounted
>> dd bs=1M if=/dev/ramdisk/a of=/tmp/boot_archive
>> fsck -y /tmp/boot_archive
>> zfs create zones/smartos
>> mkdir -p /zones/smartos/platform/i86pc/amd64
>> mv /tmp/boot_archive /zones/smartos/platform/i86pc/amd64/boot_archive
>> print /platform/i86pc/kernel/amd64/unix | cpio -pduvma /zones/smartos
>> ## install GRUB to MBR on whole-disk zones pool
>> installgrub -m -f /zones/boot/grub/stage1 /zones/boot/grub/stage2 
>> /dev/rdsk/c0t0d0s0
>> 
>> Now you only need to add „bootfs zones/smartos“ to your 
>> /zones/boot/grub/menu.lst entries, like so:
>> 
>> sed -i '' -e '/kernel/{x;s:.*:   bootfs zones/smartos:;p;x;}' 
>> /zones/boot/grub/menu.lst
>> 
>> That’s all. You can boot off you whole-disk zones pool now. 
>> Disclaimer: YMMV. Be very careful. The above commands are potentially very 
>> dangerous and could result in data loss. Check you device names.
>> 
>> / Dirk
> 
> I have to add something: better copy the boot_archive and the grub files from 
> the USB stick
> to the zones pool. The hack of getting the boot_archive from /dev/ramdisk/a 
> is from my
> notes of doing a remote, headless install without even having access to a USB 
> stick
> (I use that for installing SmartOS on kimsufi). The conditions there are 
> extreme,
> so one needs to resort to extreme hacks. If you have a USB stick a hand, it’s 
> much easier.

There are many paths to the top of the mountain :-)

Using this approach requires grub understanding of ZFS and bootfs. This is a 
bit more 
constraining than an approach that uses the bootimg and a more modern version of
usb installation than is present in the prebuilt USB images. After all, once 
you load
the image into RAM and boot from it, you're up.
 -- richard





---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] (U)EFI boot for SmartOS?

2016-04-12 Thread Richard Elling

> On Apr 12, 2016, at 1:39 PM, Dirk Steinberg  wrote:
> 
> The root file system actually resides on a ram disk, which cannot be used for 
> booting.

The RAM disk contains a UFS file system.

> 
> If I do not boot from USB or PXE, I like to put the boot files
> (kernel and boot_archive, plus a few GRUB files) onto my zones
> pool. I agree that one could use UFS, but that requires slicing/
> partitioning the disk making things more complex than necessary.
> Using a whole-disk pool is much easier.

This is a significant change in SmartOS architecture and IMHO, a significantly 
inferior
approach.

At InterModal Data, we have a different approach. We do install on one or more 
"boot disks" and keep a grub menu set for locally storing OS images (UFS in 
RAMdisk
image). However, this is not a general-purpose solution. In our world, the 
"zones" pool
is quite small, typically 32G or less. Thus we can easily accomodate "boot 
disks" that
are 64GB or more, though it is very rare for us to see more than 200GB. There 
are a
number of other constraints that impact us, that are not general purpose.

So, can you have a boot image area that cohabitates a single disk? Yes, but 
there is
a fair amount of work involved and every step brings you farther away from the 
easy,
scalable method used by default in SmartOS. You'll be better served by burning 
a USB
stick and taking a long lunch.
 -- richard




---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] (U)EFI boot for SmartOS?

2016-04-12 Thread Richard Elling

> On Apr 12, 2016, at 11:05 AM, Robert Mustacchi  wrote:
> 
> On 4/12/16 9:13 , Dirk Steinberg wrote:
>> Hi,
>> 
>> as far as I can see, the currently included version of GRUB
>> in SmartOS only supports legacy-booting.
> 
> That is correct.
> 
>> Does an EFI-enabled version of GRUB for SmartOS/Illumos exist
>> or is being worked on?
> 
> Supporting EFI booting is something that's being worked on in the
> broader illumos community.
> 
>> Maybe one could just rip a modern version of GRUB out of
>> a Linux distro and use that, as long as it includes ZFS support…
>> OTOH I seem to remember having read that GRUB2 cannot
>> boot Illumos, but I am not sure about that?

SmartOS root filesystem is UFS, so ZFS boot support is not required of grub.
 -- richard

> 
> I do not expect that to work, because we would still use BIOS based
> queries which would not be answered.
> 
> Robert
> 


---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] Can not install Build20160317T000621Z HPE Proliant DL380 Gen9, 2 CPU (E5-2650 v3)

2016-03-24 Thread Richard Elling

> On Mar 23, 2016, at 8:29 AM, Robert Mustacchi  wrote:
> 
>> On 3/23/16 8:22 , Richard Elling wrote:
>> 
>>> On Mar 23, 2016, at 6:59 AM, Robert Mustacchi  wrote:
>>> 
>>> On 3/23/16 2:00 , Benny Kjellgren wrote:
>>>> Thank you
>>>> 
>>>> "-B pci-reprog=off" solved my problem with SmartOS not booting up.
>>>> 
>>>> And have changed from raidz1 with pair of mirrored disks LV
>>>> to raidz2 with single disk raid 0 LV + one spare
>>> 
>>> I'm glad that you've got this workaround working; however, if either you
>>> or Richard could still boot up the system with -kd without doing the -B
>>> pci-reprog=off and relay where we're panciking, that'd be quite helpful.
>> 
>> Unfortunately, there is no panic. If there were a panic, we'd be well on our 
>> way :-)
> 
> So you're seeing a hard reset of the system then?
> 
> Can you break through bge_attach() and step over parts of it to see
> where we're going awry?

the team made some RCA progress here:
we trigger the reset during mac_start(), when the driver is writing to offset 
0xc (BGE_APE_EVENT) in register-set 2 (APE).  This all happens in 
bge_ape_send_event(). On sut112 the toxic address is 0xff42601fe00c.  
Before the toxic write there are several other read and writes to other things 
in APE, but those accesses all seem to be at higher offsets (0x4000-0x8000).



  -- richard

> 
>>> 
>>> I'd like to make sure we get to root cause on this so these workarounds
>>> aren't necessary. Unfortunately right now we don't have enough
>>> information to make progress on that.
>> 
>> Agree, and there is precious little public doc on how these things are built.
>> Strange PCI configurations are known to cause problems in the past, leading
>> to unpredictable behaviour. We suspect similar here. We'll map out the PCI
>> fabric as a next step, however that will be model specific.
> 
> Okay, please keep us posted as soon as you have any additional
> information. I'd like to make sure we don't lose track of these and get
> them root caused (along with the x2apic issues).
> 
> Robert


---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] Can not install Build20160317T000621Z HPE Proliant DL380 Gen9, 2 CPU (E5-2650 v3)

2016-03-23 Thread Richard Elling

> On Mar 23, 2016, at 8:29 AM, Robert Mustacchi  wrote:
> 
> On 3/23/16 8:22 , Richard Elling wrote:
>> 
>>> On Mar 23, 2016, at 6:59 AM, Robert Mustacchi  wrote:
>>> 
>>> On 3/23/16 2:00 , Benny Kjellgren wrote:
>>>> Thank you
>>>> 
>>>> "-B pci-reprog=off" solved my problem with SmartOS not booting up.
>>>> 
>>>> And have changed from raidz1 with pair of mirrored disks LV
>>>> to raidz2 with single disk raid 0 LV + one spare
>>> 
>>> I'm glad that you've got this workaround working; however, if either you
>>> or Richard could still boot up the system with -kd without doing the -B
>>> pci-reprog=off and relay where we're panciking, that'd be quite helpful.
>> 
>> Unfortunately, there is no panic. If there were a panic, we'd be well on our 
>> way :-)
> 
> So you're seeing a hard reset of the system then?
> 
> Can you break through bge_attach() and step over parts of it to see
> where we're going awry?

That was our next plan, but we'll use the workaround for now.

> 
>>> 
>>> I'd like to make sure we get to root cause on this so these workarounds
>>> aren't necessary. Unfortunately right now we don't have enough
>>> information to make progress on that.
>> 
>> Agree, and there is precious little public doc on how these things are built.
>> Strange PCI configurations are known to cause problems in the past, leading
>> to unpredictable behaviour. We suspect similar here. We'll map out the PCI
>> fabric as a next step, however that will be model specific.
> 
> Okay, please keep us posted as soon as you have any additional
> information. I'd like to make sure we don't lose track of these and get
> them root caused (along with the x2apic issues).

Agree
 -- richard

> 
> Robert




---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] Can not install Build20160317T000621Z HPE Proliant DL380 Gen9, 2 CPU (E5-2650 v3)

2016-03-23 Thread Richard Elling

> On Mar 23, 2016, at 6:59 AM, Robert Mustacchi  wrote:
> 
> On 3/23/16 2:00 , Benny Kjellgren wrote:
>> Thank you
>> 
>> "-B pci-reprog=off" solved my problem with SmartOS not booting up.
>> 
>> And have changed from raidz1 with pair of mirrored disks LV
>> to raidz2 with single disk raid 0 LV + one spare
> 
> I'm glad that you've got this workaround working; however, if either you
> or Richard could still boot up the system with -kd without doing the -B
> pci-reprog=off and relay where we're panciking, that'd be quite helpful.

Unfortunately, there is no panic. If there were a panic, we'd be well on our 
way :-)

> 
> I'd like to make sure we get to root cause on this so these workarounds
> aren't necessary. Unfortunately right now we don't have enough
> information to make progress on that.

Agree, and there is precious little public doc on how these things are built.
Strange PCI configurations are known to cause problems in the past, leading
to unpredictable behaviour. We suspect similar here. We'll map out the PCI
fabric as a next step, however that will be model specific.
 -- richard



---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] Can not install Build20160317T000621Z HPE Proliant DL380 Gen9, 2 CPU (E5-2650 v3)

2016-03-22 Thread Richard Elling

> On Mar 22, 2016, at 8:10 PM, 许若辰  wrote:
> 
> I found a interesting material benny
> http://h20564.www2.hpe.com/hpsc/doc/public/display?docId=emr_na-c04633840&DocLang=en&docLocale=en_US
>  
> 
> I don’t have a gen9 now and I’m not able to try it.
> Please have a look and tell me your progress.

Thanks! That appears to make bge2 more friendly. We'll dig into it a little 
deeper
 -- richard

> 
> ---
> Briphant, Ruochen Xu
> xuruoc...@briphant.com 
> 
> 
> 
> 
>> On Mar 23, 2016, at 10:55 AM, 许若辰 > > wrote:
>> 
>> Hey benny, I have gotten the problem before, also in HP proliant gen9.
>> But unfortunately I failed to solve this problem, but I’d like to share you 
>> some experience.
>> 1.For disk, in smart array you should use raid 0 for every single disk then 
>> smartos can find them. (Usually we should use HBA mode).
>> 2.For 1 cpu, it has been already solved.
>> 3.For KCS error, it seems that this is harmless and you can ignore it.
>> 4.For installing, yeah, there is some problem with nic bge2(maybe another), 
>> and the command “ifconfig bge2 plumb” would cause reboot of the machine. You 
>> can try this in the noinstall mode of smartos(any release).
>> If you want to install it successfully, just hack the shell script, to avoid 
>> plumbing of the specified nic.
>> But, you would not able to start OS after installing, maybe because this 
>> command would also be executed while starting OS.
>> Maybe we should hack some kernel code, but I gave up.
>> 
>> I hope these tips can help you. And I hope you are able to find the root 
>> cause of the problem and share with us.
>> I’m busy these days, sorry for replying too late.
>> 
>> Thanks.
>> 
>> 
>> 
>> ---
>> Briphant, Ruochen Xu
>> xuruoc...@briphant.com 
>> 
>> 
>> 
>> 
>>> On Mar 23, 2016, at 1:05 AM, Benny Kjellgren >> > wrote:
>>> 
>>> 
>>> we find that bge2 is toxic. more later today...
>>> 
>>>   -- richard
>>> 
>>> I manage to get the installation/setup to work by modify the script :
>>> 
>>> for iface in $(dladm show-phys -pmo link | grep bge0 ); do
>>> 
>>> Now it would be nice if we could specify boot nic or nic to ignore.
>>> --
>>> Benny
>> 
> 
> smartos-discuss | Archives 
>   
>  | 
> Modify  Your Subscription  
> 



---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] Can not install Build20160317T000621Z HPE Proliant DL380 Gen9, 2 CPU (E5-2650 v3)

2016-03-22 Thread Richard Elling
we find that bge2 is toxic. more later today...

  -- richard



> On Mar 22, 2016, at 6:36 AM, Benny Kjellgren  
> wrote:
> 
> Update :
> I trace the reboot to /smartdc/lib/smartos_prompt_config.sh
> for iface in $(dladm show-phys -pmo link); do
> ifconfig $iface plumb 2>/dev/null
> done
> --
> Benny
> smartos-discuss | Archives  | Modify Your Subscription 



---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] Can not install Build20160317T000621Z HPE Proliant DL380 Gen9, 2 CPU (E5-2650 v3)

2016-03-21 Thread Richard Elling

> On Mar 21, 2016, at 10:49 AM, Benny Kjellgren  
> wrote:
> 
> Hi Richard
> 
> I can verify that disabling the x2APIC made the NOTICE lines to disappear.
> Also notice if setting the SmartArray in HBA mode then "format" command 
> display no disks.

yes. It seems several teams are here now. Which Smart Array controller do you 
have?
 -- richard




---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] Can not install Build20160317T000621Z HPE Proliant DL380 Gen9, 2 CPU (E5-2650 v3)

2016-03-21 Thread Richard Elling

> On Mar 21, 2016, at 7:50 AM, Benny Kjellgren  
> wrote:
> 
> Hi,
> 
> I get this on the console :
> 
>   NOTICE: System detected 256 cpus, but only 1 cpu(s) were enabled during 
> boot.
>   NOTICE: Use "boot-ncpus" parameter to enable more CPU(s). See eeprom(1M).
>   WARNING: KCS error: ff
> 
> and then the server reboot again.
> 
> With "noimport=true" I get the login prompt (no reboot)
> I am new to SmartOS. Can you give some hints how to troubleshoot this issue ?

We disabled the x2APIC in the BIOS and that solved the CPU visibility issue.
There are other items still to be verified, but it appears that more than one 
group
is looking a these, concurrently. How can we improve our sharing of knowledge
and experience?
 -- richard




---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] zfs_arc_max setting for SmartOS

2016-03-19 Thread Richard Elling

> On Mar 16, 2016, at 8:21 AM, Jon Dison  wrote:
> 
> I’m reasonably certain that the VMs are using mostly swap, at least that’s 
> what top says.

Reserving swap is not the same thing as using swap. It is quite common to see 
swap
reservations, especially for KVM zones. How are you measuring swap *usage*?
 -- richard

> In one case, the VM is allocated 2GB and its using 1.2 GB of swap and about 
> 50MB of RAM.
> This box has 64GB of RAM, and a 2TB mirror (of two identical 2TB disks) for 
> zone pool.
> Currently the ZFS ARC is using abut 40GB of RAM and the other VMs are using 
> about 12-14 GB of RAM combined.
> It was only when I spun up this new VM that it booted up using seemingly only 
> swap space, the other previously running VMs seem happy on RAM/swap.
> 
>> On Mar 16, 2016, at 11:16 AM, the outsider > > wrote:
>> 
>> ZFS has a nifty feature that it ALWAYS consumes almost ALL remaining free 
>> memory. 
>> But… it will reduce memory consumption if anything else needs it. 
>>  
>> Are you sure that your VM’s run on SWAP memory? 
>>  
>> How much RAM do you have?
>> how big is your zonepool? 
>> How many HD’s do you use in your zonepool? 
>>  
>>  
>> Van: Jon Dison [mailto:jon.di...@gmail.com ] 
>> Verzonden: woensdag 16 maart 2016 15:26
>> Aan: SmartOs Discuss > >
>> Onderwerp: [smartos-discuss] zfs_arc_max setting for SmartOS
>>  
>> There is a surprising lack of information on how to set this parameter as 
>> far as what Google returns.
>>  
>> Can someone tell me the proper way to limit the amount of memory available 
>> to the ZFS ARC?
>> When I spin up new VMs now, the pretty much only use swap as all available 
>> memory is already tied up in the ARC.
>>  
>> Thanks.
> 
> smartos-discuss | Archives 
>   
>  | 
> Modify  Your Subscription  
> 



---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] zfs_arc_max setting for SmartOS

2016-03-19 Thread Richard Elling

> On Mar 17, 2016, at 12:15 AM, Matthias Götzke  wrote:
> 
> Actually, even though that might be true( that zfs will release the ram) , in 
> our experience the zfs cache is often freed very slowly and sometimes can 
> even cause the entire machine to become unresponsive for some time.

ZFS experiences in this vary widely across implementations and distros.
Are you claiming that current SmartOS is so affected, or is your experience 
with other
distros?
 -- richard


> We would very much appreciate a limit on zfa ram usage to Katie this more 
> predictable. It’s just when a lvm tries to reserve rsm on boot and that ram 
> is in use. Zfs will not actually decrease as it seems that Rahm is not marked 
> as ‘available’ from the perspective of a potential consumer such as kvm. 
>  
> It was often faster for us to reboot the machine to re the ram .
>  
> Does anybody else have those issues too ?
>  
> Cheers,
> Matthias
> From: the outsider 
> Sent: Mittwoch, 16. März 2016 16:17
> To: smartos-discuss@lists.smartos.org 
> 
> Subject: RE: [smartos-discuss] zfs_arc_max setting for SmartOS
>  
> ZFS has a nifty feature that it ALWAYS consumes almost ALL remaining free 
> memory. 
> But… it will reduce memory consumption if anything else needs it. 
>  
> Are you sure that your VM’s run on SWAP memory? 
>  
> How much RAM do you have?
> how big is your zonepool? 
> How many HD’s do you use in your zonepool? 
>  
>  
> Van: Jon Dison [mailto:jon.di...@gmail.com ] 
> Verzonden: woensdag 16 maart 2016 15:26
> Aan: SmartOs Discuss  >
> Onderwerp: [smartos-discuss] zfs_arc_max setting for SmartOS
>  
> There is a surprising lack of information on how to set this parameter as far 
> as what Google returns.
>  
> Can someone tell me the proper way to limit the amount of memory available to 
> the ZFS ARC?
> When I spin up new VMs now, the pretty much only use swap as all available 
> memory is already tied up in the ARC.
>  
> Thanks.
>  
>  
> smartos-discuss | Archives 
>   
>  | 
> Modify  Your Subscription  
> 



---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [zfs] [developer] Re: [smartos-discuss] an interesting survey -- the zpool with most disks you have ever built

2016-03-06 Thread Richard Elling

> On Mar 6, 2016, at 9:06 PM, Fred Liu  wrote:
> 
> 
> 
> 2016-03-06 22:49 GMT+08:00 Richard Elling  <mailto:richard.ell...@richardelling.com>>:
> 
>> On Mar 3, 2016, at 8:35 PM, Fred Liu > <mailto:fred_...@issi.com>> wrote:
>> 
>> Hi,
>> 
>> Today when I was reading Jeff's new nuclear weapon -- DSSD D5's CUBIC RAID 
>> introduction,
>> the interesting survey -- the zpool with most disks you have ever built 
>> popped in my brain.
> 
> We test to 2,000 drives. Beyond 2,000 there are some scalability issues that 
> impact failover times.
> We’ve identified these and know what to fix, but need a real customer at this 
> scale to bump it to
> the top of the priority queue.
> 
> [Fred]: Wow! 2000 drives almost need 4~5 whole racks! 
>> 
>> For zfs doesn't support nested vdev, the maximum fault tolerance should be 
>> three(from raidz3).
> 
> Pedantically, it is N, because you can have N-way mirroring.
>  
> [Fred]: Yeah. That is just pedantic. N-way mirroring of every disk works in 
> theory and rarely happens in reality.
> 
>> It is stranded if you want to build a very huge pool.
> 
> Scaling redundancy by increasing parity improves data loss protection by 
> about 3 orders of 
> magnitude. Adding capacity by striping reduces data loss protection by 1/N. 
> This is why there is
> not much need to go beyond raidz3. However, if you do want to go there, 
> adding raidz4+ is 
> relatively easy.
> 
> [Fred]: I assume you used stripped raidz3 vedvs in your storage mesh of 2000 
> drives. If that is true, the possibility of 4/2000 will be not so low.
>Plus, reslivering takes longer time if single disk has bigger 
> capacity. And further, the cost of over-provisioning spare disks vs raidz4+ 
> will be an deserved 
> trade-off when the storage mesh at the scale of 2000 drives.

Please don't assume, you'll just hurt yourself :-)
For example, do not assume the only option is striping across raidz3 vdevs. 
Clearly, there are many
different options.
 -- richard





---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] an interesting survey -- the zpool with most disks you have ever built

2016-03-06 Thread Richard Elling

> On Mar 3, 2016, at 8:35 PM, Fred Liu  wrote:
> 
> Hi,
> 
> Today when I was reading Jeff's new nuclear weapon -- DSSD D5's CUBIC RAID 
> introduction,
> the interesting survey -- the zpool with most disks you have ever built 
> popped in my brain.

We test to 2,000 drives. Beyond 2,000 there are some scalability issues that 
impact failover times.
We’ve identified these and know what to fix, but need a real customer at this 
scale to bump it to
the top of the priority queue.

> 
> For zfs doesn't support nested vdev, the maximum fault tolerance should be 
> three(from raidz3).

Pedantically, it is N, because you can have N-way mirroring.

> It is stranded if you want to build a very huge pool.

Scaling redundancy by increasing parity improves data loss protection by about 
3 orders of 
magnitude. Adding capacity by striping reduces data loss protection by 1/N. 
This is why there is
not much need to go beyond raidz3. However, if you do want to go there, adding 
raidz4+ is 
relatively easy.
 — richard


--

richard.ell...@richardelling.com
+1-760-896-4422






---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] Update gone bad, the Aftermath

2016-02-01 Thread Richard Elling

> On Jan 31, 2016, at 2:59 PM, Ian Collins  wrote:
> 
> Richard Elling wrote:
>> 
>>> On Jan 28, 2016, at 1:09 AM, Ian Collins >> <mailto:i...@ianshome.com>> wrote:
>>> 
>>> It isn't usually a good idea to put both log and cache on the same SSD. 
>>> Unless you are using your SmartOS box as file server, ditch the cache and 
>>> just use a log device, the whole SSD, not a partition. It also isn't a good 
>>> idea to use an SSD without power fail protection for a log.
>> 
>> FWIW, this is becoming less true as SSDs improve. In the bad old days, some 
>> SSDs were much better
>> at write-intensive workloads (orders of magnitude). Today, there are many 
>> SSDs that do well for both
>> read and write-intensive workloads and the difference between 
>> write-intensities is the amount of OOB
>> overprovisioning. Since you can usually influence overprovisioning yourself, 
>> this reduces the need for
>> separate devices.
> 
> I assume that you would still recommend against using SSDs without power fail 
> protection for log devices?

My requirements in preferred order:
1. honor synchronize cache commands
2. power loss protection

 — richard


--

richard.ell...@richardelling.com
+1-760-896-4422






---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] Update gone bad, the Aftermath

2016-01-31 Thread Richard Elling

> On Jan 28, 2016, at 1:09 AM, Ian Collins  wrote:
> 
> the outsider wrote:
>> 
>> Dear all,
>> 
>> Many thanks to all that helped yesterday and tonight for restoring my server.
>> 
>> Everything works fine now, but I have some unanswered questions. Maybe 
>> someone can shed some light on this.
>> 
>> Since I ran into trouble after updating my USB drive with the latest SmartOS 
>> version some things are unclear at this moment.
>> 
>> Updating SmartOS
>> 
>> 1.What is the recommended update/upgrade strategy for SmartOS?
>> 
>> 2.Where is the right wiki for updating the SmartOS USB or SmartOS OS?
>> 
>> 3.I followed and used this https://github.com/calmh/smartos-platform-upgrade 
>> and it seemed to work ok. But is ok to use it?
>> 
> 
> This has already been correctly answered. I've been using the process from 
> that wiki page for all my upgrades and I've never had any problems.
> 
>> ZPool:
>> 
>> 1.I created a log and cache on a SSD, but used partitions c2d0p2 and c2d0p4 
>> instead of slices. This seems to have caused the biggest error for my Zpool

This is not surprising. Historically, Solaris only “permitted” one Solaris2 
fdisk partition per drive. That partition was
further partitioned using a “slice-like” partition. There is little to no 
testing on using more than one fdisk partitions.
None of the automated partitioning tools should use more than one fdisk 
partition, so you must have done this by 
hand. 

I’m not sure this behaviour is well documented. It is far simpler to document 
that which works, rather than the 
gazillion things that might not work or haven’t been tested.

>> 
>> 2.The SSD was still alive but Zpool could not find the partitions. The SSD 
>> is not attached to the RAID controller but (due to some physical mounting 
>> problems) connected to the onboard SATA controller. Can this have caused the 
>> error?
>> 
> 
> It is not uncommon to use internal SATA controllers for log and cache devices 
> (Sun's ZFS appliances use them for cache devices).

I’d say it is quite common, especially in SmartOS installations.

> 
> It isn't usually a good idea to put both log and cache on the same SSD. 
> Unless you are using your SmartOS box as file server, ditch the cache and 
> just use a log device, the whole SSD, not a partition. It also isn't a good 
> idea to use an SSD without power fail protection for a log.

FWIW, this is becoming less true as SSDs improve. In the bad old days, some 
SSDs were much better
at write-intensive workloads (orders of magnitude). Today, there are many SSDs 
that do well for both
read and write-intensive workloads and the difference between write-intensities 
is the amount of OOB
overprovisioning. Since you can usually influence overprovisioning yourself, 
this reduces the need for
separate devices.
 — richard

> 
>> SmartOS
>> 
>> 1.The new SmartOS release I burned to USB was version 20160121T174331Z
>> 
>> 2.During the “no install” procedure it asked for username and pass. The 
>> root/root was not correct. Thanks to someone on IRC I was guided to 
>> http://us-east.manta.joyent.com/Joyent_Dev/public/SmartOS/smartos.html#20160121T174331Z
>>  
>> 
>> 
>> 3.The password for root mentioned on this page was not working. The OS kept 
>> complaining about something and it looked to me that the OS was trying to 
>> verify the password to a central system. But since the network drivers were 
>> not loaded this failed of course. Is this normal or did I do something wrong?
>> 
> 
> You probably did something wrong!
> 
>> 4.My base version was 20151210T194528Z, and I could log in with root/root 
>> during the “noinstall”. I didn’t have to use the special password from the 
>> webpage?
>> 
> 
> No, root/root should be correct.
> 
> --
> Ian.
> 
--

richard.ell...@richardelling.com
+1-760-896-4422



---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [developer] [smartos-discuss] Is zfs dead man timer tunable?

2016-01-26 Thread Richard Elling

> On Jan 25, 2016, at 11:22 PM, Fred Liu  wrote:
> 
> 
> Yes. I cleaned the whole folder
> 
> Fred 

savecore serves two functions:
1. copy the dump from the dump device to a filesystem: vmdump.# (done at boot by
   the dumpadm SMF service)
2. extract the vmcore.# and unix.# from a vmdump.#

Unless you've dumped again, the old dump should still be on the dump device.
 -- richard





---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


[smartos-discuss] Re: [developer] Is it possible to roll back a ZPOOL(which cannot be imported) to its last known good state?

2016-01-25 Thread Richard Elling

> On Jan 25, 2016, at 12:50 PM, Youzhong Yang  wrote:
> 
> Hi all,
> 
> Just wondering if anyone has done similar recovery using txg stuff.

Yes. I've seen it successfully done only once in my entire life -- a special 
case where
one node had no significant writes during the dual import period.

> 
> We have a zpool attached to two hosts physically, ideally at any time only 
> one host imports this zpool. Due to some operational mistake this zpool was 
> corrupted when the two hosts tried to have access to it. Here is the crash 
> stack:
> 
> Jan 25 10:07:17 batfs0346 genunix: [ID 403854 kern.notice] assertion failed: 
> 0 == dmu_bonus_hold(spa->spa_meta_objset, obj, FTAG, &db), file: 
> ../../common/fs/zfs/spa.c, line: 1549
> Jan 25 10:07:17 batfs0346 unix: [ID 10 kern.notice]
> Jan 25 10:07:17 batfs0346 genunix: [ID 802836 kern.notice] ff017495c920 
> fba6b1f8 ()
> Jan 25 10:07:17 batfs0346 genunix: [ID 655072 kern.notice] ff017495c9a0 
> zfs:load_nvlist+e8 ()
> Jan 25 10:07:17 batfs0346 genunix: [ID 655072 kern.notice] ff017495ca90 
> zfs:spa_load_impl+10bb ()
> Jan 25 10:07:17 batfs0346 genunix: [ID 655072 kern.notice] ff017495cb30 
> zfs:spa_load+14e ()
> Jan 25 10:07:17 batfs0346 genunix: [ID 655072 kern.notice] ff017495cb80 
> zfs:spa_tryimport+aa ()
> Jan 25 10:07:17 batfs0346 genunix: [ID 655072 kern.notice] ff017495cbd0 
> zfs:zfs_ioc_pool_tryimport+51 ()
> Jan 25 10:07:17 batfs0346 genunix: [ID 655072 kern.notice] ff017495cc80 
> zfs:zfsdev_ioctl+4a7 ()
> Jan 25 10:07:17 batfs0346 genunix: [ID 655072 kern.notice] ff017495ccc0 
> genunix:cdev_ioctl+39 ()
> Jan 25 10:07:17 batfs0346 genunix: [ID 655072 kern.notice] ff017495cd10 
> specfs:spec_ioctl+60 ()
> Jan 25 10:07:17 batfs0346 genunix: [ID 655072 kern.notice] ff017495cda0 
> genunix:fop_ioctl+55 ()
> Jan 25 10:07:17 batfs0346 genunix: [ID 655072 kern.notice] ff017495cec0 
> genunix:ioctl+9b ()
> Jan 25 10:07:17 batfs0346 genunix: [ID 655072 kern.notice] ff017495cf10 
> unix:brand_sys_sysenter+1c9 ()
> 
> Is it possible to roll back the zpool to its last known good txg? We know 
> when the zpool should be in good state.
> 
> Any suggestion would be very much appreciated. We can build a kernel if 
> needed.

Some tips:
+ make snapshots or dd-like copies of the raw drives, if feasible
+ prevent future damage or unexpected repairs by importing readonly
+ zdb does analysis of pools without changing the on-disk data
+ zdb -F attempts a conservative rewind to previous uberblocks
+ zdb -X attempts an extreme rewind to previous uberblocks automatically
+ zdb -lu shows uberblocks and their txgs
+ zdb -t allows you to check on-disk data structures for specific txgs (from 
zdb -lu)
+ once you find a txg that seems to work with zdb, you can try readonly zpool 
import
using -F, -X, or -T (don't forget readonly)
+ if readonly import works, then you can try to recover the data and later try 
readwrite import

Good luck
 -- richard



---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] Is zfs dead man timer tunable?

2016-01-21 Thread Richard Elling
answer far below...

> On Jan 21, 2016, at 8:44 PM, Fred Liu  wrote:
> 
> 
> 
>> -Original Message-
>> From: Richard Elling [mailto:richard.ell...@richardelling.com]
>> Sent: 星期五, 一月 22, 2016 12:02
>> To: smartos-discuss@lists.smartos.org
>> Subject: Re: [smartos-discuss] Is zfs deadm man timer tunable?
>> 
>> 
>>> On Jan 21, 2016, at 4:25 AM, Fred Liu  wrote:
>> 
>> zfs deadman timer is tunable. But if you hit it, you've got problems
>> that tuning the deadman won't help.
>> 
>> The tunable is zfs_deadman_synctime_ms, which is milliseconds.
>> 
>> For example, on a test machine here:
>>  [root@elvis ~]# echo zfs_deadman_synctime_ms/D | mdb -k
>>  zfs_deadman_synctime_ms:
>>  zfs_deadman_synctime_ms:100
>> 
>> FYI, you can check on the state of I/Os in the ZIO pipeline and how
>> long they've been there using the zio_state dcmd. Elvis is not
>> currently busy or broken, but here is an example:
>>  [root@elvis ~]# echo ::zio_state | mdb -k
>>  ADDRESS TYPE  STAGEWAITER
>> TIME_ELAPSED
>>  ff01ada853e0NULL  OPEN --
>>  ff01ada85b10NULL  OPEN --
>> 
>> If you see a large TIME_ELAPSED, you can track down the zio in question
>> for more debugging.
>> -- richard
>> 
> 
> Richard,
> 
> Many thanks! You have been always helpful since I first touched ZFS many 
> years ago.
> I am trying intel P3600 NVMe ssd on ZFS. I got several random server reboot 
> every day. And I just captured "
> panic message: I/O to pool 'zones' appears to be hung" from console. I doubt 
> it is related to NVMe driver or
> ssd firmware.
> 
> - SmartOS Live Image v0.147+ build: 20151001T070028Z
> [root@pluto ~]# echo zfs_deadman_synctime_ms/D | mdb -k
> zfs_deadman_synctime_ms:
> zfs_deadman_synctime_ms:100 
> [root@pluto ~]# echo ::zio_state | mdb -k
> ADDRESS TYPE  STAGEWAITER   TIME_ELAPSED
> f08576c15028NULL  OPEN --
> f08576c153a8NULL  OPEN --
> f08576c15728NULL  OPEN --
> f08576c15aa8NULL  OPEN --
> f08576c15e28NULL  OPEN --
> f08576c161a8NULL  OPEN --
> f08576c16528NULL  OPEN --
> f08576c168a8NULL  OPEN --
> f08576c16c28NULL  OPEN --
> f08576c25048NULL  OPEN --
> f08576c253c8NULL  OPEN --
> f08576c25748NULL  OPEN --
> f08576c25ac8NULL  OPEN --
> f08576c25e48NULL  OPEN --
> f08576c261c8NULL  OPEN --
> f08576c26548NULL  OPEN --
> f08576c268c8NULL  OPEN --
> f08576c26c48NULL  OPEN --
> f08576e41028NULL  OPEN --
> f08576e413a8NULL  OPEN --
> f08576e41728NULL  OPEN --
> f08576e428a8NULL  OPEN --
> f08576e45050NULL  OPEN --
> f08576e453d0NULL  OPEN --
> f08576e45750NULL  OPEN --
> f08576e45ad0NULL  OPEN --
> f08576e45e50NULL  OPEN --
> f08576e461d0NULL  OPEN --
> f08576e46550NULL  OPEN --
> f08576e468d0NULL  OPEN --
> f08576e46c50NULL  OPEN --
> f08576e47038NULL  OPEN --
> f08576e473b8NULL  OPEN

Re: [smartos-discuss] Is zfs deadm man timer tunable?

2016-01-21 Thread Richard Elling

> On Jan 21, 2016, at 4:25 AM, Fred Liu  wrote:

zfs deadman timer is tunable. But if you hit it, you've got problems that tuning
the deadman won't help.

The tunable is zfs_deadman_synctime_ms, which is milliseconds.

For example, on a test machine here:
[root@elvis ~]# echo zfs_deadman_synctime_ms/D | mdb -k
zfs_deadman_synctime_ms:
zfs_deadman_synctime_ms:100 

FYI, you can check on the state of I/Os in the ZIO pipeline and how long they've
been there using the zio_state dcmd. Elvis is not currently busy or broken, but 
here
is an example:
[root@elvis ~]# echo ::zio_state | mdb -k
ADDRESS TYPE  STAGEWAITER   
TIME_ELAPSED
ff01ada853e0NULL  OPEN --   
 
ff01ada85b10NULL  OPEN --   
 

If you see a large TIME_ELAPSED, you can track down the zio in question for more
debugging.
 -- richard



---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] Filesystem read/write speed in Centos VM

2015-01-24 Thread Richard Elling via smartos-discuss

> On Jan 23, 2015, at 12:45 PM, Greg Zartman via smartos-discuss 
>  wrote:
> 
> A picture says a thousand words, so I thought I'd wrap this discussion up 
> with a KVM windows server disk performance screenshot.  Very good results. 

sweet!
 -- richard

> 
> 
> Greg J. Zartman, P.E.
> President, Principal Engineer
> 
> LEI Engineering & Surveying, LLC
> 2160 Davcor Street SE
> Salem, Oregon  97302
> Office: 541-683-8383 (ext 103)  Cell: 541-521-8449  Fax: 866-232-6790
> www.leiengineering.com 
> 
> SBA Certified HUBZone Contractor
> 
> On Thu, Jan 22, 2015 at 10:59 AM, Greg Zartman  > wrote:
> On Thu, Jan 22, 2015 at 10:47 AM, Kim Culhan  > wrote:
> 
> What method did you use to locate the bad device ?
> 
> I did a:   iostat -xnc 1  in the global zone.  I then went into the Centos 
> KVM in a separate terminal and did a "dd" write to the NFS volume in 
> question.  Watching the global zone terminal iostate output, I monitored what 
> the %b (% busy) output was for each drive.  One of the spindle drives was 
> maxed at 100%.  Sigxcpu (IRC username) suggested that this was the bad drive 
> and I should try detaching it (This was one drive in a mirrored vdev).  I 
> then did a zpool detach zones .  I then went back to my 
> KVM termainal and re-initiated the dd write test.  iostate in the global zone 
> then showed even %b across all devices and my dd write test in the KVM 
> reported a really good write speed to the NFS volume.
> 
> Hope that helps.
> 
> Greg
> 
> 
> smartos-discuss | Archives 
>   
>  | 
> Modify  Your Subscription   
> 




---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] zfs/zpool versions, send/receive, and Solaris 10

2014-11-14 Thread Richard Elling via smartos-discuss
On Nov 14, 2014, at 3:45 PM, John Thurston via smartos-discuss 
 wrote:

> I tried to "zfs send" a snapshot from my Solaris 10 system to a SmartOS 
> system. The result was:
>> cannot receive: stream has unsupported feature, feature flags = 24
> 
> Which I interpret as telling me my Solaris system is trying to send something 
> unsupported to my SmartOS system. What I can't figure out... is "24" a single 
> flag identifying something, or a bit-wise register denoting two flags set, or 
> something else. Maybe it just means, "I support 'feature flags' and you 
> don't".

This was introduced into the zfs version 5 send stream as part of a bug fix by 
Oracle after they closed the source.
We theorize that we could ignore the bits, but it is not clear how to prove 
that it works -- empirically it might work,
for at least some cases.
 -- richard

> 
> Regardless of what it means, am I correct that my Solaris zpool being at 
> "version 32" and my SmartOS zpool being at (the last supported) "version 28" 
> is a show stopper?
> -- 
>   Do things because you should, not just because you can.
> 
> John Thurston907-465-8591
> john.thurs...@alaska.gov
> Enterprise Technology Services
> Department of Administration
> State of Alaska
> 
> 
> ---
> smartos-discuss
> Archives: https://www.listbox.com/member/archive/184463/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/184463/21953302-fd56db47
> Modify Your Subscription: https://www.listbox.com/member/?&;
> Powered by Listbox: http://www.listbox.com



---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] Very slow KVM/Windows Server disk writes

2014-10-03 Thread Richard Elling via smartos-discuss

On Oct 3, 2014, at 2:43 AM, Ian Collins via smartos-discuss 
 wrote:

> Micky via smartos-discuss wrote:
>> Welcome to the SmartOS.
>> You are not alone :)
>> 
>> There's no known fix or workaround to this problem, which is most likely due 
>> to the KVM implementation in illumous.
>> 
> 
> There is a known fix: a properly configure pool.
> 
> Every time I've seen this "issue" crop up on this list the problem has been a 
> pool configuration which hasn't been optimised for synchronous write IOPs.  
> Usually the pool lacks log devices, or worst of all the pool is raidz2 
> without log devices!

This is trivial to test by temporarily disabling the ZIL for your dataset(s).
zfs set sync=disabled poolname/datasetname
measure
to revert
zfs set sync=standard poolname/datasetname

If that test shows a huge performance improvement, then you know to concentrate 
your
efforts on optimizing the pool for sync writes. 
 -- richard

> 
> Bottom line:  if you want good write performance form a KVM instance without 
> breaking the bank, use a stripe of spinning rust mirrors and a decent SSD or 
> RAM log.
> 
> -- 
> Ian.
> 
> 
> 
> ---
> smartos-discuss
> Archives: https://www.listbox.com/member/archive/184463/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/184463/21953302-fd56db47
> Modify Your Subscription: https://www.listbox.com/member/?&;
> Powered by Listbox: http://www.listbox.com



---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] Deleting large files blocks all IO

2014-09-26 Thread Richard Elling via smartos-discuss

On Sep 26, 2014, at 9:00 AM, Rajesh Poddar via smartos-discuss 
 wrote:

> When I delete large files (2-3TB) on my smartos system it often blocks all IO 
> operations. The only info I found online pointed to dedup as a possible 
> culprit but I have never enabled dedup on my system. 

If that data is cached in ARC, then you will see a lot of memory activity in 
arcstat, vmstat,
and mpstat.

> 
> I've tried it both with and without snapshots but that didn't make a 
> difference. When I do zfs list -o space periodically then I see that the 
> deleted space slowly gets reclaimed by ZFS at about 0.1TB every 30s or so and 
> all IO is blocked until all the space freed by the deletion is reclaimed and 
> reflected in the output of the zfs list -o space command.
> 
> Any pointers on how I can go about debugging this issue?

Today, there is no way to throttle the free'ing workload. Upstream there is a 
tunable
being introduced to limit the amount of ZFS free activity.
https://www.illumos.org/issues/5138

Also, for smartos, the free'ing activity does impact the per-zone ZFS I/O 
scheduler.
 -- richard




---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] Strange overheat problem with Supermicro 4U chassis

2014-08-08 Thread Richard Elling via smartos-discuss

On May 24, 2014, at 1:54 AM, Ian Collins via smartos-discuss 
 wrote:

> Ian Collins via smartos-discuss wrote:
>> Ian Collins via smartos-discuss wrote:
>>> Richard Elling via smartos-discuss wrote:
>>>> Firmware rev 0001, 0002, 0004, and a004 work. Firmware rev 0003 is broken
>>>> because someone didn't read or understand the spec. In other words, the
>>>> spec is right and FMA follows the spec, the firmware is wrong.
>>> Just my luck:
>>> 
>>> name='inquiry-revision-id' type=string items=1
>>>  value='0003'
>>> 
>> A firmware update to 0004 does not resolve the problem.  I assume the
>> "bridge" firmware Mark mention in the other thread is still required to
>> reset the thresholds.
>> 
> 
> Has anyone else here had this problem and managed to get a solution form 
> Seagate?  I'm not getting very far and this system build is turning into a 
> bit of a nightmare :(

Yes, the latest firmware we have from Seagate works.
— richard

> 
> I guess I could build my own image with a modified disk-transport.conf (I'm 
> currently testing the system thermals with OmniOS and temp-multiple 
> disabled), but that is a path I'd rather not tread.
> 
> -- 
> Ian.
> 
> 
> 
> ---
> smartos-discuss
> Archives: https://www.listbox.com/member/archive/184463/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/184463/21953302-fd56db47
> Modify Your Subscription: https://www.listbox.com/member/?&;
> Powered by Listbox: http://www.listbox.com

-- 

ZFS storage and performance consulting at http://www.RichardElling.com








---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] illumos / SmartOS equivalent to net.ipv4.ip_nonlocal_bind

2014-08-06 Thread Richard Elling via smartos-discuss
On Aug 6, 2014, at 8:51 AM, Keith Wesolowski via smartos-discuss 
 wrote:

> On Wed, Aug 06, 2014 at 06:31:24AM -0700, Jon via smartos-discuss wrote:
> 
>> What is the illumos / SmartOS equivalent to Linux's sysctl setting 
>> net.ipv4.ip_nonlocal_bind ?
> 
> There is none.
> 
> For the benefit of those who don't use GNU/Linux and don't know what
> this setting does there, I'm told it's commonly used for IP address
> takeover/failover applications, allowing a process to bind a socket to
> an address that is not configured on any local interface.
> 
> The correct way to handle this -- which would have obviated the need for
> the engineering work they did to create this and works everywhere, had
> they only spent a few minutes researching existing solutions before
> adding yet another sysctl -- is to plumb the interface with the
> address(es) to be taken over, then simply leave it down on all members
> of the cluster except the active one.  When takeover is desired, the
> application or cluster management software simply sets IFF_UP on that
> address.  Binding is permitted when the interface is down, so long as
> the address being bound to is configured.  This approach was used in the
> Fishworks appliances and worked very well there; I assume it is used by
> many other such applications as well.

This technique is commonly used, yes.
 -- richard



---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] Problem with netatalk, very long response times from getcwd()

2014-08-02 Thread Richard Elling via smartos-discuss

On Aug 2, 2014, at 6:55 AM, Chris Ferebee via smartos-discuss 
 wrote:

> Robert,
> 
> The script you suggested outputs the following, with the server mostly idle 
> and one OS X client spinning in the Finder, doing its thing to cause the 
> flood of getcwd() calls.
> 
> # dtrace -n 'fbt::dnlc_reverse_lookup:entry{ self->t = timestamp; }' -n 
> 'fbt::dnlc_reverse_lookup:return/self->t/{ @[arg1 == 0 ? 0 : 1] = 
> quantize(timestamp - self->t); self->t = 0; }'
> dtrace: description 'fbt::dnlc_reverse_lookup:entry' matched 1 probe
> dtrace: description 'fbt::dnlc_reverse_lookup:return' matched 1 probe
> 
> ^C
> 
>1
>   value  - Distribution - count
> 1048576 | 0
> 2097152 | 240  
> 4194304 | 8
> 8388608 |@636  
>16777216 |@@@  348  
>33554432 | 0
> 
>0
>   value  - Distribution - count
> 8388608 | 0
>16777216 | 971  
>33554432 | 2
>67108864 | 0
> 
> 
> I don't yet understand how this relates to the results from Ralph's afp.d 
> Dtrace script, which shows getcwd() taking either ~ 50 us or 250 ms, with a 
> 3:1 bimodal distribution. Does getcwd() call dnlc_reverse_lookup always, or 
> just sometimes?
> 
> How might the size of the dnlc affect things? I understand it can be adjusted 
> by putting
> 
>   set ncsize=nnn
> 
> into /etc/system, but I have also seen suggestions that it can be dynamically 
> tuned like this:
> 
> # echo dnlc_max_nentries/W0t10485760 | mdb -kw
> # echo ncsize/W0t524 | mdb -kw
> 
> (Is that safe or advisable? Does it work? I'm hesitant to try, but it would 
> be convenient not to reboot the server, not to mention the difficulty of 
> modifying /etc/system on SmartOS.)
> 
> Currently, ncsize is set to 444,719 (SmartOS default?), and dnlc_nentries 
> seems to slowly fluctuate between 200,000 and 300,000.
> 
> # echo ncsize/D | mdb -kw
> ncsize: 444719# and similarly for
> dnlc_nentries: 272139
> dnlc_max_nentries: 889438
> dnlc_nentries_low_water: 440316
> 
> Dnlc cache hits seem to be around 50% per
> 
> # vmstat -s
> 5079608454 total name lookups (cache hits 50%)
> 
> and
> 
> # kstat -n dnlcstats
> module: unixinstance: 0 
> name:   dnlcstats   class:misc
>crtime  8,600642349
>dir_add_abort   0
>dir_add_max 0
>dir_add_no_memory   0
>dir_cached_current  2
>dir_cached_total2
>dir_entries_cached_current  1094
>dir_fini_purge  0
>dir_hits0
>dir_misses  19498
>dir_reclaim_any 0
>dir_reclaim_last0
>dir_remove_entry_fail   0
>dir_remove_space_fail   0
>dir_start_no_memory 0
>dir_update_fail 0
>double_enters   14
>enters  1858063
>hits2546253549
>misses  2531297654
>negative_cache_hits 22475516
>pick_free   0
>pick_heuristic  995275
>pick_last   3997

pick_heuristic and pick_last are the import indicators that your DLNC  could be
too small. Ideally, these are 0.

Somewhere around here I've got a dnlcstat... there is one in the dtrace toolkit,
but you can also do one using just kstats instead.
 -- richard


>purge_all   0
>purge_fs1   0
>purge_total_entries 1786
>purge_vfs   44
>purge_vp67
>snaptime3016218,120526680
> 
> Thanks for insights so far!
> 
> Best,
> Chris
> 
> 
> Am 01.08.2014 um 17:07 schrieb Robert Mustacchi via smartos-discuss 
> :
> 
>> On 07/31/2014 11:15 PM, Chris Ferebee via smartos-discuss wrote:
>>> Hi Robert,
>>> 
>>> The flame graph is here:
>>> 
>>>  
>>> 
>>> with corresponding output from Ralph's Dtrace script here:
>>> 
>>>  
>>> 
>>> The raw kernel Dtrace output of
>>> 
>>>  dtrace -x stackframes=100 -n 'profile-997 /arg0/ { @[st

Re: [smartos-discuss] 10GbE throughput limited to 1Gb/s, why?

2014-07-20 Thread Richard Elling via smartos-discuss
On Jul 19, 2014, at 6:42 AM, Chris Ferebee via smartos-discuss 
 wrote:

> 
> I'm trying to debug a network performance issue.
> 
> I have two servers running SmartOS (20140613T024634Z and 20140501T225642Z), 
> one is a Supermicro dual Xeon E5649 (64 GB RAM) and the other is a dual Xeon 
> E5-2620v2 (128 GB RAM). Each has an Intel X520-DA1 10GbE card, and they are 
> both connected to 10GbE ports on a NetGear GS752TXS switch.
> 
> The switch reports 10GbE links:
> 
> 1/xg49Enable  10G Full10G FullLink Up 
> Enable  151820:0C:C8:46:C8:3E   49  49
> 1/xg50Enable  10G Full10G FullLink Up 
> Enable  151820:0C:C8:46:C8:3E   50  50
> 
> as do both hosts:
> 
> [root@90-e2-ba-00-2a-e2 ~]# dladm show-phys
> LINK  MEDIA   STATE   SPEED   DUPLEX  DEVICE
> igb0  Ethernetdown0   half
> igb0
> igb1  Ethernetdown0   half
> igb1
> ixgbe0Ethernetup  1   full
> ixgbe0
> 
> [root@00-1b-21-bf-e1-b4 ~]# dladm show-phys
> LINK  MEDIA   STATE   SPEED   DUPLEX  DEVICE
> igb0  Ethernetdown0   half
> igb0
> ixgbe0Ethernetup  1   full
> ixgbe0
> igb1  Ethernetdown0   half
> igb1
> 
> Per dladm show-linkprop, maxbw is not set on either of the net0 vnic 
> interfaces.
> 
> And yet, as measured via netcat, throughput is just below 1 Gbit/s:
> 
> [root@90-e2-ba-00-2a-e2 ~]# time cat /zones/test/10gb | nc -v -v -n 
> 192.168.168.5 

It's called "netcat" for a reason, why are you cat'ing into it?
time nc -v -v -n 192.168.168.5    Connection to 192.168.168.5  port [tcp/*] succeeded!
> 
> real  1m34.662s
> user  0m11.422s
> sys   1m53.957s
> 
> (In this test, 10gb is a test file that is warm in RAM and transfers via dd 
> to /dev/null at approx. 2.4 GByte/s.)
> 
> What could be causing the slowdown, and how might I go about debugging this?

nc doesn't buffer, so a pipeline of data flowing through cat <-> nc <-> network 
<-> nc <-> ?? 
is susceptible to delays at any stage rippling their latency back to the far 
end. You're better
off testing performance with proper network performance testing tools like 
iperf where such
things are not in the design.

 -- richard


> 
> FTR, disk throughput, while not an issue here, appears to be perfectly 
> reasonable, approx. 900 MB/s read performance.
> 
> Thanks for any pointers!
> 
> Chris
> 
> 
> 
> 
> ---
> smartos-discuss
> Archives: https://www.listbox.com/member/archive/184463/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/184463/21953302-fd56db47
> Modify Your Subscription: https://www.listbox.com/member/?&;
> Powered by Listbox: http://www.listbox.com

--

richard.ell...@richardelling.com
+1-760-896-4422






---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] ghc ./configure generating bad config.status:

2014-07-12 Thread Richard Elling via smartos-discuss

On Jul 12, 2014, at 2:21 PM, Alain O'Dea via smartos-discuss 
 wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> On Fri 11 Jul 2014 12:33:09 PM UTC, Jonathan Perkin via smartos-discuss
> wrote:
>> * On 2014-07-11 at 13:28 BST, Nahum Shalman via smartos-discuss wrote:
>> 
>>> I've had issues from time caused by the fact that /bin/sh is a
>>> symlink to ksh93 and not bash.
>> 
>> There are also a number of bugs in the ksh93 currently available in
>> illumos.  They are fixed in newer releases, but I don't think anyone
>> wants to touch a ksh93 upgrade with a barge pole.
>> 
>> This is why in pkgsrc I set SHELL to /bin/bash and avoid ksh93
>> completely, it fixes a number of issues I was seeing previously.
>> 
> 
> Thank you gentlemen.  Jonathan, your SHELL env fix worked.  I don't
> understand why it works, but I'll go with it :)

someone, somewhere is doing an equivalency match because /bin -> /usr/bin
so /usr/bin/bash and /bin/bash is the exact same thing. Should be easy to 
locate 
with grep ;-)
 -- richard

> 
> My SHELL was /usr/bin/bash.  Just exporting it didn't work, but setting
> SHELL to /bin/bash worked.
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> 
> iQEcBAEBAgAGBQJTwabnAAoJEP0rIXJNjNSA9agIAKUidvUYmp2SL+ggqX1OyUuD
> jHz1Mt39V5ufiRzaypNMvEp7gqfU2XN3jsSBdSdZje4BvAuSGKUoT4d0l1HUSgRT
> UZEB8Qb+CnztRXirU8krnxfi9eRC6nKFw1nLxxCXLniS/kSkHn8lw6IdtRH51k7T
> LRZKuDav29sL0oPGrgXdB2sqZCzx07ROxkPls7ZfRjDm2S2xEQfagF8stQw6xVco
> fNSr9qKLgXx3mvmQA3ZLlPCnR8jLoD9lU+KvvdZHFbpeX0dZ90c4maFEZnHU+O4v
> eaIZFyPIOKsHR+OjQlMblbRjnb2vIaE/GT/jWE+cHdp+Fk4WUu1N+d/nZnq+mco=
> =13S7
> -END PGP SIGNATURE-
> 
> 
> 
> ---
> smartos-discuss
> Archives: https://www.listbox.com/member/archive/184463/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/184463/21953302-fd56db47
> Modify Your Subscription: https://www.listbox.com/member/?&;
> Powered by Listbox: http://www.listbox.com



---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] Joyents experience with HGST drives and performance

2014-07-07 Thread Richard Elling via smartos-discuss

On Jul 7, 2014, at 11:28 AM, Keith Wesolowski via smartos-discuss 
 wrote:

> On Mon, Jul 07, 2014 at 08:20:43PM +0200, Ibrahim Tachijian wrote:
> 
>> Sounds like a plan Ketih.
>> 
>> And the additional need for a SLOG (like ZeusRam) is def. not required
>> because of my pool being set as sync=disabled ?
> 
> I'm not sure about how sync=disabled affects metadata updates.

It doesn't, metadata is async.

>  You
> might ask on the ZFS list if this really matters to you.  However given
> your streaming-only workloads I would not expect you to have a lot of
> metadata I/O anyway, and you'd probably want logbias=throughput even so.
> So I wouldn't use a slog given what you've described, unless I found a
> specific performance problem in situ that I had isolated to the ZIL.

agree, this is a use case where sync=disabled makes good business sense
 -- richard



---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] SmartOS HA

2014-06-12 Thread Richard Elling via smartos-discuss
On Jun 12, 2014, at 1:16 AM, Chris Ferebee via smartos-discuss 
 wrote:

> Speaking of VMware, what ever became of Sun AVS a. k. a. SNDR Remote Mirror? 
> I once set up a HA file server with that years ago, and am told that it’s be 
> working well to this day.

The source is out there somewhere. AFAIK, it hasn't been touched in years.

> 
> Has it simply been rolled into a commercial  Oracle offering, or did it turn 
> out to be a fundamentally bad idea?

It always was an offering. At one time Nexenta had it in their product, as a 
plugin called Simple-HA.
It was neither simple, nor HA. 

When SNDR was developed, the max disk size was rapidly approaching 9GB. Trying 
to use it for 1TB is 
very painful, and > 2TB is, IMNSHO, insane.

High-Availability.com has SmartOS support in their RSF-1 product. This is more 
of a traditional
failover HA solution. 

It is always better to implement high availability closer to the application. 
This is the approach
Joyent takes and it is, ultimately, the best solution.
 -- richard



---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] LTS version of datasets

2014-06-05 Thread Richard Elling via smartos-discuss
On Jun 5, 2014, at 1:35 AM, Jonathan Perkin via smartos-discuss 
 wrote:
> 
> It's worth noting at this point that pkgsrc has native support for
> reporting on vulnerable packages.  We have a pkgsrc security team who
> maintain a file containing all known vulnerabilities, and it is
> matched against the packages you have installed.

"worth noting" is a massive understatement! This is a valuable service that
should be trumpeted, especially in these days of OpenSSL-patch-du-jour
 -- richard

>  To use it, run:
> 
>  $ pkg_admin fetch-pkg-vulnerabilities
>  $ pkg_admin audit
> 
> You may find with older images that there are rather a lot of matching
> vulnerabilities!
> 
> Regards,
> 
> -- 
> Jonathan Perkin  -  Joyent, Inc.  -  www.joyent.com
> 



---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] How to send mail from SmartOS command line

2014-06-03 Thread Richard Elling via smartos-discuss
On Jun 3, 2014, at 10:06 AM, Ganapathy S A via smartos-discuss 
 wrote:

> Hi Alain,
> 
> I'm able to ssh/scp to SmartOS (actually a VBox guest in Ubuntu host) but not 
> the other way around (from SmartOS to Ubuntu)! Following is the error I'm 
> getting. What config might I miss on the SmartOS side? Or on the Ubuntu side? 

IIRC, for at least some Ubuntu distributions ssh server is not installed.
This would explain a connection refused. Is the openssh-server package 
installed?
Also, use "netstat -a" to see if there is a listener on port 22 (*:ssh)
 -- richard

> 
> ssh ganu@a.b.c.d
> ssh: connect to host a.b.c.d port 22: Connection refused
> 
> 
> Regards,
> Ganapathy
> 
> 
> On 26 May 2014 20:48, Alain O'Dea  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> On Mon 26 May 2014 03:06:10 PM UTC, Ganapathy S A wrote:
> > Thanks Alain for the advice :-)
> >
> > Regards,
> > Ganapathy
> >
> >
> > On 26 May 2014 19:03, Alain O'Dea  > > wrote:
> >
> > -BEGIN PGP SIGNED MESSAGE-
> > Hash: SHA1
> >
> > On Sun 25 May 2014 09:40:40 AM UTC, Ganapathy S A via smartos-discuss
> > wrote:
> > > Hello,
> > >
> > > How to I send mail (a log file or source code or whatever but text
> > > only) from SmartOS command line? And, how to copy the command line
> > > text of SmartOS when it is run as a guest in Ubuntu host?
> > >
> > > Regards,
> > > Ganapathy
> >
> > The best way to get content on or off a SmartOS GZ (Global Zone),
> > OS VM
> > (local zone), or KVM (hardware virtual machine) is via SSH.  SSH runs
> > by default on Joyent-provided images so you just need to assign an IP.
> >
> > Adding persistent behavior (like SMTP) in the GZ goes against its
> > intent and design and will likely put you in frustrating, yak-shaving
> > territory.
> >
> > Jonathan Perkin has a very good article explaining the role of the
> > Global Zone:
> > http://www.perkin.org.uk/posts/smartos-and-the-global-zone.html
> >
> > I recommend using SSH and SCP for moving content on and off SmartOS
> > boxes and their guests.
> > -BEGIN PGP SIGNATURE-
> > Version: GnuPG v1
> > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> >
> > iQEcBAEBAgAGBQJTg0KWAAoJEP0rIXJNjNSAVdgIAMuC3lkqjluCY6j1av1io5mQ
> > 0LHNt0x4oiMtUnflThU1UDVfj/564mlCU7MkNaEGJldrPXVzHJIrXFlA+aQ4ARys
> > NhA/ZN7iZburKEnwg8GSQaRMWn2YncmS0+4Q2UjA8s6v+MMIuiZlZQcO9+ZzRtCC
> > hPcfDpzsAkGNkmemy7dGFdoWrY/j5LPbvJlC70vx7tlQJaM5UvPCxoGvDbNcafqT
> > 8qdaWGQcV4sP6Wv3Z61wb4zXISOMclugNVm/WIBUItXEK25MoVQ1RRDPrPWZoU+p
> > KB0rOE9qsSPTGLUM++5SAt6fg0iCA8dqWy3PgvtJCU/Q0sdkv58nTU42nNXs/Zc=
> > =Uaru
> > -END PGP SIGNATURE-
> 
> My pleasure.
> 
> Here is some guidance on the wiki about supported persistent
> configuration of the global zone:
> http://wiki.smartos.org/display/DOC/Persistent+Configuration+for+the+Global+Zone
> 
> Of particular interest is the persistent configuration of SSH pubkey
> authentication which is handy for secure automation.
> 
> Best,
> Alain
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> 
> iQEcBAEBAgAGBQJTg1syAAoJEP0rIXJNjNSA3vkIAI8x7BaKV405knlLiT7jbrwS
> Jed1jBHRSbOSve6fM9DFwApOOYqCyV+JdfVxYVqEhW5HcV9JTxQzbzO9KRPoRlN/
> 7Ag85P1cmzTDT9gWeUsOUC858Kt8MzDxaeIKwAmbioGGg4lKoyrt0A3W6jUzHGji
> kEbGHgk9ix2uZfjPL+KIkyOG7/gyFrRqusHTP9ZhmiEw9Mt302v1QfzXSUtDMxkt
> munT5sKG3iHhpW2KJolrbHvDlHs9B6QYTVEbUEWmxkfwbzN8aB0b/LjSIWPtIcrB
> vQw97LcZNJpe187Jxe5BVysflbwWskMNbal0xkPhHRlLz8cqH3pG6bw2fteN5ps=
> =zP4q
> -END PGP SIGNATURE-
> 
> 
> smartos-discuss | Archives  | Modify Your Subscription 



---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] DogeOS/SmartOS smf dependencies visualization tool!

2014-05-29 Thread Richard Elling via smartos-discuss
On May 29, 2014, at 10:06 PM, Yu Li via smartos-discuss 
 wrote:

> Hi, all,
> 
> just let you know that I did some work to visualize the SmartOS smf 
> dependencies.
> 
> http://www.dogeos.net/smfgraph/

well done! d3 rules!

You can adjust the charge and linkDistance to get a better spread. 

For a live system, you can also show the current status: online, disabled, 
maintenance, etc.
 -- richard

> 
> though it is actually DogeOS, since DogeOS only added one smf on top of 
> SmartOS, so most of them are SmartOS.
> 
> you can use it to get a feel when there is really some smf dependencies 
> failed, or just watch them like me as it is fun :)
> 
> I hope this tool will be found useful for SmartOS users :)
> 
> Ciao~
> 
> smartos-discuss | Archives  | Modify Your Subscription 



---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] Strange overheat problem with Supermicro 4U chassis

2014-05-20 Thread Richard Elling via smartos-discuss

On May 20, 2014, at 2:50 PM, Roodhart, Jeroen via smartos-discuss 
 wrote:

> I stand corrected 😊 Of course I assumed it was a firmware v.s. FMA thing, I 
> should have articulated that more precisely. 
> These disks (at least the vendor/series) have been the default choice for 
> these systems for a couple of generations now. As indicated in the other 
> thread ( I agree after reading that thread, that it seems to be about the 
> same issue ) other OS's just ignore the firmware as well.

Firmware rev 0001, 0002, 0004, and a004 work. Firmware rev 0003 is broken
because someone didn't read or understand the spec. In other words, the
spec is right and FMA follows the spec, the firmware is wrong.
 -- richard



---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] Working backplane/enclosure for SES in mass non-Solaris/fishwork-ZFS deployment

2014-05-10 Thread Richard Elling via smartos-discuss
Speaking of lighting LEDs...

On May 9, 2014, at 7:11 PM, Robert Mustacchi via smartos-discuss 
 wrote:

> On 5/9/14 18:01 , Fred Liu via smartos-discuss wrote:
>> I post here for I don't know where the largest non-Solaris/fishwork-ZFS 
>> deployment in the world is. But Joyent provide cloud service. The people 
>> from their datacenters must play bunch of disks everyday. As long as I know, 
>> only 
>> dataON+nexenta(http://www.dataonstorage.com/dataon-solutions/nexenta/dsm-30-for-nexentastor.html)
>>  released a solution for disk management in illumos-ZFS product. And don't 
>> know too much in Linux/FreeBSD side.

DSM uses a wrapper around sg3_utils to check LED status and change.

NexentaStor created its own LED lighting CLI code (I can't recall its name off
the top of my head) that had the annoying habit of lighting LED 0 if you gave it
the wrong input. Later they worked that into their GUI (circa 2011/2012).

For most folks, the lack of documentation for fmtopo means they miss the fact
that you can test or change all of the enumerated LEDs on the system, including
PSUs, chassis locator, etc. Whether this actually lights LEDs is not a question
that is easy to answer, given the plethora of firmware in the world of dubious
quality. This is an area where companies like DataOn provide good value: they 
test their hardware/firmware in an illumos environment.

Usage: /usr/lib/fm/fmd/fmtopo [-bCedpSVx] [-P group.property[=type:value]] [-R 
root] [-l file] [-m method] [-s scheme] [fmri]
-b  walk in sibling-first order (default is child-first)
-C  dump core after completing execution
-d  set debug mode for libtopo modules
-e  display FMRIs as paths using esc/eft notation
-l  load specified topology map file
-m  execute given method
-P  get/set specified properties
-p  display of FMRI protocol properties
-R  set root directory for libtopo plug-ins and other files
-s  display topology for the specified FMRI scheme
-S  display FMRI status (present/usable)
-V  set verbose mode
-x  display a xml formatted topology

-P is the option you're looking for to get/set properties (=uint32:1)
-C is one of those options that makes you think someone was smoking something...
or else they don't know how to use debuggers or dtrace :-P

Warning: the command example you're about to see is butugly

# /usr/lib/fm/fmd/fmtopo -P facility.mode=uint32:1 
hc://:product-id=NEWISYS-NDS-4600-JD:server-id=:chassis-id=500093d00169b000/ses-enclosure=0/bay=60?indicator=ident
TIME UUID
May 10 13:49:32 40bcdaad-3474-41b1-9fba-fa6ea2de5971

hc://:product-id=NEWISYS-NDS-4600-JD:server-id=:chassis-id=500093d00169b000/ses-enclosure=0/bay=60?indicator=ident
  group: facility   version: 1   stability: Private/Private
mode  uint320x1 (ON)

# /usr/lib/fm/fmd/fmtopo -V 
hc://:product-id=NEWISYS-NDS-4600-JD:server-id=:chassis-id=500093d00169b000/ses-enclosure=0/bay=60?indicator=ident
TIME UUID
May 10 13:50:50 63d689fa-af5c-664f-896e-d42d58c3a5d0

hc://:product-id=NEWISYS-NDS-4600-JD:server-id=:chassis-id=500093d00169b000/ses-enclosure=0/bay=60?indicator=ident
  group: protocol   version: 1   stability: Private/Private
resource  fmri  
hc://:product-id=NEWISYS-NDS-4600-JD:server-id=:chassis-id=500093d00169b000/ses-enclosure=0/bay=60?indicator=ident
  group: authority  version: 1   stability: Private/Private
product-idstringNEWISYS-NDS-4600-JD
chassis-idstring500093d00169b000
server-id string
  group: facility   version: 1   stability: Private/Private
type  uint320x1 (LOCATE)
mode  uint320x1 (ON)
  group: sesversion: 1   stability: Private/Private
node-id   uint640x1003e

# /usr/lib/fm/fmd/fmtopo -P facility.mode=uint32:1 
hc://:product-id=NEWISYS-NDS-4600-JD:server-id=:chassis-id=500093d00169b000/ses-enclosure=0/bay=60?indicator=ident
TIME UUID
May 10 13:49:32 40bcdaad-3474-41b1-9fba-fa6ea2de5971

hc://:product-id=NEWISYS-NDS-4600-JD:server-id=:chassis-id=500093d00169b000/ses-enclosure=0/bay=60?indicator=ident
  group: facility   version: 1   stability: Private/Private
mode  uint320x1 (ON)

# /usr/lib/fm/fmd/fmtopo -P facility.mode=uint32:0 
hc://:product-id=NEWISYS-NDS-4600-JD:server-id=:chassis-id=500093d00169b000/ses-enclosure=0/bay=60?indicator=ident
TIME UUID
May 10 13:52:28 b7cf2d8c-c8e3-4d74-ae35-d41c8b543846

hc://:product-id=NEWISYS-NDS-4600-JD:server-id=:chassis-id=500093d00169b000/ses-enclosure=0/bay=60?indicator=ident
  group: facility   version: 1   stability: Private/Private
mode  uint320x0 (OFF)



Joyent added more intelligence into the mix by automat

Re: [smartos-discuss] ZIL recommendations

2014-05-10 Thread Richard Elling via smartos-discuss

On May 9, 2014, at 10:32 AM, Len Weincier via smartos-discuss 
 wrote:

>> > I see from the manufacturing docs that you have a single 50GB device 
>> > (270-022 usually). From reading around it seems to be recommend that 
>> > the slog be mirrored as it is taking the writes and if one fails there 
>> > is a spare. Do you find having only the one slog device is ok ? 
>> 
>> Yes. The slog is not "taking the writes"; it's being used to store the 
>> ZIL. If it fails on a running system, the ZIL will be written to the 
>> main pool; correctness will not be compromised. Of course, there are 
>> lots of ways for a device to fail. If you imagine the possibility that 
>> a device fails by taking 50x longer than normal to complete writes, your 
>> mirror looks a lot less useful (read: not at all). The only certainty 
>> is that buying 2 devices will cost you twice as much. 
> Now that makes sense, the ZIL is the in memory intent log and the slog is the 
> fast device that can store the intent log. If the slog goes away the ZIL is 
> still in memory and will be committed to the pool. The fast slog allows ZFS 
> to confirm writes to the client much faster. 

You're close, but it is not the ZIL in memory. Wikipedia does a reasonable
job explaining intent logs at:
http://en.wikipedia.org/wiki/Intent_log

> 
>> The likelihood of the specific multiple-failure sequences that can lead 
>> to data loss here is extremely low. I personally have never seen it 
>> happen and consider it borderline contrived, much like the meteor strike 
>> and nuclear war scenarios. Given enough machines and enough time, I'm 
>> sure it will eventually happen (and when it does, it will be caused by 
>> operator error), but in the meantime I still have to think about cost. 
>> If your pool is being used to track the beneficial owners of trillions 
>> of dollars of Treasury debt, go ahead and spend the extra money. 
> Agreed now that I have the right mental model for whats happening.
> 
>> 
>> 
>> > Also what sort of performance change / increase could one expect from 
>> > using and slog device, we have the DCS3700’s ? I am trying to 
>> > understand the way the slog works and how much improvement I can 
>> > predictably expect from an SSD slog ? 
>> 
>> Any reasonable slog vs. no slog will provide a significant improvement 
>> (order of magnitude reduction in latency or more) for any workload doing 
>> synchronous writes. I have never attempted to come up with a model that 
>> can map the numbers in a particular slog's marketing brochure to 
>> delivered performance; instead, I get evaluation units and typically use 
>> a dozen or so filebench profiles along with DTrace to get a rough 
>> estimate of what delivered (i.e., filesystem) performance will look like 
>> and how the slog itself is behaving. That test is conducted on a pool 
>> with that particular slog and the exact pool layout we would use. 
> Thanks for that pointer to filebench, was messing around with tools and 
> wondering how to get decent load profiles going for testing. So far the SSD 
> slog difference is significant, hard to quantify as you say.

filebench was originally intended to get around the problem with almost
every other benchmark at the time: the load generated is too simple to
model real workloads. For example, it is not possible to use iozone to
model the complex interaction of a database, its log, and indexing because
they are three different workloads all interacting with each other. In 
filebench you can create rather complex state machines and expose the
workload interactions. Hence you can use filebench to test how a slog can
affect a complex application -- see the oltp or /var/mail workloads.

> 
> I have setup 2 systems with the disklayout tool one using defaults and one 
> with raidz2 pools both with a single SSD slog, so far the raidz2 pool looks 
> to be doing the best.

cool. Find what works and go with it.
 -- richard

> 
>> This 
>> is not as good as observing a production workload, but given the breadth 
>> of what our customers do, it's all that's practical. If you have just a 
>> single machine, and know what your workload is, you can do better. As 
>> always, watch for things you can't explain and make sure you understand 
>> them before accepting the results. 
> 
> Thanks for the thoughtful replies, always valuable 
> 
> smartos-discuss | Archives  | Modify Your Subscription 



---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] ZIL recommendations

2014-05-08 Thread Richard Elling via smartos-discuss

On May 8, 2014, at 7:51 AM, Ibrahim Tachijian via smartos-discuss 
 wrote:

> Thanks for that list Keith (http://eng.joyent.com/manufacturing). For 
> everyone who hasn't read it yet I do recommend you to do so. 
> I'm guessing joyent has gone through a huge amount of different vendors 
> before deciding on these documented part numbers.
> 
> What I notice though is that joyent doesnt seem to use any L2ARC ssd's. Is 
> this true ? And if so, why?
> My guess is that all their systems always have enough ram to make any L2ARC 
> negligible and therefore unnecessary.
> 
> Please correct me if I am wrong and especially if joyent does use l2arc 
> SSD's. 

I can't speak for Joyent, but the age when L2ARC was invented was also an age 
where
RAM was very expensive and limited. Think along the lines of "a big machine has 
16 GB."
Today, 16GB of ECC memory fits in one slot and costs less than $200. For 
caches, the
return on investment is marginal, which means doubling the cache size doesn't 
double
performance. For today's systems, putting in 256GB of RAM and not purchasing an 
L2ARC
is a very sensible approach for many workloads. Higher end filers have 1-2 TB 
of RAM, for
reference.
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422






---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] How to Know where a drive is located basedon ID

2014-04-11 Thread Richard Elling

On Apr 11, 2014, at 9:52 PM, Greg Zartman  
wrote:

> I started to ask this question once before, but didn't follow up.
> 
> I have a zpool that looks like this:
> 
> [root@LEI-SmartOS ~]# zpool status
>   
>   pool: zones 
>   
>  state: ONLINE
>   
>   scan: none requested
>   
> config:   
>   
>   
>   
> NAME   STATE READ WRITE CKSUM 
>   
> zones  ONLINE   0 0 0 
>   
>   mirror-0 ONLINE   0 0 0 
>   
> c0t5000C5000E323437d0  ONLINE   0 0 0 
>   
> c0t5000C5001172FC5Dd0  ONLINE   0 0 0 
>   
>   mirror-1 ONLINE   0 0 0 
>   
> c0t50014EE0ABEC1063d0  ONLINE   0 0 0 
>   
> c0t50014EE0AC22159Fd0  ONLINE   0 0 0 
>   
>   mirror-2 ONLINE   0 0 0 
>   
> c0t5000C5000DDE3D5Ad0  ONLINE   0 0 0 
>   
> c0t5000C50011F082E2d0  ONLINE   0 0 0 
>   
>   mirror-3 ONLINE   0 0 0 
>   
> c0t5000C5000DDE814Bd0  ONLINE   0 0 0 
>   
> c0t5000C5001EC0600Fd0  ONLINE   0 0 0 
>   
> 
> 
> These drives are sitting in a Supermicro 2U 12-bay chassis with a 
> BPN-SAS-825TQ SAS/SATA Backplate

That backplane uses an American Megatrends MegaRAC MG 9072 SES controller.
The limitations of that controller is that the SES-2 interfaces are via SGPIO 
or I2C from the HBA.
You'll need to make sure those connections are in place. According to SMC, 
you'll need to connect
the SGPIO (aka Sideband Headers) to the HBA. I'm not familiar enough with the 
IBM HBA to know
how to connect to it. Without those connections, you won't be able to manage 
the locator LEDs.

Joyent has added some HBA-oriented LED code for failures. Perhaps someone from 
Joyent can
add more info.
-- richard

> 
> My SAS/SATA controllers are 2 x IBM SERVERAID M1015 crossflashed to IT mode 
> (i.e., pass-through).
> 
> I've wired the IBM M1015's to the backplate in what I believe is the correct 
> order so the drive bay numbers on the backplate match the drive order on the 
> raid card.  However, I'm not entirely sure how to verify this.
> 
> Question:  What is the best way for me to know where a failed drive is 
> located in the chassis?  Perhaps the backplate/raid card are smart enough to 
> light up the red error LED when a drive fails, but I'm not certain of this.  
> 
> Thanks,
> 
> Greg
> 
> 
> smartos-discuss | Archives  | Modify Your Subscription 

--

richard.ell...@richardelling.com
+1-760-896-4422






---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] Memory paging

2014-04-09 Thread Richard Elling
On Apr 8, 2014, at 12:22 PM, Rajesh Poddar  wrote:

> I have a system with 64gb of ram that I'm running smartos on. It has 5 
> Windows Server 2012 KVM VMs each with 8GB of memory. There is also a smartos 
> zone with 32GB max_physical_memory that runs a very lightweight file server 
> that i wrote in Go. The file server typically consumes about 3-4GB of memory. 
> Every so often (like once in 10 minutes or so) the fileserver massively slows 
> down. I noticed that normally the RSS entry for the fileserver in the output 
> of prstat stays stable. However, these periods of slowdown perfectly coincide 
> with the RSS values decreasing. This led me to conclude that memory allocated 
> to the fileserver is being paged out. Furthermore the fraction of misses as 
> shown in the output of arcstat also masively increases during these periods.

The hit/miss ratio or fractions are not very interesting. The magnitude is. 50% 
hits for 2 accesses/sec
is very different than 50% hit for 200,000 accesses/sec. A more meaningful way 
to describe this is:
150k accesses/sec with 47% hit rate, or better yet:

demand
data
hits (dhit)
misses (dmis)
metadata
hits (mhit)
misses (mmis)
prefetch
data
hits (phit)
misses (pmis)
metadata
hits (? missing from smartos' arcstat ?)
misses (? missing from smartos' arcstat ?)


> Therefore, my tentative conclusion is that memory is being paged out from the 
> fileserver and being given to the ARC.

The ARC never takes memory from other consumers, it only takes from the free 
list.

> 
> This is the output of memstat
> 
> Kernel3612239 14110   22%
> ZFS File Data 1474454  57599%
> Anon 0378 43399   66%

The anon pages are the only pages that can be swapped out. For a system like 
this,
you will want to make sure your swap devices are at least this big, otherwise 
the swap
memory is reserved from real memory -- a 2x hit.
 -- richard

> Exec and libs2429 90%
> Page cache  14309550%
> Free (cachelist)11307440%
> Free (freelist)535660  20923%
> 
> Any suggestions on how I can confirm that this is indeed what's going on and 
> how I can fix it? One solution that would work for me would be to dedicate 
> say 4-6GB of RAM to the fileserver zone but I can't seem to find an option in 
> vmadm to do that.
> 
> Thanks,
> Rajesh
> 
> smartos-discuss | Archives  | Modify Your Subscription 

--

richard.ell...@richardelling.com
+1-760-896-4422






---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] ZFS continually report data written for MTBF calculations

2014-04-07 Thread Richard Elling

On Apr 7, 2014, at 10:20 AM, Jakob Borg  wrote:

> 2014-04-07 18:13 GMT+02:00 Evan Rowley :
> >
> > I'd like to measure the total amount of data written over time to the 
> > physical devices that make up zpool zils, l2arcs, and vdevs. Each one of 
> > these physical devices has a projected Mean Time Before Failure (MTBF), 
> > often measured in writes or data written to the device, which I'd like to 
> > compare with the total number of writes to the device.
> >  
> > The "zpool iostat" command can do this for a given period of time, but 
> > unless I'm mistaken, it can't do this continually for an unpredictable 
> > amount of time.
> >  
> > I'm imagining that something like this can be done using dtrace, but I'm 
> > not completley certain if that is the best way to tackle the problem.
> >  
> > Are there any other ways to do this? (before i go re-inventing the wheel)
> 
> Many SSD:s keep SMART-accessible counters for data read and written;

Wow, that is revisionist history :-)
SCSI devices reported this info in the read/write logs long before SMART (circa 
2004) existed.

> 
> [root@anto ~]# /opt/smartmontools/sbin/smartctl -d sat,12  -A 
> /dev/rdsk/c3t1d0p0
> ...
> ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
> WHEN_FAILED RAW_VALUE
> ...
> 241 Host_Writes_32MiB   0x0032   100   100   000Old_age   Always  
>  -   355603
> 242 Host_Reads_32MiB0x0032   100   100   000Old_age   Always  
>  -   64535
> 
> Haven't seen the same on any spinning disks...

Try something like:
sg_logs -a /dev/rdsk/c0t5000C500476C5B37d0
SEAGATE   ST3300657SS   0008
Supported log pages  (spc-2) [0x0]:
0x00Supported log pages
0x02Error counters (write)
0x03Error counters (read)
0x05Error counters (verify)
0x06Non-medium errors
0x0dTemperature
0x10Self-test results
0x15Background scan results (sbc-3)
0x18Protocol specific port
0x37Cache (Seagate), Miscellaneous (Hitachi)
0x38[unknown vendor specific page code]
0x3eFactory (Seagate/Hitachi)
Write error counter page  (spc-3) [0x2]
  Errors corrected with possible delays = 0
  Total rewrites or rereads = 0
  Total errors corrected = 0
  Total times correction algorithm processed = 0
  Total bytes processed = 60158093824
  Total uncorrected errors = 0

 look for zeros in the error counters, especially errors corrected with 
possible delays

Read error counter page  (spc-3) [0x3]
  Errors corrected without substantial delay = 258780
  Errors corrected with possible delays = 0
  Total rewrites or rereads = 0
  Total errors corrected = 258780
  Total times correction algorithm processed = 258780
  Total bytes processed = 22475677696
  Total uncorrected errors = 0
Verify error counter page  (spc-3) [0x5]
  Errors corrected without substantial delay = 0
  Errors corrected with possible delays = 0
  Total rewrites or rereads = 0
  Total errors corrected = 0
  Total times correction algorithm processed = 0
  Total bytes processed = 0
  Total uncorrected errors = 0

 rarely do we see verify enabled... too slow

Non-medium error page  (spc-2) [0x6]
  Non-medium error count = 1

 rare to discover what these actually reveal... usually need to get access 
to private data

Temperature page  (spc-3) [0xd]
  Current temperature = 39 C
  Reference temperature = 68 C
Self-test results page  (spc-3) [0x10]
Background scan results page (sbc-3) [0x15]
  Status parameters:
Accumulated power on minutes: 542377 [h:m  9039:37]

 here is the POH, useful for MTBF-related reliability calculations

Status: background scan enabled, none active (waiting for BMS interval 
timer to expire)
Number of background scans performed: 139
Background medium scan progress: 0.00%
Number of background medium scans performed: 1203
Protocol Specific port page for SAS SSP  (sas-2) [0x18]
relative target port id = 1
  generation code = 0
  number of phys = 1
  phy identifier = 0
attached device type: expander device
attached reason: SMP phy control function
reason: power on
negotiated logical link rate: 6 Gbps
attached initiator port: ssp=0 stp=0 smp=0
attached target port: ssp=0 stp=0 smp=1
SAS address = 0x5000c500476c5b35
attached SAS address = 0x50030480009b7a3f
attached phy identifier = 21
Invalid DWORD count = 0
Running disparity error count = 0
Loss of DWORD synchronization = 0
Phy reset problem = 0

 zeros here are good, too... no running disparity errors means good cabling

Phy event descriptors:
 Invalid word count: 0
 Running disparity error count: 0
 Loss of dword synchronization count: 0
 Phy reset problem count: 0
relative target port id = 2
  generation code = 0
  number of phys = 1
  phy identifier = 1
attached device type: no device attached
attached reason: unknown
reason: unk

Re: [smartos-discuss] Setting up mass storage server

2014-03-08 Thread Richard Elling
On Mar 6, 2014, at 12:42 PM, Alex Adriaanse  wrote:

> On Mar 6, 2014, at 10:06 AM, Keith Wesolowski  
> wrote:
>> The way we do this for our Manta storage nodes is to have a single
>> RAIDZ2 pool with 3 vdevs (the topology ends up being 3x(9+2) with 2
>> spares and a slog in a 36-disk chassis).  This works well, performs
>> well, and has excellent durability.  If you're curious about our HW
>> config, see github.com/joyent/manufacturing.
> 
> Thanks, your email is incredibly helpful. As far as the raidz2 vdevs are 
> concerned, I keep reading on the Internet that for raidz2 you should have an 
> even number of disks. Is this valid advice?

No.

> If so, what’s the logic behind this (or if it’s not valid, why do people keep 
> making this recommendation)?

Because they don't understand how ZFS works and keep trying to force their 
preconceived
notions of RAID-5/RAID-6 on ZFS.
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422






---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] Network setup

2014-03-08 Thread Richard Elling

On Mar 6, 2014, at 2:49 PM, tono  wrote:

> Hello,
> 
> I'm looking for advice about redundant NIC setup, specifically avoiding
> single point of failure at the switch. I did see the wiki page on link
> aggregations. We use non-stacking switches, which means that all members
> would need to connect to the same switch.
> 
> Solaris' DLMP fits the bill but it was introduced only in 11.1.
> 
> It seems that IPMP could work but not in the global zone (or at least
> not in a way that would benefit the local zones). Every non-global zone
> would need its own IPMP setup, which is a bit messy & also wasteful in
> terms of probe traffic.

By default there is no probe traffic for IPMP. Link health is based on the 
state of the link, not an ICMP ping. This means that if your switch or network
fails, but the link does not fail, then it will not be detected by IPMP. Not by
coincidence, this is how link aggregation works, too.

If you actually want to test that your network is working (as opposed to the
switches) then you'll need to configure probes.
 -- richard

> 
> Am I missing anything or are these our options? Is there anything else
> in the pipe?
> 
> Also, how to best separate interfaces from a security point of view? I
> think it would be safest to reserve the management network for LOMs.
> Would there be any additional benefit to separating the global zone from
> the local ones? I mean, if a zone is already compromised, would a
> separate NIC or VLAN add any real security? On a dual NIC server, it
> wouldn't be a small sacrifice. 
> 
> Thanks!
> 
> 
> ---
> smartos-discuss
> Archives: https://www.listbox.com/member/archive/184463/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/184463/21953302-fd56db47
> Modify Your Subscription: https://www.listbox.com/member/?&;
> Powered by Listbox: http://www.listbox.com

--

richard.ell...@richardelling.com
+1-760-896-4422






---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] zfs_nocacheflush and Enterprise SSDs with Power Loss Protection (Intel S3500)

2014-03-08 Thread Richard Elling
Hi Nick,

On Mar 7, 2014, at 12:21 PM, Nick  wrote:

> On Tue, Mar 4, 2014 at 4:32 PM, Keith Wesolowski 
>  wrote:
> It's also been my experience that most applications don't actually open
> files O_DSYNC and proceed to do tiny writes to them.  I've observed
> MySQL, for instance, doing lots of 87-byte writes to its log file, but
> it opens it without O_DSYNC and instead issues fsync(3c) calls when
> required for transactional consistency.  That pattern allows the OS to
> aggregate small I/O and also to do writes before they have to be
> committed if the device is otherwise idle.  This pattern usually results
> in much better apparent performance than writing O_DSYNC.  Again,
> whether this is relevant depends on your application.
> 
> > - The slowness of the Seagate 600 Pro SSD really surprised me. This is one
> > of the fastest drives on the market, and you can see that with the cache
> > flushing disabled it outpaces the Intel SSD. It is supposed to be an
> > "Enterprise" drive with power protection, so I thought it could ignore the
> > flush commands, but that is clearly not the case. When doing 4K sync
> > writes, it seems to be somewhere around 50 times slower than the Intel
> > drive and it is actually slower than a mechnical hard drive, which really
> > shocked me.
> 
> It's an interesting result, for sure.  Ask Seagate.  FWIW, I tested
> their Pulsar.2 SAS drive for slog use and found it pretty good, very
> similar in fact to the STEC Mach16 SLC that we use today.  I don't think
> that's the same device, though, and we never seriously considered either
> for primary storage.
> 
> If you want to get further into this, you can look at using DTrace to
> break down the slowness by attributes of the SCSI packets being sent by
> the HBA.  That might yield some interesting data; i.e., is it a
> particular size, a particular LBA range, a particular SCSI command, etc.
> If it's always SYNCHRONIZE CACHE that's slow, does it matter what
> commands were issued prior to that?
> 
> 
> Thanks a lot Keith, I really appreciate your help. I've been benchmarking 
> recently with mysql and sysbench as that is closer to our actual workload. 
> I'm finding that, as you thought, with more threads going on, the Intel S3500 
> SSD's performance isn't hurt too much by adding cache flushing. Disabling 
> cache flushing still speeds things up, but by less than 2x, so that seems 
> decent. The Seagate on the other hand...

It so happens I've been looking at Seagate SSD performance today, too :-)

> The performance hit on the Seagate 600 Pro is 3x to 10x depending on how many 
> threads is run. I ran dtrace as you suggested to look at the SCSI commands 
> and the SYNCHRONIZE CACHE commands are taking 1us for the "fast" ones, 
> and I saw some get up around 50us, so it would seem that this drive is 
> only capable of about 100 cache flushes per second, at most, which really is 
> limiting its performance. This compares to the Intel S3500 taking a little 
> over 100us for a SYNCHRONIZE CACHE command.

I haven't done these measurements yet, but 100us is barely tolerable, more than 
1ms becomes intolerable
for slogs.
 -- richard

> 
> One question I have for you or whoever can answer this, is: is there 
> something special about MySQL's innodb_flush_log_at_trx_commit with ZFS? In 
> the case of the Seagate drive, setting it to 2 obviously speeds up things 
> significantly as the flush is only performed once per second, but my 
> understanding is that you put the last second of transactions at risk by 
> using any setting but 1, and I want to make sure there's nothing in ZFS to 
> somehow mitigate this...I can't think of any reason that you could set it to 
> 2 and be ACID safe, but I ask because all the default configurations for 
> SmartOS's/Joyent's machines (Standard64, Percona, etc) have it 
> innodb_flush_log_at_trx_commit = 2 by default, and I would be surprised that 
> they would default to a not-safe setting. But perhaps I'm just not 
> understanding something. Thanks,
> 
> Nick
> 
> smartos-discuss | Archives  | Modify Your Subscription 

--

richard.ell...@richardelling.com
+1-760-896-4422






---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com


Re: [smartos-discuss] zfs_nocacheflush and Enterprise SSDs with Power Loss Protection (Intel S3500)

2014-03-03 Thread Richard Elling
[also agree with Keith's comments]

On Mar 3, 2014, at 2:35 PM, Nick  wrote:

> I've been testing write performance with SSDs and have a few observations and 
> questions.
> 
> My focus has been on testing random write performance with 4K blocks, which 
> is a commonly benchmarked item.

Do you actually see 4K random writes to the disk?  [bait placed, waiting for 
the hook to set]
Hint: iosnoop -Dast

> The problem with 4K blocks and SSDs is that all SSDs internally use a much 
> larger "erase block". It seems like it is commonly 256K to 2MB in size, 
> depending on the drive, and the way an SSD commonly functions is that if I 
> write 100 4K blocks it will keep them in it's internal RAM until it gets 
> enough to fill an entire erase block, then it performs the actual write to 
> the flash memory. So whenever you see a benchmark online they are operating 
> the SSD in this fashion.

Nit: you should qualify this as "Flash SSDs" -- other SSDs can behave 
differently.

> 
> People running ZFS are generally concerned about data security, and this 
> would be a problem as if the power dies you will lose any data that is in the 
> SSD's RAM but hasn't been written to flash.

Yes, very much so.

> 
> Most newer SSDs seem to honor a sync to actually flush its memory to flash 
> (although some older ones will lie about a sync).

All should, or they violate the spec.

> 
> The issue, when benchmarking 4K blocks for example, is that if I tell the SSD 
> to write 4K of data and then do a sync, it needs to read an entire erase 
> block (256K+) then modify it, and write it, so all of a sudden it has written 
> much more data than it would have if written to in async mode.
> 
> The first SSD I tested was an Intel S3500 Enterprise SSD. This SSD has "power 
> loss protection" according to Intel, and has capacitors built in so that 
> sudden power loss would not cause you to lose data. In theory such a drive 
> should be able to ignore the sync calls and can safely buffer the data in 
> memory, since it is able to write it to flash in the event of a power loss.
> 
> I did some benchmarks of 4K random writes, where a sync was called after each 
> one (basically worst-case) and I found that the S3500 performed about twice 
> as fast when zfs_nocacheflush = 1, to disable ZFS's sending flush commands to 
> the drive.

You can also disable the SYNCHRONIZE_CACHE on a per-disk-model basis.
This allows you to have both volatile and nonvolatile disk caches in the same 
system.
See sd(7d) and the code shows a "cache-nonvolatile" option, unfortunately not 
documented in the man page :-(

> 
> So my first question is: Should I be operating the S3500 with 
> zfs_nocacheflush = 1? My theory was that this would do nothing for this 
> drive, as it should never need to be told to flush, but is it for some reason 
> forcing a flush even though it doesn't need to? My understanding for 
> zfs_nocacheflush is that this should be safe for this drive, since it has the 
> power loss protection. Am I incorrect in this?

Ask Intel.

> 
> Another interesting data point is that I switched the Intel S3500 out for a 
> Seagate 600 Pro drive. The "Pro" version of this drive is labeled an 
> "Enterprise" drive and advertises power loss protection, and teardowns 
> confirm a capacitor bank.
> 
> With zfs_nocacheflush = 1, this drive is very fast, even faster than the 
> Intel S3500. But when allowed to send the flush cache commands after every 4K 
> sync write, it becomes incredibly slow. Like 1/20th the speed or less, and 
> really not much faster than a conventional hard drive. Much slower than the 
> S3500 when both have zfs_nocacheflush = 0.

bummer. Ask Seagate.

> 
> Now, theoretically this drive should be able to have zfs_nocacheflush = 1 as 
> well, because it has the capacitor bank power loss protection.
> 
> I'm wondering if others have explored this issue and are running with 
> zfs_nocacheflush? It seems to me that this is a very overlooked part of SSDs 
> as every benchmark I can find assumes you are writing in async mode, which is 
> often not the case for databases. Thanks,

Yes. But we do it on a per-device basis. For example, for ZeusRAM, 
SYNCHRONIZE_CACHE is a NOP,
so we're better off not bothering to send it.
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422






---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com