[zfs-discuss] zfs receive slowness - lots of systime spent in genunix`list_next ?
Hi All, We are currently doing a zfs send/recv with mbuffer to send incremental changes across and it seems to be running quite slowly, with zfs receive the apparent bottle neck. The process itself seems to be using almost 100% of a single CPU in sys time. Wondering if anyone has any ideas if this is normal or if this is just going to run forever and never finish... details - two machines connected via Gigabit Ethernet on the same LAN. Sending server: zfs send -i 20111201_1 data@20111205_1 | mbuffer -s 128k -m 1G -O tdp03r-int:9090 Receiving server: mbuffer -s 128k -m 1G -I 9090 | zfs receive -vF tank/db/data mbuffer showing: in @ 256 KiB/s, out @ 256 KiB/s, 306 GiB total, buffer 100% ful My debug: DTraceToolkit hotkernel reports: zfs`lzjb_decompress10 0.0% unix`page_nextn31 0.0% genunix`fsflush_do_pages 37 0.0% zfs`dbuf_free_range 183 0.1% genunix`list_next5822 3.7% unix`mach_cpu_idle 150261 96.1% Top shows: PID USERNAME NLWP PRI NICE SIZE RES STATETIMECPU COMMAND 22945 root1 600 13M 3004K cpu/6 144:21 3.79% zfs 550 root 28 590 39M 22M sleep 10:19 0.06% fmd I'd say the 3.7% or so here is so low because we are providing not per CPU, but aggregate CPU usage. mpstat seems to show the real story. mpstat 1 shows output much like this each second: CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 00 00 329 108 830 1730 00 0 0 100 10 00 1001 940 2310 00 0 0 100 20 00320 280510 00 0 0 100 30 00180 110000 00 0 0 100 40 00166 100200 00 0 0 100 50 00 6020000 00 0 0 100 60 00 2000000 00 0 0 100 70 00 9040000160 0 0 100 80 00 6030000 00 3 0 97 90 00 3100000 00 0 0 100 100 00222 350110 00 89 0 11 110 00 2000000 00 0 0 100 120 00 3020100 20 0 0 100 130 00 2000000 00 0 0 100 140 0024 1760020610 0 0 100 150 00140 240010 20 0 0 100 160 00 2000000 00 0 0 100 170 0010280050780 1 0 99 180 00 2000000 00 0 0 100 190 00 5120000100 0 0 100 200 00 2000000 00 0 0 100 210 00 9240000 40 0 0 100 220 00 4000000 00 0 0 100 230 00 2000000 00 0 0 100 So I'm lead to believe that zfs receive is spending almost 100% of a single CPUs time doing a lot of genunix`list_next ... Any ideas what is going on here? Best Regards, -- Lachlan Mulcahy Senior DBA, Marin Software Inc. San Francisco, USA AU Mobile: +61 458 448 721 US Mobile: +1 (415) 867 2839 Office : +1 (415) 671 6080 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs receive slowness - lots of systime spent in genunix`list_next ?
On Mon, 5 Dec 2011, Lachlan Mulcahy wrote: genunix`list_next 5822 3.7% unix`mach_cpu_idle 150261 96.1% Rather idle. Top shows: PID USERNAME NLWP PRI NICE SIZE RES STATE TIME CPU COMMAND 22945 root 1 60 0 13M 3004K cpu/6 144:21 3.79% zfs 550 root 28 59 0 39M 22M sleep 10:19 0.06% fmd Having 'fmd' (fault monitor daemon) show up in top at all is rather ominous since it implies that faults are actively being reported. You might want to look into what is making it active. So I'm lead to believe that zfs receive is spending almost 100% of a single CPUs time doing a lot of genunix`list_next ... Or maybe it is only doing 3.7% of this. There seems to be a whole lot of nothing going on. Any ideas what is going on here? Definitely check if low-level faults are being reported to fmd. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs receive slowness - lots of systime spent in genunix`list_next ?
On 12/05/11 10:47, Lachlan Mulcahy wrote: zfs`lzjb_decompress10 0.0% unix`page_nextn31 0.0% genunix`fsflush_do_pages 37 0.0% zfs`dbuf_free_range 183 0.1% genunix`list_next5822 3.7% unix`mach_cpu_idle 150261 96.1% your best bet in a situation like this -- where there's a lot of cpu time spent in a generic routine -- is to use an alternate profiling method that shows complete stack traces rather than just the top function on the stack. often the names of functions two or three or four deep in the stack will point at what's really responsible. something as simple as: dtrace -n 'profile-1001 { @[stack()] = count(); }' (let it run for a bit then interrupt it). should show who's calling list_next() so much. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs receive slowness - lots of systime spent in genunix`list_next ?
Hi Bob, On Mon, Dec 5, 2011 at 11:19 AM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: On Mon, 5 Dec 2011, Lachlan Mulcahy wrote: genunix`list_next ** 5822 3.7% unix`mach_cpu_idle** 150261 96.1% Rather idle. Actually this is a bit misleading since it is averaged over all the cores in the system. This system has 24 cores and the approximate 90% used on a single core in system time can be shown to be 3.75% 90 / 24 = 3.75 Similarly 23/24 cores are close to 100% idle. Top shows: PID USERNAME NLWP PRI NICE SIZE RES STATETIMECPU COMMAND 22945 root1 600 13M 3004K cpu/6 144:21 3.79% zfs 550 root 28 590 39M 22M sleep 10:19 0.06% fmd Having 'fmd' (fault monitor daemon) show up in top at all is rather ominous since it implies that faults are actively being reported. You might want to look into what is making it active. Nothing in /var/adm/messages as this is the last two days entries: Dec 4 18:46:20 mslvstdp03r sshd[20926]: [ID 800047 auth.crit] fatal: Read from socket failed: Connection reset by peer Dec 5 01:40:07 mslvstdp03r sshd[21808]: [ID 800047 auth.crit] fatal: Read from socket failed: Connection reset by peer Anything else you suggest I'd check for faults? (Though I'm sort of doubting it is an issue, I'm happy to be thorough) So I'm lead to believe that zfs receive is spending almost 100% of a single CPUs time doing a lot of genunix`list_next ... Or maybe it is only doing 3.7% of this. There seems to be a whole lot of nothing going on. See the math above -- also mpstat shows a single CPU at around 90% system time and since top reports zfs as the only active process the circumstantial evidence is fairly indicative. Any ideas what is going on here? Definitely check if low-level faults are being reported to fmd. Any logs other than /var/adm/messages I should check (apologies, I'm quite new to (open)solaris ) Regards, -- Lachlan Mulcahy Senior DBA, Marin Software Inc. San Francisco, USA AU Mobile: +61 458 448 721 US Mobile: +1 (415) 867 2839 Office : +1 (415) 671 6080 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs receive slowness - lots of systime spent in genunix`list_next ?
On Mon, 5 Dec 2011, Lachlan Mulcahy wrote: Anything else you suggest I'd check for faults? (Though I'm sort of doubting it is an issue, I'm happy to be thorough) Try running fmdump -ef and see if new low-level fault events are comming in during the zfs receive. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs receive slowness - lots of systime spent in genunix`list_next ?
Hi Bob, On Mon, Dec 5, 2011 at 12:31 PM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: On Mon, 5 Dec 2011, Lachlan Mulcahy wrote: Anything else you suggest I'd check for faults? (Though I'm sort of doubting it is an issue, I'm happy to be thorough) Try running fmdump -ef and see if new low-level fault events are comming in during the zfs receive. Just a bunch of what I would guess is unrelated USB errors? Dec 05 20:58:09.0246 ereport.io.usb.epse Dec 05 20:58:50.9207 ereport.io.usb.epse Dec 05 20:59:36.0242 ereport.io.usb.epse Dec 05 20:59:39.0230 ereport.io.usb.epse Dec 05 21:00:21.0223 ereport.io.usb.epse Dec 05 21:01:06.0215 ereport.io.usb.epse Dec 05 21:01:09.0314 ereport.io.usb.epse Dec 05 21:01:50.9213 ereport.io.usb.epse Dec 05 21:02:36.0299 ereport.io.usb.epse Dec 05 21:02:39.0298 ereport.io.usb.epse Regards, -- Lachlan Mulcahy Senior DBA, Marin Software Inc. San Francisco, USA AU Mobile: +61 458 448 721 US Mobile: +1 (415) 867 2839 Office : +1 (415) 671 6080 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] LSI 3GB HBA SAS Errors (and other misc)
Here's LSIUTIL after swapping to a 6GB backplane and dual 9211-8i cards on a fresh boot. Much better. :) Adapter Phy 0: Link Up, No Errors Adapter Phy 1: Link Up, No Errors Adapter Phy 2: Link Up, No Errors Adapter Phy 3: Link Up, No Errors Adapter Phy 4: Link Down, No Errors Adapter Phy 5: Link Down, No Errors Adapter Phy 6: Link Down, No Errors Adapter Phy 7: Link Down, No Errors Expander (Handle 0009) Phy 0: Link Down, No Errors Expander (Handle 0009) Phy 1: Link Down, No Errors Expander (Handle 0009) Phy 2: Link Down, No Errors Expander (Handle 0009) Phy 3: Link Down, No Errors Expander (Handle 0009) Phy 4: Link Up Invalid DWord Count 8 Running Disparity Error Count 7 Loss of DWord Synch Count 2 Phy Reset Problem Count 0 Expander (Handle 0009) Phy 5: Link Up Invalid DWord Count 8 Running Disparity Error Count 6 Loss of DWord Synch Count 2 Phy Reset Problem Count 0 Expander (Handle 0009) Phy 6: Link Up Invalid DWord Count 8 Running Disparity Error Count 5 Loss of DWord Synch Count 2 Phy Reset Problem Count 0 Expander (Handle 0009) Phy 7: Link Up Invalid DWord Count 8 Running Disparity Error Count 7 Loss of DWord Synch Count 2 Phy Reset Problem Count 0 Expander (Handle 0009) Phy 8: Link Up Invalid DWord Count 8 Running Disparity Error Count 7 Loss of DWord Synch Count 2 Phy Reset Problem Count 0 Expander (Handle 0009) Phy 9: Link Up Invalid DWord Count 8 Running Disparity Error Count 4 Loss of DWord Synch Count 2 Phy Reset Problem Count 0 Expander (Handle 0009) Phy 10: Link Up Invalid DWord Count 8 Running Disparity Error Count 6 Loss of DWord Synch Count 2 Phy Reset Problem Count 0 Expander (Handle 0009) Phy 11: Link Up Invalid DWord Count 8 Running Disparity Error Count 6 Loss of DWord Synch Count 2 Phy Reset Problem Count 0 Expander (Handle 0009) Phy 12: Link Down, No Errors Expander (Handle 0009) Phy 13: Link Down, No Errors Expander (Handle 0009) Phy 14: Link Down, No Errors Expander (Handle 0009) Phy 15: Link Down, No Errors Expander (Handle 0009) Phy 16: Link Down, No Errors Expander (Handle 0009) Phy 17: Link Down, No Errors Expander (Handle 0009) Phy 18: Link Up, No Errors Expander (Handle 0009) Phy 19: Link Up, No Errors Expander (Handle 0009) Phy 20: Link Down, No Errors Expander (Handle 0009) Phy 21: Link Down, No Errors Expander (Handle 0009) Phy 22: Link Down, No Errors Expander (Handle 0009) Phy 23: Link Down, No Errors Expander (Handle 0009) Phy 24: Link Down, No Errors Expander (Handle 0009) Phy 25: Link Down, No Errors Expander (Handle 0009) Phy 26: Link Down, No Errors Expander (Handle 0009) Phy 27: Link Down, No Errors Expander (Handle 0009) Phy 28: Link Up, No Errors Expander (Handle 0009) Phy 29: Link Up, No Errors Expander (Handle 0009) Phy 30: Link Up, No Errors Expander (Handle 0009) Phy 31: Link Up, No Errors Expander (Handle 0009) Phy 32: Link Up, No Errors Expander (Handle 0009) Phy 33: Link Up, No Errors Expander (Handle 0009) Phy 34: Link Down, No Errors Expander (Handle 0009) Phy 35: Link Down, No Errors Expander (Handle 0009) Phy 36: Link Up, No Errors Expander (Handle 0009) Phy 37: Link Down, No Errors ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] LSI 3GB HBA SAS Errors (and other misc)
Whoops. Make that 9211-4i cards. :) Still promising. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Scrub found error in metadata:0x0, is that always fatal? No checks um errors now...
Well, I have an intermediate data point. One scrub run completed without finding any newer errors (beside one at the pool-level and two and the raidz2-level). Zpool clear alone did not fix it, meaning that the pool:metadata:0x0 was still reported as problematic, but a second attempt at zpool clear did clear the errors from zpool status. Before running zdb as asked by other commentors, I decided to rescrub. At some point between 200Gb and 1.7Tb scanned, the errors returned to the stats. So, in contrast with Nigel's optimistic theory that metadata is anyway extra-redundant and should be easily fixable, it seems that I do still have the problem. It does not show itself in practice as of yet, but is found by scrub ;) After a few days to complete the current scrub, I plan to run zdb as asked by Steve. If anyone else has some theories, suggestions or requests to dig up more clues - bring them on! ;) 2011-12-02 20:08, Nigel W wrote: On Fri, Dec 2, 2011 at 02:58, Jim Klimovjimkli...@cos.ru wrote: My question still stands: is it possible to recover from this error or somehow safely ignore it? ;) I mean, without backing up data and recreating the pool? If the problem is in metadata but presumably the pool still works, then this particular metadata is either not critical or redundant, and somehow can be forged and replaced by valid metadata. Is this a rightful path of thought? Are there any tools to remake such a metadata block? Again, I did not try to export/reimport the pool yet, except for that time 3 days ago when the machine hung, was reset and imported the pool and continued the scrub automatically... I think it is now too late to do an export and a rollback import, too... Unfortunately I cannot provide you with a direct answer as I have only been a user of ZFS for about a year and in that time only encountered this once. Anecdotally, at work I had something similar happen to a Nexcenta Core 3.0 (b134) box three days ago (seemingly caused by a hang then eventual panic as a result of attempting to add a drive that is having read failures to the pool). When the box came back up, zfs reported an error in metadata:0x0. We scrubbed the tank (~400GB used) and like in your case the checksum error didn't clear. We ran a scrub again and it seems that the second scrub did clear the metadata error. I don't know if that means it will work that way for everyone, every time, or not. But considering that the pool and the data on it appears to be fine (just not having any replicas until we get the bad disk replaced) and that all metadata is supposed to havecopies+1 copies (with an apparent max of 3 copies[1]) on the pool at all times I can't see why this error shouldn't be cleared by a scrub. [1] http://blogs.oracle.com/relling/entry/zfs_copies_and_data_protection //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs receive slowness - lots of systime spent in genunix`list_next ?
Hi All, Just a follow up - it seems like whatever it was doing it eventually got done with and the speed picked back up again. The send/recv finally finished -- I guess I could do with a little patience :) Lachlan On Mon, Dec 5, 2011 at 10:47 AM, Lachlan Mulcahy lmulc...@marinsoftware.com wrote: Hi All, We are currently doing a zfs send/recv with mbuffer to send incremental changes across and it seems to be running quite slowly, with zfs receive the apparent bottle neck. The process itself seems to be using almost 100% of a single CPU in sys time. Wondering if anyone has any ideas if this is normal or if this is just going to run forever and never finish... details - two machines connected via Gigabit Ethernet on the same LAN. Sending server: zfs send -i 20111201_1 data@20111205_1 | mbuffer -s 128k -m 1G -O tdp03r-int:9090 Receiving server: mbuffer -s 128k -m 1G -I 9090 | zfs receive -vF tank/db/data mbuffer showing: in @ 256 KiB/s, out @ 256 KiB/s, 306 GiB total, buffer 100% ful My debug: DTraceToolkit hotkernel reports: zfs`lzjb_decompress10 0.0% unix`page_nextn31 0.0% genunix`fsflush_do_pages 37 0.0% zfs`dbuf_free_range 183 0.1% genunix`list_next5822 3.7% unix`mach_cpu_idle 150261 96.1% Top shows: PID USERNAME NLWP PRI NICE SIZE RES STATETIMECPU COMMAND 22945 root1 600 13M 3004K cpu/6 144:21 3.79% zfs 550 root 28 590 39M 22M sleep 10:19 0.06% fmd I'd say the 3.7% or so here is so low because we are providing not per CPU, but aggregate CPU usage. mpstat seems to show the real story. mpstat 1 shows output much like this each second: CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 00 00 329 108 830 1730 00 0 0 100 10 00 1001 940 2310 00 0 0 100 20 00320 280510 00 0 0 100 30 00180 110000 00 0 0 100 40 00166 100200 00 0 0 100 50 00 6020000 00 0 0 100 60 00 2000000 00 0 0 100 70 00 9040000160 0 0 100 80 00 6030000 00 3 0 97 90 00 3100000 00 0 0 100 100 00222 350110 00 89 0 11 110 00 2000000 00 0 0 100 120 00 3020100 20 0 0 100 130 00 2000000 00 0 0 100 140 0024 1760020610 0 0 100 150 00140 240010 20 0 0 100 160 00 2000000 00 0 0 100 170 0010280050780 1 0 99 180 00 2000000 00 0 0 100 190 00 5120000100 0 0 100 200 00 2000000 00 0 0 100 210 00 9240000 40 0 0 100 220 00 4000000 00 0 0 100 230 00 2000000 00 0 0 100 So I'm lead to believe that zfs receive is spending almost 100% of a single CPUs time doing a lot of genunix`list_next ... Any ideas what is going on here? Best Regards, -- Lachlan Mulcahy Senior DBA, Marin Software Inc. San Francisco, USA AU Mobile: +61 458 448 721 US Mobile: +1 (415) 867 2839 Office : +1 (415) 671 6080 -- Lachlan Mulcahy Senior DBA, Marin Software Inc. San Francisco, USA AU Mobile: +61 458 448 721 US Mobile: +1 (415) 867 2839 Office : +1 (415) 671 6080 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] weird bug with Seagate 3TB USB3 drive
In message 4ebbfb5...@oracle.com, Cindy Swearingen writes: CR 7102272: ZFS storage pool created on a 3 TB USB 3.0 device has device label problems Let us know if this is still a problem in the OS11 FCS release. I finally got upgraded from Solaris 11 Express SRU 12 to S11 FCS. Solaris 11 11/11 still spews the I/O request is not aligned with 4096 disk sector size warnings but zpool(1M) create's label persists and I can export and import between systems. John groenv...@acm.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] How to get nfs work with zfs?
I am going to share a dir and it's subdir through NFS to Virtual Host, which include XEN(CentOS/netbsd) and ESXi,but failed, the following step is what I did: solaris 11: zfs create tank/iso zfs create tank/iso/linux zfs create tank/iso/windows share -F nfs -o rw,nosuid,root=VM-host1:VM-host2 /tank/iso chmod -R 777 /tank/iso centos: mkdir /home/iso mount -t nfs -o rw,nosuid solaris11:/tank/iso /home/iso echo newfile /home/iso/newfile.txt success echo newfile /home/iso/linux/newfile.txt failed,and display: permission denied and the, check the dir on solaris11: ls -al /tank/iso drwxrwxrwx 5 root root 8 Dec 5 13:04 . drwxr-xr-x 4 root root 4 Dec 2 22:45 .. drwxrwxrwx 2 root root 2 Dec 2 16:54 bsd drwxrwxrwx 2 root root 2 Dec 2 16:54 linux -rw-r--r-- 1 nobody nobody 8 Dec 5 12:57 newfile.txt drwxrwxrwx 2 root root 2 Dec 2 16:54 windows check the dir on CentOS: ls -al /home/iso drwxr-xr-x+ 2 root root 2 Dec 2 16:54 bsd drwxr-xr-x+ 2 root root 2 Dec 2 16:54 linux -rw-r--r--+ 1 nfsnobody nfsnobody 8 Dec 5 12:57 newfile.txt drwxr-xr-x+ 2 root root 2 Dec 2 16:54 windows I got couple questions: 1、why the owner of newfile.txt is nfsnobody on CentOS, and on solaris, it's nobody? 2、why the subdir do not have write access, how to accomplish it; 3、what does + mean? 4、do I need to remount a share dir after changing the file access on solaris(NFS server)? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss