*** This bug is a duplicate of bug 1632462 *** https://bugs.launchpad.net/bugs/1632462
------- Comment From cdead...@us.ibm.com 2016-10-28 05:44 EDT------- cde00 (cdead...@us.ibm.com) deleted native attachment sosreport-iap03.144150-20160726151858.tar.xz on 2016-10-28 04:33:22 cde00 (cdead...@us.ibm.com) added native attachment /tmp/AIXOS05964145 /sosreport-iap03.144150-20160726151858.tar.xz on 2016-10-28 04:33:26 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1632466 Title: STC860:Tuleta-L:KVM:iap03:HTX logs miscompares on multiple adapters on ubuntu 16.04 guest. Status in linux package in Ubuntu: New Bug description: == Comment: #0 - Application Cdeadmin <cdead...@us.ibm.com> - 2016-07-26 11:20:20 == == Comment: #1 - Application Cdeadmin <cdead...@us.ibm.com> - 2016-07-26 11:20:22 == ==== State: Open by: panico on 25 July 2016 18:02:09 ==== Contact Information: ==================== Defect Originator: Michael Panico Defect Originator pan...@us.ibm.com Backup: Deepti Umarani Backup ST ID: Deepti S Umarani/India/IBM System Info: ============ Machine Type:............8284-22A Card Type:...............FSP2_P8LE Current Boot Side:.......T Next Boot Side:..........T PT_Swap:.................1 Current Side Driver:.....fips860/b0713a_1630.860 Ubuntu KVM Host: root@iaos1:/tmp# uname -a Linux iaos1 4.4.0-30-generic #49-Ubuntu SMP Fri Jul 1 10:00:36 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux root@iaos1:/tmp# lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu Yakkety Yak (development branch) Release: 16.10 Codename: yakkety Ubuntu Guest: [htx@iap03] [1m/usr/lpp/htx/bin# [0m uname -a Linux iap03 4.4.0-31-generic #50-Ubuntu SMP Wed Jul 13 00:05:18 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux [htx@iap03] [1m/usr/lpp/htx/bin# [0m lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 16.04.1 LTS Release: 16.04 Codename: xenial HTX level on ubuntu 16.04 guest: [htx@iap03] [1m/usr/lpp/htx/bin# [0m dpkg -l | grep htxubuntu ii htxubuntu 402 ppc64el Problem Description: ==================== Running networks (via bridging) on three different guests (other linux, ubuntu 16.10 and ubuntu 16.04) the ubuntu 16.04 guest logged a miscompare on the austin adapter and the shinerT adapter. The shinerT ran for 24 hours and the austin for 72 hours before logging miscompares. From /tmp/htxerr: [htx@iap03] [1m/usr/lpp/htx/bin# [0m cat /tmp/htxerr /dev/enp0s6 Jul 23 14:29:00 2016 err=00e30000 sev=4 hxecom R(102.1.2.20.50854) connected to W(102.1.1.20.39073) Miscompare with packet 0, data size = 51062. miscompare_count=0 pakLen = 51082 Miscompare at displacement (decimal) = 23876 wbuf = ffffffffffffffffffffffffffffffffffffffff rbuf = 55555555555555555555555555555555ffffffff For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531 Consecutive prior good reads= 70739865, good bytes read= 2868570843930 /dev/enp0s6 Jul 23 14:29:00 2016 err=00d80000 sev=4 hxecom Shutting down testing due to error flag - 1 R(102.1.2.20.50854) connected to W(102.1.1.20.39073) For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531 Consecutive prior good reads= 0, good bytes read= 0 /dev/enp0s7 Jul 23 14:29:00 2016 err=00fb0068 sev=1 hxecom W(102.1.1.20.39073) connected to R(102.1.2.20.50854) Wrote 878 characters, expected 963 Write failed - Connection reset by peer. For stanza 1, TCP connection, bufmin=3, bufmax=2869, bufinc=94 Consecutive prior good writes= 25459432, good bytes written= 35283290616 htx_messages Jul 23 14:29:01 2016 err=00000000 sev=1 hxssup hxecom HE program for enp0s6 terminated by exit(0) call. htx_messages Jul 23 16:29:07 2016 err=00000000 sev=4 hang_monitor hxecom for enp0s7 is HUNG! Max run time (set in mdt) = 7200 seconds. Current elasped time = 7206 seconds. htx_messages Jul 23 18:29:07 2016 err=00000000 sev=4 hang_monitor hxecom for enp0s7 is HUNG! Max run time (set in mdt) = 7200 seconds. Current elasped time = 14406 seconds. htx_messages Jul 23 20:29:07 2016 err=00000000 sev=4 hang_monitor hxecom for enp0s7 is HUNG! Max run time (set in mdt) = 7200 seconds. Current elasped time = 21606 seconds. htx_messages Jul 23 22:29:08 2016 err=00000000 sev=4 hang_monitor hxecom for enp0s7 is HUNG! Max run time (set in mdt) = 7200 seconds. Current elasped time = 28807 seconds. htx_messages Jul 24 00:29:08 2016 err=00000000 sev=4 hang_monitor hxecom for enp0s7 is HUNG! Max run time (set in mdt) = 7200 seconds. Current elasped time = 36007 seconds. /dev/enp0s4 Jul 25 04:56:13 2016 err=00e30000 sev=4 hxecom R(103.1.2.20.36665) connected to W(103.1.1.20.40459) Miscompare with packet 0, data size = 30000. miscompare_count=0 pakLen = 30020 Miscompare at displacement (decimal) = 17508 wbuf = ffffffffffffffffffffffffffffffffffffffff rbuf = e7f0e11400000000333039373600ffffffffffff For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531 Consecutive prior good reads= 42699991, good bytes read= 1731422119820 /dev/enp0s4 Jul 25 04:56:13 2016 err=00d80000 sev=4 hxecom Shutting down testing due to error flag - 1 R(103.1.2.20.36665) connected to W(103.1.1.20.40459) For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531 Consecutive prior good reads= 0, good bytes read= 0 /dev/enp0s5 Jul 25 04:56:13 2016 err=00fb0068 sev=1 hxecom W(103.1.1.20.40459) connected to R(103.1.2.20.36665) Wrote 13877 characters, expected 40551 Write failed - Connection reset by peer. For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531 Consecutive prior good writes= 42700201, good bytes written= 1731430540751 htx_messages Jul 25 04:56:14 2016 err=00000000 sev=1 hxssup hxecom HE program for enp0s4 terminated by exit(0) call. /dev/enp0s5 Jul 25 04:56:13 2016 err=00d80000 sev=4 hxecom Shutting down testing due to error flag - 10 W(103.1.1.20.40459) connected to R(103.1.2.20.36665) For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531 Consecutive prior good writes= 0, good bytes written= 0 The bridges in the guest xml: <interface type='bridge'> <mac address='52:54:00:ab:08:11'/> <source bridge='brenP3p5s0f0'/> <model type='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> </interface> <interface type='bridge'> <mac address='52:54:00:96:cb:70'/> <source bridge='brenP3p5s0f2'/> <model type='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </interface> <interface type='bridge'> <mac address='52:54:00:26:6f:ec'/> <source bridge='brenP3p5s0f3'/> <model type='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> </interface> <interface type='bridge'> <mac address='52:54:00:a8:b0:09'/> <source bridge='brenP1p3s0f0'/> <model type='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/> </interface> <interface type='bridge'> <mac address='52:54:00:a1:e6:4d'/> <source bridge='brenP1p3s0f1'/> <model type='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> </interface> <interface type='bridge'> <mac address='52:54:00:b5:27:2a'/> <source bridge='brenp1s0f0'/> <model type='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/> </interface> <interface type='bridge'> <mac address='52:54:00:be:da:6c'/> <source bridge='brenp1s0f1'/> <model type='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/> </interface> There is also another defect opened (SW360179) on a glacier park logging a miscompare on another ubuntu 16.04 guest, but on a ubuntu 16.04 KVM host. ==== State: Open by: avkuttan on 26 July 2016 04:49:05 ==== Summary : ========= device : /dev/enp0s6 Miscompare offset (decimal) : 23876 Transfer size (decimal) : 51082 Miscompare length (decimal) : 16 Reader with IP (102.1.2.20.50854) is connected to Writer with (102.1.1.20.39073) on enp0s6. This miscompare shows 16 bytes of unexpected data in the read buffer. Analysis details: ================ /tmp/htxerr: ------------ /dev/enp0s6 Jul 23 14:29:00 2016 err=00e30000 sev=4 hxecom <<=== device (/dev/enp0s6) R(102.1.2.20.50854) connected to W(102.1.1.20.39073) <<=== TCP Connection details : Writer(102.1.1.20.39073) and Reader (102.1.2.20.50854) Miscompare with packet 0, data size = 51062. miscompare_count=0 pakLen = 51082 <<=== Transfer size (51062 + 20 for header) Miscompare at displacement (decimal) = 23876 <<=== Miscompare offset (23876) wbuf = ffffffffffffffffffffffffffffffffffffffff rbuf = 55555555555555555555555555555555ffffffff For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531 Consecutive prior good reads= 70739865, good bytes read= 2868570843930 /dev/enp0s6 Jul 23 14:29:00 2016 err=00d80000 sev=4 hxecom Shutting down testing due to error flag - 1 R(102.1.2.20.50854) connected to W(102.1.1.20.39073) For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531 Consecutive prior good reads= 0, good bytes read= 0 Expected data (write buffer) at miscompare location: ---------------------------------------------------- 00005c80: 89d3 bb05 0000 0000 3233 3638 3000 ffff ........23680... 00005c90: 0100 2f64 6576 2f65 6e70 3073 3700 0000 ../dev/enp0s7... 00005ca0: 0030 3030 3031 3432 3934 00ff ffff ffff .000014294...... 00005cb0: ffff ffff ffff ffff ffff ffff ffff ffff ................ 00005cc0: ffff ffff ffff ffff ffff ffff ffff ffff ................ 00005cd0: ffff ffff ffff ffff ffff ffff ffff ffff ................ 00005ce0: ffff ffff ffff ffff ffff ffff ffff ffff ................ 00005cf0: ffff ffff ffff ffff ffff ffff ffff ffff ................ 00005d00: 89d3 bb05 0000 0000 3233 3830 3800 ffff ........23808... 00005d10: 0100 2f64 6576 2f65 6e70 3073 3700 0000 ../dev/enp0s7... 00005d20: 0030 3030 3031 3432 3934 00ff ffff ffff .000014294...... 00005d30: ffff ffff ffff ffff ffff ffff ffff ffff ................ 00005d40: ffff ffff ffff ffff ffff ffff ffff ffff ................ <<=== start of miscompare 00005d50: ffff ffff ffff ffff ffff ffff ffff ffff ................ <<=== end of miscompare 00005d60: ffff ffff ffff ffff ffff ffff ffff ffff ................ 00005d70: ffff ffff ffff ffff ffff ffff ffff ffff ................ 00005d80: 89d3 bb05 0000 0000 3233 3933 3600 ffff ........23936... 00005d90: 0100 2f64 6576 2f65 6e70 3073 3700 0000 ../dev/enp0s7... 00005da0: 0030 3030 3031 3432 3934 00ff ffff ffff .000014294...... 00005db0: ffff ffff ffff ffff ffff ffff ffff ffff ................ 00005dc0: ffff ffff ffff ffff ffff ffff ffff ffff ................ 00005dd0: ffff ffff ffff ffff ffff ffff ffff ffff ................ 00005de0: ffff ffff ffff ffff ffff ffff ffff ffff ................ 00005df0: ffff ffff ffff ffff ffff ffff ffff ffff ................ Actual data (read buffer) at miscompare location: ------------------------------------------------- 00005c80: 89d3 bb05 0000 0000 3233 3638 3000 ffff ........23680... 00005c90: 0100 2f64 6576 2f65 6e70 3073 3700 0000 ../dev/enp0s7... 00005ca0: 0030 3030 3031 3432 3934 00ff ffff ffff .000014294...... 00005cb0: ffff ffff ffff ffff ffff ffff ffff ffff ................ 00005cc0: ffff ffff ffff ffff ffff ffff ffff ffff ................ 00005cd0: ffff ffff ffff ffff ffff ffff ffff ffff ................ 00005ce0: ffff ffff ffff ffff ffff ffff ffff ffff ................ 00005cf0: ffff ffff ffff ffff ffff ffff ffff ffff ................ 00005d00: 89d3 bb05 0000 0000 3233 3830 3800 ffff ........23808... 00005d10: 0100 2f64 6576 2f65 6e70 3073 3700 0000 ../dev/enp0s7... 00005d20: 0030 3030 3031 3432 3934 00ff ffff ffff .000014294...... 00005d30: ffff ffff ffff ffff ffff ffff ffff ffff ................ 00005d40: ffff ffff 5555 5555 5555 5555 5555 5555 ....UUUUUUUUUUUU <<=== start of miscompare 00005d50: 5555 5555 ffff ffff ffff ffff ffff ffff UUUU............ <<=== end of miscompare 00005d60: ffff ffff ffff ffff ffff ffff ffff ffff ................ 00005d70: ffff ffff ffff ffff ffff ffff ffff ffff ................ 00005d80: 89d3 bb05 0000 0000 3233 3933 3600 ffff ........23936... 00005d90: 0100 2f64 6576 2f65 6e70 3073 3700 0000 ../dev/enp0s7... 00005da0: 0030 3030 3031 3432 3934 00ff ffff ffff .000014294...... 00005db0: ffff ffff ffff ffff ffff ffff ffff ffff ................ 00005dc0: ffff ffff ffff ffff ffff ffff ffff ffff ................ 00005dd0: ffff ffff ffff ffff ffff ffff ffff ffff ................ 00005de0: ffff ffff ffff ffff ffff ffff ffff ffff ................ 00005df0: ffff ffff ffff ffff ffff ffff ffff ffff ................ Miscompare side wise: ==================== 00005d30: ffff ffff ffff ffff ffff ffff ffff ffff .......... 00005d30: ffff ffff ffff ffff ffff ffff ffff ffff .......... 00005d40: ffff ffff ffff ffff ffff ffff ffff ffff .......... | 00005d40: ffff ffff 5555 5555 5555 5555 5555 5555 ....UUUUUU 00005d50: ffff ffff ffff ffff ffff ffff ffff ffff .......... | 00005d50: 5555 5555 ffff ffff ffff ffff ffff ffff UUUU...... 00005d60: ffff ffff ffff ffff ffff ffff ffff ffff .......... 00005d60: ffff ffff ffff ffff ffff ffff ffff ffff .......... Miscompare started at = 0x00005d45 Miscompare ended at = 0x00005d55 Miscompare length = x00005d55 - 0x00005d45 = 0x10 = 16 (decimal) Buffer files are dumped into readable form as: /tmp/htxenp0s6.wbuf_1 ==>> /tmp/wbuf_1_dump /tmp/htxenp0s6.rbuf_1 ==>> /tmp/rbuf_1_dump Transfer this defect to Linux team for further analysis. ==== State: Open by: kalok on 26 July 2016 06:32:03 ==== transferred to ppclinux_nonconfidential_BZilla using CMVC. updates not visible in BQ. ==== State: Open by: panico on 26 July 2016 10:11:29 ==== The same host logged another two miscompares on an ubuntu 16.10 guest (iap02.aus.stglabs.ibm.com (9.3.242.19)) on the austin adapter in slot C10 and the glacier park adapter in slot C7. [htx@iap02] [1m/usr/lpp/htx/bin# [0m cat /tmp/htxerr /dev/enp0s10 Jul 26 01:24:25 2016 err=00e30000 sev=4 hxecom R(101.1.2.19.33643) connected to W(101.1.1.19.40555) Miscompare with packet 0, data size = 51062. miscompare_count=0 pakLen = 51082 Miscompare at displacement (decimal) = 35954 wbuf = aaaaaaaaaaaaaaaaaaaaaaaaaaaa6b3af7020000 rbuf = fffffffffffffffffffffffffffffffff7020000 For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531 Consecutive prior good reads= 32932090, good bytes read= 1335345881380 /dev/enp0s10 Jul 26 01:24:25 2016 err=00d80000 sev=4 hxecom Shutting down testing due to error flag - 1 R(101.1.2.19.33643) connected to W(101.1.1.19.40555) For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531 Consecutive prior good reads= 0, good bytes read= 0 htx_messages Jul 26 01:24:26 2016 err=00000000 sev=1 hxssup hxecom HE program for enp0s10 terminated by exit(0) call. /dev/enp0s11 Jul 26 01:24:26 2016 err=00fb0068 sev=1 hxecom W(101.1.1.19.40555) connected to R(101.1.2.19.33643) Wrote 1980 characters, expected 51082 Write failed - Connection reset by peer. For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531 Consecutive prior good writes= 32932264, good bytes written= 1335354769648 /dev/enp0s11 Jul 26 01:24:26 2016 err=00d80000 sev=4 hxecom Shutting down testing due to error flag - 10 W(101.1.1.19.40555) connected to R(101.1.2.19.33643) For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531 Consecutive prior good writes= 0, good bytes written= 0 /dev/enp0s11 Jul 26 01:24:26 2016 err=002a0068 sev=4 hxecom SetUpConnect: Error Connecting to (9.3.242.19,41525) - Connection reset by peer. /dev/enp0s11 Jul 26 01:24:39 2016 err=002a006f sev=4 hxecom SetUpConnect: Error Connecting to (9.3.242.19,41525) - Connection refused. /dev/enp0s4 Jul 26 01:25:14 2016 err=00e30000 sev=4 hxecom R(103.1.2.19.43035) connected to W(103.1.1.19.53023) Miscompare with packet 0, data size = 51062. miscompare_count=0 pakLen = 51082 Miscompare at displacement (decimal) = 16166 wbuf = 3433313400aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa rbuf = 0200000000323232373200aaaa01002faaaaaaaa For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531 Consecutive prior good reads= 12475849, good bytes read= 505864438618 /dev/enp0s4 Jul 26 01:25:14 2016 err=00d80000 sev=4 hxecom Shutting down testing due to error flag - 1 R(103.1.2.19.43035) connected to W(103.1.1.19.53023) For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531 Consecutive prior good reads= 0, good bytes read= 0 htx_messages Jul 26 01:25:14 2016 err=00000000 sev=1 hxssup hxecom HE program for enp0s4 terminated by exit(0) call. /dev/enp0s5 Jul 26 01:25:14 2016 err=00fb0068 sev=1 hxecom W(103.1.1.19.53023) connected to R(103.1.2.19.43035) Wrote 10254 characters, expected 51082 Write failed - Connection reset by peer. For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531 Consecutive prior good writes= 12475989, good bytes written= 505871590098 /dev/enp0s5 Jul 26 01:25:14 2016 err=00d80000 sev=4 hxecom Shutting down testing due to error flag - 10 W(103.1.1.19.53023) connected to R(103.1.2.19.43035) For stanza 0, TCP connection, bufmin=30000, bufmax=65000, bufinc=10531 Consecutive prior good writes= 0, good bytes written= 0 /dev/enp0s5 Jul 26 01:25:14 2016 err=002a0068 sev=4 hxecom SetUpConnect: Error Connecting to (9.3.242.19,50744) - Connection reset by peer. /dev/enp0s5 Jul 26 01:25:26 2016 err=002a006f sev=4 hxecom SetUpConnect: Error Connecting to (9.3.242.19,50744) - Connection refused. == Comment: #84 - Dwip N. Banerjee <dnban...@us.ibm.com> - 2016-10-10 11:43:26 == This is going to be a long post (my apologies). But it has, I believe, the root cause analysis for this issue, based on what I have discovered till now. As has been mentioned this issue has been worked on, on a 24X7 including myself. First I want to correct something I said in comment 61. I did the analysis as mentioned there. But when looking at the data on the receiver - which had captured the miscompare packet - I had misconstrued the position of the error. I had mistakenly picked the one at the end - and failed to notice the actual miscompare (the "XXXX...X" in a see of "YYYYYY"). That had mistakenly pushed me towards considering an additional problem (of possible reordering of packets as well). This caused a proliferation of possibilities to consider - on which I spent quite a bit of time before I went back to drawing board again to revalidate. I tracked the packets down through every concerned module and realized that there was no reordering. I then went back to the old data and discovered the missed sequences. So that at least narrowed down the problem to one (a nasty one albeit) - a case of data modification with a specific signature. the last 16 bytes of a page were getting corrupted. This was visible when transmitting the packet. The visibility/detection signature: -- always on the receive side -- always on the tun output and bridge on sender -- sometimes on the vhost transmit -- sometimes (rare) on the guest transmit. But from the signature, there is no clue as to whether the corruption is on - receive side - send side OR if it is on: - virtio/guest side - vhost/host side While my hunch was that the problem was likely on the receive side (because it was so close to the edge of a page, it almost seemed like someone had mistakenly read past their page), it has been very hard to narrow down or to validate. I had put in my own logs on the host and guest to capture the descriptor flows in order to be able to validate the access of the addresses and if the suspect address (the one NEXT to the corrupted one, in the next page) was being accessed - especially be receivers. As it turned out sometimes they were indeed accessed by receivers and also by transmitters. I had to put timestamps to sort them out. One had to consider the possibility of races in the virtio/vhost communication -- memory barrier issues -- race possibilities -- cases where one side prematurely accessed descriptors before they were advertised by the producer -- descriptor corruption -- other scenarios. I did observe the following interesting points: - typically the virtio queues were running at close to full (especially for the slower Austin adapter) - the vhost receive queues for the MLX adapters were empty i.e. there were lots of skbs lying on the tun receive queues, waiting to be delivered to the guest. The above does bring into the picture the possibility of possible barrier/race causing premature exposure and leading to the corruption seen. I did consider various scenarios in this realm, both for send and receives on the guest/host but my logging didn't validate any of the scenarios. There is no point in cataloging them now - suffice it to say that none of them panned out. Eventually, I decided to focus on the receive side since it appeared to be the likely place for the data modification based on the signatures that were seen. I instrumented the receive side especially looking for the signatures we have observed. This was in addition to the logs I had encoded in the host/guest. This is when I started observing similar packet modification signatures as has been observed for this problem. I dumped the packet before and after looking for modification signatures like we are seeing. IN EVERY CASE I SAW A DATA CORRUPTION in the transmit path, I saw the SAME SIGNATURE WAS FLAGGED in the receive path on tun_recvmsg. Comparing timestamps, the modification observed on the tun receive path was always before (some 30/40ms to 100/200 ms before). Also the change happened during the receive: -- before the receive when the packet was dumped it had original data -- after the receive, it was a corrupted packet -- and if I let it go through it will then hit the miscompare on the transmit side The signature captured in this receive intercept was similar to the miscompare... An example below: --------------------------------------------- Previously (last two lines) ffffffffffffffffffffffffffffffff <END PAGE - 16> ffffffffffffffffffffffffffffffff <END PAGE> current (last two lines) ffffffffffffffffffffffffffffffff <END PAGE - 16> cccccccccccccccccccccccccccccccc <END PAGE> cccccccccccccccccccccccccccccccc <START new PAGE> --------------------------------------------- When I let it run thru, this scenario will trigger a miscompare in the transmit path like we have been seeing all along. Thereafter, I put some instrumentation in the code to determine what was causing this. After going through a few this is the general sequence, which I am illustrating with an example below. Sequence leading to the corruption --------------------------------- A large packet is read in. Two iovecs assigned for this: 0. Buffers are posted via vhost (the sizes are important... esp the first one ... here they are 0x7700 and 0x10000). In every instance I observed, the first one had a buffer size of 7700. c000001fb485dd88: 00003ffb5ad90000 0000000000007700 ...Z.?...w...... c000001fb485dd98: 00003fff40cd0000 0000000000010000 ...@.?.......... 1.struct iovec { iov_base = 0x3ffb5ad90000, iov_len = 30464 } 2.<header> read from skb1 3.several GSO segmnents read in from the fragments of skb1 4.Once over, it reads the next skb2 from fraglist 5.In this skb, there is no linear data, 6.It reads the fragments from skb2. 7.After four complete fragments, the fifth one can not be accommodated in this iovec. This is where it gets interesting. 8. The iov_offset is at 716e when it issues a read for 0x5a8 bytes (one MSS/1448 bytes). But it can fit only 0x592 bytes. c000000001995a30: 00000000000005a8 000000000000716e ^^^iov_offset^^^ c000000001995a40: 0000000000000592 0000000000000000 ................ c000000001995a50: 00000000000005a8 c000001ec7937842 ........Bx...... ^^^^from^^^^^^ c000000001995a60: 00003ffb5ad9716e 0000000000000592 ^^^to/userspc^^^ ^^^amount read^^ The iov has only 7700-716e=0x592 space left, so it can not fit one full GSO segment. Now from is advanced: c000001ec7937242+0x592=c000001ec79377d4 to/userspace_buf is advanced: 716e+592=7700 i.e. end of iov. So next iovec is invoked. struct iovec { iov_base = 0x3fff40cd0000, iov_len = 65536 } Now we see the next read (into new iov) c000000001995aa0: 00000000000005a8 0000000000007700 .........w...... ^^^offset: first iov done c000000001995ab0: 0000000000000016 0000000000000026 ........&....... ^^amount to read ^^^left=returned by copy_to_user c000000001995ac0: 00000000000005a8 c000001ec7937dd4 .........}...... ^^amount wanted ^^^from^^^^^^^^ c000000001995ad0: 00003fff40cd0000 fffffffffffffff0 ...@.?.......... ^^^to^^^^^^^^^ ^^^len=-10!!!!!^^^ So we wanted to read 0x16 bytes, but __copy_to_user_inatomic returned 0x26 bytes as the amount still remaining to be read!!!! __copy_to_user_inatomic(00003fff40cd0000, c000001ec7937dd4, 0x16) = 0x26 THIS IS THE PROBLEM. Now our iov_len becomes negative = 0x16-0x26=fffffffffffffff0 We then apply this negative offset and modify the 16 buytes before the start of this page on the next read. Here is the next read: c000000001995b50: fffffffffffffff0 0000000000000026 ........&....... ^^^offset^^^^^^ ^^^ bytes to write c000000001995b60: 0000000000000000 00000000000005a8 ................ ^^^total wanted - from the initial c000000001995b70: c000001ec7937dc4 00003fff40ccfff0 .}.........@.?.. ^^^from^^^^^^^^ ^^^^to- previous page c000000001995b80: 0000000000000026 0000000000000016 &............... ^^^iov_len=amount read note that both from and to has regressed!!! from=c000001ec7937dd4+fffffffffffffff0=c000001ec7937dc4 to=0x3fff40cd0000+fffffffffffffff0=00003fff40ccfff0 i.e. we are re-reading 16 bytes from from and we are reading into the last 16 bytes of the previous page our offset is -16(dec) (will read into i.e. overwrite the the previous 16 bytes After this the reads normalize and behave the way they ought to. But those 16 bytes were completely messed up (16 re-read and 16 read into wrong place) - which were obviously wrong/unnecessary. For the instances, I observed, the following were common: - read into first iov of size 0x7700 - next iov starts on a new page - GSO segments read properly till truncation due to first iov running out of space. Last read=0x592. So 0x16 bytes remaining. At this point the "from" address seems to end in "d4" - not sure if that is just a coincidence. - next iov is start of page - __copy_to_user_inatomic(from, to, 0x16)=0x26 -- CULPRIT!!! - Now our len, offset will become negative -16(decimal) and we read into the last 16 bytes of the previous page - after this everything normalizes. Other than the 16 bytes of mayhem everything is fine! I investigated the call chain. Here is the call chain in invoking the copy_from_user routine... Call Chain ========== tun_put_user --> skb_copy_datagram_iter --> copy_to_iter --> copy_page_to_iter --> copy_page_to_iter_iovec -->__copy_to_user_inatomic (using atomic kmaps) --> __copy_tofrom_user ****************************************** __copy_to_user_inatomic pretty much passes everything to __copy_tofrom_user So that is likely where the focus has to be. The __copy_tofrom_user is in Linux/arch/powerpc/lib/copyuser_64.S ----------------------------------------------------------------- _GLOBAL_TOC(__copy_tofrom_user) BEGIN_FTR_SECTION nop FTR_SECTION_ELSE b __copy_tofrom_user_power7 ALT_FTR_SECTION_END_IFCLR(CPU_FTR_VMX_COPY) _GLOBAL(__copy_tofrom_user_base) /* first check for a whole page copy on a page boundary */ cmpldi cr1,r5,16 ... ****************************************** Note that NOT all of the above corruption will be detected!! In our case, if the page corrupted is one of the pages being CURRENTLY transmitted (on the vhost send queue) we will detect it. But the queue size is pretty small (256) - so probability is quite low. Or if we have to hold on to packets due to the possibility of TCP retransmission, they will be long-lived and will increase the probability as well. I did observe one TCP retransmission in one of the instances but I wasn't actively seeking those occurrences. The corrupted packets will be delivered to the guest - I did see a packet of size 0x16 on the guest in one of the crashes - which seemed bizarre (how can that be generated?) but now makes sense. It was just thrown away. I let the code run with some checks in copy_page_to_iter_iovec to capture the anomalies in __copy_tofrom_user. The instances are pretty rare and generally conform to the following pattern: 0x26=__copy_tofrom_user(to, from, 0x16) However, it is not always 26/16. This is confirmed/validated by prior observations of the corruption being less than 16 bytes as well (any where from 4 bytes to 16 bytes - 16 being most common). [ 2365.918381] error in copying ... buf=00003fff1e7e0000 from=c000001f6e03cd6c left=26 copy=16 [ 8789.412510] error in copying ... buf=00003ffeef060000 from=c000001f6ecb99f6 left=26 copy=16 [11930.325914] error in copying ... buf=00003fff23f40000 from=c000001f53041fd4 left=26 copy=16 [15400.086279] error in copying ... buf=00003fff165b0000 from=c000001f5380ab66 left=26 copy=16 [22717.969918] error in copying ... buf=00003fff199a0000 from=c000001f947a6606 left=24 copy=14 <<<<<-----NOTE [27274.429858] error in copying ... buf=00003fff14120000 from=c000001f677c31d4 left=26 copy=16 [32103.463712] error in copying ... buf=00003ffc272e0000 from=c000001f25e7e68c left=26 copy=16 The size of the read along with the new user page (and possibly even the from address offset) may have something to do with this (just conjecture again). I believe the solution to the problem lies in figuring out why __copy_tofrom_user is returning a number greater than the read as "left". As mentioned, the code appears to be in linux/arch/powerpc/lib/copyuser_64.S While it is possible to put a check in copy_page_to_iter_iovec to test for a case where "left" is greater than "copy", the correct approach (I believe) is to determine why it is behaving in the way observed. Interestingly, "left" is expected to be between 0 and "copy" - no check for negative or greater than copy. So clearly the function is expected behave in a tight range and so the community may want to keep it that way (just conjecture). This issue doesn't seem unique to POWER. It seems Sparc observed something similar as well. In linux 4.8, in copy_page_to_iter_iovec() the calls to the atomic kmaps are enveloped with (IS_ENABLED(CONFIG_HIGHMEM) additionally. The reasoning given is efficiency. I don't believe it has any impact in our scenario (both Ubuntu 16.04 and PowerKVM don't seem to have it set) Similar Issues ------------- I did a little rudimentary research regarding any similar issues reported in the past. The following URLs report similar relevant issues: 1 https://lkml.org/lkml/2016/7/8/775 2 http://www.spinics.net/lists/stable/msg94139.html 3 https://lkml.org/lkml/2016/8/3/181 4 https://patchwork.kernel.org/patch/9260733/ Note that all of them are quite recent. Case 1 appears very interesting. I was wondering if we are having any page fault related issues for the new user page (we will be reading into) especially for a small read. (Case 2 is just a little usage case). Cases 3 and 4 seem to refer to similar issues where the corruption was narrowed down to the commit which had recoded the module to use copy_page_to_iter(). I don't believe that the root cause was found (but I didn't put too much effort into digging up the history because in our case we have a smoking gun already, I believe). While working on this, I observed another bug in tun_put_user() although it didn't have any impact in our case (I verified) tun_put_user calls skb_copy_datagram_iter(skb, vlan_offset, iter, skb->len - vlan_offset); but never checks the return code (which can be EFAULT). In that case, it will blithely copy junk to the vhost (caller's) buffers. That though appears to be a secondary issue for this problem. I believe though, that we need to figure out how __copy_tofrom_user is returning the anomalous data described above. While I can go through the assembly and make some progress, I believe there are experts here who can probably rapidly figure out the issue here from what has been described above. == Comment: #85 - Dwip N. Banerjee <dnban...@us.ibm.com> - 2016-10-10 13:03:24 == In the meanwhile, I will try the foillowing: -- disable atomic kmaps -- immediately switch to non-atomic kmaps in case left > copy == Comment: #86 - Pradeep Satyanarayana <prad...@us.ibm.com> - 2016-10-10 15:12:51 == Paul, Can you please take a look at Dwip's excellent analysis and let us know your thoughts? == Comment: #87 - Dwip N. Banerjee <dnban...@us.ibm.com> - 2016-10-11 00:53:47 == I wanted to add a couple of updates and a note... My effort to try to maintain the performance profile by keeping the atomic kmaps and switching over to the non-atomic mode in case of the weird read failed. It was fairly quick. The corruption was visible within half an hour. So I tried to switch off the atomic completely by essentially following the code from 4.8 (using a CONFIG_HIGHMEM check although its stated introduction purpose was different). It's still running , after almost 12 hours (on this machine it reproduces fairly quickly). David also did the the same thing but by commenting out the atomic portion. I think that is probably still running as well. This may provide a helpful data point. Finally, I wanted to mention that __copy_tofrom_user() is also called in the transmit path via __copy_from_user_inatomic(). The call sequence is: handle_tx->tun_sendmsg->tun_get_user->skb_copy_datagram_from_iter->copy_page_from_iter->copy_page_from_iter_iovec->__copy_from_user_inatomic->__copy_tofrom_user This is done for the non-zerocopy path. By default it is off but the vhost code is self-tuning w.r.t zero-copy buffers even when it zerocopy mode set, i.e. beyond a threshold of buffers still in use (ubufs yet to be released) the code will switch to a copy mode. That brings into the picture another invocation sequence of __copy_tofrom_user() (with the src/dst reversed w.r.t. kernel/user space). So that will be a further possible source of mayhem... == Comment: #88 - Paul Mackerras <p...@au1.ibm.com> - 2016-10-11 07:09:38 == Dwip, thank you for the excellent analysis. I dug into __copy_tofrom_user today to find out how it could return more than the number of bytes it was asked to copy. It turns out that there is a bug in __copy_tofrom_user_base which has been there since 2002! There is one store instruction in that function for which the exception handling goes to the wrong place, resulting in the return value being 16 greater than it should be. The fix is very simple and I will post it to the upstream mailing list and attach it to this bug. == Comment: #89 - Paul Mackerras <p...@au1.ibm.com> - 2016-10-11 07:21:52 == To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1632466/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp