Apoligies for my bug reporting style having turned into something like
personal blog postings...  I'm distressed about this bug.  I'm worried
that a dozen production machines that are currently running Debian
stable with similar IPv6 + IPsec configuration will be affected once
stretch is released.  Therefore I'm trying my best to learn the tools
and diagnose the bug.  Any tips would be greatly appreciated.

On Wed, Nov 25 2015, Gerald Turner wrote:
> On Wed, Nov 25 2015, Gerald Turner wrote:
>> I suppose I'll restart bisection at last 'bad' and let the kernels
>> run for a day before issueing 'git bisect good'.
>
> I'm in the process of doing this, may take a week.

I took a week to re-perform bisection, this time booting twice and
waiting for a day of uptime before issueing 'git bisect good'.
Nevertheless the result was the exact same replay I copied two emails
back.  Nothing gained.

I then scrutinized over the backtrace disassembly (three emails back).
Panic occurs at the return from inline function rt6_get_cookie declared
in ip6_fib.h.  This function was introduced during 4.2 with merge
c1a34035:

  commit c1a34035506d3a7ad62403125d59c86b763c477d
  Merge: 01b6961 d52d399
  Author: David S. Miller <da...@davemloft.net>
  Date:   Mon May 25 13:25:35 2015 -0400

    Merge branch 'ipv6_route_sharing'

  commit d52d3997f843ffefaa8d8462790ffcaca6c74192
  Author: Martin KaFai Lau <ka...@fb.com>
  Date:   Fri May 22 20:56:06 2015 -0700

    ipv6: Create percpu rt6_info

  commit 83a09abd1a8badbbb715f928d07c65ac47709c47
  Author: Martin KaFai Lau <ka...@fb.com>
  Date:   Fri May 22 20:56:05 2015 -0700

    ipv6: Break up ip6_rt_copy()

  commit 8d0b94afdca84598912347e61defa846a0988d04
  Author: Martin KaFai Lau <ka...@fb.com>
  Date:   Fri May 22 20:56:04 2015 -0700

    ipv6: Keep track of DST_NOCACHE routes in case of iface down/unregister

  commit 3da59bd94583d1239e4fbdee452265a160b9cd71
  Author: Martin KaFai Lau <ka...@fb.com>
  Date:   Fri May 22 20:56:03 2015 -0700

    ipv6: Create RTF_CACHE clone when FLOWI_FLAG_KNOWN_NH is set

  commit 48e8aa6e3137692d38f20e8bfff100e408c6bc53
  Author: Martin KaFai Lau <ka...@fb.com>
  Date:   Fri May 22 20:56:02 2015 -0700

    ipv6: Set FLOWI_FLAG_KNOWN_NH at flowi6_flags

  commit b197df4f0f3782782e9ea8996e91b65ae33e8dd9
  Author: Martin KaFai Lau <ka...@fb.com>
  Date:   Fri May 22 20:56:01 2015 -0700

    ipv6: Add rt6_get_cookie() function

  commit 45e4fd26683c9a5f88600d91b08a484f7f09226a
  Author: Martin KaFai Lau <ka...@fb.com>
  Date:   Fri May 22 20:56:00 2015 -0700

    ipv6: Only create RTF_CACHE routes after encountering pmtu exception

  commit 8b9df2657704dd313333a79497dde429f9190caa
  Author: Martin KaFai Lau <ka...@fb.com>
  Date:   Fri May 22 20:55:59 2015 -0700

    ipv6: Combine rt6_alloc_cow and rt6_alloc_clone

  commit 2647a9b07032c5a95ddee1fcb65d95bddbc6b7f9
  Author: Martin KaFai Lau <ka...@fb.com>
  Date:   Fri May 22 20:55:58 2015 -0700

    ipv6: Remove external dependency on rt6i_gateway and RTF_ANYCAST

  commit fd0273d7939f2ce3247f6aac5f6b9a0135d4cd39
  Author: Martin KaFai Lau <ka...@fb.com>
  Date:   Fri May 22 20:55:57 2015 -0700

    ipv6: Remove external dependency on rt6i_dst and rt6i_src

  commit 286c2349f6665c3e67f464a5faa14a0e28be4842
  Author: Martin KaFai Lau <ka...@fb.com>
  Date:   Fri May 22 20:55:56 2015 -0700

    ipv6: Clean up ipv6_select_ident() and ip6_fragment()


This following is all conjecture, but evidently with this merge the IPv6
routing cache gained some optimization, is now using per-CPU structures,
and has relegated PMTU updates to a slower path.  My IPv6 + IPsec
environments have had their share of PMTU problems in the past (two of
the three sites are behind 6in4 tunnels, all three sites have differing
MTU's, used to get stalls, even on interactive SSH traffic, due to PMTU
cache eviction/re-discovery).

Also the crash occurs immediately after boot (or login for the desktop
system), and I'm using systemd, highly concurrent, maybe a race with the
per-CPU change?

Also the "Merge: 01b6961 d52d399" line is vaguely interesting (to me
anway, because I'm a git newbie) because commit 01b6961 happens to be
the same exotic driver as as the _first bad commit_ from my bisect runs.

Therefore I think I'm onto something...

I spent some time trying to build 4.2.6 with these commits reverted,
unfortunately there are a few commits that came later that modify lines
From this merge, so simply running 'git revert -m 1 c1a340355' is not
possible.

I eventually built a 4.2.6 kernel with the following commits reverted:

  git revert 9c7370a1 # ipv6: Fix a potential deadlock when creating pcpu rt
  git revert a73e4195 # ipv6: Add rt6_make_pcpu_route
  git revert ad706862 # ipv6: Remove un-used argument from ip6_dst_alloc
  git revert 87775312 # net-ipv6: Delete an unnecessary check before the 
function call "free_percpu"
  git revert d52d3997 # ipv6: Create percpu rt6_info

Sadly this too crashed, however at least it was a different crash!

[   45.751104] BUG: unable to handle kernel NULL pointer dereference at         
  (null)
[   45.751127] IP: [<ffffffff815526a7>] _raw_spin_lock_bh+0x17/0x30
[   45.751144] PGD 0
[   45.751151] Oops: 0002 [#1] SMP
[   45.751159] Modules linked in: xfrm4_mode_transport ccm xfrm6_mode_tunnel 
xfrm4_mode_tunnel deflate ebtable_filter ebtables ip6table_filter ip6_tables 
iptable_filter ip_tables x_tables tun sit ip_tunnel rfcomm twofish_generic 
twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common 
serpent_avx2 serpent_avx_x86_64 serpent_sse2_x86_64 serpent_generic 
blowfish_generic blowfish_x86_64 blowfish_common cast5_avx_x86_64 cast5_generic 
cast_common seqiv crypto_null ctr ecb des_generic cbc camellia_generic 
camellia_aesni_avx2 camellia_aesni_avx_x86_64 camellia_x86_64 xts xcbc 
sha512_ssse3 sha512_generic md4 algif_hash xfrm_user xfrm4_tunnel tunnel4 
ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo bnep binfmt_misc nls_utf8 
nls_cp437 vfat fat ext4 mbcache jbd2 x86_pkg_temp_thermal intel_powerclamp 
intel_rapl
[   45.751357]  eeepc_wmi iosf_mbi snd_hda_codec_realtek asus_wmi iTCO_wdt 
sparse_keymap snd_hda_codec_hdmi iTCO_vendor_support snd_hda_codec_generic 
coretemp kvm_intel snd_hda_intel kvm btusb psmouse btrtl snd_hda_codec btbcm 
btintel snd_hda_core bluetooth mei_me serio_raw lpc_ich efivars pcspkr 
snd_hwdep sg mei mfd_core dw_dmac rfkill 8250_fintek crc16 i2c_i801 
dw_dmac_core snd_soc_rt5640 snd_soc_rl6231 snd_soc_core snd_compress acpi_pad 
snd_pcm snd_timer snd soundcore regmap_i2c tpm_infineon tpm_tis 
i2c_designware_platform shpchp battery i2c_designware_core evdev tpm 
snd_soc_sst_acpi processor cuse fuse parport_pc ppdev lp parport efivarfs 
autofs4 btrfs xor raid6_pq algif_skcipher af_alg hid_generic usbhid dm_crypt 
dm_mod sd_mod crct10dif_pclmul crc32_pclmul crc32c_intel jitterentropy_rng 
sha256_ssse3
[   45.751558]  sha256_generic hmac ahci libahci drbg libata ansi_cprng i915 
i2c_algo_bit xhci_pci ehci_pci mxm_wmi xhci_hcd ehci_hcd drm_kms_helper e1000e 
ptp aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd usbcore 
scsi_mod pps_core drm usb_common fan thermal sdhci_acpi sdhci video mmc_core 
thermal_sys wmi i2c_hid hid button
[   45.751648] CPU: 2 PID: 564 Comm: kworker/2:2 Not tainted 4.2.6-gt+ #1
[   45.751662] Hardware name: ASUS All Series/Z97-AR, BIOS 1304 07/11/2014
[   45.751679] Workqueue: events dst_gc_task
[   45.751688] task: ffff8808171dee80 ti: ffff88080ea48000 task.ti: 
ffff88080ea48000
[   45.751705] RIP: 0010:[<ffffffff815526a7>]  [<ffffffff815526a7>] 
_raw_spin_lock_bh+0x17/0x30
[   45.751724] RSP: 0018:ffff88080ea4bcf0  EFLAGS: 00010246
[   45.751736] RAX: 0000000000000000 RBX: ffff8807ec1397c0 RCX: 0000000000000020
[   45.751751] RDX: 0000000000000001 RSI: ffffffff81672c21 RDI: 0000000000000000
[   45.751766] RBP: ffff8807ec139900 R08: ffffffff81acb9c8 R09: ffff88083fa9254c
[   45.751781] R10: 0000000000000653 R11: 00000000000003ed R12: 0000000000000000
[   45.751796] R13: 0000000000000000 R14: ffff880035fdee40 R15: 0000000000000080
[   45.751811] FS:  0000000000000000(0000) GS:ffff88083fa80000(0000) 
knlGS:0000000000000000
[   45.751828] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   45.751840] CR2: 0000000000000000 CR3: 0000000001c0c000 CR4: 00000000001406e0
[   45.751855] Stack:
[   45.751860]  ffffffff8150b43f 0000000200000000 ffff8807ec1397c0 
0000000000000000
[   45.751878]  0000000000000001 0000000000000001 ffffffff81466bea 
ffff88080ea4bd28
[   45.751896]  ffff8807ec1397c0 ffff88081732ac40 ffffffff81466d25 
000000000000016b
[   45.751915] Call Trace:
[   45.751921]  [<ffffffff8150b43f>] ? ip6_dst_destroy+0x3f/0xa0
[   45.751935]  [<ffffffff81466bea>] ? dst_destroy+0x2a/0xc0
[   45.751948]  [<ffffffff81466d25>] ? dst_gc_task+0xa5/0x210
[   45.751962]  [<ffffffff8101c633>] ? native_sched_clock+0x23/0x80
[   45.751975]  [<ffffffff8101c695>] ? sched_clock+0x5/0x10
[   45.751988]  [<ffffffff810a4134>] ? pick_next_task_fair+0x594/0x8d0
[   45.752003]  [<ffffffff8101263b>] ? __switch_to+0x1cb/0x560
[   45.752016]  [<ffffffff81084c0f>] ? process_one_work+0x19f/0x3d0
[   45.752029]  [<ffffffff81084e8d>] ? worker_thread+0x4d/0x450
[   45.752042]  [<ffffffff8154ec4d>] ? __schedule+0x2bd/0x8c0
[   45.752054]  [<ffffffff81084e40>] ? process_one_work+0x3d0/0x3d0
[   45.752068]  [<ffffffff8108ac81>] ? kthread+0xc1/0xe0
[   45.752080]  [<ffffffff8108abc0>] ? kthread_create_on_node+0x170/0x170
[   45.752094]  [<ffffffff8155301f>] ? ret_from_fork+0x3f/0x70
[   45.752106]  [<ffffffff8108abc0>] ? kthread_create_on_node+0x170/0x170
[   45.752120] Code: 01 00 00 00 c3 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 
00 0f 1f 44 00 00 65 81 05 b0 92 ab 7e 00 02 00 00 31 c0 ba 01 00 00 00 <f0> 0f 
b1 17 85 c0 75 02 f3 c3 89 c6 e8 c8 c4 b5 ff 66 90 c3 0f
[   45.752199] RIP  [<ffffffff815526a7>] _raw_spin_lock_bh+0x17/0x30
[   45.752214]  RSP <ffff88080ea4bcf0>
[   45.752221] CR2: 0000000000000000

I'm lost.  If reverting a few commits, cleanly (no conflicts) can bust a
kernel locking mechanism, then I'm afraid this endeavor is futile.

-- 
Gerald Turner <gtur...@unzane.com>        Encrypted mail preferred!
OpenPGP: 4096R / CA89 B27A 30FA 66C5 1B80  3858 EC94 2276 FDB8 716D

Attachment: signature.asc
Description: PGP signature

Reply via email to