On Thu, 2013-01-17 at 10:59 -0800, Zahid Chowdhury wrote:
> Hi Vyacheslav,
>   Thanks for your responses/help. My responses are below with "ZC>".
> 
> Zahid
> 
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Vyacheslav Dubeyko
> Sent: Tuesday, January 15, 2013 10:29 PM
> To: Zahid Chowdhury
> Cc: [email protected]
> Subject: Re: Kernel panic in nilfs.
> 
> Hi Zahid,
> 
> On Tue, 2013-01-15 at 14:36 -0800, Zahid Chowdhury wrote:
> > Hello,
> >   I am running a Centos 5.5 (kernel 2.6.18-194.17.4.el5). I have used the
> > Centos distribution with the nilfs kernel module 2.0.22 to statically build
> > nilfs into the kernel (that's why I renamed 2.6.18-194.17.4.el5 to 
> > 2.6.18-194.17.4.el5SSI_NILFS). I have enabled netconsole as the box is 
> > mostly headless - the kernel panic messages below came up through 
> > netconsole. The garbage collection daemon is nilfs-utils 2.1.0. The 
> > processor is a Intel(R) Atom(TM) CPU D510 dual core with 2 contexts. The 
> > SSD is a Industrial Grade Apacer 16GB SLC. At the time the kernel panicked 
> > there many (> 100) soft-real time processes with nice levels of -19 running 
> > (the cleanerd runs at +19 nice level as we have found that otherwise it 
> > disturbs the soft real-time processes). These soft real-time processes also 
> > are memory hogs & cpu hogs (less than < a few % idle even with all the 
> > cores/contexts), such that less than a few K of memory is available (we 
> > will be fixing the apps, but still nilfs should not panic the kernel) at 
> > anytime. We do allow overcommit and the all processes are at the normal 
> > oom_adj value of 0 except for critical processes like syslogd & klogd, 
> > sshd, crond, nilfs_cleanerd, ifplugd, dbus-daemon. Btw, we did much testing 
> > and no kernel panics occurred over weeks until I oom_adj the critical 
> > processes just today.
> > 
> > 
> > Has anybody seen the kernel panic messages I see below? Is there any fix 
> > for this in a Centos 5.5 kernel? Would upgrading to a newer nilfs module 
> > clear up this panic? Would upgrading to a newer kernel clear up this panic? 
> > Upgrading cleanerd? Any other suggestions/questions are very welcome. 
> > Thanks all.
> > 
> 
> First of all, I think that it makes sense to try to upgrade kernel and
> nilfs-utils. It needs to understand that your issue can be reproduced on
> actual state of NILFS2 code.
> 
> ZC> Actually, some of our apps cannot run on newer kernels, so we may not
> ZC> hit this panic situation in that scenario.
> 
> 
> Secondly, what value of vm.min_free_kbytes do you have in your system?
> Do you have in system log any error messages about page allocation
> failure?
> 
> 
> ZC> We have min_free_kbytes as the Centos 5.5 default of 3831. That is very
> ZC> low for the pool of reserved page frames. Thus, I will be bumping this
> ZC> to 32K or 64K. We also have the vm.lowmem_reserve_ratio set to 32,
> ZC> I am hoping to bump this to a higher number like "9" instead of "32".
> ZC> On previous runs of this load test we did have page allocation failures
> ZC> in the apps and oom_kill ran and nuked processes - in the kernel panic
> ZC> run we had no page allocation failures reported/oom, that is why I am
> ZC> worried. I will be setting panic on reboot, but it is scary when I see
> ZC> no messages and suddenly things panic - though a load test was running
> ZC> when the panic happened. Any other thoughts/suggestions are very
> ZC> welcome.
> 
> 
> Thirdly, I don't clearly understand currently how to try to reproduce
> your issue. Could you describe in more details what filesystem
> operations were before issue occurrence? Do you have any NILFS2-related
> error messages in your system log before kernel panic?
> 
> ZC> We have 90/10 reads to writes (sqlite) as most writes were moved to a 
> memory
> ZC> filesystem as writes to NILFS creates 5:1 ratios in the size of the
> ZC> dat file to real space usage - do you know if this has been fixed on
> ZC> on a newer release of the kernel module and/or has the gc daemon been
> ZC> cleaned up? Also, the gc daemon uses most of the CPU bandwidth with
> ZC> a large dat file situation. No nilfs error messages in syslogd via
> ZC> klogd. I'm unsure if you can reproduce. Maybe download a Centos 5.5
> ZC> distro, compile in the nilfs module, I think the error messages on
> ZC> compile are easily fixable - the issue I had was with the Redhat
> ZC> signing method of there modules - please view the centos web-site on
> ZC> ways to deal with this. Then run CPU & Memory hogs with no ulimit
> ZC> protection (remember Cemtos ships with overcommit on). The hogs should
> ZC> do mostly reads. Cleanerd should be ioniced to the lowest priority and
> ZC> reniced to the lowest level. That should do it. Let me know if you have
> ZC> any problems.
> ZC> Regards.
> 

Thank you for additional details.

Currently, I am fixing two bugs in NILFS2 (flush kernel thread issue and
issue with bad btree node messages). I guess that these bugs can be a
reason of your issue. So, you can check your issue reproducibility after
fix of these bugs. If you will reproduce your issue without changing
then I will investigate your issue more deeply.

With the best regards,
Vyacheslav Dubeyko.

> Thanks,
> Vyacheslav Dubeyko.
> 
> > 
> > Zahid
> > 
> > P.S.: Panic flow over netconsole into syslogd - sorry for so many lines, 
> > alas Solaris syslogd seems to wrap early:
> > 
> > Jan 15 12:22:38  ------------[ cut here ]------------
> > Jan 15 12:22:38  kernel BUG at fs/nilfs2/page.c:317!
> > Jan 15 12:22:38  invalid opcode: 0000 [#1]
> > Jan 15 12:22:38  SMP
> > Jan 15 12:22:38
> > Jan 15 12:22:38  last sysfs file: 
> > /devices/pci0000:00/0000:00:1c.0/0000:02:00.0/irq
> > Jan 15 12:22:38  Modules linked in:
> > Jan 15 12:22:38   netconsole
> > Jan 15 12:22:38   autofs4
> > Jan 15 12:22:38   dme1737
> > Jan 15 12:22:38   hwmon_vid
> > Jan 15 12:22:38   hidp
> > Jan 15 12:22:38   l2cap
> > Jan 15 12:22:38   bluetooth
> > Jan 15 12:22:38   sunrpc
> > Jan 15 12:22:38   bridge
> > Jan 15 12:22:38   ip_nat_ftp
> > Jan 15 12:22:38   ip_conntrack_ftp
> > Jan 15 12:22:38   ip_conntrack_netbios_ns
> > Jan 15 12:22:38   iptable_mangle
> > Jan 15 12:22:38   iptable_filter
> > Jan 15 12:22:38   ipt_MASQUERADE
> > Jan 15 12:22:38   xt_tcpudp
> > Jan 15 12:22:38   iptable_nat
> > Jan 15 12:22:38   ip_nat
> > Jan 15 12:22:38   ip_conntrack
> > Jan 15 12:22:38   nfnetlink
> > Jan 15 12:22:38   ip_tables
> > Jan 15 12:22:38   x_tables
> > Jan 15 12:22:38   loop
> > Jan 15 12:22:38   dm_mirror
> > Jan 15 12:22:38   dm_multipath
> > Jan 15 12:22:38   scsi_dh
> > Jan 15 12:22:38   video
> > Jan 15 12:22:38   backlight
> > Jan 15 12:22:38   sbs
> > Jan 15 12:22:38   power_meter
> > Jan 15 12:22:38   hwmon
> > Jan 15 12:22:38   i2c_ec
> > Jan 15 12:22:38   dell_wmi
> > Jan 15 12:22:38   wmi
> > Jan 15 12:22:38   button
> > Jan 15 12:22:38   battery
> > Jan 15 12:22:38   asus_acpi
> > Jan 15 12:22:38   ac
> > Jan 15 12:22:38   lp
> > Jan 15 12:22:38   snd_hda_intel
> > Jan 15 12:22:38   snd_seq_dummy
> > Jan 15 12:22:38   sg
> > Jan 15 12:22:38   snd_seq_oss
> > Jan 15 12:22:38   snd_seq_midi_event
> > Jan 15 12:22:38   snd_seq
> > Jan 15 12:22:38   snd_seq_device
> > Jan 15 12:22:38   snd_pcm_oss
> > Jan 15 12:22:38   snd_mixer_oss
> > Jan 15 12:22:38   snd_pcm
> > Jan 15 12:22:38   snd_timer
> > Jan 15 12:22:38   snd_page_alloc
> > Jan 15 12:22:38   parport_pc
> > Jan 15 12:22:38   e1000e
> > Jan 15 12:22:38   pcspkr
> > Jan 15 12:22:38   snd_hwdep
> > Jan 15 12:22:38   serio_raw
> > Jan 15 12:22:38   parport
> > Jan 15 12:22:38   i2c_i801
> > Jan 15 12:22:38   i2c_core
> > Jan 15 12:22:38   snd
> > Jan 15 12:22:38   soundcore
> > Jan 15 12:22:38   dm_raid45
> > Jan 15 12:22:38   dm_message
> > Jan 15 12:22:38   dm_region_hash
> > Jan 15 12:22:38   dm_log
> > Jan 15 12:22:38   dm_mod
> > Jan 15 12:22:38   dm_mem_cache
> > Jan 15 12:22:38   usb_storage
> > Jan 15 12:22:38   ata_piix
> > Jan 15 12:22:38   libata
> > Jan 15 12:22:38   sd_mod
> > Jan 15 12:22:38   scsi_mod
> > Jan 15 12:22:38   ext3
> > Jan 15 12:22:38   jbd
> > Jan 15 12:22:38   uhci_hcd
> > Jan 15 12:22:38   ohci_hcd
> > Jan 15 12:22:38   ehci_hcd
> > Jan 15 12:22:38
> > Jan 15 12:22:38  CPU:    0
> > Jan 15 12:22:38  EIP:    0060:[<c04c078b>]    Not tainted VLI
> > Jan 15 12:22:38  EFLAGS: 00010246   (2.6.18-194.17.4.el5SSI_NILFS #1)
> > Jan 15 12:22:38  EIP is at nilfs_copy_page+0x29/0x198
> > Jan 15 12:22:38  eax: 80010029   ebx: c1329100   ecx: 00000000   edx: 
> > c135de00
> > Jan 15 12:22:38  esi: 00000000   edi: f6df3f30   ebp: f6df3cf4   esp: 
> > f7a14ca8
> > Jan 15 12:22:38  ds: 007b   es: 007b   ss: 0068
> > Jan 15 12:22:38  Process nilfs_cleanerd (pid: 1653, ti=f7a14000 
> > task=f79c4000 task.ti=f7a14000)
> > Jan 15 12:22:38
> > Jan 15 12:22:38  Stack:
> > Jan 15 12:22:38  ec2e8000
> > Jan 15 12:22:38  e0461000
> > Jan 15 12:22:38  c135de00
> > Jan 15 12:22:38  c1585d00
> > Jan 15 12:22:38  f6df3f30
> > Jan 15 12:22:38  c0458ba8
> > Jan 15 12:22:38  c135de00
> > Jan 15 12:22:38  c1329100
> > Jan 15 12:22:38
> > Jan 15 12:22:38
> > Jan 15 12:22:38  f6df3f30
> > Jan 15 12:22:38  f6df3cf4
> > Jan 15 12:22:38  c04c0ff2
> > Jan 15 12:22:38  00001f8e
> > Jan 15 12:22:38  00000005
> > Jan 15 12:22:38  00001f7c
> > Jan 15 12:22:38  0000000e
> > Jan 15 12:22:38  00000000
> > Jan 15 12:22:38
> > Jan 15 12:22:38
> > Jan 15 12:22:38  c1407240
> > Jan 15 12:22:38  c12b5ac0
> > Jan 15 12:22:38  c152afe0
> > Jan 15 12:22:38  c1462ae0
> > Jan 15 12:22:38  c1408c20
> > Jan 15 12:22:38  c135de00
> > Jan 15 12:22:38  c11fdda0
> > Jan 15 12:22:38  c1503320
> > Jan 15 12:22:38
> > Jan 15 12:22:38  Call Trace:
> > Jan 15 12:22:38   [<c0458ba8>]
> > Jan 15 12:22:38  find_lock_page+0x1a/0x7e
> > Jan 15 12:22:38   [<c04c0ff2>]
> > Jan 15 12:22:38  nilfs_copy_back_pages+0xbb/0x1e7
> > Jan 15 12:22:38   [<c04d2f3b>]
> > Jan 15 12:22:38  nilfs_commit_gcdat_inode+0x83/0xa8
> > Jan 15 12:22:38   [<c04cc0de>]
> > Jan 15 12:22:38  nilfs_segctor_complete_write+0x1dd/0x301
> > Jan 15 12:22:38   [<c04cd337>]
> > Jan 15 12:22:38  nilfs_segctor_do_construct+0x1011/0x1384
> > Jan 15 12:22:38   [<c045dbea>]
> > Jan 15 12:22:38  __set_page_dirty_nobuffers+0xb0/0xd3
> > Jan 15 12:22:38   [<c04c17f3>]
> > Jan 15 12:22:38  nilfs_mdt_mark_block_dirty+0x41/0x47
> > Jan 15 12:22:38   [<c04cd8c1>]
> > Jan 15 12:22:38  nilfs_segctor_construct+0x82/0x261
> > Jan 15 12:22:38   [<c04ceada>]
> > Jan 15 12:22:38  nilfs_clean_segments+0xa9/0x1c4
> > Jan 15 12:22:38   [<c04d26e2>]
> > Jan 15 12:22:38  nilfs_ioctl+0x444/0x57d
> > Jan 15 12:22:38   [<c0465900>]
> > Jan 15 12:22:38  free_pgd_range+0x108/0x190
> > Jan 15 12:22:38   [<c04d229e>]
> > Jan 15 12:22:38  nilfs_ioctl+0x0/0x57d
> > Jan 15 12:22:38   [<c048620d>]
> > Jan 15 12:22:38  do_ioctl+0x1c/0x5d
> > Jan 15 12:22:38   [<c04867a1>]
> > Jan 15 12:22:38  vfs_ioctl+0x47b/0x4d3
> > Jan 15 12:22:38   [<c041eef6>]
> > Jan 15 12:22:38  enqueue_task+0x29/0x39
> > Jan 15 12:22:38   [<c0486841>]
> > Jan 15 12:22:38  sys_ioctl+0x48/0x5f
> > Jan 15 12:22:38   [<c0404f17>]
> > Jan 15 12:22:38  syscall_call+0x7/0xb
> > Jan 15 12:22:38   =======================
> > Jan 15 12:22:38  Code:
> > Jan 15 12:22:38  00
> > Jan 15 12:22:38  c3
> > Jan 15 12:22:38  55
> > Jan 15 12:22:38  57
> > Jan 15 12:22:38  56
> > Jan 15 12:22:38  89
> > Jan 15 12:22:38  ce
> > Jan 15 12:22:38  53
> > Jan 15 12:22:38  89
> > Jan 15 12:22:38  c3
> > Jan 15 12:22:38  83
> > Jan 15 12:22:38  ec
> > Jan 15 12:22:38  18
> > Jan 15 12:22:38  89
> > Jan 15 12:22:38  54
> > Jan 15 12:22:38  24
> > Jan 15 12:22:38  08
> > Jan 15 12:22:38  8b
> > Jan 15 12:22:38  00
> > Jan 15 12:22:38  f6
> > Jan 15 12:22:38  c4
> > Jan 15 12:22:38  10
> > Jan 15 12:22:38  74
> > Jan 15 12:22:38  08
> > Jan 15 12:22:38  0f
> > Jan 15 12:22:38  0b
> > Jan 15 12:22:38  3b
> > Jan 15 12:22:38  01
> > Jan 15 12:22:38  22
> > Jan 15 12:22:38  1b
> > Jan 15 12:22:38  66
> > Jan 15 12:22:38  c0
> > Jan 15 12:22:38  8b
> > Jan 15 12:22:38  54
> > Jan 15 12:22:38  24
> > Jan 15 12:22:38  08
> > Jan 15 12:22:38  8b
> > Jan 15 12:22:38  02
> > Jan 15 12:22:38  f6
> > Jan 15 12:22:38  c4
> > Jan 15 12:22:38  08
> > Jan 15 12:22:38  75
> > Jan 15 12:22:38  08
> > Jan 15 12:22:38  f>
> > Jan 15 12:22:38  0b
> > Jan 15 12:22:38  3d
> > Jan 15 12:22:38  01
> > Jan 15 12:22:38  22
> > Jan 15 12:22:38  1b
> > Jan 15 12:22:38  66
> > Jan 15 12:22:38  c0
> > Jan 15 12:22:38  8b
> > Jan 15 12:22:38  03
> > Jan 15 12:22:38  8b
> > Jan 15 12:22:38  7c
> > Jan 15 12:22:38  24
> > Jan 15 12:22:38  08
> > Jan 15 12:22:38  f6
> > Jan 15 12:22:38  c4
> > Jan 15 12:22:38  08
> > Jan 15 12:22:38  8b
> > Jan 15 12:22:38  6f
> > Jan 15 12:22:38  0c
> > Jan 15 12:22:38  75
> > Jan 15 12:22:38
> > Jan 15 12:22:38  EIP: [<c04c078b>]
> > Jan 15 12:22:38  nilfs_copy_page+0x29/0x198
> > Jan 15 12:22:38   SS:ESP 0068:f7a14ca8
> > Jan 15 12:22:38
> > Jan 15 12:22:38  Kernel panic - not syncing: Fatal exception
> > Jan 15 12:22:38
> > ~
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> > the body of a message to [email protected]
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to