Hi all, What can cause a client to receive a "o2iblnd no resources" message from an OSS? --------------------------------------------------------------------------- Feb 1 15:24:24 node-5-8 kernel: LustreError: 1893:0:(o2iblnd_cb.c:2448:kiblnd_rejected()) [EMAIL PROTECTED] rejected: o2iblnd no resources ---------------------------------------------------------------------------
I suspect an out-of-memory problem, and indeed the OSS logs are filled up with the following: --------------------------------------------------------------------------- ib_cm/3: page allocation failure. order:4, mode:0xd0 Call Trace:<ffffffff8015c847>{__alloc_pages+777} <ffffffff801727e9>{alloc_page_interleave+61} <ffffffff8015c8e0>{__get_free_pages+11} <ffffffff8015facd>{kmem_getpages+36} <ffffffff80160262>{cache_alloc_refill+609} <ffffffff8015ff30>{__kmalloc+123} <ffffffffa014ee75>{:ib_mthca:mthca_alloc_qp_common+668} <ffffffffa014f42d>{:ib_mthca:mthca_alloc_qp+178} <ffffffffa0153e3a>{:ib_mthca:mthca_create_qp+311} <ffffffffa00d5b1b>{:ib_core:ib_create_qp+20} <ffffffffa021a5f9>{:rdma_cm:rdma_create_qp+43} <ffffffff8024b7b5>{dma_pool_free+245} <ffffffffa014b257>{:ib_mthca:mthca_init_cq+1073} <ffffffffa01540cf>{:ib_mthca:mthca_create_cq+282} <ffffffff801727e9>{alloc_page_interleave+61} <ffffffffa0400c10>{:ko2iblnd:kiblnd_cq_completion+0} <ffffffffa0400d50>{:ko2iblnd:kiblnd_cq_event+0} <ffffffffa00d5cc1>{:ib_core:ib_create_cq+33} <ffffffffa03f56bd>{:ko2iblnd:kiblnd_create_conn+3565} <ffffffffa0276f38>{:libcfs:cfs_alloc+40} <ffffffffa03fe457>{:ko2iblnd:kiblnd_passive_connect+2215} <ffffffffa00d8595>{:ib_core:ib_find_cached_gid+244} <ffffffffa021a278>{:rdma_cm:cma_acquire_dev+293} <ffffffffa03ff540>{:ko2iblnd:kiblnd_cm_callback+64} <ffffffffa03ff500>{:ko2iblnd:kiblnd_cm_callback+0} <ffffffffa021b19a>{:rdma_cm:cma_req_handler+863} <ffffffff801e8427>{alloc_layer+67} <ffffffff801e8645>{idr_get_new_above_int+423} <ffffffffa00fa0ab>{:ib_cm:cm_process_work+101} <ffffffffa00faa57>{:ib_cm:cm_req_handler+2398} <ffffffffa00fae3c>{:ib_cm:cm_work_handler+0} <ffffffffa00fae6a>{:ib_cm:cm_work_handler+46} <ffffffff80146fca>{worker_thread+419} <ffffffff80133566>{default_wake_function+0} <ffffffff801335b7>{__wake_up_common+67} <ffffffff80133566>{default_wake_function+0} <ffffffff8014ad18>{keventd_create_kthread+0} <ffffffff80146e27>{worker_thread+0} <ffffffff8014ad18>{keventd_create_kthread+0} <ffffffff8014acef>{kthread+200} <ffffffff80110de3>{child_rip+8} <ffffffff8014ad18>{keventd_create_kthread+0} <ffffffff8014ac27>{kthread+0} <ffffffff80110ddb>{child_rip+0} Mem-info: Node 0 DMA per-cpu: cpu 0 hot: low 2, high 6, batch 1 cpu 0 cold: low 0, high 2, batch 1 cpu 1 hot: low 2, high 6, batch 1 cpu 1 cold: low 0, high 2, batch 1 cpu 2 hot: low 2, high 6, batch 1 cpu 2 cold: low 0, high 2, batch 1 cpu 3 hot: low 2, high 6, batch 1 cpu 3 cold: low 0, high 2, batch 1 Node 0 Normal per-cpu: cpu 0 hot: low 32, high 96, batch 16 cpu 0 cold: low 0, high 32, batch 16 cpu 1 hot: low 32, high 96, batch 16 cpu 1 cold: low 0, high 32, batch 16 cpu 2 hot: low 32, high 96, batch 16 cpu 2 cold: low 0, high 32, batch 16 cpu 3 hot: low 32, high 96, batch 16 cpu 3 cold: low 0, high 32, batch 16 Node 0 HighMem per-cpu: empty Free pages: 35336kB (0kB HighMem) Active:534156 inactive:127091 dirty:1072 writeback:0 unstable:0 free:8834 slab:146612 mapped:26222 pagetables:1035 Node 0 DMA free:9832kB min:52kB low:64kB high:76kB active:0kB inactive:0kB present:16384kB pages_scanned:37 all_unreclaimable? yes protections[]: 0 510200 510200 Node 0 Normal free:25504kB min:16328kB low:20408kB high:24492kB active:2136624kB inactive:508364kB present:4964352kB pages_scanned:0 all_unreclaimable? no protections[]: 0 0 0 Node 0 HighMem free:0kB min:128kB low:160kB high:192kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no protections[]: 0 0 0 Node 0 DMA: 2*4kB 2*8kB 1*16kB 0*32kB 1*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 0*2048kB 2*4096kB = 9832kB Node 0 Normal: 1284*4kB 2290*8kB 126*16kB 1*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 25504kB Node 0 HighMem: empty Swap cache: add 111, delete 111, find 23/36, race 0+0 Free swap: 4096360kB 1245184 pages of RAM 235840 reserved pages 659867 pages shared 0 pages swap cached --------------------------------------------------------------------------- IB links are up and working on both the client and the OSS: --------------------------------------------------------------------------- client# ibstatus Infiniband device 'mthca0' port 1 status: default gid: fe80:0000:0000:0000:0005:ad00:0008:af71 base lid: 0x83 sm lid: 0x130 state: 4: ACTIVE phys state: 5: LinkUp rate: 20 Gb/sec (4X DDR) oss# ibstatus Infiniband device 'mthca0' port 1 status: default gid: fe80:0000:0000:0000:0005:ad00:0008:cb11 base lid: 0x126 sm lid: 0x130 state: 4: ACTIVE phys state: 5: LinkUp rate: 20 Gb/sec (4X DDR) --------------------------------------------------------------------------- And the Subnet Manager doesn't expose any unusual error or skyrocketed counter (I use OFED 1.2, kernel 2.6.9-55.0.9.EL_lustre.1.6.4.1smp). What I don't really get is that most clients can access files on this OSS with no issue, and besides, my limited understanding of the kernel memory mechanisms tend to let me believe that this OSS is not out of memory: --------------------------------------------------------------------------- # cat /proc/meminfo MemTotal: 4037380 kB MemFree: 31688 kB Buffers: 1333536 kB Cached: 1231900 kB SwapCached: 0 kB Active: 2138948 kB Inactive: 507720 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 4037380 kB LowFree: 31688 kB SwapTotal: 4096564 kB SwapFree: 4096360 kB Dirty: 6868 kB Writeback: 0 kB Mapped: 106984 kB Slab: 588200 kB CommitLimit: 6115252 kB Committed_AS: 860508 kB PageTables: 4304 kB VmallocTotal: 536870911 kB VmallocUsed: 274788 kB VmallocChunk: 536596091 kB HugePages_Total: 0 HugePages_Free: 0 Hugepagesize: 2048 kB --------------------------------------------------------------------------- This only appeared lately, after several week of continuous use of the filesystem, without any problem. Is there anything like a memory leak somewhere? Any help to diagnose the problem would be greatly appreciated. Thanks! -- Kilian _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss