Hi all,

What can cause a client to receive a "o2iblnd no resources" message 
from an OSS?
---------------------------------------------------------------------------
Feb  1 15:24:24 node-5-8 kernel: LustreError: 
1893:0:(o2iblnd_cb.c:2448:kiblnd_rejected()) [EMAIL PROTECTED] rejected: 
o2iblnd no resources
---------------------------------------------------------------------------

I suspect an out-of-memory problem, and indeed the OSS logs are filled
up with the following:
---------------------------------------------------------------------------
ib_cm/3: page allocation failure. order:4, mode:0xd0

Call Trace:<ffffffff8015c847>{__alloc_pages+777} 
<ffffffff801727e9>{alloc_page_interleave+61}
       <ffffffff8015c8e0>{__get_free_pages+11} 
<ffffffff8015facd>{kmem_getpages+36}
       <ffffffff80160262>{cache_alloc_refill+609} 
<ffffffff8015ff30>{__kmalloc+123}
       <ffffffffa014ee75>{:ib_mthca:mthca_alloc_qp_common+668}
       <ffffffffa014f42d>{:ib_mthca:mthca_alloc_qp+178} 
<ffffffffa0153e3a>{:ib_mthca:mthca_create_qp+311}
       <ffffffffa00d5b1b>{:ib_core:ib_create_qp+20} 
<ffffffffa021a5f9>{:rdma_cm:rdma_create_qp+43}
       <ffffffff8024b7b5>{dma_pool_free+245} 
<ffffffffa014b257>{:ib_mthca:mthca_init_cq+1073}
       <ffffffffa01540cf>{:ib_mthca:mthca_create_cq+282} 
<ffffffff801727e9>{alloc_page_interleave+61}
       <ffffffffa0400c10>{:ko2iblnd:kiblnd_cq_completion+0}
       <ffffffffa0400d50>{:ko2iblnd:kiblnd_cq_event+0} 
<ffffffffa00d5cc1>{:ib_core:ib_create_cq+33}
       <ffffffffa03f56bd>{:ko2iblnd:kiblnd_create_conn+3565}
       <ffffffffa0276f38>{:libcfs:cfs_alloc+40} 
<ffffffffa03fe457>{:ko2iblnd:kiblnd_passive_connect+2215}
       <ffffffffa00d8595>{:ib_core:ib_find_cached_gid+244}
       <ffffffffa021a278>{:rdma_cm:cma_acquire_dev+293} 
<ffffffffa03ff540>{:ko2iblnd:kiblnd_cm_callback+64}
       <ffffffffa03ff500>{:ko2iblnd:kiblnd_cm_callback+0}
       <ffffffffa021b19a>{:rdma_cm:cma_req_handler+863} 
<ffffffff801e8427>{alloc_layer+67}
       <ffffffff801e8645>{idr_get_new_above_int+423} 
<ffffffffa00fa0ab>{:ib_cm:cm_process_work+101}
       <ffffffffa00faa57>{:ib_cm:cm_req_handler+2398} 
<ffffffffa00fae3c>{:ib_cm:cm_work_handler+0}
       <ffffffffa00fae6a>{:ib_cm:cm_work_handler+46} 
<ffffffff80146fca>{worker_thread+419}
       <ffffffff80133566>{default_wake_function+0} 
<ffffffff801335b7>{__wake_up_common+67}
       <ffffffff80133566>{default_wake_function+0} 
<ffffffff8014ad18>{keventd_create_kthread+0}
       <ffffffff80146e27>{worker_thread+0} 
<ffffffff8014ad18>{keventd_create_kthread+0}
       <ffffffff8014acef>{kthread+200} <ffffffff80110de3>{child_rip+8}
       <ffffffff8014ad18>{keventd_create_kthread+0} 
<ffffffff8014ac27>{kthread+0}
       <ffffffff80110ddb>{child_rip+0}
Mem-info:
Node 0 DMA per-cpu:
cpu 0 hot: low 2, high 6, batch 1
cpu 0 cold: low 0, high 2, batch 1
cpu 1 hot: low 2, high 6, batch 1
cpu 1 cold: low 0, high 2, batch 1
cpu 2 hot: low 2, high 6, batch 1
cpu 2 cold: low 0, high 2, batch 1
cpu 3 hot: low 2, high 6, batch 1
cpu 3 cold: low 0, high 2, batch 1
Node 0 Normal per-cpu:
cpu 0 hot: low 32, high 96, batch 16
cpu 0 cold: low 0, high 32, batch 16
cpu 1 hot: low 32, high 96, batch 16
cpu 1 cold: low 0, high 32, batch 16
cpu 2 hot: low 32, high 96, batch 16
cpu 2 cold: low 0, high 32, batch 16
cpu 3 hot: low 32, high 96, batch 16
cpu 3 cold: low 0, high 32, batch 16
Node 0 HighMem per-cpu: empty

Free pages:       35336kB (0kB HighMem)
Active:534156 inactive:127091 dirty:1072 writeback:0 unstable:0 free:8834 
slab:146612 mapped:26222 pagetables:1035
Node 0 DMA free:9832kB min:52kB low:64kB high:76kB active:0kB inactive:0kB 
present:16384kB pages_scanned:37 all_unreclaimable? yes
protections[]: 0 510200 510200
Node 0 Normal free:25504kB min:16328kB low:20408kB high:24492kB 
active:2136624kB inactive:508364kB present:4964352kB pages_scanned:0 
all_unreclaimable? no
protections[]: 0 0 0
Node 0 HighMem free:0kB min:128kB low:160kB high:192kB active:0kB inactive:0kB 
present:0kB pages_scanned:0 all_unreclaimable? no
protections[]: 0 0 0
Node 0 DMA: 2*4kB 2*8kB 1*16kB 0*32kB 1*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 
0*2048kB 2*4096kB = 9832kB
Node 0 Normal: 1284*4kB 2290*8kB 126*16kB 1*32kB 0*64kB 0*128kB 0*256kB 0*512kB 
0*1024kB 0*2048kB 0*4096kB = 25504kB
Node 0 HighMem: empty
Swap cache: add 111, delete 111, find 23/36, race 0+0
Free swap:       4096360kB
1245184 pages of RAM
235840 reserved pages
659867 pages shared
0 pages swap cached
---------------------------------------------------------------------------

IB links are up and working on both the client and the OSS:
---------------------------------------------------------------------------
client# ibstatus
Infiniband device 'mthca0' port 1 status:
        default gid:     fe80:0000:0000:0000:0005:ad00:0008:af71
        base lid:        0x83
        sm lid:          0x130
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            20 Gb/sec (4X DDR)
oss# ibstatus
Infiniband device 'mthca0' port 1 status:
        default gid:     fe80:0000:0000:0000:0005:ad00:0008:cb11
        base lid:        0x126
        sm lid:          0x130
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            20 Gb/sec (4X DDR)
---------------------------------------------------------------------------
And the Subnet Manager doesn't expose any unusual error or skyrocketed 
counter (I use OFED 1.2, kernel 2.6.9-55.0.9.EL_lustre.1.6.4.1smp).

What I don't really get is that most clients can access files on this
OSS with no issue, and besides, my limited understanding of the kernel
memory mechanisms tend to let me believe that this OSS is not out of 
memory:
---------------------------------------------------------------------------
# cat /proc/meminfo
MemTotal:      4037380 kB
MemFree:         31688 kB
Buffers:       1333536 kB
Cached:        1231900 kB
SwapCached:          0 kB
Active:        2138948 kB
Inactive:       507720 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:      4037380 kB
LowFree:         31688 kB
SwapTotal:     4096564 kB
SwapFree:      4096360 kB
Dirty:            6868 kB
Writeback:           0 kB
Mapped:         106984 kB
Slab:           588200 kB
CommitLimit:   6115252 kB
Committed_AS:   860508 kB
PageTables:       4304 kB
VmallocTotal: 536870911 kB
VmallocUsed:    274788 kB
VmallocChunk: 536596091 kB
HugePages_Total:     0
HugePages_Free:      0
Hugepagesize:     2048 kB
---------------------------------------------------------------------------

This only appeared lately, after several week of continuous use of the 
filesystem, without any problem. Is there anything like a memory leak 
somewhere? Any help to diagnose the problem would be greatly appreciated.

Thanks!
-- 
Kilian
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to