[lustre-discuss] MDS cases with ldlm_flock_deadlock error

2024-05-05 Thread Lixin Liu via lustre-discuss
Hi,

Starting from yesterday, we see frequent MDS crashes, all of them are showing 
ldlm_flock_deadlock.
Servers are running Lustre 2.15.4, MDT and MGT are on LDISKFS and OSTs are on 
ZFS. AlmaLinux 8.9.
Clients are mostly CentOS 7.9 with Lustre client 2.15.4.

In one of these crashes, we have a complete coredump in case if someone wants 
to check.

Thanks,

Lixin.

[15817.464501] LustreError: 22687:0:(ldlm_flock.c:230:ldlm_flock_deadlock()) 
ASSERTION( req != lock ) failed:
[15817.474247] LustreError: 22687:0:(ldlm_flock.c:230:ldlm_flock_deadlock()) 
LBUG
[15817.481497] Pid: 22687, comm: mdt01_003 4.18.0-513.9.1.el8_lustre.x86_64 #1 
SMP Sat Dec 23 05:23:32 UTC 2023
[15817.491318] Call Trace TBD:
[15817.494137] [<0>] libcfs_call_trace+0x6f/0xa0 [libcfs]
[15817.499297] [<0>] lbug_with_loc+0x3f/0x70 [libcfs]
[15817.504097] [<0>] ldlm_flock_deadlock.isra.10+0x1fb/0x240 [ptlrpc]
[15817.510398] [<0>] ldlm_process_flock_lock+0x289/0x1f90 [ptlrpc]
[15817.516402] [<0>] ldlm_lock_enqueue+0x2a5/0xaa0 [ptlrpc]
[15817.521813] [<0>] ldlm_handle_enqueue0+0x634/0x1520 [ptlrpc]
[15817.527562] [<0>] tgt_enqueue+0xa4/0x220 [ptlrpc]
[15817.532368] [<0>] tgt_request_handle+0xccd/0x1a20 [ptlrpc]
[15817.537949] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
[15817.544311] [<0>] ptlrpc_main+0xbec/0x1530 [ptlrpc]
[15817.549294] [<0>] kthread+0x134/0x150
[15817.552966] [<0>] ret_from_fork+0x1f/0x40
[15817.556980] Kernel panic - not syncing: LBUG
[15817.561248] CPU: 23 PID: 22687 Comm: mdt01_003 Kdump: loaded Tainted: G  
 OE- -  - 4.18.0-513.9.1.el8_lustre.x86_64 #1
[15817.573669] Hardware name: Dell Inc. PowerEdge R640/0CRT1G, BIOS 2.19.1 
06/04/2023
[15817.581235] Call Trace:
[15817.583687]  dump_stack+0x41/0x60
[15817.587007]  panic+0xe7/0x2ac
[15817.589979]  ? ret_from_fork+0x1f/0x40
[15817.593733]  lbug_with_loc.cold.8+0x18/0x18 [libcfs]
[15817.598714]  ldlm_flock_deadlock.isra.10+0x1fb/0x240 [ptlrpc]
[15817.604557]  ldlm_process_flock_lock+0x289/0x1f90 [ptlrpc]
[15817.610121]  ? lustre_msg_get_flags+0x2a/0x90 [ptlrpc]
[15817.615346]  ? lustre_msg_add_version+0x21/0xa0 [ptlrpc]
[15817.620745]  ldlm_lock_enqueue+0x2a5/0xaa0 [ptlrpc]
[15817.625702]  ldlm_handle_enqueue0+0x634/0x1520 [ptlrpc]
[15817.631007]  tgt_enqueue+0xa4/0x220 [ptlrpc]
[15817.635365]  tgt_request_handle+0xccd/0x1a20 [ptlrpc]
[15817.640503]  ? ptlrpc_nrs_req_get_nolock0+0xff/0x1f0 [ptlrpc]
[15817.646337]  ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
[15817.652256]  ptlrpc_main+0xbec/0x1530 [ptlrpc]
[15817.656791]  ? ptlrpc_wait_event+0x590/0x590 [ptlrpc]
[15817.661928]  kthread+0x134/0x150
[15817.665161]  ? set_kthread_struct+0x50/0x50
[15817.669346]  ret_from_fork+0x1f/0x40


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] MDS crashes, lustre version 2.15.3

2023-11-29 Thread Lixin Liu via lustre-discuss
Hi Aurelien,

Thanks, I guess we will have to rebuild our own 2.15.x server. I see other 
crashes have different dump, usually like these:

[36664.403408] BUG: unable to handle kernel NULL pointer dereference at 

[36664.411237] PGD 0 P4D 0
[36664.413776] Oops:  [#1] SMP PTI
[36664.417268] CPU: 28 PID: 11101 Comm: qmt_reba_cedar_ Kdump: loaded Tainted: 
G  IOE- -  - 4.18.0-477.10.1.el8_lustre.x86_64 #1
[36664.430293] Hardware name: Dell Inc. PowerEdge R640/0CRT1G, BIOS 2.19.1 
06/04/2023
[36664.437860] RIP: 0010:qmt_id_lock_cb+0x69/0x100 [lquota]
[36664.443199] Code: 48 8b 53 20 8b 4a 0c 85 c9 74 78 89 c1 48 8b 42 18 83 78 
10 02 75 0a 83 e1 01 b8 01 00 00 00 74 17 48 63 44 24 04 48 c1 e0 04 <48> 03 45 
00 f6 40 08 0c 0f 95 c0 0f b6 c0 48 8b 4c 24 08 65 48 33
[36664.461942] RSP: 0018:aa2e303f3df0 EFLAGS: 00010246
[36664.467169] RAX:  RBX: 98722c74b700 RCX: 
[36664.474301] RDX: 9880415ce660 RSI: 0010 RDI: 9881240b5c64
[36664.481435] RBP:  R08:  R09: 0004
[36664.488566] R10: 0010 R11: f000 R12: 98722c74b700
[36664.495697] R13: 9875fc07a320 R14: 9878444d3d10 R15: 9878444d3cc0
[36664.502832] FS:  () GS:987f20f8() 
knlGS:
[36664.510917] CS:  0010 DS:  ES:  CR0: 80050033
[36664.516664] CR2:  CR3: 002065a10004 CR4: 007706e0
[36664.523794] DR0:  DR1:  DR2: 
[36664.530927] DR3:  DR6: fffe0ff0 DR7: 0400
[36664.538058] PKRU: 5554
[36664.540772] Call Trace:
[36664.543231]  ? cfs_cdebug_show.part.3.constprop.23+0x20/0x20 [lquota]
[36664.549699]  qmt_glimpse_lock.isra.20+0x1e7/0xfa0 [lquota]
[36664.555204]  qmt_reba_thread+0x5cd/0x9b0 [lquota]
[36664.559927]  ? qmt_glimpse_lock.isra.20+0xfa0/0xfa0 [lquota]
[36664.565602]  kthread+0x134/0x150
[36664.568834]  ? set_kthread_struct+0x50/0x50
[36664.573021]  ret_from_fork+0x1f/0x40
[36664.576603] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) 
mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) mbcache jbd2 lustre(OE) 
lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ksocklnd(OE) ko2iblnd(OE) 
ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) 8021q garp mrp stp llc dell_rbu 
vfat fat dm_round_robin dm_multipath rpcrdma sunrpc rdma_ucm ib_srpt ib_isert 
iscsi_target_mod target_core_mod ib_iser libiscsi opa_vnic scsi_transport_iscsi 
ib_umad rdma_cm ib_ipoib iw_cm ib_cm intel_rapl_msr intel_rapl_common 
isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp 
coretemp kvm_intel dell_smbios iTCO_wdt iTCO_vendor_support wmi_bmof 
dell_wmi_descriptor dcdbas kvm ipmi_ssif irqbypass crct10dif_pclmul hfi1 
mgag200 crc32_pclmul drm_shmem_helper ghash_clmulni_intel rdmavt qla2xxx 
drm_kms_helper rapl ib_uverbs nvme_fc intel_cstate syscopyarea nvme_fabrics 
sysfillrect sysimgblt nvme_core intel_uncore fb_sys_fops pcspkr acpi_ipmi 
ib_core scsi_transport_fc igb
[36664.576699]  drm ipmi_si i2c_algo_bit mei_me dca ipmi_devintf mei i2c_i801 
lpc_ich wmi ipmi_msghandler acpi_power_meter xfs libcrc32c sd_mod t10_pi sg 
ahci libahci crc32c_intel libata megaraid_sas dm_mirror dm_region_hash dm_log 
dm_mod
[36664.684758] CR2: 

Is this also related to the same bug?

Thanks,

Lixin.

From: Aurelien Degremont 
Date: Wednesday, November 29, 2023 at 8:31 AM
To: lustre-discuss , Lixin Liu 
Subject: RE: MDS crashes, lustre version 2.15.3

You are likely hitting that bug https://jira.whamcloud.com/browse/LU-15207 
which is fixed in (not yet released) 2.16.0

Aurélien

De : lustre-discuss  de la part de 
Lixin Liu via lustre-discuss 
Envoyé : mercredi 29 novembre 2023 17:18
À : lustre-discuss 
Objet : [lustre-discuss] MDS crashes, lustre version 2.15.3

External email: Use caution opening links or attachments


Hi,

We built our 2.15.3 environment a few months ago. MDT is using ldiskfs and OSTs 
are using ZFS.
The system seems to perform well at the beginning, but recently, we see 
frequent MDS crashes.
The vmcore-dmesg.txt shows the following:

[26056.031259] LustreError: 69513:0:(hash.c:1469:cfs_hash_for_each_tight()) 
ASSERTION( !cfs_hash_is_rehashing(hs) ) failed:
[26056.043494] LustreError: 69513:0:(hash.c:1469:cfs_hash_for_each_tight()) LBUG
[26056.051460] Pid: 69513, comm: lquota_wb_cedar 
4.18.0-477.10.1.el8_lustre.x86_64 #1 SMP Tue Jun 20 00:12:13 UTC 2023
[26056.063099] Call Trace TBD:
[26056.066221] [<0>] libcfs_call_trace+0x6f/0xa0 [libcfs]
[26056.071970] [<0>] lbug_with_loc+0x3f/0x70 [libcfs]
[26056.077322] [<0>] cfs_hash_for_each_tight+0x301/0x310 [libcfs]
[26056.083839] [<0>] qsd_start_reint_thread+0x561/0xcc0 [lquota]
[26056.090265] [<0>] qsd_upd_thread+0xd43/0x1040 [lquota]
[26056.096008] [<0>] kthread+0x13

[lustre-discuss] Random drop off OST from clients

2023-10-05 Thread Lixin Liu
 Hi,

Recently, we frequently see OSTs are randomly dropped by some client nodes.

We have 4 Lustre filesystems, total 126 OSTs. All clients are running 2.15.3 
client on CentOS 7.
Servers are CentOS 7 with Lustre 2.12.8 (3 FS') and 2.15.3 on Alma 8.8. 
Failures can happen
from both versions of servers. LNET is using OPA interface.

One example of the failure is like

# lctl dl | grep ' IN '
126 IN osc cedar_sc-OST000a-osc-980c76944800 
52e66575-6443-4be9-a7ce-348b526a0836 4

In syslog, we see

Oct  4 23:24:30 cedar5 kernel: LustreError: 11-0: 
cedar_sc-OST000a-osc-980c76944800: operation ldlm_enqueue to node 
172.19.128.33@o2ib failed: rc = -107
Oct  4 23:24:30 cedar5 kernel: Lustre: cedar_sc-OST000a-osc-980c76944800: 
Connection to cedar_sc-OST000a (at 172.19.128.33@o2ib) was lost; in progress 
operations using this service will wait for recovery to complete
Oct  4 23:24:30 cedar5 kernel: LustreError: 
5195:0:(osc_request.c:1037:osc_init_grant()) 
cedar_sc-OST000a-osc-980c76944800: granted 3407872 but already consumed 
519700480
Oct  4 23:24:30 cedar5 kernel: LustreError: 167-0: 
cedar_sc-OST000a-osc-980c76944800: This client was evicted by 
cedar_sc-OST000a; in progress operations using this service will fail.
Oct  4 23:24:31 cedar5 kernel: LustreError: 
62880:0:(ldlm_resource.c:1126:ldlm_resource_complain()) 
cedar_sc-OST000a-osc-980c76944800: namespace resource 
[0x73fbbe2:0x0:0x0].0x0 (97fe127e3080) refcount nonzero (1) after lock 
cleanup; forcing cleanup.
Oct  4 23:24:31 cedar5 kernel: LustreError: 
5218:0:(osc_request.c:711:osc_announce_cached()) 
cedar_sc-OST000a-osc-980c76944800: dirty 131074 > system dirty_max 131072
Oct  4 23:24:36 cedar5 kernel: LustreError: 
5209:0:(osc_request.c:711:osc_announce_cached()) 
cedar_sc-OST000a-osc-980c76944800: dirty 131074 > system dirty_max 131072
Oct  4 23:24:47 cedar5 kernel: LustreError: 
5220:0:(osc_request.c:711:osc_announce_cached()) 
cedar_sc-OST000a-osc-980c76944800: dirty 131072 > system dirty_max 131072
Oct  4 23:25:36 cedar5 kernel: LustreError: 
5242:0:(osc_request.c:711:osc_announce_cached()) 
cedar_sc-OST000a-osc-980c76944800: dirty 131074 > system dirty_max 131072


This one in particular is 2.15.3 server. Once this happen, it appears the only 
way is to reboot the
client and then the issue goes away.

Any ideas where we should check?

Thank you very much.

Lixin.



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Cannot move files to another directory

2020-12-11 Thread Lixin Liu
I tried to send this yesterday, but did not seem to get through. Try again now.

Hi,

We moved our MDT and MGT to a new storage device (DDN SFA200NV) this week. 
Everything
appears to work but there is a very strange problem. We could not "mv" a file 
to another directory.
This is for both old and new data. Here is an example:

$ mkdir testdir

$ dd if=/dev/zero of=testfile bs=1024k count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB, 10 MiB) copied, 0.292493 s, 35.8 MB/s

$ cp testfile testdir/
$ mv testfile testdir/testfile.new
mv: cannot move 'testfile' to 'testdir/testfile.new': No such file or directory

$ chmod 666 testfile

$ ls -l
total 10244
drwxr-x--- 2 liu liu 4096 Dec 10 08:12 testdir
-rw-rw-rw- 1 liu liu 10485760 Dec 10 08:11 testfile

$ rm testfile
$

"mv" a file in the same directory works.

I have an open case with DDN about this issue, but would like to know if 
someone here
has any suggestions.

Thanks,

Lixin Liu
Simon Fraser University

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Robinhood changelog errors

2020-05-30 Thread Lixin Liu
After set changelog_mask and restart robinhood, this problem is cleared.

Lixin.

On 2020-05-29, 11:06 PM, "lustre-discuss on behalf of Lixin Liu" 
 wrote:

 I am getting an error every second in the robinhood log:

2020/05/29 22:10:37 [38025/3] ChangeLog | Error in llapi_changelog_recv(): 
-2: No such file or directory
2020/05/29 22:10:38 [38025/3] ChangeLog | Error in llapi_changelog_recv(): 
-2: No such file or directory
2020/05/29 22:10:39 [38025/3] ChangeLog | Error in llapi_changelog_recv(): 
-2: No such file or directory
2020/05/29 22:10:40 [38025/3] ChangeLog | Error in llapi_changelog_recv(): 
-2: No such file or directory
2020/05/29 22:10:41 [38025/3] ChangeLog | Error in llapi_changelog_recv(): 
-2: No such file or directory

These started in early April.

Is there something I can do to determine what is the cause?

Thanks,

    Lixin Liu
Simon Fraser University

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Robinhood changelog errors

2020-05-30 Thread Lixin Liu
 I am getting an error every second in the robinhood log:

2020/05/29 22:10:37 [38025/3] ChangeLog | Error in llapi_changelog_recv(): -2: 
No such file or directory
2020/05/29 22:10:38 [38025/3] ChangeLog | Error in llapi_changelog_recv(): -2: 
No such file or directory
2020/05/29 22:10:39 [38025/3] ChangeLog | Error in llapi_changelog_recv(): -2: 
No such file or directory
2020/05/29 22:10:40 [38025/3] ChangeLog | Error in llapi_changelog_recv(): -2: 
No such file or directory
2020/05/29 22:10:41 [38025/3] ChangeLog | Error in llapi_changelog_recv(): -2: 
No such file or directory

These started in early April.

Is there something I can do to determine what is the cause?

Thanks,

Lixin Liu
Simon Fraser University

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] strange errors on Lustre servers

2018-08-12 Thread Lixin Liu
Hi Zeeshan,

Thanks for the hint.

OPA works fine, but then I found someone brought up a misconfigured node which 
has
the conflict IP address on OPA interface. Fixing it and problem solved.

Lixin.


From: Zeeshan Ali Shah 
Date: Saturday, August 11, 2018 at 11:20 PM
To: Lixin Liu 
Cc: "lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] strange errors on Lustre servers

What is output of opainfo ?
Sent from my iPhone

On 12 Aug 2018, at 04:04, Lixin Liu mailto:l...@sfu.ca>> wrote:
Hi,

I am getting these errors on all our MDS and OSS servers (Lustre 2.10.1):

Aug 11 11:45:52 ndc-oss5b kernel: LNet: 
24727:0:(o2iblnd_cb.c:2410:kiblnd_passive_connect()) Conn stale 
172.19.142.119@o2ib version 12/12 incarnation 1533927051163335/1533998625080752
Aug 11 11:55:52 ndc-oss5b kernel: LNet: 
105990:0:(o2iblnd_cb.c:2410:kiblnd_passive_connect()) Conn stale 
172.19.142.119@o2ib version 12/12 incarnation 1533927051163335/1533998625080752
Aug 11 12:05:52 ndc-oss5b kernel: LNet: 
105990:0:(o2iblnd_cb.c:2410:kiblnd_passive_connect()) Conn stale 
172.19.142.119@o2ib version 12/12 incarnation 1533927051163335/1533998625080752
Aug 11 12:15:52 ndc-oss5b kernel: LNet: 
105990:0:(o2iblnd_cb.c:2410:kiblnd_passive_connect()) Conn stale 
172.19.142.119@o2ib version 12/12 incarnation 1533927051163335/1533998625080752
Aug 11 12:25:52 ndc-oss5b kernel: LNet: 
105990:0:(o2iblnd_cb.c:2410:kiblnd_passive_connect()) Conn stale 
172.19.142.119@o2ib version 12/12 incarnation 1533927051163335/1533998625080752

This is a new node we brought online recently. Is it an indication that we have 
problem with
it OPA interface on the node? This machine has a 8160F CPU (OPA interface on 
chip).

Thanks,

Lixin Liu
High Performance Computing
Simon Fraser University

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] strange errors on Lustre servers

2018-08-11 Thread Lixin Liu
Hi,

I am getting these errors on all our MDS and OSS servers (Lustre 2.10.1):

Aug 11 11:45:52 ndc-oss5b kernel: LNet: 
24727:0:(o2iblnd_cb.c:2410:kiblnd_passive_connect()) Conn stale 
172.19.142.119@o2ib version 12/12 incarnation 1533927051163335/1533998625080752
Aug 11 11:55:52 ndc-oss5b kernel: LNet: 
105990:0:(o2iblnd_cb.c:2410:kiblnd_passive_connect()) Conn stale 
172.19.142.119@o2ib version 12/12 incarnation 1533927051163335/1533998625080752
Aug 11 12:05:52 ndc-oss5b kernel: LNet: 
105990:0:(o2iblnd_cb.c:2410:kiblnd_passive_connect()) Conn stale 
172.19.142.119@o2ib version 12/12 incarnation 1533927051163335/1533998625080752
Aug 11 12:15:52 ndc-oss5b kernel: LNet: 
105990:0:(o2iblnd_cb.c:2410:kiblnd_passive_connect()) Conn stale 
172.19.142.119@o2ib version 12/12 incarnation 1533927051163335/1533998625080752
Aug 11 12:25:52 ndc-oss5b kernel: LNet: 
105990:0:(o2iblnd_cb.c:2410:kiblnd_passive_connect()) Conn stale 
172.19.142.119@o2ib version 12/12 incarnation 1533927051163335/1533998625080752

This is a new node we brought online recently. Is it an indication that we have 
problem with
it OPA interface on the node? This machine has a 8160F CPU (OPA interface on 
chip).

Thanks,

Lixin Liu
High Performance Computing
Simon Fraser University

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] BAD CHECKSUM

2017-12-09 Thread Lixin Liu
Hi Andreas,

 

I have seen very similar errors in our 2.10.1 environment. Same errors from 
different clients to

different OSS servers and OSTs. Our network is OPA and we are using the latest 
driver and

firmware for all HFIs and switches (10.6).

 

Thanks,

 

Lixin Liu

Compute Canada

 

From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of 
"Dilger, Andreas" <andreas.dil...@intel.com>
Date: Saturday, December 9, 2017 at 9:07 PM
To: Hans Henrik Happe <ha...@nbi.dk>
Cc: "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] BAD CHECKSUM

 

Based on the messages on the client, this isn’t related to mmap() or writes 
done by the client, since the data has the same checksum from before it was 
sent and after it got the checksum error returned from the server. That means 
the pages did not change on the client. 

 

Possible causes include the client network card, server network card, memory, 
or possibly the OFED driver?  It could of course be something in Lustre/LNet, 
though we haven’t had any reports of anything similar. 

 

When the checksum code was first written, it was motivated by a faulty Ethernet 
NIC that had TCP checksum offload, but bad onboard cache, and the data was 
corrupted when copied onto the NIC but the TCP checksum was computed on the bad 
data and the checksum was “correct” when received by the server, so it didn’t 
cause TCP resends. 

 

Are you seeing this on multiple servers?  The client log only shows one server, 
while the server log shows multiple clients.  If it is only happening on one 
server it might point to hardware. 

 

Did you also upgrade the kernel and OFED at the same time as Lustre? You could 
try building Lustre 2.10.1 on the old 2.9.0 kernel and OFED to see if that 
works properly. 

Cheers, Andreas


On Dec 9, 2017, at 11:09, Hans Henrik Happe <ha...@nbi.dk> wrote:



On 09-12-2017 18:57, Hans Henrik Happe wrote:


On 07-12-2017 21:36, Dilger, Andreas wrote:

On Dec 7, 2017, at 10:37, Hans Henrik Happe <ha...@nbi.dk> wrote:

Hi,

 

Can an application cause BAD CHECKSUM errors in Lustre logs by somehow

overwriting memory while being DMA'ed to network?

 

After upgrading to 2.10.1 on the server side we started seeing this from

a user's application (MPI I/O). Both 2.9.0 and 2.10.1 clients emit these

errors. We have not yet established weather the application is doing

things correctly.

If applications are using mmap IO it is possible for the page to become 
inconsistent after the checksum has been computed.  However, mmap IO is

normally detected by the client and no message should be printed.

 

There isn't anything that the application needs to do, since the client will 
resend the data if there is a checksum error, but the resends do slow down the 
IO.  If the inconsistency is on the client, there is no cause for concern 
(though it would be good to figure out the root cause).

 

It would be interesting to see what the exact error message is, since that will 
say whether the data became inconsistent on the client, or over the network.  
If the inconsistency is over the network or on the server, then that may point 
to hardware issues.

I've attached logs from a server and a client.


There was a cut n' paste error in the first set of files. This should be
better.

Looks like a something goes wrong over the network.

Cheers,
Hans Henrik





___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___ lustre-discuss mailing list 
lustre-discuss@lists.lustre.org 
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org 

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org