Re: [Lustre-discuss] 1.8.x server for el6
Peter, Andreas, Thank you both for your assessment. I decided to proceed conservatively with 1.8.7 on el5. I'll take the opportunity to get more comfortable with 2.x on el6 in a test system on the earlier server nodes. Best, Michael On January 5, 2012, at 12:12 , Peter Jones wrote: > inline > > On 12-01-05 9:21 AM, Andreas Dilger wrote: >> On 2012-01-05, at 3:29 AM, Michael Sternberg wrote: >>> I am wondering about the release matrix for Lustre vs. RHEL/CentOS. >>> [..] >>> It appears that the current recommended production release is Lustre-1.8, >>> and, since there is no server for el6 I take that as a strong hint to stick >>> to el5. >> For new deployments the recommended version is 2.1.0 with RHEL6.1. We are >> starting work on a 2.1.1 maintenance release for the spring. > While it is not often that I would disagree with Andreas, I would say > that the answer on this point depends upon your timing. Right now, if > stability is your primary driver (and it sounds like it is) then I would > recommend 1.8.7-wc1. The early feedback from 2.1 is very encouraging, > but I think that we need a little more production feedback before we > could confidently assert that 2.1.x is the default option. >> >>> Is there some more information public? I was planning to upgrade a cluster >>> to el6. Ideally, I'd like to have servers and user nodes all on the same >>> major OS release. However, stability will trump that. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] 1.8.x server for el6
I am wondering about the release matrix for Lustre vs. RHEL/CentOS. == Lustre lustre.org whamcloud Release el5 el6 el5 el6 -- 1.8.7 clientyes no yes yes 1.8.7 serveryes no yes no 2.0 (discontinued) n/a 2.1 client n/ayes yes 2.1 server n/ayes yes == http://downloads.lustre.org/public/lustre/v1.8/lustre_1.8.7/ http://downloads.whamcloud.com/public/lustre/lustre-1.8.7-wc1/el6/ http://downloads.whamcloud.com/public/lustre/lustre-2.1.0/ It appears that the current recommended production release is Lustre-1.8, and, since there is no server for el6 I take that as a strong hint to stick to el5. Is there some more information public? I was planning to upgrade a cluster to el6. Ideally, I'd like to have servers and user nodes all on the same major OS release. However, stability will trump that. Best, Michael ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre-2.1, CentOS-6.2: unknown param quota_type=ug
Phew! OK, I found a manual-bug on this already open –– http://jira.whamcloud.com/browse/LUDOC-26 Thank you very much! Michael On Jan 5, 2012, at 3:10 , Johann Lombardi wrote: > In lustre 2.x, this must be mdd.quota_type (instead of mdt.quota_type). A > patch was landed to master some time ago (will be available in 2.2) to > interpret mdt.quota_type as mdd.quota_type transparently, see > http://review.whamcloud.com/#change,354. > > Cheers, > Johann > > On Thu, Jan 05, 2012 at 01:35:14AM -0600, Michael Sternberg wrote: >> I cannot get quota to work on a clean new file system with Lustre-2.1 on >> CentOS-6.2. >> >> # tunefs.lustre --param mdt.quota_type=ug /dev/mapper/mdt1 >> # mount /mnt/mdt1 >> mount.lustre: mount /dev/mapper/mdt1 at /mnt/mdt1 failed: Function not >> implemented >> >> This logs: >> >> LustreError: 3957:0:(obd_config.c:1140:class_process_proc_param()) >> carbonfs-MDT: unknown param quota_type=ug >> >> Am I missing some component or parameter syntax? I went by >> http://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.html#dbdoclet.50438217_31982 >> . >> >> >> I actually got "Function not implemented" already with mkfs.lustre --param >> mdt.quota_type=ug2 ... I decided to defer quota setup until later because I >> had a note to this effect from a much earlier install. So, I got everything >> up and running without quota, MGS MDS on same node but different LUNs, 4 OSS >> with external journals, client. >> >> The system for all nodes (servers and client) is: >> >> - CentOS-6.2, but with the wc1 kernel for CentOS-6.1 = >> 2.6.32-131.6.1.el6_lustre.g65156ed.x86_64 >> >> (Is this asking for trouble? Things worked under 1.8.x with >> sticking to a previous minor-version OS kernel.) >> >> - Failover for the MGS/MDS pair and 2 OSS pairs. >> >> I had been using quota with 1.8.6 on CentOS-5.6 without problems. The 2.1 fs >> is on new hardware to migrate to. >> >> >> Best, Michael ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Lustre-2.1, CentOS-6.2: unknown param quota_type=ug
Hello, I cannot get quota to work on a clean new file system with Lustre-2.1 on CentOS-6.2. # tunefs.lustre --param mdt.quota_type=ug /dev/mapper/mdt1 # mount /mnt/mdt1 mount.lustre: mount /dev/mapper/mdt1 at /mnt/mdt1 failed: Function not implemented This logs: LustreError: 3957:0:(obd_config.c:1140:class_process_proc_param()) carbonfs-MDT: unknown param quota_type=ug Am I missing some component or parameter syntax? I went by http://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.html#dbdoclet.50438217_31982 . I actually got "Function not implemented" already with mkfs.lustre --param mdt.quota_type=ug2 ... I decided to defer quota setup until later because I had a note to this effect from a much earlier install. So, I got everything up and running without quota, MGS MDS on same node but different LUNs, 4 OSS with external journals, client. The system for all nodes (servers and client) is: - CentOS-6.2, but with the wc1 kernel for CentOS-6.1 = 2.6.32-131.6.1.el6_lustre.g65156ed.x86_64 (Is this asking for trouble? Things worked under 1.8.x with sticking to a previous minor-version OS kernel.) - Failover for the MGS/MDS pair and 2 OSS pairs. I had been using quota with 1.8.6 on CentOS-5.6 without problems. The 2.1 fs is on new hardware to migrate to. Best, Michael [root@mds03 ~]# rpm -qa | grep lustre kernel-2.6.32-131.6.1.el6_lustre.g65156ed.x86_64 lustre-ldiskfs-3.3.0-2.6.32_131.6.1.el6_lustre.g65156ed.x86_64_g9d71fe8.x86_64 lustre-modules-2.1.0-2.6.32_131.6.1.el6_lustre.g65156ed.x86_64_g9d71fe8.x86_64 lustre-2.1.0-2.6.32_131.6.1.el6_lustre.g65156ed.x86_64_g9d71fe8.x86_64 [root@mds03 ~]# lsmod | grep quota lquota198183 5 mdd ptlrpc 1070449 13 cmm,mdt,mdd,mds,mgs,mgc,lustre,lov,osc,lquota,mdc,fid,fld obdclass 804693 29 cmm,osd_ldiskfs,mdt,mdd,mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,osc,lquota,mdc,fid,fld,ptlrpc lvfs 28053 18 cmm,osd_ldiskfs,mdt,mdd,mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,osc,lquota,mdc,fid,fld,ptlrpc,obdclass libcfs338203 21 cmm,osd_ldiskfs,mdt,mdd,mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,osc,lquota,mdc,fid,fld,ksocklnd,ko2iblnd,ptlrpc,obdclass,lnet,lvfs [root@oss03 ~]# lsmod | grep quota lquota198183 5 obdfilter ptlrpc 1070449 10 obdfilter,ost,mgc,lustre,lov,osc,lquota,mdc,fid,fld obdclass 804693 24 obdfilter,fsfilt_ldiskfs,ost,mgc,lustre,lov,osc,lquota,mdc,fid,fld,ptlrpc lvfs 28053 14 obdfilter,fsfilt_ldiskfs,ost,mgc,lustre,lov,osc,lquota,mdc,fid,fld,ptlrpc,obdclass libcfs338203 17 obdfilter,fsfilt_ldiskfs,ost,mgc,lustre,lov,osc,lquota,mdc,fid,fld,ptlrpc,obdclass,lvfs,ksocklnd,ko2iblnd,lnet Syslog says upon the "mount /mnt/mdt1" Jan 5 01:04:05 mds03 kernel: LDISKFS-fs warning (device dm-1): ldiskfs_fill_super: extents feature not enabled on this filesystem, use tune2fs. Jan 5 01:04:05 mds03 kernel: LDISKFS-fs (dm-1): warning: maximal mount count reached, running e2fsck is recommended Jan 5 01:04:05 mds03 kernel: LDISKFS-fs (dm-1): mounted filesystem with ordered data mode Jan 5 01:04:05 mds03 kernel: LDISKFS-fs warning (device dm-1): ldiskfs_fill_super: extents feature not enabled on this filesystem, use tune2fs. Jan 5 01:04:05 mds03 kernel: LDISKFS-fs (dm-1): warning: maximal mount count reached, running e2fsck is recommended Jan 5 01:04:05 mds03 kernel: LDISKFS-fs (dm-1): mounted filesystem with ordered data mode Jan 5 01:04:05 mds03 kernel: Lustre: Enabling ACL Jan 5 01:04:05 mds03 kernel: Lustre: carbonfs-MDT: used disk, loading Jan 5 01:04:05 mds03 kernel: LustreError: 3957:0:(obd_config.c:1140:class_process_proc_param()) carbonfs-MDT: unknown param quota_type=ug Jan 5 01:04:05 mds03 kernel: LustreError: 3957:0:(obd_config.c:1361:class_config_llog_handler()) Err -38 on cfg command: Jan 5 01:04:05 mds03 kernel: Lustre:cmd=cf00f 0:carbonfs-MDT 1:mdt.quota_type=ug Jan 5 01:04:05 mds03 kernel: LustreError: 15c-8: MGC172.17.120.3@o2ib: The configuration from log 'carbonfs-MDT' failed (-38). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. Jan 5 01:04:05 mds03 kernel: LustreError: 3901:0:(obd_mount.c:1192:server_start_targets()) failed to start server carbonfs-MDT: -38 Jan 5 01:04:05 mds03 kernel: LustreError: 3901:0:(obd_mount.c:1719:server_fill_super()) Unable to start targets: -38 Jan 5 01:04:05 mds03 kernel: Lustre: Failing over carbonfs-MDT Jan 5 01:04:05 mds03 kernel: Lustre: Skipped 2 previous similar messages Jan 5 01:04:05 mds03 kernel: Lustre: 3901:0:(quota_master.c:793:close_quota_files()) quota[0] is off already Jan 5 01:04:05 mds03 kernel: Lustre: 3901:0:(quota_master.c:793:close_quota_files()) Skipped 1
[Lustre-discuss] Solved: panic on jbd:journal_dirty_metadata
Hello Wojciech, Confirmed - I built and installed the patch as well, and the problem hasn't occurred again here either - Thank you! For reference, I'm using the released kernel and e2fsprogs rpm plus three rebuilt rpms. The patch only affects obdfilter.ko in lustre-modules. "nm /lib/modules/2.6.18-164.11.1.el5_lustre.1.8.3/kernel/fs/lustre/obdfilter.ko" produced identical output before and after the patch, which I found reassuring. # rpm -qa | grep -e e2fs -e lustre | sort e2fsprogs-1.41.10.sun2-0redhat kernel-2.6.18-164.11.1.el5_lustre.1.8.3 lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3_ lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3_ lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3_ With best regards, Michael On Jul 25, 2010, at 18:08 , Wojciech Turek wrote: > Hi Michael, > > Our OST's were also nearly full when the problem occured, after installing > the patch we didn't have a single occurence of that problem. > > Cheers > > Wojciech > > On 24 July 2010 17:06, Michael Sternberg wrote: > Wojciech, > > Thank you very much for your pointer. Perhaps the fact that the OSTs are > nearly full contributes(?). I also see higher usage. > > In any case, I'll attempt compilation with the patch applied. > > > With best regards, > Michael > > > On Jul 22, 2010, at 9:16 , Wojciech Turek wrote: > > > Hi Michael, > > > > This looks like the problem we had some time ago after upgrading to 1.8.3 > > > > https://bugzilla.lustre.org/show_bug.cgi?id=22889 > > > > Best regards > > Wojciech > > > > On 20 July 2010 00:00, Michael Sternberg wrote: > > Hello, > > > > I use OSSs with external journal partitions and since lustre-1.8.1 about > > one to two times a week I get frustrating panics on OSSs as follows: > > > >:libcfs:cfs_alloc ... > >:lvfs:lprocfs_counter_add ... > >... > > > >RIP [] :jbd:journal_dirty_metadata+0x7f/0x1e3 > > RSP > > <0>Kernel panic - not syncing: Fatal exception > > ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] panic on jbd:journal_dirty_metadata
Wojciech, Thank you very much for your pointer. Perhaps the fact that the OSTs are nearly full contributes(?). I also see higher usage. In any case, I'll attempt compilation with the patch applied. With best regards, Michael On Jul 22, 2010, at 9:16 , Wojciech Turek wrote: > Hi Michael, > > This looks like the problem we had some time ago after upgrading to 1.8.3 > > https://bugzilla.lustre.org/show_bug.cgi?id=22889 > > Best regards > Wojciech > > On 20 July 2010 00:00, Michael Sternberg wrote: > Hello, > > I use OSSs with external journal partitions and since lustre-1.8.1 about one > to two times a week I get frustrating panics on OSSs as follows: > >:libcfs:cfs_alloc ... >:lvfs:lprocfs_counter_add ... >... > >RIP [] :jbd:journal_dirty_metadata+0x7f/0x1e3 > RSP > <0>Kernel panic - not syncing: Fatal exception > ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] panic on jbd:journal_dirty_metadata
Hello, I use OSSs with external journal partitions and since lustre-1.8.1 about one to two times a week I get frustrating panics on OSSs as follows: :libcfs:cfs_alloc ... :lvfs:lprocfs_counter_add ... ... RIP [] :jbd:journal_dirty_metadata+0x7f/0x1e3 RSP <0>Kernel panic - not syncing: Fatal exception (full graphical screenshot attached, hoping it passes through) Clients sometimes report: Message from syslogd@ at Mon Jul 19 04:11:46 2010 ... login4 kernel: journal commit I/O error I have recently updated to 1.8.3, where I e2fsck'd and re-initialized the external journals, but still get those panics. I use 2 OSS with heartbeat failover, each one "owns" and normally serves 2 OSTs (4 OST total), coming from 4 LUNs on a RAID unit with dual controllers. All OSTs use ldiskfs (pre-ext4 proper, if I understand correctly) with the journals located on partitions of 2 further LUNs. I use a variant of the script at bug 20807 to account for different device numbers of the external journals on the two OSSs. Failover usually works, eventually, after a load peak of up to 100 on the OSS taking over, and messages about hung threads (see below). Is there anything I could do besides giving up on external journals? My data stores are RAID1, and the journal disks are a single pair of disks also in RAID1. I had difficulties locating further information on googling "journal_dirty_metadata" pertaining to lustre/ldiskfs specifically. There are old discussions at: https://bugzilla.redhat.com/show_bug.cgi?id=183119 2007/2008 - kernel 2.4.7 http://oss.oracle.com/pipermail/ocfs2-users/2010-January/004113.html (ahem) With best regards, Michael [r...@mds01 ~]# cat /proc/fs/lustre/version lustre: 1.8.3 kernel: patchless_client build: 1.8.3-20100409182943-PRISTINE-2.6.18-164.11.1.el5_lustre.1.8.3 [r...@mds01 ~]# uname -r 2.6.18-164.11.1.el5_lustre.1.8.3 [r...@mds01 ~]# lctl dl 0 UP mgs MGS MGS 725 1 UP mgc mgc172.17.12...@o2ib 9642cdcd-4955-ca05-4e85-8a9f6d10c027 5 2 UP mdt MDS MDS_uuid 3 7 UP lov sandbox-mdtlov sandbox-mdtlov_UUID 4 8 UP mds sandbox-MDT sandbox-MDT_UUID 719 9 UP osc sandbox-OST-osc sandbox-mdtlov_UUID 5 10 UP osc sandbox-OST0001-osc sandbox-mdtlov_UUID 5 [r...@mds01 ~]# ssh mds02 lctl dl 0 UP mgc mgc172.17.12...@o2ib 12c07f8c-f1e7-f739-9983-3c3aa2ec492a 5 1 UP mdt MDS MDS_uuid 3 2 UP lov carbonfs-mdtlov carbonfs-mdtlov_UUID 4 3 UP mds carbonfs-MDT carbonfs-MDT_UUID 719 4 UP osc carbonfs-OST0001-osc carbonfs-mdtlov_UUID 5 5 UP osc carbonfs-OST-osc carbonfs-mdtlov_UUID 5 [r...@oss01 ~]# lctl dl 0 UP mgc mgc172.17.12...@o2ib a89ba7f9-2a12-ff77-0321-151a1addf043 5 1 UP ost OSS OSS_uuid 3 2 UP obdfilter sandbox-OST sandbox-OST_UUID 721 3 UP obdfilter carbonfs-OST carbonfs-OST_UUID 721 4 UP obdfilter sandbox-OST0001 sandbox-OST0001_UUID 721 5 UP obdfilter carbonfs-OST0001 carbonfs-OST0001_UUID 721 [client]# lctl dl 0 UP mgc mgc172.17.12...@o2ib dadd88bf-fbad-d933-b02a-a539fd8abfea 5 1 UP lov sandbox-clilov-8101da93c400 30094b3a-b246-e667-ef8a-f6690e4d051c 4 2 UP mdc sandbox-MDT-mdc-8101da93c400 30094b3a-b246-e667-ef8a-f6690e4d051c 5 3 UP osc sandbox-OST-osc-8101da93c400 30094b3a-b246-e667-ef8a-f6690e4d051c 5 4 UP osc sandbox-OST0001-osc-8101da93c400 30094b3a-b246-e667-ef8a-f6690e4d051c 5 5 UP lov carbonfs-clilov-81041dc0f800 4aaa2ddd-eee8-564f-ea48-b9c36a428eb9 4 6 UP mdc carbonfs-MDT-mdc-81041dc0f800 4aaa2ddd-eee8-564f-ea48-b9c36a428eb9 5 7 UP osc carbonfs-OST0001-osc-81041dc0f800 4aaa2ddd-eee8-564f-ea48-b9c36a428eb9 5 8 UP osc carbonfs-OST-osc-81041dc0f800 4aaa2ddd-eee8-564f-ea48-b9c36a428eb9 5 -- "thread hung" messages: -- Jul 19 04:01:05 oss01 kernel: Lustre: carbonfs-OST0001: Recovery period over after 1:05, of 359 clients 358 recovered and 0 were evicted. Jul 19 04:01:05 oss01 kernel: Lustre: carbonfs-OST0001: sending delayed replies to recovered clientsJul 19 04:01:05 oss01 kernel: Lustre: carbonfs-OST0001: received MDS connection from 172.17.12...@o2ib Jul 19 04:03:36 oss01 kernel: LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 172.17.0@o2ib ns: filter-carbonfs-OST0001_UUID loc k: 810178b8d600/0x80a4d28c4aff67ec lrc: 3/0,0 mode: PW/PW res: 152472/0 rrc: 5 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x10020 remote: 0x4701b34288b619f7 expref: 12 pid: 68 06 timeout 4356934523 Jul 19 04:04:16 oss01 kernel: Lustre: 6757:0:(ldlm_lib.c:804:target_handle_connect()) carbonfs-OST0001: exp 8101ecf94e00 already connectingJul 19 04:04:16 oss01 kernel: Lustre: 6757:0:(ldlm_lib.c:804:target_handle_connect()) Skipped 38 previous similar messages Jul 19 0
Re: [Lustre-discuss] A Failed client soft lockup one OSS
PS: The syslog snippet I posted is slightly out of order; I merged from login2 and oss01 and did a simple sort(1). Michael ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] A Failed client soft lockup one OSS
+1 on this one, in my case using lustre-1.8.2 on RHEL-5.4 over o2ib, with patchless clients. My OSS complains about hung service threads: Service thread pid 16590 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: . . . Service thread pid 16590 completed after 2403.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). In this case, n337 (172.16.1.87) went dead, and another client (login2) suffered. Solution was to reboot n337, the initial hung client. NB: I take the overload warning at its word. The OST is one giant RAID5, which I have scheduled to split into several RAID 1+0 sets next week. regards, Michael On Mar 26, 2010, at 2:29 , Lu Wang wrote: > We find bug on Lustre 1.8.1.1. Sometimes one client's dead may cause > soft lockup on OSS. The certain OSS may reach a high CPU System% usage, and > then became unreachable through "lctl ping" from now and then. Mar 13 18:45:13 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1329698739270352 sent from carbonfs-OST to NID 172.16.1...@tcp 0s ago has failed due to network error (7s prior to deadline). Mar 13 18:45:13 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Mar 13 18:45:13 oss01 kernel: LustreError: 16590:0:(events.c:66:request_out_callback()) Skipped 59 previous similar messages Mar 13 18:45:13 oss01 kernel: LustreError: 16590:0:(events.c:66:request_out_callback()) @@@ type 4, status -113 r...@810126fc x1329698739270352/t0 o106->@NET_0x2ac100157_UUID:15/16 lens 296/424 e 0 to 1 dl 1268527520 ref 2 fl Rpc:/0/0 rc 0/0 Mar 13 18:45:13 oss01 kernel: LustreError: 16590:0:(lib-move.c:2436:LNetPut()) Error sending PUT to 12345-172.16.1...@tcp: -113 Mar 13 18:45:13 oss01 kernel: LustreError: 16590:0:(lib-move.c:2436:LNetPut()) Skipped 57 previous similar messages Mar 13 18:45:13 oss01 kernel: r...@810126fc x1329698739270352/t0 o106->@NET_0x2ac100157_UUID:15/16 lens 296/424 e 0 to 1 dl 1268527520 ref 1 fl Rpc:/0/0 rc 0/0 Mar 13 18:45:22 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1329698739270352 sent from carbonfs-OST to NID 172.16.1...@tcp 0s ago has failed due to network error (7s prior to deadline). Mar 13 18:45:22 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) Skipped 454325 previous similar messages Mar 13 18:45:22 oss01 kernel: r...@810126fc x1329698739270352/t0 o106->@NET_0x2ac100157_UUID:15/16 lens 296/424 e 0 to 1 dl 1268527529 ref 1 fl Rpc:/2/0 rc 0/0 Mar 13 18:45:29 login2 -- MARK -- Mar 13 18:45:41 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1329698739270352 sent from carbonfs-OST to NID 172.16.1...@tcp 0s ago has failed due to network error (7s prior to deadline). Mar 13 18:45:41 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) Skipped 903192 previous similar messages Mar 13 18:45:41 oss01 kernel: r...@810126fc x1329698739270352/t0 o106->@NET_0x2ac100157_UUID:15/16 lens 296/424 e 0 to 1 dl 1268527548 ref 1 fl Rpc:/2/0 rc 0/0 Mar 13 18:46:18 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1329698739270352 sent from carbonfs-OST to NID 172.16.1...@tcp 0s ago has failed due to network error (7s prior to deadline). Mar 13 18:46:18 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) Skipped 1804452 previous similar messages Mar 13 18:46:18 oss01 kernel: r...@810126fc x1329698739270352/t0 o106->@NET_0x2ac100157_UUID:15/16 lens 296/424 e 0 to 1 dl 1268527585 ref 1 fl Rpc:/2/0 rc 0/0 Mar 13 18:46:28 oss01 kernel: LustreError: 16590:0:(events.c:66:request_out_callback()) Skipped 3611781 previous similar messages Mar 13 18:46:28 oss01 kernel: LustreError: 16590:0:(events.c:66:request_out_callback()) @@@ type 4, status -113 r...@810126fc x1329698739270352/t0 o106->@NET_0x2ac100157_UUID:15/16 lens 296/424 e 0 to 1 dl 1268527595 ref 2 fl Rpc:/2/0 rc 0/0 Mar 13 18:46:28 oss01 kernel: LustreError: 16590:0:(lib-move.c:2436:LNetPut()) Error sending PUT to 12345-172.16.1...@tcp: -113 Mar 13 18:46:28 oss01 kernel: LustreError: 16590:0:(lib-move.c:2436:LNetPut()) Skipped 3611782 previous similar messages Mar 13 18:47:27 oss01 kernel: LustreError: 138-a: carbonfs-OST: A client on nid 172.16.1...@tcp was evicted due to a lock blocking callback to 172.16.1...@tcp timed out: rc -107 Mar 13 18:47:33 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1329698739270352 sent from carbonfs-OST to NID 172.16.1...@tcp 0s ago has failed due to network error (7s prior to deadline). Mar 13 1
Re: [Lustre-discuss] concurrent open() fails sporadically
On Oct 28, 2009, at 16:40 , David Singleton wrote: > Michael Sternberg wrote: >> I'm seeing open() failures when attempting concurrent access in a >> lustre fs. >> >> The following Fortran program fails sporadically when run under >> mpirun, even on the same host. Note that there is no MPI statement; >> the mpirun simply keeps the startup times very close together: > > See https://bugzilla.lustre.org/show_bug.cgi?id=17545 Thanks - was preparing to file one when a search for "concurrent open" didn't return a hit. Good to see: "Landed on 1.8.2". Michael ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] concurrent open() fails sporadically
On Oct 28, 2009, at 15:47 , Brian J. Murrell wrote: > On Wed, 2009-10-28 at 15:38 -0500, Michael Sternberg wrote: >> I'm seeing open() failures when attempting concurrent access in a >> lustre fs. >> [..] >> A C version never failed (thus far): > > This might be indicative. Maybe not. Fortran might just be > exposing a > race condition that the C version is not. > [..] > What would be ideal is an strace of the fortran program failing so > that > we can see what the system calls did. Great suggestion! Turns out the file in question has mode 0440, but since the open() is not otherwise specified, Fortran first tries to open read-write, and only then read-only. I'm using: mpirun -np 2 bash -c 'strace -tt ./a.out 2> strace7-$$.err' > strace7.out Here's a failure case where the first process fails, and the seconds succeeds. The difference is that in the first process the initial open(.., O_RDWR) returns with ENOENT (fatal) vs. EACCES (will retry). If the timestamps can be trusted, the failing open() comes 0.1 ms *after* the succeeding PID's open(.., O_RDONLY). $ tail -n 15 strace7* ==> strace7-10831.err <== 17:27:42.630621 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0 17:27:42.630686 fstat(0, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0 17:27:42.630753 open("test.dat", O_RDWR) = -1 ENOENT (No such file or directory) 17:27:42.631044 write(2, "At line ", 8At line ) = 8 17:27:42.63 write(2, "2", 12)= 1 17:27:42.631171 write(2, " of file ", 9 of file ) = 9 17:27:42.631248 write(2, "test.f", 6test.f) = 6 17:27:42.631322 write(2, "\n", 1 ) = 1 17:27:42.631385 write(2, "Fortran runtime error: ", 23Fortran runtime error: ) = 23 17:27:42.631443 write(2, "No such file or directory", 25No such file or directory) = 25 17:27:42.631500 write(2, "\n", 1 ) = 1 17:27:42.631563 close(0)= 0 17:27:42.631615 exit_group(2) = ? ==> strace7-10832.err <== 17:27:42.629790 fstat(2, {st_mode=S_IFREG|0664, st_size=5542, ...}) = 0 17:27:42.629984 ioctl(2, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff8a624490) = -1 ENOTTY (Inappropriate ioctl for device) 17:27:42.630076 stat("test.dat", {st_mode=S_IFREG|0440, st_size=805891, ...}) = 0 17:27:42.630163 fstat(2, {st_mode=S_IFREG|0664, st_size=5813, ...}) = 0 17:27:42.630235 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 3), ...}) = 0 17:27:42.630299 fstat(0, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 3), ...}) = 0 17:27:42.630364 open("test.dat", O_RDWR) = -1 EACCES (Permission denied) 17:27:42.630648 open("test.dat", O_RDONLY) = 3 17:27:42.630921 fstat(3, {st_mode=S_IFREG|0440, st_size=805891, ...}) = 0 17:27:42.630998 ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff8a623240) = -1 ENOTTY (Inappropriate ioctl for device) 17:27:42.631055 close(3)= 0 17:27:42.631133 write(1, " OK", 3) = 3 17:27:42.631193 write(1, "\n", 1) = 1 17:27:42.631252 close(0)= 0 17:27:42.631331 exit_group(0) = ? A workaround for my user is to either "chmod u+w datafile" or, more cleanly, be explicit in the Fortran open() by saying ACTION='READ'. With best regards, Michael ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] concurrent open() fails sporadically
Greetings, I'm seeing open() failures when attempting concurrent access in a lustre fs. The following Fortran program fails sporadically when run under mpirun, even on the same host. Note that there is no MPI statement; the mpirun simply keeps the startup times very close together: $ cat test.f program test open(1, file = 'test.dat', status = 'old') close(1) write(*,*) "OK" end $ gfortran test.f $ mpirun -np 8 a.out OK OK OK OK OK OK OK OK $ mpirun -np 8 a.out OK OK OK OK OK OK At line 2 of file test.f Fortran runtime error: No such file or directory OK The "status= 'old'" seems to be the trigger. A C version never failed (thus far): $ cat test.c #include #include #include main () { if (fopen("test.dat", "r") == NULL) { perror("test.dat"); } else { char hostname[20]; gethostname(hostname, 20); printf("%s: OK\n", hostname); } } I run 2.6.18-92.1.17.el5_lustre.1.6.7.1smp on RHEL-5.3. The error shows up with both gfortran-4.1.2 20080704 (Red Hat 4.1.2-44) and Intel Fortran 10.1 20090817. The data file size is some 800K. Nothing from lustre shows up in syslog on the clients or servers. The error is quite unexpected for such a basic operation. Where should I look for parameters to tweak? I have mounted on the client: mds01...@o2ib:mds02...@o2ib:/sandbox on /sandbox type lustre (rw) on the MDS: /dev/dm-2 on /mnt/mdt-sandbox type lustre (rw) and OSS: /dev/dm-2 on /mnt/ost0-sandbox type lustre (rw) The MGS/MDS sit on the same disk, /dev/dm-1 (which also serves /home) With best regards, Michael ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] autofs parser tantrum
Hello Daire, Thank you for your note. I ultimately decided to use static mounts; will see ... I've been very happy with the 1.6.x releases, even more so because tcp nids now work for me (I use a custom tcp port, and never got clients to connect previously.) Regards, Michael On Apr 27, 2009, at 5:58 , Daire Byrne wrote: > We had to patch the autofs package to add "@" as a special > character. I've never tried the failover MDS mount case but I'm sure > it will work fine as long as autofs knows to ignore the '@' and ':' > correctly and pass that on to mount. > > Daire ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] autofs parser tantrum
Hi, I'm struggling with the autofs(5) parser (on RHEL-5.2), which does not understand the lustre mount spec. This is not a lustre problem, but I wonder if someone has used automount successfully to mount a lustre file system. My /etc/auto.master is: /- /etc/auto.direct and the direct map is (/etc/auto.direct): /sandbox -fstype=lustre md...@tcp:md...@tcp:/sandbox Upon a mount request I get the following in the autofs log, showing how the parser mangles the mount spec: lookup_mount: lookup(file): looking up /sandbox lookup_mount: lookup(file): /sandbox -> -rw md...@tcp:md...@tcp:/ sandbox parse_mount: parse(sun): expanded entry: -rw md...@tcp:md...@tcp// sandbox ^^ If I leave out the failover server, the parser complains of "invalid location": lookup_mount: lookup(file): looking up /sandbox lookup_mount: lookup(file): /sandbox -> -fstype=lustre md...@tcp:/sandbox parse_mount: parse(sun): expanded entry: -fstype=lustre md...@tcp:/sandbox parse_mount: parse(sun): gathered options: fstype=lustre parse(sun): invalid location md...@tcp:/sandbox What's a workable way to automount lustre, or is this misguided? The reason I want to use automount is that the client is the management node of an HPC cluster. It needs the mount only occasionally, such as for creating home directories and debugging. Since the node is critical for the cluster (there's actually two, in high-availability), I'd like to shield it (them) from any lustre recovery outages. Regards, Michael ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] NAME field in lustre-client rpms
On Apr 23, 2009, at 12:44 , Brian J. Murrell wrote: On Thu, 2009-04-23 at 12:18 -0500, Michael Sternberg wrote: Still, how could one serve both the patched and patchless lustre rpms out of the same yum repository? See bug 13908. https://bugzilla.lustre.org/show_bug.cgi?id=13908 --- Comment #37 From Brian Murrell 2009-04-22 10:57:38 [reply] --- (From update of attachment 22951 [details]) Landed to lustre-build HEAD which means landed for lustre releases/branches 1.6.8, 1.8.1 and 2.0 (HEAD). --- Comment #38 From Brian Murrell 2009-04-22 10:57:58 [reply] --- Landed, closing. Very nice - fix is coming already - thank you! Michael___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] NAME field in lustre-client rpms
I was trying to install a patchless client of lustre-1.6.7.1 on RHEL-5.3 from a local yum repository and noticed an oddity with the rpm metadata: the NAME fields of the "lustre-client" rpms are the same as in the patched-kernel rpms. I looked back at all the versions I've used (1.6.x), and saw that this has always been the case there, so apparently that's by design, e.g.: lustre-1.6.7.1-2.6.18_92.1.17.el5_lustre.1.6.7.1smp.x86_64.rpm lustre-modules-1.6.7.1-2.6.18_92.1.17.el5_lustre. 1.6.7.1smp.x86_64.rpm vs. lustre-client-1.6.7.1-2.6.18_92.1.17.el5_lustre.1.6.7.1smp.x86_64.rpm lustre-client-modules-1.6.7.1-2.6.18_92.1.17.el5_lustre. 1.6.7.1smp.x86_64.rpm # rpm -qip lustre-[1mc]*rpm | grep Name Name: lustre Relocations: (not relocatable) Name: lustre Relocations: (not relocatable) Name: lustre-modules Relocations: (not relocatable) Name: lustre-modules Relocations: (not relocatable) This means that the patched and patchless rpms collide in a yum repository, and only one will be available for installation (the last by collation, I believe). This is the first time I've tried to use the patchless client. However, I've put it on the back burner (since there is no support yet for the RHEL-5.3 kernel) and pulled it out of my yum repo. Still, how could one serve both the patched and patchless lustre rpms out of the same yum repository? Regards, Michael ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Upgrade-Procedure
On Jan 6, 2009, at 11:21 , Thomas Roth wrote: > I've just upgraded a 1 MDT - 2 OST - 2 Clients - test cluster from > Lustre version 1.6.5.1 to 1.6.6 > However, I did not follow the manual ( > http://manual.lustre.org/manual/LustreManual16_HTML/UpgradingLustre.html#50548855_pgfId-1289726) > > : > I did not use the tunefs.lustre command on MGS/MDT or OSTs. > My data seem to have survived, the quota had to be switched back on > with > lfs quotacheck. > > My question: is this treatment with tunefs.lustre ( in particular > with > the somewhat scary "writeconf" parameter) actually necessary at all? I also did not perform this step. I assumed since the manual at docs.sun.com wasn't updated, there were no specific steps required. http://www.sun.com/software/products/lustre/get.jsp links (eventually) to http://docs.sun.com/app/docs/doc/820-3681 holding Lustre_1.6_man_v1.14 September 2008 But lustre.org (linked by Thomas) holds the newer Lustre_1.6_man_v1.15 November 2008. However, the example therein is still the previous one, as it refers to kernel 2.6.18-8.1.14.el5_lustre.1.6.4.2smp, i.e. lustre 1.6.4. Hmm - Could *not running* tune2fs have been the cause of my hiccup with lov_objids that I mentioned last week? As I wrote then, my data was also still intact (at least it was before I opened write access - ahem.) Regards, Michael ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Solved: incorrect num_osts; LBUG
I wrote: On Jan 2, 2009, at 19:24 , Michael Sternberg wrote: > While performing an LFSCK after upgrade lustre-1.6.6, the MDS claims > to have 512 OSTs [while there is only 1]: > > MDS: num_osts = 512 Well, I found an interesting section "How to fix bad LAST_ID on an OST" in the operations manual, Appendix D, which pointed more or less in the right direction. Briefly, /lov_objid within the MDS ldiskfs is constructed from /O/0/ LAST_ID in all the OSTs' ldiskfs. In my case, with only one OST, the first and only entry agrees between these two files, and the rest of lov_objid was padded with NULs to 4KB (block size?). I do not understand what caused this, but, given prior e2fsck output indicating length 8, I decided to simply chop off the NULs, which indeed made e2fsck happy. lfsck on a client found a handful of empty orphans, but a backup on the client side still matches the contents of the rest of the fs (phew!). Regards, Michael ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] incorrect num_osts; LBUG
While performing an LFSCK after upgrade lustre-1.6.6, the MDS claims to have 512 OSTs: MDS: num_osts = 512 I am in the middle of an update, in a currently asymmetric heartbeat config, with a passive MDS and OSS on lustre-1.6.5.1, and the currently active MDS and OSS (where num_osts comes out wrong) on lustre-1.6.6. The rest of the e2fsck output looks perfectly normal and in line with previous runs. I run two lustre file systems, /home and /sandbox off the same MGS, and e2fsck for BOTH file systems reports 512 OSTs. Running e2fsck on the 1.6.5.1 servers gives the same blown up OST list. Does this mean my MGS is trashed? In context: = = = = # e2fsck -n -v --mdsdb /tmp/mdsdb-sandbox /dev/dm-2; date e2fsck 1.40.11.sun1 (17-June-2008) device /dev/dm-2 mounted by lustre per /proc/fs/lustre/mds/sandbox- MDT/mntdev Warning! /dev/dm-2 is mounted. Warning: skipping journal recovery because doing a read-only filesystem check. sandbox-MDT has been mounted 47 times without being checked, check forced. Pass 1: Checking inodes, blocks, and sizes MDS: ost_idx 0 max_id 1192499 MDS: ost_idx 1 max_id 0 MDS: ost_idx 2 max_id 0 MDS: ost_idx 3 max_id 0 MDS: ost_idx 4 max_id 0 ... MDS: ost_idx 509 max_id 0 MDS: ost_idx 510 max_id 0 MDS: ost_idx 511 max_id 0 MDS: got 4096 bytes = 512 entries in lov_objids MDS: max_files = 36420 MDS: num_osts = 512 mds info db file written Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Pass 6: Acquiring information for lfsck MDS: max_files = 36420 MDS: num_osts = 512 MDS: 'sandbox-MDT_UUID' mdt idx 0: compat 0x4 rocomp 0x1 incomp 0x4 36420 inodes used (0.05%) 3 non-contiguous inodes (0.0%) # of inodes with ind/dind/tind blocks: 1/0/0 9283672 blocks used (12.67%) 0 bad blocks 1 large file 32592 regular files 3818 directories 0 character device files 0 block device files 0 fifos 1 link 1 symbolic link (1 fast symbolic link) 0 sockets 36412 files = = = = I did an LFSCK just prior to the upgrade, which showed num_osts = 1. This is on RHEL-5.2 with the latest e2fsprogs: # e2fsck -V e2fsck 1.40.11.sun1 (17-June-2008) Using EXT2FS Library version 1.40.11.sun1, 17-June-2008 I run two MDS and OSS each in heartbeat; I encountered an LBUG during a heartbeat hiccup (output appended). Kernels for the updated and active MDS and OSS, and some clients are: 2.6.18-92.1.10.el5_lustre.1.6.6smp the passive MDS and OSS, and remaining clients are: 2.6.18-53.1.14.el5_lustre.1.6.5.1smp Regards, Michael Jan 2 15:41:20 mds02 kernel: LustreError: 14500:0:(mgs_handler.c: 194:mgs_setup()) ASSERTION(!lvfs_check_rdonly(lvfs_sbdev(mnt- >mnt_sb))) failed Jan 2 15:41:20 mds02 kernel: LustreError: 14500:0:(tracefile.c: 450:libcfs_assertion_failed()) LBUG Jan 2 15:41:20 mds02 kernel: Lustre: 14500:0:(linux-debug.c: 185:libcfs_debug_dumpstack()) showing stack for process 14500 Jan 2 15:41:20 mds02 kernel: mount.lustre R running task 0 14500 14499 (NOTLB) Jan 2 15:41:20 mds02 kernel: 0040 0020 8102105a5568 88966410 Jan 2 15:41:20 mds02 kernel: 888d3b80 810200b242c0 Jan 2 15:41:20 mds02 kernel: 8102105a5598 80143a09 Jan 2 15:41:20 mds02 kernel: Call Trace: Jan 2 15:41:20 mds02 kernel: [] dump_trace +0x211/0x23a Jan 2 15:41:20 mds02 kernel: [] :ptlrpc:lprocfs_rd_pool_state+0x0/0x200 Jan 2 15:41:20 mds02 kernel: [] :obdclass:lprocfs_wr_atomic+0x0/0x60 Jan 2 15:41:20 mds02 kernel: [] show_trace+0x34/0x47 Jan 2 15:41:20 mds02 kernel: [] _show_stack+0xdb/ 0xea Jan 2 15:41:20 mds02 kernel: [] :libcfs:lbug_with_loc+0x7a/0xc0 Jan 2 15:41:20 mds02 kernel: [] :libcfs:libcfs_assertion_failed+0x54/0x60 Jan 2 15:41:20 mds02 kernel: [] :mgs:mgs_setup +0x301/0x800 Jan 2 15:41:20 mds02 kernel: [] :obdclass:class_setup+0x942/0xc70 Jan 2 15:41:20 mds02 kernel: [] :obdclass:class_process_config+0x14bd/0x19e0 Jan 2 15:41:20 mds02 kernel: [] :obdclass:do_lcfg +0x924/0xb20 Jan 2 15:41:20 mds02 kernel: [] :obdclass:lustre_start_simple+0x130/0x1d0 Jan 2 15:41:20 mds02 kernel: [] :obdclass:server_start_mgs+0x223/0x320 Jan 2 15:41:20 mds02 OpenSM[14448]: Entering MASTER state Jan 2 15:41:20 mds02 kernel: [] :obdclass:server_fill_super+0x18a2/0x1fb0 Jan 2 15:41:21 mds02 kernel: [] :sunrpc:rpc_call_sync+0x9e/0xa8 Jan 2 15:41:21 mds02 kernel: [] recalc_sigpending +0xe/0x25 Jan 2 15:41:21 mds02 attrd: [14276]: info: main: Starting mainloop... Jan 2 15:
Re: [Lustre-discuss] Interoperability between lustre 1.4.11 server and lustre 1.6 client
Thanks for the confirmation. I have a similar issue, but with different version *between servers*: Is it acceptable to run MDS/MGS with lustre-1.6.5.1 but the OSSs with lustre-1.6.4.3? An MDS update to 1.6.5.1 and RHEL-5.2 went fine, but the same on the OSS fails. So, I am considering rollback of the OSS the 1.6.4.3 and RHEL-5.1 which is working. The clients are also a mix of the two versions mentioned. Best regards, Michael On Apr 28, 2008, at 2:53 , Peter Kjellstrom wrote: > On Wednesday 23 April 2008, ashok bharat bayana wrote: >> hi! >> Is there an interoperability between lustre 1.4.11 server and >> lustre 1.6 >> patchless client? i mean can we make lustre 1.4.11 server talk to >> lustre >> 1.6 patchless client? I want help in this regard.. > > FWIW, we've been running 1.6 clients with 1.4 servers without > problems for a > while. Note though that we don't run patchless client. Here are the > exact > versions we currently use: > client : lustre-1.6.4.3-2.6.9_67.0.4.EL_lustre.1.6.4.3smp > servers: lustre-1.4.11.1-2.6.9_55.0.9.EL_lustre.1.4.11.1smp > > /Peter > ___ > Lustre-discuss mailing list > Lustre-discuss@lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] OSS: bad header in inode - invalid magic
On Jul 2, 2008, at 9:26 , Brian J. Murrell wrote: > On Wed, 2008-07-02 at 09:01 -0500, Michael Sternberg wrote: >> Ah - I tried, but ran into a symbol error: >> >> [EMAIL PROTECTED] ~]# wget >> http://downloads.lustre.org/public/tools/e2fsprogs/latest/e2fsprogs-1.40.7.sun3-0redhat.rhel5.x86_64.rpm >> [EMAIL PROTECTED] ~]# rpm -Fvh ./ >> e2fsprogs-1.40.7.sun3-0redhat.rhel5.x86_64.rpm >> [EMAIL PROTECTED] ~]# e2fsck -fp /dev/dm-3 >> e2fsck: symbol lookup error: e2fsck: undefined symbol: >> ext2fs_mmp_update > > Ahhh. This isn't good. Can you file a bug in our bugzilla about > that? Done: https://bugzilla.lustre.org/show_bug.cgi?id=16265 Update to CentOS-5.2 in progress ... Michael ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] OSS: bad header in inode - invalid magic
Hi, On Jul 2, 2008, at 8:35 , Brian J. Murrell wrote: > On Tue, 2008-07-01 at 19:52 -0500, Michael Sternberg wrote: >> Jul 1 15:43:58 oss01 kernel: LDISKFS-fs error (device dm-3): >> ldiskfs_ext_find_extent: bad header in inode #405012501: invalid >> magic >> - magic 0, entries 0, max 0(0), depth 0(0) > > I would suggest making sure you have the *latest* e2fsprogs from Sun > and > running an e2fsck on that volume. You may have to flags to force a > thorough check. Check the manpage. Ah - I tried, but ran into a symbol error: [EMAIL PROTECTED] ~]# wget http://downloads.lustre.org/public/tools/e2fsprogs/latest/e2fsprogs-1.40.7.sun3-0redhat.rhel5.x86_64.rpm [EMAIL PROTECTED] ~]# rpm -Fvh ./ e2fsprogs-1.40.7.sun3-0redhat.rhel5.x86_64.rpm [EMAIL PROTECTED] ~]# e2fsck -fp /dev/dm-3 e2fsck: symbol lookup error: e2fsck: undefined symbol: ext2fs_mmp_update This is on RHEL5.1. >> Message from syslogd@ at Tue Jul 1 16:00:02 2008 ... >> login1 kernel: LustreError: 5612:0:(ptlrpcd.c:72:ptlrpcd_wake()) >> ASSERTION(pc != NULL) failed >> Message from syslogd@ at Tue Jul 1 16:00:02 2008 ... >> login1 kernel: LustreError: 5612:0:(tracefile.c: >> 431:libcfs_assertion_failed()) LBUG > > This looks like bug 13888 fixed in 1.6.5. OK, will go forward with 1.6.5.1, hoping OFED-1.3 etc. issues are solved. Thank you very much, Michael ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] OSS: bad header in inode - invalid magic
Hi, I repeatedly encounter "invalid magic" in one particular inode of one of my OSS volumes (1 of 4, each 5 TB), with the consequence of lustre remounting R/O. I run 2.6.18-53.1.13.el5_lustre.1.6.4.3smp on RHEL5.1 on a cluster with approx. 150 client nodes. The error appears on the OSS as: Jul 1 15:43:58 oss01 kernel: LDISKFS-fs error (device dm-3): ldiskfs_ext_find_extent: bad header in inode #405012501: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) Jul 1 15:43:58 oss01 kernel: Remounting filesystem read-only Jul 1 15:43:58 oss01 kernel: LDISKFS-fs error (device dm-3): ldiskfs_ext_find_extent: bad header in inode #405012501: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) Jul 1 15:43:58 oss01 kernel: LustreError: 25462:0:(fsfilt-ldiskfs.c: 417:fsfilt_ldiskfs_brw_start()) can't get handle for 45 credits: rc = -30 Jul 1 15:43:58 oss01 kernel: LustreError: 25462:0:(fsfilt-ldiskfs.c: 417:fsfilt_ldiskfs_brw_start()) Skipped 6 previous similar messages Jul 1 15:43:58 oss01 kernel: LustreError: 25462:0:(filter_io_26.c: 705:filter_commitrw_write()) error starting transaction: rc = -30 Jul 1 15:43:58 oss01 kernel: LustreError: 19569:0:(filter_io_26.c: 705:filter_commitrw_write()) error starting transaction: rc = -30 [... many repeats] Three login nodes signaled, about 10 .. 15 minutes apart the same wall(8) message: Message from syslogd@ at Tue Jul 1 16:00:02 2008 ... login1 kernel: LustreError: 5612:0:(ptlrpcd.c:72:ptlrpcd_wake()) ASSERTION(pc != NULL) failed Message from syslogd@ at Tue Jul 1 16:00:02 2008 ... login1 kernel: LustreError: 5612:0:(tracefile.c: 431:libcfs_assertion_failed()) LBUG Twice in the past, I followed this recovery procedure from the Manual and the Wiki: http://wiki.lustre.org/index.php?title=Fsck_Support#Using_e2fsck_on_a_backing_filesystem%7Cusing Using e2fsck on a backing filesystem -- nice walk-through http://manual.lustre.org/manual/LustreManual16_HTML/Failover.html#50446391_pgfId-1287654 8.4.1 Starting/Stopping a Resource [i.e., simply umounting the device on the OSS - is this correct?] http://manual.lustre.org/manual/LustreManual16_HTML/LustreInstallation.html#50446385_43530 4.2.1.5 Stopping a Server In other words: umount the OSS perform fsck on the block device remount the OSS So, last time I did: [EMAIL PROTECTED] ~]# umount /mnt/ost2 [EMAIL PROTECTED] ~]# e2fsck -fp /dev/dm-3 lustre-OST0002: recovering journal lustre-OST0002: ext3 recovery flag is clear, but journal has data. lustre-OST0002: Run journal anyway lustre-OST0002: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY. (i.e., without -a or -p options) [EMAIL PROTECTED] ~]# mount -t ldiskfs /dev/dm-3 /mnt/ost2 [EMAIL PROTECTED] ~]# umount /mnt/ost2 [EMAIL PROTECTED] ~]# e2fsck -fp /dev/dm-3 lustre-OST0002: 342355/427253760 files (4.2% non-contiguous), 139324997/1708984375 blocks To my surprise, there were no errors. I did the same today after the error above, but left out the "-p" flag; still, fsck did not find an error (except the journal replay??): [EMAIL PROTECTED] ~]# e2fsck -f /dev/dm-3 e2fsck 1.40.4.cfs1 (31-Dec-2007) lustre-OST0002: recovering journal Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information lustre-OST0002: * FILE SYSTEM WAS MODIFIED * lustre-OST0002: 343702/427253760 files (4.4% non-contiguous), 137003893/1708984375 blocks [EMAIL PROTECTED] ~]# I haven't mounted back yet for fear this would stall the system again in a couple of days. How can I locate the "bad" inode - should I try? Is this an inode of the lustre FS or the underlying ext3 on the OST? Are there version dependencies of e2fsck with lustre? I am running lustre-1.6.4.3 and e2fsck-1.40.4. I would appreciate any pointers. Thank you for your attention and help. Michael ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] o2ib module prevents shutdown
Hello Wojciech, Sorry for the delayed response; lustre_rmmod worked in a manual test to remove modules after ib0 was down; I have yet to try this as part of the init.d shutdown scripts; an alternate solution with a script didn't quite work. Thanks for the hint! Regards, Michael On Apr 15, 2008, at 12:33 , Wojciech Turek wrote: > Hi, > > This usually happens when you try to remove IB card drivers before > stopping lustre network. What I do is after clean umount I run > lustre_rmmod script which removes all lustre modules and stops > lustre network. Then you can safety remove IB card driver and > nothing should get stuck. > > Cheers, > > Wojciech > > On 15 Apr 2008, at 18:22, Michael Sternberg wrote: > >> >> On Apr 15, 2008, at 12:15, Brian J. Murrell wrote: >>> On Tue, 2008-04-15 at 12:07 -0500, Michael Sternberg wrote: >>>> Hello, >>>> >>>> Not sure if this is the right forum: I'm encountering difficulties >>>> with o2ib which prevents an LNET shutdown from proceeding: >>>> >>>>Unloading OpenIB kernel modules:NET: Unregistered protocal >>>> family 27 >>>>Failed to unload rdma_cm >>>>Failed to unload rdma_cm >>>>Failed to unload ib_cm >>>>Failed to unload ib_sa >>>>LustreError: 131-3: Received notification of device removal >>>>Please shutdown LNET to allow this to proceed >>>> >>>> This happens on server and client nodes alike. We run RHEL5.1 and >>>> OFED 1.2, kernel 2.6.18-53.1.13.el5_lustre.1.6.4.3smp from CFS/Sun. >>>> >>>> I narrowed it down to module ko2iblnd, which I attempt to remove >>>> first (added to PRE_UNLOAD_MODULES in /etc/init.d/openibd), but it >>>> doesn't work. Strangely, in "lsmod" the use count of the module is >>>> one, but I don't see where it's used. >>> >>> To ask what might sound like a stupid question, but you do have >>> all of >>> your lustre filesystems unmounted before you try to unload ko2iblnd, >>> yes? Can you show us what's in /proc/mounts when you try to unload >>> ko2iblnd but it shows a refcount > 0? >> >> No problem with the question - anything that helps: >> >> # cat /proc/mounts >> rootfs / rootfs rw 0 0 >> /dev/root / ext3 rw,data=ordered 0 0 >> /dev /dev tmpfs rw 0 0 >> /proc /proc proc rw 0 0 >> /sys /sys sysfs rw 0 0 >> /proc/bus/usb /proc/bus/usb usbfs rw 0 0 >> devpts /dev/pts devpts rw 0 0 >> /dev/sda1 /boot ext3 rw,data=ordered 0 0 >> tmpfs /dev/shm tmpfs rw 0 0 >> none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0 >> sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0 >> 172.16.100.3:/drbd/exports/opt /opt nfs >> rw >> ,vers >> =3,rsize=8192,wsize=8192,hard,intr,proto=tcp,timeo=600,retrans=2, >> sec=sys,addr=172.16.100.3 0 0 >> /etc/auto.misc /misc autofs >> rw,fd=6,pgrp=3689,timeout=300,minproto=5,maxproto=5,indirect 0 0 >> -hosts /net autofs >> rw,fd=11,pgrp=3689,timeout=300,minproto=5,maxproto=5,indirect 0 0 >> >> >> This was even after: >> >> # ifconfig ib0 down >> >> I also have: >> >> # grep lnet /etc/modprobe.conf >> options lnet networks="o2ib0,tcp0(eth0)" accept_port=6988 >> >> (the accept_port spec doesn't work either on a tcp-only node, but >> that's a separate issue, or so I believe.) >> >> >> Regards, Michael >> >> ___ >> Lustre-discuss mailing list >> Lustre-discuss@lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Download Problems with lustre 1.6.4.3
I had success the other day at Sun's official download site - looks like the issues it had are fixed. Regards, Michael On Apr 16, 2008, at 14:28 , Frank Leers wrote: > ...OK, so should there be anything here? The directory seems empty, > or > possibly the files have incorrect permissions set, or we changed our > minds :-/ > > http://downloads.lustre.org/public/lustre/v1.6/ > > -frank > > On Thu, 2008-03-20 at 05:46 -0400, Oleg Drokin wrote: >> Hello! >> >> On Mar 20, 2008, at 4:55 AM, Patrick Winnertz wrote: >>> I've tried just some minutes ago to download the sourcecode of >>> lustre from >>> [1], but it failed with this error: >> >> Please try http://downloads.lustre.org >> This is new downloads site that went onto comission just yesterday, >> so >> no wide announcements were made yet, it seems. >> >> Bye, >> Oleg ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] o2ib module prevents shutdown
On Apr 15, 2008, at 12:15, Brian J. Murrell wrote: > On Tue, 2008-04-15 at 12:07 -0500, Michael Sternberg wrote: >> Hello, >> >> Not sure if this is the right forum: I'm encountering difficulties >> with o2ib which prevents an LNET shutdown from proceeding: >> >> Unloading OpenIB kernel modules:NET: Unregistered protocal family 27 >> Failed to unload rdma_cm >> Failed to unload rdma_cm >> Failed to unload ib_cm >> Failed to unload ib_sa >> LustreError: 131-3: Received notification of device removal >> Please shutdown LNET to allow this to proceed >> >> This happens on server and client nodes alike. We run RHEL5.1 and >> OFED 1.2, kernel 2.6.18-53.1.13.el5_lustre.1.6.4.3smp from CFS/Sun. >> >> I narrowed it down to module ko2iblnd, which I attempt to remove >> first (added to PRE_UNLOAD_MODULES in /etc/init.d/openibd), but it >> doesn't work. Strangely, in "lsmod" the use count of the module is >> one, but I don't see where it's used. > > To ask what might sound like a stupid question, but you do have all of > your lustre filesystems unmounted before you try to unload ko2iblnd, > yes? Can you show us what's in /proc/mounts when you try to unload > ko2iblnd but it shows a refcount > 0? No problem with the question - anything that helps: # cat /proc/mounts rootfs / rootfs rw 0 0 /dev/root / ext3 rw,data=ordered 0 0 /dev /dev tmpfs rw 0 0 /proc /proc proc rw 0 0 /sys /sys sysfs rw 0 0 /proc/bus/usb /proc/bus/usb usbfs rw 0 0 devpts /dev/pts devpts rw 0 0 /dev/sda1 /boot ext3 rw,data=ordered 0 0 tmpfs /dev/shm tmpfs rw 0 0 none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0 sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0 172.16.100.3:/drbd/exports/opt /opt nfs rw,vers=3,rsize=8192,wsize=8192,hard,intr,proto=tcp,timeo=600,retrans=2, sec=sys,addr=172.16.100.3 0 0 /etc/auto.misc /misc autofs rw,fd=6,pgrp=3689,timeout=300,minproto=5,maxproto=5,indirect 0 0 -hosts /net autofs rw,fd=11,pgrp=3689,timeout=300,minproto=5,maxproto=5,indirect 0 0 This was even after: # ifconfig ib0 down I also have: # grep lnet /etc/modprobe.conf options lnet networks="o2ib0,tcp0(eth0)" accept_port=6988 (the accept_port spec doesn't work either on a tcp-only node, but that's a separate issue, or so I believe.) Regards, Michael ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] o2ib module prevents shutdown
Hello, Not sure if this is the right forum: I'm encountering difficulties with o2ib which prevents an LNET shutdown from proceeding: Unloading OpenIB kernel modules:NET: Unregistered protocal family 27 Failed to unload rdma_cm Failed to unload rdma_cm Failed to unload ib_cm Failed to unload ib_sa LustreError: 131-3: Received notification of device removal Please shutdown LNET to allow this to proceed This happens on server and client nodes alike. We run RHEL5.1 and OFED 1.2, kernel 2.6.18-53.1.13.el5_lustre.1.6.4.3smp from CFS/Sun. I narrowed it down to module ko2iblnd, which I attempt to remove first (added to PRE_UNLOAD_MODULES in /etc/init.d/openibd), but it doesn't work. Strangely, in "lsmod" the use count of the module is one, but I don't see where it's used. # umount /mnt/lustre # ifconfig ib0 down # modprobe -r ko2iblnd FATAL: Module ko2iblnd is in use. # lsmod | grep ko2 ko2iblnd 143136 1 lnet 258088 5 lustre,ksocklnd,ko2iblnd,ptlrpc,obdclass libcfs189784 12 osc,mgc,lustre,lov,lquota,mdc,ksocklnd,ko2iblnd,ptlrpc,obdclass,lnet,lvf s rdma_cm65940 4 ko2iblnd,ib_iser,rdma_ucm,ib_sdp ib_core88576 16 ko2iblnd,ib_iser,rdma_ucm,ib_ucm,ib_srp,ib_sdp,rdma_cm,ib_cm,iw_cm,ib_lo cal_sa,ib_ipoib,ib_sa,ib_uverbs,ib_umad,ib_mthca,ib_mad I'd be grateful for any hints. Regards, Michael ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Download Problems with lustre 1.6.4.3
> On Thu, Mar 20, 2008 at 2:46 AM, Oleg Drokin sun.com> wrote: > > On Mar 20, 2008, at 4:55 AM, Patrick Winnertz wrote: > > > I've tried just some minutes ago to download the sourcecode of > > > lustre from > > > [1], but it failed with this error: > > > > Please try http://downloads.lustre.org > > This is new downloads site that went onto comission just > yesterday, so > > no wide announcements were made yet, it seems. > > I've been attempting to convince sun that their download website sucks > and they need to change it, I could paste the entire email thread but > that would be cruel, to you guys. I had the *very same* experience, receiving from SDLC talk about firewall settings etc., culminating in an off-topic and canned referral to the Java development community -- very frustrating. I found the RPMs I was seeking at the download site Oleg mentioned. Feels like arriving at an oasis of sanity. -- Thank you very much! Regards, Michael ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss