Re: [Lustre-discuss] exceedingly slow lstats
On 1/21/12 6:35 AM, John White wrote: Well, I was reading the strace wrong anyway: lstat(../403/a323, {st_mode=S_IFREG|0644, st_size=0, ...}) = 00.134326 getxattr(../403/a323, system.posix_acl_access, 0x0, 0) = -1 EOPNOTSUPP (Operation not supported)0.18 lstat(../403/a330, {st_mode=S_IFREG|0644, st_size=0, ...}) = 00.158898 getxattr(../403/a330, system.posix_acl_access, 0x0, 0) = -1 EOPNOTSUPP (Operation not supported)0.19 lstat(../403/a331, {st_mode=S_IFREG|0644, st_size=0, ...}) = 00.239466 getxattr(../403/a331, system.posix_acl_access, 0x0, 0) = -1 EOPNOTSUPP (Operation not supported)0.12 lstat(../403/a332, {st_mode=S_IFREG|0644, st_size=0, ...}) = 00.130146 getxattr(../403/a332, system.posix_acl_access, 0x0, 0) = -1 EOPNOTSUPP (Operation not supported)0.12 The getxattr takes an incredibly short amount of time, it's the lstat itself that's taking 0.1+s. Which version Lustre they are using? for further investigation, you can collect detailed Lustre log on some client with the same test you did before on it to check which operation(s) slowdown the process. Cheers, Nasf John White HPC Systems Engineer (510) 486-7307 One Cyclotron Rd, MS: 50C-3396 Lawrence Berkeley National Lab Berkeley, CA 94720 On Jan 20, 2012, at 2:28 PM, Mark Hahn wrote: I'm sorry, I'm not quite understanding what you're asking. I don't have ACLs specifically enabled anywhere (and would expect the default is disabled). I guess what I was suggesting is that you could try a simple experiment: mount a client requesting the acl mount option. (I don't know whether the mount -oremount,acl ... trick will work with Lustre.) if the problem goes away, you're done. you also mentioned OSS load - I don't see how that could be related, since OSSs are not involved in metadata operations like lstat or getxattr. (though depending on Lustre version, they could be involved in fetching actual size of files, which is especially salient on striped files...) John White HPC Systems Engineer (510) 486-7307 One Cyclotron Rd, MS: 50C-3396 Lawrence Berkeley National Lab Berkeley, CA 94720 On Jan 20, 2012, at 12:49 PM, Mark Hahn wrote: 0.916908 getxattr(/global/scratch/jwhite/backuptest/highcount/3/a5, system.posix_acl_access, 0x0, 0) = -1 EOPNOTSUPP (Operation not supported) are your clients mounting with the acl option and acl isn't missing on the mds mount? -- operator may differ from spokesperson. h...@mcmaster.ca ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Follow-up: China Lustre Workshop 2011
???,: http://lustrechina.csp.escience.cn/dct/page/1 http://www.youku.com/playlist_show/id_16689032.html http://lustrechina.csp.escience.cn/dct/page/65600 ???Lusre??! Original Message Subject:China Lustre Workshop 2011 Date: Wed, 07 Sep 2011 11:08:34 +0800 From: Fan Yong yong@whamcloud.com To: lustre-discuss@lists.lustre.org lustre-discuss@lists.lustre.org, lustre-commun...@lists.lustre.org This is the invitation for China Lustre Workshop 2011. For English users, please click here http://lustrechina.csp.escience.cn/dct/page/65595. ??Lustre2011??? http://lustrechina.csp.escience.cn/ ???Lustre??: ,LustreI/O??,?,???,,POSIX?,? ??Top 500?,??2011?6?,??Top 10?8??Lustre,?Top 100??70%??Lustre? ,??,???Lustre??,Lustre?? ??,???Lustre?? ???Lustre,?Lustre,??Lustre,??? ?Lustre??,?Lustre?? ? http://www.ihep.ac.cn/?Whamcloud http://www.whamcloud.com/Whamcloud,? ???Lustre???Lustre???: * Lustre??? roadmap * Lustre??/ ??,???/ * Lustre??? * ???Lustre * Lustre?? Lustre 2011(??),? http://lustrechina.csp.escience.cn/dct/page/65554,?? ? http://lustrechina.csp.escience.cn/dct/page/65558???,? ??? http://lustrechina.ihep.ac.cn:8080/enableq/q.php?qname=lustresurvey,?,?? ??! : 2011?11?26? : ?? ? ? ?: ??(138-1075-8632,lu.w...@ihep.ac.cn),??(138-1064-7466,yong@whamcloud.com) ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Reminder for China Lustre Workshop 2011
: 2011?11?26? : ?? :http://lustrechina.csp.escience.cn/dct/page/65597 :http://lustrechina.csp.escience.cn/dct/page/65592 This is the invitation for China Lustre Workshop 2011. For English users, please click here http://lustrechina.csp.escience.cn/dct/page/65595. ??Lustre2011??? http://lustrechina.csp.escience.cn/ ???Lustre??: ,LustreI/O??,?,???,,POSIX?,? ??Top 500?,??2011?6?,??Top 10?8??Lustre,?Top 100??70%??Lustre? ,??,???Lustre??,Lustre?? ??,???Lustre?? ???Lustre,?Lustre,??Lustre,??? ?Lustre??,?Lustre?? ? http://www.ihep.ac.cn/?Whamcloud http://www.whamcloud.com/Whamcloud,? ???Lustre???Lustre???: * Lustre??? roadmap * Lustre??/ ??,???/ * Lustre??? * ???Lustre * Lustre?? Lustre 2011(??),? http://lustrechina.csp.escience.cn/dct/page/65554,?? ? http://lustrechina.csp.escience.cn/dct/page/65558???,? ??? http://lustrechina.ihep.ac.cn:8080/enableq/q.php?qname=lustresurvey,?,?? ??! : 2011?11?26? : ?? ? ? ?: ??(138-1075-8632,lu.w...@ihep.ac.cn),??(138-1064-7466,yong@whamcloud.com) ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Reminder for China Lustre Workshop 2011
??: 2011?10?15? : 2011?11?10? ??: 2011?11?20? This is the invitation for China Lustre Workshop 2011. For English users, please click here http://lustrechina.csp.escience.cn/dct/page/65595. ??Lustre2011??? http://lustrechina.csp.escience.cn/ ???Lustre??: ,LustreI/O??,?,???,,POSIX?,? ??Top 500?,??2011?6?,??Top 10?8??Lustre,?Top 100??70%??Lustre? ,??,???Lustre??,Lustre?? ??,???Lustre?? ???Lustre,?Lustre,??Lustre,??? ?Lustre??,?Lustre?? ? http://www.ihep.ac.cn/?Whamcloud http://www.whamcloud.com/Whamcloud,? ???Lustre???Lustre???: * Lustre??? roadmap * Lustre??/ ??,???/ * Lustre??? * ???Lustre * Lustre?? Lustre 2011(??),? http://lustrechina.csp.escience.cn/dct/page/65554,?? ? http://lustrechina.csp.escience.cn/dct/page/65558???,? ??? http://lustrechina.ihep.ac.cn:8080/enableq/q.php?qname=lustresurvey,?,?? ??! : 2011?11?26? : ?? ? ? ?: ??(138-1075-8632,lu.w...@ihep.ac.cn),??(138-1064-7466,yong@whamcloud.com) ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Reminder for China Lustre Workshop 2011
This is the reminder for China Lustre Workshop 2011. ??: 2011? 9?30? : 2011?11?10? ??: 2011?11?20? ** This is the invitation for China Lustre Workshop 2011. For English users, please click here http://lustrechina.csp.escience.cn/dct/page/65595. ??Lustre2011??? http://lustrechina.csp.escience.cn/ ???Lustre??: ,LustreI/O??,?,???,,POSIX?,? ??Top 500?,??2011?6?,??Top 10?8??Lustre,?Top 100??70%??Lustre? ,??,???Lustre??,Lustre?? ??,???Lustre?? ???Lustre,?Lustre,??Lustre,??? ?Lustre??,?Lustre?? ? http://www.ihep.ac.cn/?Whamcloud http://www.whamcloud.com/Whamcloud,? ???Lustre???Lustre???: * Lustre??? roadmap * Lustre??/ ??,???/ * Lustre??? * ???Lustre * Lustre?? Lustre 2011(??),? http://lustrechina.csp.escience.cn/dct/page/65554,?? ? http://lustrechina.csp.escience.cn/dct/page/65558???,? ??? http://lustrechina.ihep.ac.cn:8080/enableq/q.php?qname=lustresurvey,?,?? ??! : 2011?11?26? : ?? ? ? ?: ??(138-1075-8632,lu.w...@ihep.ac.cn),??(138-1064-7466,yong@whamcloud.com) ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] China Lustre Workshop 2011
This is the invitation for China Lustre Workshop 2011. For English users, please click here http://lustrechina.csp.escience.cn/dct/page/65595. ??Lustre2011??? http://lustrechina.csp.escience.cn/ ???Lustre??: ,LustreI/O??,?,???,,POSIX?,? ??Top 500?,??2011?6?,??Top 10?8??Lustre,?Top 100??70%??Lustre? ,??,???Lustre??,Lustre?? ??,???Lustre?? ???Lustre,?Lustre,??Lustre,??? ?Lustre??,?Lustre?? ? http://www.ihep.ac.cn/?Whamcloud http://www.whamcloud.com/Whamcloud,? ???Lustre???Lustre???: * Lustre??? roadmap * Lustre??/ ??,???/ * Lustre??? * ???Lustre * Lustre?? Lustre 2011(??),? http://lustrechina.csp.escience.cn/dct/page/65554,?? ? http://lustrechina.csp.escience.cn/dct/page/65558???,? ??? http://lustrechina.ihep.ac.cn:8080/enableq/q.php?qname=lustresurvey,?,?? ??! : 2011?11?26? : ?? ? ? ?: ??(138-1075-8632,lu.w...@ihep.ac.cn),??(138-1064-7466,yong@whamcloud.com) ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] RHEL54 / Lustre 2.0.0.1 mds_getxattr -95 errors
I think you met the issues I mentioned in: http://jira.whamcloud.com/browse/ORNL-3 I have made patch for it and in testing. -- Nasf On 1/18/11 10:39 PM, CHU, STEPHEN H (ATTSI) wrote: Hi all, I recently loaded the following on my testbed to try out Lustre 2.0: · One MDS -- RHEL 54, Lustre 2.0 o mkfs.lustre --fsname=lufs --reformat --mgs --mdt --param lov.stripesize=25M --param lov.stripecount=1 /dev/sda2 o /etc/fstab = /dev/sda2 /lustre1-mgs-mdslustre rw,noauto,_netdev 0 0 · One OSS -- RHEL 54, Lustre 2.0 o mkfs.lustre --fsname=lufs --reformat --ost --mgsnode=10.103.34.42@o2ib0 /dev/sdb o mkfs.lustre --fsname=lufs --reformat --ost --mgsnode=10.103.34.42@o2ib0 /dev/sdc o /etc/fstab = /dev/sdb/lustre1/ost1 lustre rw,noauto,_netdev 0 0 o/dev/sdc/lustre1/ost2 lustre rw,noauto,_netdev 0 0 · One Client -- RHEL 54, Lustre 2.0 o /etc/fstab = 10.103.34.42@o2ib0:/lufs /lustre1_fifo lustre rw,noauto,_netdev 0 0 /lustre-mgs-mds, /lustre1/ost1,2 mounted OK on MDS and OSS. /lustre1_fifo mounted OK with Client. So far so good. On Client, cd / and performed a ls --l. The following messages immediately showed up: · On Client o Jan 18 14:04:19 bg8mo33sn kernel: LustreError: 11-0: an error occurred while communicating with 10.103.34.42@o2ib. The mds_getxattr operation failed with -95 · On MDS o Jan 18 14:04:19 bg8mo33lm kernel: LustreError: 20056:0:(ldlm_lib.c:2123:target_send_reply_msg()) @@@ processing error (-95) req@81031427e050 x1357944885151537/t0(0) o49-af4d23bf-2d42-6e4d-6afc-353425e513af@NET_0x50a672229_UUID:0/0 lens 448/328 e 0 to 0 dl 1295359465 ref 1 fl Interpret:// rc -95/-1 All IBs between all nodes are alive and seeing each other with no problem. The client also NFS exports /lustre1_fifo via: · /etc/exports = /lustre1_fifo testhost1(rw,sync,no_root_squash) After testhost1 mounted /lustre1_fifo and attempt to mkdir test under it, the same error messages from above showed up on the MDS and the client. -95 as Operation not supported on transport endpoint but what does it mean here in this context. The same MDS/OSS/Client setup/arrangements ran fine on RHEL 5.3 and Lustre 1.8.1.1 with no errors. Appreciate any help/insight. Thanks. */Steve/* */Stephen Chu/* *ATT Labs CSO* C5-3C03 200 Laurel Ave Middletown, NJ (732) 420-0588 stephen...@att.com /This e-mail and any files transmitted with it are ATT property, are confidential, and are intended solely for the use of the individual or entity to whom this email is addressed. If you are not one of the named recipient(s) or otherwise have reason to believe that you have received this message in error, please notify the sender and delete this message immediately from your computer. Any other use, retention, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited./ ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] RHEL54 / Lustre 2.0.0.1 mds_getxattr -95 errors
I think you met the issues I mentioned in: http://jira.whamcloud.com/browse/ORNL-3 I have made patch for it and in testing. Before the patch landed, you can mount MDT with -o acl or -o noacl explicitly to avoid these confused messages. -- Nasf On 1/18/11 10:39 PM, CHU, STEPHEN H (ATTSI) wrote: Hi all, I recently loaded the following on my testbed to try out Lustre 2.0: · One MDS -- RHEL 54, Lustre 2.0 o mkfs.lustre --fsname=lufs --reformat --mgs --mdt --param lov.stripesize=25M --param lov.stripecount=1 /dev/sda2 o /etc/fstab = /dev/sda2 /lustre1-mgs-mdslustre rw,noauto,_netdev 0 0 · One OSS -- RHEL 54, Lustre 2.0 o mkfs.lustre --fsname=lufs --reformat --ost --mgsnode=10.103.34.42@o2ib0 /dev/sdb o mkfs.lustre --fsname=lufs --reformat --ost --mgsnode=10.103.34.42@o2ib0 /dev/sdc o /etc/fstab = /dev/sdb/lustre1/ost1 lustre rw,noauto,_netdev 0 0 o/dev/sdc/lustre1/ost2 lustre rw,noauto,_netdev 0 0 · One Client -- RHEL 54, Lustre 2.0 o /etc/fstab = 10.103.34.42@o2ib0:/lufs /lustre1_fifo lustre rw,noauto,_netdev 0 0 /lustre-mgs-mds, /lustre1/ost1,2 mounted OK on MDS and OSS. /lustre1_fifo mounted OK with Client. So far so good. On Client, cd / and performed a ls --l. The following messages immediately showed up: · On Client o Jan 18 14:04:19 bg8mo33sn kernel: LustreError: 11-0: an error occurred while communicating with 10.103.34.42@o2ib. The mds_getxattr operation failed with -95 · On MDS o Jan 18 14:04:19 bg8mo33lm kernel: LustreError: 20056:0:(ldlm_lib.c:2123:target_send_reply_msg()) @@@ processing error (-95) req@81031427e050 x1357944885151537/t0(0) o49-af4d23bf-2d42-6e4d-6afc-353425e513af@NET_0x50a672229_UUID:0/0 lens 448/328 e 0 to 0 dl 1295359465 ref 1 fl Interpret:// rc -95/-1 All IBs between all nodes are alive and seeing each other with no problem. The client also NFS exports /lustre1_fifo via: · /etc/exports = /lustre1_fifo testhost1(rw,sync,no_root_squash) After testhost1 mounted /lustre1_fifo and attempt to mkdir test under it, the same error messages from above showed up on the MDS and the client. -95 as Operation not supported on transport endpoint but what does it mean here in this context. The same MDS/OSS/Client setup/arrangements ran fine on RHEL 5.3 and Lustre 1.8.1.1 with no errors. Appreciate any help/insight. Thanks. */Steve/* */Stephen Chu/* *ATT Labs CSO* C5-3C03 200 Laurel Ave Middletown, NJ (732) 420-0588 stephen...@att.com /This e-mail and any files transmitted with it are ATT property, are confidential, and are intended solely for the use of the individual or entity to whom this email is addressed. If you are not one of the named recipient(s) or otherwise have reason to believe that you have received this message in error, please notify the sender and delete this message immediately from your computer. Any other use, retention, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited./ ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre Quotas
On 12/7/10 4:35 AM, Mark Nelson wrote: Hi Guys, Several years ago there was a thread discussing some of the problems with Lustre Quotas and what kinds of things might be done to move forward. I was wondering if/how things have improved since then? Any one have any thoughts/experiences they would be willing to share? Here's the thread from 2008: http://lists.lustre.org/pipermail/lustre-devel/2008-May/002451.html As I known, the progress is as following: * Changes required to quotas because of architecture changes * #1: Supporting quotas on HEAD (no CMD) It has been done and released in lustre-2.0. #2: Supporting quotas with CMD Some design only, not implement yet. #3: Supporting quotas with DMU Seems in processing. * Shortcomings of the current quota implementation * Unfortunately, these known quota issues on lustre are not overcame yet. Cheers, -- Nasf Thanks, Mark ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Patchless kernel support?
On 11/25/10 11:55 PM, Ashley Pittman wrote: On 25 Nov 2010, at 15:37, Fan Yong wrote: On 11/25/10 8:18 PM, Ashley Pittman wrote: Picking up from something that was said at SC last week I believe it was Andreas that mentioned the possibility of patch-less kernel support. This is something that would be immensely useful to us for a variety of reasons. Has there been any recent work into investigating how much work would be involved in implementing this and what's the feeling for if it could be done though changes to Lustre only or a case of submitting a number of patches upstream? I have made some small investigation for the patchless-server recently. I am not sure whether you can check the following link or not: http://jira.whamcloud.com/browse/IT-13 No I can't, my account doesn't have the permissions required to read that. Sorry, it was internal task (IT) originally, and has been converted to LU type, any registered user can access it now. http://jira.whamcloud.com/browse/LU-20 Cheers, -- Nasf The Oracle bug 21524 is what I think I was looking for but the more information I have on this subject the better. Ashley. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] 1.8 quotas
On 10/23/10 4:39 PM, Bernd Schubert wrote: Hello Jason, please note that it is also possible to enable quotas using lctl and that would not be visible using tunefs.lustre. I think the only real option to check if quotas are enabled is to check if quota file exist. For an online filesystem 'debugfs -c /dev/device' is probably the safest way (there is also a 'secret' way how to bind mount the underlying ldiskfs to another directory, but I only use that for test filesystems and never in production, as have not verified the kernel code path yet). Either way, you should check for lquota files, such as r...@rhel5-nfs@phys-oss0:~# mount -t ldiskfs /dev/mapper/ost_demofs_2 /mnt r...@rhel5-nfs@phys-oss0:~# ll /mnt [...] -rw-r--r-- 1 root root 7168 Oct 23 09:48 lquota_v2.group -rw-r--r-- 1 root root 71680 Oct 23 09:48 lquota_v2.user (Of course, you should check that for those OST which have reported the slow quota messages). In fact, once you have performed the command lfs quotacheck -ug $MNT on any client, the two files you mentioned will be created on each OST and MDT, even though you performed lfs quotaoff -ug $MNT later, such two files will not be removed. So you can not know whether quota is on/off for your system just according to whether such two files exists or not. (If quota is off for your system, thenlquota_chkquota() called in filter_commitrw_write() will be bypassed directly) Since you want to disable quota on your system, why not perform lfs quotaoff -ug $MNT on client directly? such command can be performed even if quota is off already, without any harm. If you want to make sure whether quota is off, you can try lfs quota -u $UID $MNT on client, if quota is off, it will report user quotas are not enabled.. - Nasf I just poked around a bit in the code and above the fsfilt_check_slow() check, there is also a loop that calls filter_range_is_mapped(). Now this function calls fs_bmap() and when that eventually goes to down to ext3, it might get a bit slow if, if another thread should modify that file (check out linux/fs/inode.c): /* * bmap() is special. It gets used by applications such as lilo and by * the swapper to find the on-disk block of a specific piece of data. * * Naturally, this is dangerous if the block concerned is still in the * journal. If somebody makes a swapfile on an ext3 data-journaling * filesystem and enables swap, then they may get a nasty shock when the * data getting swapped to that swapfile suddenly gets overwritten by * the original zero's written out previously to the journal and * awaiting writeback in the kernel's buffer cache. * * So, if we see any bmap calls here on a modified, data-journaled file, * take extra steps to flush any blocks which might be in the cache. */ I don't know though, if it can happen that several threads write to the same file. But if it happens, it gets slow. I wonder if a possible swap file is worth the efforts here... In fact, the reason to call filter_range_is_mapped() certainly does not require a journal flush in that loop. I will check myself next week, if journal flushes are ever made due to that and open a Lustre bugzilla then. Avoiding all of that should not be difficult Cheers, Bernd On Saturday, October 23, 2010, Jason Hill wrote: Kevin/Dave/(and Dave from DDN): Thanks for your replies. From tunefs.lustre --dryrun it is very apparent that we are not running quotas. Thanks for your assistance. That message, from lustre/obdfilter/filter_io_26.c, is the result of the thread taking 35 second from when it entered filter_commitrw_write() until after it called lquota_chkquota() to check the quota. However, it is certainly plausible that the thread was delayed because of something other than quotas, such as an allocation (eg, it could have been stuck in filter_iobuf_get). Kevin ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] 1.8 quotas
On 10/22/10 9:37 PM, Jason Hill wrote: Folks, Not having to deal with quotas on our scratch filesystems in the past, I'm puzzled on why we're seeing messages like the following: Oct 22 09:29:00 widow-oss3c2 kernel: kernel: Lustre: widow3-OST00b1: slow quota init 35s due to heavy IO load We're (I think) not doing quotas. I've ran through the 1.8 manual and it's unclear to me how to detect if Lustre is in fact calculating quotas underneath the covers. I'm extremely hesitant to run lfs quota check - we've only got 400T and 30-ish million files in the filesystem currently, but there are 372 OST's on 48 OSS nodes - and I'm concerned about the amount of time it would take as I've heard the initial lfs quota check command can take quite a while to build the mapping. So, the question is - if we see messages like slow quota init, are quotas being calculated in the background? And as a followup - how do we turn them off? No. I think you are misguided by the message slow quota init 35s due to heavy IO load, which does not mean recalculating (initial calculating) quota in the background. In fact, such message is printed out before obdfilter write, at such point, the OST tries to acquire enough quota for this write operation. It will check locally whether the remaining quota related with the uid/gid (for this OST object) is enough or not, if not, the quota slave on this OST will acquire more quota from quota master on MDS. This process maybe take some long time on high load system, especially when the remaining quota on quota master (MDS) is also very limit. The message you saw just shows that. There is no good way to disable these message so long as setting quota on this uid/gid. Cheers, Nasf Thanks, ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] need help debuggin an access permission problem
In fact, the issues occurred when MDS does the upcall (default processed by user space l_getgroups) for user/group information related with this RPC, one UID for each upcall, and all the supplementary groups (not more than sysconf(_SC_NGROUPS_MAX) count) of this UID will be returned. The whole process is not nothing related with single user or not. If it is the improper configuration (of LDAP) for some user(s) caused the failure, you have to verify all the users one by one. Cheers, Nasf On 9/24/10 9:58 PM, Tina Friedrich wrote: Actually, what I hit was one of the LDAP server private to the MDS errounously had a size limit set where the others are unlimited. They're round robin'd which is why I was seeing an inermittent effect. So not a client issue, the clients would not have used this server for their lookups. Which is why I'm puzzled as to how this works, and trying to understand it a bit better; to my understanding, this should not affect lookups on single users, only 'bulk' transfers of data, at least as I understand this? Tina On 24/09/10 12:35, Daniel Kobras wrote: Hi! On Fri, Sep 24, 2010 at 09:18:15AM +0100, Tina Friedrich wrote: Cheers Andreas. I had actually found that, but there doesn't seem to be that much documentation about it. Or I didn't find it :) Plus it appeared to find the users that were problematic whenever I tried it, so I wondered if that is all there is, or if there's some other mechanism I could test for. Mind that access to cached files is no longer authorized by the MDS, but by the client itself. I wouldn't call it documentation, but http://wiki.lustre.org/images/b/ba/Tuesday_lustre_automotive.pdf has an illustration of why this is a problem when nameservices become out of sync between MDS and Lustre clients (slides 23/24). Sounds like you hit a very similar issue. Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] need help debuggin an access permission problem
On 9/23/10 10:03 PM, Tina Friedrich wrote: Hi, thanks for the answer. I found it in the meantime; one of our ldap servers had a wrong size limit entry. The logs I had of course already looked at - they didn't yield much in terms of why, only what (as in, I could see it was permission errors, but they do of course not really tell you why you are getting them. There weren't any log entries that hinted at 'size limit exceeded' or anything.). Still - could someone point me to the bit in the documentation that best describes how the MDS queries that sort of information (group/passwd info, I mean)? Or how to best test that it's mechanisms are working? For example, in this case, I always thought one would only hit the size limit if doing a bulk 'transfer' of data, not doing a lookup on one user - plus I could do these sort lookups fine on all machines involved (against all ldap servers). The topic about User/Group Cache Upcall maybe helpful for you. For lustre-1.8.x, it is chapter of 28.1; for lustre-2.0.x, it is chapter of 29.1. Good Luck! Cheers, Nasf Tina On 23/09/10 11:20, Ashley Pittman wrote: On 23 Sep 2010, at 10:46, Tina Friedrich wrote: Hello List, I'm after debugging hints... I have a couple of users that intermittently get I/O errors when trying to ls a directory (as in, within half an hour, works - doesn't work - works...). Users/groups are kept in ldap; as far as I can see/check, the ldap information is consistend everywhere (i.e. no replication failure or anything). I am trying to figure out what is going on here/where this is going wrong. Can someone give me a hint on how to debug this? Specifically, how does the MDS look up this sort of information, could there be a 'list too long' type of error involved, something like that? Could you give an indication as to the number of files in the directory concerned? What is the full ls command issued (allowing for shell aliases) and in the case where it works is there a large variation in the time it takes when it does work? In terms of debugging it I'd say the log files for the client in question and the MDS would be the most likely place to start. Ashley, ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4
On 9/20/10 5:36 PM, Temple Jason wrote: It appears that turning off statahead does indeed avoid the soft lockup bug. But this seems to me to be a workaround, and not a solution. Is statahead not useful for performance gains? I am not comfortable making my user's jobs waste more cpu time because I have to implement a workaround instead of a fix. Is there one in the works? Nasf - does your patch solve the bug, or does it just disable statahead by default? Statahead is used for optimizing the performance of traversing large directory in Lustre. My patch is not to disable statahead, but try to fix the soft lockup issues, and also contains some other fixes to improve the system stability and performance a bit. I think it is worth for you to try such patch. Cheers, Nasf Thanks, Jason -Original Message- From: lustre-discuss-boun...@lists.lustre.org [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of paciu...@gmail.com Sent: sabato, 18. settembre 2010 08:13 To: rr...@whamcloud.com; peter.x.jo...@oracle.com Cc: lustre-discuss@lists.lustre.org Subject: Re: [Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4 No i have disabled the statahead cache to avoid the problem -Messaggio originale- Da: Robert Readrr...@whamcloud.com Data: Sat Sep 18 04:42:18 GMT 2010 A: Peter Jonespeter.x.jo...@oracle.com CC: lustre-discuss@lists.lustre.orglustre-discuss@lists.lustre.org Oggetto: Re: [Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4 Hi Peter, Perhaps the link got mangled by your mail client? (It does have some seemingly unusual characters for an URL.) My interpretation of Gabriele's reply is that the problem occurred even with statahead disabled, so in that case this patch might be worth trying. robert On Sep 17, 2010, at 10:18 , Peter Jones wrote: The URL does not work for me, but if it is a statahead issue then surely turning statahead off would be a simple workaround to avoid having to apply a patch. Fan Yong wrote: On 9/14/10 8:55 PM, Gabriele Paciucci wrote: I have the same problem, I put the statahead_max to 0 !!! In fact, I have made a patch for statahead related issues (including this one) against lustre-1.8, which is in inspection. http://review.whamcloud.com/#change,2 If possible, you can try such patch. Cheers, -- Nasf ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4
On 9/14/10 8:55 PM, Gabriele Paciucci wrote: I have the same problem, I put the statahead_max to 0 !!! In fact, I have made a patch for statahead related issues (including this one) against lustre-1.8, which is in inspection. http://review.whamcloud.com/#change,2 If possible, you can try such patch. Cheers, -- Nasf On 09/14/2010 11:55 AM, Fan Yong wrote: On 9/14/10 5:21 PM, Temple Jason wrote: Hello, I have recently upgraded my lustre filesystem from 1.8.3 to 1.8.4. The first day we brought our system online with the new version, we started seeing clients getting stuck in this soft lockup loop. The load shoots up over 120, and eventually the node becomes unusable and requires a hard reset. I’ve seen loops like this on the server side in previous lustre versions, but to have it happen on the client is completely new. Here is a bit of what I see in the logs: Sep 13 21:11:39 wn122 kernel: LustreError: 27016:0:(statahead.c:289:ll_sai_entry_fini()) ASSERTION(sa_is_stopped(sai)) failed The soft lockup is caused by this ASSERTION, because when such ASSERTION triggered, the thread holds a spin_lock, then caused other threads busy waiting for such spin_lock. Sep 13 21:11:39 wn122 kernel: LustreError: 27016:0:(statahead.c:289:ll_sai_entry_fini()) LBUG Sep 13 21:11:39 wn122 kernel: Pid: 27016, comm: athena.py Sep 13 21:11:39 wn122 kernel: Sep 13 21:11:39 wn122 kernel: Call Trace: Sep 13 21:11:39 wn122 kernel: [885cf6a1] libcfs_debug_dumpstack+0x51/0x60 [libcfs] Sep 13 21:11:39 wn122 kernel: [885cfbda] lbug_with_loc+0x7a/0xd0 [libcfs] Sep 13 21:11:39 wn122 kernel: [885d7f00] tracefile_init+0x0/0x110 [libcfs] Sep 13 21:11:39 wn122 kernel: [889247e9] ll_statahead_exit+0x409/0x500 [lustre] Sep 13 21:11:39 wn122 kernel: [8008a4b4] default_wake_function+0x0/0xe Sep 13 21:11:39 wn122 kernel: [888d2f2e] ll_intent_drop_lock+0x8e/0xb0 [lustre] Sep 13 21:11:39 wn122 kernel: [8891c2bb] ll_lookup_it+0x30b/0x7c0 [lustre] Sep 13 21:11:39 wn122 kernel: [888e250d] __ll_inode_revalidate_it+0x5bd/0x650 [lustre] Sep 13 21:11:39 wn122 kernel: [886f34f4] ldlm_lock_add_to_lru+0x74/0xe0 [ptlrpc] Sep 13 21:11:39 wn122 kernel: [88919431] ll_convert_intent+0xb1/0x170 [lustre] Sep 13 21:11:39 wn122 kernel: [8891d4c7] ll_lookup_nd+0x207/0x400 [lustre] Sep 13 21:11:39 wn122 kernel: [8002229b] d_alloc+0x174/0x1a9 Sep 13 21:11:39 wn122 kernel: [8000cc40] do_lookup+0xe5/0x1e6 Sep 13 21:11:39 wn122 kernel: [80009fce] __link_path_walk+0xa01/0xf42 Sep 13 21:11:39 wn122 kernel: [8000e803] link_path_walk+0x5c/0xe5 Sep 13 21:11:39 wn122 kernel: [80034cd4] vfs_readdir+0x94/0xa9 Sep 13 21:11:39 wn122 kernel: [800f4fa9] compat_sys_getdents+0xaf/0xbd Sep 13 21:11:39 wn122 kernel: [8000c9df] do_path_lookup+0x270/0x2e8 Sep 13 21:11:39 wn122 kernel: [800123f7] getname+0x15b/0x1c1 Sep 13 21:11:39 wn122 kernel: [80023318] __user_walk_fd+0x37/0x4c Sep 13 21:11:39 wn122 kernel: [800320b8] sys_faccessat+0xe4/0x18d Sep 13 21:11:39 wn122 kernel: [80034cd4] vfs_readdir+0x94/0xa9 Sep 13 21:11:39 wn122 kernel: [800f4fa9] compat_sys_getdents+0xaf/0xbd Sep 13 21:11:39 wn122 kernel: [8006149b] sysenter_do_call+0x1b/0x67 Sep 13 21:11:39 wn122 kernel: [8002c205] dummy_inode_permission+0x0/0x3 Sep 13 21:11:39 wn122 kernel: Sep 13 21:11:39 wn122 kernel: LustreError: dumping log to /tmp/lustre-log.1284405099.27016 Sep 13 21:11:44 wn122 dhclient: DHCPREQUEST on eth0 to 148.187.67.113 port 67 Sep 13 21:11:49 wn122 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ptlrpcd:31817] Sep 13 21:11:49 wn122 kernel: CPU 3: Sep 13 21:11:49 wn122 kernel: Modules linked in: mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) nfs fscache nfs_acl loc kd sunrpc bonding(U) ip_conntrack_netbios_ns ipt_REJECT xt_tcpudp xt_state iptable_filter iptable_nat ip_nat ip_conntrack nfnetlink iptable_mangle ip_tables x_tables rdma_ucm(U) ib _sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ipv6 xfrm_nalgo crypto_api ib_uverbs(U) ib_umad(U) mlx4_vnic(U) ib_sa(U) mlx4_ib(U) ib_mthca(U) ib_mad(U ) ib_core(U) dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev sg i2c_i801 i2c_core e1000e shpchp mlx4_ core(U) pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Sep 13 21:11:49 wn122 kernel: Pid: 31817, comm: ptlrpcd Tainted: G 2.6.18-128.7.1.el5 #1 Sep 13 21:11:49 wn122 kernel: RIP: 0010:[80064cb7] [80064cb7] .text.lock.spinlock+0x5/0x30 Sep 13 21:11:49 wn122 kernel: RSP: 0018:8101ec177cb8
Re: [Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4
On 9/18/10 1:18 AM, Peter Jones wrote: The URL does not work for me, but if it is a statahead issue then surely turning statahead off would be a simple workaround to avoid having to apply a patch. Strange, it is a public link. I can access it without login. Anyway, after inspection, I will attach it to Lustre Bugzilla for further review/verify. Cheers, Nasf Fan Yong wrote: On 9/14/10 8:55 PM, Gabriele Paciucci wrote: I have the same problem, I put the statahead_max to 0 !!! In fact, I have made a patch for statahead related issues (including this one) against lustre-1.8, which is in inspection. http://review.whamcloud.com/#change,2 If possible, you can try such patch. Cheers, -- Nasf ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] ll_glimpse_size() causing intermittent EIO
On 9/17/10 2:43 AM, Fraser McCrossan wrote: We're seeing a problem recently with a 1.8.4 server and both 1.8.2 and 1.8.4 clients. We get intermittent log entries on the client like this: LustreError: 32493:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO Occasionally we get rc -4 instead of -5. The client code fails with an I/O error, as you'd expect. We can't find matching log entries on the server. Trying the same file on the same or different clients again a few minutes later works fine. We're seeing this on lots of different clients but intermittently and fairly rarely, but it obviously it tends to crash user jobs. Any ideas what could be causing it? Have you compared your issues with Lustre bug 22622? https://bugzilla.lustre.org/show_bug.cgi?id=22622 Some similar, but not sure. Good Luck. Cheers, Nasf ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] 2.0.0.1 server with 1.8.x clients
Honestly, we do not suggest the mixed mode as you mentioned unless you have to for some special reasons. There are some known interoperability issues between lustre-1.8 and lustre-2.0. These are part of them in the list: https://bugzilla.lustre.org/buglist.cgi?quicksearch=1.8%3C-%3E2.0+interop And as I known, most interoperability issues were fixed after lustre-1.8.3, so lustre-1.8.x with x=4 for your configuration maybe work. Cheers, Nasf On 9/15/10 2:48 PM, Christos Theodosiou wrote: Hi, we plan to use lustre 2.0.0.1 servers with some 1.8.x and 2.0.0.1 clients. Is there any known issues, that discourage this configuration? Regards Christos ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] IOR performance - Need help
On 9/14/10 5:57 PM, satish patil wrote: Hello, Recently we installed 6 OSS pairs with 8 OST per pair. Total 48 OST's. Each OST is with 3.7 TB. In all it is 177 TB file system. Lustre version installed is 1.8.1.1 and currently using client based on RHEL 5U2 which is 1.6.x. When running the individual OST test from performance perspecitve we are able to get around 17.5 GB performance. Out target is to cross 10 GBPS write performance using single file w/o -F option avoiding client side cache. I have reached max to 7.5GB for write performance , but not going beyond. I tried using stripe count as 48 for a single file along with default stripe size which is 1MB. But not able to cross 10 GBPS. Can you give a detailed description for your system topology? We have met customer with more large theory bandwidth, but worse performance, because of the unexpected back-end storage performance in parallel. For I/O performance testing, full stripe maybe not the best choice. Using single stripe files, and spreading these relative small files to all OSTs evenly, maybe give better result. Command line used for running the IOR as follow /opt/intel/mpi/bin64/mpirun --totalnum=96 --file=$PBS_NODEFILE --rsh=/usr/bin/ssh -1 --ordered --verbose -l -machinefile $PBS_NODEFILE -np 96 /newScratch/IOR/src/C/IOR.mpiio -a MPIIO -b 22G -C -i 3 -k -t 1m -w -r -R -W -x -N 96 -o /newScratch/hp.stripeC48/IOR.dat We have used lustre_config to create the file system. On the other hand, lustre provides basic I/O performance utils (under lustre-iokit). You can use them step by step for the basic elements performance (like back-end storage, obdfilter, and network), which can help you to locate where the performance issues are. Cheers, Nasf Appriciate your help. Regards SP ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] IOR performance - Need help
On 9/14/10 7:20 PM, satish patil wrote: Thanks for your feedback. Back end storage P2000 G3 which is SAS based - 8 Gbps SAN using 450GB-15K. It is clients requirement to have performanace with Single file using all OST's. It is quite necessary to verify that the raw system (without lustre) can achieve the parallel I/O performance more than 10GB/s as your expected. The real test result is more convincing than any nominal parallel I/O performance, especially for SAN infrastructure based storage. Cheers, Nasf Regards SP --- On Tue, 9/14/10, Fan Yongyong@whamcloud.com wrote: From: Fan Yongyong@whamcloud.com Subject: Re: [Lustre-discuss] IOR performance - Need help To: lustre-discuss@lists.lustre.org Date: Tuesday, September 14, 2010, 4:24 PM On 9/14/10 5:57 PM, satish patil wrote: Hello, Recently we installed 6 OSS pairs with 8 OST per pair. Total 48 OST's. Each OST is with 3.7 TB. In all it is 177 TB file system. Lustre version installed is 1.8.1.1 and currently using client based on RHEL 5U2 which is 1.6.x. When running the individual OST test from performance perspecitve we are able to get around 17.5 GB performance. Out target is to cross 10 GBPS write performance using single file w/o -F option avoiding client side cache. I have reached max to 7.5GB for write performance , but not going beyond. I tried using stripe count as 48 for a single file along with default stripe size which is 1MB. But not able to cross 10 GBPS. Can you give a detailed description for your system topology? We have met customer with more large theory bandwidth, but worse performance, because of the unexpected back-end storage performance in parallel. For I/O performance testing, full stripe maybe not the best choice. Using single stripe files, and spreading these relative small files to all OSTs evenly, maybe give better result. Command line used for running the IOR as follow /opt/intel/mpi/bin64/mpirun --totalnum=96 --file=$PBS_NODEFILE --rsh=/usr/bin/ssh -1 --ordered --verbose -l -machinefile $PBS_NODEFILE -np 96 /newScratch/IOR/src/C/IOR.mpiio -a MPIIO -b 22G -C -i 3 -k -t 1m -w -r -R -W -x -N 96 -o /newScratch/hp.stripeC48/IOR.dat We have used lustre_config to create the file system. On the other hand, lustre provides basic I/O performance utils (under lustre-iokit). You can use them step by step for the basic elements performance (like back-end storage, obdfilter, and network), which can help you to locate where the performance issues are. Cheers, Nasf Appriciate your help. Regards SP ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Adding OST to online Lustre with quota
On 7/7/10 10:48 PM, Andreas Dilger wrote: On 2010-07-06, at 10:10, Fan Yong wrote: On 7/5/10 4:21 PM, Gizo Nanava wrote: we wounder whether is it possible to add OSTs to the Lustre with quota support without making it offline? We tried to do this but all quota information was lost. Despite the fact that OST was formatted with quota support we are receiving this error message: Lustre: 3743:0:(lproc_quota.c:447:lprocfs_quota_wr_type()) lustrefs-OST0016: quotaon failed because quota files don't exist, please run quotacheck firstly The message suggest to run the quotacheck command again, but maybe there is a faster solution. Quotacheck is the normal process, but it maybe some slowly. The basic idea is to generate lquota_v2.user / lquota_v2.group without quotacheck on target Lustre system, there are other similar ways for that. To be honest, I don't think this is a desirable solution. It should be possible to automatically create these quota files the first time that a new OST is mounted, since we know at that point that the filesystem is empty and there will be no quota usage for any user on the OST. Right, that is the expected solution, but it is not supported yet. The above is just a hack based on existing mechanism. Regards, -- Fan Yong Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc. -- Fan Yong Lustre Group ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Adding OST to online Lustre with quota
On 7/5/10 4:21 PM, Gizo Nanava wrote: Hello, we wounder whether is it possible to add OSTs to the Lustre with quota support without making it offline? We tried to do this but all quota information was lost. Despite the fact that OST was formatted with quota support we are receiving this error message: Lustre: 3743:0:(lproc_quota.c:447:lprocfs_quota_wr_type()) lustrefs-OST0016: quotaon failed because quota files don't exist, please run quotacheck firstly The message suggest to run the quotacheck command again, but maybe there is a faster solution. Quotacheck is the normal process, but it maybe some slowly. If you do not want to perform quotacheck which affect all servers, there maybe one possible solution (if the OST to be added is a clean one): build another small lustre locally and perform quotacheck on such lustre system, and then umount one of OST and remount it as ldiskfs (and mount target OST to be added as ldiskfs also), cope the lquota_v2.user / lquota_v2.group to the target OST, remount the target OST as lustre. The basic idea is to generate lquota_v2.user / lquota_v2.group without quotacheck on target Lustre system, there are other similar ways for that. Just try, anyway it will not destroy the current Lustre system. -- Fan Yong Could you please someone tell us a proper procedure, or point to a proper documentation? Thank you for any help. Best wishes, Gizo Nanava ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Fan Yong Lustre Group ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Incorrect user's Quota
Lu Wang wrote: Hi Yong, We are running lustre-1.6.5-2.6.9_55.EL.cernsmp 32 bit(on client), 2.6.9-67.0.22.EL_lustre.1.6.6smp,64bit(on server). each user has 20GB quota for file size. We have not set file number quota. # lfs quota -u /besfs2 Disk quotas for user (uid 23034): Filesystem kbytes quota limit grace files quota limit grace /besfs2 4 2000 2010 1 0 0 besfs2-MDT_UUID 4 131072 1 0 besfs2-OST_UUID 0 131072 besfs2-OST0001_UUID 0 131072 besfs2-OST0002_UUID 0 131072 besfs2-OST0003_UUID 0 131072 besfs2-OST0004_UUID 0 131072 besfs2-OST0005_UUID 0 131072 besfs2-OST0006_UUID 0 131072 I am sorry I do not have the error screen shot. Which shows that certain user has not reached the file size quota while an quota exceed error was triggered. One possible reason is that there are some open-delete (delete files before file real closed) operations, which cause the disk space and quota are not released until files are real closed. On the other hand, lustre quota is distributed, which is not as accurate as local filesystem. That maybe cause Quota exceed when near but not hit the limitation. If it is not the above reasons caused your failure, I suggest you to upgrade to lustre 1.6.7 or 1.8.1, some other users have reported the similar issues before, we have fix some quota related issues in such distributions. -- Fan Yong -- Lu Wang 2009-08-28 - 发件人:Fan Yong 发送日期:2009-08-28 17:12:09 收件人:Lu Wang 抄送: 主题:Re: [Lustre-discuss] Incorrect user's Quota Lu Wang wrote: Dear list, We are running a small Lustre with 2 OSS( 1 OSS shares server with MDS ) for usrs' home directory. We have experienced incorrect user's quota for several times. Certain users got Quota exceed errors when their usage of the disk space is only half of quotas. We have this kind of errors on MDS for some times: Aug 24 12:25:20 beshome01 kernel: Lustre: Skipped 3 previous similar messages Aug 24 12:51:29 beshome01 kernel: LustreError: 28467:0:(quota_master.c:507:mds_quota_adjust()) mds adjust qunit failed! (opc:4 rc:-122) Aug 24 12:52:06 beshome01 kernel: LustreError: 26005:0:(quota_master.c:507:mds_quota_adjust()) mds adjust qunit failed! (opc:4 rc:-122) Dose this error have relationship with the quota incorrect problem? How can we avoid this situtaion? Which version of lustre you used? What is the Quota exceed for: block or file? What is the limitation for your users quota? Provide detail information can help to localization the issues. -- Fan Yong Best Regards Lu Wang -- Computing Center IHEP Office: Computing Center,123 Beijing 100049,China Email: lu.w...@ihep.ac.cn -- ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss