[Lustre-discuss] Client Cannot Mount File System
MDS/OSSs: 1.8.8-wc1_2.6.18_308.4.1.el5_gbc88c4c Client: 1.8.9-wc1_2.6.32_358.23.2.el6 One (out of hundreds) of our clients has been unable to mount our lustre file system. We could find no host or network issues. Attempts to mount yielded the following on the client mount -t lustre -o localflock 10.13.68.1@o2ib:10.13.68.2@o2ib:/lfs /lfs/scratch mount.lustre: mount 10.13.68.1@o2ib:10.13.68.2@o2ib:/lfs at /lfs/scratch failed: Interrupted system call Error: Failed to mount 10.13.68.1@o2ib:10.13.68.2@o2ib:/lfs with the following syslog messages. Jun 10 15:21:05 r15a-s40 kernel: Lustre: 1269:0:(o2iblnd_cb.c:1813:kiblnd_close_conn_locked()) Closing conn to 10.13.79.252@o2ib2: error 0(waiting) Jun 10 15:21:05 r15a-s40 kernel: LustreError: 166-1: MGC10.13.68.1@o2ib: Connection to service MGS via nid 10.13.68.1@o2ib was lost; in progress operations using this service will fail. Jun 10 15:21:05 r15a-s40 kernel: LustreError: 15c-8: MGC10.13.68.1@o2ib: The configuration from log 'lfs-client' failed (-4). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. Jun 10 15:21:05 r15a-s40 kernel: LustreError: 4012:0:(llite_lib.c:1099:ll_fill_super()) Unable to process log: -4 Jun 10 15:21:05 r15a-s40 kernel: LustreError: 4012:0:(lov_obd.c:1012:lov_cleanup()) lov tgt 1 not cleaned! deathrow=0, lovrc=1 Jun 10 15:21:05 r15a-s40 kernel: LustreError: 4012:0:(lov_obd.c:1012:lov_cleanup()) Skipped 5 previous similar messages Jun 10 15:21:05 r15a-s40 kernel: LustreError: 4012:0:(lov_obd.c:1012:lov_cleanup()) lov tgt 13 not cleaned! deathrow=1, lovrc=1 Jun 10 15:21:05 r15a-s40 kernel: LustreError: 4012:0:(mdc_request.c:1500:mdc_precleanup()) client import never connected Jun 10 15:21:05 r15a-s40 kernel: Lustre: MGC10.13.68.1@o2ib: Reactivating import Jun 10 15:21:05 r15a-s40 kernel: Lustre: MGC10.13.68.1@o2ib: Connection restored to service MGS using nid 10.13.68.1@o2ib. Jun 10 15:21:05 r15a-s40 kernel: Lustre: client lfs-client(88061e105c00) umount complete Jun 10 15:21:05 r15a-s40 kernel: LustreError: 4012:0:(obd_mount.c:2067:lustre_fill_super()) Unable to mount (-4) Nothing noteworthy on the MDS. After reconfiguring the client with a new IPoIB IP (and hence, NID), it was able to mount with no problems and is working fine.Additionally, the MDS was rebooted at least once during the time that this client in question was unable to mount so it seems like whatever was on the MDT was saved - presumably on the MDT. I'm particularly curious about the ll_fill_super message. To what log is it referring? Anyone seen this before and have an idea what we need to clear on the MDS/MDT to allow this client to successfully mount the file system again? Thanks, Charlie Taylor UF Research Computing ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Thread might be hung, Heavy IO Load messages
You may also want to check and, if necessary, limit the lru_size on your clients. I believe there are guidelines in the ops manual. We have ~750 clients and limit ours to 600 per OST. That, combined with the setting zone_reclaim_mode=0 should make a big difference. Regards, Charlie Taylor UF HPC Center On Feb 1, 2012, at 2:04 PM, Carlos Thomaz wrote: Hi David, You may be facing the same issue discussed on previous threads, which is the issue regarding the zone_reclaim_mode. Take a look on the previous thread where myself and Kevin replied to Vijesh Ek. If you don't have access to the previous emails, look at your kernel settings for the zone reclaim: cat /proc/sys/vm/zone_reclaim_mode It should be set to 0. Also, look at the number of Lustre OSS service threads. It may be set to high... Rgds. Carlos. -- Carlos Thomaz | HPC Systems Architect Mobile: +1 (303) 519-0578 ctho...@ddn.com | Skype ID: carlosthomaz DataDirect Networks, Inc. 9960 Federal Dr., Ste 100 Colorado Springs, CO 80921 ddn.com http://www.ddn.com/ | Twitter: @ddn_limitless http://twitter.com/ddn_limitless | 1.800.TERABYTE On 2/1/12 11:57 AM, David Noriega tsk...@my.utsa.edu wrote: indicates the system was overloaded (too many service threads, or ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss Charles A. Taylor, Ph.D. Associate Director, UF HPC Center (352) 392-4036 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Inactive Service Threads
Lustre Version: lustre-1.8.6-wc1_2.6.18_238.12.1.el5_lustre.1.8.6.x86_64 OSS Configuration: -- Dual E5620 Processors (8 2.4 GHz cores) 24 GB RAM: 8 OSTs ( two per controller) 4 x Adaptec 51245 2 x RAID-6 LUN's per controller 7200 RPM Hitachi Drives (SATA HUA722020ALA330)) 128 KB Stripe Size 512 KB RPCs We've tested the configuration extensively and know that we can sustain 2.4 GB/sec to the OSSs for large-block sequential I/O for long periods of time with no issues. The problem comes in production under more typical client work loads where we see far too many of the messages below - even when the load on the servers is not all that great (8 - 10). Five minutes for an IOP to complete seems like a long time. Seems like we must be either hitting a bug or running out of some resource (locks?). Iostat tends to show fairly typical service, queue, and wait times which further suggests that there is more going on here than just busy disks. We have about 600 clients with the following settings... lctl set_param ldlm.namespaces.*osc*.lru_size=600 lctl set_param ldlm.namespaces.*mdc*.lru_size=600 max_rpcs_in_flight=32 max_pages_per_rpc=128 Note that we have tuned down the number of ost threads from the default to 96 but it has had little impact. If we are to believe the messages, we should probably reduce the thread count further but it feels like something else is wrong. Perhaps someone else has encountered this or can see an obvious problem in our setup. Any ideas or suggestions are welcome. Charlie Taylor UF HPC Center Dec 26 15:05:10 hpcoss8 kernel: Lustre: Service thread pid 26929 was inactive for 320.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Dec 26 15:05:10 hpcoss8 kernel: Pid: 26929, comm: ll_ost_io_36 Dec 26 15:05:10 hpcoss8 kernel: Dec 26 15:05:10 hpcoss8 kernel: Call Trace: Dec 26 15:05:10 hpcoss8 kernel: [80047144] try_to_wake_up+0x472/0x484 Dec 26 15:05:10 hpcoss8 kernel: [8008c871] __wake_up_common+0x3e/0x68 Dec 26 15:05:10 hpcoss8 kernel: [8028882e] __down_trylock+0x39/0x4e Dec 26 15:05:10 hpcoss8 kernel: [8006472d] __down_failed_trylock+0x35/0x3a Dec 26 15:05:10 hpcoss8 kernel: [800646b9] __down_failed+0x35/0x3a Dec 26 15:05:10 hpcoss8 kernel: [88b491e6] .text.lock.ldlm_resource+0x7d/0x87 [ptlrpc] Dec 26 15:05:10 hpcoss8 kernel: [88b6c337] ldlm_pools_shrink+0x247/0x2f0 [ptlrpc] Dec 26 15:05:10 hpcoss8 kernel: [80064604] __down_read+0x12/0x92 Dec 26 15:05:10 hpcoss8 kernel: [8002231e] __up_read+0x19/0x7f Dec 26 15:05:10 hpcoss8 kernel: [8003f6c0] shrink_slab+0x60/0x153 Dec 26 15:05:10 hpcoss8 kernel: [800cdd0a] zone_reclaim+0x235/0x2cd Dec 26 15:05:10 hpcoss8 kernel: [800ca13d] __rmqueue+0x44/0xc7 Dec 26 15:05:10 hpcoss8 kernel: [8000a919] get_page_from_freelist+0xbf/0x43a Dec 26 15:05:10 hpcoss8 kernel: [8000f41a] __alloc_pages+0x78/0x308 Dec 26 15:05:10 hpcoss8 kernel: [80025d41] find_or_create_page+0x32/0x72 Dec 26 15:05:10 hpcoss8 kernel: [88e694e5] filter_get_page+0x35/0x70 [obdfilter] Dec 26 15:05:10 hpcoss8 kernel: [88e6b72a] filter_preprw+0x14da/0x1e00 [obdfilter] Dec 26 15:05:10 hpcoss8 kernel: [88a41a54] kiblnd_init_tx_msg+0x154/0x1d0 [ko2iblnd] Dec 26 15:05:10 hpcoss8 kernel: [88ad4dc0] class_handle2object+0xe0/0x170 [obdclass] Dec 26 15:05:11 hpcoss8 kernel: [88a49f2d] kiblnd_send+0x86d/0x8b0 [ko2iblnd] Dec 26 15:05:11 hpcoss8 kernel: [88e1600c] ost_brw_write+0xf9c/0x2480 [ost] Dec 26 15:05:11 hpcoss8 kernel: [889fe111] LNetMDBind+0x301/0x450 [lnet] Dec 26 15:05:11 hpcoss8 kernel: [88b88c65] lustre_msg_set_limit+0x35/0xf0 [ptlrpc] Dec 26 15:05:11 hpcoss8 kernel: [88b7eac8] ptlrpc_send_reply+0x5e8/0x600 [ptlrpc] Dec 26 15:05:11 hpcoss8 kernel: [88b82fe5] lustre_msg_get_version+0x35/0xf0 [ptlrpc] Dec 26 15:05:11 hpcoss8 kernel: [88b82ef5] lustre_msg_get_opc+0x35/0xf0 [ptlrpc] Dec 26 15:05:11 hpcoss8 kernel: [88b830a8] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc] Dec 26 15:05:11 hpcoss8 kernel: [88e1a09e] ost_handle+0x2bae/0x55b0 [ost] Dec 26 15:05:11 hpcoss8 kernel: [80153e70] __next_cpu+0x19/0x28 Dec 26 15:05:11 hpcoss8 kernel: [8008dc31] dequeue_task+0x18/0x37 Dec 26 15:05:11 hpcoss8 kernel: [88b926d9] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc] Dec 26 15:05:11 hpcoss8 kernel: [88b92e35] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc] Dec 26 15:05:11 hpcoss8 kernel: [8008e435] default_wake_function+0x0/0xe Dec 26 15:05:11 hpcoss8 kernel: [88b93dc6] ptlrpc_main+0xf66/0x1120 [ptlrpc] Dec 26 15:05:11 hpcoss8 kernel: [8005dfb1] child_rip+0xa/0x11 Dec 26 15:05:11 hpcoss8 kernel: [88b92e60]
Re: [Lustre-discuss] RAID cards - what works well with Lustre?
We use adaptec 51245s and 51645s with 1. max_hw_sectors_kb=512 2. RAID5 4+1 or RAID6 4+2 3. RAID chunk size = 128 So each 1 MB lustre RPC results in two 4-way, striped writes with no read-modify-write penalty. We can further improve write performance by matching the max_pages_per_rpc (per OST on the client side) i.e. the max rpc size to the max_hw_sectors_kb setting for the block devices. In this case max_pages_per_rpc=128 instead of the default 256 at which point you have 1 raid-stripe write per rpc. If you put your OSTs atop LVs (LVM2) as we do, you will want to take the additional step of making sure your LVs are aligned as well. pvcreate --dataalignment 1024S /dev/sd$driveChar You need a fairly new version of the LVM2 that supports the -- dataalignment option. We are using lvm2-2.02.56-8.el5_5.6.x86_64. Note that we attempted to increase the max_hw_sectors_kb for the block devices (RAID LDs) to 1024 but in order to do so, we needed to change the adaptec driver (aacraid) kernel parameter acbsize=8192 which we found to be unstable.For our adaptec drivers we use.. options aacraid cache=7 msi=2 expose_physicals=-1 acbsize=4096 Note that most of the information above was the result of testing and tuning performed here by Craig Prescott. We now have close to a PB of such storage in production here at the UF HPC Center. We used Areca cards at first but found them to be a bit too flakey for our needs. The adaptecs seem to have some infant mortality issues. We RMA about 10% to 12% percent of newly purchased cards but if they make it past initial burn-in testing, they tend to be pretty reliable. Regards, Charlie Taylor UF HPC Center On Jul 5, 2011, at 12:33 PM, Daire Byrne wrote: Hi, I have been testing some LSI 9260 RAID cards for use with Lustre v1.8.6 but have found that the megaraid_sas driver is not really able to facilitate the 1MB full stripe IOs that Lustre likes. This topic has also come up recently in the following two email threads: http://groups.google.com/group/lustre-discuss-list/browse_thread/thread/65a1fdc312b0eccb# http://groups.google.com/group/lustre-discuss-list/browse_thread/thread/fcf39d85b7e945ab I was able to up the max_hw_sectors_kb - 1024 by setting the max_sectors megaraid_sas module option but found that the IOs were still being pretty fragmented: disk I/O size ios % cum % | ios % cum % 4K: 3060 0 0 | 2611 0 0 8K: 3261 0 0 | 2664 0 0 16K: 6408 0 1 | 5296 0 1 32K: 13025 1 2 | 10692 1 2 64K: 48397 4 6 | 26417 2 4 128K:50166 4 10 | 42218 4 9 256K: 113124 9 20 | 86516 8 17 512K: 677242 57 78 | 448231 45 63 1M: 254195 21 100 | 355804 36 100 So next I looked at the sg_tablesize and found it was being set to 80 by the driver (which queries the firmware). I tried to hack the driver and increase this value but bad things happened and so it looks like it is a genuine hardware limit with these cards. The overall throughput isn't exactly terrible because the RAID write- back cache does a reasonable job but I suspect it could be better, e.g. ost 3 sz 201326592K rsz 1024K obj 192 thr 192 write 1100.52 [ 231.75, 529.96] read 940.26 [ 275.70, 357.60] ost 3 sz 201326592K rsz 1024K obj 192 thr 384 write 1112.19 [ 184.80, 546.43] read 1169.20 [ 337.63, 462.52] ost 3 sz 201326592K rsz 1024K obj 192 thr 768 write 1217.79 [ 219.77, 665.32] read 1532.47 [ 403.58, 552.43] ost 3 sz 201326592K rsz 1024K obj 384 thr 384 write 920.87 [ 171.82, 466.77] read 901.03 [ 257.73, 372.87] ost 3 sz 201326592K rsz 1024K obj 384 thr 768 write 1058.11 [ 166.83, 681.25] read 1309.63 [ 346.64, 484.51] All of this brings me to my main question - what internal cards have people here used which work well with Lustre? 3ware, Areca or other models of LSI? Cheers, Daire ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] quota_chk_acq_common()
We enabled quotas on two new file systems and are now seeing lots of the following in our logs... Lustre: 31473:0:(quota_interface.c:460:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0): 2 Time(s) Looking at the code, it is clearly going through the loop at least 10 times however rc is always zero when the message is printed so the acquire() call is succeeding, apparently, on the 10th try. However, if I'm reading the code correctly, for that to happen, the thread has already waited at least 45s (cumulatively) which is a long time to us. It seems like such a long wait would cause other complaints but we aren't seeing anything obvious. Is this normal?Are others seeing the same messages? Is there some tuning we should be doing. Note that one of the file systems is 1.8.5 while the other is 2.0. Thx, Charlie Taylor UF HPC Center ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Lustre HA Experiences
We are dipping our toes into the waters of Lustre HA using pacemaker. We have 16 7.2 TB OSTs across 4 OSSs (4 OSTs each). The four OSSs are broken out into two dual-active pairs running Lustre 1.8.5.Mostly, the water is fine but we've encountered a few surprises. 1. An 8-client iozone write test in which we write 64 files of 1.7 TB each seems to go well - until the end at which point iozone seems to finish successfully and begins its cleanup. That is to say it starts to remove all 64 large files.At this point, the ll_ost threads go bananas - consuming all available cpu cycles on all 8 cores of each server. This seems to block the corosync totem exchange long enough to initiate a stonith request. 2. We have found that re-mounting the OSTs, either via the HA agent or manually, often can take a *very* long time - on the order of four or five minutes. We have not figured out why yet. An strace of the mount process has not yielded much.The mount seems to just be waiting for something but we can't tell what. We are starting to adjust our HA parameters to compensate for these observations but we hate to do this in a vacuum and wonder if others have also observed these behaviors and what, if anything, was done to compensate/correct? Regards, Charlie Taylor UF HPC Center ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] MDS can't recover OSTs
We had a RAID array barf this morning resulting in some OST corruption which appeared to be successfully repaired with a combination of fsck and ll_recover_lost_found_objs. The OSTs mounted OK but the MDS can't seem to recover its connection to two of the OSTs as we are seeing a continuing stream of the following in the MDS syslog. Apr 28 11:37:54 crnmds kernel: Lustre: 31983:0:(recover.c: 67:ptlrpc_initiate_recovery()) crn-OST0013_UUID: starting recovery Apr 28 11:37:54 crnmds kernel: Lustre: 31983:0:(import.c: 608:ptlrpc_connect_import()) 810117426000 crn-OST0013_UUID: changing import state from DISCONN to CONNECTING Apr 28 11:37:54 crnmds kernel: Lustre: 31983:0:(import.c: 470:import_select_connection()) crn-OST0013-osc: connect to NID 10.13.24.92@o2ib last attempt 22689204132 Apr 28 11:37:54 crnmds kernel: Lustre: 31983:0:(import.c: 544:import_select_connection()) crn-OST0013-osc: import 810117426000 using connection 10.13.24.92@o2ib/10.13.24.92@o2ib Apr 28 11:37:54 crnmds kernel: Lustre: 31982:0:(import.c: 1091:ptlrpc_connect_interpret()) 810117426000 crn-OST0013_UUID: changing import state from CONNECTING to DISCONN Apr 28 11:37:54 crnmds kernel: Lustre: 31982:0:(import.c: 1137:ptlrpc_connect_interpret()) recovery of crn-OST0013_UUID on 10.13.24.92@o2ib failed (-16) Apr 28 11:37:54 crnmds kernel: Lustre: 31982:0:(import.c: 1091:ptlrpc_connect_interpret()) 81012e50d000 crn-OST0007_UUID: changing import state from CONNECTING to DISCONN Apr 28 11:37:54 crnmds kernel: Lustre: 31982:0:(import.c: 1137:ptlrpc_connect_interpret()) recovery of crn-OST0007_UUID on 10.13.24.91@o2ib failed (-16) It seems that we never see a 'oscc recovery finished' message on crnmds for OST0007 or OST0013. We have not seen this problem before so we are trying to figure out how to get the MDT reconnected to these two OSTs. Any one else been through this before? Thanks, Charlie Taylor UF HPC Center ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] how to reuse OST indices (EADDRINUSE)
On Dec 21, 2010, at 12:39 PM, Andreas Dilger wrote: It's unfortunate that you didn't see the thread from a few weeks ago that discussed this exact topic of OST replacement. Agreed. :( It should get a section in the manual I think. Agreed. This file is at /O/0/LAST_ID (capital 'o' then zero) and should be copied for OSTs you haven't replaced yet, along with the other files. It can be recreated with a binary editor from the value on the MDS (lctl get_param osc.*.prealloc_next_id) for the 6 OSTs that have already been replaced. Search the list or bugzilla for LAST_ID for a detailed procedure. This seems to do the trick. Thank you!.One important clarification though...on the mds, should we getting the value of prealloc_next_id or prealloc_last_id?Section 23.3.9 of the 2.0 Ops manual for How to fix a Bad LAST_ID on an OST seems to use prealloc_last_id.Which should we be using? Thank you again, Charlie Taylor UF HPC Center ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] controlling which eth interface lustre uses
On Oct 21, 2010, at 9:51 AM, Brock Palen wrote: On Oct 21, 2010, at 9:48 AM, Joe Landman wrote: On 10/21/2010 09:37 AM, Brock Palen wrote: We recently added a new oss, it has 1 1Gb interface and 1 10Gb interface, The 10Gb interface is eth4 10.164.0.166 The 1Gb interface is eth0 10.164.0.10 They look like they are on the same subnet if you are using /24 ... You are correct Both interfaces are on the same subnet: [r...@oss4-gb ~]# route Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 10.164.0.0 * 255.255.248.0 U 0 00 eth0 10.164.0.0 * 255.255.248.0 U 0 00 eth4 169.254.0.0 * 255.255.0.0 U 0 00 eth4 default 10.164.0.1 0.0.0.0 UG0 00 eth0 There is no way to mask the lustre service away from the 1Gb interface? We struggle with this as well but have not found a way to enforce it. You would think that lustre would honor the NID for incoming *and* outgoing traffic but apparently the standard linux routing table determines the outbound path and lnet is out of the picture. Thus, you end up having to assign separate subnets, shut down your eth0 (in this case) interface, or use static routes to fine tune the routing decisions (where possible). We wish that the outgoing decision could be made on the basis of the *NID* but that might be too intrusive with regard to the linux kernel's network stack so I can understand, somewhat, why it is not that way. Still, it is somewhat counter-intuitive to go through all the trouble of having the LNET layer and assigning NIDs only to have them disregarded for outbound traffic. Perhaps there is a way around this that we don't know about. Regards, Charlie Taylor UF HPC Center ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Lustre 1.8.4 Patched Kernel Build
After installing the kernel source... rpm -Uvh kernel-2.6.18-194.3.1.0.1.el5.src.rpm rpmbuild -bp kernel-2.6.spec intalling the lustre source... lustre-source-1.8.4-2.6.18_194.3.1.0.1.el5_lustre.1.8.4.x86_64.rpm and patching the kernel... ln -s /usr/src/lustre-1.8.4/ldiskfs/kernel_patches/patches . ln -s /usr/src/lustre-1.8.4/ldiskfs/kernel_patches/series/ldiskfs-2.6- rhel5.series series quilt push -av, we attempt to build the kernel and get CC fs/compat_ioctl.o In file included from include/linux/ext3_jbd.h:20, from fs/compat_ioctl.c:50: include/linux/ext3_fs.h: In function ‘ext3_new_blocks’: include/linux/ext3_fs.h:1057: error: ‘EXT2_MOUNT_MBALLOC’ undeclared (first use in this function) include/linux/ext3_fs.h:1057: error: (Each undeclared identifier is reported only once include/linux/ext3_fs.h:1057: error: for each function it appears in.) This seems easy enough to fix but doing so just results in more of the same (and worse) down the road. Shouldn't this just work?Is there a problem with the source RPMs? Charlie Taylor UF HPC Center ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] two problems
On Jun 3, 2010, at 6:17 PM, Andreas Dilger wrote: On 2010-06-03, at 06:23, Stefano Elmopi wrote: surely my action was to test environment, in a production environment, I would have placed all the files before deleting the server OST1. The main problem here is that you have completely erased all knowledge of the failed OST, while there are still files in the filesystem using it (i.e. using lctl --writeconf). If the OST had simply failed and been marked inactive (which is what is normally done in such situations) it would still be possible to delete the files. The problem being seen on the MDT now is simply one that cannot happen in any normal failure scenario. I'm sure I'm speaking out of turn but our recent experience contradicts this. We lost an OST and marked it as inactive and *could not* remove the files until we actually replaced the OST with another (using the same index). Once we did that and reactivated the OST we could delete the files which didn't really exist other than on the MDT. It was kind of annoying. Our intent was not to replace the OST but it became such a hassle for us and our users (recursive file operations would often encounter the missing files and error out) that we did so just to be able to remove the files that had been on the failed OST. Regards, Charlie Taylor UF HPC Center ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lost Files - How to remove from MDT
On Apr 18, 2010, at 11:46 AM, Bernd Schubert wrote: You don't need to take the filesystem offline for lfsck. You sure about that? Looking at http://wiki.lustre.org/manual/LustreManual18_HTML/LustreRecovery.html#50598012_37365 step 1 says Stop the Lustre File System. Also, I have rewritten large parts of lfsck and also fixed the parallelization code. I need to review all patches again and probably also make a hg or git repository out of it. Unfortunately, I always have more tasks to do than I manage to do... But given the fact that I fixed several bugs and added safety checks, I think my version actually is better than upstream. Let me know if you are interested and I can put a tar ball of e2fsprogs-sun-ddn on my home page. Sure, we can try it but it seems to me that by the time you generate the OST data and run lfsck against the MDT, much could change on a file system being used by 600+ active clients. REgards, Charlie ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lost Files - How to remove from MDT
On Apr 18, 2010, at 1:14 PM, Andreas Dilger wrote: On 2010-04-18, at 07:16, Charles Taylor wrote: On Apr 18, 2010, at 9:35 AM, Miguel Afonso Oliveira wrote: You are going to have to use unlink with something like this: for file in lost_files unlink $file Nope. That's really no different than rm and produces the same result... unlink /scratch/crn/bwang/NCS/1O5P/1o5p_wat.prmtop unlink: cannot unlink `/scratch/crn/bwang/NCS/1O5P/1o5p_wat.prmtop': Invalid argument This surprises me that unlink doesn't work, since that is the answer I was going to give also. Did you also verify that after this message is posted, that the file isn't actually unlinked? I suspect that the file name was unlinked, but an error is returned from destroying the OST object, but that is fine since the OST is dead and gone anyway. Nope. They are still there following the Invalid argument error. It seems that before we deactivated the OST we could remove the files but got an error message but once the OST was deactivated, we get the error message and the file (err, its metadata) remains. What error messages are posted on the console log (dmesg/syslog)? Lots of the following but there is a find running as well so I don't think it is necessarily from the rm command. Lustre: 4286:0:(lov_pack.c:67:lov_dump_lmm_v1()) stripe_size 1048576, stripe_count 1 Lustre: 4286:0:(lov_pack.c:76:lov_dump_lmm_v1()) stripe 0 idx 17 subobj 0x0/0x3dbe6b Lustre: 4286:0:(lov_pack.c:64:lov_dump_lmm_v1()) objid 0x38f59c8, magic 0x0bd10bd0, pattern 0x1 Lustre: 4286:0:(lov_pack.c:67:lov_dump_lmm_v1()) stripe_size 1048576, stripe_count 1 Lustre: 4286:0:(lov_pack.c:76:lov_dump_lmm_v1()) stripe 0 idx 17 subobj 0x0/0x3dc0fa Charlie ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Lost Files - How to remove from MDT
We lost an OST several months ago and could not recover it.We decided to deactivate until we bring some new storage online and can just rebuild the entire file system.However, now, the MDT still knows about all the files that were on the lost OST and this results in things like invalid argument and ?? ?.. in directory listings.The files cannot be removed by standard commands. We end up doing something like mv Dir to Tmp cp -r Tmp Dir (this produces lots of 'cp: cannot stat ...' for the missing files) mv Tmp /lost+found (this moves all the missing file names more or less out of the way). Is there some way to remove these files from the MDT - as though they never existed - without reformatting the entire file system? Thanks, Charlie Taylor UF HPC Center ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lost Files - How to remove from MDT
On Apr 18, 2010, at 9:38 AM, Brian J. Murrell wrote: On Sun, 2010-04-18 at 09:30 -0400, Charles Taylor wrote: Is there some way to remove these files from the MDT - as though they never existed - without reformatting the entire file system? lfsck is the documented, supported method. Yes, but we attempted that at one time with a smaller file system (for a different reason). After letting it run for over a day, we estimated that it would have taken seven to ten days to finish. That just wasn't practical for us at the time and still isn't. This file system would probably take a couple of weeks to lfsck. I'm sorry to say we can't take the file system offline for that long. We may just have to leave it as is until we put some new storage in place and can migrate the good data off it. I just thought I'd ask. Thanks for the reply though, Charlie Taylor UF HPC Center ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lost Files - How to remove from MDT
On Apr 18, 2010, at 9:35 AM, Miguel Afonso Oliveira wrote: Hi, You are going to have to use unlink with something like this: for file in lost_files unlink $file Nope. That's really no different than rm and produces the same result... unlink /scratch/crn/bwang/NCS/1O5P/1o5p_wat.prmtop unlink: cannot unlink `/scratch/crn/bwang/NCS/1O5P/1o5p_wat.prmtop': Invalid argument Thanks for the suggestion though, Charlie Taylor UF HPC Center Cheers, Miguel Afonso Oliveira P.S.: To build a list of all your lost files you can do a rsync with the dry-run flag. On Apr 18, 2010, at 2:30 PM, Charles Taylor wrote: We lost an OST several months ago and could not recover it.We decided to deactivate until we bring some new storage online and can just rebuild the entire file system.However, now, the MDT still knows about all the files that were on the lost OST and this results in things like invalid argument and ?? ?.. in directory listings.The files cannot be removed by standard commands. We end up doing something like mv Dir to Tmp cp -r Tmp Dir (this produces lots of 'cp: cannot stat ...' for the missing files) mv Tmp /lost+found (this moves all the missing file names more or less out of the way). Is there some way to remove these files from the MDT - as though they never existed - without reformatting the entire file system? Thanks, Charlie Taylor UF HPC Center ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lost Files - How to remove from MDT
On Apr 18, 2010, at 10:47 AM, Miguel Afonso Oliveira wrote: Hi again, Sorry I forgot to mention this only works if the offending OST still exists. If at this time you can no longer re-include the OST where these files existed then you can still create a new one with the same index and then you can unlink. Ok, thanks. We may go ahead and try that. Charlie MAO On Apr 18, 2010, at 3:16 PM, Charles Taylor wrote: On Apr 18, 2010, at 9:35 AM, Miguel Afonso Oliveira wrote: Hi, You are going to have to use unlink with something like this: for file in lost_files unlink $file Nope. That's really no different than rm and produces the same result... unlink /scratch/crn/bwang/NCS/1O5P/1o5p_wat.prmtop unlink: cannot unlink `/scratch/crn/bwang/NCS/1O5P/1o5p_wat.prmtop': Invalid argument Thanks for the suggestion though, Charlie Taylor UF HPC Center Cheers, Miguel Afonso Oliveira P.S.: To build a list of all your lost files you can do a rsync with the dry-run flag. On Apr 18, 2010, at 2:30 PM, Charles Taylor wrote: We lost an OST several months ago and could not recover it.We decided to deactivate until we bring some new storage online and can just rebuild the entire file system.However, now, the MDT still knows about all the files that were on the lost OST and this results in things like invalid argument and ?? ?.. in directory listings.The files cannot be removed by standard commands. We end up doing something like mv Dir to Tmp cp -r Tmp Dir (this produces lots of 'cp: cannot stat ...' for the missing files) mv Tmp /lost+found (this moves all the missing file names more or less out of the way). Is there some way to remove these files from the MDT - as though they never existed - without reformatting the entire file system? Thanks, Charlie Taylor UF HPC Center ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lost Files - How to remove from MDT
While I'm thinking about it, that brings up an interesting question. All the OSTs for this file system were originally formatted under 1.6.3. We have since upgraded to 1.8.x. If we reformat the missing OST with the same index under 1.8.2 and add it back into the file system (sans its data) should we expect trouble? We were reluctant to do so since we doubt that this is a tested scenario but perhaps we are being overly paranoid. Should it be OK to mix OSTs formatted under different versions (1.6 vs 1.8) of Lustre?Seems like it should be OK but you can't test everything and this seems like a bit of an outlier. Regards, Charlie Taylor UF HPC Center On Apr 18, 2010, at 10:47 AM, Miguel Afonso Oliveira wrote: Hi again, Sorry I forgot to mention this only works if the offending OST still exists. If at this time you can no longer re-include the OST where these files existed then you can still create a new one with the same index and then you can unlink. MAO On Apr 18, 2010, at 3:16 PM, Charles Taylor wrote: On Apr 18, 2010, at 9:35 AM, Miguel Afonso Oliveira wrote: Hi, You are going to have to use unlink with something like this: for file in lost_files unlink $file Nope. That's really no different than rm and produces the same result... unlink /scratch/crn/bwang/NCS/1O5P/1o5p_wat.prmtop unlink: cannot unlink `/scratch/crn/bwang/NCS/1O5P/1o5p_wat.prmtop': Invalid argument Thanks for the suggestion though, Charlie Taylor UF HPC Center Cheers, Miguel Afonso Oliveira P.S.: To build a list of all your lost files you can do a rsync with the dry-run flag. On Apr 18, 2010, at 2:30 PM, Charles Taylor wrote: We lost an OST several months ago and could not recover it.We decided to deactivate until we bring some new storage online and can just rebuild the entire file system.However, now, the MDT still knows about all the files that were on the lost OST and this results in things like invalid argument and ?? ?.. in directory listings.The files cannot be removed by standard commands. We end up doing something like mv Dir to Tmp cp -r Tmp Dir (this produces lots of 'cp: cannot stat ...' for the missing files) mv Tmp /lost+found (this moves all the missing file names more or less out of the way). Is there some way to remove these files from the MDT - as though they never existed - without reformatting the entire file system? Thanks, Charlie Taylor UF HPC Center ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] bad write checksums
On Jul 24, 2009, at 10:33 AM, Craig Prescott wrote: Hi; We've been testing some 1.8.0.1 patchless clients (RHEL5.3, x86_64, RPMs from the Sun download page) with out 1.6.4.2 servers. Just to clarify the typo... That should have been with our 1.6.4.2 servers. We are running 1.8.0.1 patch-less clients with 1.6.4.2 on the MGS/MDS and OSSs and getting the messages Craig refers to below. ct The OSS nodes started logging these LustreErrors from the 1.8.0.1 clients: LustreError: 7302:0:(ost_handler.c:1157:ost_brw_write()) client csum 8448447f, original server csum 66fb7cff, server csum now 66fb7cff LustreError: 7302:0:(ost_handler.c:1157:ost_brw_write()) Skipped 1 previous similar message LustreError: 7391:0:(ost_handler.c:1095:ost_brw_write()) client csum 9d8c7d6a, server csum 2cfdcb47 LustreError: 168-f: ufhpc-OST0004: BAD WRITE CHECKSUM: changed in transit before arrival at OST from 12345-10.13.28...@tcp inum 38470778/1485322248 object 67094039/0 extent [0-1023] Is this a known issue with running 1.8.0.1 clients against 1.6.4.2 servers? We aren't seeing these messages in relation to our 1.6 clients. Looking through the Lustre bugzilla, I see bug 18296, which discusses these messages, but it was logged against Lustre version 1.6.6. Cheers, Craig ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] 1.6.4.2 - 1.8.0.1 Upgrade Question
Thanks Brian, but I'm still a little unsure of the tunefs step in the upgrade.Is it actually necessary as part of the MDS upgrade? Is it safe and do we do the same when upgrading the OSSs i.e. do we have to run tunefs.lustre on each OST or just the MDT? Thanks, Charlie Taylor UF HPC Center On Jul 23, 2009, at 9:47 AM, Brian J. Murrell wrote: Charles, I have opened bug 20246 to have that section of the manual reviewed. Thanx for pointing that out. b. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss We are about to upgrade our standalone MGS/MDS (no failover) from 1.6.4.2 to 1.8.0.1.I'm a little confused by section 13.2.4 of the Lustre 1.8 Operations Manual. What is the purpose of the mdt1# tunefs.lustre --mgs --mdt --fsname=testfs /dev/sda1 command?I assume it is writing this information to the MDT (/dev/ sda1) but wan't that information alright put there when the file system was created under 1.6.4.2?Has the format changed between the two versions?Why is the tunefs.lustre step necessary?I'm concerned about overwriting *anything* on the MDT and rendering our file system unusable. I just want to be sure we understand what we are doing. The paragraph labeled Description in section 32.2 (describing tunefs.lustre) did not exactly give me a warm-fuzzy. BTW, we already have a number of 1.8.0.1 clients running against the 1.6.4.2 servers. Working great so far! Thanks, Charlie Taylor UF HPC Center ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Question on upgrading Lustre 1.6.6 - 1.8.0
On May 17, 2009, at 7:10 AM, Daire Byrne wrote: I think the v1.8 manual is still referring to the upgrade of Lustre v1.4 - v1.6. If you are upgrading from v1.6 to v1.8 then you should only need to install the newer packages and reboot. You may need to do a tune2fs if you want to enable newer features but I'm not 100% sure of that. Wow, I hope that's not the case. I know documentation is a pain but *no* documentation is better than wrong, misleading, or old documentation. We are getting ready to upgrade and were planning to go by this procedure. If it is not for 1.6.x - to 1.8.x, then that would be good to know. Note that for clients, we can just reload modules and reboot. That works very nicely. Haven't done anything on the server side yet. We are proceeding with caution and may wait for 1.8.1. Charlie Taylor UF HPC Center Daire - thhsieh thhs...@piano.rcas.sinica.edu.tw wrote: Dear All, I have read the description of Lustre Operation Guide for version 1.8. But I am still not very sure about the exact procedures to upgrade from version 1.6.6 to version 1.8.0. Now I try to write up a plan of upgrading. Please give me your kindly comments on my procedures. :) In our system, we have three Lustre filesystems (they are all version 1.6.6, for all the MGS, MDT, OST, and clients), which are configured in the following: 1. fsname=chome MGS: qa1:/dev/sda5 MDT: qa1:/dev/sda5 (i.e., exactly same disk partition as MGS) OST: qaX:/dev/sdaX (distributed in several OST nodes) 2. fsname=cwork MGS: qa1:/dev/sda5 (shared with that of chome) MDT: qa1:/dev/sda6 OST: qaY:/dev/sdaY (distributed in several OST nodes) 3. fsname=cwork1 MGS: qa1:/dev/sda5 (shared with that of chome) MDT: qa1:/dev/sda7 OST: qaZ:/dev/sdaZ (distributed in several OST nodes) We do not have failover configurations in all the filesystems. I am planing to shutdown all the Lustre filesystems, and then perform the upgrading, and finally startup them. I guess that would be simpler. The exact procedures I am going to do are: 1. For each of the Lustre filesystems, I will perform the following shutdown procedures (chome should be the last one to shutdown, since it share the MDT and MGS in the same partition): - umount all clients - umount all OSTs - umount MDT 2. Install the new Lustre-1.8 software and modules and reboot all the nodes. Then I will upgrade chome first, and then cwork, and finally cwork1. 3. Upgrade MGS and MDT for chome: qa1# tunefs.lustre --mgs --mdt --fsname=chome /dev/sda5 4. Upgrade OSTs for chome: qaX# tunefs.lustre --ost --fsname=chome --mgsnode=qa1 /dev/sdaX Up to this point the chome part should be ready, I guess. 5. Now the MDT for cwork. The manual says that we should copy the MDT and client startup logs from the MDT to the MGS, so I guess that I should - Mount MGS as ldiskfs: qa1# mount -t ldiskfs /dev/sda5 /mnt - Run script lustre_up14 on the MDT of cwork partition: qa1# lustre_up14 /dev/sda6 cwork then I will get the following files: /tmp/logs/cwork-client /tmp/logs/cwork-MDT - Copy these log files to /mnt/CONFIGS/ - Umount MGS: qa1# umount /mnt - Upgrade the MDT: qa1# tunefs.lustre --mdt --nomgs --fsname=cwork --mgsnode=qa1 /dev/sda6 6. Now the OSTs for cwork: qaY# tunefs.lustre --ost --fsname=cwork1 --mgsnode=qa1 /dev/sdaY Up to now the filesystem cwork should be ready. 7. For the MDT and OSTs for cwork1, we can follow the same procedures as step 6 and 7. 8. Start up the new Lustre filesystems: For chome: qa1# mount -t lustre /dev/sda5 /cfs/chome_mdt qaX# mount -t lustre /dev/sdaX /cfs/chome_ostX mount the clients for cwork: qa1# mount -t lustre /dev/sda6 /cfs/cwork_mdt qaY# mount -t lustre /dev/sdaY /cfs/cwork_ostY mount the clients for cwork1: qa1# mount -t lustre /dev/sda7 /cfs/cwork1_mdt qaZ# mount -t lustre /dev/sdaZ /cfs/cwork1_ostZ mount the clients Please kindly give me your comments. Thanks very much. Best Regards, T.H.Hsieh ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Download
Hmmm. I tried to download lustre 1.8.0 two days ago (shortly after the announcement).I tried to use my existing Sun account info that I established when we downloaded 1.6.x.The download site accepted my user name and password but denied me access saying that my account required manual review.A day later, yesterday, they sent me an email saying they needed my Full Legal name and that the initial A was not sufficient or some other nonsense.So it is now two days later and I still can't download what is *supposed* to be freely available open-source software. For the record, I don't mind registering for the download. I *do* mind the security screening.I'm not opening a bank account here. I just want to download some bits with an account I used in the past.Fortunately, I could just fake another account and get the software. My question is simply, Why do you want to hassle your faithful this way?We are not long-time lustre users (about 1.5 years now) but we have touted the benefits of lustre to anyone and everyone who would listen and have contributed to the adoption of lustre at several other sites. I'll just add that the multi-tiered download site is also an unnecessary hassle and puts off your users. Stop the insanity, Charlie Taylor UF HPC Center ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] (no subject)
On May 11, 2009, at 8:07 PM, Andreas Dilger wrote: On May 11, 2009 14:38 -0700, Hayes, Robert N wrote: We will test the mem=12G suggestion. Before attempting the 1.8.0 client, can you confirm that a 1.8 client should work with a 1.6 server without causing any more complications? Yes, the 1.8.x clients are interoperable with 1.6.x servers. If you are worried about testing this out during live system time then you can wait for an outage window to test the 1.8 client in isolation. There is nothing to do on the server, and just RPM upgrade/downgrades on the client. And it's a beautiful thing. :) Charlie Taylor UF HPC Center ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] LUG 2009
Wow, sorry to waste the BW here but I'm confused.Are you really having the meeting at the Cavallo Point Lodge but the Advanced User Seminar (on the 15th) at The Lodge at Sonoma Renaissance Resort and Spa? Is that a misprint or are the meeting and the Seminar at two different places? Thanks, Charlie Taylor UF HPC Center ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Recovery without end
I'm going to pipe in here.We too use a very large (1000) timeout value. We have two separate luster file systems one of them consists of two rather beefy OSSs with 12 OSTs each (FalconIII FC-SATA RAID). The other consists of 8 OSSs with 3 OSTs each (Xyratex 4900FC). We have about 500 clients and support both tcp and o2ib NIDS. We run Lustre 1.6.4.2 on a patched 2.6.18-8.1.14 CentOS/RH kernel. It has worked *very* well for us for over a year now - very few problems with very good performance under very heavy loads. We've tried setting our timeout to lower values but settled on the 1000 value (despite the long recovery periods) because if we don't, our lustre connectivity starts to breakdown and our mounts come and go with errors like transport endpoint failure or transport endpoint not connected or some such (its been a while now).File system access comes and goes randomly on nodes.We tried many tunings and looked for other sources of problems (underlying network issues). Ultimately, the only thing we found that fixed this was to extend the timeout value. I know you will be tempted to tell us that our network must be flakey but it simply is not. We'd love to understand why we need such a large timeout value and why, if we don't use a large value, we see these transport end-point failures.However, after spending several days trying to understand and resolve the issue, we finally just accepted the long timeout as a suitable workaround. I wonder if there are others who have silently done the same. We'll be upgrading to 1.6.6 or 1.6.7 in the not-too-distant future.Maybe then we'll be able to do away with the long timeout value but until then, we need it. :( Just my two cents, Charlie Taylor UF HPC Center On Feb 25, 2009, at 11:03 AM, Brian J. Murrell wrote: On Wed, 2009-02-25 at 16:09 +0100, Thomas Roth wrote: Our /proc/sys/lustre/timeout is 1000 That's way to high. Long recoveries are exactly the reason you don't want this number to be huge. - there has been some debate on this large value here, but most other installation will not run in a network environment with a setup as crazy as ours. What's so crazy about your set up? Unless your network is very flaky and/or you have not tuned your OSSes properly, there should be no need for such a high timeout and if there is you need to address the problems requiring it. Putting the timeout to 100 immediately results in Transport endpoint errors, impossible to run Lustre like this. 300 is the max that we recommend and we have very large production clusters that use such values successfully. Since this is a 1.6.5.1 system, I activated the adaptive timeouts - and put them to equally large values, /sys/module/ptlrpc/parameters/at_max = 6000 /sys/module/ptlrpc/parameters/at_history = 6000 /sys/module/ptlrpc/parameters/at_early_margin = 50 /sys/module/ptlrpc/parameters/at_extra = 30 This is likely not good as well. I will let somebody more knowledgeable about AT comment in detail though. It's a new feature and not getting wide use at all yet, so the real-world experience is still low. b. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Another server question.
On Feb 4, 2009, at 10:39 AM, Robert Minvielle wrote: I still can not seem to get this OST to come online. The clients are still exhibiting the same behaviour as before. Is there any way to get the OST to go into active by force? I ran a ext3 check on it using the SUN modded e2fsprogs and it returns e2fsck 1.40.11.sun1 (17-June-2008) datafs-OST0001: recovering journal datafs-OST0001: clean, 472/25608192 files, 1862944/102410358 blocks Yet, I still get: cd /proc/fs/lustre; find . -name *recov* -exec cat {} \; status: INACTIVE On the MGS, it seems to show as active... [r...@l1storage1 ~]# cat /proc/fs/lustre/lov/datafs-mdtlov/target_obd 0: datafs-OST_UUID ACTIVE 1: datafs-OST0001_UUID ACTIVE 4: datafs-OST0004_UUID ACTIVE 5: datafs-OST0005_UUID ACTIVE 6: datafs-OST0006_UUID ACTIVE We've seen OSTs come up as INACTIVE before. We are not sure why it happens.Sometimes it will transition into RECOVERY if you remount it (umount, mount). Sometimes you may find that the OST is mounted read-only and you can force it back to read-write with mount (as in mount -o rw,remount device). Sometimes, if you wait, it will transition to ACTIVE on its own (perhaps passing through RECOVERY first, I don't know). We've intentionally and unintentionally experienced all three. I think Brian and/or Andreas have already mentioned the remount route. Don't worry though. Lustre really does work. This sounds like normal tooth cutting. You'll be ok. :) Charlie Taylor UF HPC Center ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Another server question.
On Feb 4, 2009, at 4:33 AM, Andreas Dilger wrote: On Feb 03, 2009 12:21 -0500, Charles Taylor wrote: In our experience, despite what has been said and what we have read, if we lose or take down a single OSS, our clients lose access (i/o seems blocked) to the file system until that OSS is back up and has completed recovery.That's just or experience and it has been very consistent. We've never seen otherwise, though we would like to. :) To be clear - a client process will wait indefinitely until an OST is back alive, unless either the process is killed (this should be possible after the Lustre recovery timeout is exceeded, 100s by default), or the OST is explicitly marked inactive on the clients: lctl --device {failed OSC device on client} deactivate After the OSC is marked inactive, then all IO to that OST should immediately return with -EIO, and not hang. Thanks Andreas, I think that clears things up and will help us understand what to expect going forward. If you have experiences other than this it is a bug. If this isn't explained in the documentation it is a documentation bug. If that is spelled out clearly in the documentation, I missed it (certainly possible). I hope I indicated that this business has never been a show-stopper for us. Typically, if we lose an OSS or OST our top priority is getting it back in service. As you indicate, most clients wait and resume when recovery is complete and this is usually fine with us. In fact, its awesome and users understand it since it is akin to what they were used to w/ NFS - back in the day. We love you man! :) Charlie Taylor UF HPC Center ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Another server question.
On Feb 3, 2009, at 11:42 AM, Brian J. Murrell wrote: I down one of the servers (normal shutdown, not the MGD of course). OK, so the clients seem to be frozen in regards to the lustre. Only if they want to access objects (files, or file stripes) on that server that you shut down, yes. In our experience, despite what has been said and what we have read, if we lose or take down a single OSS, our clients lose access (i/o seems blocked) to the file system until that OSS is back up and has completed recovery.That's just or experience and it has been very consistent. We've never seen otherwise, though we would like to. :) Many here have noted that it should be ok, with the exception of files that were stored on the downed server, Again, not in our experience.We are currently running 1.6.4.2 and have never seen this work.Losing a single OSS renders the file system pretty much unusable until the OSS has recovered.We could be doing something wrong, I suppose but I'm not sure what. but that does not seem to be the case here. That is not my main concern however, the real question is, I bring the server back up; check its ID by issuing lctl dl; I check the MGS by a cat / proc/fs/lustre/devices and see the ID in there as UP. OK, so it all seems well again, but the client is still (somewhat) stuck. You have to wait for recovery to complete. You can check the recovery status on the OSSs and MGS/MDS by cd /proc/fs/lustre; find . -name *recov* -exec cat {} \; Once all the OSSs/MGS show recovery COMPLETE, clients will be able to access the file system again. We've been running three separate Lustre file systems for over a year now and are *very* happy with it.There are a few things that we still don't understand and this is one of them. We wish that when an OSS went down, we only lost access to files/objects on *that* OSS but, again, that has not been our experience.Still we've kissed a lot of distributed/parallel file system frogs. We'll take Lustre, hands down. Charlie Taylor UF HPC Center ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Another server question.
On Feb 3, 2009, at 12:28 PM, Brian J. Murrell wrote: On Tue, 2009-02-03 at 12:21 -0500, Charles Taylor wrote: In our experience, despite what has been said and what we have read, if we lose or take down a single OSS, our clients lose access (i/o seems blocked) to the file system until that OSS is back up and has completed recovery. That is likely the real world results of taking down an OSS, indeed. But that is more likely simply due to the random distribution of files/stripes around your filesystem and that it won't take long for all active clients to eventually want something from that missing OSS. That could certainly be the case. Again, not in our experience. Have you actually tested your theory in a controlled environment where you could be sure that clients that got hung up have never tried to access an OST on missing OSS? No, we've never set out to prove that it works or doesn't. We are not complaining though - just saying that for us the practical ramification of an OSS going down is that the file system will be unusable until the OSS is back in service and recovery is complete. If so, and you are still finding that clients that don't touch the downed OSS are getting hung up, please, by all means, file a bug. Will do. We'll be upgrading to 1.6.6 pretty soon and perhaps we'll do some more extensive testing then. Regards, Charlie ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] lvbo_init failed
We are getting lots of these (always for the same resource) on one of our OSSs. LustreError: 22308:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init failed for resource 5820180: rc -2: 1 Time(s) LustreError: 5:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init failed for resource 5820180: rc -2: 1 Time(s) LustreError: 22277:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init failed for resource 5820180: rc -2: 2 Time(s) LustreError: 22274:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init failed for resource 5820180: rc -2: 3 Time(s) LustreError: 22204:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init failed for resource 5820180: rc -2: 1 Time(s) LustreError: 22193:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init failed for resource 5820180: rc -2: 2 Time(s) LustreError: 22253:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init failed for resource 5820180: rc -2: 1 Time(s) LustreError: 22200:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init failed for resource 5820180: rc -2: 2 Time(s) LustreError: 22264:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init failed for resource 5820180: rc -2: 1 Time(s) We've tried to track down the object with lfs find but no joy so far.I'm not even sure that is the right approach. We found a but pertaining to this in the lustre bugzilla but it looks like it was resolved so I'm not sure that's the issue either. Any one else run into this before? Is there something we can do to stop it? We are running 1.6.4.2 on CentOS 4.5 with an updated kernel on the OSSs. Linux hpcio7.ufhpc 2.6.18-8.1.14.el5.L-1642 #1 SMP Mon Feb 18 13:24:27 EST 2008 x86_64 x86_64 x86_64 GNU/Linux). This file system has been in production for about six months - first time we've seen this. Charlie Taylor UF HPC Center ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre 1.6.5 install problem
Lustre doesn't know where your ib modules symbols are. When you configured lustre (in the build sense) you pointed it to a patched kernel tree. In that directory is a Module.symvers file devoid of ib module symbols. You should also have a Module.symvers in your /usr/ src/ofa_kernel directory (assuming you built OFED as well). So... cat /usr/src/ofa_kernel/Module.symvers patched_kernel_dir/ Module.symvers and run make install again and it should be happy. For a 2.6.9 kernel, you probably need OFED 1.2. Charlie Taylor UF HPC Center On Jun 18, 2008, at 5:55 AM, Johnlya wrote: Install step is: rpm -Uvh --nodeps e2fsprogs-devel-1.40.7.sun3-0redhat.x86_64.rpm rpm -Uvh e2fsprogs-1.40.7.sun3-0redhat.x86_64.rpm cd ../PyXML/ tar -zxvf PyXML-0.8.4.tar.gz cd PyXML-0.8.4 python setup.py build python setup.py install cd ../../Expect rpm -ivh expect-5.42.1-1.src.rpm cd ../1.6.5/ rpm -ivh kernel-lustre-smp-2.6.9-67.0.7.EL_lustre.1.6.5.x86_64.rpm rpm -ivh lustre-ldiskfs-3.0.4-2.6.9_67.0.7.EL_lustre. 1.6.5smp.x86_64.rpm rpm -ivh lustre-modules-1.6.5-2.6.9_67.0.7.EL_lustre. 1.6.5smp.x86_64.rpm when install lustre-modules, it displays warning: WARNING: /lib/modules/2.6.9-67.0.7.EL_lustre.1.6.5smp/kernel/net/ lustre/ko2iblnd.ko needs unknown symbol ib_create_cq WARNING: /lib/modules/2.6.9-67.0.7.EL_lustre.1.6.5smp/kernel/net/ lustre/ko2iblnd.ko needs unknown symbol rdma_resolve_addr WARNING: /lib/modules/2.6.9-67.0.7.EL_lustre.1.6.5smp/kernel/net/ lustre/ko2iblnd.ko needs unknown symbol ib_dereg_mr WARNING: /lib/modules/2.6.9-67.0.7.EL_lustre.1.6.5smp/kernel/net/ lustre/ko2iblnd.ko needs unknown symbol rdma_reject WARNING: /lib/modules/2.6.9-67.0.7.EL_lustre.1.6.5smp/kernel/net/ lustre/ko2iblnd.ko needs unknown symbol rdma_disconnect WARNING: /lib/modules/2.6.9-67.0.7.EL_lustre.1.6.5smp/kernel/net/ lustre/ko2iblnd.ko needs unknown symbol rdma_resolve_route WARNING: /lib/modules/2.6.9-67.0.7.EL_lustre.1.6.5smp/kernel/net/ lustre/ko2iblnd.ko needs unknown symbol rdma_bind_addr WARNING: /lib/modules/2.6.9-67.0.7.EL_lustre.1.6.5smp/kernel/net/ lustre/ko2iblnd.ko needs unknown symbol rdma_create_qp WARNING: /lib/modules/2.6.9-67.0.7.EL_lustre.1.6.5smp/kernel/net/ lustre/ko2iblnd.ko needs unknown symbol ib_destroy_cq WARNING: /lib/modules/2.6.9-67.0.7.EL_lustre.1.6.5smp/kernel/net/ lustre/ko2iblnd.ko needs unknown symbol rdma_create_id WARNING: /lib/modules/2.6.9-67.0.7.EL_lustre.1.6.5smp/kernel/net/ lustre/ko2iblnd.ko needs unknown symbol rdma_listen WARNING: /lib/modules/2.6.9-67.0.7.EL_lustre.1.6.5smp/kernel/net/ lustre/ko2iblnd.ko needs unknown symbol rdma_destroy_qp WARNING: /lib/modules/2.6.9-67.0.7.EL_lustre.1.6.5smp/kernel/net/ lustre/ko2iblnd.ko needs unknown symbol ib_get_dma_mr WARNING: /lib/modules/2.6.9-67.0.7.EL_lustre.1.6.5smp/kernel/net/ lustre/ko2iblnd.ko needs unknown symbol ib_alloc_pd WARNING: /lib/modules/2.6.9-67.0.7.EL_lustre.1.6.5smp/kernel/net/ lustre/ko2iblnd.ko needs unknown symbol rdma_connect WARNING: /lib/modules/2.6.9-67.0.7.EL_lustre.1.6.5smp/kernel/net/ lustre/ko2iblnd.ko needs unknown symbol ib_modify_qp WARNING: /lib/modules/2.6.9-67.0.7.EL_lustre.1.6.5smp/kernel/net/ lustre/ko2iblnd.ko needs unknown symbol rdma_destroy_id WARNING: /lib/modules/2.6.9-67.0.7.EL_lustre.1.6.5smp/kernel/net/ lustre/ko2iblnd.ko needs unknown symbol rdma_accept WARNING: /lib/modules/2.6.9-67.0.7.EL_lustre.1.6.5smp/kernel/net/ lustre/ko2iblnd.ko needs unknown symbol ib_dealloc_pd Please tell me why? Thank you ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] MDS Problems
On Jun 13, 2008, at 5:46 PM, Andreas Dilger wrote: On Jun 13, 2008 16:03 -0400, Charles Taylor wrote: We have been running the config below on three different lustre file systems since early January and, for the most part, things have been pretty stable.We are now experiencing frequent hangs on some clients - particularly our interactive login nodes.All processes get blocked behind Lustre I/O requests. When this happens there are *no* messages in either dmesg or syslog on the clients. They seem unaware of a problem. This is likely due to client statahead problems. Please disable this with echo 0 /proc/fs/lustre/llite/*/statahead_max on the clients. This should also be fixed in 1.6.5 This seems to have done the trick. Odd though that we've been running this way for several months and it didn't seem to be an issue until now. We saw the discussions of this go by at one point and we should have just taken care of it then whether we were seeing it or not. Thanks for reminding us of it. 1. A ton of lustre-log.M.N files get dumped into /tmp in a short period of time. Most of them appear to be full of garbage and unprintable characters rather than thread stack traces. Many of them are also zero length. The lustre-log files are not stack traces. They are dumped lustre debug logs. Got it. We have been adjusting lru_size on the clients but so far it has made no difference.We have options mds mds_num_threads=512 and our system timeout is 1000 (sure, go ahead and flame me but if we don't do that we get tons of endpoint transport failures on the clients and no, there are no connectivity issues). :) We are open to suggestion and wondering if we should update the MDSs to 1.6.5. Can we do that safely without also upgrading the clients and OSTs? In general the MDS and OSS nodes should run the same level of software, as that is what we test, but there isn't a hard requirement for it. Would it be reasonable then, to upgrade the MDSs and OSSs but leave the clients at 1.6.4.2 or is that asking for trouble. I think this comes up a lot and I'm pretty sure people have said they do it successfully. I'm just wondering if it is a *design* goal that is architected in or just something that happens to work most of the time. Thanks again, Charlie Taylor UF HPC Center ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] MDS Problems
We have been running the config below on three different lustre file systems since early January and, for the most part, things have been pretty stable.We are now experiencing frequent hangs on some clients - particularly our interactive login nodes.All processes get blocked behind Lustre I/O requests. When this happens there are *no* messages in either dmesg or syslog on the clients. They seem unaware of a problem. However, on the MDS we see the following... 1. A ton of lustre-log.M.N files get dumped into /tmp in a short period of time. Most of them appear to be full of garbage and unprintable characters rather than thread stack traces. Many of them are also zero length. 2. Lots of dmesg output similar to that appended (see below). 3. Pretty much the same in syslog. 4. The frequency/period of these events seems to consistent with the timeouts associated with the following messages... Lustre: 4534:0:(watchdog.c:312:lcw_update_time()) Expired watchdog for pid 4534 disabled after 499.9514s We have been adjusting lru_size on the clients but so far it has made no difference.We have options mds mds_num_threads=512 and our system timeout is 1000 (sure, go ahead and flame me but if we don't do that we get tons of endpoint transport failures on the clients and no, there are no connectivity issues). :) We are open to suggestion and wondering if we should update the MDSs to 1.6.5. Can we do that safely without also upgrading the clients and OSTs? Our config is as below. Thanks, Charlie Taylor UF HPC Center = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Config: Lustre 1.6.4.2 CentOS 5.0 Kernel 2.6.18-8.1.14 OFED 1.2 320 o2ib clients 80 tcp clients dmesg output: = LustreError: dumping log to /tmp/lustre-log.1213385571.4240 ll_mdt_20 S 81023d2c9700 0 4070 1 4071 4069 (L-TLB) 81023d2c9700 81023d2c9630 81023d2c9630 000a 81023fa09100 8101438ca7a0 0002bb03681325fb 8f21 81023fa092e8 0001 8866eb51 Call Trace: [8866eb51] :ptlrpc:ldlm_run_cp_ast_work+0x161/0x1f0 [88686da0] :ptlrpc:ldlm_expired_completion_wait+0x0/0x250 [800611f7] schedule_timeout+0x8a/0xad [80092c5e] process_timeout+0x0/0x5 [886881fd] :ptlrpc:ldlm_completion_ast+0x35d/0x6a0 [886705e9] :ptlrpc:ldlm_lock_enqueue+0x559/0x5c0 [80086a74] default_wake_function+0x0/0xe [8866ce6a] :ptlrpc:ldlm_lock_addref_internal_nolock+0x3a/ 0x90 [88684bb0] :ptlrpc:ldlm_blocking_ast+0x0/0x2d0 [88685e24] :ptlrpc:ldlm_cli_enqueue_local+0x454/0x510 [888d5e87] :mds:mds_fid2locked_dentry+0x1d7/0x2a0 [88687ea0] :ptlrpc:ldlm_completion_ast+0x0/0x6a0 [888d6647] :mds:mds_getattr_lock+0x6f7/0xc70 [885b11c4] :ksocklnd:ksocknal_alloc_tx+0x1c4/0x270 [888d7191] :mds:mds_intent_policy+0x5d1/0xbe0 [8854dca7] :lnet:lnet_prep_send+0x67/0xb0 [88673786] :ptlrpc:ldlm_resource_putref+0x1b6/0x3b0 [88670183] :ptlrpc:ldlm_lock_enqueue+0xf3/0x5c0 [8866dbbd] :ptlrpc:ldlm_lock_create+0x98d/0x9c0 [88690660] :ptlrpc:ldlm_server_completion_ast+0x0/0x570 [8868cda0] :ptlrpc:ldlm_handle_enqueue+0xd90/0x1410 [88690bd0] :ptlrpc:ldlm_server_blocking_ast+0x0/0x690 [888e0cad] :mds:mds_handle+0x46dd/0x58ff [8860bcb2] :obdclass:class_handle2object+0xd2/0x160 [886a7280] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90 [886a4e35] :ptlrpc:lustre_swab_buf+0xc5/0xf0 [886aca8b] :ptlrpc:ptlrpc_server_handle_request+0xb0b/0x1270 [800608e8] thread_return+0x0/0xea [8006b165] do_gettimeofday+0x50/0x92 [8851b056] :libcfs:lcw_update_time+0x16/0x100 [8003cee3] lock_timer_base+0x1b/0x3c [886af4cc] :ptlrpc:ptlrpc_main+0x7dc/0x950 [80086a74] default_wake_function+0x0/0xe [8005be25] child_rip+0xa/0x11 [886aecf0] :ptlrpc:ptlrpc_main+0x0/0x950 [8005be1b] child_rip+0x0/0x11 LustreError: dumping log to /tmp/lustre-log.1213385600.4070 Lustre: 4338:0:(ldlm_lib.c:519:target_handle_reconnect()) hpcdata- MDT: a9f365a3-8746-6ed5-e45e-cd292891ece2 reconnecting Lustre: 4338:0:(ldlm_lib.c:519:target_handle_reconnect()) Skipped 6 previous similar messages Lustre: 4338:0:(ldlm_lib.c:747:target_handle_connect()) hpcdata- MDT: refuse reconnection from [EMAIL PROTECTED] @o2ib to 0x8101cdbd6000; still busy with 2 active RPCs Lustre: 4338:0:(ldlm_lib.c:747:target_handle_connect()) Skipped 6 previous similar messages ll_mdt_501S 81023ca51700 0 4551 1 4552 4550 (L-TLB) 81023ca51700
Re: [Lustre-discuss] MDS crash and the Dilger Procedure
Paging Dr. Dilger, paging Dr. Dilger. Dr. Dilger, you are needed in the emergency room. :) On Jun 5, 2008, at 5:28 PM, Jakob Goldbach wrote: Hi, I just had to go through the Dilger procedure after MDS crashing when mounting MDT. The system is running fine now - glad that I just learned about this procedure. trace attached. Stay tuned for more episodes of... As the MDT Mounts. ct ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Lustre Mount Crashing
We lost our MDS/MGS to a power failure yesterday evening. Just to be safe, we ran e2fsck on the combined MDT/MGT and there were only a couple of minor complaints about HTREE issues that it fixed.The MDT/MGT now fsck's cleanly. The problem is that, despite the clean e2fsck, the MGS is crashing in the lustre mount code when attempting to mount the MDT. It is a scratch file system so it is not backed up. Still, it is a pain to lose the data.I'm assuming this is not normal and there is not much in the manual about doing anything more than e2fsck but I want to ask if anyone else has seem something like this before and might have some additional suggestions before I trash the data and reformat the file system. Thanks, Charlie Taylor UF HPC Center ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre Mount Crashing
On Jun 2, 2008, at 11:49 AM, Dennis Nelson wrote: Unless you are getting some kind of kernel panic, that stack trace should be in the syslog. No, it is going down hard in a kernel panic. All of the stack trace I can see at the moment looks like (scribbled by hand... so forgive me for leaving off the addresses and offsets). :libcfs:cfs_alloc :obdclass:lustre_init_lsi :obdclass:lustre_fill_super :obdclass::lustre_fill_super set_anon_super set_anon_super :obd_class:lustre_fill_super et_sb_nodev vfs_kern_mount do_kern_mount do_mount __handle_mm_fault __up_read do_page_fault zone_statistics __alloc_pages sys_mount system_call RIP . resched_task I wish I could get the whole trace to you. We might try to get kdump on there but my luck with kdump has been mixed. It seems to work with some chipsets and not with others. Anyway, we may just be out of luck. I just hate to give up too easily because it seems like everything is solid yet we crash on or just after the mount. This is on a MDS that has been running without a problem for 5 months (lustre 1.6.4.2 ). uname -a Linux hpcmds 2.6.18-8.1.14.el5.L-1642 #2 SMP Thu Feb 21 15:42:14 EST 2008 x86_64 x86_64 x86_64 GNU/Linux I don't know if that trace is a lot of help to you since it is not complete (which is why I didn't post it initially) but maybe there is something there of use. Regards, Charlie Taylor UF HPC Center ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Cannot send after transport endpoint shutdown (-108)
Sure, we will provide you with more details of our installation but let me first say that, if recollection serves, we did not pull that number out of a hat. I believe that there is a formula in one of the lustre tuning manuals for calculating the recommended timeout value. I'll have to take a moment to go back and find it. Anyway, if you use that formula for our cluster, the recommended timeout value, I think, comes out to be *much* larger than 1000. Later this morning, we will go back and find that formula and share with the list how we came up w/ our timeout. Perhaps you can show us where we are going wrong. One more comment We just brought up our second large lustre file system. It is 80+ TB served by 24 OSTs on two (pretty beefy) OSSs. We just achieved over 2GB/sec of sustained (large block, sequential) I/O from an aggregate of 20 clients.Our design target was 1.0 GB/sec/OSS and we hit that pretty comfortably. That said, when we first mounted the new (1.6.4.2) file system across all 400 nodes in our cluster, we immediately started getting transport endpoint failures and evictions. We looked rather intensively for network/fabric problems (we have both o2ib and tcp nids) and could find none. All of our MPI apps are/were running just fine. The only way we could get rid of the evictions and transport endpoint failures was by increasing the timeout. Also, we knew to do this based on our experience with our first lustre file system (1.6.3 + patches) where we had to do the same thing. Like I said, a little bit later, Craig or I will post more details about our implementation. If we are doing something wrong with regard to this timeout business, I would love to know what it is. Thanks, Charlie Taylor UF HPC Center On Mar 4, 2008, at 4:04 PM, Brian J. Murrell wrote: On Tue, 2008-03-04 at 15:55 -0500, Aaron S. Knister wrote: I think I tried that before and it didn't help, but I will try it again. Thanks for the suggestion. Just so you guys know, 1000 seconds for the obd_timeout is very, very large! As you could probably guess, we have some very, very big Lustre installations and to the best of my knowledge none of them are using anywhere near that. AFAIK (and perhaps a Sun engineer with closer experience to some of these very large clusters might correct me) the largest value that the largest clusters are using is in the neighbourhood of 300s. There has to be some other problem at play here that you need 1000s. Can you both please report your lustre and kernel versions? I know you said latest Aaron, but some version numbers might be more solid to go on. b. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Cannot send after transport endpoint shutdown (-108)
Well, go figure.We are running... Lustre: 1.6.4.2 on clients and servers Kernel: 2.6.18-8.1.14.el5Lustre (clients and servers) Platform: X86_64 (opteron 275s, mostly) Interconnect: IB, Ethernet IB Stack: OFED 1.2 We already posted our procedure for patching the kernel, building OFED, and building lustre so I don't think I'll go into that again.Like I said, we just brought a new file system online. Everything looked fine at first with just a few clients mounted. Once we mounted all 408 (or so), we started gettting all kinds of transport endpoint failures and the MGSs and OSTs were evicting clients left and right.We looked for network problems and could not find any of any substance.Once we increased the obd/lustre/ system timeout setting as previously discussed, the errors vanished.This was consistent with our experience with 1.6.3 as well.That file system has been online since early December. Both file systems appear to be working well. I'm not sure what to make of it.Perhaps we are just masking another problem. Perhaps there are some other, related values that need to be tuned.We've done the best we could but I'm sure there is still much about Lustre we don't know. We'll try to get someone out to the next class but until then, we're on our own, so to speak. Charlie Taylor UF HPC Center Just so you guys know, 1000 seconds for the obd_timeout is very, very large! As you could probably guess, we have some very, very big Lustre installations and to the best of my knowledge none of them are using anywhere near that. AFAIK (and perhaps a Sun engineer with closer experience to some of these very large clusters might correct me) the largest value that the largest clusters are using is in the neighbourhood of 300s. There has to be some other problem at play here that you need 1000s. I can confirm that at a recent large installation with several thousand clients, the default of 100 is in effect. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Cannot send after transport endpoint shutdown (-108)
We've seen this before as well.Our experience is that the obd_timeout is far too small for large clusters (ours is 400+ nodes) and the only way we avoid these errors is by setting it to 1000 which seems high to us but appears to work and puts an end to the transport endpoint shutdowns. On the MDS lctl conf_param srn.sys.timeout=1000 You may have to do this on the OSS's as well unless you restart the OSS's but I could be wrong on that. You should check it everywhere with... cat /proc/sys/lustre/timeout On Mar 4, 2008, at 3:31 PM, Aaron S. Knister wrote: This morning I've had both my infiniband and tcp lustre clients hiccup. They are evicted from the server presumably as a result of their high load and consequent timeouts. My question is- why don't the clients re-connect. The infiniband and tcp clients both give the following message when I type df - Cannot send after transport endpoint shutdown (-108). I've been battling with this on and off now for a few months. I've upgraded my infiniband switch firmware, all the clients and servers are running the latest version of lustre and the lustre patched kernel. Any ideas? -Aaron ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre Bug 13917
Sure enough, we updated enough clients with the 1.6.4.2 bits to accommodate our 512-way job and it now fires right up. That sure read like a server side fix to us but thanks for setting us straight.Thanks for clarifying the upgrade questions as well. Regards, Charlie Taylor UF HPC Center On Feb 20, 2008, at 9:27 AM, Brian J. Murrell wrote: On Sun, 2008-02-17 at 08:04 -0500, Charles Taylor wrote: We are running lustre 1.6.3 with some patches we applied by hand with a patched 2.6.18-8.1.14 kernel on both the clients and servers. We think we are now hitting lustre bug 13197 Do you mean 13917? and can no longer operate without a fix.We could apply the 13197 patch to 1.6.3 and keep going as we are but we would like to start moving to 1.6.4.2. Good idea. Would it be insane to update to 1.6.4.2 on the MDS and OSSs while continuing to run 1.6.3 on the clients If you want to fix 13917, yes. 13917 is a client-side fix. or is 1.6.4.2 similar enough to interoperate with 1.6.3 clients? Our interoperability commitment provides that 1.6.3 and 1.6.4.2 will inter-operate, however upgrading the OSS and MDS only will not fix 13917. b. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] ldlm_enqueue operation failures
:(ldlm_lib.c: 1442:target_send_reply_msg()) Skipped 207 previous similar messages Feb 19 08:03:03 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS Feb 19 08:03:03 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: 515:mgs_handle()) Skipped 203 previous similar messages Feb 19 08:07:30 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: 1442:target_send_reply_msg()) @@@ processing error (-107) [EMAIL PROTECTED] x6548994/t0 o101-?@?:-1 lens 232/0 ref 0 fl Interpret:/0/0 rc -107/0 Feb 19 08:07:30 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: 1442:target_send_reply_msg()) Skipped 205 previous similar messages Feb 19 08:13:05 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS Feb 19 08:13:05 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 515:mgs_handle()) Skipped 207 previous similar messages Feb 19 08:17:33 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: 1442:target_send_reply_msg()) @@@ processing error (-107) [EMAIL PROTECTED] x680167/t0 o101-?@?:-1 lens 232/0 ref 0 fl Interpret:/0/0 rc -107/0 Feb 19 08:17:33 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: 1442:target_send_reply_msg()) Skipped 209 previous similar messages Feb 19 08:23:07 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS Feb 19 08:23:07 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: 515:mgs_handle()) Skipped 205 previous similar messages On Feb 19, 2008, at 12:15 AM, Oleg Drokin wrote: Hello! On Feb 18, 2008, at 5:13 PM, Charles Taylor wrote: Feb 18 15:32:47 r5b-s42 kernel: LustreError: 11-0: an error occurred while communicating with [EMAIL PROTECTED] The mds_close operation failed with -116 Feb 18 15:32:47 r5b-s42 kernel: LustreError: Skipped 3 previous similar messages Feb 18 15:32:47 r5b-s42 kernel: LustreError: 7828:0:(file.c: 97:ll_close_inode_openhandle()) inode 17243099 mdc close failed: rc = -116 Feb 18 15:32:47 r5b-s42 kernel: LustreError: 7828:0:(file.c: 97:ll_close_inode_openhandle()) Skipped 1 previous similar messages These mean client was evicted (And later successfully reconnected) after opening file successfully. We need all the failure/evictions info since job started to make any meaningful progress, because as of now I have no idea why clients were evicted. Bye, Oleg ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] ldlm_enqueue operation failures
One more thing worth mentioning, we have no more callback or watchdog timer expired messages so 1.6.4.2 seems to have fixed that. So it just seems like if 512 threads try to open the same file at roughly the same time, we are running out of some resource on the MDS or OSSs that keeps Lustre from satisfying the request. Charlie On Feb 19, 2008, at 8:45 AM, Charles Taylor wrote: Yes, I understand.Right now we are just trying to isolate our problems so that we don't provide information that is not related to the issue. Just to recap we were running pretty well with our patched 1.6.3 implementation. However, we could not start a 512-way job in which each thread tries to open a single copy of the same file.Inevitably, one or more threads would get a can not open file error and call mpi_abort() even though the file is there and many other threads open it successfully. We thought we were hitting lustre bug 13197 which is supposed to be fixed in 1.6.4.2 so we upgraded our MGS/MDS and OSSs to 1.6.4.2. We have *not* upgraded the clients (400+ of them) and were hoping to avoid that for the moment. The upgrade seemed to go well and the file system is accessible on all the clients. However, our 512-way application still cannot run.We tried modifying the app so that each thread opens its own copy of the input file (i.e. file.in.rank) and duplicated the input file 512 times).This allowed the job to start but it eventually failed anyway with another can not open file error. ERROR (proc. 00410) - cannot open file: ./ skews_ms2p0.mixt.cva_00411_5.3E-04 This seems to clearly indicate a problem with Lustre and/or our implementation. On a perhaps separate note (perhaps not), since the upgrade yesterday, we are seeing the messages below every ten minutes. Perhaps we need shutdown and impose some sanity on all this but in reality, this is the only job that is having trouble (out of hundreds, sometimes thousands) and the file system seems to be operating just fine otherwise. Any insight is appreciated at this point. We've put a lot of effort into lustre at this point and would like to stick with it but right now it looks like it can't scale to a 512 way job. Thanks for the help, Charlie Feb 19 07:07:09 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: 1442:target_send_reply_msg()) Skipped 202 previous similar messages Feb 19 07:12:41 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c: 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS Feb 19 07:12:41 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c: 515:mgs_handle()) Skipped 201 previous similar messages Feb 19 07:17:12 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c: 1442:target_send_reply_msg()) @@@ processing error (-107) [EMAIL PROTECTED] x36818597/t0 o101-?@?:-1 lens 232/0 ref 0 fl Interpret:/0/0 rc -107/0 Feb 19 07:17:12 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c: 1442:target_send_reply_msg()) Skipped 207 previous similar messages Feb 19 07:22:42 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c: 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS Feb 19 07:22:42 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c: 515:mgs_handle()) Skipped 209 previous similar messages Feb 19 07:27:16 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: 1442:target_send_reply_msg()) @@@ processing error (-107) [EMAIL PROTECTED] x679809/t0 o101-?@?:-1 lens 232/0 ref 0 fl Interpret:/0/0 rc -107/0 Feb 19 07:27:16 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: 1442:target_send_reply_msg()) Skipped 207 previous similar messages Feb 19 07:32:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS Feb 19 07:32:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 515:mgs_handle()) Skipped 205 previous similar messages Feb 19 07:37:16 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: 1442:target_send_reply_msg()) @@@ processing error (-107) [EMAIL PROTECTED] x140057135/t0 o101-?@?:-1 lens 232/0 ref 0 fl Interpret:/0/0 rc -107/0 Feb 19 07:37:16 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: 1442:target_send_reply_msg()) Skipped 201 previous similar messages Feb 19 07:42:52 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS Feb 19 07:42:52 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: 515:mgs_handle()) Skipped 205 previous similar messages Feb 19 07:47:17 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c: 1442:target_send_reply_msg()) @@@ processing error (-107) [EMAIL PROTECTED] x5243687/t0 o101-?@?:-1 lens 232/0 ref 0 fl Interpret:/0/0 rc -107/0 Feb 19 07:47:17 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c: 1442:target_send_reply_msg()) Skipped 207 previous similar messages Feb 19 07:52:59 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS Feb 19 07
[Lustre-discuss] ldlm_enqueue operation failures
FWIW, we got our MGS/MDS and OSSs upgraded to 1.6.4.2 and they seem to be fine.The clients are still running 1.6.3. Unfortunately, the upgrade did not resolve our issue.One our users has an mpi app where every thread opens the same input file (actually several in succession).Although we have run this job successfully before on up to 512 procs, it is not working now. Lustre seems to be locking up when all the threads go after the same file (to open) and we see things such as ... Feb 18 15:42:11 r3b-s16 kernel: LustreError: 11-0: an error occurred while communicating with [EMAIL PROTECTED] The ldlm_enqueue operation failed with -107 Feb 18 15:42:11 r3b-s16 kernel: LustreError: Skipped 21 previous similar messages Feb 18 15:52:51 r3b-s16 kernel: LustreError: 11-0: an error occurred while communicating with [EMAIL PROTECTED] The ldlm_enqueue operation failed with -107 Feb 18 15:52:51 r3b-s16 kernel: LustreError: Skipped 19 previous similar messages [EMAIL PROTECTED] is our MDS. We have 512 ll_mdt threads (the max). The actual error in the code on some of the threads will be that the file was not found (even though it was clearly there) and this only happens after about an 8 minute timeout. Note that we have the file system mounted with the -o flock option. Is this part of the problem or are we hitting yet another bug? Thanks, Charlie Taylor UF HPC Center ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] ldlm_enqueue operation failures
Well, the log on the MDS at the time of the failure looks like... Feb 18 15:25:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS Feb 18 15:25:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 515:mgs_handle()) Skipped 263 previous similar messages Feb 18 15:29:25 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: 1442:target_send_reply_msg()) @@@ processing error (-107) [EMAIL PROTECTED] x1602651/t0 o101-?@?:-1 lens 232/0 ref 0 fl Interpret:/0/0 rc -107/0 Feb 18 15:29:25 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: 1442:target_send_reply_msg()) Skipped 427 previous similar messages Feb 18 15:31:28 hpcmds kernel: LustreError: 7150:0:(mds_open.c: 1474:mds_close()) @@@ no handle for file close ino 43116025: cookie 0x1938027bf9d67349 [EMAIL PROTECTED] x1789/t0 o35- beb7df79-6127-c0ca-9d36-2a96817a77a9@:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 Feb 18 15:31:28 hpcmds kernel: LustreError: 7150:0:(mds_open.c: 1474:mds_close()) Skipped 161 previous similar messages Feb 18 15:33:17 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c: 210:waiting_locks_callback()) ### lock callback timer expired: evicting client 2bdea9d4-43c3-a0b0-2822- [EMAIL PROTECTED] nid [EMAIL PROTECTED] ns: mds- ufhpc-MDT_UUID lock: 810053d3f100/0x688cfbc7df2ef487 lrc: 1/0,0 mode: CR/CR res: 21878337/3424633214 bits 0x3 rrc: 582 type: IBT flags: 430 remote: 0x95c1d2685c2c76d9 expref: 21 pid 6090 Feb 18 15:33:17 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c: 210:waiting_locks_callback()) Skipped 3 previous similar messages Feb 18 15:33:17 hpcmds kernel: LustreError: 6265:0:(ldlm_lockd.c: 962:ldlm_handle_enqueue()) ### lock on destroyed export 8101096ec000 ns: mds-ufhpc-MDT_UUID lock: 810225fe12c0/0x688cfbc7df2ef505 lrc: 2/0,0 mode: CR/CR res: 21878337/3424633214 bits 0x3 rrc: 579 type: IBT flags: 430 remote: 0x95c1d2685c2c76e0 expref: 6 pid 6265 Feb 18 15:33:17 hpcmds kernel: LustreError: 6265:0:(ldlm_lockd.c: 962:ldlm_handle_enqueue()) Skipped 3 previous similar messages Feb 18 15:33:17 hpcmds kernel: Lustre: 6061:0:(mds_reint.c: 127:mds_finish_transno()) commit transaction for disconnected client 2bdea9d4-43c3-a0b0-2822-c49ecfe6e044: rc 0 We don't have any watchdog timeouts associated with the event so I don't have any tracebacks from those.One one of the clients we have... Feb 18 15:33:17 r1b-s23 kernel: LustreError: 11-0: an error occurred while communicating with [EMAIL PROTECTED] The ldlm_enqueue operation failed with -107 Feb 18 15:33:17 r1b-s23 kernel: LustreError: Skipped 2 previous similar messages Feb 18 15:33:17 r1b-s23 kernel: Lustre: ufhpc-MDT-mdc- 81012d370800: Connection to service ufhpc-MDT via nid [EMAIL PROTECTED] was lost; in progress operations using thi\ s service will wait for recovery to complete. Feb 18 15:33:17 r1b-s23 kernel: Lustre: Skipped 2 previous similar messages Feb 18 15:33:17 r1b-s23 kernel: LustreError: 167-0: This client was evicted by ufhpc-MDT; in progress operations using this service will fail. Feb 18 15:33:17 r1b-s23 kernel: LustreError: Skipped 2 previous similar messages Feb 18 15:33:17 r1b-s23 kernel: LustreError: 12004:0:(mdc_locks.c: 423:mdc_finish_enqueue()) ldlm_cli_enqueue: -5 Feb 18 15:33:17 r1b-s23 kernel: LustreError: 12004:0:(mdc_locks.c: 423:mdc_finish_enqueue()) Skipped 3 previous similar messages Feb 18 15:33:17 r1b-s23 kernel: Lustre: ufhpc-MDT-mdc- 81012d370800: Connection restored to service ufhpc-MDT using nid [EMAIL PROTECTED] Feb 18 15:33:17 r1b-s23 kernel: Lustre: Skipped 2 previous similar messages ct On Feb 18, 2008, at 4:42 PM, Oleg Drokin wrote: Hello! On Feb 18, 2008, at 4:29 PM, Charles Taylor wrote: Unfortunately, the upgrade did not resolve our issue.One our users has an mpi app where every thread opens the same input file (actually several in succession).Although we have run this job successfully before on up to 512 procs, it is not working now. Lustre seems to be locking up when all the threads go after the same file (to open) and we see things such as ... Can you upload full log from start of problematic job to end somewhere? Also somewhere when first watchdog timeouts hit, it would be nice if you can do sysrq-t on MDS too to get traces of all threads (you need to have big dmesg buffer for them to fit, of use serial console). Is the job uses flocks/fcntl locks at all? if not, then don't worry about mounting with -o flock. Bye, Oleg ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] ldlm_enqueue operation failures
Well, yes. But the evictions are the result of the job trying to start. Absent that, there are no evictions.A bunch of threads trying to open the same file should not cause the clients to be evicted.That's an odd way of dealing with concurrency. :) Charlie On Feb 18, 2008, at 4:57 PM, Oleg Drokin wrote: Hello! On Feb 18, 2008, at 4:55 PM, Charles Taylor wrote: Feb 18 15:25:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS Feb 18 15:25:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 515:mgs_handle()) Skipped 263 previous similar messages Feb 18 15:29:25 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: 1442:target_send_reply_msg()) @@@ processing error (-107) [EMAIL PROTECTED] x1602651/t0 o101-?@?:-1 lens 232/0 ref 0 fl Interpret:/0/0 rc -107/0 Feb 18 15:29:25 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: 1442:target_send_reply_msg()) Skipped 427 previous similar messages Feb 18 15:31:28 hpcmds kernel: LustreError: 7150:0:(mds_open.c: 1474:mds_close()) @@@ no handle for file close ino 43116025: cookie 0x1938027bf9d67349 [EMAIL PROTECTED] x1789/t0 o35- beb7df79-6127-c0ca-9d36-2a96817a77a9@:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 Feb 18 15:31:28 hpcmds kernel: LustreError: 7150:0:(mds_open.c: 1474:mds_close()) Skipped 161 previous similar messages Feb 18 15:33:17 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c: 210:waiting_locks_callback()) ### lock callback timer expired: evicting client 2bdea9d4-43c3-a0b0-2822- [EMAIL PROTECTED] nid [EMAIL PROTECTED] ns: mds-ufhpc-MDT_UUID lock: 810053d3f100/0x688cfbc7df2ef487 lrc: 1/0,0 mode: CR/CR res: 21878337/3424633214 bits 0x3 rrc: 582 type: IBT flags: 430 remote: 0x95c1d2685c2c76d9 expref: 21 pid 6090 Feb 18 15:33:17 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c: 210:waiting_locks_callback()) Skipped 3 previous similar messages Feb 18 15:33:17 hpcmds kernel: LustreError: 6265:0:(ldlm_lockd.c: 962:ldlm_handle_enqueue()) ### lock on destroyed export 8101096ec000 ns: mds-ufhpc-MDT_UUID lock: 810225fe12c0/0x688cfbc7df2ef505 lrc: 2/0,0 mode: CR/CR res: 21878337/3424633214 bits 0x3 rrc: 579 type: IBT flags: 430 remote: 0x95c1d2685c2c76e0 expref: 6 pid 6265 Feb 18 15:33:17 hpcmds kernel: LustreError: 6265:0:(ldlm_lockd.c: 962:ldlm_handle_enqueue()) Skipped 3 previous similar messages Feb 18 15:33:17 hpcmds kernel: Lustre: 6061:0:(mds_reint.c: 127:mds_finish_transno()) commit transaction for disconnected client 2bdea9d4-43c3-a0b0-2822-c49ecfe6e044: rc 0 This looks like in the middle of eviction storm, and by this point MDS and MGS anlready evicted tons of clients for unknown reasons (should be in the log before those messages). Bye, Oleg ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] ldlm_enqueue operation failures
We also see these on some of the clients... Feb 18 15:32:47 r5b-s42 kernel: LustreError: 11-0: an error occurred while communicating with [EMAIL PROTECTED] The mds_close operation failed with -116 Feb 18 15:32:47 r5b-s42 kernel: LustreError: Skipped 3 previous similar messages Feb 18 15:32:47 r5b-s42 kernel: LustreError: 7828:0:(file.c: 97:ll_close_inode_openhandle()) inode 17243099 mdc close failed: rc = -116 Feb 18 15:32:47 r5b-s42 kernel: LustreError: 7828:0:(file.c: 97:ll_close_inode_openhandle()) Skipped 1 previous similar message I'm assuming some of the threads succeed in opening the file. When one fails, it calls mpi_abort() at which point all those threads that successfully opened the file then try to close it.Apparently they can't close the file at that point either.I'm guessing of course, but it seems plausible. ct On Feb 18, 2008, at 4:57 PM, Oleg Drokin wrote: Hello! On Feb 18, 2008, at 4:55 PM, Charles Taylor wrote: Feb 18 15:25:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS Feb 18 15:25:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 515:mgs_handle()) Skipped 263 previous similar messages Feb 18 15:29:25 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: 1442:target_send_reply_msg()) @@@ processing error (-107) [EMAIL PROTECTED] x1602651/t0 o101-?@?:-1 lens 232/0 ref 0 fl Interpret:/0/0 rc -107/0 Feb 18 15:29:25 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: 1442:target_send_reply_msg()) Skipped 427 previous similar messages Feb 18 15:31:28 hpcmds kernel: LustreError: 7150:0:(mds_open.c: 1474:mds_close()) @@@ no handle for file close ino 43116025: cookie 0x1938027bf9d67349 [EMAIL PROTECTED] x1789/t0 o35- beb7df79-6127-c0ca-9d36-2a96817a77a9@:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 Feb 18 15:31:28 hpcmds kernel: LustreError: 7150:0:(mds_open.c: 1474:mds_close()) Skipped 161 previous similar messages Feb 18 15:33:17 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c: 210:waiting_locks_callback()) ### lock callback timer expired: evicting client 2bdea9d4-43c3-a0b0-2822- [EMAIL PROTECTED] nid [EMAIL PROTECTED] ns: mds-ufhpc-MDT_UUID lock: 810053d3f100/0x688cfbc7df2ef487 lrc: 1/0,0 mode: CR/CR res: 21878337/3424633214 bits 0x3 rrc: 582 type: IBT flags: 430 remote: 0x95c1d2685c2c76d9 expref: 21 pid 6090 Feb 18 15:33:17 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c: 210:waiting_locks_callback()) Skipped 3 previous similar messages Feb 18 15:33:17 hpcmds kernel: LustreError: 6265:0:(ldlm_lockd.c: 962:ldlm_handle_enqueue()) ### lock on destroyed export 8101096ec000 ns: mds-ufhpc-MDT_UUID lock: 810225fe12c0/0x688cfbc7df2ef505 lrc: 2/0,0 mode: CR/CR res: 21878337/3424633214 bits 0x3 rrc: 579 type: IBT flags: 430 remote: 0x95c1d2685c2c76e0 expref: 6 pid 6265 Feb 18 15:33:17 hpcmds kernel: LustreError: 6265:0:(ldlm_lockd.c: 962:ldlm_handle_enqueue()) Skipped 3 previous similar messages Feb 18 15:33:17 hpcmds kernel: Lustre: 6061:0:(mds_reint.c: 127:mds_finish_transno()) commit transaction for disconnected client 2bdea9d4-43c3-a0b0-2822-c49ecfe6e044: rc 0 This looks like in the middle of eviction storm, and by this point MDS and MGS anlready evicted tons of clients for unknown reasons (should be in the log before those messages). Bye, Oleg ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] ext3-unlink-race.patch
When trying to build 1.6.4.2 from a clean source tree I get the following after a successful configure. cd linux-stage quilt push -a -q Applying patch patches/ext3-wantedi-2.6-rhel4.patch Applying patch patches/iopen-2.6.18-rhel5.patch Applying patch patches/ext3-map_inode_page-2.6.18.patch Applying patch patches/export-ext3-2.6-rhel4.patch Applying patch patches/ext3-include-fixes-2.6-rhel4.patch Applying patch patches/ext3-extents-2.6.18-vanilla.patch Applying patch patches/ext3-mballoc3-core.patch Applying patch patches/ext3-mballoc3-2.6.18.patch Applying patch patches/ext3-nlinks-2.6.9.patch Applying patch patches/ext3-ialloc-2.6.patch Applying patch patches/ext3-remove-cond_resched-calls-2.6.12.patch Applying patch patches/ext3-filterdata-sles10.patch Applying patch patches/ext3-uninit-2.6.18.patch Applying patch patches/ext3-nanosecond-2.6.18-vanilla.patch Applying patch patches/ext3-inode-version-2.6.18-vanilla.patch Applying patch patches/ext3-mmp-2.6.18-vanilla.patch Applying patch patches/ext3-unlink-race.patch 1 out of 1 hunk FAILED Patch patches/ext3-unlink-race.patch does not apply (enforce with -f) Looking at the patch and the source, I can see why the hunk failed. I don't see a place to put it either.I'm going to just remove this patch file from the list but if anyone has encountered this or knows why this patch doesn't apply successfully and wants to save me from impending disaster, I'd be happy to hear from you. :) Thanks, Charlie Taylor UF HPC Center ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] 1.6.3 - 1.6.4.2 upgrade
Turns out that the client upgrade worked just fine. I had not noticed that the ko2iblnd module was not in place. I'm still wondering if I need to do anything special with regard to upgrading the MGS/MDS and OSSs. I'm hoping to just dump the software in place and reboot with live clients. Seems kind of risky, but hey, the docs say you can do it for 1.4 - 1.6.3 so going from 1.6.3 - 1.6.4.2 ought to be a no-brainer, right? :) Charlie Taylor UF HPC Center On Feb 17, 2008, at 3:44 PM, Charles Taylor wrote: Just updated a single client from 1.6.3 to 1.6.4.2.The documentation seems to indicate that an upgraded client should still be able to mount the file system from a non-upgraded MGSMDT.The documentation appears to be referring to a 1.4 to 1.6 upgrade but I made the leap that similar things would apply to 1.6.3 - 1.6.4.2. As I said, the servers are still running 1.6.3 and have not been touched.The client is upgraded to 1.6.4.2. When I try to mount the file system I get... Is the MGS specification correct? Is the filesystem name correct? If upgrading, is the copied client log valid? (see upgrade docs) I've double-checked the first two but I have no idea what the third item refers to.The docs doc about using tunefs to manually copy client config logs when upgrading an MGS/MDS but they seem to indicate that the only issue on the client one needs to worry about is the form of the mount command. In going from 1.6.3 to 1.6.4.2, that should not be an issue. Have I missed a step? Do I need to do something to tell the MGS/MDS that the client has been upgraded? Is there newer documentation for going from 1.6.3 to 1.6.4.2?I was hoping that I could just upgrade the software on the MGS/MDS and OSS (in that order) and restart?Is that not the case? Thanks, charlie taylor uf hpc center ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss