[Lustre-discuss] Searching for a speaker about lustre
Hi, together with a friend of mine I wanted to deliver a talk about Lustre on the Open Source in Data Centers Conference. See: http://www.netways.de/en/osdc/osdc_2010 Due to a health problem my friend cannot attend the conference and now I am looking for a replacement. Anyone on the list who wants to see the town of Nurnberg and to deliver a nice talk about Lustre in June? We could split up the talk into two parts if wanted. Greetings, -- Dr. Michael Schwartzkopff MultiNET Services GmbH Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany Tel: +49 - 89 - 45 69 11 0 Fax: +49 - 89 - 45 69 11 21 mob: +49 - 174 - 343 28 75 mail: mi...@multinet.de web: www.multinet.de Sitz der Gesellschaft: 85630 Grasbrunn Registergericht: Amtsgericht München HRB 114375 Geschäftsführer: Günter Jurgeneit, Hubert Martens --- PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B Skype: misch42 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Future of LusterFS?
Am Donnerstag, 22. April 2010 08:33:14 schrieb Janne Aho: > Hi, > > Today we have a storage system based on NFS, but we are really concerned > about redundancy and are at the brink to take the step to a cluster file > system as glusterfs, but we have got suggestions on that lusterfs would > have been the best option for us, but at the same time those who > "recommended" lusterfs has said that Oracle has pulled the plug and put > the resources into OCFS2. > If using lusterfs in a production environment, it would be good to know > that it won't be discontinued. > > Will there be a long term future for lusterfs? > Or should we be looking for something else for a long term solution? > > Thanks in advance for your reply for my a bit cloudy question. Hi, for me Lustre is a very good option. But you also could consider a system composed from - corosync for the cluster communication - pacemaker as a cluster resource manager - DRBD for the replication of data between nodes in a cluster and - NFS or - OCFS2 or GFS or ... especially the NFS option provides you with a high available NFS server on real cluster stack all managed by pacemaker. Greetings, -- Dr. Michael Schwartzkopff MultiNET Services GmbH Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany Tel: +49 - 89 - 45 69 11 0 Fax: +49 - 89 - 45 69 11 21 mob: +49 - 174 - 343 28 75 mail: mi...@multinet.de web: www.multinet.de Sitz der Gesellschaft: 85630 Grasbrunn Registergericht: Amtsgericht München HRB 114375 Geschäftsführer: Günter Jurgeneit, Hubert Martens --- PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B Skype: misch42 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Filesystem monitoring in Heartbeat
Am Donnerstag, 21. Januar 2010 23:09:37 schrieb Bernd Schubert: > On Thursday 21 January 2010, Adam Gandelman wrote: (...) > I guess you want to use the pacemaker agent I posted into this bugzilla: > > https://bugzilla.lustre.org/show_bug.cgi?id=20807 Hallo, how far did you come with the development of the agent? Some kind of finished? Publishable? Greetings, -- Dr. Michael Schwartzkopff MultiNET Services GmbH Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany Tel: +49 - 89 - 45 69 11 0 Fax: +49 - 89 - 45 69 11 21 mob: +49 - 174 - 343 28 75 mail: mi...@multinet.de web: www.multinet.de Sitz der Gesellschaft: 85630 Grasbrunn Registergericht: Amtsgericht München HRB 114375 Geschäftsführer: Günter Jurgeneit, Hubert Martens --- PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B Skype: misch42 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] What HA Software to use with Lustre
Am Freitag, 15. Januar 2010 07:30:13 schrieben Sie: > > A introduction into pacemaker can be found at: > > http://www.clusterlabs.org/doc/en- > > US/Pacemaker/1.0/html/Pacemaker_Explained/index.html > > I wish I were aware of the "crm" CLI before trying to take the XML way > according the link above: > > http://www.clusterlabs.org/doc/crm_cli.html > > Cheers, > Li Wei We are working on a documentation how to set up lustre together with pacemaker. As soon as we are finished it will show up in the wiki. Greetings, -- Dr. Michael Schwartzkopff MultiNET Services GmbH Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany Tel: +49 - 89 - 45 69 11 0 Fax: +49 - 89 - 45 69 11 21 mob: +49 - 174 - 343 28 75 mail: mi...@multinet.de web: www.multinet.de Sitz der Gesellschaft: 85630 Grasbrunn Registergericht: Amtsgericht München HRB 114375 Geschäftsführer: Günter Jurgeneit, Hubert Martens --- PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B Skype: misch42 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] What HA Software to use with Lustre
Am Freitag, 15. Januar 2010 00:48:53 schrieb Jagga Soorma: > Hi Guys, > > I am setting up our new Lustre environment and was wondering what is the > recommended (stable) HA clustering software to use with the MDS and OSS > failover. Any input would be greatly appreciated. > > Thanks, > -J The docs describe heartbeat but the software is not recommended any more. Neither heartbeat version 1 nor heartbeat version 2. Instead the projects openais and pacemaker replaced the funcionallity of heartbeat. For the new project please see www.clusterlabs.org A introduction into pacemaker can be found at: http://www.clusterlabs.org/doc/en- US/Pacemaker/1.0/html/Pacemaker_Explained/index.html Greetings, ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Implementing MMP correctly
Hi, I am trying to understand howto implement MMP correctly into a lustre failover cluster. As far as I understood the MMP protects the same filesystem beeing mounted by different nodes (OSS) of a failover cluster. So far so good. If a node was shut down uncleanly it still will occupy its filesystems by MMP and thus preventing the clean failover to an other node. Now I want to implement a clean failover into the Filesystem Resource Agent of pacemaker. Is there a good way to solve the problem with MMP? Possible sotutions are: - Disable the MMP feature in a cluster at all, since the resource manager takes care that the same resource is only mounted once in the cluster - Do a "tunefs -O ^mmp " and a "tunefs -O mmp " before every mounting of a resource? - Do a "sleep 10" before mounting a resource? But the manual says "the file system mount require additional time if the file system was not cleanly unmounted." - Check if the file system is in use by another OSS through MMP and wait a litte bit longer? How do I do this? Please mail me any ideas. Thanks. -- Dr. Michael Schwartzkopff MultiNET Services GmbH Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany Tel: +49 - 89 - 45 69 11 0 Fax: +49 - 89 - 45 69 11 21 mob: +49 - 174 - 343 28 75 mail: mi...@multinet.de web: www.multinet.de Sitz der Gesellschaft: 85630 Grasbrunn Registergericht: Amtsgericht München HRB 114375 Geschäftsführer: Günter Jurgeneit, Hubert Martens --- PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B Skype: misch42 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] how to define 60 failnodes
Am Montag, 9. November 2009 16:36:15 schrieb Bernd Schubert: > On Monday 09 November 2009, Brian J. Murrell wrote: > > Theoretically. I had discussed this briefly with another engineer a > > while ago and IIRC, the result of the discussion was that there was > > nothing inherent in the configuration logic that would prevent one from > > having more than two ("primary" and "failover") OSSes providing service > > to an OST. Two nodes per OST is how just about everyone that wants > > failover configures Lustre. > > Not everyone ;) And especially it doesn't make sense to have a 2 node > failover scheme with pacemaker: > > https://bugzilla.lustre.org/show_bug.cgi?id=20964 the problem is that pacemaker does not understand about the applications it does cluster. pacemaker is made to provide high availability for ANY service, not only for a cluster FS. So if you want to pin some resources (i.e. FS1) to a special node, you have to add a location constraint. But this contradicts the logic of pacemaker a little bit. Why should a resource run on this node, if all nodes are equal? Basically I had the same problem with my lustre cluster I had the following solution: - make colocation constratins so that filesystems do not like to run in the same node. And theoretically with openais as a cluster stack the number of nodes is not limited to 16 any more like in heartbeat. You can build larger clusters. Greetings, -- Dr. Michael Schwartzkopff MultiNET Services GmbH Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany Tel: +49 - 89 - 45 69 11 0 Fax: +49 - 89 - 45 69 11 21 mob: +49 - 174 - 343 28 75 mail: mi...@multinet.de web: www.multinet.de Sitz der Gesellschaft: 85630 Grasbrunn Registergericht: Amtsgericht München HRB 114375 Geschäftsführer: Günter Jurgeneit, Hubert Martens --- PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B Skype: misch42 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Understanding of MMP
Am Montag, 19. Oktober 2009 20:42:19 schrieben Sie: > On Monday 19 October 2009, Andreas Dilger wrote: > > On 19-Oct-09, at 08:46, Michael Schwartzkopff wrote: > > > perhaps I have a problem understanding multiple mount protection > > > MMP. I have a > > > cluster. When a failover happens sometimes I get the log entry: > > > > > > Oct 19 15:16:08 sososd7 kernel: LDISKFS-fs warning (device dm-2): > > > ldiskfs_multi_mount_protect: Device is already active on another node. > > > Oct 19 15:16:08 sososd7 kernel: LDISKFS-fs warning (device dm-2): > > > ldiskfs_multi_mount_protect: MMP failure info: last update time: > > > 1255958168, > > > last update node: sososd3, last update device: dm-2 > > > > > > Does the second line mean that my node (sososd7) tried to mount /dev/ > > > dm-2 but > > > MMP prevented it from doing so because the last update from the old > > > node > > > (sososd3) was too recent? > > > > The update time stored in the MMP block is purely for informational > > purposes. It actually uses a sequence counter that has nothing to do > > with the system clock on either of the nodes (since they may not be in > > sync). > > > > What that message actually means is that sososd7 tried to mount the > > filesystem on dm-2 (which likely has another "LVM" name that the kernel > > doesn't know anything about) but the MMP block on the disk was modified > > by sososd3 AFTER sososd7 first looked at it. > > Probably, bug#19566. Michael, which Lustre version do you exactly use? > > > Thanks, > Bernd I got version 1.8.1.1 which was published last week. Is the fix included or only in 1.8.2? Greetings, -- Dr. Michael Schwartzkopff MultiNET Services GmbH Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany Tel: +49 - 89 - 45 69 11 0 Fax: +49 - 89 - 45 69 11 21 mob: +49 - 174 - 343 28 75 mail: mi...@multinet.de web: www.multinet.de Sitz der Gesellschaft: 85630 Grasbrunn Registergericht: Amtsgericht München HRB 114375 Geschäftsführer: Günter Jurgeneit, Hubert Martens --- PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B Skype: misch42 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Understanding of MMP
Hi, perhaps I have a problem understanding multiple mount protection MMP. I have a cluster. When a failover happens sometimes I get the log entry: Oct 19 15:16:08 sososd7 kernel: LDISKFS-fs warning (device dm-2): ldiskfs_multi_mount_protect: Device is already active on another node. Oct 19 15:16:08 sososd7 kernel: LDISKFS-fs warning (device dm-2): ldiskfs_multi_mount_protect: MMP failure info: last update time: 1255958168, last update node: sososd3, last update device: dm-2 Does the second line mean that my node (sososd7) tried to mount /dev/dm-2 but MMP prevented it from doing so because the last update from the old node (sososd3) was too recent? >From the manuals I found the MMP time of 109 seconds? Is it correct that after the umount the next node cannot mount the same filesystem within 10 seconds? So the solution would be to wait fotr 10 seconds mounting the resource on the next node. Is this correct? Thanks. -- Dr. Michael Schwartzkopff MultiNET Services GmbH Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany Tel: +49 - 89 - 45 69 11 0 Fax: +49 - 89 - 45 69 11 21 mob: +49 - 174 - 343 28 75 mail: mi...@multinet.de web: www.multinet.de Sitz der Gesellschaft: 85630 Grasbrunn Registergericht: Amtsgericht München HRB 114375 Geschäftsführer: Günter Jurgeneit, Hubert Martens --- PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B Skype: misch42 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Problem re-mounting Lustre on an other node
Hi, we have a Lustre 1.8 Cluster with openais and pacemaker as the cluster manager. When I migrate one lustre resource from one node to an other node I get an error. Stopping lustre on one node is no problem, but the node where lustre should start says: Oct 14 09:54:28 sososd6 kernel: kjournald starting. Commit interval 5 seconds Oct 14 09:54:28 sososd6 kernel: LDISKFS FS on dm-4, internal journal Oct 14 09:54:28 sososd6 kernel: LDISKFS-fs: recovery complete. Oct 14 09:54:28 sososd6 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Oct 14 09:54:28 sososd6 multipathd: dm-4: umount map (uevent) Oct 14 09:54:39 sososd6 kernel: kjournald starting. Commit interval 5 seconds Oct 14 09:54:39 sososd6 kernel: LDISKFS FS on dm-4, internal journal Oct 14 09:54:39 sososd6 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Oct 14 09:54:39 sososd6 kernel: LDISKFS-fs: file extents enabled Oct 14 09:54:39 sososd6 kernel: LDISKFS-fs: mballoc enabled Oct 14 09:54:39 sososd6 kernel: Lustre: mgc134.171.16@tcp: Reactivating import Oct 14 09:54:45 sososd6 kernel: LustreError: 137-5: UUID 'segfs-OST_UUID' is not available for connect (no target) Oct 14 09:54:45 sososd6 kernel: LustreError: Skipped 3 previous similar messages Oct 14 09:54:45 sososd6 kernel: LustreError: 31334:0: (ldlm_lib.c:1850:target_send_reply_msg()) @@@ processing error (-19) r...@810225fcb800 x334514011/t0 o8->@:0/0 lens 368/0 e 0 to 0 dl 1255506985 ref 1 fl Interpret:/0/0 rc -19/0 Oct 14 09:54:45 sososd6 kernel: LustreError: 31334:0: (ldlm_lib.c:1850:target_send_reply_msg()) Skipped 3 previous similar messages These log continue until the cluster software times out and the resource tells me about the error. Any help understanding these logs? Thanks. -- Dr. Michael Schwartzkopff MultiNET Services GmbH Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany Tel: +49 - 89 - 45 69 11 0 Fax: +49 - 89 - 45 69 11 21 mob: +49 - 174 - 343 28 75 mail: mi...@multinet.de web: www.multinet.de Sitz der Gesellschaft: 85630 Grasbrunn Registergericht: Amtsgericht München HRB 114375 Geschäftsführer: Günter Jurgeneit, Hubert Martens --- PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B Skype: misch42 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Setup mail cluster
Am Montag, 12. Oktober 2009 15:54:04 schrieb Vadym: > Hello > I'm do a schema of mail service so I have only one question: > Can Lustre provide me full automatic failover solution? No. See the lustre manual for this. You need a cluster solution for this. The manual is *hopelessly* outdate at this point. Do NOT user heartbeat any more. User pacemaker as the cluster manager. See www.clusterlabs.org. When I find some time I want to write a HOWTO about setting up a lustre clsuter with pacemaker and OpenAIS. > I plan to use for storage the standard servers with 1GE links. I need > automatic solution as possible. > E.g. RAID5 functionality, when one or more storage node down user data > still accessible. So if I have 100TB of disk storage I can serve 50TB of > data in failover mode with no downtime. Can you provide me more > information? Is a bond-device for cluster interconnect! It is more safe! Use DRBD for replication of the data if you use Direct attached Storage. DRBD can operate on top of LVM. So you can have that functionallity also. Perhaps you can try clustered LVM. Has nice features. Or just use ZFS, which offers all this. -- Dr. Michael Schwartzkopff MultiNET Services GmbH Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany Tel: +49 - 89 - 45 69 11 0 Fax: +49 - 89 - 45 69 11 21 mob: +49 - 174 - 343 28 75 mail: mi...@multinet.de web: www.multinet.de Sitz der Gesellschaft: 85630 Grasbrunn Registergericht: Amtsgericht München HRB 114375 Geschäftsführer: Günter Jurgeneit, Hubert Martens --- PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B Skype: misch42 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Question about sleeping processes
Am Dienstag, 6. Oktober 2009 17:08:44 schrieb Brian J. Murrell: > On Tue, 2009-10-06 at 17:01 +0200, Michael Schwartzkopff wrote: > > Here is some additional from the logs. Any ideas about that? > > > > Oct 5 10:26:43 sosmds2 kernel: LustreError: 30617:0: > > (pack_generic.c:655:lustre_shrink_reply_v2()) ASSERTION(msg->lm_bufcount > > > segment) failed > > Here's the failed assertion. > > > Oct 5 10:26:43 sosmds2 kernel: LustreError: 30617:0: > > (pack_generic.c:655:lustre_shrink_reply_v2()) LBUG > > Which always leads to an LBUG which is what is putting the thread to > sleep. > > Any time you see an LBUG in a server log file, you need to reboot the > server. > > So now you need to take that ASSERTION message to our bugzilla and see > if you can find a bug for already, and if not, file a new one, please. > > Cheers, > b. Thanks for your fast reply. I think # 20020 is the one we hit. Waiting for a solution. Greetings, -- Dr. Michael Schwartzkopff MultiNET Services GmbH Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany Tel: +49 - 89 - 45 69 11 0 Fax: +49 - 89 - 45 69 11 21 mob: +49 - 174 - 343 28 75 mail: mi...@multinet.de web: www.multinet.de Sitz der Gesellschaft: 85630 Grasbrunn Registergericht: Amtsgericht München HRB 114375 Geschäftsführer: Günter Jurgeneit, Hubert Martens --- PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B Skype: misch42 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Question about sleeping processes
Am Dienstag, 6. Oktober 2009 16:22:08 schrieb Brian J. Murrell: > On Tue, 2009-10-06 at 12:48 +0200, Michael Schwartzkopff wrote: > > Hi, > > Hi, > > > my system load shows that quite a number of processes are waiting. > > Blocked. I guess the word waiting is similar. > > > My questions are: > > What causes the problem? > > In this case, the thread has lbugged previously. > > If you look in syslog for node with these processes you should find > entries with LBUG and/or ASSERTION messages. These are the defects that > are causing the processes to get blocked (uninteruptable sleep) (...) Here is some additional from the logs. Any ideas about that? Oct 5 10:26:43 sosmds2 kernel: LustreError: 30617:0: (pack_generic.c:655:lustre_shrink_reply_v2()) ASSERTION(msg->lm_bufcount > segment) failed Oct 5 10:26:43 sosmds2 kernel: LustreError: 30617:0: (pack_generic.c:655:lustre_shrink_reply_v2()) LBUG Oct 5 10:26:43 sosmds2 kernel: Lustre: 30617:0:(linux- debug.c:264:libcfs_debug_dumpstack()) showing stack for process 30617 Oct 5 10:26:43 sosmds2 kernel: ll_mdt_47 R running task 0 30617 1 30618 30616 (L-TLB) Oct 5 10:26:43 sosmds2 kernel: 0001 000714a28100 0001 Oct 5 10:26:43 sosmds2 kernel: 0001 0086 0012 8102212dfe88 Oct 5 10:26:43 sosmds2 kernel: 0001 802f6aa0 Oct 5 10:26:43 sosmds2 kernel: Call Trace: Oct 5 10:26:43 sosmds2 kernel: [] autoremove_wake_function+0x9/0x2e Oct 5 10:26:43 sosmds2 kernel: [] __wake_up_common+0x3e/0x68 Oct 5 10:26:43 sosmds2 kernel: [] __wake_up_common+0x3e/0x68 Oct 5 10:26:43 sosmds2 kernel: [] vprintk+0x2cb/0x317 Oct 5 10:26:43 sosmds2 kernel: [] kallsyms_lookup+0xc2/0x17b Oct 5 10:26:43 sosmds2 last message repeated 3 times Oct 5 10:26:43 sosmds2 kernel: [] printk_address+0x9f/0xab Oct 5 10:26:43 sosmds2 kernel: [] printk+0x8/0xbd Oct 5 10:26:43 sosmds2 kernel: [] printk+0x52/0xbd Oct 5 10:26:43 sosmds2 kernel: [] module_text_address+0x33/0x3c Oct 5 10:26:43 sosmds2 kernel: [] kernel_text_address+0x1a/0x26 Oct 5 10:26:43 sosmds2 kernel: [] dump_trace+0x211/0x23a Oct 5 10:26:43 sosmds2 kernel: [] show_trace+0x34/0x47 Oct 5 10:26:43 sosmds2 kernel: [] _show_stack+0xdb/0xea Oct 5 10:26:43 sosmds2 kernel: [] :libcfs:lbug_with_loc+0x7a/0xd0 Oct 5 10:26:43 sosmds2 kernel: [] :libcfs:tracefile_init+0x0/0x110 Oct 5 10:26:43 sosmds2 kernel: [] :ptlrpc:lustre_shrink_reply_v2+0xa8/0x240 Oct 5 10:26:43 sosmds2 kernel: [] :mds:mds_getattr_lock+0xc59/0xce0 Oct 5 10:26:43 sosmds2 kernel: [] :ptlrpc:lustre_msg_add_version+0x34/0x110 Oct 5 10:26:43 sosmds2 kernel: [] :lnet:lnet_ni_send+0x93/0xd0 Oct 5 10:26:43 sosmds2 kernel: [] :lnet:lnet_send+0x973/0x9a0 Oct 5 10:26:43 sosmds2 kernel: [] cache_alloc_refill+0x106/0x186 Oct 5 10:26:43 sosmds2 kernel: [] :mds:fixup_handle_for_resent_req+0x5a/0x2c0 Oct 5 10:26:43 sosmds2 kernel: [] :mds:mds_intent_policy+0x636/0xc10 Oct 5 10:26:43 sosmds2 kernel: [] :ptlrpc:ldlm_resource_putref+0x1b6/0x3a0 Oct 5 10:26:43 sosmds2 kernel: [] :ptlrpc:ldlm_lock_enqueue+0x186/0xb30 Oct 5 10:26:43 sosmds2 kernel: [] :ptlrpc:ldlm_export_lock_get+0x6f/0xe0 Oct 5 10:26:43 sosmds2 kernel: [] :obdclass:lustre_hash_add+0x218/0x2e0 Oct 5 10:26:43 sosmds2 kernel: [] :ptlrpc:ldlm_server_blocking_ast+0x0/0x83d Oct 5 10:26:43 sosmds2 kernel: [] :ptlrpc:ldlm_handle_enqueue+0xc19/0x1210 Oct 5 10:26:43 sosmds2 kernel: [] :mds:mds_handle+0x4080/0x4cb0 Oct 5 10:26:43 sosmds2 kernel: [] __next_cpu+0x19/0x28 Oct 5 10:26:43 sosmds2 kernel: [] find_busiest_group+0x20d/0x621 Oct 5 10:26:43 sosmds2 kernel: [] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 Oct 5 10:26:43 sosmds2 kernel: [] enqueue_task+0x41/0x56 Oct 5 10:26:43 sosmds2 kernel: [] :ptlrpc:ptlrpc_check_req+0x1d/0x110 Oct 5 10:26:43 sosmds2 kernel: [] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1160 Oct 5 10:26:43 sosmds2 kernel: [] thread_return+0x62/0xfe Oct 5 10:26:43 sosmds2 kernel: [] __wake_up_common+0x3e/0x68 Oct 5 10:26:43 sosmds2 kernel: [] :ptlrpc:ptlrpc_main+0x1218/0x13e0 Oct 5 10:26:43 sosmds2 kernel: [] default_wake_function+0x0/0xe Oct 5 10:26:43 sosmds2 kernel: [] audit_syscall_exit+0x327/0x342 Oct 5 10:26:43 sosmds2 kernel: [] child_rip+0xa/0x11 Oct 5 10:26:43 sosmds2 kernel: [] :ptlrpc:ptlrpc_main+0x0/0x13e0 Oct 5 10:26:43 sosmds2 kernel: [] child_rip+0x0/0x11 Oct 5 10:26:43 sosmds2 kernel: Oct 5 10:26:43 sosmds2 kernel: LustreError: dumping log to /tmp/lustre- log.1254731203.30617 -- Dr. Michael Schwartzkopff MultiNET Services GmbH Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany Tel: +49 - 89 - 45 69 11 0 Fax: +49 - 89 - 45 69 11 21 mob: +49 - 174 - 343 28 75 mail: mi...@multinet.de web: www.multinet.de Sitz der Gesellschaft: 85630 Grasbrunn Re
[Lustre-discuss] Question about sleeping processes
Hi, my system load shows that quite a number of processes are waiting. ps shows me the same number of processes in state D (uniterruptable sleep). All processes are ll_mdt_NN, where NN is a decimal number. In the logs I find the entry ( see log below). My questions are: What causes the problem? Can I kill the "hanging" processes? System: Luste 1.8.1 on RHEL5.3 thanks for any hints. --- Oct 5 10:28:03 sosmds2 kernel: Lustre: 0:0:(watchdog.c:181:lcw_cb()) Watchdog triggered for pid 28402: it was inactive for 200.00s Oct 5 10:28:03 sosmds2 kernel: ll_mdt_35 D 81000100c980 0 28402 1 28403 28388 (L-TLB) Oct 5 10:28:03 sosmds2 kernel: 81041c723810 0046 7fff Oct 5 10:28:03 sosmds2 kernel: 81041c7237d0 0001 81022f3e60c0 81022f12e080 Oct 5 10:28:03 sosmds2 kernel: 000177b2feff847c 14df 81022f3e62a8 0001028f Oct 5 10:28:03 sosmds2 kernel: Call Trace: Oct 5 10:28:03 sosmds2 kernel: [] default_wake_function+0x0/0xe Oct 5 10:28:03 sosmds2 kernel: [] :libcfs:lbug_with_loc+0xc6/0xd0 Oct 5 10:28:03 sosmds2 kernel: [] :libcfs:tracefile_init+0x0/0x110 Oct 5 10:28:03 sosmds2 kernel: [] :ptlrpc:lustre_shrink_reply_v2+0xa8/0x240 Oct 5 10:28:03 sosmds2 kernel: [] :mds:mds_getattr_lock+0xc59/0xce0 Oct 5 10:28:03 sosmds2 kernel: [] :ptlrpc:lustre_msg_add_version+0x34/0x110 Oct 5 10:28:03 sosmds2 kernel: [] :lnet:lnet_ni_send+0x93/0xd0 Oct 5 10:28:03 sosmds2 kernel: [] :lnet:lnet_send+0x973/0x9a0 Oct 5 10:28:03 sosmds2 kernel: [] :mds:fixup_handle_for_resent_req+0x5a/0x2c0 Oct 5 10:28:03 sosmds2 kernel: [] :mds:mds_intent_policy+0x636/0xc10 Oct 5 10:28:03 sosmds2 kernel: [] :ptlrpc:ldlm_resource_putref+0x1b6/0x3a0 Oct 5 10:28:03 sosmds2 kernel: [] :ptlrpc:ldlm_lock_enqueue+0x186/0xb30 Oct 5 10:28:03 sosmds2 kernel: [] :ptlrpc:ldlm_export_lock_get+0x6f/0xe0 Oct 5 10:28:03 sosmds2 kernel: [] :obdclass:lustre_hash_add+0x218/0x2e0 Oct 5 10:28:03 sosmds2 kernel: [] :ptlrpc:ldlm_server_blocking_ast+0x0/0x83d Oct 5 10:28:03 sosmds2 kernel: [] :ptlrpc:ldlm_handle_enqueue+0xc19/0x1210 Oct 5 10:28:03 sosmds2 kernel: [] :mds:mds_handle+0x4080/0x4cb0 Oct 5 10:28:03 sosmds2 kernel: [] :lvfs:lprocfs_counter_sub+0x57/0x90 Oct 5 10:28:03 sosmds2 kernel: [] __next_cpu+0x19/0x28 Oct 5 10:28:03 sosmds2 kernel: [] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 Oct 5 10:28:03 sosmds2 kernel: [] enqueue_task+0x41/0x56 Oct 5 10:28:03 sosmds2 kernel: [] :ptlrpc:ptlrpc_check_req+0x1d/0x110 Oct 5 10:28:03 sosmds2 kernel: [] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1160 Oct 5 10:28:03 sosmds2 kernel: [] lock_timer_base+0x1b/0x3c Oct 5 10:28:03 sosmds2 kernel: [] __wake_up_common+0x3e/0x68 Oct 5 10:28:03 sosmds2 kernel: [] :ptlrpc:ptlrpc_main+0x1218/0x13e0 Oct 5 10:28:03 sosmds2 kernel: [] default_wake_function+0x0/0xe Oct 5 10:28:03 sosmds2 kernel: [] audit_syscall_exit+0x327/0x342 Oct 5 10:28:03 sosmds2 kernel: [] child_rip+0xa/0x11 Oct 5 10:28:03 sosmds2 kernel: [] :ptlrpc:ptlrpc_main+0x0/0x13e0 Oct 5 10:28:03 sosmds2 kernel: [] child_rip+0x0/0x11 -- Dr. Michael Schwartzkopff MultiNET Services GmbH Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany Tel: +49 - 89 - 45 69 11 0 Fax: +49 - 89 - 45 69 11 21 mob: +49 - 174 - 343 28 75 mail: mi...@multinet.de web: www.multinet.de Sitz der Gesellschaft: 85630 Grasbrunn Registergericht: Amtsgericht München HRB 114375 Geschäftsführer: Günter Jurgeneit, Hubert Martens --- PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B Skype: misch42 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss