[Lustre-discuss] Question about sleeping processes

2009-10-06 Thread Michael Schwartzkopff
Hi,

my system load shows that quite a number of processes are waiting. ps shows me 
the same number of processes in state D (uniterruptable sleep). All processes 
are ll_mdt_NN, where NN is a decimal number.

In the logs I find the entry ( see log below).

My questions are:
What causes the problem?
Can I kill the "hanging" processes?

System: Luste 1.8.1 on RHEL5.3

thanks for any hints.

---

Oct  5 10:28:03 sosmds2 kernel: Lustre: 0:0:(watchdog.c:181:lcw_cb()) Watchdog 
triggered for pid 28402: it was inactive for 200.00s
Oct  5 10:28:03 sosmds2 kernel: ll_mdt_35 D 81000100c980 0 28402
  
1 28403 28388 (L-TLB)
Oct  5 10:28:03 sosmds2 kernel:  81041c723810 0046 
 7fff
Oct  5 10:28:03 sosmds2 kernel:  81041c7237d0 0001 
81022f3e60c0 81022f12e080
Oct  5 10:28:03 sosmds2 kernel:  000177b2feff847c 14df 
81022f3e62a8 0001028f
Oct  5 10:28:03 sosmds2 kernel: Call Trace:
Oct  5 10:28:03 sosmds2 kernel:  [] 
default_wake_function+0x0/0xe
Oct  5 10:28:03 sosmds2 kernel:  [] 
:libcfs:lbug_with_loc+0xc6/0xd0
Oct  5 10:28:03 sosmds2 kernel:  [] 
:libcfs:tracefile_init+0x0/0x110
Oct  5 10:28:03 sosmds2 kernel:  [] 
:ptlrpc:lustre_shrink_reply_v2+0xa8/0x240
Oct  5 10:28:03 sosmds2 kernel:  [] 
:mds:mds_getattr_lock+0xc59/0xce0
Oct  5 10:28:03 sosmds2 kernel:  [] 
:ptlrpc:lustre_msg_add_version+0x34/0x110
Oct  5 10:28:03 sosmds2 kernel:  [] 
:lnet:lnet_ni_send+0x93/0xd0
Oct  5 10:28:03 sosmds2 kernel:  [] 
:lnet:lnet_send+0x973/0x9a0
Oct  5 10:28:03 sosmds2 kernel:  [] 
:mds:fixup_handle_for_resent_req+0x5a/0x2c0
Oct  5 10:28:03 sosmds2 kernel:  [] 
:mds:mds_intent_policy+0x636/0xc10
Oct  5 10:28:03 sosmds2 kernel:  [] 
:ptlrpc:ldlm_resource_putref+0x1b6/0x3a0
Oct  5 10:28:03 sosmds2 kernel:  [] 
:ptlrpc:ldlm_lock_enqueue+0x186/0xb30
Oct  5 10:28:03 sosmds2 kernel:  [] 
:ptlrpc:ldlm_export_lock_get+0x6f/0xe0
Oct  5 10:28:03 sosmds2 kernel:  [] 
:obdclass:lustre_hash_add+0x218/0x2e0
Oct  5 10:28:03 sosmds2 kernel:  [] 
:ptlrpc:ldlm_server_blocking_ast+0x0/0x83d
Oct  5 10:28:03 sosmds2 kernel:  [] 
:ptlrpc:ldlm_handle_enqueue+0xc19/0x1210
Oct  5 10:28:03 sosmds2 kernel:  [] 
:mds:mds_handle+0x4080/0x4cb0
Oct  5 10:28:03 sosmds2 kernel:  [] 
:lvfs:lprocfs_counter_sub+0x57/0x90
Oct  5 10:28:03 sosmds2 kernel:  [] __next_cpu+0x19/0x28
Oct  5 10:28:03 sosmds2 kernel:  [] 
:ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
Oct  5 10:28:03 sosmds2 kernel:  [] enqueue_task+0x41/0x56
Oct  5 10:28:03 sosmds2 kernel:  [] 
:ptlrpc:ptlrpc_check_req+0x1d/0x110
Oct  5 10:28:03 sosmds2 kernel:  [] 
:ptlrpc:ptlrpc_server_handle_request+0xa97/0x1160
Oct  5 10:28:03 sosmds2 kernel:  [] lock_timer_base+0x1b/0x3c
Oct  5 10:28:03 sosmds2 kernel:  [] __wake_up_common+0x3e/0x68
Oct  5 10:28:03 sosmds2 kernel:  [] 
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Oct  5 10:28:03 sosmds2 kernel:  [] 
default_wake_function+0x0/0xe
Oct  5 10:28:03 sosmds2 kernel:  [] 
audit_syscall_exit+0x327/0x342
Oct  5 10:28:03 sosmds2 kernel:  [] child_rip+0xa/0x11
Oct  5 10:28:03 sosmds2 kernel:  [] 
:ptlrpc:ptlrpc_main+0x0/0x13e0
Oct  5 10:28:03 sosmds2 kernel:  [] child_rip+0x0/0x11


-- 
Dr. Michael Schwartzkopff
MultiNET Services GmbH
Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany
Tel: +49 - 89 - 45 69 11 0
Fax: +49 - 89 - 45 69 11 21
mob: +49 - 174 - 343 28 75

mail: mi...@multinet.de
web: www.multinet.de

Sitz der Gesellschaft: 85630 Grasbrunn
Registergericht: Amtsgericht München HRB 114375
Geschäftsführer: Günter Jurgeneit, Hubert Martens

---

PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B
Skype: misch42
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Question about sleeping processes

2009-10-06 Thread Michael Schwartzkopff
Am Dienstag, 6. Oktober 2009 16:22:08 schrieb Brian J. Murrell:
> On Tue, 2009-10-06 at 12:48 +0200, Michael Schwartzkopff wrote:
> > Hi,
>
> Hi,
>
> > my system load shows that quite a number of processes are waiting.
>
> Blocked.  I guess the word waiting is similar.
>
> > My questions are:
> > What causes the problem?
>
> In this case, the thread has lbugged previously.
>
> If you look in syslog for node with these processes you should find
> entries with LBUG and/or ASSERTION messages.  These are the defects that
> are causing the processes to get blocked (uninteruptable sleep)
(...)

Here is some additional from the logs. Any ideas about that?

Oct  5 10:26:43 sosmds2 kernel: LustreError: 30617:0:
(pack_generic.c:655:lustre_shrink_reply_v2()) ASSERTION(msg->lm_bufcount > 
segment) failed
Oct  5 10:26:43 sosmds2 kernel: LustreError: 30617:0:
(pack_generic.c:655:lustre_shrink_reply_v2()) LBUG
Oct  5 10:26:43 sosmds2 kernel: Lustre: 30617:0:(linux-
debug.c:264:libcfs_debug_dumpstack()) showing stack for process 30617
Oct  5 10:26:43 sosmds2 kernel: ll_mdt_47 R  running task   0 30617 
 
1 30618 30616 (L-TLB)
Oct  5 10:26:43 sosmds2 kernel:   0001 
000714a28100 0001
Oct  5 10:26:43 sosmds2 kernel:  0001 0086 
0012 8102212dfe88
Oct  5 10:26:43 sosmds2 kernel:  0001  
802f6aa0 
Oct  5 10:26:43 sosmds2 kernel: Call Trace:
Oct  5 10:26:43 sosmds2 kernel:  [] 
autoremove_wake_function+0x9/0x2e
Oct  5 10:26:43 sosmds2 kernel:  [] __wake_up_common+0x3e/0x68
Oct  5 10:26:43 sosmds2 kernel:  [] __wake_up_common+0x3e/0x68
Oct  5 10:26:43 sosmds2 kernel:  [] vprintk+0x2cb/0x317
Oct  5 10:26:43 sosmds2 kernel:  [] kallsyms_lookup+0xc2/0x17b
Oct  5 10:26:43 sosmds2 last message repeated 3 times
Oct  5 10:26:43 sosmds2 kernel:  [] printk_address+0x9f/0xab
Oct  5 10:26:43 sosmds2 kernel:  [] printk+0x8/0xbd
Oct  5 10:26:43 sosmds2 kernel:  [] printk+0x52/0xbd
Oct  5 10:26:43 sosmds2 kernel:  [] 
module_text_address+0x33/0x3c
Oct  5 10:26:43 sosmds2 kernel:  [] 
kernel_text_address+0x1a/0x26
Oct  5 10:26:43 sosmds2 kernel:  [] dump_trace+0x211/0x23a
Oct  5 10:26:43 sosmds2 kernel:  [] show_trace+0x34/0x47
Oct  5 10:26:43 sosmds2 kernel:  [] _show_stack+0xdb/0xea
Oct  5 10:26:43 sosmds2 kernel:  [] 
:libcfs:lbug_with_loc+0x7a/0xd0
Oct  5 10:26:43 sosmds2 kernel:  [] 
:libcfs:tracefile_init+0x0/0x110
Oct  5 10:26:43 sosmds2 kernel:  [] 
:ptlrpc:lustre_shrink_reply_v2+0xa8/0x240
Oct  5 10:26:43 sosmds2 kernel:  [] 
:mds:mds_getattr_lock+0xc59/0xce0
Oct  5 10:26:43 sosmds2 kernel:  [] 
:ptlrpc:lustre_msg_add_version+0x34/0x110
Oct  5 10:26:43 sosmds2 kernel:  [] 
:lnet:lnet_ni_send+0x93/0xd0
Oct  5 10:26:43 sosmds2 kernel:  [] 
:lnet:lnet_send+0x973/0x9a0
Oct  5 10:26:43 sosmds2 kernel:  [] 
cache_alloc_refill+0x106/0x186
Oct  5 10:26:43 sosmds2 kernel:  [] 
:mds:fixup_handle_for_resent_req+0x5a/0x2c0
Oct  5 10:26:43 sosmds2 kernel:  [] 
:mds:mds_intent_policy+0x636/0xc10
Oct  5 10:26:43 sosmds2 kernel:  [] 
:ptlrpc:ldlm_resource_putref+0x1b6/0x3a0
Oct  5 10:26:43 sosmds2 kernel:  [] 
:ptlrpc:ldlm_lock_enqueue+0x186/0xb30
Oct  5 10:26:43 sosmds2 kernel:  [] 
:ptlrpc:ldlm_export_lock_get+0x6f/0xe0
Oct  5 10:26:43 sosmds2 kernel:  [] 
:obdclass:lustre_hash_add+0x218/0x2e0
Oct  5 10:26:43 sosmds2 kernel:  [] 
:ptlrpc:ldlm_server_blocking_ast+0x0/0x83d
Oct  5 10:26:43 sosmds2 kernel:  [] 
:ptlrpc:ldlm_handle_enqueue+0xc19/0x1210
Oct  5 10:26:43 sosmds2 kernel:  [] 
:mds:mds_handle+0x4080/0x4cb0
Oct  5 10:26:43 sosmds2 kernel:  [] __next_cpu+0x19/0x28
Oct  5 10:26:43 sosmds2 kernel:  [] 
find_busiest_group+0x20d/0x621
Oct  5 10:26:43 sosmds2 kernel:  [] 
:ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
Oct  5 10:26:43 sosmds2 kernel:  [] enqueue_task+0x41/0x56
Oct  5 10:26:43 sosmds2 kernel:  [] 
:ptlrpc:ptlrpc_check_req+0x1d/0x110
Oct  5 10:26:43 sosmds2 kernel:  [] 
:ptlrpc:ptlrpc_server_handle_request+0xa97/0x1160
Oct  5 10:26:43 sosmds2 kernel:  [] thread_return+0x62/0xfe
Oct  5 10:26:43 sosmds2 kernel:  [] __wake_up_common+0x3e/0x68
Oct  5 10:26:43 sosmds2 kernel:  [] 
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Oct  5 10:26:43 sosmds2 kernel:  [] 
default_wake_function+0x0/0xe
Oct  5 10:26:43 sosmds2 kernel:  [] 
audit_syscall_exit+0x327/0x342
Oct  5 10:26:43 sosmds2 kernel:  [] child_rip+0xa/0x11
Oct  5 10:26:43 sosmds2 kernel:  [] 
:ptlrpc:ptlrpc_main+0x0/0x13e0
Oct  5 10:26:43 sosmds2 kernel:  [] child_rip+0x0/0x11
Oct  5 10:26:43 sosmds2 kernel:
Oct  5 10:26:43 sosmds2 kernel: LustreError: dumping log to /tmp/lustre-
log.1254731203.30617


-- 
Dr. Michael Schwartzkopff
MultiNET Services GmbH
Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany
Tel: +49 - 89 - 45 69 11 0
Fax: +49 - 89 - 45 69 11 21
mob: +49 - 174 - 343 28 75

mail: mi...@multinet.de
web: www.multinet.de

Sitz der Gesellschaft: 85630 Grasbrunn
Re

Re: [Lustre-discuss] Question about sleeping processes

2009-10-06 Thread Michael Schwartzkopff
Am Dienstag, 6. Oktober 2009 17:08:44 schrieb Brian J. Murrell:
> On Tue, 2009-10-06 at 17:01 +0200, Michael Schwartzkopff wrote:
> > Here is some additional from the logs. Any ideas about that?
> >
> > Oct  5 10:26:43 sosmds2 kernel: LustreError: 30617:0:
> > (pack_generic.c:655:lustre_shrink_reply_v2()) ASSERTION(msg->lm_bufcount
> > > segment) failed
>
> Here's the failed assertion.
>
> > Oct  5 10:26:43 sosmds2 kernel: LustreError: 30617:0:
> > (pack_generic.c:655:lustre_shrink_reply_v2()) LBUG
>
> Which always leads to an LBUG which is what is putting the thread to
> sleep.
>
> Any time you see an LBUG in a server log file, you need to reboot the
> server.
>
> So now you need to take that ASSERTION message to our bugzilla and see
> if you can find a bug for already, and if not, file a new one, please.
>
> Cheers,
> b.

Thanks for your fast reply. I think # 20020 is the one we hit.
Waiting for a solution.
Greetings,
-- 
Dr. Michael Schwartzkopff
MultiNET Services GmbH
Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany
Tel: +49 - 89 - 45 69 11 0
Fax: +49 - 89 - 45 69 11 21
mob: +49 - 174 - 343 28 75

mail: mi...@multinet.de
web: www.multinet.de

Sitz der Gesellschaft: 85630 Grasbrunn
Registergericht: Amtsgericht München HRB 114375
Geschäftsführer: Günter Jurgeneit, Hubert Martens

---

PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B
Skype: misch42
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Setup mail cluster

2009-10-12 Thread Michael Schwartzkopff
Am Montag, 12. Oktober 2009 15:54:04 schrieb Vadym:
> Hello
> I'm do a schema of mail service so I have only one question:
> Can Lustre provide me full automatic failover solution?

No. See the lustre manual for this. You need a cluster solution for this.
The manual is *hopelessly* outdate at this point. Do NOT user heartbeat any 
more. User pacemaker as the cluster manager. See www.clusterlabs.org.

When I find some time I want to write a HOWTO about setting up a lustre clsuter 
with pacemaker and OpenAIS.

> I plan to use for storage the standard servers with 1GE links. I need
> automatic solution as possible.
> E.g. RAID5 functionality, when one or more storage node down user data
> still accessible. So if I have 100TB of disk storage I can serve 50TB of
> data in failover mode with no downtime. Can you provide me more
> information?

Is a  bond-device for cluster interconnect! It is more safe!

Use DRBD for replication of the data if you use Direct attached Storage.

DRBD can operate on top of LVM. So you can have that functionallity also.

Perhaps you can try clustered LVM. Has nice features.

Or just use ZFS, which offers all this.

-- 
Dr. Michael Schwartzkopff
MultiNET Services GmbH
Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany
Tel: +49 - 89 - 45 69 11 0
Fax: +49 - 89 - 45 69 11 21
mob: +49 - 174 - 343 28 75

mail: mi...@multinet.de
web: www.multinet.de

Sitz der Gesellschaft: 85630 Grasbrunn
Registergericht: Amtsgericht München HRB 114375
Geschäftsführer: Günter Jurgeneit, Hubert Martens

---

PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B
Skype: misch42
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Problem re-mounting Lustre on an other node

2009-10-14 Thread Michael Schwartzkopff
Hi,

we have a Lustre 1.8 Cluster with openais and pacemaker as the cluster 
manager. When I migrate one lustre resource from one node to an other node I 
get an error. Stopping lustre on one node is no problem, but the node where 
lustre should start says:

Oct 14 09:54:28 sososd6 kernel: kjournald starting.  Commit interval 5 seconds
Oct 14 09:54:28 sososd6 kernel: LDISKFS FS on dm-4, internal journal
Oct 14 09:54:28 sososd6 kernel: LDISKFS-fs: recovery complete.
Oct 14 09:54:28 sososd6 kernel: LDISKFS-fs: mounted filesystem with ordered 
data mode.
Oct 14 09:54:28 sososd6 multipathd: dm-4: umount map (uevent)
Oct 14 09:54:39 sososd6 kernel: kjournald starting.  Commit interval 5 seconds
Oct 14 09:54:39 sososd6 kernel: LDISKFS FS on dm-4, internal journal
Oct 14 09:54:39 sososd6 kernel: LDISKFS-fs: mounted filesystem with ordered 
data mode.
Oct 14 09:54:39 sososd6 kernel: LDISKFS-fs: file extents enabled
Oct 14 09:54:39 sososd6 kernel: LDISKFS-fs: mballoc enabled
Oct 14 09:54:39 sososd6 kernel: Lustre: mgc134.171.16@tcp: Reactivating 
import
Oct 14 09:54:45 sososd6 kernel: LustreError: 137-5: UUID 'segfs-OST_UUID' 
is not available  for connect (no target)
Oct 14 09:54:45 sososd6 kernel: LustreError: Skipped 3 previous similar 
messages
Oct 14 09:54:45 sososd6 kernel: LustreError: 31334:0:
(ldlm_lib.c:1850:target_send_reply_msg()) @@@ processing error (-19)  
r...@810225fcb800 x334514011/t0 o8->@:0/0 lens 368/0 e 0 to 0 dl 
1255506985 ref 1 fl Interpret:/0/0 rc -19/0
Oct 14 09:54:45 sososd6 kernel: LustreError: 31334:0:
(ldlm_lib.c:1850:target_send_reply_msg()) Skipped 3 previous similar messages

These log continue until the cluster software times out and the resource tells 
me about the error. Any help understanding these logs? Thanks.

-- 
Dr. Michael Schwartzkopff
MultiNET Services GmbH
Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany
Tel: +49 - 89 - 45 69 11 0
Fax: +49 - 89 - 45 69 11 21
mob: +49 - 174 - 343 28 75

mail: mi...@multinet.de
web: www.multinet.de

Sitz der Gesellschaft: 85630 Grasbrunn
Registergericht: Amtsgericht München HRB 114375
Geschäftsführer: Günter Jurgeneit, Hubert Martens

---

PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B
Skype: misch42
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Understanding of MMP

2009-10-19 Thread Michael Schwartzkopff
Hi,

perhaps I have a problem understanding multiple mount protection MMP. I have a 
cluster. When a failover happens sometimes I get the log entry:

Oct 19 15:16:08 sososd7 kernel: LDISKFS-fs warning (device dm-2): 
ldiskfs_multi_mount_protect: Device is already active on another node.
Oct 19 15:16:08 sososd7 kernel: LDISKFS-fs warning (device dm-2): 
ldiskfs_multi_mount_protect: MMP failure info: last update time: 1255958168, 
last update node: sososd3, last update device: dm-2

Does the second line mean that my node (sososd7) tried to mount /dev/dm-2 but 
MMP prevented it from doing so because the last update from the old node 
(sososd3) was too recent?

>From the manuals I found the MMP time of 109 seconds? Is it correct that after 
the umount the next node cannot mount the same filesystem within 10 seconds?

So the solution would be to wait fotr 10 seconds mounting the resource on the 
next node. Is this correct?

Thanks.

-- 
Dr. Michael Schwartzkopff
MultiNET Services GmbH
Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany
Tel: +49 - 89 - 45 69 11 0
Fax: +49 - 89 - 45 69 11 21
mob: +49 - 174 - 343 28 75

mail: mi...@multinet.de
web: www.multinet.de

Sitz der Gesellschaft: 85630 Grasbrunn
Registergericht: Amtsgericht München HRB 114375
Geschäftsführer: Günter Jurgeneit, Hubert Martens

---

PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B
Skype: misch42
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Understanding of MMP

2009-10-19 Thread Michael Schwartzkopff
Am Montag, 19. Oktober 2009 20:42:19 schrieben Sie:
> On Monday 19 October 2009, Andreas Dilger wrote:
> > On 19-Oct-09, at 08:46, Michael Schwartzkopff wrote:
> > > perhaps I have a problem understanding multiple mount protection
> > > MMP. I have a
> > > cluster. When a failover happens sometimes I get the log entry:
> > >
> > > Oct 19 15:16:08 sososd7 kernel: LDISKFS-fs warning (device dm-2):
> > > ldiskfs_multi_mount_protect: Device is already active on another node.
> > > Oct 19 15:16:08 sososd7 kernel: LDISKFS-fs warning (device dm-2):
> > > ldiskfs_multi_mount_protect: MMP failure info: last update time:
> > > 1255958168,
> > > last update node: sososd3, last update device: dm-2
> > >
> > > Does the second line mean that my node (sososd7) tried to mount /dev/
> > > dm-2 but
> > > MMP prevented it from doing so because the last update from the old
> > > node
> > > (sososd3) was too recent?
> >
> > The update time stored in the MMP block is purely for informational
> > purposes.  It actually uses a sequence counter that has nothing to do
> > with the system clock on either of the nodes (since they may not be in
> > sync).
> >
> > What that message actually means is that sososd7 tried to mount the
> > filesystem on dm-2 (which likely has another "LVM" name that the kernel
> > doesn't know anything about) but the MMP block on the disk was modified
> > by sososd3 AFTER sososd7 first looked at it.
>
> Probably, bug#19566. Michael, which Lustre version do you exactly use?
>
>
> Thanks,
> Bernd

I got version 1.8.1.1 which was published last week. Is the fix included or 
only in 1.8.2?

Greetings,
-- 
Dr. Michael Schwartzkopff
MultiNET Services GmbH
Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany
Tel: +49 - 89 - 45 69 11 0
Fax: +49 - 89 - 45 69 11 21
mob: +49 - 174 - 343 28 75

mail: mi...@multinet.de
web: www.multinet.de

Sitz der Gesellschaft: 85630 Grasbrunn
Registergericht: Amtsgericht München HRB 114375
Geschäftsführer: Günter Jurgeneit, Hubert Martens

---

PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B
Skype: misch42
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] how to define 60 failnodes

2009-11-09 Thread Michael Schwartzkopff
Am Montag, 9. November 2009 16:36:15 schrieb Bernd Schubert:
> On Monday 09 November 2009, Brian J. Murrell wrote:
> > Theoretically.  I had discussed this briefly with another engineer a
> > while ago and IIRC, the result of the discussion was that there was
> > nothing inherent in the configuration logic that would prevent one from
> > having more than two ("primary" and "failover") OSSes providing service
> > to an OST.  Two nodes per OST is how just about everyone that wants
> > failover configures Lustre.
>
> Not everyone ;) And especially it doesn't make sense to have a 2 node
> failover scheme with pacemaker:
>
> https://bugzilla.lustre.org/show_bug.cgi?id=20964

the problem is that pacemaker does not understand about the applications it 
does cluster. pacemaker is made to provide high availability for ANY service, 
not only for a cluster FS.

So if you want to pin some resources (i.e. FS1) to a special node, you have to 
add a location constraint. But this contradicts the logic of pacemaker a 
little bit. Why should a resource run on this node, if all nodes are equal?

Basically I had the same problem with my lustre cluster I had the following 
solution:

- make colocation constratins so that filesystems do not like to run in the 
same node.

And theoretically with openais as a cluster stack the number of nodes is not 
limited to 16 any more like in heartbeat. You can build larger clusters.

Greetings,

-- 
Dr. Michael Schwartzkopff
MultiNET Services GmbH
Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany
Tel: +49 - 89 - 45 69 11 0
Fax: +49 - 89 - 45 69 11 21
mob: +49 - 174 - 343 28 75

mail: mi...@multinet.de
web: www.multinet.de

Sitz der Gesellschaft: 85630 Grasbrunn
Registergericht: Amtsgericht München HRB 114375
Geschäftsführer: Günter Jurgeneit, Hubert Martens

---

PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B
Skype: misch42
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Implementing MMP correctly

2009-12-22 Thread Michael Schwartzkopff
Hi,

I am trying to understand howto implement MMP correctly into a lustre failover 
cluster.

As far as I understood the MMP protects the same filesystem beeing mounted by 
different nodes (OSS) of a failover cluster. So far so good.

If a node was shut down uncleanly it still will occupy its filesystems by MMP 
and thus preventing the clean failover to an other node. Now I want to 
implement a clean failover into the Filesystem Resource Agent of pacemaker. Is 
there a good way to solve the problem with MMP? Possible sotutions are:

- Disable the MMP feature in a cluster at all, since the resource manager 
takes care that the same resource is only mounted once in the cluster

- Do a "tunefs -O ^mmp " and a "tunefs -O mmp " before every 
mounting of a resource?

- Do a "sleep 10" before mounting a resource? But the manual says "the file 
system mount require additional time if the file system was not cleanly 
unmounted."

- Check if the file system is in use by another OSS through MMP and wait a 
litte bit longer? How do I do this?

Please mail me any ideas. Thanks.

-- 
Dr. Michael Schwartzkopff
MultiNET Services GmbH
Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany
Tel: +49 - 89 - 45 69 11 0
Fax: +49 - 89 - 45 69 11 21
mob: +49 - 174 - 343 28 75

mail: mi...@multinet.de
web: www.multinet.de

Sitz der Gesellschaft: 85630 Grasbrunn
Registergericht: Amtsgericht München HRB 114375
Geschäftsführer: Günter Jurgeneit, Hubert Martens

---

PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B
Skype: misch42
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] What HA Software to use with Lustre

2010-01-14 Thread Michael Schwartzkopff
Am Freitag, 15. Januar 2010 00:48:53 schrieb Jagga Soorma:
> Hi Guys,
>
> I am setting up our new Lustre environment and was wondering what is the
> recommended (stable) HA clustering software to use with the MDS and OSS
> failover.  Any input would be greatly appreciated.
>
> Thanks,
> -J

The docs describe heartbeat but the software is not recommended any more. 
Neither heartbeat version 1 nor heartbeat version 2. Instead the projects 
openais and pacemaker replaced the funcionallity of heartbeat. For the new 
project please see
www.clusterlabs.org

A introduction into pacemaker can be found at:
http://www.clusterlabs.org/doc/en-
US/Pacemaker/1.0/html/Pacemaker_Explained/index.html

Greetings,
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] What HA Software to use with Lustre

2010-01-15 Thread Michael Schwartzkopff
Am Freitag, 15. Januar 2010 07:30:13 schrieben Sie:
> > A introduction into pacemaker can be found at:
> > http://www.clusterlabs.org/doc/en-
> > US/Pacemaker/1.0/html/Pacemaker_Explained/index.html
>
> I wish I were aware of the "crm" CLI before trying to take the XML way
> according the link above:
>
>   http://www.clusterlabs.org/doc/crm_cli.html
>
> Cheers,
> Li Wei

We are working on a documentation how to set up lustre together with 
pacemaker. As soon as we are finished it will show up in the wiki.

Greetings,

-- 
Dr. Michael Schwartzkopff
MultiNET Services GmbH
Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany
Tel: +49 - 89 - 45 69 11 0
Fax: +49 - 89 - 45 69 11 21
mob: +49 - 174 - 343 28 75

mail: mi...@multinet.de
web: www.multinet.de

Sitz der Gesellschaft: 85630 Grasbrunn
Registergericht: Amtsgericht München HRB 114375
Geschäftsführer: Günter Jurgeneit, Hubert Martens

---

PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B
Skype: misch42
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Filesystem monitoring in Heartbeat

2010-01-22 Thread Michael Schwartzkopff
Am Donnerstag, 21. Januar 2010 23:09:37 schrieb Bernd Schubert:
> On Thursday 21 January 2010, Adam Gandelman wrote:
(...)
> I guess you want to use the pacemaker agent I posted into this bugzilla:
>
> https://bugzilla.lustre.org/show_bug.cgi?id=20807

Hallo,

how far did you come with the development of the agent? Some kind of finished? 
Publishable?

Greetings,

-- 
Dr. Michael Schwartzkopff
MultiNET Services GmbH
Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany
Tel: +49 - 89 - 45 69 11 0
Fax: +49 - 89 - 45 69 11 21
mob: +49 - 174 - 343 28 75

mail: mi...@multinet.de
web: www.multinet.de

Sitz der Gesellschaft: 85630 Grasbrunn
Registergericht: Amtsgericht München HRB 114375
Geschäftsführer: Günter Jurgeneit, Hubert Martens

---

PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B
Skype: misch42
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Future of LusterFS?

2010-04-21 Thread Michael Schwartzkopff
Am Donnerstag, 22. April 2010 08:33:14 schrieb Janne Aho:
> Hi,
>
> Today we have a storage system based on NFS, but we are really concerned
> about redundancy and are at the brink to take the step to a cluster file
> system as glusterfs, but we have got suggestions on that lusterfs would
> have been the best option for us, but at the same time those who
> "recommended" lusterfs has said that Oracle has pulled the plug and put
> the resources into OCFS2.
> If using lusterfs in a production environment, it would be good to know
> that it won't be discontinued.
>
> Will there be a long term future for lusterfs?
> Or should we be looking for something else for a long term solution?
>
> Thanks in advance for your reply for my a bit cloudy question.

Hi,

for me Lustre is a very good option.

But you also could consider a system composed from
- corosync for the cluster communication
- pacemaker as a cluster resource manager
- DRBD for the replication of data between nodes in a cluster

and

- NFS
or
- OCFS2 or GFS or ...

especially the NFS option provides you with a high available NFS server on 
real cluster stack all managed by pacemaker.

Greetings,
-- 
Dr. Michael Schwartzkopff
MultiNET Services GmbH
Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany
Tel: +49 - 89 - 45 69 11 0
Fax: +49 - 89 - 45 69 11 21
mob: +49 - 174 - 343 28 75

mail: mi...@multinet.de
web: www.multinet.de

Sitz der Gesellschaft: 85630 Grasbrunn
Registergericht: Amtsgericht München HRB 114375
Geschäftsführer: Günter Jurgeneit, Hubert Martens

---

PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B
Skype: misch42
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Searching for a speaker about lustre

2010-04-29 Thread Michael Schwartzkopff
Hi,

together with a friend of mine I wanted to deliver a talk about Lustre on the 
Open Source in Data Centers Conference. See:
http://www.netways.de/en/osdc/osdc_2010

Due to a health problem my friend cannot attend the conference and now I am 
looking for a replacement. Anyone on the list who wants to see the town of 
Nurnberg and to deliver a nice talk about Lustre in June? We could split up 
the talk into two parts if wanted.

Greetings,
 
-- 
Dr. Michael Schwartzkopff
MultiNET Services GmbH
Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany
Tel: +49 - 89 - 45 69 11 0
Fax: +49 - 89 - 45 69 11 21
mob: +49 - 174 - 343 28 75

mail: mi...@multinet.de
web: www.multinet.de

Sitz der Gesellschaft: 85630 Grasbrunn
Registergericht: Amtsgericht München HRB 114375
Geschäftsführer: Günter Jurgeneit, Hubert Martens

---

PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B
Skype: misch42
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss