Re: [Lustre-discuss] Multi-Role/Tasking MDS/OSS Hosts

2010-09-20 Thread Kevin Van Maren
Bernd Schubert wrote:
> Hello Cory,
>
> On 09/17/2010 11:31 PM, Cory Spitz wrote:
>   
>> Hi, Bernd.
>>
>> On 09/17/2010 02:48 PM, Bernd Schubert wrote:
>> 
>>> On Friday, September 17, 2010, Andreas Dilger wrote:
>>>   
 On 2010-09-17, at 12:42, Jonathan B. Horen wrote:
 
> We're trying to architect a Lustre setup for our group, and want to
> leverage our available resources. In doing so, we've come to consider
> multi-purposing several hosts, so that they'll function simultaneously
> as MDS & OSS.
>   
 You can't do this and expect recovery to work in a robust manner.  The
 reason is that the MDS is a client of the OSS, and if they are both on the
 same node that crashes, the OSS will wait for the MDS "client" to
 reconnect and will time out recovery of the real clients.
 
>>> Well, that is some kind of design problem. Even on separate nodes it can 
>>> easily happen, that both MDS and OSS fail, for example power outage of the 
>>> storage rack. In my experience situations like that happen frequently...
>>>
>>>   
>> I think that just argues that the MDS should be on a separate UPS.
>> 

Or dual-redundant UPS devices driving all "critical infrastructure".  
Redundant power supplies
are the norm for server-class hardware, and they should be cabled to 
different circuits (which
each need to be sized to sustain the maximum power).

> well, there is not only a single reason. Next hardware issue is that
> maybe an IB switch fails. 

Sure, but that's also easy to address (in theory): put OSS nodes on 
different leaf switches than
MDS nodes, and put the failover pairs on different switches as well.

In practice, IB switches probably do not fail often enough to worry 
about recovery glitches,
especially if they have redundant power, but I certainly recommend 
failover partners are on
different switch chips so that in case of a failure it is still possible 
to get the system up.

I would also recommend using bonded network interfaces to avoid 
cable-failure issues (ie,
connect both OSS nodes to both of the leaf switches, rather than one to 
each), but there are
some outstanding issues with Lustre on IB bonding (patches in bugzilla), 
and of course
multipath to disk (loss of connectivity to disk was mentioned at LUG as 
one of the
biggest causes of Lustre issues).  In general it is easier to have 
redundant cables than to
ensure your HA package properly monitors cable status and does a 
failover when required.

> And then have also seen cascading Lustre
> failures. It starts with an LBUG on the OSS, which triggers another
> problem on the MDS...
>   
Yes, that's why bugs are fixed.  panic_on_lbug may help stop the problem 
before it spreads,
depending on the issue.

> Also, for us this actually will become a real problem, which cannot be
> easily solved. So this issue will become a DDN priority.
>
>
> Cheers,
> Bernd
>
> --
> Bernd Schubert
> DataDirect Networks
>
>   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Multi-Role/Tasking MDS/OSS Hosts

2010-09-17 Thread Robert Read
Hi,

On Sep 17, 2010, at 12:48 , Bernd Schubert wrote:

> On Friday, September 17, 2010, Andreas Dilger wrote:
>> On 2010-09-17, at 12:42, Jonathan B. Horen wrote:
>>> We're trying to architect a Lustre setup for our group, and want to
>>> leverage our available resources. In doing so, we've come to consider
>>> multi-purposing several hosts, so that they'll function simultaneously
>>> as MDS & OSS.
>> 
>> You can't do this and expect recovery to work in a robust manner.  The
>> reason is that the MDS is a client of the OSS, and if they are both on the
>> same node that crashes, the OSS will wait for the MDS "client" to
>> reconnect and will time out recovery of the real clients.
> 
> Well, that is some kind of design problem. Even on separate nodes it can 
> easily happen, that both MDS and OSS fail, for example power outage of the 
> storage rack. In my experience situations like that happen frequently...
> 
> I think some kind a pre-connection would be required, where a client can tell 
> a server, that it was rebooted and that the server shall not to wait any 
> longer for it. Actually, shouldn't be that difficult, as already different 
> connection flags exist. So if the client contacts a server and ask for an 
> initial connection, the server could check for that NID and then immediately 
> abort recovery for that client.

This is an interesting idea, but NID is not ideal as this wouldn't be compatible
with multiple mounts on the same node.  Not very useful in production, perhaps,
but very useful for testing.

Another option would be to hash the mount point  pathname (and some other data, 
such as the NID) and use this as the client uuid.  Then the client uuid would 
be persistent 
across reboots and the server would rely on flags to detect if this was a 
reconnect or a
new connection after a reboot or remount.

robert

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Multi-Role/Tasking MDS/OSS Hosts

2010-09-17 Thread Robert Read
Hi,

On Sep 17, 2010, at 14:49 , Bernd Schubert wrote:

> Hello Cory,
> 
> On 09/17/2010 11:31 PM, Cory Spitz wrote:
>> Hi, Bernd.
>> 
>> On 09/17/2010 02:48 PM, Bernd Schubert wrote:
>>> On Friday, September 17, 2010, Andreas Dilger wrote:
 On 2010-09-17, at 12:42, Jonathan B. Horen wrote:
> We're trying to architect a Lustre setup for our group, and want to
> leverage our available resources. In doing so, we've come to consider
> multi-purposing several hosts, so that they'll function simultaneously
> as MDS & OSS.
 
 You can't do this and expect recovery to work in a robust manner.  The
 reason is that the MDS is a client of the OSS, and if they are both on the
 same node that crashes, the OSS will wait for the MDS "client" to
 reconnect and will time out recovery of the real clients.
>>> 
>>> Well, that is some kind of design problem. Even on separate nodes it can 
>>> easily happen, that both MDS and OSS fail, for example power outage of the 
>>> storage rack. In my experience situations like that happen frequently...
>>> 
>> 
>> I think that just argues that the MDS should be on a separate UPS.
> 
> well, there is not only a single reason. Next hardware issue is that
> maybe an IB switch fails. And then have also seen cascading Lustre
> failures. It starts with an LBUG on the OSS, which triggers another
> problem on the MDS...
> Also, for us this actually will become a real problem, which cannot be
> easily solved. So this issue will become a DDN priority.

There is always a possibility that multiple failures will occur, and this 
possibility can 
be reduced depending on one's resources. The point here is simply that  a 
configuration with an mds and oss  on the same node will guarantee multiple 
failures and aborted OSS recovery when that node fails.

cheers,
robert

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Multi-Role/Tasking MDS/OSS Hosts

2010-09-17 Thread Bernd Schubert
Hello Cory,

On 09/17/2010 11:31 PM, Cory Spitz wrote:
> Hi, Bernd.
> 
> On 09/17/2010 02:48 PM, Bernd Schubert wrote:
>> On Friday, September 17, 2010, Andreas Dilger wrote:
>>> On 2010-09-17, at 12:42, Jonathan B. Horen wrote:
 We're trying to architect a Lustre setup for our group, and want to
 leverage our available resources. In doing so, we've come to consider
 multi-purposing several hosts, so that they'll function simultaneously
 as MDS & OSS.
>>>
>>> You can't do this and expect recovery to work in a robust manner.  The
>>> reason is that the MDS is a client of the OSS, and if they are both on the
>>> same node that crashes, the OSS will wait for the MDS "client" to
>>> reconnect and will time out recovery of the real clients.
>>
>> Well, that is some kind of design problem. Even on separate nodes it can 
>> easily happen, that both MDS and OSS fail, for example power outage of the 
>> storage rack. In my experience situations like that happen frequently...
>>
> 
> I think that just argues that the MDS should be on a separate UPS.

well, there is not only a single reason. Next hardware issue is that
maybe an IB switch fails. And then have also seen cascading Lustre
failures. It starts with an LBUG on the OSS, which triggers another
problem on the MDS...
Also, for us this actually will become a real problem, which cannot be
easily solved. So this issue will become a DDN priority.


Cheers,
Bernd

--
Bernd Schubert
DataDirect Networks



signature.asc
Description: OpenPGP digital signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Multi-Role/Tasking MDS/OSS Hosts

2010-09-17 Thread Cory Spitz
Hi, Bernd.

On 09/17/2010 02:48 PM, Bernd Schubert wrote:
> On Friday, September 17, 2010, Andreas Dilger wrote:
>> On 2010-09-17, at 12:42, Jonathan B. Horen wrote:
>>> We're trying to architect a Lustre setup for our group, and want to
>>> leverage our available resources. In doing so, we've come to consider
>>> multi-purposing several hosts, so that they'll function simultaneously
>>> as MDS & OSS.
>>
>> You can't do this and expect recovery to work in a robust manner.  The
>> reason is that the MDS is a client of the OSS, and if they are both on the
>> same node that crashes, the OSS will wait for the MDS "client" to
>> reconnect and will time out recovery of the real clients.
> 
> Well, that is some kind of design problem. Even on separate nodes it can 
> easily happen, that both MDS and OSS fail, for example power outage of the 
> storage rack. In my experience situations like that happen frequently...
> 

I think that just argues that the MDS should be on a separate UPS.

> I think some kind a pre-connection would be required, where a client can tell 
> a server, that it was rebooted and that the server shall not to wait any 
> longer for it. Actually, shouldn't be that difficult, as already different 
> connection flags exist. So if the client contacts a server and ask for an 
> initial connection, the server could check for that NID and then immediately 
> abort recovery for that client.
> 
> 
> Cheers,
> Bernd
> 
> 
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Multi-Role/Tasking MDS/OSS Hosts

2010-09-17 Thread Bernd Schubert
On Friday, September 17, 2010, Andreas Dilger wrote:
> On 2010-09-17, at 12:42, Jonathan B. Horen wrote:
> > We're trying to architect a Lustre setup for our group, and want to
> > leverage our available resources. In doing so, we've come to consider
> > multi-purposing several hosts, so that they'll function simultaneously
> > as MDS & OSS.
> 
> You can't do this and expect recovery to work in a robust manner.  The
> reason is that the MDS is a client of the OSS, and if they are both on the
> same node that crashes, the OSS will wait for the MDS "client" to
> reconnect and will time out recovery of the real clients.

Well, that is some kind of design problem. Even on separate nodes it can 
easily happen, that both MDS and OSS fail, for example power outage of the 
storage rack. In my experience situations like that happen frequently...

I think some kind a pre-connection would be required, where a client can tell 
a server, that it was rebooted and that the server shall not to wait any 
longer for it. Actually, shouldn't be that difficult, as already different 
connection flags exist. So if the client contacts a server and ask for an 
initial connection, the server could check for that NID and then immediately 
abort recovery for that client.


Cheers,
Bernd


-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Multi-Role/Tasking MDS/OSS Hosts

2010-09-17 Thread Andreas Dilger
On 2010-09-17, at 13:19, Brian J. Murrell wrote:
> On Fri, 2010-09-17 at 11:10 -0800, Jonathan B. Horen wrote:
>> Yes, but... how, then, am I to view SAN storage devices?
> 
> I'm not sure.  Whatever you can do to present the devices to Linux as
> block devices.  Lustre OSTs and MDTs are Linux block devices.

I think the more informative answer is that the OST is the Lustre name for the 
ext4-like filesystems on the Linux block devices (regardless of what the 
underlying storage is).  The OSS is the Lustre name for the node on which these 
block devices are attached.

>> Am I correct in thinking that these SAN storage devices would be networked 
>> to one-or-more OSSes?
> 
> They need to make themselves available as block device(s).

Yes, the SAN devices need to be attached to at least one OSS, but preferably 
two OSSes to provide high availability.  We recommend against connecting the 
SAN devices to all of the OSS/MDS nodes, because this increases the 
configuration complexity and risk of an administrative error, and provides no 
benefit.

>> Did I misunderstand that RHEL5 sports Lustre-support already in the kernel? 
> 
> Yes, I'm afraid you did.  You will need either one of our pre-built (we
> currently build for RHEL5, OEL5, SLES10 and SLES11) kernels or patch
> your own kernel with the patches.

The patches are needed on the Lustre server kernel, but are not needed on the 
client.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Multi-Role/Tasking MDS/OSS Hosts

2010-09-17 Thread Andreas Dilger
On 2010-09-17, at 12:42, Jonathan B. Horen wrote:
> We're trying to architect a Lustre setup for our group, and want to leverage 
> our available resources. In doing so, we've come to consider multi-purposing 
> several hosts, so that they'll function simultaneously as MDS & OSS.

You can't do this and expect recovery to work in a robust manner.  The reason 
is that the MDS is a client of the OSS, and if they are both on the same node 
that crashes, the OSS will wait for the MDS "client" to reconnect and will time 
out recovery of the real clients.


Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Multi-Role/Tasking MDS/OSS Hosts

2010-09-17 Thread Brian J. Murrell
On Fri, 2010-09-17 at 11:10 -0800, Jonathan B. Horen wrote:
> Thanks very much!

NP.

> Yes, but... how, then, am I to view SAN storage devices?

I'm not sure.  Whatever you can do to present the devices to Linux as
block devices.  Lustre OSTs and MDTs are Linux block devices.

> they're already in RAID-6 arrays, with PVs, VGs, and LVs,

An (LVM if that's what you are referring to) is a block device and can
be used as an OST or MDT.

> Am I correct in thinking that these SAN storage devices would be networked
> to one-or-more OSSes?

They need to make themselves available as block device(s).

> Did I misunderstand that RHEL5 sports Lustre-support already in the kernel? 

Yes, I'm afraid you did.  You will need either one of our pre-built (we
currently build for RHEL5, OEL5, SLES10 and SLES11) kernels or patch
your own kernel with the patches.

b.



signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Multi-Role/Tasking MDS/OSS Hosts

2010-09-17 Thread Jonathan B. Horen
Thanks very much!

On Fri, Sep 17, 2010 at 10:52 AM, Brian J. Murrell  wrote:

> On Fri, 2010-09-17 at 10:42 -0800, Jonathan B. Horen wrote:
> >
> > Background: Our OSTs
>
> OSSes.  OSTs are the disks that an OSS provides object service with.
>

Yes, but... how, then, am I to view SAN storage devices? Sure, the disks are
the OSTs, but these aren't JBODs hooked-up to a host's SCSI/SATA/SAS
backplane... they're already in RAID-6 arrays, with PVs, VGs, and LVs,
holding real user data, which are managed by the NexSan/FalconStor software
(on top of a Linux OS).

Am I correct in thinking that these SAN storage devices would be networked
to one-or-more OSSes?

Admittedly, I find it somewhat confusing.


> > Primary MDS would be a 72-cpu IBM x3950m2, which would
> > also be an OSS.
>
> MDS and OSS on the same node is an unsupported configuration due to the
> fact that if it fails you will have a "double failure" and recovery
> cannot be performed.
>
> > Secondary MDS would be a 2-cpu Penguin Computing Altus-1300,
> > which would also be an OSS.
>
> Ditto.
>
> > Are there basic conflicts-of-interest, and/or known/potential "gotchas"
> in
> > utilizing hosts in such multi-purpose roles?
>
> OSSes and MDSes require a kernel patched for Lustre.  So you'd need to
> be able to either replace the kernel on those existing machines or patch
> the source you built it from.
>

Did I misunderstand that RHEL5 sports Lustre-support already in the kernel?
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Multi-Role/Tasking MDS/OSS Hosts

2010-09-17 Thread Brian J. Murrell
On Fri, 2010-09-17 at 10:42 -0800, Jonathan B. Horen wrote: 
> 
> Background: Our OSTs

OSSes.  OSTs are the disks that an OSS provides object service with.

> Primary MDS would be a 72-cpu IBM x3950m2, which would
> also be an OSS.

MDS and OSS on the same node is an unsupported configuration due to the
fact that if it fails you will have a "double failure" and recovery
cannot be performed.

> Secondary MDS would be a 2-cpu Penguin Computing Altus-1300,
> which would also be an OSS.

Ditto.

> Are there basic conflicts-of-interest, and/or known/potential "gotchas" in
> utilizing hosts in such multi-purpose roles? 

OSSes and MDSes require a kernel patched for Lustre.  So you'd need to
be able to either replace the kernel on those existing machines or patch
the source you built it from.

Generally speaking, you are of course going to have only as much
performance as the Lustre services on those shared nodes will be able to
get the resources it wants.  Our usual recommendation is to dedicate OSS
and MDS nodes for this reason, but there is no hard rule that you must
provide dedicated nodes, so long as everything else on the nodes can
live with the patched Lustre kernel.  If you are already patching those
kernels for something else, you could run into conflicts trying to patch
them for Lustre.

b.



signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Multi-Role/Tasking MDS/OSS Hosts

2010-09-17 Thread Jonathan B. Horen
We're trying to architect a Lustre setup for our group, and want to leverage
our available resources. In doing so, we've come to consider multi-purposing
several hosts, so that they'll function simultaneously as MDS & OSS.

Background: Our OSTs are a NexSan SATAboy (10T), a NexSan SATAbeast (30T), a
FalconStor NSS650 (32T), and a FalconStor NSS620 (32T) -- all have multiple
iSCSI interfaces. Primary MDS would be a 72-cpu IBM x3950m2, which would
also be an OSS. Secondary MDS would be a 2-cpu Penguin Computing Altus-1300,
which would also be an OSS. A 2-cpu Dell PowerEdge 1425 would be our third
OSS. The IBM x3950m2 also functions as a heavily-used compute cluster (70
dedicated cpus, which could/would be reduced by the number of cpus to be
dedicated to MDS and OSS needs). We have most of the infrastructure already
in-place for Infiniband networking.

Are there basic conflicts-of-interest, and/or known/potential "gotchas" in
utilizing hosts in such multi-purpose roles?

-- 
JONATHAN B. HOREN
Systems Administrator
UAF Life Science Informatics
Center for Research Services
(907) 474-2742
jbho...@alaska.edu
http://biotech.inbre.alaska.edu
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss