Re: [lustre-discuss] design to enable kernel updates

Vicker, Darby (JSC-EG311) Fri, 10 Feb 2017 22:06:08 -0800

Yeah, since this is our first experience using failover with lustre we are just 
doing manual failover for now. But we may implement the 
corosync/pacemaker/stonith setup in the future.


On Feb 10, 2017, at 11:57 PM, Jeff Johnson 
<jeff.john...@aeoncomputing.com<mailto:jeff.john...@aeoncomputing.com>> wrote:

You're also leaving out the corosync/pacemaker/stonith configuration. That is 
unless you are doing manual export/import of pools.

On Fri, Feb 10, 2017 at 9:03 PM, Vicker, Darby (JSC-EG311) 
<darby.vicke...@nasa.gov<mailto:darby.vicke...@nasa.gov>> wrote:
Sure.  Our hardware is very similar to this:

https://www.supermicro.com/solutions/Lustre.cfm

We are using twin servers instead two single chassis servers as shown there but 
functionally this is the same – we can just fit more stuff into a single rack 
with the twin servers.  We are using a single JBOB per twin server as shown in 
one of the configurations on the above page and are using ZFS as the backend.  
All servers are dual-homed on both Ethernet and IB.  A combined MGS/MDS is at 
10.148.0.30 address for IB and X.X.98.30 for Ethernet. The secondary MDS/MGS on 
the .31 address for both networks.  With the combined MDS/MGS, they both fail 
over together.  This did require a patch from LU-8397 to get the MGS failover 
to work properly so we are using 2.9.0 with the LU-8397 patch and are compiling 
our own server rpms.  But this is pretty simple with ZFS since you don't need a 
patched kernel.  The lustre formatting and configuration bits are below.  I'm 
leaving out the ZFS pool creation but I think you get the idea.

I hope that helps.

Darby



if [[ $HOSTNAME == *mds* ]] ; then

    mkfs.lustre \
        --fsname=hpfs-fsl \
        --backfstype=zfs \
        --reformat \
        --verbose \
        --mgs --mdt --index=0 \
        --servicenode=${LUSTRE_LOCAL_TCP_IP}@tcp0,${LUSTRE_LOCAL_IB_IP}@o2ib0 \
        --servicenode=${LUSTRE_PEER_TCP_IP}@tcp0,${LUSTRE_PEER_IB_IP}@o2ib0 \
        metadata/meta-fst

elif [[ $HOSTNAME == *oss* ]] ; then

   num=`hostname --short | sed 's/hpfs-fsl-//' | sed 's/oss//'`
   num=`printf '%g' $num`

   mkfs.lustre \
       --mgsnode=X.X.98.30@tcp0,10.148.0.30@o2ib0 \
       --mgsnode=X.X.98.31@tcp0,10.148.0.31@o2ib0 \
       --fsname=hpfs-fsl \
       --backfstype=zfs \
       --reformat \
       --verbose \
       --ost --index=$num \
       --servicenode=${LUSTRE_LOCAL_TCP_IP}@tcp0,${LUSTRE_LOCAL_IB_IP}@o2ib0 \
       --servicenode=${LUSTRE_PEER_TCP_IP}@tcp0,${LUSTRE_PEER_IB_IP}@o2ib0 \
       $pool/ost-fsl
fi




/etc/ldev.conf:

#local  foreign/-  label       [md|zfs:]device-path   [journal-path]/- [raidtab]

hpfs-fsl-mds0  hpfs-fsl-mds1  hpfs-fsl-MDT0000  zfs:metadata/meta-fsl

hpfs-fsl-oss00 hpfs-fsl-oss01 hpfs-fsl-OST0000  zfs:oss00-0/ost-fsl
hpfs-fsl-oss01 hpfs-fsl-oss00 hpfs-fsl-OST0001  zfs:oss01-0/ost-fsl
hpfs-fsl-oss02 hpfs-fsl-oss03 hpfs-fsl-OST0002  zfs:oss02-0/ost-fsl
hpfs-fsl-oss03 hpfs-fsl-oss02 hpfs-fsl-OST0003  zfs:oss03-0/ost-fsl
hpfs-fsl-oss04 hpfs-fsl-oss05 hpfs-fsl-OST0004  zfs:oss04-0/ost-fsl
hpfs-fsl-oss05 hpfs-fsl-oss04 hpfs-fsl-OST0005  zfs:oss05-0/ost-fsl
hpfs-fsl-oss06 hpfs-fsl-oss07 hpfs-fsl-OST0006  zfs:oss06-0/ost-fsl
hpfs-fsl-oss07 hpfs-fsl-oss06 hpfs-fsl-OST0007  zfs:oss07-0/ost-fsl
hpfs-fsl-oss08 hpfs-fsl-oss09 hpfs-fsl-OST0008  zfs:oss08-0/ost-fsl
hpfs-fsl-oss09 hpfs-fsl-oss08 hpfs-fsl-OST0009  zfs:oss09-0/ost-fsl
hpfs-fsl-oss10 hpfs-fsl-oss11 hpfs-fsl-OST000a  zfs:oss10-0/ost-fsl
hpfs-fsl-oss11 hpfs-fsl-oss10 hpfs-fsl-OST000b  zfs:oss11-0/ost-fsl




/etc/modprobe.d/lustre.conf:

options lnet networks=tcp0(enp4s0),o2ib0(ib1)
options ko2iblnd map_on_demand=32

-----Original Message-----
From: Brian Andrus <toomuc...@gmail.com<mailto:toomuc...@gmail.com>>
Date: Friday, February 10, 2017 at 12:07 AM
To: Darby Vicker <darby.vicke...@nasa.gov<mailto:darby.vicke...@nasa.gov>>, Ben 
Evans <bev...@cray.com<mailto:bev...@cray.com>>, 
"lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>" 
<lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>>
Subject: Re: [lustre-discuss] design to enable kernel updates

Darby,

Do you mind if I inquire about the setup for your lustre systems?
I'm trying to understand how the MGS/MGT is setup for high availability.
I understand with OSTs and MDTs where all I really need is to have the
failnode set when I do the mkfs.lustre
However, as I understand it, you have to use something like pacemaker
and drbd to deal with the MGS/MGT. Is this how you approached it?

Brian Andrus



On 2/6/2017 12:58 PM, Vicker, Darby (JSC-EG311) wrote:
> Agreed.  We are just about to go into production on our next LFS with the
> setup described.  We had to get past a bug in the MGS failover for
> dual-homed servers but as of last week that is done and everything is
> working great (see "MGS failover problem" thread on this mailing list from
> this month and last).  We are in the process of syncing our existing LFS
> to this new one and I've failed over/rebooted/upgraded the new LFS servers
> many times now to make sure we can do this in practice when the new LFS goes
> into production.  Its working beautifully.
>
> Many thanks to the lustre developers for their continued efforts.  We have
> been using and have been fans of lustre for quite some time now and it
> just keeps getting better.
>
> -----Original Message-----
> From: lustre-discuss 
> <lustre-discuss-boun...@lists.lustre.org<mailto:lustre-discuss-boun...@lists.lustre.org>>
>  on behalf of Ben Evans <bev...@cray.com<mailto:bev...@cray.com>>
> Date: Monday, February 6, 2017 at 2:22 PM
> To: Brian Andrus <toomuc...@gmail.com<mailto:toomuc...@gmail.com>>, 
> "lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>" 
> <lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>>
> Subject: Re: [lustre-discuss] design to enable kernel updates
>
> It's certainly possible.  When I've done that sort of thing, you upgrade
> the OS on all the servers first, boot half of them (the A side) to the new
> image, all the targets will fail over to the B servers.  Once the A side
> is up, reboot the B half to the new OS.  Finally, do a failback to the
> "normal" running state.
>
> At least when I've done it, you'll want to do the failovers manually so
> the HA infrastructure doesn't surprise you for any reason.
>
> -Ben
>
> On 2/6/17, 2:54 PM, "lustre-discuss on behalf of Brian Andrus"
> <lustre-discuss-boun...@lists.lustre.org<mailto:lustre-discuss-boun...@lists.lustre.org>
>  on behalf of toomuc...@gmail.com<mailto:toomuc...@gmail.com>>
> wrote:
>
>> All,
>>
>> I have been contemplating how lustre could be configured such that I
>> could update the kernel on each server without downtime.
>>
>> It seems this is _almost_ possible when you have a san system so you
>> have failover for OSTs and MDTs. BUT the MGS/MGT seems to be the
>> problematic one, since rebooting that seems cause downtime that cannot
>> be avoided.
>>
>> If you have a system where the disks are physically part of the OSS
>> hardware, you are out of luck. The hypothetical scenario I am using is
>> if someone had a VM that was a qcow image on a lustre mount (basically
>> an active, open file being read/written to continuously). How could
>> lustre be built to ensure anyone on the VM would not notice a kernel
>> upgrade to the underlying lustre servers.
>>
>>
>> Could such a setup be done? It seems that would be a better use case for
>> something like GPFS or Gluster, but being a die-hard lustre enthusiast,
>> I want to at least show it could be done.
>>
>>
>> Thanks in advance,
>>
>> Brian Andrus
>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>



_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



--
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing

jeff.john...@aeoncomputing.com<mailto:jeff.john...@aeoncomputing.com>
www.aeoncomputing.com<http://www.aeoncomputing.com>
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite D - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] design to enable kernel updates

Reply via email to