Re: [gpfsug-discuss] Problems with remote mount via routed IB

2018-03-13 Thread Zachary Mance
Hi Jan,

I am NOT using the pre-populated cache that mellanox refers to in it's
documentation. After chatting with support, I don't believe that's
necessary anymore (I didn't get a straight answer out of them).

For the subnet prefix, make sure to use one from the range
0xfec0-0xfec0001f.

---
Zach Mance  zma...@ucar.edu  (303) 497-1883

HPC Data Infrastructure Group / CISL / NCAR
---

On Tue, Mar 13, 2018 at 9:24 AM, Jan Erik Sundermann  wrote:

> Hello Zachary
>
> We are currently changing out setup to have IP over IB on all machines to
> be able to enable verbsRdmaCm.
>
> According to Mellanox (https://community.mellanox.com/docs/DOC-2384)
> ibacm requires pre-populated caches to be distributed to all end hosts with
> the mapping of IP to the routable GIDs (of both IB subnets). Was this also
> required in your successful deployment?
>
> Best
> Jan Erik
>
>
>
> On 03/12/2018 11:10 PM, Zachary Mance wrote:
>
>> Since I am testing out remote mounting with EDR IB routers, I'll add to
>> the discussion.
>>
>> In my lab environment I was seeing the same rdma connections being
>> established and then disconnected shortly after. The remote filesystem
>> would eventually mount on the clients, but it look a quite a while
>> (~2mins). Even after mounting, accessing files or any metadata operations
>> would take a while to execute, but eventually it happened.
>>
>> After enabling verbsRdmaCm, everything mounted just fine and in a timely
>> manner. Spectrum Scale was using the librdmacm.so library.
>>
>> I would first double check that you have both clusters able to talk to
>> each other on their IPoIB address, then make sure you enable verbsRdmaCm on
>> both clusters.
>>
>>
>> 
>> ---
>> Zach Mance zma...@ucar.edu  (303) 497-1883
>> HPC Data Infrastructure Group / CISL / NCAR
>> 
>> ---
>>
>> On Thu, Mar 1, 2018 at 1:41 AM, John Hearns > > wrote:
>>
>> In reply to Stuart,
>> our setup is entirely Infiniband. We boot and install over IB, and
>> rely heavily on IP over Infiniband.
>>
>> As for users being 'confused' due to multiple IPs, I would
>> appreciate some more depth on that one.
>> Sure, all batch systems are sensitive to hostnames (as I know to my
>> cost!) but once you get that straightened out why should users care?
>> I am not being aggressive, just keen to find out more.
>>
>>
>>
>> -Original Message-
>> From: gpfsug-discuss-boun...@spectrumscale.org
>> 
>> [mailto:gpfsug-discuss-boun...@spectrumscale.org
>> ] On Behalf Of
>> Stuart Barkley
>> Sent: Wednesday, February 28, 2018 6:50 PM
>> To: gpfsug main discussion list > >
>> Subject: Re: [gpfsug-discuss] Problems with remote mount via routed IB
>>
>> The problem with CM is that it seems to require configuring IP over
>> Infiniband.
>>
>> I'm rather strongly opposed to IP over IB.  We did run IPoIB years
>> ago, but pulled it out of our environment as adding unneeded
>> complexity.  It requires provisioning IP addresses across the
>> Infiniband infrastructure and possibly adding routers to other
>> portions of the IP infrastructure.  It was also confusing some users
>> due to multiple IPs on the compute infrastructure.
>>
>> We have recently been in discussions with a vendor about their
>> support for GPFS over IB and they kept directing us to using CM
>> (which still didn't work).  CM wasn't necessary once we found out
>> about the actual problem (we needed the undocumented
>> verbsRdmaUseGidIndexZero configuration option among other things due
>> to their use of SR-IOV based virtual IB interfaces).
>>
>> We don't use routed Infiniband and it might be that CM and IPoIB is
>> required for IB routing, but I doubt it.  It sounds like the OP is
>> keeping IB and IP infrastructure separate.
>>
>> Stuart Barkley
>>
>> On Mon, 26 Feb 2018 at 14:16 -, Aaron Knister wrote:
>>
>>  > Date: Mon, 26 Feb 2018 14:16:34
>>  > From: Aaron Knister > >
>>  > Reply-To: gpfsug main discussion list
>>  > > >
>>  > To: gpfsug-discuss@spectrumscale.org
>> 
>>  > Subject: Re: [gpfsug-discuss] Problems with remote mount via
>> routed IB

Re: [gpfsug-discuss] Preferred NSD

2018-03-13 Thread Alex Chekholko
Hi Lukas,

I would like to discourage you from building a large distributed clustered
filesystem made of many unreliable components.  You will need to
overprovision your interconnect and will also spend a lot of time in
"healing" or "degraded" state.

It is typically cheaper to centralize the storage into a subset of nodes
and configure those to be more highly available.  E.g. of your 60 nodes,
take 8 and put all the storage into those and make that a dedicated GPFS
cluster with no compute jobs on those nodes.  Again, you'll still need
really beefy and reliable interconnect to make this work.

Stepping back; what is the actual problem you're trying to solve?  I have
certainly been in that situation before, where the problem is more like: "I
have a fixed hardware configuration that I can't change, and I want to try
to shoehorn a parallel filesystem onto that."

I would recommend looking closer at your actual workloads.  If this is a
"scratch" filesystem and file access is mostly from one node at a time,
it's not very useful to make two additional copies of that data on other
nodes, and it will only slow you down.

Regards,
Alex

On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek 
wrote:

> On Tue, Mar 13, 2018 at 10:37:43AM +, John Hearns wrote:
> > Lukas,
> > It looks like you are proposing a setup which uses your compute servers
> as storage servers also?
>
> yes, exactly. I would like to utilise NVMe SSDs that are in every compute
> servers.. Using them as a shared scratch area with GPFS is one of the
> options.
>
> >
> >   *   I'm thinking about the following setup:
> > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected
> >
> > There is nothing wrong with this concept, for instance see
> > https://www.beegfs.io/wiki/BeeOND
> >
> > I have an NVMe filesystem which uses 60 drives, but there are 10 servers.
> > You should look at "failure zones" also.
>
> you still need the storage servers and local SSDs to use only for caching,
> do
> I understand correctly?
>
> >
> > From: gpfsug-discuss-boun...@spectrumscale.org [mailto:gpfsug-discuss-
> boun...@spectrumscale.org] On Behalf Of Knister, Aaron S.
> (GSFC-606.2)[COMPUTER SCIENCE CORP]
> > Sent: Monday, March 12, 2018 4:14 PM
> > To: gpfsug main discussion list 
> > Subject: Re: [gpfsug-discuss] Preferred NSD
> >
> > Hi Lukas,
> >
> > Check out FPO mode. That mimics Hadoop's data placement features. You
> can have up to 3 replicas both data and metadata but still the downside,
> though, as you say is the wrong node failures will take your cluster down.
> >
> > You might want to check out something like Excelero's NVMesh (note: not
> an endorsement since I can't give such things) which can create logical
> volumes across all your NVMe drives. The product has erasure coding on
> their roadmap. I'm not sure if they've released that feature yet but in
> theory it will give better fault tolerance *and* you'll get more efficient
> usage of your SSDs.
> >
> > I'm sure there are other ways to skin this cat too.
> >
> > -Aaron
> >
> >
> >
> > On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek  > wrote:
> > Hello,
> >
> > I'm thinking about the following setup:
> > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected
> >
> > I would like to setup shared scratch area using GPFS and those NVMe
> SSDs. Each
> > SSDs as on NSD.
> >
> > I don't think like 5 or more data/metadata replicas are practical here.
> On the
> > other hand, multiple node failures is something really expected.
> >
> > Is there a way to instrument that local NSD is strongly preferred to
> store
> > data? I.e. node failure most probably does not result in unavailable
> data for
> > the other nodes?
> >
> > Or is there any other recommendation/solution to build shared scratch
> with
> > GPFS in such setup? (Do not do it including.)
> >
> > --
> > Lukáš Hejtmánek
> > ___
> > gpfsug-discuss mailing list
> > gpfsug-discuss at spectrumscale.org
> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> > -- The information contained in this communication and any attachments
> is confidential and may be privileged, and is for the sole use of the
> intended recipient(s). Any unauthorized review, use, disclosure or
> distribution is prohibited. Unless explicitly stated otherwise in the body
> of this communication or the attachment thereto (if any), the information
> is provided on an AS-IS basis without any express or implied warranties or
> liabilities. To the extent you are relying on this information, you are
> doing so at your own risk. If you are not the intended recipient, please
> notify the sender immediately by replying to this message and destroy all
> copies of this message and any attachments. Neither the sender nor the
> company/group of companies he or she represents shall be liable for the
> proper and complete transmission of the information contained in this
> communication, or for any

Re: [gpfsug-discuss] Problems with remote mount via routed IB

2018-03-13 Thread Jan Erik Sundermann

Hello Zachary

We are currently changing out setup to have IP over IB on all machines 
to be able to enable verbsRdmaCm.


According to Mellanox (https://community.mellanox.com/docs/DOC-2384) 
ibacm requires pre-populated caches to be distributed to all end hosts 
with the mapping of IP to the routable GIDs (of both IB subnets). Was 
this also required in your successful deployment?


Best
Jan Erik



On 03/12/2018 11:10 PM, Zachary Mance wrote:
Since I am testing out remote mounting with EDR IB routers, I'll add to 
the discussion.


In my lab environment I was seeing the same rdma connections being 
established and then disconnected shortly after. The remote filesystem 
would eventually mount on the clients, but it look a quite a while 
(~2mins). Even after mounting, accessing files or any metadata 
operations would take a while to execute, but eventually it happened.


After enabling verbsRdmaCm, everything mounted just fine and in a timely 
manner. Spectrum Scale was using the librdmacm.so library.


I would first double check that you have both clusters able to talk to 
each other on their IPoIB address, then make sure you enable verbsRdmaCm 
on both clusters.



---
Zach Mance zma...@ucar.edu  (303) 497-1883
HPC Data Infrastructure Group / CISL / NCAR
--- 



On Thu, Mar 1, 2018 at 1:41 AM, John Hearns > wrote:


In reply to Stuart,
our setup is entirely Infiniband. We boot and install over IB, and
rely heavily on IP over Infiniband.

As for users being 'confused' due to multiple IPs, I would
appreciate some more depth on that one.
Sure, all batch systems are sensitive to hostnames (as I know to my
cost!) but once you get that straightened out why should users care?
I am not being aggressive, just keen to find out more.



-Original Message-
From: gpfsug-discuss-boun...@spectrumscale.org

[mailto:gpfsug-discuss-boun...@spectrumscale.org
] On Behalf Of
Stuart Barkley
Sent: Wednesday, February 28, 2018 6:50 PM
To: gpfsug main discussion list mailto:gpfsug-discuss@spectrumscale.org>>
Subject: Re: [gpfsug-discuss] Problems with remote mount via routed IB

The problem with CM is that it seems to require configuring IP over
Infiniband.

I'm rather strongly opposed to IP over IB.  We did run IPoIB years
ago, but pulled it out of our environment as adding unneeded
complexity.  It requires provisioning IP addresses across the
Infiniband infrastructure and possibly adding routers to other
portions of the IP infrastructure.  It was also confusing some users
due to multiple IPs on the compute infrastructure.

We have recently been in discussions with a vendor about their
support for GPFS over IB and they kept directing us to using CM
(which still didn't work).  CM wasn't necessary once we found out
about the actual problem (we needed the undocumented
verbsRdmaUseGidIndexZero configuration option among other things due
to their use of SR-IOV based virtual IB interfaces).

We don't use routed Infiniband and it might be that CM and IPoIB is
required for IB routing, but I doubt it.  It sounds like the OP is
keeping IB and IP infrastructure separate.

Stuart Barkley

On Mon, 26 Feb 2018 at 14:16 -, Aaron Knister wrote:

 > Date: Mon, 26 Feb 2018 14:16:34
 > From: Aaron Knister mailto:aaron.s.knis...@nasa.gov>>
 > Reply-To: gpfsug main discussion list
 > mailto:gpfsug-discuss@spectrumscale.org>>
 > To: gpfsug-discuss@spectrumscale.org

 > Subject: Re: [gpfsug-discuss] Problems with remote mount via
routed IB
 >
 > Hi Jan Erik,
 >
 > It was my understanding that the IB hardware router required RDMA
CM to work.
 > By default GPFS doesn't use the RDMA Connection Manager but it can be
 > enabled (e.g. verbsRdmaCm=enable). I think this requires a restart on
 > clients/servers (in both clusters) to take effect. Maybe someone else
 > on the list can comment in more detail-- I've been told folks have
 > successfully deployed IB routers with GPFS.
 >
 > -Aaron
 >
 > On 2/26/18 11:38 AM, Sundermann, Jan Erik (SCC) wrote:
 > >
 > > Dear all
 > >
 > > we are currently trying to remote mount a file system in a routed
 > > Infiniband test setup and face problems with dropped RDMA
 > > connections. The setup is the
 > > following:
 > >
 > > - Spectrum Scale Cluster 1 is setup on four servers which are
 > > connected to the same infiniband network. Addit

[gpfsug-discuss] SSUG USA Spring Meeting - Registration and call for speakers is now open!

2018-03-13 Thread Oesterlin, Robert
The registration for the Spring meeting of the SSUG-USA is now open. You can 
register here:

https://www.eventbrite.com/e/spectrum-scale-gpfs-user-group-us-spring-2018-meeting-tickets-43662759489

DATE AND TIME
Wed, May 16, 2018, 9:00 AM –
Thu, May 17, 2018, 5:00 PM EDT

LOCATION
IBM Cambridge Innovation Center
One Rogers Street
Cambridge, MA 02142-1203

Please note that we have limited meeting space so please register only if 
you’re sure you can attend. Detailed agenda will be published in the coming 
weeks. If you are interested in presenting, please contact me. I do have 
several speakers lined up already, but we can use a few more.


Bob Oesterlin
Sr Principal Storage Engineer, Nuance

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Problems with remote mount via routed IB

2018-03-13 Thread Jan Erik Sundermann

Hi John

We try to route infiniband traffic. The IP traffic is routed separately.
The two clusters we try to connect are configured differently, one with 
IP over IB the other one with dedicated ethernet adapters.


Jan Erik



On 02/27/2018 10:17 AM, John Hearns wrote:

Jan Erik,
Can you clarify if you are routing IP traffic between the two Infiniband 
networks.
Or are you routing Infiniband traffic?


If I can be of help I manage an Infiniband network which connects to other IP 
networks using Mellanox VPI gateways, which proxy arp between IB and Ethernet. 
But I am not running GPFS traffic over these.



-Original Message-
From: gpfsug-discuss-boun...@spectrumscale.org 
[mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of Sundermann, Jan 
Erik (SCC)
Sent: Monday, February 26, 2018 5:39 PM
To: gpfsug-discuss@spectrumscale.org
Subject: [gpfsug-discuss] Problems with remote mount via routed IB


Dear all

we are currently trying to remote mount a file system in a routed Infiniband 
test setup and face problems with dropped RDMA connections. The setup is the 
following:

- Spectrum Scale Cluster 1 is setup on four servers which are connected to the 
same infiniband network. Additionally they are connected to a fast ethernet 
providing ip communication in the network 192.168.11.0/24.

- Spectrum Scale Cluster 2 is setup on four additional servers which are 
connected to a second infiniband network. These servers have IPs on their IB 
interfaces in the network 192.168.12.0/24.

- IP is routed between 192.168.11.0/24 and 192.168.12.0/24 on a dedicated 
machine.

- We have a dedicated IB hardware router connected to both IB subnets.


We tested that the routing, both IP and IB, is working between the two clusters 
without problems and that RDMA is working fine both for internal communication 
inside cluster 1 and cluster 2

When trying to remote mount a file system from cluster 1 in cluster 2, RDMA 
communication is not working as expected. Instead we see error messages on the 
remote host (cluster 2)


2018-02-23_13:48:47.037+0100: [I] VERBS RDMA connecting to 192.168.11.4 
(iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 2
2018-02-23_13:48:49.890+0100: [I] VERBS RDMA connected to 192.168.11.4 
(iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 2
2018-02-23_13:48:53.138+0100: [E] VERBS RDMA closed connection to 192.168.11.1 
(iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 error 733 
index 3
2018-02-23_13:48:53.854+0100: [I] VERBS RDMA connecting to 192.168.11.1 
(iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 3
2018-02-23_13:48:54.954+0100: [E] VERBS RDMA closed connection to 192.168.11.3 
(iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 error 733 
index 1
2018-02-23_13:48:55.601+0100: [I] VERBS RDMA connected to 192.168.11.1 
(iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 3
2018-02-23_13:48:57.775+0100: [I] VERBS RDMA connecting to 192.168.11.3 
(iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 1
2018-02-23_13:48:59.557+0100: [I] VERBS RDMA connected to 192.168.11.3 
(iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 1
2018-02-23_13:48:59.876+0100: [E] VERBS RDMA closed connection to 192.168.11.2 
(iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 error 733 
index 0
2018-02-23_13:49:02.020+0100: [I] VERBS RDMA connecting to 192.168.11.2 
(iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 0
2018-02-23_13:49:03.477+0100: [I] VERBS RDMA connected to 192.168.11.2 
(iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 0
2018-02-23_13:49:05.119+0100: [E] VERBS RDMA closed connection to 192.168.11.4 
(iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 error 733 
index 2
2018-02-23_13:49:06.191+0100: [I] VERBS RDMA connecting to 192.168.11.4 
(iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 2
2018-02-23_13:49:06.548+0100: [I] VERBS RDMA connected to 192.168.11.4 
(iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 2
2018-02-23_13:49:11.578+0100: [E] VERBS RDMA closed connection to 192.168.11.1 
(iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 error 733 
index 3
2018-02-23_13:49:11.937+0100: [I] VERBS RDMA connecting to 192.168.11.1 
(iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 3
2018-02-23_13:49:11.939+0100: [I] VERBS RDMA connected to 192.168.11.1 
(iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 3


and in the cluster with the file system (cluster 1)

2018-02-23_13:47:36.112+0100: [E] VERBS RDMA rdma read error 
IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in 
gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err 129
2018-02-23_13:47:36.112+0100: [E] VERBS RDMA closed c

Re: [gpfsug-discuss] Preferred NSD

2018-03-13 Thread Lukas Hejtmanek
On Tue, Mar 13, 2018 at 10:37:43AM +, John Hearns wrote:
> Lukas,
> It looks like you are proposing a setup which uses your compute servers as 
> storage servers also?

yes, exactly. I would like to utilise NVMe SSDs that are in every compute
servers.. Using them as a shared scratch area with GPFS is one of the options.
 
> 
>   *   I'm thinking about the following setup:
> ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected
> 
> There is nothing wrong with this concept, for instance see
> https://www.beegfs.io/wiki/BeeOND
> 
> I have an NVMe filesystem which uses 60 drives, but there are 10 servers.
> You should look at "failure zones" also.

you still need the storage servers and local SSDs to use only for caching, do
I understand correctly?
 
> 
> From: gpfsug-discuss-boun...@spectrumscale.org 
> [mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of Knister, Aaron 
> S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
> Sent: Monday, March 12, 2018 4:14 PM
> To: gpfsug main discussion list 
> Subject: Re: [gpfsug-discuss] Preferred NSD
> 
> Hi Lukas,
> 
> Check out FPO mode. That mimics Hadoop's data placement features. You can 
> have up to 3 replicas both data and metadata but still the downside, though, 
> as you say is the wrong node failures will take your cluster down.
> 
> You might want to check out something like Excelero's NVMesh (note: not an 
> endorsement since I can't give such things) which can create logical volumes 
> across all your NVMe drives. The product has erasure coding on their roadmap. 
> I'm not sure if they've released that feature yet but in theory it will give 
> better fault tolerance *and* you'll get more efficient usage of your SSDs.
> 
> I'm sure there are other ways to skin this cat too.
> 
> -Aaron
> 
> 
> 
> On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek 
> mailto:xhejt...@ics.muni.cz>> wrote:
> Hello,
> 
> I'm thinking about the following setup:
> ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected
> 
> I would like to setup shared scratch area using GPFS and those NVMe SSDs. Each
> SSDs as on NSD.
> 
> I don't think like 5 or more data/metadata replicas are practical here. On the
> other hand, multiple node failures is something really expected.
> 
> Is there a way to instrument that local NSD is strongly preferred to store
> data? I.e. node failure most probably does not result in unavailable data for
> the other nodes?
> 
> Or is there any other recommendation/solution to build shared scratch with
> GPFS in such setup? (Do not do it including.)
> 
> --
> Lukáš Hejtmánek
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> -- The information contained in this communication and any attachments is 
> confidential and may be privileged, and is for the sole use of the intended 
> recipient(s). Any unauthorized review, use, disclosure or distribution is 
> prohibited. Unless explicitly stated otherwise in the body of this 
> communication or the attachment thereto (if any), the information is provided 
> on an AS-IS basis without any express or implied warranties or liabilities. 
> To the extent you are relying on this information, you are doing so at your 
> own risk. If you are not the intended recipient, please notify the sender 
> immediately by replying to this message and destroy all copies of this 
> message and any attachments. Neither the sender nor the company/group of 
> companies he or she represents shall be liable for the proper and complete 
> transmission of the information contained in this communication, or for any 
> delay in its receipt.

> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss


-- 
Lukáš Hejtmánek

Linux Administrator only because
  Full Time Multitasking Ninja 
  is not an official job title
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Preferred NSD

2018-03-13 Thread John Hearns
Lukas,
It looks like you are proposing a setup which uses your compute servers as 
storage servers also?


  *   I'm thinking about the following setup:
~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected

There is nothing wrong with this concept, for instance see
https://www.beegfs.io/wiki/BeeOND

I have an NVMe filesystem which uses 60 drives, but there are 10 servers.
You should look at "failure zones" also.


From: gpfsug-discuss-boun...@spectrumscale.org 
[mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of Knister, Aaron 
S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
Sent: Monday, March 12, 2018 4:14 PM
To: gpfsug main discussion list 
Subject: Re: [gpfsug-discuss] Preferred NSD

Hi Lukas,

Check out FPO mode. That mimics Hadoop's data placement features. You can have 
up to 3 replicas both data and metadata but still the downside, though, as you 
say is the wrong node failures will take your cluster down.

You might want to check out something like Excelero's NVMesh (note: not an 
endorsement since I can't give such things) which can create logical volumes 
across all your NVMe drives. The product has erasure coding on their roadmap. 
I'm not sure if they've released that feature yet but in theory it will give 
better fault tolerance *and* you'll get more efficient usage of your SSDs.

I'm sure there are other ways to skin this cat too.

-Aaron



On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek 
mailto:xhejt...@ics.muni.cz>> wrote:
Hello,

I'm thinking about the following setup:
~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected

I would like to setup shared scratch area using GPFS and those NVMe SSDs. Each
SSDs as on NSD.

I don't think like 5 or more data/metadata replicas are practical here. On the
other hand, multiple node failures is something really expected.

Is there a way to instrument that local NSD is strongly preferred to store
data? I.e. node failure most probably does not result in unavailable data for
the other nodes?

Or is there any other recommendation/solution to build shared scratch with
GPFS in such setup? (Do not do it including.)

--
Lukáš Hejtmánek
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
-- The information contained in this communication and any attachments is 
confidential and may be privileged, and is for the sole use of the intended 
recipient(s). Any unauthorized review, use, disclosure or distribution is 
prohibited. Unless explicitly stated otherwise in the body of this 
communication or the attachment thereto (if any), the information is provided 
on an AS-IS basis without any express or implied warranties or liabilities. To 
the extent you are relying on this information, you are doing so at your own 
risk. If you are not the intended recipient, please notify the sender 
immediately by replying to this message and destroy all copies of this message 
and any attachments. Neither the sender nor the company/group of companies he 
or she represents shall be liable for the proper and complete transmission of 
the information contained in this communication, or for any delay in its 
receipt.
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss