[gpfsug-discuss] ces - change preferred host for an IP without actually moving the IP

2019-02-12 Thread Billich Heinrich Rainer (PSI)
Hello,

Can I change the preferred server for a ces address without actually moving the 
IP?

In my case the IP already moved to the new server due to a failure on a second 
server.  Now I would like the IP to stay even if the other server gets active 
again: I first want to move a test address only.  But ‘mmces address move’ 
denies to run as the address already is on the server I want to make the 
preferred one.

I also didn’t find where this address assignment is stored, I searched in the 
files available from ccr.

Thank you,

Heiner
--
Paul Scherrer Institut
Heiner Billich
System Engineer Scientific Computing
Science IT / High Performance Computing
WHGA/106
Forschungsstrasse 111
5232 Villigen PSI
Switzerland

Phone +41 56 310 36 02
heiner.bill...@psi.ch
https://www.psi.ch


___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


[gpfsug-discuss] Token manager - how to monitor performance?

2019-01-31 Thread Billich Heinrich Rainer (PSI)
Hello,
Sorry for coming up with this never-ending story. I know that token management 
is mainly autoconfigured and even the placement of token manager nodes is no 
longer under user control in all cases. Still I would like to monitor this 
component to see if we are close to some limit like memory or rpc rate. 
Especially as we’ll do some major changes to our setup soon.
I would like to monitor the performance of our token manager nodes to get 
warned _before_ we get performance issues. Any advice is welcome.
Ideally I would like collect some numbers and pass them on to influxdb or 
similar. I didn’t find anything in perfmon/zimon that seemed to match. I could 
imagine that numbers like “number of active tokens” and “number of token 
operations” per manager would be helpful. Or “# of rpc calls per second”.  And 
maybe “number of open files”, “number of token operations”, “number of tokens” 
for clients.  And maybe some percentage of used token memory … and cache hit 
ratio …
This would also help to tune – like if a client does very many token operations 
or rpc calls maybe I should increase maxFilesToCache.
 The above is just to illustrate, as token management is complicated the really 
valuable metrics may be different.
Or am I too anxious and should wait and see instead?
cheers,
Heiner

Heiner Billich
--
Paul Scherrer Institut
Heiner Billich
System Engineer Scientific Computing
Science IT / High Performance Computing
WHGA/106
Forschungsstrasse 111
5232 Villigen PSI
Switzerland

Phone +41 56 310 36 02
heiner.bill...@psi.ch
https://www.psi.ch




___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] [NFS-Ganesha-Support] does ganesha deny access for unknown UIDs?

2019-01-25 Thread Billich Heinrich Rainer (PSI)
Hello Daniel,

thank you. The clients do NFS v3 mounts, hence idmap is no option - as I know 
it's used in NFS v4 to map between uid/guid and names only? For a process to 
switch to a certain uid/guid in general one does not need a  matching passwd 
entry? I see that with ACLs you get issues as they use names, and you can't do 
a server-side group membership lookup, and there may be more subtle issues.

Anyway, I'll create the needed accounts on the server. By the way: We had the 
same issue with Netapp filers and it took a while to  find the configuration 
option to allow 'unknown' uid/gid  to access a nfs v3 export.

I'll try  to reproduce on a test system with increased logging to see what 
exactly goes wrong and maybe ask later to add a configuration option to ganesha 
to switch to a behaviour more similar to kernel-nfs.

Many client systems at my site are legacy and run various operating systems, 
hence a complete switch to NFS v4 is unlikely to happen soon.

cheers,

Heiner 
--
Paul Scherrer Institut
Heiner Billich   
System Engineer Scientific Computing
Science IT / High Performance Computing 
WHGA/106  
Forschungsstrasse 111
5232 Villigen PSI
Switzerland
 
Phone +41 56 310 36 02
heiner.bill...@psi.ch 
https://www.psi.ch
 
 

On 24/01/19 16:35, "Daniel Gryniewicz"  wrote:

Hi.

For local operating FSALs (like GPFS and VFS), the way Ganesha makes 
sure that a UID/GID combo has the correct permissions for an operation 
is to set the UID/GID of the thread to the one in the operation, then 
perform the actual operation.  This way, the kernel and the underlying 
filesystem perform atomic permission checking on the op.  This 
setuid/setgid will fail, of course, if the local system doesn't have 
that UID/GID to set to.

The solution for this is to use NFS idmap to map the remote ID to a 
local one. This includes the ability to map unknown IDs to some local ID.

    Daniel
    
    On 1/24/19 9:29 AM, Billich Heinrich Rainer (PSI) wrote:
> Hello,
> 
> a local account on a nfs client couldn’t write to a ganesha nfs export 
> even with directory permissions 777. The solution was to create the 
> account on the ganesha servers, too.
> 
> Please can you confirm that this is the intended behaviour? is there an 
> option to change this and to map unknown accounts to nobody instead? We 
> often have embedded Linux appliances or similar as nfs clients which 
> need to place some data on the nfs exports  using uid/gid of local 
accounts.
> 
> We manage gids on the server side and allow NFS v3 client access only.
> 
> I crosspost this to ganesha support and to the gpfsug mailing list.
> 
> Thank you,
> 
> Heiner Billich
> 
> ganesha version: 2.5.3-ibm028.00.el7.x86_64



___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


[gpfsug-discuss] does ganesha deny access for unknown UIDs?

2019-01-24 Thread Billich Heinrich Rainer (PSI)
Hello,

a local account on a nfs client couldn’t write to a ganesha nfs export even 
with directory permissions 777. The solution was to create the account on the 
ganesha servers, too.

Please can you confirm that this is the intended behaviour? is there an option 
to change this and to map unknown accounts to nobody instead? We often have 
embedded Linux appliances or similar as nfs clients which need to place some 
data on the nfs exports  using uid/gid of local accounts.

We manage gids on the server side and allow NFS v3 client access only.

I crosspost this to ganesha support and to the gpfsug mailing list.

Thank you,

Heiner Billich

ganesha version: 2.5.3-ibm028.00.el7.x86_64

the ganesha config

CacheInode
{
fd_hwmark_percent=60;
fd_lwmark_percent=20;
fd_limit_percent=90;
lru_run_interval=90;
entries_hwmark=150;
}
NFS_Core_Param
{
clustered=TRUE;
rpc_max_connections=1;
heartbeat_freq=0;
mnt_port=33247;
nb_worker=256;
nfs_port=2049;
nfs_protocols=3,4;
nlm_port=33245;
rquota_port=33246;
rquota_port=33246;
short_file_handle=FALSE;
mount_path_pseudo=true;
}
GPFS
{
fsal_grace=FALSE;
fsal_trace=TRUE;
}
NFSv4
{
delegations=FALSE;
domainname=virtual1.com;
grace_period=60;
lease_lifetime=60;
}
Export_Defaults
{
access_type=none;
anonymous_gid=-2;
anonymous_uid=-2;
manage_gids=TRUE;
nfs_commit=FALSE;
privilegedport=FALSE;
protocols=3,4;
sectype=sys;
squash=root_squash;
transports=TCP;
}

one export

# === START / id=206 nclients=3 ===
EXPORT {
Attr_Expiration_Time=60;
Delegations=none;
Export_id=206;
Filesystem_id=42.206;
MaxOffsetRead=18446744073709551615;
MaxOffsetWrite=18446744073709551615;
MaxRead=1048576;
MaxWrite=1048576;
Path="/";
PrefRead=1048576;
PrefReaddir=1048576;
PrefWrite=1048576;
Pseudo="/";
Tag="";
UseCookieVerifier=false;
FSAL {
Name=GPFS;
}
CLIENT {
# === /X12SA ===
Access_Type=RW;
Anonymous_gid=-2;
Anonymous_uid=-2;
Clients=X.Y.A.B/24;
Delegations=none;
Manage_Gids=TRUE;
NFS_Commit=FALSE;
PrivilegedPort=FALSE;
Protocols=3;
SecType=SYS;
Squash=Root;
Transports=TCP;
}
….
--
Paul Scherrer Institut
Heiner Billich
System Engineer Scientific Computing
Science IT / High Performance Computing
WHGA/106
Forschungsstrasse 111
5232 Villigen PSI
Switzerland

Phone +41 56 310 36 02
heiner.bill...@psi.ch
https://www.psi.ch


___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


[gpfsug-discuss] CES - suspend a node and don't start smb/nfs at mmstartup/boot

2018-11-14 Thread Billich Heinrich Rainer (PSI)
Hello,

how can I prevent smb, ctdb, nfs (and object) to start when I reboot the node 
or restart gpfs on a suspended ces node?  Being able to do this would make 
updates much easier

With

 # mmces node suspend –stop

I can move all IPs to other CES nodes and stop all CES services, what also 
releases the ces-shared-root-directory and allows to unmount the underlying 
filesystem.
But after a reboot/restart only the IPs stay on the on the other nodes, the CES 
services start up. Hm, sometimes I would very much prefer the services to stay 
down as long as the nodes is suspended and to keep the node out of the CES 
cluster as much as possible.

I did not try rough things like just renaming smbd, this seems likely to create 
unwanted issues.

Thank you,

Cheers,

Heiner Billich
--
Paul Scherrer Institut
Heiner Billich
System Engineer Scientific Computing
Science IT / High Performance Computing
WHGA/106
Forschungsstrasse 111
5232 Villigen PSI
Switzerland

Phone +41 56 310 36 02
heiner.bill...@psi.ch
https://www.psi.ch



From:  on behalf of Madhu Konidena 

Reply-To: gpfsug main discussion list 
Date: Sunday 11 November 2018 at 22:06
To: gpfsug main discussion list 
Subject: Re: [gpfsug-discuss] If you're attending KubeCon'18

I will be there at both. Please stop by our booth at SC18 for a quick chat.

Madhu Konidena
[cid:ii_d4d3894a4c2f4773]
ma...@corehive.com



On Nov 10, 2018, at 3:37 PM, Jon Bernard 
mailto:jonbern...@gmail.com>> wrote:
Hi Vasily,
I will be at Kubecon with colleagues from Tower Research Capital (and at SC). 
We have a few hundred nodes across several Kubernetes clusters, most of them 
mounting Scale from the host.
Jon
On Fri, Oct 26, 2018, 5:58 PM Vasily Tarasov 
mailto:vtara...@us.ibm.com> wrote:
Folks,   Please let me know if anyone is attending KubeCon'18 in Seattle this 
December (via private e-mail). We will be there and would like to meet in 
person with people that already use or consider using Kubernetes/Swarm/Mesos 
with Scale. The goal is to share experiences, problems, visions.   P.S. If you 
are not attending KubeCon, but are interested in the topic, shoot me an e-mail 
anyway.   Best, -- Vasily Tarasov, Research Staff Member, Storage Systems 
Research, IBM Research - Almaden
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


[gpfsug-discuss] CES - samba - how can I disable shadow_copy2, i.e. snapshots

2018-11-09 Thread Billich Heinrich Rainer (PSI)
Hello,

we run CES with smbd on a filesystem _without_ snapshots. I would like to 
completely remove the shadow_copy2 vfs object in samba which exposes the 
snapshots to windows clients: 

We don't offer snapshots as service to clients and if I create a snapshot I 
don't want it to be exposed to clients. I'm also not sure how much additional 
directory traversals this vfs object causes, shadow_copy2 has to search for the 
snapshot directories again and again, just to learn that there are no snapshots 
available.

Now the file  samba_registry.def (/usr/lpp/mmfs/share/samba/samba_registry.def) 
doesn't allow to change the settings for shadow_config2  in samba's 
configuration.

Hm, is it o.k. to edit samba_registry.def? That's probably not what IBM 
intended. But with mmsnapdir I can change the name of the snapshot directories, 
which would require me to edit the locked settings, too, so it seems a bit 
restrictive.

I didn’t search all documentation, if there is an option do disable 
shadow_copy2 with some command I would be happy to learn.

Any comments or ideas are welcome. Also if you think I should just create a 
bogus .snapdirs at root level to get rid of the error messages and that's it, 
please let me know.

we run scale 5.0.1-1 on RHEL4 x86_64. We will upgrade to 5.0.2-1 soon, but I 
didn’t' check that version yet.

Cheers

Heiner Billich

What I would like to change in samba's configuration:


52c52
<   vfs objects = syncops gpfs fileid time_audit
---
>   vfs objects = shadow_copy2 syncops gpfs fileid time_audit
72a73,76
>   shadow:snapdir = .snapshots
>   shadow:fixinodes = yes
>   shadow:snapdirseverywhere = yes
>   shadow:sort = desc
--
Paul Scherrer Institut
Heiner Billich   
System Engineer Scientific Computing
Science IT / High Performance Computing 
WHGA/106  
Forschungsstrasse 111
5232 Villigen PSI
Switzerland
 
Phone +41 56 310 36 02
heiner.bill...@psi.ch 
https://www.psi.ch
 
 


___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] control which hosts become token manager

2018-07-27 Thread Billich Heinrich Rainer (PSI)
Hello,

So probably I was wrong from the beginning –  please can somebody clarify: In a 
multicluster environment with all storage and filesystem hosted by a single 
cluster all token managers will reside in this central cluster?

Or are there also token managers in the storage-less clusters which just mount? 
This managers wouldn’t be accessible by all nodes which access the file system, 
hence I doubt this exists.

Still it would be nice to know how to influence the token manager placement and 
how to exclude certain machines. And the output of ‘mmdiag –tokenmgr’ indicates 
that there _are_ token manager in the remote-mounting cluster – confusing.

I would greatly appreciate if somebody could sort this out. A point to the 
relevant documentation would also be welcome.

Thank you & Kind regards,

Heiner

--
Paul Scherrer Institut
Science IT
Heiner Billich
WHGA 106
CH 5232  Villigen PSI
056 310 36 02
https://www.psi.ch


From:  on behalf of "Billich Heinrich 
Rainer (PSI)" 
Reply-To: gpfsug main discussion list 
Date: Friday 27 July 2018 at 17:33
To: gpfsug main discussion list 
Subject: Re: [gpfsug-discuss] control which hosts become token manager

Thank you,

The cluster was freshly set up and the VM node never was denoted as manager, it 
was created as quorum-client. What I didn’t mention but probably should have: 
This is a multicluster mount, the cluster has no own storage. Hence the 
filesystem manager are on the home cluster, according to mmlsmgr. Hm, probably 
more complicated as I initially thought. Still I would expect that for 
file-access that is restricted to this cluster all token management is handled 
inside the cluster, too? And I don’t want the weakest node to participate.

Kind regards,

Heiner

--
Paul Scherrer Institut
Science IT
Heiner Billich
WHGA 106
CH 5232  Villigen PSI
056 310 36 02
https://www.psi.ch


From:  on behalf of Bryan Banister 

Reply-To: gpfsug main discussion list 
Date: Tuesday 24 July 2018 at 23:12
To: gpfsug main discussion list 
Subject: Re: [gpfsug-discuss] control which hosts become token manager

Agree with Peter here.  And if the file system and workload are of significant 
size then isolating the token manager to a dedicated node is definitely best 
practice.

Unfortunately there isn’t a way to specify a preferred manager per FS… (Bryan 
starts typing up a new RFE…).

Cheers,
-Bryan

From: gpfsug-discuss-boun...@spectrumscale.org 
 On Behalf Of Peter Childs
Sent: Tuesday, July 24, 2018 2:29 PM
To: gpfsug main discussion list 
Subject: Re: [gpfsug-discuss] control which hosts become token manager

Note: External Email


What does mmlsmgr show?

Your config looks fine.

I suspect you need to do a

mmchmgr perf node-1.psi.ch<http://node-1.psi.ch>
mmchmgr tiered node-2.psi.ch<http://node-2.psi.ch>

It looks like the node was set up as a manager and was demoted to just quorum 
but since its still currently the manager it needs to be told to stop.

From experience it's also worth having different file system managers on 
different nodes, if at all possible.

But that's just a guess without seeing the output of mmlsmgr.


Peter Childs
Research Storage
ITS Research and Teaching Support
Queen Mary, University of London


 Billich Heinrich Rainer (PSI) wrote 
Hello,

I want to control which nodes can become token manager. In detail I run a 
virtual machine as quorum node. I don’t want this machine to become a token 
manager - it has no access to Infiniband and only very limited memory.

What I see is that ‘mmdiag –tokenmgr’ lists the machine as active token 
manager. The machine has role ‘quorum-client’. This doesn’t seem sufficient to 
exclude it.

Is there any way to tell spectrum scale to exclude this single machine with 
role quorum-client?

I run 5.0.1-1.

Sorry if this is a faq, I did search  quite a bit before I wrote to the list.

Thank you,

Heiner Billich


[root@node-2 ~]# mmlscluster

GPFS cluster information

  GPFS cluster name: node.psi.ch
  GPFS cluster id:   5389874024582403895
  GPFS UID domain:   node.psi.ch
  Remote shell command:  /usr/bin/ssh
  Remote file copy command:  /usr/bin/scp
  Repository type:   CCR

Node  Daemon node name   IP address Admin node nameDesignation

   1   node-1.psi.ch   a.b.95.31  node-1.psi.ch   quorum-manager
   2   node-2.psi.ch   a.b.95.32  node-2.psi.ch   quorum-manager
   3   node-quorum.psi.ch  a.b.95.30  node-quorum.psi.ch  quorum
   <<<< VIRTUAL MACHINE >>>>>>>>>

[root@node-2 ~]# mmdiag --tokenmgr

=== mmdiag: tokenmgr ===
  Token Domain perf
There are 3 active token servers in this domain.
Server list:
  a.b.95.120
  a.b.95.121
  a.b.95.122<<<< VIRTUAL MACHINE >>>&

Re: [gpfsug-discuss] Power9 / GPFS

2018-07-27 Thread Billich Heinrich Rainer (PSI)
Hello

If you don’t need the installer maybe just extract the RPMs, this bypasses 
java. For x86_64 I use commands like the once below, shouldn’t be much 
different on power.

TARFILE=$1

START=$( grep -a -m 1 ^PGM_BEGIN_TGZ= $TARFILE| cut -d= -f2)
echo extract RPMs from  $TARFILE with START=$START
tail -n +$START $TARFILE | tar  xvzf - *.rpm */repodata/*

Kind regards,

Heiner

--
Paul Scherrer Institut

From:  on behalf of Simon Thompson 

Reply-To: gpfsug main discussion list 
Date: Friday 27 July 2018 at 17:19
To: "gpfsug-discuss@spectrumscale.org" 
Subject: [gpfsug-discuss] Power9 / GPFS

I feel like I must be doing something stupid here but …

We’re trying to install GPFS onto some Power 9 AI systems we’ve just got…

So from Fix central, we download 
“Spectrum_Scale_Data_Management-5.0.1.1-ppc64LE-Linux-install”, however we are 
failing to unpack the file:

./Spectrum_Scale_Data_Management-5.0.1.1-ppc64LE-Linux-install --dir 5.0.1.1 
--text-only

Extracting License Acceptance Process Tool to 5.0.1.1 ...
tail -n +620 ./Spectrum_Scale_Data_Management-5.0.1.1-ppc64LE-Linux-install | 
tar -C 5.0.1.1 -xvz --exclude=installer --exclude=*_rpms --exclude=*rpm  
--exclude=*tgz --exclude=*deb 1> /dev/null

Installing JRE ...

If directory 5.0.1.1 has been created or was previously created during another 
extraction,
.rpm, .deb, and repository related files in it (if there were) will be removed 
to avoid conflicts with the ones being extracted.

tail -n +620 ./Spectrum_Scale_Data_Management-5.0.1.1-ppc64LE-Linux-install | 
tar -C 5.0.1.1 --wildcards -xvz  ibm-java*tgz 1> /dev/null
tar -C 5.0.1.1/ -xzf 5.0.1.1/ibm-java*tgz

Invoking License Acceptance Process Tool ...
5.0.1.1/ibm-java-ppc64le-80/jre/bin/java -cp 5.0.1.1/LAP_HOME/LAPApp.jar 
com.ibm.lex.lapapp.LAP -l 5.0.1.1/LA_HOME -m 5.0.1.1 -s 5.0.1.1  -text_only
Unhandled exception
Type=Segmentation error vmState=0x
J9Generic_Signal_Number=0004 Signal_Number=000b Error_Value= 
Signal_Code=0001
Handler1=7FFFB194FC80 Handler2=7FFFB176EA40
R0=7FFFB176A0E8 R1=7FFFB23AC5D0 R2=7FFFB2737400 R3=
R4=7FFFB17D2AA4 R5=0006 R6= R7=7FFFAC12A3C0


This looks like the java runtime is failing during the license approval status.

First off, can someone confirm that 
“Spectrum_Scale_Data_Management-5.0.1.1-ppc64LE-Linux-install” is indeed the 
correct package we are downloading for Power9, and then any tips on how to 
extract the packages.

These systems are running the IBM factory shipped install of RedHat 7.5.

Thanks

Simon


___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] control which hosts become token manager

2018-07-27 Thread Billich Heinrich Rainer (PSI)
Thank you,

The cluster was freshly set up and the VM node never was denoted as manager, it 
was created as quorum-client. What I didn’t mention but probably should have: 
This is a multicluster mount, the cluster has no own storage. Hence the 
filesystem manager are on the home cluster, according to mmlsmgr. Hm, probably 
more complicated as I initially thought. Still I would expect that for 
file-access that is restricted to this cluster all token management is handled 
inside the cluster, too? And I don’t want the weakest node to participate.

Kind regards,

Heiner

--
Paul Scherrer Institut
Science IT
Heiner Billich
WHGA 106
CH 5232  Villigen PSI
056 310 36 02
https://www.psi.ch


From:  on behalf of Bryan Banister 

Reply-To: gpfsug main discussion list 
Date: Tuesday 24 July 2018 at 23:12
To: gpfsug main discussion list 
Subject: Re: [gpfsug-discuss] control which hosts become token manager

Agree with Peter here.  And if the file system and workload are of significant 
size then isolating the token manager to a dedicated node is definitely best 
practice.

Unfortunately there isn’t a way to specify a preferred manager per FS… (Bryan 
starts typing up a new RFE…).

Cheers,
-Bryan

From: gpfsug-discuss-boun...@spectrumscale.org 
 On Behalf Of Peter Childs
Sent: Tuesday, July 24, 2018 2:29 PM
To: gpfsug main discussion list 
Subject: Re: [gpfsug-discuss] control which hosts become token manager

Note: External Email


What does mmlsmgr show?

Your config looks fine.

I suspect you need to do a

mmchmgr perf node-1.psi.ch<http://node-1.psi.ch>
mmchmgr tiered node-2.psi.ch<http://node-2.psi.ch>

It looks like the node was set up as a manager and was demoted to just quorum 
but since its still currently the manager it needs to be told to stop.

From experience it's also worth having different file system managers on 
different nodes, if at all possible.

But that's just a guess without seeing the output of mmlsmgr.


Peter Childs
Research Storage
ITS Research and Teaching Support
Queen Mary, University of London


 Billich Heinrich Rainer (PSI) wrote 
Hello,

I want to control which nodes can become token manager. In detail I run a 
virtual machine as quorum node. I don’t want this machine to become a token 
manager - it has no access to Infiniband and only very limited memory.

What I see is that ‘mmdiag –tokenmgr’ lists the machine as active token 
manager. The machine has role ‘quorum-client’. This doesn’t seem sufficient to 
exclude it.

Is there any way to tell spectrum scale to exclude this single machine with 
role quorum-client?

I run 5.0.1-1.

Sorry if this is a faq, I did search  quite a bit before I wrote to the list.

Thank you,

Heiner Billich


[root@node-2 ~]# mmlscluster

GPFS cluster information

  GPFS cluster name: node.psi.ch
  GPFS cluster id:   5389874024582403895
  GPFS UID domain:   node.psi.ch
  Remote shell command:  /usr/bin/ssh
  Remote file copy command:  /usr/bin/scp
  Repository type:   CCR

Node  Daemon node name   IP address Admin node nameDesignation

   1   node-1.psi.ch   a.b.95.31  node-1.psi.ch   quorum-manager
   2   node-2.psi.ch   a.b.95.32  node-2.psi.ch   quorum-manager
   3   node-quorum.psi.ch  a.b.95.30  node-quorum.psi.ch  quorum
   <<<< VIRTUAL MACHINE >>>>>>>>>

[root@node-2 ~]# mmdiag --tokenmgr

=== mmdiag: tokenmgr ===
  Token Domain perf
There are 3 active token servers in this domain.
Server list:
  a.b.95.120
  a.b.95.121
  a.b.95.122<<<< VIRTUAL MACHINE >>>>>>>>>
  Token Domain tiered
There are 3 active token servers in this domain.
Server list:
  a.b.95.120
  a.b.95.121
  a.b.95.122   <<<< VIRTUAL MACHINE >>>>>>>>>

--
Paul Scherrer Institut
Science IT
Heiner Billich
WHGA 106
CH 5232  Villigen PSI
056 310 36 02
https://www.psi.ch




Note: This email is for the confidential use of the named addressee(s) only and 
may contain proprietary, confidential, or privileged information and/or 
personal data. If you are not the intended recipient, you are hereby notified 
that any review, dissemination, or copying of this email is strictly 
prohibited, and requested to notify the sender immediately and destroy this 
email and any attachments. Email transmission cannot be guaranteed to be secure 
or error-free. The Company, therefore, does not make any guarantees as to the 
completeness or accuracy of this email or any attachments. This email is for 
informational purposes only and does not constitute a recommendation, offer, 
request, or solicitation of any kind to buy, sell, subscribe, redeem, or 
perform any type o

[gpfsug-discuss] control which hosts become token manager

2018-07-24 Thread Billich Heinrich Rainer (PSI)
Hello,

I want to control which nodes can become token manager. In detail I run a 
virtual machine as quorum node. I don’t want this machine to become a token 
manager - it has no access to Infiniband and only very limited memory.

What I see is that ‘mmdiag –tokenmgr’ lists the machine as active token 
manager. The machine has role ‘quorum-client’. This doesn’t seem sufficient to 
exclude it.

Is there any way to tell spectrum scale to exclude this single machine with 
role quorum-client?

I run 5.0.1-1.

Sorry if this is a faq, I did search  quite a bit before I wrote to the list.

Thank you,

Heiner Billich


[root@node-2 ~]# mmlscluster

GPFS cluster information

  GPFS cluster name: node.psi.ch
  GPFS cluster id:   5389874024582403895
  GPFS UID domain:   node.psi.ch
  Remote shell command:  /usr/bin/ssh
  Remote file copy command:  /usr/bin/scp
  Repository type:   CCR

Node  Daemon node name   IP address Admin node nameDesignation

   1   node-1.psi.ch   a.b.95.31  node-1.psi.ch   quorum-manager
   2   node-2.psi.ch   a.b.95.32  node-2.psi.ch   quorum-manager
   3   node-quorum.psi.ch  a.b.95.30  node-quorum.psi.ch  quorum
    VIRTUAL MACHINE >

[root@node-2 ~]# mmdiag --tokenmgr

=== mmdiag: tokenmgr ===
  Token Domain perf
There are 3 active token servers in this domain.
Server list:
  a.b.95.120
  a.b.95.121
  a.b.95.122 VIRTUAL MACHINE >
  Token Domain tiered
There are 3 active token servers in this domain.
Server list:
  a.b.95.120
  a.b.95.121
  a.b.95.122    VIRTUAL MACHINE >

--
Paul Scherrer Institut
Science IT
Heiner Billich
WHGA 106
CH 5232  Villigen PSI
056 310 36 02
https://www.psi.ch

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] /sbin/rmmod mmfs26 hangs on mmshutdown

2018-07-12 Thread Billich Heinrich Rainer (PSI)


Hello Sven,

The machine has

maxFilesToCache 204800   (2M)

it will become a CES node, hence the higher than default value. It’s just a 3 
node cluster with remote cluster mount and no activity (yet). But all three 
nodes are listed as token server by ‘mmdiag –tokenmgr’.

Top showed 100% idle on core 55.  This matches the kernel messages about rmmod 
being stuck on core 55.
I didn’t see a dominating thread/process, but many kernel threads showed 30-40% 
CPU, in sum that used  about 50% of all cpu available.

This time mmshutdown did return and left the module loaded, next mmstartup 
tried to remove the ‘old’ module and got stuck :-(

I append two links to screenshots

Thank you,

Heiner

https://pasteboard.co/Hu86DKf.png
https://pasteboard.co/Hu86rg4.png

If the links don’t work  I can post the images to the list.

Kernel messages:

[  857.791050] CPU: 55 PID: 16429 Comm: rmmod Tainted: GW  OEL 
   3.10.0-693.17.1.el7.x86_64 #1
[  857.842265] Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS 
P89 01/22/2018
[  857.884938] task: 883ffafe8fd0 ti: 88342af3 task.ti: 
88342af3
[  857.924120] RIP: 0010:[]  [] 
compound_unlock_irqrestore+0xe/0x20
[  857.970708] RSP: 0018:88342af33d38  EFLAGS: 0246
[  857.999742] RAX:  RBX: 88207ffda068 RCX: 00e5
[  858.037165] RDX: 0246 RSI: 0246 RDI: 0246
[  858.074416] RBP: 88342af33d38 R08:  R09: 
[  858.111519] R10: 88207ffcfac0 R11: ea00fff40280 R12: 0200
[  858.148421] R13: 0001fff40280 R14: 8118cd84 R15: 88342af33ce8
[  858.185845] FS:  7fc797d1e740() GS:883fff0c() 
knlGS:
[  858.227062] CS:  0010 DS:  ES:  CR0: 80050033
[  858.257819] CR2: 004116d0 CR3: 003fc2ec CR4: 001607e0
[  858.295143] DR0:  DR1:  DR2: 
[  858.332145] DR3:  DR6: fffe0ff0 DR7: 0400
[  858.369097] Call Trace:
[  858.384829]  [] put_compound_page+0x149/0x174
[  858.416176]  [] put_page+0x45/0x50
[  858.443185]  [] cxiReleaseAndForgetPages+0xda/0x220 
[mmfslinux]
[  858.481751]  [] ? cxiDeallocPageList+0xbd/0x110 [mmfslinux]
[  858.518206]  [] cxiDeallocPageList+0x45/0x110 [mmfslinux]
[  858.554438]  [] ? _raw_spin_lock+0x10/0x30
[  858.585522]  [] cxiFreeSharedMemory+0x12a/0x130 [mmfslinux]
[  858.622670]  [] kxFreeAllSharedMemory+0xe2/0x160 [mmfs26]
[  858.659246]  [] mmfs+0xc85/0xca0 [mmfs26]
[  858.689379]  [] gpfs_clean+0x26/0x30 [mmfslinux]
[  858.722330]  [] cleanup_module+0x25/0x30 [mmfs26]
[  858.755431]  [] SyS_delete_module+0x19b/0x300
[  858.786882]  [] system_call_fastpath+0x16/0x1b
[  858.818776] Code: 89 ca 44 89 c1 4c 8d 43 10 e8 6f 2b ff ff 89 c2 48 89 13 
5b 5d c3 0f 1f 80 00 00 00 00 55 48 89 e5 f0 80 67 03 fe 48 89 f7 57 9d <0f> 1f 
44 00 00 5d c3 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44
[  859.068528] hrtimer: interrupt took 2877171 ns
[  870.517924] INFO: rcu_sched self-detected stall on CPU { 55}  (t=240003 
jiffies g=18437 c=18436 q=194992)
[  870.577882] Task dump for CPU 55:
[  870.602837] rmmod   R  running task0 16429  16374 0x0008
[  870.645206] Call Trace:
[  870.666388][] sched_show_task+0xa8/0x110
[  870.704271]  [] dump_cpu_task+0x39/0x70
[  870.738421]  [] rcu_dump_cpu_stacks+0x90/0xd0
[  870.775339]  [] rcu_check_callbacks+0x442/0x730
[  870.812353]  [] ? tick_sched_do_timer+0x50/0x50
[  870.848875]  [] update_process_times+0x46/0x80
[  870.884847]  [] tick_sched_handle+0x30/0x70
[  870.919740]  [] tick_sched_timer+0x39/0x80
[  870.953660]  [] __hrtimer_run_queues+0xd4/0x260
[  870.989276]  [] hrtimer_interrupt+0xaf/0x1d0
[  871.023481]  [] local_apic_timer_interrupt+0x35/0x60
[  871.061233]  [] smp_apic_timer_interrupt+0x3d/0x50
[  871.097838]  [] apic_timer_interrupt+0x232/0x240
[  871.133232][] ? put_page_testzero+0x8/0x15
[  871.170089]  [] put_compound_page+0x151/0x174
[  871.204221]  [] put_page+0x45/0x50
[  871.234554]  [] cxiReleaseAndForgetPages+0xda/0x220 
[mmfslinux]
[  871.275763]  [] ? cxiDeallocPageList+0xbd/0x110 [mmfslinux]
[  871.316987]  [] cxiDeallocPageList+0x45/0x110 [mmfslinux]
[  871.356886]  [] ? _raw_spin_lock+0x10/0x30
[  871.389455]  [] cxiFreeSharedMemory+0x12a/0x130 [mmfslinux]
[  871.429784]  [] kxFreeAllSharedMemory+0xe2/0x160 [mmfs26]
[  871.468753]  [] mmfs+0xc85/0xca0 [mmfs26]
[  871.501196]  [] gpfs_clean+0x26/0x30 [mmfslinux]
[  871.536562]  [] cleanup_module+0x25/0x30 [mmfs26]
[  871.572110]  [] SyS_delete_module+0x19b/0x300
[  871.606048]  [] system_call_fastpath+0x16/0x1b

--
Paul Scherrer Institut
Science IT
Heiner Billich
WHGA 106
CH 5232  Villigen PSI
056 310 36 02
https://www.psi.ch


From:  on behalf of Sven Oehme 

Reply-To: gpfsug main discussion list 
Date: Thursday 12 July 2018 at 15:42
To: gpfsug main discussion list 
Subject: Re: [gpf

Re: [gpfsug-discuss] /sbin/rmmod mmfs26 hangs on mmshutdown

2018-07-12 Thread Billich Heinrich Rainer (PSI)
056 310 36 02
https://www.psi.ch


From:  on behalf of Sven Oehme 

Reply-To: gpfsug main discussion list 
Date: Wednesday 11 July 2018 at 15:47
To: gpfsug main discussion list 
Subject: Re: [gpfsug-discuss] /sbin/rmmod mmfs26 hangs on mmshutdown

Hi,

what does numactl -H report ?

also check if this is set to yes :

root@fab3a:~# mmlsconfig numaMemoryInterleave
numaMemoryInterleave yes

Sven

On Wed, Jul 11, 2018 at 6:40 AM Billich Heinrich Rainer (PSI) 
mailto:heiner.bill...@psi.ch>> wrote:
Hello,

I have two nodes which hang on  ‘mmshutdown’, in detail the command 
‘/sbin/rmmod mmfs26’ hangs. I get kernel messages which I append below. I 
wonder if this looks familiar to somebody? Is it a known bug?  I can avoid the 
issue if I reduce pagepool from 128G to 64G.

Running ‘systemctl stop gpfs’ shows the same issue. It forcefully terminates 
after a while, but ‘rmmod’ stays stuck.

Two functions cxiReleaseAndForgetPages and put_page seem to be involved,  the 
first part of gpfs, the second a kernel call.

The servers have 256G memory  and 72 (virtual) cores each.
I run 5.0.1-1 on RHEL7.4  with kernel 3.10.0-693.17.1.el7.x86_64.

I can try to switch back to 5.0.0

Thank you & kind regards,

Heiner



Jul 11 14:12:04 node-1.x.y mmremote[1641]: Unloading module mmfs26
Jul 11 14:12:04 node-1.x.y mmsysmon[2440]: [E] Event raised: The Spectrum Scale 
service process not running on this node. Normal operation cannot be done
Jul 11 14:12:04 node-1.x.y mmsysmon[2440]: [I] Event raised: The Spectrum Scale 
service process is running
Jul 11 14:12:04 node-1.x.y mmsysmon[2440]: [E] Event raised: The node is not 
able to form a quorum with the other available nodes.
Jul 11 14:12:38 node-1.x.y sshd[2826]: Connection closed by xxx port 52814 
[preauth]

Jul 11 14:12:41 node-1.x.y kernel: NMI watchdog: BUG: soft lockup - CPU#28 
stuck for 23s! [rmmod:2695]

Jul 11 14:12:41 node-1.x.y kernel: Modules linked in: mmfs26(OE-) mmfslinux(OE) 
tracedev(OE) tcp_diag inet_diag rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) 
ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_fpga_tools(OE) 
mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) mlx4_ib(OE) ib_core(OE) vfat 
fat ext4 sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi 
mbcache jbd2 kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw 
gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support ipmi_ssif 
pcc_cpufreq hpilo ipmi_si sg hpwdt pcspkr i2c_i801 lpc_ich ipmi_devintf wmi 
ioatdma shpchp ipmi_msghandler acpi_power_meter binfmt_misc nfsd auth_rpcgss 
nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod crc_t10dif 
crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect
Jul 11 14:12:41 node-1.x.y kernel:  sysimgblt fb_sys_fops ttm ixgbe 
mlx4_core(OE) crct10dif_pclmul mdio mlx_compat(OE) crct10dif_common drm ptp 
crc32c_intel devlink hpsa pps_core i2c_core scsi_transport_sas dca dm_mirror 
dm_region_hash dm_log dm_mod [last unloaded: tracedev]
Jul 11 14:12:41 node-1.x.y kernel: CPU: 28 PID: 2695 Comm: rmmod Tainted: G 
   W  OEL    3.10.0-693.17.1.el7.x86_64 #1
Jul 11 14:12:41 node-1.x.y kernel: Hardware name: HP ProLiant DL380 
Gen9/ProLiant DL380 Gen9, BIOS P89 01/22/2018
Jul 11 14:12:41 node-1.x.y kernel: task: 8808c4814f10 ti: 881619778000 
task.ti: 881619778000
Jul 11 14:12:41 node-1.x.y kernel: RIP: 0010:[]  
[] put_compound_page+0xc3/0x174
Jul 11 14:12:41 node-1.x.y kernel: RSP: 0018:88161977bd50  EFLAGS: 0246
Jul 11 14:12:41 node-1.x.y kernel: RAX: 0283 RBX: fae3d201 
RCX: 0284
Jul 11 14:12:41 node-1.x.y kernel: RDX: 0283 RSI: 0246 
RDI: ea003d478000
Jul 11 14:12:41 node-1.x.y kernel: RBP: 88161977bd68 R08: 881ffae3d1e0 
R09: 000180800059
Jul 11 14:12:41 node-1.x.y kernel: R10: fae3d201 R11: ea007feb8f40 
R12: fae3d201
Jul 11 14:12:41 node-1.x.y kernel: R13: 88161977bd40 R14:  
R15: 88161977bd40
Jul 11 14:12:41 node-1.x.y kernel: FS:  7f81a1db0740() 
GS:883ffee8() knlGS:
Jul 11 14:12:41 node-1.x.y kernel: CS:  0010 DS:  ES:  CR0: 
80050033
Jul 11 14:12:41 node-1.x.y kernel: CR2: 7fa96e38f980 CR3: 000c36b2c000 
CR4: 001607e0
Jul 11 14:12:41 node-1.x.y kernel: DR0:  DR1:  
DR2: 
Jul 11 14:12:41 node-1.x.y kernel: DR3:  DR6: fffe0ff0 
DR7: 0400

Jul 11 14:12:41 node-1.x.y kernel: Call Trace:
Jul 11 14:12:41 node-1.x.y kernel:  [] put_page+0x45/0x50
Jul 11 14:12:41 node-1.x.y kernel:  [] 
cxiReleaseAndForgetPages+0xb2/0x1c0 [mmfslinux]
Jul 11 14:12:41 node-1.x.y kernel:  [] 
cxiDeallocPageList+0x45/0x110 [mmfslinux]
Jul 11 14:12:41 node-1.x.y kernel:  [] ? 
kmem_cache_free+0x1e2/0x200
Jul 11 14:12:41 node-1.x.y kernel:  [] 
cxiFreeSharedMemory+0x12a/0x130 [mmfslinux]
Jul 11 14:12:41 node-1.

[gpfsug-discuss] /sbin/rmmod mmfs26 hangs on mmshutdown

2018-07-11 Thread Billich Heinrich Rainer (PSI)
Hello,

I have two nodes which hang on  ‘mmshutdown’, in detail the command 
‘/sbin/rmmod mmfs26’ hangs. I get kernel messages which I append below. I 
wonder if this looks familiar to somebody? Is it a known bug?  I can avoid the 
issue if I reduce pagepool from 128G to 64G.

Running ‘systemctl stop gpfs’ shows the same issue. It forcefully terminates 
after a while, but ‘rmmod’ stays stuck.

Two functions cxiReleaseAndForgetPages and put_page seem to be involved,  the 
first part of gpfs, the second a kernel call.

The servers have 256G memory  and 72 (virtual) cores each.
I run 5.0.1-1 on RHEL7.4  with kernel 3.10.0-693.17.1.el7.x86_64.

I can try to switch back to 5.0.0

Thank you & kind regards,

Heiner



Jul 11 14:12:04 node-1.x.y mmremote[1641]: Unloading module mmfs26
Jul 11 14:12:04 node-1.x.y mmsysmon[2440]: [E] Event raised: The Spectrum Scale 
service process not running on this node. Normal operation cannot be done
Jul 11 14:12:04 node-1.x.y mmsysmon[2440]: [I] Event raised: The Spectrum Scale 
service process is running
Jul 11 14:12:04 node-1.x.y mmsysmon[2440]: [E] Event raised: The node is not 
able to form a quorum with the other available nodes.
Jul 11 14:12:38 node-1.x.y sshd[2826]: Connection closed by xxx port 52814 
[preauth]

Jul 11 14:12:41 node-1.x.y kernel: NMI watchdog: BUG: soft lockup - CPU#28 
stuck for 23s! [rmmod:2695]

Jul 11 14:12:41 node-1.x.y kernel: Modules linked in: mmfs26(OE-) mmfslinux(OE) 
tracedev(OE) tcp_diag inet_diag rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) 
ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_fpga_tools(OE) 
mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) mlx4_ib(OE) ib_core(OE) vfat 
fat ext4 sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi 
mbcache jbd2 kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw 
gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support ipmi_ssif 
pcc_cpufreq hpilo ipmi_si sg hpwdt pcspkr i2c_i801 lpc_ich ipmi_devintf wmi 
ioatdma shpchp ipmi_msghandler acpi_power_meter binfmt_misc nfsd auth_rpcgss 
nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod crc_t10dif 
crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect
Jul 11 14:12:41 node-1.x.y kernel:  sysimgblt fb_sys_fops ttm ixgbe 
mlx4_core(OE) crct10dif_pclmul mdio mlx_compat(OE) crct10dif_common drm ptp 
crc32c_intel devlink hpsa pps_core i2c_core scsi_transport_sas dca dm_mirror 
dm_region_hash dm_log dm_mod [last unloaded: tracedev]
Jul 11 14:12:41 node-1.x.y kernel: CPU: 28 PID: 2695 Comm: rmmod Tainted: G 
   W  OEL    3.10.0-693.17.1.el7.x86_64 #1
Jul 11 14:12:41 node-1.x.y kernel: Hardware name: HP ProLiant DL380 
Gen9/ProLiant DL380 Gen9, BIOS P89 01/22/2018
Jul 11 14:12:41 node-1.x.y kernel: task: 8808c4814f10 ti: 881619778000 
task.ti: 881619778000
Jul 11 14:12:41 node-1.x.y kernel: RIP: 0010:[]  
[] put_compound_page+0xc3/0x174
Jul 11 14:12:41 node-1.x.y kernel: RSP: 0018:88161977bd50  EFLAGS: 0246
Jul 11 14:12:41 node-1.x.y kernel: RAX: 0283 RBX: fae3d201 
RCX: 0284
Jul 11 14:12:41 node-1.x.y kernel: RDX: 0283 RSI: 0246 
RDI: ea003d478000
Jul 11 14:12:41 node-1.x.y kernel: RBP: 88161977bd68 R08: 881ffae3d1e0 
R09: 000180800059
Jul 11 14:12:41 node-1.x.y kernel: R10: fae3d201 R11: ea007feb8f40 
R12: fae3d201
Jul 11 14:12:41 node-1.x.y kernel: R13: 88161977bd40 R14:  
R15: 88161977bd40
Jul 11 14:12:41 node-1.x.y kernel: FS:  7f81a1db0740() 
GS:883ffee8() knlGS:
Jul 11 14:12:41 node-1.x.y kernel: CS:  0010 DS:  ES:  CR0: 
80050033
Jul 11 14:12:41 node-1.x.y kernel: CR2: 7fa96e38f980 CR3: 000c36b2c000 
CR4: 001607e0
Jul 11 14:12:41 node-1.x.y kernel: DR0:  DR1:  
DR2: 
Jul 11 14:12:41 node-1.x.y kernel: DR3:  DR6: fffe0ff0 
DR7: 0400

Jul 11 14:12:41 node-1.x.y kernel: Call Trace:
Jul 11 14:12:41 node-1.x.y kernel:  [] put_page+0x45/0x50
Jul 11 14:12:41 node-1.x.y kernel:  [] 
cxiReleaseAndForgetPages+0xb2/0x1c0 [mmfslinux]
Jul 11 14:12:41 node-1.x.y kernel:  [] 
cxiDeallocPageList+0x45/0x110 [mmfslinux]
Jul 11 14:12:41 node-1.x.y kernel:  [] ? 
kmem_cache_free+0x1e2/0x200
Jul 11 14:12:41 node-1.x.y kernel:  [] 
cxiFreeSharedMemory+0x12a/0x130 [mmfslinux]
Jul 11 14:12:41 node-1.x.y kernel:  [] 
kxFreeAllSharedMemory+0xe2/0x160 [mmfs26]
Jul 11 14:12:41 node-1.x.y kernel:  [] mmfs+0xc85/0xca0 
[mmfs26]
Jul 11 14:12:41 node-1.x.y kernel:  [] gpfs_clean+0x26/0x30 
[mmfslinux]
Jul 11 14:12:41 node-1.x.y kernel:  [] 
cleanup_module+0x25/0x30 [mmfs26]
Jul 11 14:12:41 node-1.x.y kernel:  [] 
SyS_delete_module+0x19b/0x300
Jul 11 14:12:41 node-1.x.y kernel:  [] 
system_call_fastpath+0x16/0x1b
Jul 11 14:12:41 node-1.x.y kernel: Code: d1 00 00 00 4c 89 e7 e8 3a ff ff ff e9 
c4 00 00 00 4c 39 e3 7

[gpfsug-discuss] -o syncnfs has no effect?

2018-07-05 Thread Billich Heinrich Rainer (PSI)
Hello,

I try to mount a fs with "-o syncnfs" as we'll export it with CES/Protocols. 
But I never see the mount option displayed when I do

 # mount | grep fs-name

This is a remote cluster mount, we'll run the Protocol nodes in  a separate 
cluster. On the home cluster I see the option 'nfssync' in the output of 
'mount'.

My conclusion is that the mount option "syncnfs" has no effect on remote 
cluster mounts. Which seems a bit strange?

Please can someone clarify on this? What is the impact on protocol nodes 
exporting remote cluster mounts? Is there any chance of data corruption? Or are 
some mount options implicitely inherited from the home cluster? I've read 
'syncnfs' is default on Linux, but I would like to know for sure.

Funny enough I can pass arbitrary options with 

  # mmmount  -o some-garbage

which are silently ignored.

 I did 'mmchfs -o syncnfs' on the home cluster and the syncnfs option is 
present in /etc/fstab on the remote cluster. I did not remount on all nodes __

Thank you, I'll appreciate any hints or replies.

Heiner

Versions:
Remote cluster 5.0.1 on RHEL7.4  (imounts the fs and runs protocol nodes)
Home cluster 4.2.3-8 on RHEL6  (export the fs, owns the storage)
Filesystem: 17.00 (4.2.3.0)
All Linux x86_64 with Spectrum Scale Standard Edition
--
Paul Scherrer Institut
Science IT
Heiner Billich
WHGA 106
CH 5232  Villigen PSI
056 310 36 02
https://www.psi.ch
 

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


[gpfsug-discuss] mmhealth with 4.2.3-5 gives many false alarms ib_rdma_nic_unrecognized

2018-01-09 Thread Billich Heinrich Rainer (PSI)
Hello,

I just upgraded to 4.2.3-5 and now see many failures ‘ib_rdma_nic_unrecognized’ 
in mmhealth,  like


ComponentStatusStatus ChangeReasons
--
NETWORK  DEGRADED  2018-01-06 15:57:21  
ib_rdma_nic_unrecognized(mlx4_0/1)
  mlx4_0/1   FAILED2018-01-06 15:57:21  ib_rdma_nic_unrecognized


I didn’t see this messages with 4.2.3-4. The relevant lines in 
/usr/lpp/mmfs/lib/mmsysmon/NetworkService.py changed between -4 and -5.

What seems to happen: I have Mellanox VPI cards with one port Infiniband and 
one port Ethernet.  mmhealth complains about the Ethernet port.  Hmm – I did 
specify the active Infiniband ports only in verbsPorts, I don’t see why 
mmhealth cares about any other ports when it checks RDMA.

So probably a bug, I’ll open a PMR unless somebody points me to a different 
solution.  I tried but I can’t hide this event in mmhealth.

Cheers,
Heiner

--
Paul Scherrer Institut
Science IT
Heiner Billich
WHGA 106
CH 5232  Villigen PSI
056 310 36 02
https://www.psi.ch
 




___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] slow startup of AFM flush to home

2017-10-16 Thread Billich Heinrich Rainer (PSI)
Hello Scott,

Thank you. I did set afmFlushThreadDelay = 1 and did get a much faster startup. 
Setting to 0 didn’t improve further. I’m not sure how much we’ll need this in 
production when most of the time the queue is full. But for benchmarking during 
setup it’s helps a lot. (we run 4.2.3-4 on RHEL7)


Kind regards,

Heiner

Scott Fadden did write:

When an AFM gateway is flushing data to the target (home) it starts flushing 
with a few threads (Don't remember the number) and ramps up to 
afmNumFlushThreads. How quickly this ramp up occurs is controlled by 
afmFlushThreadDealy. The default is 5 seconds. So flushing only adds threads 
once every 5 seconds. 
This was an experimental parameter so your milage may vary.
 
 
Scott Fadden
Spectrum Scale - Technical Marketing
Phone: (503) 880-5833
sfad...@us.ibm.com
http://www.ibm.com/systems/storage/spectrum/scale
 
 
- Original message -
From: "Billich Heinrich Rainer (PSI)" 
Sent by: gpfsug-discuss-boun...@spectrumscale.org
To: "gpfsug-discuss@spectrumscale.org" 
Cc:
Subject: [gpfsug-discuss] AFM: Slow startup of flush from cache to home
Date: Fri, Oct 13, 2017 10:16 AM
 
Hello,



Running an AFM IW cache  we noticed that AFM starts the flushing of data from 
cache to home rather slow, say at 20MB/s, and only slowly increases to several 
100MB/s after a few minutes. As soon as the pending queue gets no longer filled 
the data rate drops, again.



I assume that this is a good behavior for WAN traffic where you don’t want to 
use the full bandwidth from the beginning but only if really needed. For our 
local setup with dedicated links I would prefer a much more aggressive behavior 
to get data transferred asap to home.



Am I right, does AFM implement such a ‘slow startup’, and is there a way to 
change this behavior? We did increase afmNumFlushThreads  to 128. Currently we 
measure with many small files (1MB). For large files the behavior is different, 
we get a stable data rate from the beginning, but I did not yet try with a 
continuous write on the cache to see whether I see an increase after a while, 
too.



Thank you,



Heiner Billich
--
Paul Scherrer Institut
Science IT
Heiner Billich
WHGA 106
CH 5232  Villigen PSI
056 310 36 02
https://www.psi.ch



___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


[gpfsug-discuss] AFM: Slow startup of flush from cache to home

2017-10-13 Thread Billich Heinrich Rainer (PSI)
Hello,

Running an AFM IW cache  we noticed that AFM starts the flushing of data from 
cache to home rather slow, say at 20MB/s, and only slowly increases to several 
100MB/s after a few minutes. As soon as the pending queue gets no longer filled 
the data rate drops, again.

I assume that this is a good behavior for WAN traffic where you don’t want to 
use the full bandwidth from the beginning but only if really needed. For our 
local setup with dedicated links I would prefer a much more aggressive behavior 
to get data transferred asap to home.

Am I right, does AFM implement such a ‘slow startup’, and is there a way to 
change this behavior? We did increase afmNumFlushThreads  to 128. Currently we 
measure with many small files (1MB). For large files the behavior is different, 
we get a stable data rate from the beginning, but I did not yet try with a 
continuous write on the cache to see whether I see an increase after a while, 
too.

Thank you,

Heiner Billich

--
Paul Scherrer Institut
Science IT
Heiner Billich
WHGA 106
CH 5232  Villigen PSI
056 310 36 02
https://www.psi.ch
 


___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


[gpfsug-discuss] AFM - prefetch of many small files - tuning - storage latency required to increase max socket buffer size ...

2017-10-04 Thread Billich Heinrich Rainer (PSI)
Hello,

A while ago I asked the list for advice on how to tune AFM to speed-up the 
prefetch of small files (~1MB). In the meantime, we got some results which I 
want to share.

We had to increase the maximum socket buffer sizes to very high values of 
40-80MB. Consider that we use IP over Infiniband and the 
bandwidth-delay-product is about 5MB (1-10us latency). 

How do we explain this? The reads on the nfs server have a latency of about 
20ms. This is physics of disks. Hence a single client can get up to 50 
requests/s. Each request is 1MB. To get 1GB/s we need 20 clients in parallel. 
At all times we have about 20 requests pending. Looks like the server does 
allocate the socket buffer space before it asks for the data. Hence it 
allocates/blocks about 20MB at all times. Surprisingly it’s storage latency and 
not network latency that required us to increase the max. socket buffer size. 

For large files prefetch works and reduces the latency of reads drastically and 
no special tuning is required.

We did test with kernel-nfs and gpfs  4.2.3 on  RHEL7. Whether ganesha shows a 
similar pattern would be interesting to know. Once we fixed the nfs issues afm 
did show a nice parallel prefetch up to ~1GB/s with 1MB sized files without any 
tuning. Still much below the 4GB/s measured with iperf between the two nodes ….

Kind regards,

Heiner

--
Paul Scherrer Institut
Science IT
Heiner Billich
WHGA 106
CH 5232  Villigen PSI
056 310 36 02
https://www.psi.ch
 


___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Use AFM for migration of many small files

2017-09-06 Thread Billich Heinrich Rainer (PSI)
Hello Venkateswara, Edward,

Thank you for the comments on how to speed up AFM prefetch with small files. We 
run 4.2.2-3 and the AFM mode is RO and we have just a single gateway, i.e. no 
parallel reads for large files. We will try to increase the value of 
afmNumFlushThreads. It wasn’t clear to me that these threads do read from home, 
too - at least for prefetch. First I will try a plain NFS mount and see how 
parallel reads of many small files  scale the throughput. Next I will try AFM 
prefetch. I don’t do nice benchmarking, just watching dstat output. We prefetch 
100’000 files in one bunch, so there is ample time to observe. 

The basic issue is that we get just about 45MB/s for sequential read of  many 
1000 files with 1MB per file on the home cluster. I.e. we read one file at a 
time before we switch to the next. This is no surprise. Each read takes about 
20ms to complete, so at max we get 50 reads of 1MB per second. We’ve seen this 
on classical raid storage and on DSS/ESS systems. It’s likely just the physics 
of spinning disks and the fact that we do one read at a time and don’t allow 
any parallelism. We wait for one or two I/Os to single disks to complete before 
we continue  With larger files prefetch jumps in and fires many reads in 
parallel … To get 1’000MB/s I need to do 1’000 read/s  and need to have ~20 
reads in progress in parallel  all the time … we’ll see how close we get to 
1’000MB/s with ‘many small files’.

Kind regards,

Heiner
--
Paul Scherrer Institut
Science IT
Heiner Billich
WHGA 106
CH 5232  Villigen PSI
056 310 36 02
https://www.psi.ch
 

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


[gpfsug-discuss] Use AFM for migration of many small files

2017-09-04 Thread Billich Heinrich Rainer (PSI)
Hello,

We use AFM prefetch to migrate data between two clusters (using NFS). This 
works fine with large files, say 1+GB. But we have millions of smaller files,  
about 1MB each. Here I see just ~150MB/s – compare this to the 1000+MB/s we get 
for larger files.

I assume that we would need more parallelism, does prefetch pull just one file 
at a time? So each file needs  some or many metadata operations plus a single  
or just a few read and writes. Doing this sequentially adds up all the 
latencies of NFS+GPFS. This is my explanation. With larger files gpfs prefetch 
on home will help.

Please can anybody comment: Is this right, does AFM prefetch handle one file at 
a time in a sequential manner? And is there any way to change this behavior? Or 
am I wrong and I need to look elsewhere to get better performance for prefetch 
of many smaller files?

We will migrate several filesets in parallel, but still with individual 
filesets up to 350TB in size 150MB/s isn’t fun. Also just about 150 files/s 
seconds looks poor.

The setup is quite new, hence there may be other places to look at. 
It’s all RHEL7 an spectrum scale 4.2.2-3 on the afm cache.

Thank you,

Heiner
--,
Paul Scherrer Institut
Science IT
Heiner Billich
WHGA 106
CH 5232  Villigen PSI
056 310 36 02
https://www.psi.ch
 


___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


[gpfsug-discuss] does AFM support NFS via RDMA

2017-07-11 Thread Billich Heinrich Rainer (PSI)
Hello,

We run AFM using NFS as transport between home and cache. Using 
IP-over-Infiniband we see a throughput between 1 and 2 GB/s. This is not bad 
but far from what a native IB link provides – 6GB/s . Does AFM’s nfs client on 
gateway nodes support NFS using RDMA? I would like to try.  Or should we try to 
tune nfs and the IP stack – I wonder if anybody got throughput above 2 GB/s 
using IPoIB and FDR between two nodes?

We can’t use a native gpfs multicluster mount – this links home and cache much 
too strong: If home fails cache will unmount the cache fileset – this is what I 
get from the manuals. 

We run spectrum scale 4.2.2/4.2.3 on Redhat 7.

Thank you,

Heiner Billich

--
Paul Scherrer Institut
Heiner Billich
WHGA 106
CH 5232  Villigen
056 310 36 02
 


___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


[gpfsug-discuss] AFM - how to update directories with deleted files during prefetch

2017-06-30 Thread Billich Heinrich Rainer (PSI)
Hello

I have a short question about AFM prefetch and some more remarks regarding AFM 
and it’s use for data migration. I understand that many  of you have done this 
for very large amounts of data and number of files. I would welcome an input, 
comments or remarks. Sorry if this is a bit too long for a mailing list.

Short:
How can I tell an AFM cache  to update a directory when I do prefetch? I know 
about ‘find .’ or ‘ls –lsR’ but this really is no option for us as it takes too 
long. Mostly I want to update the directories to make AFM cache aware of file 
deletions on home. On home I can use a policy run to find all directories which 
changed since the last update and pass them to prefetch on AFM cache.  

I know that I can find some workaround based on the directory list, like an ‘ls 
–lsa’ just for those directories, but this doesn’t sound very efficient. And 
depending on cache effects and timeout settings it may work or not (o.k. – it 
will work most time).

We do regular file deletions and will accumulated millions of deleted files on 
cache over time if we don’t update the directories to make AFM cache aware of 
the deletion.

Background:
We will use AFM to migrate data on filesets to another cluster. We have to do 
this several times in the next few months, hence I want to get a reliable and 
easy to use procedure. The old system is home, the new system is a read-only 
AFM cache. We want to use ‘mmafmctl prefetch’ to move the data. Home will be in 
use while we run the migration. Once almost all data is moved we do a (short) 
break for a last sync and make the read-only AFM cache a ‘normal’ fileset. 
During the break I want  to use policy runs and prefetch only and no time 
consuming ‘ls –lsr’ or ‘find .’ I don’t want to use this metadata intensive 
posix operation during operation, either.

More general:
AFM can be used for data migration. But I don’t see how to use it efficiently. 
How to do incremental transfers, how to ensure that the we really have 
identical copies before we switch and how to keep the switch time short , i.e. 
the time when both old and new aren’t accessible for clients,

Wish – maybe an RFE?
I can use policy runs to collect all changed items on home since the last 
update. I wish that I can pass this list to afm prefetch to do all updates on 
AFM cache, too. Same as backup tools use the list to do incremental backups.

And a tool to create policy lists of home and cache and to compare the lists 
would be nice, too. As you do this during the break/switch it should be fast 
and reliable and leave no doubts.

Kind regards,

Heiner

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss