[gpfsug-discuss] Fw: CES node slow to respond

2017-03-24 Thread IBM Spectrum Scale
Caching of file descriptors can be disabled with "Cache_FDs = FALSE;" in cacheinode{} block. Regards, The Spectrum Scale (GPFS) team -- If you feel that your question can benefit

Re: [gpfsug-discuss] CES node slow to respond

2017-03-24 Thread IBM Spectrum Scale
Forwarding Malahal's reply. Max open files of the ganesha process is set internally (and written to /etc/sysconfig/ganesha file as NOFILE parameter) based on MFTC (max files to cache) gpfs parameter. Regards, The Spectrum Scale (GPFS) team

Re: [gpfsug-discuss] CES node slow to respond

2017-03-24 Thread Matt Weil
On 3/24/17 12:17 PM, IBM Spectrum Scale wrote: Hi Bryan, Making sure Malahal's reply was received by the user group. >> Then we noticed that the CES host had 5.4 million files open This is technically not possible with ganesha alone. A process can only open 1 million files on RHEL distro.

Re: [gpfsug-discuss] CES node slow to respond

2017-03-24 Thread Matt Weil
also running version 4.2.2.2. On 3/24/17 2:57 PM, Matt Weil wrote: On 3/24/17 1:13 PM, Bryan Banister wrote: Hi Vipul, Hmm… interesting. We have dedicated systems running CES and nothing else, so the only thing opening files on GPFS is ganesha. IBM Support recommended we massively

Re: [gpfsug-discuss] CES node slow to respond

2017-03-24 Thread Matt Weil
On 3/24/17 1:13 PM, Bryan Banister wrote: Hi Vipul, Hmm… interesting. We have dedicated systems running CES and nothing else, so the only thing opening files on GPFS is ganesha. IBM Support recommended we massively increase the maxFilesToCache to fix the performance issues we were having.

Re: [gpfsug-discuss] CES node slow to respond

2017-03-24 Thread Bryan Banister
Thanks Sven! We recently upgrade to 4.2.2 and will see about lowering the maxFilesToCache to something more appropriate. We’re not offering NFS access as a performance solution… but it can’t come to a crawl either! Your help is greatly appreciated as always, -Bryan From:

Re: [gpfsug-discuss] CES node slow to respond

2017-03-24 Thread Sven Oehme
changes in ganesha management code were made in April 2016 to reduce the need for high maxfilestocache value, the ganesha daemon adjusts it allowed file cache by reading the maxfilestocache value and then reducing its allowed NOFILE value . the code shipped with 4.2.2 release. you want a high

Re: [gpfsug-discuss] CES node slow to respond

2017-03-24 Thread Bryan Banister
Hi Vipul, Hmm... interesting. We have dedicated systems running CES and nothing else, so the only thing opening files on GPFS is ganesha. IBM Support recommended we massively increase the maxFilesToCache to fix the performance issues we were having. I could try to reproduce the problem to

Re: [gpfsug-discuss] strange waiters + filesystem deadlock

2017-03-24 Thread Aaron Knister
I believe it was created with -n 5000. Here's the exact command that was used: /usr/lpp/mmfs/bin/mmcrfs dnb03 -F ./disc_mmcrnsd_dnb03.lst -T /gpfsm/dnb03 -j cluster -B 1M -n 5000 -N 20M -r1 -R2 -m2 -M2 -A no -Q yes -v yes -i 512 --metadata-block-size=256K -L 8388608 -Aaron On 3/24/17 2:05

Re: [gpfsug-discuss] strange waiters + filesystem deadlock

2017-03-24 Thread Sven Oehme
was this filesystem creates with -n 5000 ? or was that changed later with mmchfs ? please send the mmlsconfig/mmlscluster output to me at oeh...@us.ibm.com On Fri, Mar 24, 2017 at 10:58 AM Aaron Knister wrote: > I feel a little awkward about posting wlists of IP's

Re: [gpfsug-discuss] strange waiters + filesystem deadlock

2017-03-24 Thread Aaron Knister
It's large, I do know that much. I'll defer to one of our other storage admins. Jordan, do you have that number handy? -Aaron On 3/24/17 2:03 PM, Fosburgh,Jonathan wrote: 7PB filesystem and only 28 million inodes in use? What is your average file size? Our large filesystem is 7.5P (currently

Re: [gpfsug-discuss] strange waiters + filesystem deadlock

2017-03-24 Thread Fosburgh,Jonathan
7PB filesystem and only 28 million inodes in use? What is your average file size? Our large filesystem is 7.5P (currently 71% used) with over 1 billion inodes in use. -- Jonathan Fosburgh Principal Application Systems Analyst Storage Team IT Operations

Re: [gpfsug-discuss] strange waiters + filesystem deadlock

2017-03-24 Thread Aaron Knister
I feel a little awkward about posting wlists of IP's and hostnames on the mailing list (even though they're all internal) but I'm happy to send to you directly. I've attached both an lsfs and an mmdf output of the fs in question here since that may be useful for others to see. Just a note

Re: [gpfsug-discuss] strange waiters + filesystem deadlock

2017-03-24 Thread Aaron Knister
Thanks Bob, Jonathan. We're running GPFS 4.1.1.10 and no HSM/LTFSEE. I'm currently gathering, as requested, a snap from all nodes (with traces). With 3500 nodes this ought to be entertaining. -Aaron On 3/24/17 12:50 PM, Oesterlin, Robert wrote: Hi Aaron Yes, I have seen this several times

Re: [gpfsug-discuss] strange waiters + filesystem deadlock

2017-03-24 Thread Sven Oehme
ok, that seems a different problem then i was thinking. can you send output of mmlscluster, mmlsconfig, mmlsfs all ? also are you getting close to fill grade on inodes or capacity on any of the filesystems ? sven On Fri, Mar 24, 2017 at 10:34 AM Aaron Knister wrote:

Re: [gpfsug-discuss] strange waiters + filesystem deadlock

2017-03-24 Thread Sven Oehme
you must be on sles as this segfaults only on sles to my knowledge :-) i am looking for a NSD or manager node in your cluster that runs at 100% cpu usage. do you have zimon deployed to look at cpu utilization across your nodes ? sven On Fri, Mar 24, 2017 at 10:08 AM Aaron Knister

Re: [gpfsug-discuss] CES node slow to respond

2017-03-24 Thread IBM Spectrum Scale
Hi Bryan, Making sure Malahal's reply was received by the user group. >> Then we noticed that the CES host had 5.4 million files open This is technically not possible with ganesha alone. A process can only open 1 million files on RHEL distro. Either we have leaks in kernel or some other

Re: [gpfsug-discuss] strange waiters + filesystem deadlock

2017-03-24 Thread Aaron Knister
Hi Sven, Which NSD server should I run top on, the fs manager? If so the CPU load is about 155%. I'm working on perf top but not off to a great start... # perf top PerfTop:1095 irqs/sec kernel:61.9% exact: 0.0% [1000Hz cycles], (all, 28 CPUs)

Re: [gpfsug-discuss] strange waiters + filesystem deadlock

2017-03-24 Thread Sven Oehme
while this is happening run top and see if there is very high cpu utilization at this time on the NSD Server. if there is , run perf top (you might need to install perf command) and see if the top cpu contender is a spinlock . if so send a screenshot of perf top as i may know what that is and

Re: [gpfsug-discuss] strange waiters + filesystem deadlock

2017-03-24 Thread Oesterlin, Robert
Hi Aaron Yes, I have seen this several times over the last 6 months. I opened at least one PMR on it and they never could track it down. I did some snap dumps but without some traces, they did not have enough. I ended up getting out of it by selectively rebooting some of my NSD servers. My

Re: [gpfsug-discuss] strange waiters + filesystem deadlock

2017-03-24 Thread Fosburgh,Jonathan
This is one of the more annoying long waiter problems. We've seen it several times and I'm not sure they all had the same cause. What version of GPFS? Do you have anything like Tivoli HSM or LTFSEE? -- Jonathan Fosburgh Principal Application Systems Analyst Storage Team IT Operations

[gpfsug-discuss] strange waiters + filesystem deadlock

2017-03-24 Thread Aaron Knister
Since yesterday morning we've noticed some deadlocks on one of our filesystems that seem to be triggered by writing to it. The waiters on the clients look like this: 0x19450B0 ( 6730) waiting 2063.294589599 seconds, SyncHandlerThread: on ThCond 0x1802585CB10 (0xC9002585CB10)