Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-22 Thread Raymond Wan
Hi Will, On 23/2/2019 1:50 AM, Will Dennis wrote: For one of my groups, on the GPU servers in their cluster, I have provided a RAID-0 md array of multi-TB SSDs (for I/O speed) mounted on a given path ("/mnt/local" for historical reasons) that they can use for local scratch space. Their

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-22 Thread Christopher Samuel
On 2/22/19 3:54 PM, Aaron Jackson wrote: Happy to answer any questions about our setup. If folks are interested in a mailing list where this discussion would be decidedly on-topic then I'm happy to add people to the Beowulf list where there's a lot of other folks with expertise in this

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-22 Thread Aaron Jackson
Hi Will, I look after our GPU cluster in our vision lab. We have a similar setup - we are working from a single ZFS file server. We have two pools: /db which is about 40TB spinning SAS built out of two raidz vdevs, with 16TB of L2ARC (across 4 SSDs). This reduces the size of ARC quite

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-22 Thread Matthew BETTINGER
We stuck avere between Isilon and a cluster to get us over the hump until next budget cycle ... then we replaced with spectrascale for mid level storage. Still use lustre of course as scratch. On 2/22/19, 12:24 PM, "slurm-users on behalf of Will Dennis" wrote: (replies inline)

Re: [slurm-users] New Bright Cluster Slurm issue for AD users

2019-02-22 Thread Yugendra Guvvala
Just to close loop on this. This was not as Slurm issue it was more of AD configuration. AD needs to be installed on all nodes of cluster that way SLURM knows the USER ID. I had trouble with sssd DB folders missing and sssd.conf file having appropriate permissions. So look put for those. You

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-22 Thread Will Dennis
Yes, we've thought about using FS-Cache, but it doesn't help on the first read-in, and the cache eviction may affect subsequent read attempts... (different people are using different data sets, and the cache will probably not hold all of them at the same time...) On Friday, February 22, 2019

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-22 Thread Mark Hahn
applications) and are presently consuming the data via NFS mounts (both groups have 10G ethernet interconnects between the Slurm nodes and the NFS servers.) They are both now complaining of "too-long loading times" for the how about just using cachefs (backed by a local filesystem on ssd)?

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-22 Thread Will Dennis
(replies inline) On Friday, February 22, 2019 1:03 PM, Alex Chekholko said: >Hi Will, > >If your bottleneck is now your network, you may want to upgrade the network. >Then the disks will become your bottleneck :) > Via network bandwidth analysis, it's not really network that's the problem...

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-22 Thread Alex Chekholko
Hi Will, You have bumped into the old adage: "HPC is just about moving the bottlenecks around". If your bottleneck is now your network, you may want to upgrade the network. Then the disks will become your bottleneck :) For GPU training-type jobs that load the same set of data over and over

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-22 Thread Will Dennis
Thanks for the reply, Ray. For one of my groups, on the GPU servers in their cluster, I have provided a RAID-0 md array of multi-TB SSDs (for I/O speed) mounted on a given path ("/mnt/local" for historical reasons) that they can use for local scratch space. Their other servers in the cluster

[slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-22 Thread Will Dennis
Hi folks, Not directly Slurm-related, but... We have a couple of research groups that have large data sets they are processing via Slurm jobs (deep-learning applications) and are presently consuming the data via NFS mounts (both groups have 10G ethernet interconnects between the Slurm nodes

Re: [slurm-users] SLURM docs: HTML title should be same as page title

2019-02-22 Thread Will Dennis
Yes! I always have E_WAYTOOMANY tabs open on my Chrome browser, and using "TooManyTabs" plugin and searching for "Slurm" I see a whole bunch of "Slurm Workload Manager" entries, then have to guess which one is what page... -Original Message- From: slurm-users

Re: [slurm-users] SLURM docs: HTML title should be same as page title

2019-02-22 Thread Prentice Bisbal
On 2/22/19 9:53 AM, Patrice Peterson wrote: Hello, it's a little inconvenient that the title tag of all SLURM doc pages only says "Slurm Workload Manager". I usually have tabs to many SLURM doc pages open and it's difficult to differentiate between them all. Would it be possible to change the

Re: [slurm-users] pam_slurm_adopt with pbis-open pam modules

2019-02-22 Thread Prentice Bisbal
On 2/22/19 12:54 AM, Chris Samuel wrote: On Thursday, 21 February 2019 8:20:36 AM PST נדב טולדו wrote: Yeah I have, before i installed pbis and introduce lsass.so the slurm module worked well Is there anyway to debug? I am seeing in syslog that the slurm module is adopting into the job

[slurm-users] SLURM docs: HTML title should be same as page title

2019-02-22 Thread Patrice Peterson
Hello, it's a little inconvenient that the title tag of all SLURM doc pages only says "Slurm Workload Manager". I usually have tabs to many SLURM doc pages open and it's difficult to differentiate between them all. Would it be possible to change the tab title to the page title for the doc

Re: [slurm-users] Only one socket for SLURM

2019-02-22 Thread Carlos Fenoy
Hola Gestió, You can have a look at: https://slurm.schedmd.com/core_spec.html and the "CoreSpecCount" or "CpuSpecList" of the slurm.conf file. Regards, Carlos On Fri, Feb 22, 2019 at 6:50 AM Chris Samuel wrote: > On Thursday, 21 February 2019 1:00:52 PM PST Sam Hawarden wrote: > > > Linux