Re: [lustre-discuss] Command line tool to monitor Lustre I/O ?

2018-12-21 Thread Christopher Johnston
I use the exact same setup along with Kapacitor for alerting (alerta as the
dash).  We have created have dozens of panels in Grafana that are very
useful for troubleshooting bottlenecks with the OSS nodes, disks, as well
as the clients.

Can't go wrong with it I feel, easy to setup and fun to make graphs :-)



On Thu, Dec 20, 2018 at 12:15 PM Alexander I Kulyavtsev 
wrote:

> 1) cerebro + ltop still work.
>
>
> 2) telegraf + inflixdb (collector, time series DB ). Telegraf has
> input plugins for lustre ("lustre2"), zfs,  and many others. Grafana to
> plot live data from DB. Also, influxDB integrates with Prometheus.
>
> Basically, each component can feed data to different output types through
> plugins; or take data from multiple type of sources so you can use
> different combination for your monitoring stack.
>
>
> For the simplest tool you may take a look if telegraf from influxdb stack
> has proper output plugin (see influxdata on github).
>
>
> Alex.
> --
> *From:* lustre-discuss  on
> behalf of Laifer, Roland (SCC) 
> *Sent:* Thursday, December 20, 2018 8:04:55 AM
> *To:* lustre-discuss@lists.lustre.org
> *Subject:* [lustre-discuss] Command line tool to monitor Lustre I/O ?
>
> Dear Lustre administrators,
>
> what is a good command line tool to monitor current Lustre metadata and
> throughput operations on the local client or server? Up to now we had
> used collectl but this no longer works for Lustre 2.10.
>
> Some background about collectl: The Lustre support of collectl was
> removed many years ago but up to Lustre 2.7 it was still possible to
> monitor metadata and throughput operations on clients. In addition,
> there were plugins which also worked for the server side, see
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__wiki.lustre.org_Collectl=DwICAg=gRgGjJ3BkIsb5y6s49QqsA=23V5nhLj03jeTboyg6QveA=RpMjhssRJoiP3ANRP6Ze3_nBrliMMPOgQaewqEwRTn4=QmdmoNcRR5A0sOgiJimMo0KtZnc-ne44A4YY8aSWbuI=
> However, it seems that there was no update for these plugins to adapt
> them for Lustre 2.10.
>
> Regards,
>   Roland
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org=DwICAg=gRgGjJ3BkIsb5y6s49QqsA=23V5nhLj03jeTboyg6QveA=RpMjhssRJoiP3ANRP6Ze3_nBrliMMPOgQaewqEwRTn4=SXbueuHkxyBAq95D_-bLmBayRVDMtR-l7t0XZfNXEXk=
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Quota Reporting (for all users and/or gruops)

2018-12-20 Thread Christopher Johnston
I have been using Robinhood and some scripts I wrote to run daily checks to
send reports to my user, etc.  You may be able to accomplish something
similar there. .

On Thu, Dec 20, 2018 at 9:49 AM Paul Edmon  wrote:

> I'm not aware of one, but I too would love to either learn of a tool to do
> this or advocate for Lustre to add it.
>
>
> -Paul Edmon-
>
>
> On 12/20/18 9:41 AM, Jason Williams wrote:
>
> It is entirely possible that this already exists, but my google-foo is not
> what it used to be.  However, I've searched around the internet and it
> seems as though it doesn't really exist.  There are handfuls of now defunct
> or un-maintained projects out there, but nothing that seems to report all
> of the user and/or group quotas.
>
>
> Does anyone know of a good quota reporting tool that can give quota
> information in the same way as a 'repquota -u' or 'repquota -g' would?
>
>
> --
> Jason Williams
> Assistant Director
> Systems and Data Center Operations.
> Maryland Advanced Research Computing Center (MARCC)
> Johns Hopkins University
> jas...@jhu.edu
>
>
> ___
> lustre-discuss mailing 
> listlustre-discuss@lists.lustre.orghttp://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lnet can't bind to 988

2018-06-08 Thread Christopher Johnston
Do you have the lnet service enabled?

On Fri, Jun 8, 2018, 11:17 AM Michael Di Domenico 
wrote:

> i'm having trouble with 2.10.4 clients running on rhel 7.5 kernel 862.3.2
>
> at times when the box boots lustre wont mount, lnet bops out and
> complains about port 988 being in use
>
> however, when i run netstat or lsof commands, i cannot find port 988
> listed against anything
>
> is there some way to trace deeper to see what lnet is really complaining
> about
>
> usually rebooting the box fixes the issue, but this seems a little
> mysterious
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] how to get statistics from lustre clients ?

2018-01-02 Thread Christopher Johnston
I use the telgraf plugin for MDS/OSS/OST stats.  I know its not the client
side that you were after but I find the metrics to be very useful and
display nicely in grafana.

https://github.com/influxdata/telegraf/tree/master/plugins/inputs/lustre2

On Tue, Jan 2, 2018 at 8:45 AM, Ben Evans  wrote:

> You can get breakdowns of client->OST reads and writes, and combine them
> into OSS-level info.
>
> There is currently no timestamp on it, all the stats files are cumulative
> since you last cleared.  You can get around this by reading the file
> regularly and noting the time, and doing the diffs since the last read.
>
> There are a few programs out there that do this sort of thing for you
> already, collectl and collectd come to mind.
>
> -Ben Evans
>
> From: lustre-discuss  on behalf
> of "Black.S" 
> Date: Saturday, December 30, 2017 at 9:18 AM
> To: "lustre-discuss@lists.lustre.org" 
> Subject: [lustre-discuss] how to get statistics from lustre clients ?
>
> May be anybody know it or get the target where I can search
>
> I want to get from each lustre client:
>
>-
>
>how much lustre client read/write ?
>-
>
>which size blocks ?
>-
>
>with which OSS communicate ?
>-
>
>timestemp of operation ( I mean time then start read/write from lustre
>client) ?
>
> Is I can get that from (or by) linux node with lustre client?
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] 2.10.1 client fails to mount on 2.9 backend

2017-11-21 Thread Christopher Johnston
Yes the MGS is a TCP based setup (not IB), if I recall 'ltcl ping' looked
OK but I can test that later this morning and provide feedback.

On Fri, Nov 17, 2017 at 4:48 PM Chris Horn <ho...@cray.com> wrote:

> Is the MGS actually on tcp or is it on o2ib? Can you “lctl ping” the MGS
> LNet nid from the client where you’re trying to mount?
>
>
>
> Chris Horn
>
>
>
> *From: *lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on
> behalf of Christopher Johnston <chjoh...@gmail.com>
> *Date: *Friday, November 17, 2017 at 3:17 PM
> *To: *lustre-discuss <lustre-discuss@lists.lustre.org>
> *Subject: *[lustre-discuss] 2.10.1 client fails to mount on 2.9 backend
>
>
>
> Just tested the new 2.10.1 client against one of my fileservers where I
> have 2.9 running.   Works with 2.10.0 but not 2.10.1, is this expected?
>
>
>
> # mount -t lustre 172.30.69.90:/qstore /mnt
>
> mount.lustre: mount 172.30.69.90:/qstore at /mnt failed: No such file or
> directory
>
> Is the MGS specification correct?
>
> Is the filesystem name correct?
>
> If upgrading, is the copied client log valid? (see upgrade docs)
>
>
>
> # dmesg -e | tail -4
>
> [Nov17 16:15] LustreError: 22344:0:(ldlm_lib.c:483:client_obd_setup())
> can't add initial connection
>
> [  +0.009423] LustreError: 22344:0:(obd_config.c:608:class_setup()) setup
> MGC172.30.69.90@tcp failed (-2)
>
> [  +0.009755] LustreError: 22344:0:(obd_mount.c:203:lustre_start_simple())
> MGC172.30.69.90@tcp setup error -2
>
> [  +0.010193] LustreError: 22344:0:(obd_mount.c:1505:lustre_fill_super())
> Unable to mount  (-2)
>
>
>
> -Chris
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] 2.10.1 client fails to mount on 2.9 backend

2017-11-17 Thread Christopher Johnston
Just tested the new 2.10.1 client against one of my fileservers where I
have 2.9 running.   Works with 2.10.0 but not 2.10.1, is this expected?

# mount -t lustre 172.30.69.90:/qstore /mnt
mount.lustre: mount 172.30.69.90:/qstore at /mnt failed: No such file or
directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)

# dmesg -e | tail -4
[Nov17 16:15] LustreError: 22344:0:(ldlm_lib.c:483:client_obd_setup())
can't add initial connection
[  +0.009423] LustreError: 22344:0:(obd_config.c:608:class_setup()) setup
MGC172.30.69.90@tcp failed (-2)
[  +0.009755] LustreError: 22344:0:(obd_mount.c:203:lustre_start_simple())
MGC172.30.69.90@tcp setup error -2
[  +0.010193] LustreError: 22344:0:(obd_mount.c:1505:lustre_fill_super())
Unable to mount  (-2)

-Chris
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Best way to run serverside 2.8 w. MOFED 4.1 on Centos 7.2

2017-08-18 Thread Christopher Johnston
Get coffee somewhere in between 

On Aug 18, 2017 1:08 PM, "Jeff Johnson" 
wrote:

> John,
>
> You can rebuild 2.8 against MOFED. 1) Install MOFED version of choice. 2)
> Pull down the 2.8 Lustre source and configure with
> '--with-o2ib=/usr/src/ofa_kernel/default'. 3) `make rpms` 4) Install. 5)
> Profit.
>
> --Jeff
>
> On Fri, Aug 18, 2017 at 9:41 AM, john casu 
> wrote:
>
>> I have an existing 2.8 install that broke when we added MOFED into the
>> mix.
>>
>> Nothing I do wrt installing 2.8 rpms works to fix this, and I get a
>> couple of missing symbole, when I install lustre-modules:
>> depmod: WARNING: /lib/modules/3.10.0-327.3.1.el
>> 7_lustre.x86_64/extra/kernel/net/lustre/ko2iblnd.ko needs unknown symbol
>> ib_query_device
>> depmod: WARNING: /lib/modules/3.10.0-327.3.1.el
>> 7_lustre.x86_64/extra/kernel/net/lustre/ko2iblnd.ko needs unknown symbol
>> ib_alloc_pd
>>
>> I'm assuming the issue is that lustre 2.8 is built using the standard
>> Centos 7.2 infiniband drivers.
>>
>> I can't move to Centos 7.3, at this time.  Is there any way to get 2.8 up
>> & running w. mofed without rebuilding lustre rpms?
>>
>> If I have to rebuild, it'd probably be easier to go to 2.10 (and zfs
>> 0.7.1). Is that a correct assumption?
>> Or will the 2.10 rpms work on Centps 7.2?
>>
>> thanks,
>> -john c
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>
>
>
> --
> --
> Jeff Johnson
> Co-Founder
> Aeon Computing
>
> jeff.john...@aeoncomputing.com
> www.aeoncomputing.com
> t: 858-412-3810 x1001 <(858)%20412-3810>   f: 858-412-3845
> <(858)%20412-3845>
> m: 619-204-9061 <(619)%20204-9061>
>
> 4170 Morena Boulevard, Suite D - San Diego, CA 92117
>
> High-Performance Computing / Lustre Filesystems / Scale-out Storage
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] rebooting nodes

2017-08-10 Thread Christopher Johnston
On my systems that use standard ethernet (im in the cloud), 2.9 reboots I
have no issues I can see.  I did have issues with the lnet driver not being
able to grab the port on boot-up so I backported the lnet systemd unit file
from 2.10 to get around that.

On Thu, Aug 10, 2017 at 9:44 AM, Ben Evans  wrote:

> Are the Infiniband drivers disappearing first?  I know that used to be an
> issue.
>
> -Ben
>
> On 8/10/17, 8:59 AM, "lustre-discuss on behalf of Michael Di Domenico"
>  mdidomeni...@gmail.com> wrote:
>
> >does anyone else have issues with issue 'reboot' while having a lustre
> >mount?
> >
> >we're running v2.9 clients on our workstations, but when a user goes
> >to reboot the machine (from the gui) the system stalls under systemd
> >while i presume it's attempting to unmount the filesystem.
> >
> >what i see on the console is; systemd kicks in and starts unmounting
> >all the nfs shares we have, works fine.  but then it gets to lustre
> >and starts throwing connection errors on the console.  it's almost as
> >if systemd raced itself stopping lustre, whereby lnet got yanked out
> >from under the mount before the unmount actually finished.
> >
> >after five minutes or so, it looks like systemd threw in the towel and
> >gave up trying to unmount, but the system is stuck still trying to
> >execute more shutdown tasks.
> >
> >when we mount lustre on the workstations, i have a script that figures
> >some stuff out, issues a service lnet start, and then issues a mount
> >command.  this all works fine, but i'm not sure if that's why systemd
> >can't figure out what to do correctly.
> >
> >and since this is during a shutdown phase, debugging this is
> >difficult.  any thoughts?
> >___
> >lustre-discuss mailing list
> >lustre-discuss@lists.lustre.org
> >http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.10.0 mmap() Issues

2017-08-10 Thread Christopher Johnston
Sure can Peter, will do that later this morning.

On Thu, Aug 10, 2017 at 8:58 AM, Jones, Peter A <peter.a.jo...@intel.com>
wrote:

> Christopher
>
> Could you please open a JIRA ticket about this?
>
> Thanks
>
> Peter
>
> On 8/8/17, 8:58 AM, "lustre-discuss on behalf of Christopher Johnston" <
> lustre-discuss-boun...@lists.lustre.org on behalf of chjoh...@gmail.com>
> wrote:
>
> At my company we use mmap() exclusively for accessing our data on Lustre.
> For starters we are seeing some very weird (or maybe expected) poor random
> read/write performance for these types of access patterns.  I decided to
> give Lustre 2.10.0 a try with ZFS 0.7.0 as the backend instead of ldisk and
> after compiling and building the RPMs, the filesystem mounted up just
> fine.  I then started doing some iozone runs to test the stability of the
> filesystem and although izone does complete uts benchmark I am seeing a lot
> of stack traces coming out of various kernel threads.  Note I am only
> seeing this when using mmap().  We also ran our application as well just to
> verify.  I am also going to try with an ldiskfs format as well to see if
> this changes anything.
>
> My ZFS settings are modest, with 50% of memory allocated to the ARC:
>
> options zfs zfs_arc_max=3921674240 zfs_prefetch_disable=1
> recordsize=1M
> compression=on
> dedupe=off
> xattr=sa
> dnodesize=auto
>
>
> Below is the output from the stack trace:
>
> Aug  8 09:38:04 dev-gc01-oss001 kernel: BUG: Bad page state: 87 messages
> suppressed
> Aug  8 09:38:04 dev-gc01-oss001 kernel: BUG: Bad page state in process
> socknal_sd00_01  pfn:1cbac1
> Aug  8 09:38:04 dev-gc01-oss001 kernel: page:ea00072eb040 count:0
> mapcount:-1 mapping:  (null) index:0x0
> Aug  8 09:38:04 dev-gc01-oss001 kernel: page flags: 0x2f8000(tail)
> Aug  8 09:38:04 dev-gc01-oss001 kernel: page dumped because: nonzero
> mapcount
> Aug  8 09:38:04 dev-gc01-oss001 kernel: Modules linked in: 8021q garp mrp
> stp llc osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE)
> fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE)
> iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul
> glue_helper ablk_helper cryptd ppdev sg i2c_piix4 parport_pc i2c_core
> parport pcspkr nfsd nfs_acl lockd grace binfmt_misc auth_rpcgss sunrpc
> ip_tables xfs libcrc32c zfs(POE) zunicode(POE) zavl(POE) icp(POE)
> zcommon(POE) znvpair(POE) spl(OE) zlib_deflate sd_mod crc_t10dif
> crct10dif_generic virtio_net virtio_scsi crct10dif_pclmul crct10dif_common
> crc32c_intel serio_raw virtio_pci virtio_ring virtio
> Aug  8 09:38:04 dev-gc01-oss001 kernel: CPU: 0 PID: 2558 Comm:
> socknal_sd00_01 Tainted: PB  OE  
> 3.10.0-514.26.2.el7.x86_64 #1
> Aug  8 09:38:04 dev-gc01-oss001 kernel: Hardware name: Google Google
> Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> Aug  8 09:38:04 dev-gc01-oss001 kernel: ea00072eb040 005e265e
> 8800b879f5a8 81687133
> Aug  8 09:38:04 dev-gc01-oss001 kernel: 8800b879f5d0 81682368
> ea00072eb040 
> Aug  8 09:38:04 dev-gc01-oss001 kernel: 000f 8800b879f618
> 8118946d fff0fe00
> Aug  8 09:38:04 dev-gc01-oss001 kernel: Call Trace:
> Aug  8 09:38:04 dev-gc01-oss001 kernel: []
> dump_stack+0x19/0x1b
> Aug  8 09:38:05 dev-gc01-oss001 kernel: []
> bad_page.part.75+0xdf/0xfc
> Aug  8 09:38:05 dev-gc01-oss001 kernel: []
> free_pages_prepare+0x16d/0x190
> Aug  8 09:38:05 dev-gc01-oss001 kernel: []
> __free_pages_ok+0x19/0xd0
> Aug  8 09:38:05 dev-gc01-oss001 kernel: []
> free_compound_page+0x1b/0x20
> Aug  8 09:38:05 dev-gc01-oss001 kernel: []
> __put_compound_page+0x1f/0x22
> Aug  8 09:38:05 dev-gc01-oss001 kernel: []
> put_compound_page+0x16f/0x17d
> Aug  8 09:38:05 dev-gc01-oss001 kernel: []
> put_page+0x4c/0x60
> Aug  8 09:38:05 dev-gc01-oss001 kernel: []
> skb_release_data+0x8f/0x140
> Aug  8 09:38:05 dev-gc01-oss001 kernel: []
> skb_release_all+0x24/0x30
> Aug  8 09:38:05 dev-gc01-oss001 kernel: []
> consume_skb+0x2c/0x80
> Aug  8 09:38:05 dev-gc01-oss001 kernel: []
> __dev_kfree_skb_any+0x3d/0x50
> Aug  8 09:38:05 dev-gc01-oss001 kernel: []
> free_old_xmit_skbs.isra.32+0x6b/0xc0 [virtio_net]
> Aug  8 09:38:05 dev-gc01-oss001 kernel: []
> start_xmit+0x5f/0x4f0 [virtio_net]
> Aug  8 09:38:05 dev-gc01-oss001 kernel: []
> dev_hard_start_xmit+0x171/0x3b0
> Aug  8 09:38:05 dev-gc01-oss001 kernel: []
> sch_direct_xmit+0x104/0x200
> Aug  8 09:38:05 dev-gc01-oss001 kernel: []
> __dev_queue_xmit+0x23c/0x570
> Aug  8 09:38:05 dev-gc01-oss001 kernel: []
> dev_queue_xmit+0x10/0x20
> Aug  8 09:38:05