Re: [lustre-discuss] nodes crash during ior test

2017-08-07 Thread Brian Andrus

Had another where the client rebooted.

Here is the full dmesg from that:


   /*[181902.731655] BUG: unable to handle kernel NULL pointer
   dereference at   (null)*//*
   *//*[181902.731710] IP: []
   _raw_spin_unlock+0xa/0x30*//*
   *//*[181902.731749] PGD 0*//*
   *//*[181902.731766] Oops: 0002 [#1] SMP*//*
   *//*[181902.731788] Modules linked in: osc(OE) mgc(OE) lustre(OE)
   lmv(OE) fld(OE) mdc(OE) fid(OE) lov(OE) ksocklnd(OE) ptlrpc(OE)
   obdclass(OE) lnet(OE) libcfs(OE) nfsv3 nfs fscache sfc mtd vfat fat
   intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm
   irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul
   glue_helper ablk_helper cryptd ipmi_devintf ipmi_si iTCO_wdt
   iTCO_vendor_support sb_edac pcspkr ipmi_msghandler sg edac_core
   shpchp ioatdma wmi acpi_power_meter mei_me mei i2c_i801 lpc_ich nfsd
   auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c
   sd_mod sr_mod crc_t10dif cdrom crct10dif_generic crct10dif_pclmul
   crct10dif_common crc32c_intel mgag200 drm_kms_helper syscopyarea
   sysfillrect sysimgblt fb_sys_fops i2c_algo_bit ttm ahci libahci drm
   ixgbe libata i2c_core mdio ptp pps_core*//*
   *//*[181902.732221]  dca fjes dm_mirror dm_region_hash dm_log dm_mod
   [last unloaded: mtd]*//*
   *//*[181902.732260] CPU: 14 PID: 18830 Comm: socknal_sd01_05
   Tainted: G   OE   3.10.0-514.26.2.el7.x86_64 #1*//*
   *//*[181902.732309] Hardware name: NEC Express5800/R120f-1M
   [N8100-2210F]/MS-S0901, BIOS 5.0.8022 06/22/2015*//*
   *//*[181902.732351] task: 881031e72f10 ti: 88102ae4
   task.ti: 88102ae4*//*
   *//*[181902.732392] RIP: 0010:[] 
   [] _raw_spin_unlock+0xa/0x30*//*

   *//*[181902.732434] RSP: 0018:88102ae43c38 EFLAGS: 00010206*//*
   *//*[181902.732459] RAX: 88103dd7ef70 RBX: 88103dd7eec0 RCX:
   *//*
   *//*[181902.732492] RDX: d173 RSI: 68b8 RDI:
   *//*
   *//*[181902.732524] RBP: 88102ae43c50 R08: 105d41fb R09:
   5610*//*
   *//*[181902.732557] R10: 7000 R11:  R12:
   000345c0*//*
   *//*[181902.732589] R13: 88102a70b200 R14: 88203d1eb674 R15:
   8818286b9810*//*
   *//*[181902.732622] FS:  ()
   GS:88203f20() knlGS:*//*
   *//*[181902.732658] CS:  0010 DS:  ES:  CR0:
   80050033*//*
   *//*[181902.732685] CR2:  CR3: 019be000 CR4:
   001407e0*//*
   *//*[181902.732717] DR0:  DR1:  DR2:
   *//*
   *//*[181902.732750] DR3:  DR6: 0ff0 DR7:
   0400*//*
   *//*[181902.732783] Stack:*//*
   *//*[181902.732795]  a076d8b6 88103261d200
   88203d1eb600 88102ae43c90*//*
   *//*[181902.732834]  a07f4a91 8818286b9800
   88103261d200 0001*//*
   *//*[181902.732873]  8820357155c0 
   88103261d210 88102ae43cc0*//*
   *//*[181902.732911] Call Trace:*//*
   *//*[181902.732945]  [] ?
   cfs_percpt_unlock+0x36/0xc0 [libcfs]*//*
   *//*[181902.732995]  []
   lnet_return_tx_credits_locked+0x211/0x480 [lnet]*//*
   *//*[181902.733037]  []
   lnet_msg_decommit+0xd0/0x6c0 [lnet]*//*
   *//*[181902.733073]  [] lnet_finalize+0x1e9/0x690
   [lnet]*//*
   *//*[181902.733110]  []
   ksocknal_tx_done+0x85/0x1c0 [ksocklnd]*//*
   *//*[181902.733145]  []
   ksocknal_handle_zcack+0x137/0x1e0 [ksocklnd]*//*
   *//*[181902.733181]  []
   ksocknal_process_receive+0x3a1/0xd90 [ksocklnd]*//*
   *//*[181902.733219]  []
   ksocknal_scheduler+0xee/0x670 [ksocklnd]*//*
   *//*[181902.733255]  [] ?
   wake_up_atomic_t+0x30/0x30*//*
   *//*[181902.733286]  [] ?
   ksocknal_recv+0x2a0/0x2a0 [ksocklnd]*//*
   *//*[181902.733318]  [] kthread+0xcf/0xe0*//*
   *//*[181902.733344]  [] ?
   kthread_create_on_node+0x140/0x140*//*
   *//*[181902.733377]  [] ret_from_fork+0x58/0x90*//*
   *//*[181902.734492]  [] ?
   kthread_create_on_node+0x140/0x140*//*
   *//*[181902.735590] Code: 90 8d 8a 00 00 02 00 89 d0 f0 0f b1 0f 39
   d0 75 ea b8 01 00 00 00 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00
   00 0f 1f 44 00 00 <66> 83 07 02 c3 90 8b 37 f0 66 83 07 02 f6 47 02
   01 74 f1 55 48*//*
   *//*[181902.737942] RIP  []
   _raw_spin_unlock+0xa/0x30*//*
   *//*[181902.739079]  RSP *//*
   *//*[181902.740189] CR2: *//*
   */


Brian Andrus


On 8/7/2017 7:17 AM, Brian Andrus wrote:


There were actually several:

On an OSS:

[447314.138709] BUG: unable to handle kernel NULL pointer dereference 
at 0020
[543262.189674] BUG: unable to handle kernel NULL pointer dereference 
at   (null)
[16397.115830] BUG: unable to handle kernel NULL pointer dereference 
at   (null)



On 2 separate clients:

[65404.590906] BUG: unable to handle kernel NULL pointer dereference 
at   (null)
[72095.972732] BUG: u

Re: [lustre-discuss] lustre client 2.9 cannot mount 2.10.0 OSTs

2017-08-07 Thread Riccardo Veraldi
I figure out the problem was a wrong setting client side.

On 8/7/17 8:09 PM, Riccardo Veraldi wrote:
> it is like if my /etc/modprobe.d/lustre.conf gets completely ignored
> when lnet module is loaded
>
> On 8/7/17 7:05 PM, Cowe, Malcolm J wrote:
>> Lustre file system names cannot exceed 8 characters in length, but 
>> “scratch12” is 9 characters. Try changing the fsname to a smaller string. 
>> You can do this with tunefs.lustre on all the storage targets, but I can’t 
>> remember if you need to use --erase-params and recreate all the options. 
>> Alternatively, reformat.
>>
>> Malcolm.
>>
>> On 8/8/17, 11:30 am, "lustre-discuss on behalf of Riccardo Veraldi" 
>> > riccardo.vera...@cnaf.infn.it> wrote:
>>
>> trying to debug more this problem looks like tcp port 9888 is closed on
>> the MDS.
>> this is weird. lnet module is running. There is no firewall and OSSs and
>> MDS are on the same subnet.
>> but I Cannot connect to port 9888.
>> There is anything which changed in Lustre 2.10.0 related to lnet and TCP
>> ports that I need to take care of in the configuration ?
>> 
>> On 8/7/17 6:13 PM, Riccardo Veraldi wrote:
>> > Hello,
>> >
>> > I have a new Lustre cluster based on Lustre 2.10.0/ZFS 0.7.0 on Centos 
>> 7.3
>> > Lustre FS creation went smooth.
>> > When I tryed then to mount from the clients, Lustre is not able to 
>> mount
>> > any of the OSTs.
>> > It stops at MGS/MDT level.
>> >
>> > this is from the client side:
>> >
>> > mount.lustre: mount 192.168..48.254@tcp2:/scratch12 at
>> > /reg/data/scratch12 failed: Invalid argument
>> > This may have multiple causes.
>> > Is 'scratch12' the correct filesystem name?
>> > Are the mount options correct?
>> > Check the syslog for more info.
>> >
>> > Aug  7 17:58:53 psana1510 kernel: [285130.463377] LustreError:
>> > 29240:0:(mgc_request.c:335:config_log_add()) logname scratch12-client 
>> is
>> > too long
>> > Aug  7 17:58:53 psana1510 kernel: [285130.463772] Lustre:
>> > :0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has
>> > failed due to network error: [sent 1502153933/real 1502153933] 
>> > req@88203d75ec00 x1574823717093632/t0(0)
>> > o250->MGC192.168.48.254@tcp2@192.168.48.254@tcp2:26/25 lens 520/544 e 0
>> > to 1 dl 1502153938 ref 1 fl Rpc:eXN/0/ rc 0/-1
>> > Aug  7 17:58:53 psana1510 kernel: [285130.469156] LustreError: 15b-f:
>> > MGC192.168.48.254@tcp2: The configuration from log
>> > 'scratch12-client'failed from the MGS (-22).  Make sure this client and
>> > the MGS are running compatible versions of Lustre.
>> > Aug  7 17:58:53 psana1510 kernel: [285130.472072] Lustre: Unmounted
>> > scratch12-client
>> > Aug  7 17:58:53 psana1510 kernel: [285130.473827] LustreError:
>> > 29240:0:(obd_mount.c:1505:lustre_fill_super()) Unable to mount  (-22)
>> >
>> > from the MDS side there is nothing in syslog. So I tried to engage 
>> tcpdump:
>> >
>> > 17:58:53.745610 IP psana1510.pcdsn.1023 >
>> > psanamds12.pcdsn.cyborg-systems: Flags [S], seq 1356843681, win 29200,
>> > options [mss 1460,sackOK,TS val 284847388 ecr 0,nop,wscale 7], length 0
>> > 17:58:53.745644 IP psanamds12.pcdsn.cyborg-systems >
>> > psana1510.pcdsn.1023: Flags [R.], seq 0, ack 1356843682, win 0, length >> 0
>> > 17:58:58.757421 ARP, Request who-has psanamds12.pcdsn tell
>> > psana1510.pcdsn, length 46
>> > 17:58:58.757441 ARP, Reply psanamds12.pcdsn is-at 00:1a:4a:16:01:56 
>> (oui
>> > Unknown), length 28
>> >
>> > OSS, nothing in the log file or in tcpdump
>> >
>> > lustre client is 2.9 and the server 2.10.0
>> >
>> > I have no firewall running and no SElinux
>> >
>> > this never happened to me before. I am usually running older lustre
>> > versions on clients but I never had this problem before.
>> > Any hint ?
>> >
>> > thank you very much
>> >
>> > Rick
>> >
>> >
>> >
>> > ___
>> > lustre-discuss mailing list
>> > lustre-discuss@lists.lustre.org
>> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> 
>> 
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> 
>>
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
ht

Re: [lustre-discuss] lustre client 2.9 cannot mount 2.10.0 OSTs

2017-08-07 Thread Riccardo Veraldi
it is like if my /etc/modprobe.d/lustre.conf gets completely ignored
when lnet module is loaded

On 8/7/17 7:05 PM, Cowe, Malcolm J wrote:
> Lustre file system names cannot exceed 8 characters in length, but 
> “scratch12” is 9 characters. Try changing the fsname to a smaller string. You 
> can do this with tunefs.lustre on all the storage targets, but I can’t 
> remember if you need to use --erase-params and recreate all the options. 
> Alternatively, reformat.
>
> Malcolm.
>
> On 8/8/17, 11:30 am, "lustre-discuss on behalf of Riccardo Veraldi" 
>  riccardo.vera...@cnaf.infn.it> wrote:
>
> trying to debug more this problem looks like tcp port 9888 is closed on
> the MDS.
> this is weird. lnet module is running. There is no firewall and OSSs and
> MDS are on the same subnet.
> but I Cannot connect to port 9888.
> There is anything which changed in Lustre 2.10.0 related to lnet and TCP
> ports that I need to take care of in the configuration ?
> 
> On 8/7/17 6:13 PM, Riccardo Veraldi wrote:
> > Hello,
> >
> > I have a new Lustre cluster based on Lustre 2.10.0/ZFS 0.7.0 on Centos 
> 7.3
> > Lustre FS creation went smooth.
> > When I tryed then to mount from the clients, Lustre is not able to mount
> > any of the OSTs.
> > It stops at MGS/MDT level.
> >
> > this is from the client side:
> >
> > mount.lustre: mount 192.168..48.254@tcp2:/scratch12 at
> > /reg/data/scratch12 failed: Invalid argument
> > This may have multiple causes.
> > Is 'scratch12' the correct filesystem name?
> > Are the mount options correct?
> > Check the syslog for more info.
> >
> > Aug  7 17:58:53 psana1510 kernel: [285130.463377] LustreError:
> > 29240:0:(mgc_request.c:335:config_log_add()) logname scratch12-client is
> > too long
> > Aug  7 17:58:53 psana1510 kernel: [285130.463772] Lustre:
> > :0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has
> > failed due to network error: [sent 1502153933/real 1502153933] 
> > req@88203d75ec00 x1574823717093632/t0(0)
> > o250->MGC192.168.48.254@tcp2@192.168.48.254@tcp2:26/25 lens 520/544 e 0
> > to 1 dl 1502153938 ref 1 fl Rpc:eXN/0/ rc 0/-1
> > Aug  7 17:58:53 psana1510 kernel: [285130.469156] LustreError: 15b-f:
> > MGC192.168.48.254@tcp2: The configuration from log
> > 'scratch12-client'failed from the MGS (-22).  Make sure this client and
> > the MGS are running compatible versions of Lustre.
> > Aug  7 17:58:53 psana1510 kernel: [285130.472072] Lustre: Unmounted
> > scratch12-client
> > Aug  7 17:58:53 psana1510 kernel: [285130.473827] LustreError:
> > 29240:0:(obd_mount.c:1505:lustre_fill_super()) Unable to mount  (-22)
> >
> > from the MDS side there is nothing in syslog. So I tried to engage 
> tcpdump:
> >
> > 17:58:53.745610 IP psana1510.pcdsn.1023 >
> > psanamds12.pcdsn.cyborg-systems: Flags [S], seq 1356843681, win 29200,
> > options [mss 1460,sackOK,TS val 284847388 ecr 0,nop,wscale 7], length 0
> > 17:58:53.745644 IP psanamds12.pcdsn.cyborg-systems >
> > psana1510.pcdsn.1023: Flags [R.], seq 0, ack 1356843682, win 0, length 0
> > 17:58:58.757421 ARP, Request who-has psanamds12.pcdsn tell
> > psana1510.pcdsn, length 46
> > 17:58:58.757441 ARP, Reply psanamds12.pcdsn is-at 00:1a:4a:16:01:56 (oui
> > Unknown), length 28
> >
> > OSS, nothing in the log file or in tcpdump
> >
> > lustre client is 2.9 and the server 2.10.0
> >
> > I have no firewall running and no SElinux
> >
> > this never happened to me before. I am usually running older lustre
> > versions on clients but I never had this problem before.
> > Any hint ?
> >
> > thank you very much
> >
> > Rick
> >
> >
> >
> > ___
> > lustre-discuss mailing list
> > lustre-discuss@lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lustre client 2.9 cannot mount 2.10.0 OSTs

2017-08-07 Thread Riccardo Veraldi
thanks, yes I noticed that Issued and I changed hte name. I Also rebuild
the FS but now it does not work for another unknown reason:


Aug  7 19:05:38 psana1510 kernel: [289134.511260] Lustre:
:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has
failed due to network error: [sent 1502157938/real 1502157938] 
req@88203e699800 x1574823717862480/t0(0)
o250->MGC192.168.48.254@tcp@192.168.48.254@tcp:26/25 lens 520/544 e 0 to
1 dl 1502157943 ref 1 fl Rpc:eXN/0/ rc 0/-1
Aug  7 19:05:38 psana1510 kernel: [289134.515771] Lustre:
:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1 previous
similar message
Aug  7 19:05:44 psana1510 kernel: [289140.510450] LustreError:
5566:0:(mgc_request.c:251:do_config_log_add()) MGC192.168.48.254@tcp:
failed processing log, type 1: rc = -5
Aug  7 19:06:14 psana1510 kernel: [289170.512640] LustreError: 15c-8:
MGC192.168.48.254@tcp: The configuration from log 'scrtch12-client'
failed (-5). This may be the result of communication errors between this
node and the MGS, a bad configuration, or other errors. See the syslog
for more information.
Aug  7 19:06:14 psana1510 kernel: [289170.517479] Lustre: Unmounted
scrtch12-client
Aug  7 19:06:14 psana1510 kernel: [289170.519294] LustreError:
5566:0:(obd_mount.c:1505:lustre_fill_super()) Unable to mount  (-5)

the MDS has always TCP port 9888 closed.

I have seen that lnetctl is missing in my rpm. I rebuilt the rpm from
SPEC file. but I noticed lnetctl is missing.
could that be the reason of LNET problem ?

I noticed I Was missing libyaml on my build server and I Remember there
was a Lustre bug related to building and packaging the Lustre rpms
if libyaml is missing.

my question is if lnetctl is needed to have lustre  lnet configured at
startup.
this is my lustre.conf in modprobe.d

options lnet network=tcp2(eth0)

also on my previous lustre 2.9 and 2.8 rpm builds lnetctl is missing but
LNET gets configured without troubles when lnet module is loaded

Rick



On 8/7/17 7:05 PM, Cowe, Malcolm J wrote:
> Lustre file system names cannot exceed 8 characters in length, but 
> “scratch12” is 9 characters. Try changing the fsname to a smaller string. You 
> can do this with tunefs.lustre on all the storage targets, but I can’t 
> remember if you need to use --erase-params and recreate all the options. 
> Alternatively, reformat.
>
> Malcolm.
>
> On 8/8/17, 11:30 am, "lustre-discuss on behalf of Riccardo Veraldi" 
>  riccardo.vera...@cnaf.infn.it> wrote:
>
> trying to debug more this problem looks like tcp port 9888 is closed on
> the MDS.
> this is weird. lnet module is running. There is no firewall and OSSs and
> MDS are on the same subnet.
> but I Cannot connect to port 9888.
> There is anything which changed in Lustre 2.10.0 related to lnet and TCP
> ports that I need to take care of in the configuration ?
> 
> On 8/7/17 6:13 PM, Riccardo Veraldi wrote:
> > Hello,
> >
> > I have a new Lustre cluster based on Lustre 2.10.0/ZFS 0.7.0 on Centos 
> 7.3
> > Lustre FS creation went smooth.
> > When I tryed then to mount from the clients, Lustre is not able to mount
> > any of the OSTs.
> > It stops at MGS/MDT level.
> >
> > this is from the client side:
> >
> > mount.lustre: mount 192.168..48.254@tcp2:/scratch12 at
> > /reg/data/scratch12 failed: Invalid argument
> > This may have multiple causes.
> > Is 'scratch12' the correct filesystem name?
> > Are the mount options correct?
> > Check the syslog for more info.
> >
> > Aug  7 17:58:53 psana1510 kernel: [285130.463377] LustreError:
> > 29240:0:(mgc_request.c:335:config_log_add()) logname scratch12-client is
> > too long
> > Aug  7 17:58:53 psana1510 kernel: [285130.463772] Lustre:
> > :0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has
> > failed due to network error: [sent 1502153933/real 1502153933] 
> > req@88203d75ec00 x1574823717093632/t0(0)
> > o250->MGC192.168.48.254@tcp2@192.168.48.254@tcp2:26/25 lens 520/544 e 0
> > to 1 dl 1502153938 ref 1 fl Rpc:eXN/0/ rc 0/-1
> > Aug  7 17:58:53 psana1510 kernel: [285130.469156] LustreError: 15b-f:
> > MGC192.168.48.254@tcp2: The configuration from log
> > 'scratch12-client'failed from the MGS (-22).  Make sure this client and
> > the MGS are running compatible versions of Lustre.
> > Aug  7 17:58:53 psana1510 kernel: [285130.472072] Lustre: Unmounted
> > scratch12-client
> > Aug  7 17:58:53 psana1510 kernel: [285130.473827] LustreError:
> > 29240:0:(obd_mount.c:1505:lustre_fill_super()) Unable to mount  (-22)
> >
> > from the MDS side there is nothing in syslog. So I tried to engage 
> tcpdump:
> >
> > 17:58:53.745610 IP psana1510.pcdsn.1023 >
> > psanamds12.pcdsn.cyborg-systems: Flags [S], seq 1356843681, win 29200,
> > options [mss 1460,sackOK,TS val 284847388 ecr 0,nop,wscale 7], length 0

Re: [lustre-discuss] nodes crash during ior test

2017-08-07 Thread Cowe, Malcolm J
I’ve created a Benchmarking process outline and tools overview here:

http://wiki.lustre.org/Category:Benchmarking

This has been recently updated and is based on notes I’ve maintained at Intel 
over the years.

Malcolm Cowe
High Performance Data Division

Intel Corporation | www.intel.com


From: lustre-discuss  on behalf of 
Alexander I Kulyavtsev 
Date: Tuesday, 8 August 2017 at 3:30 am
To: "E.S. Rosenberg" 
Cc: Lustre discussion 
Subject: Re: [lustre-discuss] nodes crash during ior test

Lustre wiki has sidebars on Testing and Monitoring, you may start Benchmarking.

there was Benchmarking Group in OpenSFS.
wiki:   http://wiki.opensfs.org/Benchmarking_Working_Group
mail list:  http://lists.opensfs.org/listinfo.cgi/openbenchmark-opensfs.org

It is actually question to the list what is the preferred location for KB on 
lustre benchmarking: on lustre.org or 
opensfs.org.
IMHO KB on lustre.org and BWG minutes (if it reengage)  on 
opensfs.org.

Alex.


On Aug 7, 2017, at 7:56 AM, E.S. Rosenberg 
mailto:esr+lus...@mail.hebrew.edu>> wrote:

OT:
Can we create a wiki page or some other form of knowledge pooling on 
benchmarking lustre?
Right now I'm using slides from 2009 as my source which may not be ideal...

http://wiki.lustre.org/images/4/40/Wednesday_shpc-2009-benchmarking.pdf


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lustre client 2.9 cannot mount 2.10.0 OSTs

2017-08-07 Thread Cowe, Malcolm J
Lustre file system names cannot exceed 8 characters in length, but “scratch12” 
is 9 characters. Try changing the fsname to a smaller string. You can do this 
with tunefs.lustre on all the storage targets, but I can’t remember if you need 
to use --erase-params and recreate all the options. Alternatively, reformat.

Malcolm.

On 8/8/17, 11:30 am, "lustre-discuss on behalf of Riccardo Veraldi" 
 wrote:

trying to debug more this problem looks like tcp port 9888 is closed on
the MDS.
this is weird. lnet module is running. There is no firewall and OSSs and
MDS are on the same subnet.
but I Cannot connect to port 9888.
There is anything which changed in Lustre 2.10.0 related to lnet and TCP
ports that I need to take care of in the configuration ?

On 8/7/17 6:13 PM, Riccardo Veraldi wrote:
> Hello,
>
> I have a new Lustre cluster based on Lustre 2.10.0/ZFS 0.7.0 on Centos 7.3
> Lustre FS creation went smooth.
> When I tryed then to mount from the clients, Lustre is not able to mount
> any of the OSTs.
> It stops at MGS/MDT level.
>
> this is from the client side:
>
> mount.lustre: mount 192.168..48.254@tcp2:/scratch12 at
> /reg/data/scratch12 failed: Invalid argument
> This may have multiple causes.
> Is 'scratch12' the correct filesystem name?
> Are the mount options correct?
> Check the syslog for more info.
>
> Aug  7 17:58:53 psana1510 kernel: [285130.463377] LustreError:
> 29240:0:(mgc_request.c:335:config_log_add()) logname scratch12-client is
> too long
> Aug  7 17:58:53 psana1510 kernel: [285130.463772] Lustre:
> :0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has
> failed due to network error: [sent 1502153933/real 1502153933] 
> req@88203d75ec00 x1574823717093632/t0(0)
> o250->MGC192.168.48.254@tcp2@192.168.48.254@tcp2:26/25 lens 520/544 e 0
> to 1 dl 1502153938 ref 1 fl Rpc:eXN/0/ rc 0/-1
> Aug  7 17:58:53 psana1510 kernel: [285130.469156] LustreError: 15b-f:
> MGC192.168.48.254@tcp2: The configuration from log
> 'scratch12-client'failed from the MGS (-22).  Make sure this client and
> the MGS are running compatible versions of Lustre.
> Aug  7 17:58:53 psana1510 kernel: [285130.472072] Lustre: Unmounted
> scratch12-client
> Aug  7 17:58:53 psana1510 kernel: [285130.473827] LustreError:
> 29240:0:(obd_mount.c:1505:lustre_fill_super()) Unable to mount  (-22)
>
> from the MDS side there is nothing in syslog. So I tried to engage 
tcpdump:
>
> 17:58:53.745610 IP psana1510.pcdsn.1023 >
> psanamds12.pcdsn.cyborg-systems: Flags [S], seq 1356843681, win 29200,
> options [mss 1460,sackOK,TS val 284847388 ecr 0,nop,wscale 7], length 0
> 17:58:53.745644 IP psanamds12.pcdsn.cyborg-systems >
> psana1510.pcdsn.1023: Flags [R.], seq 0, ack 1356843682, win 0, length 0
> 17:58:58.757421 ARP, Request who-has psanamds12.pcdsn tell
> psana1510.pcdsn, length 46
> 17:58:58.757441 ARP, Reply psanamds12.pcdsn is-at 00:1a:4a:16:01:56 (oui
> Unknown), length 28
>
> OSS, nothing in the log file or in tcpdump
>
> lustre client is 2.9 and the server 2.10.0
>
> I have no firewall running and no SElinux
>
> this never happened to me before. I am usually running older lustre
> versions on clients but I never had this problem before.
> Any hint ?
>
> thank you very much
>
> Rick
>
>
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lustre client 2.9 cannot mount 2.10.0 OSTs

2017-08-07 Thread Riccardo Veraldi
trying to debug more this problem looks like tcp port 9888 is closed on
the MDS.
this is weird. lnet module is running. There is no firewall and OSSs and
MDS are on the same subnet.
but I Cannot connect to port 9888.
There is anything which changed in Lustre 2.10.0 related to lnet and TCP
ports that I need to take care of in the configuration ?

On 8/7/17 6:13 PM, Riccardo Veraldi wrote:
> Hello,
>
> I have a new Lustre cluster based on Lustre 2.10.0/ZFS 0.7.0 on Centos 7.3
> Lustre FS creation went smooth.
> When I tryed then to mount from the clients, Lustre is not able to mount
> any of the OSTs.
> It stops at MGS/MDT level.
>
> this is from the client side:
>
> mount.lustre: mount 192.168..48.254@tcp2:/scratch12 at
> /reg/data/scratch12 failed: Invalid argument
> This may have multiple causes.
> Is 'scratch12' the correct filesystem name?
> Are the mount options correct?
> Check the syslog for more info.
>
> Aug  7 17:58:53 psana1510 kernel: [285130.463377] LustreError:
> 29240:0:(mgc_request.c:335:config_log_add()) logname scratch12-client is
> too long
> Aug  7 17:58:53 psana1510 kernel: [285130.463772] Lustre:
> :0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has
> failed due to network error: [sent 1502153933/real 1502153933] 
> req@88203d75ec00 x1574823717093632/t0(0)
> o250->MGC192.168.48.254@tcp2@192.168.48.254@tcp2:26/25 lens 520/544 e 0
> to 1 dl 1502153938 ref 1 fl Rpc:eXN/0/ rc 0/-1
> Aug  7 17:58:53 psana1510 kernel: [285130.469156] LustreError: 15b-f:
> MGC192.168.48.254@tcp2: The configuration from log
> 'scratch12-client'failed from the MGS (-22).  Make sure this client and
> the MGS are running compatible versions of Lustre.
> Aug  7 17:58:53 psana1510 kernel: [285130.472072] Lustre: Unmounted
> scratch12-client
> Aug  7 17:58:53 psana1510 kernel: [285130.473827] LustreError:
> 29240:0:(obd_mount.c:1505:lustre_fill_super()) Unable to mount  (-22)
>
> from the MDS side there is nothing in syslog. So I tried to engage tcpdump:
>
> 17:58:53.745610 IP psana1510.pcdsn.1023 >
> psanamds12.pcdsn.cyborg-systems: Flags [S], seq 1356843681, win 29200,
> options [mss 1460,sackOK,TS val 284847388 ecr 0,nop,wscale 7], length 0
> 17:58:53.745644 IP psanamds12.pcdsn.cyborg-systems >
> psana1510.pcdsn.1023: Flags [R.], seq 0, ack 1356843682, win 0, length 0
> 17:58:58.757421 ARP, Request who-has psanamds12.pcdsn tell
> psana1510.pcdsn, length 46
> 17:58:58.757441 ARP, Reply psanamds12.pcdsn is-at 00:1a:4a:16:01:56 (oui
> Unknown), length 28
>
> OSS, nothing in the log file or in tcpdump
>
> lustre client is 2.9 and the server 2.10.0
>
> I have no firewall running and no SElinux
>
> this never happened to me before. I am usually running older lustre
> versions on clients but I never had this problem before.
> Any hint ?
>
> thank you very much
>
> Rick
>
>
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] lustre client 2.9 cannot mount 2.10.0 OSTs

2017-08-07 Thread Riccardo Veraldi
Hello,

I have a new Lustre cluster based on Lustre 2.10.0/ZFS 0.7.0 on Centos 7.3
Lustre FS creation went smooth.
When I tryed then to mount from the clients, Lustre is not able to mount
any of the OSTs.
It stops at MGS/MDT level.

this is from the client side:

mount.lustre: mount 192.168..48.254@tcp2:/scratch12 at
/reg/data/scratch12 failed: Invalid argument
This may have multiple causes.
Is 'scratch12' the correct filesystem name?
Are the mount options correct?
Check the syslog for more info.

Aug  7 17:58:53 psana1510 kernel: [285130.463377] LustreError:
29240:0:(mgc_request.c:335:config_log_add()) logname scratch12-client is
too long
Aug  7 17:58:53 psana1510 kernel: [285130.463772] Lustre:
:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has
failed due to network error: [sent 1502153933/real 1502153933] 
req@88203d75ec00 x1574823717093632/t0(0)
o250->MGC192.168.48.254@tcp2@192.168.48.254@tcp2:26/25 lens 520/544 e 0
to 1 dl 1502153938 ref 1 fl Rpc:eXN/0/ rc 0/-1
Aug  7 17:58:53 psana1510 kernel: [285130.469156] LustreError: 15b-f:
MGC192.168.48.254@tcp2: The configuration from log
'scratch12-client'failed from the MGS (-22).  Make sure this client and
the MGS are running compatible versions of Lustre.
Aug  7 17:58:53 psana1510 kernel: [285130.472072] Lustre: Unmounted
scratch12-client
Aug  7 17:58:53 psana1510 kernel: [285130.473827] LustreError:
29240:0:(obd_mount.c:1505:lustre_fill_super()) Unable to mount  (-22)

from the MDS side there is nothing in syslog. So I tried to engage tcpdump:

17:58:53.745610 IP psana1510.pcdsn.1023 >
psanamds12.pcdsn.cyborg-systems: Flags [S], seq 1356843681, win 29200,
options [mss 1460,sackOK,TS val 284847388 ecr 0,nop,wscale 7], length 0
17:58:53.745644 IP psanamds12.pcdsn.cyborg-systems >
psana1510.pcdsn.1023: Flags [R.], seq 0, ack 1356843682, win 0, length 0
17:58:58.757421 ARP, Request who-has psanamds12.pcdsn tell
psana1510.pcdsn, length 46
17:58:58.757441 ARP, Reply psanamds12.pcdsn is-at 00:1a:4a:16:01:56 (oui
Unknown), length 28

OSS, nothing in the log file or in tcpdump

lustre client is 2.9 and the server 2.10.0

I have no firewall running and no SElinux

this never happened to me before. I am usually running older lustre
versions on clients but I never had this problem before.
Any hint ?

thank you very much

Rick



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] nodes crash during ior test

2017-08-07 Thread Alexander I Kulyavtsev
Lustre wiki has sidebars on Testing and Monitoring, you may start Benchmarking.

there was Benchmarking Group in OpenSFS.
wiki:  http://wiki.opensfs.org/Benchmarking_Working_Group
mail list: http://lists.opensfs.org/listinfo.cgi/openbenchmark-opensfs.org

It is actually question to the list what is the preferred location for KB on 
lustre benchmarking: on lustre.org or 
opensfs.org.
IMHO KB on lustre.org and BWG minutes (if it reengage)  on 
opensfs.org.

Alex.


On Aug 7, 2017, at 7:56 AM, E.S. Rosenberg 
mailto:esr+lus...@mail.hebrew.edu>> wrote:

OT:
Can we create a wiki page or some other form of knowledge pooling on 
benchmarking lustre?

Right now I'm using slides from 2009 as my source which may not be ideal...

http://wiki.lustre.org/images/4/40/Wednesday_shpc-2009-benchmarking.pdf

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] PFL error

2017-08-07 Thread Jones, Peter A
Perhaps this is the same as LU-9825?




On 8/7/17, 6:42 AM, "lustre-discuss on behalf of Vicker, Darby (JSC-EG311)" 
 
wrote:

>Hello,
>
>We've upgraded to 2.10 and I've been playing with progressive file layouts.  
>To begin, I'm just setting a test directory to use the following PFL.  
>
>lfs setstripe \
>   -E 4M   -c 1 -S 1M -i -1 \
>   -E 256M -c 4 -S 1M -i -1 \
>   -E -1   -c 8 -S 4M -i -1 .
>
>I then created some files in the different ranges.  
>
>dd if=/dev/zero of=1m.dat   bs=1M count=1
>dd if=/dev/zero of=5m.dat   bs=1M count=5
>dd if=/dev/zero of=300m.dat bs=1M count=300
>
>When I look at "lfs getstripe" for those files, I can see them hitting the 
>different tiers as they should.  However, I'm seeing this in the lustre logs 
>whenever I create a file in my PFL test directory.  
>
>Aug  4 14:12:14 hpfs-fsl-mds0 kernel: LustreError: 
>20096:0:(mdt_lvb.c:163:mdt_lvbo_fill()) hpfs-fsl-MDT: expected 416 actual 
>344.
>Aug  4 14:12:14 hpfs-fsl-mds0 kernel: LustreError: 
>20096:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 2 previous similar messages
>
>I've searched JIRA but didn't see anything related to this.  Is this 
>concerning?  I can create a new JIRA ticket if that helps.  
>
>Darby
>
>
>
>
>
>
>___
>lustre-discuss mailing list
>lustre-discuss@lists.lustre.org
>http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] nodes crash during ior test

2017-08-07 Thread Brian Andrus

There were actually several:

On an OSS:

[447314.138709] BUG: unable to handle kernel NULL pointer dereference at 
0020
[543262.189674] BUG: unable to handle kernel NULL pointer dereference 
at   (null)
[16397.115830] BUG: unable to handle kernel NULL pointer dereference 
at   (null)



On 2 separate clients:

[65404.590906] BUG: unable to handle kernel NULL pointer dereference 
at   (null)
[72095.972732] BUG: unable to handle kernel paging request at 
002029b0e000


Brian Andrus



On 8/4/2017 10:49 AM, Patrick Farrell wrote:


Brian,


What is the actual crash?  Null pointer, failed assertion/LBUG...?  
Probably just a few more lines back in the log would show that.



Also, Lustre 2.10 has been released, you might benefit from switching 
to that.  There are almost certainly more bugs in this pre-2.10 
development version you're running than in the release.



- Patrick


*From:* lustre-discuss  on 
behalf of Brian Andrus 

*Sent:* Friday, August 4, 2017 12:12:59 PM
*To:* lustre-discuss@lists.lustre.org
*Subject:* [lustre-discuss] nodes crash during ior test
All,

I am trying to run some ior benchmarking on a small system.

It only has 2 OSSes.
I have been having some trouble where one of the clients will reboot and
do a crash dump somewhat arbitrarily. The runs will work most of the
time, but every 5 or so times, a client reboots and it is not always the
same client.

The call trace seems to point to lnet:


72095.973865] Call Trace:
[72095.973892]  [] ? cfs_percpt_unlock+0x36/0xc0 
[libcfs]

[72095.973936]  []
lnet_return_tx_credits_locked+0x211/0x480 [lnet]
[72095.973973]  [] lnet_msg_decommit+0xd0/0x6c0 [lnet]
[72095.974006]  [] lnet_finalize+0x1e9/0x690 [lnet]
[72095.974037]  [] ksocknal_tx_done+0x85/0x1c0 
[ksocklnd]

[72095.974068]  [] ksocknal_handle_zcack+0x137/0x1e0
[ksocklnd]
[72095.974101]  []
ksocknal_process_receive+0x3a1/0xd90 [ksocklnd]
[72095.974134]  [] ksocknal_scheduler+0xee/0x670
[ksocklnd]
[72095.974165]  [] ? wake_up_atomic_t+0x30/0x30
[72095.974193]  [] ? ksocknal_recv+0x2a0/0x2a0 
[ksocklnd]

[72095.974222]  [] kthread+0xcf/0xe0
[72095.974244]  [] ? kthread_create_on_node+0x140/0x140
[72095.974272]  [] ret_from_fork+0x58/0x90
[72095.974296]  [] ? kthread_create_on_node+0x140/0x140

I am currently using lustre 2.9.59_15_g107b2cb built for kmod

Is there something I can do to track this down and hopefully remedy it?

Brian Andrus

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.10 on RHEL6.x?

2017-08-07 Thread Jeff Johnson
I'm going to be testing an upgrade of a filled 2.9/0.6.5.7/CentOS6.x LFS to
2.10/0.7/CentOS6.9. I will report back results to the mailing list when it
is completed.

--Jeff

On Mon, Aug 7, 2017 at 06:50 E.S. Rosenberg 
wrote:

> We created a test system that was installed with CentOS 6.x and Lustre 2.8
> filled with some data and subsequently reinstalled with CentOS 7.x and
> Lustre 2.9
>
> Everything seems to have gone fine but I am actually curious if anyone
> else did this pretty invasive upgrade? (Hoping to upgrade in the
> not-to-distant future, maybe even directly to 2.10)
>
> Thanks,
> Eli
>
> On Mon, Aug 7, 2017 at 4:46 PM, Jones, Peter A 
> wrote:
>
>> Correct – RHEL 6.x support appeared for the last time in the community
>> 2.8 release. However, there has been some interest in seeing some kind of
>> support for RHEL 6.x in the 2.10 LTS releases so I think it likely that at
>> least support for clients will be reintroduced in a future 2.10.x
>> maintenance release.
>>
>> On 8/7/17, 6:34 AM, "lustre-discuss on behalf of E.S. Rosenberg" <
>> lustre-discuss-boun...@lists.lustre.org on behalf of
>> esr+lus...@mail.hebrew.edu> wrote:
>>
>> If I'm not mistaken they haven't provided RPMs for RHEL6.x since 2.9...
>> HTH,
>> Eli
>>
>> On Mon, Aug 7, 2017 at 4:33 PM, Steve Barnet 
>> wrote:
>>
>>> Hey all,
>>>
>>>   I am looking to upgrade from lustre 2.8 to 2.10. I see that
>>> there are no pre-built RPMs for 2.10 on RHEL6.x families.
>>>
>>> Did I miss them, or will I need to build from source (or
>>> upgrade to Centos 7)?
>>>
>>> Thanks much!
>>>
>>> Best,
>>>
>>> ---Steve
>>>
>>> ___
>>> lustre-discuss mailing list
>>> lustre-discuss@lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>>
>>
>>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
-- 
--
Jeff Johnson
Co-Founder
Aeon Computing

jeff.john...@aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite D - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.10 on RHEL6.x?

2017-08-07 Thread E.S. Rosenberg
We created a test system that was installed with CentOS 6.x and Lustre 2.8
filled with some data and subsequently reinstalled with CentOS 7.x and
Lustre 2.9

Everything seems to have gone fine but I am actually curious if anyone else
did this pretty invasive upgrade? (Hoping to upgrade in the not-to-distant
future, maybe even directly to 2.10)

Thanks,
Eli

On Mon, Aug 7, 2017 at 4:46 PM, Jones, Peter A 
wrote:

> Correct – RHEL 6.x support appeared for the last time in the community 2.8
> release. However, there has been some interest in seeing some kind of
> support for RHEL 6.x in the 2.10 LTS releases so I think it likely that at
> least support for clients will be reintroduced in a future 2.10.x
> maintenance release.
>
> On 8/7/17, 6:34 AM, "lustre-discuss on behalf of E.S. Rosenberg" <
> lustre-discuss-boun...@lists.lustre.org on behalf of
> esr+lus...@mail.hebrew.edu> wrote:
>
> If I'm not mistaken they haven't provided RPMs for RHEL6.x since 2.9...
> HTH,
> Eli
>
> On Mon, Aug 7, 2017 at 4:33 PM, Steve Barnet 
> wrote:
>
>> Hey all,
>>
>>   I am looking to upgrade from lustre 2.8 to 2.10. I see that
>> there are no pre-built RPMs for 2.10 on RHEL6.x families.
>>
>> Did I miss them, or will I need to build from source (or
>> upgrade to Centos 7)?
>>
>> Thanks much!
>>
>> Best,
>>
>> ---Steve
>>
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.10 on RHEL6.x?

2017-08-07 Thread Jones, Peter A
Correct – RHEL 6.x support appeared for the last time in the community 2.8 
release. However, there has been some interest in seeing some kind of support 
for RHEL 6.x in the 2.10 LTS releases so I think it likely that at least 
support for clients will be reintroduced in a future 2.10.x maintenance release.

On 8/7/17, 6:34 AM, "lustre-discuss on behalf of E.S. Rosenberg" 
mailto:lustre-discuss-boun...@lists.lustre.org>
 on behalf of esr+lus...@mail.hebrew.edu> 
wrote:

If I'm not mistaken they haven't provided RPMs for RHEL6.x since 2.9...
HTH,
Eli

On Mon, Aug 7, 2017 at 4:33 PM, Steve Barnet 
mailto:bar...@icecube.wisc.edu>> wrote:
Hey all,

  I am looking to upgrade from lustre 2.8 to 2.10. I see that
there are no pre-built RPMs for 2.10 on RHEL6.x families.

Did I miss them, or will I need to build from source (or
upgrade to Centos 7)?

Thanks much!

Best,

---Steve

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] PFL error

2017-08-07 Thread Vicker, Darby (JSC-EG311)
Hello,

We've upgraded to 2.10 and I've been playing with progressive file layouts.  To 
begin, I'm just setting a test directory to use the following PFL.  

lfs setstripe \
   -E 4M   -c 1 -S 1M -i -1 \
   -E 256M -c 4 -S 1M -i -1 \
   -E -1   -c 8 -S 4M -i -1 .

I then created some files in the different ranges.  

dd if=/dev/zero of=1m.dat   bs=1M count=1
dd if=/dev/zero of=5m.dat   bs=1M count=5
dd if=/dev/zero of=300m.dat bs=1M count=300

When I look at "lfs getstripe" for those files, I can see them hitting the 
different tiers as they should.  However, I'm seeing this in the lustre logs 
whenever I create a file in my PFL test directory.  

Aug  4 14:12:14 hpfs-fsl-mds0 kernel: LustreError: 
20096:0:(mdt_lvb.c:163:mdt_lvbo_fill()) hpfs-fsl-MDT: expected 416 actual 
344.
Aug  4 14:12:14 hpfs-fsl-mds0 kernel: LustreError: 
20096:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 2 previous similar messages

I've searched JIRA but didn't see anything related to this.  Is this 
concerning?  I can create a new JIRA ticket if that helps.  

Darby






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.10 on RHEL6.x?

2017-08-07 Thread E.S. Rosenberg
If I'm not mistaken they haven't provided RPMs for RHEL6.x since 2.9...
HTH,
Eli

On Mon, Aug 7, 2017 at 4:33 PM, Steve Barnet 
wrote:

> Hey all,
>
>   I am looking to upgrade from lustre 2.8 to 2.10. I see that
> there are no pre-built RPMs for 2.10 on RHEL6.x families.
>
> Did I miss them, or will I need to build from source (or
> upgrade to Centos 7)?
>
> Thanks much!
>
> Best,
>
> ---Steve
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Lustre 2.10 on RHEL6.x?

2017-08-07 Thread Steve Barnet

Hey all,

  I am looking to upgrade from lustre 2.8 to 2.10. I see that
there are no pre-built RPMs for 2.10 on RHEL6.x families.

Did I miss them, or will I need to build from source (or
upgrade to Centos 7)?

Thanks much!

Best,

---Steve

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] nodes crash during ior test

2017-08-07 Thread Jones, Peter A
I do apologize. This was my error – I seem to have sent it to lustre-devel 
twice when I intended to send it to both lustre-devel and lustre-discuss.

On 8/7/17, 5:56 AM, "lustre-discuss on behalf of E.S. Rosenberg" 
mailto:lustre-discuss-boun...@lists.lustre.org>
 on behalf of esr+lus...@mail.hebrew.edu> 
wrote:

Did I miss the release announcement or was 2.10 never announced on this list?
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] nodes crash during ior test

2017-08-07 Thread E.S. Rosenberg
OT:
Can we create a wiki page or some other form of knowledge pooling on
benchmarking lustre?

Right now I'm using slides from 2009 as my source which may not be ideal...

http://wiki.lustre.org/images/4/40/Wednesday_shpc-2009-benchmarking.pdf

OT2:
Did I miss the release announcement or was 2.10 never announced on this
list?

Thanks!
Eli

On Fri, Aug 4, 2017 at 8:49 PM, Patrick Farrell  wrote:

> Brian,
>
> What is the actual crash?  Null pointer, failed assertion/LBUG...?
> Probably just a few more lines back in the log would show that.
>
>
> Also, Lustre 2.10 has been released, you might benefit from switching to
> that.  There are almost certainly more bugs in this pre-2.10 development
> version you're running than in the release.
>
>
> - Patrick
> --
> *From:* lustre-discuss  on
> behalf of Brian Andrus 
> *Sent:* Friday, August 4, 2017 12:12:59 PM
> *To:* lustre-discuss@lists.lustre.org
> *Subject:* [lustre-discuss] nodes crash during ior test
>
> All,
>
> I am trying to run some ior benchmarking on a small system.
>
> It only has 2 OSSes.
> I have been having some trouble where one of the clients will reboot and
> do a crash dump somewhat arbitrarily. The runs will work most of the
> time, but every 5 or so times, a client reboots and it is not always the
> same client.
>
> The call trace seems to point to lnet:
>
>
> 72095.973865] Call Trace:
> [72095.973892]  [] ? cfs_percpt_unlock+0x36/0xc0 [libcfs]
> [72095.973936]  []
> lnet_return_tx_credits_locked+0x211/0x480 [lnet]
> [72095.973973]  [] lnet_msg_decommit+0xd0/0x6c0 [lnet]
> [72095.974006]  [] lnet_finalize+0x1e9/0x690 [lnet]
> [72095.974037]  [] ksocknal_tx_done+0x85/0x1c0 [ksocklnd]
> [72095.974068]  [] ksocknal_handle_zcack+0x137/0x1e0
> [ksocklnd]
> [72095.974101]  []
> ksocknal_process_receive+0x3a1/0xd90 [ksocklnd]
> [72095.974134]  [] ksocknal_scheduler+0xee/0x670
> [ksocklnd]
> [72095.974165]  [] ? wake_up_atomic_t+0x30/0x30
> [72095.974193]  [] ? ksocknal_recv+0x2a0/0x2a0 [ksocklnd]
> [72095.974222]  [] kthread+0xcf/0xe0
> [72095.974244]  [] ? kthread_create_on_node+0x140/0x140
> [72095.974272]  [] ret_from_fork+0x58/0x90
> [72095.974296]  [] ? kthread_create_on_node+0x140/0x140
>
> I am currently using lustre 2.9.59_15_g107b2cb built for kmod
>
> Is there something I can do to track this down and hopefully remedy it?
>
> Brian Andrus
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org