Re: [lustre-discuss] ZFS atime is it required?

2020-10-29 Thread Andreas Dilger
On Oct 23, 2020, at 14:03, Kumar, Amit 
mailto:ahku...@mail.smu.edu>> wrote:

Dear All,

Quick question, can I get away by setting “zfs set atime=off 
on_all_my_voulmes_mgt_mdt_and_osts” ? I ask this as it is noted to be 
performance boosting tip with the assumption filesystems(Lustre) handles all 
access times?

You don't really need atime enabled on the OSTs, but I also don't think 
"atime=off" will make any difference.  That is a VFS/ZPL level option, and 
Lustre osd-zfs doesn't use any of the ZPL code, but rather handles atime 
internally.

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Hidden QoS in Lustre ?

2020-10-08 Thread Andreas Dilger
On Oct 8, 2020, at 10:37 AM, Tung-Han Hsieh  
wrote:
> 
> Dear All,
> 
> In the past months, we encountered several times of Lustre I/O abnormally
> slowing down. It is quite mysterious that there seems no problem on the
> network hardware, nor the lustre itself since there is no error message
> at all in MDT/OST/client sides.
> 
> Recently we probably found a way to reproduce it, and then have some
> suspections. We found that if we continuously perform I/O on a client
> without stop, then after some time threshold (probably more than 24
> hours), the additional file I/O bandwidth of that client will be shriked
> dramatically.
> 
> Our configuration is the following:
> - One MDT and one OST server, based on ZFS + Lustre-2.12.4.
> - The OST is served by a RAID 5 system with 15 SAS hard disks.
> - Some clients connect to MDT/OST through Infiniband, some through
>  gigabit ethernet.
> 
> Our test was focused on the clients using infiniband, which is described
> in the following:
> 
> We have a huge (several TB) amount of data stored in the Lustre file
> system to be transferred to outside network. In order not to exhaust
> the network bandwidth of our institute, we transfer the data with limited
> bandwidth via the following command:
> 
> rsync -av --bwlimit=1000  ://
> 
> That is, the transferring rate is 1 MB per second, which is relatively
> low. The client read the data from Lustre through infiniband. So during
> data transmission, presumably there is no problem to do other data I/O
> on the same client. On average, when copy a 600 MB file from one directory
> to another directory (both in the same Lustre file system), it took about
> 1.0 - 2.0 secs, even when the rsync process still working.
> 
> But after about 24 hours of continuously sending data via rsync, the
> additional I/O on the same client was dramatically shrinked. When it happens,
> it took more than 1 minute to copy a 600 MB from somewhere to another place
> (both in the same Lustre) while rsync is still running.
> 
> Then, we stopped the rsync process, and wait for a while (about one
> hour). The I/O performance of copying that 600 MB file returns normal.
> 
> Based on this observation, we are suspecting that whether there is a
> hidden QoS mechanism built in Lustre ? When a process occupies the I/O
> bandwidth for a long time and exceeded some limits, does Lustre automatically
> shrinked the I/O bandwidth for all processes running in the same client ?
> 
> I am not against such QoS design, if it does exist. But the amount of
> shrinking seems to be too large for infiniband (QDR and above). Then
> I am further suspecting that whether this is due to that our system is
> mixed with clients in which some have infiniband but some do not ?
> 
> Could anyone help to fix this problem ? Any suggestions will be very
> appreciated.

There is no "hidden QOS", unless it is so well hidden that I don't know
about it.

You could investigate several different things to isolate the problem:
- try with a 2.13.56 client to see if the problem is already fixed
- check if the client is using a lot of CPU when it becomes slow
- run strace on your copy process to see which syscalls are slow
- check memory/slab usage
- enable Lustre debug=-1 and dump the kernel debug log to see where
  the process is taking a long time to complete a request

It is definitely possible that there is some kind of problem, since this
is not a very common workload to be continuously writing to the same file
descriptor for over a day.  You'll have to do the investigation on your
system to isolate the source of the problem.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre optimize for spares data files ?

2020-09-09 Thread Andreas Dilger
On Sep 8, 2020, at 9:13 PM, Tung-Han Hsieh  
wrote:
> 
> I would like to ask whether Lustre file system has implemented the
> function to optimize for large sparse data files ?
> 
> For example, a 3GB data file but with more than 80% bytes zero, can
> Lustre file system optimize the storage not actually taking the whole
> 3GB of disk space ?

Could you please explain your usage further?  Lustre definitely has
support for sparse files - if they are written by an application with
"seek" or by multiple threads in parallel, then only the blocks that
are written will use space on the OST.

For ldiskfs the block size is 4KB.  For ZFS the OST block size is up
to 1MB, if the file size is 1MB or larger.  That is why compression
on ZFS can help reduce space usage on the OST, because it can effectively
compress the 1MB blocks that are nearly full of zeroes, if your sparse
writes are smaller than the blocksize.

If you are *copying* a sparse file, that depends on the tool that is
doing the copy.  For example, "cp --sparse=always" will generate a
sparse file.  We are also working on adding SEEK_HOLE and SEEK_DATA,
which will help tools to copy sparse files.

Cheers, Andreas

> I know that some file systems (e.g., ZFS) has this function. If Lustre
> does not have it, is there a roadmap to implement it in the future ?
> 
> Thanks for your reply in advance.
> 
> Best Regards,
> 
> T.H.Hsieh
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.12 routing with MR and discovery off

2020-08-30 Thread Andreas Dilger
On Aug 26, 2020, at 4:37 PM, Faaland, Olaf P.  wrote:
> 
> Does Lustre 2.12 require that routes for every intermediate network are 
> defined, on every node on a path?
> 
> For example, given this Lustre network, where:
>  A-D are nodes and 1-6 are addresses
>  network tcp2 has only routers, no clients and no servers
> 
> A(1) -tcp1- (2)B(3) -tcp2- (4)C(5) -tcp3- (6)D
> 
> And configured routes:
> 
> A: options lnet routes="tcp3 2@tcp1"
> B: options lnet routes="tcp3 4@tcp2"
> C: options lnet routes="tcp1 3@tcp2"
> D: options lnet routes="tcp1 5@tcp3"
> 
> With Lustre <= 2.10 we configured only these routes.  The only nodes that 
> need to know tcp2 exist are attached to it, and so there are no routes to 
> tcp2 defined anywhere.
> 
> It looks to me like Lustre 2.12 attempts to send error notifications back to 
> the original sender, and so nodes A and D may end up receiving messages from 
> nids on tcp2.  This then requires nodes A and D to have routes to tcp2 
> defined, so they can reply to the messages.

This is interesting.  I'm not an LNet expert, but it seems strange to me that
nodes other than "B" and "C" should care about the state of connections within
@tcp2 if they are not endpoints.  They should never be sending messages directly
to those nodes, and the LNet routers B/C knowing which connections/peers are
working should be enough for them to make routing decisions for A and D.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.12 routing with MR and discovery off

2020-08-29 Thread Andreas Dilger
On Aug 26, 2020, at 16:37, Faaland, Olaf P. 
mailto:faala...@llnl.gov>> wrote:

Does Lustre 2.12 require that routes for every intermediate network are 
defined, on every node on a path?

For example, given this Lustre network, where:
 A-D are nodes and 1-6 are addresses
 network tcp2 has only routers, no clients and no servers

A(1) -tcp1- (2)B(3) -tcp2- (4)C(5) -tcp3- (6)D

And configured routes:

A: options lnet routes="tcp3 2@tcp1"
B: options lnet routes="tcp3 4@tcp2"
C: options lnet routes="tcp1 3@tcp2"
D: options lnet routes="tcp1 5@tcp3"

With Lustre <= 2.10 we configured only these routes.  The only nodes that need 
to know tcp2 exist are attached to it, and so there are no routes to tcp2 
defined anywhere.

It looks to me like Lustre 2.12 attempts to send error notifications back to 
the original sender, and so nodes A and D may end up receiving messages from 
nids on tcp2.  This then requires nodes A and D to have routes to tcp2 defined, 
so they can reply to the messages.

Interesting.  I'm no LNet expert, but it seems strange to me that nodes other 
than B and C should care about the state of connections within @tcp2 if they 
are not endpoints themselves. A and D should never be sending messges directly 
to those nodes, and the LNet routers B/C knowing which connections peers in 
@tcp2 are working should be enough for them to make routing decisions for A and 
D.

If B/C nodes are themselves unable to communicate with their peers, then _that_ 
should be sent back to A/D to indicate they cannot route packets to the target 
NID, but I wouldn't think A/D should get information about @tcp2 themselves?

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] some clients dmesg filled up with "dirty page discard"

2020-08-29 Thread Andreas Dilger
On Aug 25, 2020, at 17:42, 肖正刚 
mailto:guru.nov...@gmail.com>> wrote:

no, on oss we found only the client who reported " dirty page discard  " being 
evicted.
we hit this again last night, and on oss we can see logs like:
"
[Tue Aug 25 23:40:12 2020] LustreError: 
14278:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired 
after 100s: evicting client at 10.10.3.223@o2ib  ns: 
filter-public1-OST_UUID lock: 9f1f91cba880/0x3fcc67dad1c65842 lrc: 
3/0,0 mode: PR/PR res: [0xde2db83:0x0:0x0].0x0 rrc: 3 type: EXT 
[0->18446744073709551615] (req 0->270335) flags: 0x6400020020 nid: 
10.10.3.223@o2ib remote: 0xd713b7b417045252 expref: 7081 pid: 25923 timeout: 
21386699 lvb_type: 0

It isn't clear what the question is here.  The "dirty page discard" message 
means that unwritten data from the client was discarded because the client was 
evicted and the lock covering this data was revoked by the server because the 
client was not responsive.

Anymore , we exec lfsck on all servers,  result is

There is no need for LFSCK in this case.  The file data was not written, but a 
client eviction does not result in the filesystem becoming inconsistent.

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] import_set_state_nolock() with binary args in lctl debug_file output?

2020-08-24 Thread Andreas Dilger
On Aug 14, 2020, at 3:44 PM, Sternberg, Michael G.  wrote:
> 
> 
> In lctl debug_file output, for import_set_state_nolock(), I sometimes see 
> binary arguments (sample snippet at end of post), and figure that's not a 
> good sign. How can I get to the bottom of this?
> 
> The only direct reference I could find is LU-7339 from 2015, which got no 
> replies.

> import_set_state_nolock()) 8ae6833d6000 
> 0<8A>00o<8A>: changing import state 
> from FULL to RECOVER

According to the comments in that ticket, this is a case of the debug message
interpreting the structure incorrectly on the server, since the message is
normally printed on the client:

if (imp->imp_state != LUSTRE_IMP_NEW) {
CDEBUG(D_HA, "%p %s: changing import state from %s to %s\n",
   imp, obd2cli_tgt(imp->imp_obd),
   ptlrpc_import_state_name(imp->imp_state),
   ptlrpc_import_state_name(state));
}

This message could probably be changed from using obd2cli_tgt() to using
imp->imp_obd->obd_name instead.  That said, this is probably a red herring
and totally unrelated to your problem, but a patch against LU-7339 would be
welcome in any case.


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Can not mount ZFS-based device

2020-08-04 Thread Andreas Dilger
On Aug 4, 2020, at 4:57 AM, yangshengwang2011  wrote:
> 
> Hi,
> 
> 
> I can not mount a ZFS-based device when install the lustre servers.
> 
> 
> Information in syslog is,
> 
> # kernel:osd_zfs: Unknown symbol zfs_refcount_add (0)
> 
> # kernel:LustreError:158-c: Can't load module 'osd-zfs'
> 
> 
> And the message when insmod osd_zfs.ko  manually is:
> 
> # kernel:osd_zfs: Unknown symbol zfs_refcount_add (0)

It looks like Lustre was not built against the version of ZFS that
was installed on the system.  You need to re-run autogen.sh,
configure, then rebuild the Lustre modules against the version
of ZFS installed on the node.

Cheers, Andreas

> 
> 
> 
> System information:
> 
> distribution name  | centos 7.6
> 
> Linux kernel   | 3.10.0-957.27.2_el7
> 
> Architecture   | x86_64
> 
> Lustre version   | 2.12.5
> 
> ZFS version   | 0.8.1
> 
> 
> 
> What should i do for this situation?
> 
> 
> 
> Sincerely  thanks!
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] How to convert fids like /O/0/d19/115 to actual data ?

2020-07-20 Thread Andreas Dilger
On Jul 9, 2020, at 3:52 AM, Zeeshan Ali Shah  wrote:
> 
> Dear All ,
> On zfs based lustre we are getting following
> pool: ost2-xag
>  state: ONLINE
> status: One or more devices has experienced an error resulting in data
> corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
> entire pool from backup.
>see: http://zfsonlinux.org/msg/ZFS-8000-8A
>   scan: resilvered 43.5G in 0h2m with 8 errors on Thu Jul  9 12:00:08 2020
> config:
> 
> NAME  STATE READ WRITE CKSUM
> ost2-xag  ONLINE   0 024
>  raidz2-0ONLINE   0 048
>spare-0   ONLINE   0 0 0
>  wwn-0x5000c500a7c94003  ONLINE   0 0 0
>  wwn-0x5000c500a7c970ef  ONLINE   0 0 0
>wwn-0x5000c500a7c94343ONLINE   0 0 0
>replacing-2   ONLINE   0 0 0
>  wwn-0x5000c500a7d02fff  ONLINE   0 0 0
>  wwn-0x5000c500a7c9460b  ONLINE   0 0 0
>wwn-0x5000c500a7c94fcfONLINE   0 0 0
>spare-4   ONLINE   0 0 0
>  wwn-0x5000c500a7c952db  ONLINE   0 0 0
>  wwn-0x5000c500a7c968bb  ONLINE   0 0 0
>wwn-0x5000c500a7c95553ONLINE   0 0 0
>wwn-0x5000c500a7c95ba3ONLINE   0 0 0
>wwn-0x5000c500a7c96547ONLINE   0 0 0
>wwn-0x5000c500a7c967ffONLINE   0 0 1
> spares
>  wwn-0x5000c500a7c968bb  INUSE currently in use
>  wwn-0x5000c500a7c970ef  INUSE currently in use
> 
> errors: Permanent errors have been detected in the following files:
> 
> ost2-xag/ost16:/O/0/d19/115
> 
> In above how to Convert it in to actual file/directory name  ?

There is a tool "ll_decode_filter_fid" that can read the xattr from
this object to find the MDT parent inode FID, and then you can run
"lfs fid2path /mnt $FID" to determine the pathname.  With ZFS, you
need to mount the underlying OST target directly as type ZFS and
run ll_decode_filter_fid on the above pathname.

For ldiskfs, it is possible to run something like:

   debugfs -c -R 'stat /O/0/d19/115' /dev/XXX

without having to unmount the underlying filesystem.  However, I don't
think that a similar operation is possible with ZDB - that would need
some xattr-specific decoding to be added to that tool.

There is a ticket open so that "lfs fid2path" can directly map OST FIDs
to a pathname, https://jira.whamcloud.com/browse/LU-13527 but it has
not been implemented yet.


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] MGS+MDT migration to a new storage using LVM tools

2020-07-19 Thread Andreas Dilger
On Jul 19, 2020, at 12:41 AM, David Cohen  wrote:
> 
> Hi,
> We have a combined MGS+MDT and I'm looking for a migration to new storage 
> with a minimal disruption to the running jobs on the cluster.
> 
> Can anyone find problems in the scenario below and/or suggest another 
> solution?
> I would appreciate also "no problems" replies to reassure the scenario before 
> I proceed.
> 
> Current configuration:
> The mdt is a logical volume in a lustre_pool VG on a /dev/mapper/MDT0001 PV

I've been running Lustre on LVM at home for many years, and have done pvmove
of the underlying storage to new devices without any problems.

> Migration plan:
> Add /dev/mapper/MDT0002 new disk (multipath)

I would really recommend that you *not* use MDT0002 as the name of the PV.
This is very confusing because the MDT itself (at the Lustre level) is
almost certainly named "-MDT", and if you ever add new MDTs to
this filesystem it will be confusing as to which *Lustre* MDT is on which
underlying PV.  Instead, I'd take the opportunity to name this "MDT" to
match the actual Lustre MDT target name.

> extend the VG:
> pvcreate /dev/mapper/MDT0002
> vgextend  lustre_pool /dev/mapper/MDT0002
> mirror the mdt to the new disk:
> lvconvert -m 1 /dev/lustre_pool/TECH_MDT /dev/mapper/MDT0002

I typically just use "pvmove", but doing this by adding a mirror and then
splitting it off is probably safer.  That would still leave you with a full
copy of the MDT on the original PV if something happened in the middle.

> wait the mirrored disk to sync:
> lvs -o+devices
> when it's fully synced unmount the MDT, remove the old disk from the mirror:
> lvconvert -m 0 /dev/lustre_pool/TECH_MDT /dev/mapper/MDT0001
> and remove the old disk from the pool:
> vgreduce lustre_pool /dev/mapper/MDT0001
> pvremove /dev/mapper/MDT0001
> remount the MDT and let the clients few minutes to recover the connection.

In my experience with pvmove, there is no need to do anything with the clients,
as long as you are not also moving the MDT to a new server, since the LVM/DM
operations are totally transparent to both the Lustre server and client.

After my pvmove (your "lvconvert -m 0"), I would just vgreduce the old PV from
the VG, and then leave it in the system (internal HDD) until the next time I
needed to shut down the server.  If you have hot-plug capability for the PVs,
then you don't even need to wait for that.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] systemd lnet/rdma conflict

2020-07-17 Thread Andreas Dilger
Rick,
would you be able to put this in the form of a patch against 
lustre/scripts/systemd/lnet.service so that this is working
well for everyone.  You could use LU-9673 for this.



> On Jul 16, 2020, at 2:34 PM, Mohr Jr, Richard Frank  wrote:
>> On Jul 16, 2020, at 2:46 PM, Christopher Benjamin Coffey 
>>  wrote:
>> 
>> 
>> I'm trying to get lustre , and rdma setup on an el8 system. I can't get 
>> systemd to get the two services: lnet, and rdma shutdown correctly without 
>> hanging the system. I've tried many things in the rdma.service, and 
>> lnet.service files to get them to work correctly but still the issue exists. 
>> Here are my service files below. Anyone know how to fix this?
> 
> Yup, ran into the same thing.  See suggestion below.
> 
>> 
>> -
>> [Unit]
>> Description=lnet management
>> 
>> Requires=network-online.target
>> After=network-online.target rdma.service
>> Wants=rdma.service
>> 
>> ConditionPathExists=!/proc/sys/lnet/
>> 
>> [Service]
>> Type=oneshot
>> RemainAfterExit=true
>> ExecStart=/sbin/modprobe lnet
>> ExecStart=/usr/sbin/lnetctl lnet configure
>> ExecStart=/usr/sbin/lnetctl import /etc/lnet.conf
>> ExecStop=/usr/sbin/lnetctl lnet unconfigure
>> ExecStop=/usr/sbin/lustre_rmmod
>> TimeoutStopSec=30
>> 
>> [Install]
>> WantedBy=multi-user.target
> 
> 
> Try  adding “BindsTo=rdma.service” to the lnet service file.  This should 
> force the lnet service to be stopped if the rdma service is ever stopped.
> 
> —
> Rick Mohr
> Senior HPC System Administrator
> Joint Institute for Computational Sciences
> University of Tennessee
> 
> 
> 
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Is there aceiling of lustre filesystem a client can mount

2020-07-17 Thread Andreas Dilger
On Jul 15, 2020, at 8:39 PM, 肖正刚  wrote:
> 
>   Hi, Jongwoo &  Andreas
> 
> Sorry for the ambiguous description.
> What I want to know is the number of lustre filesystems that a client can 
> mount on the same time.

The number of filesystems a client can mount depends on how much RAM it has.
I don't think anyone has done any measurement of this kind before, but there
are several production sites that have 10 or more Lustre mounts on the client.

Cheers, Andreas

> 
> Message: 1
> Date: Wed, 15 Jul 2020 14:29:10 +0800
> From: ??? 
> To: lustre-discuss@lists.lustre.org
> Subject: [lustre-discuss] Is there aceiling of lustre filesystem a
> client can mount
> 
> Hi, all
> Is there a ceiling for a Lustre filesystem that can be mounted in a cluster?
> If so, what's the number?
> If not, how much is proper?
> Does mount multiple filesystems  can affect the stability of each file
> system or cause other problems?
> 
> Thanks!
> 
> 
> Message: 3
> Date: Wed, 15 Jul 2020 23:45:57 +0900
> From: Jongwoo Han 
> To: ??? 
> Cc: lustre-discuss 
> Subject: Re: [lustre-discuss] Is there aceiling of lustre filesystem a
> client can mount
> 
> I think your question is ambiguous.
> 
> What ceiling do you mean? Total storage capacity? number of disks? number
> of clients? number of filesystems?
> 
> Please be more clear about it.
> 
> Regards,
> Jongwoo Han
> 
> 2020? 7? 15? (?) ?? 3:29, ??? ?? ??:
> 
> > Hi, all
> > Is there a ceiling for a Lustre filesystem that can be mounted in a
> > cluster?
> > If so, what's the number?
> > If not, how much is proper?
> > Does mount multiple filesystems  can affect the stability of each file
> > system or cause other problems?
> 


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Is there aceiling of lustre filesystem a client can mount

2020-07-15 Thread Andreas Dilger
On Jul 15, 2020, at 12:29 AM, 肖正刚  wrote:
> 
> Hi, all
> Is there a ceiling for a Lustre filesystem that can be mounted in a cluster?
> If so, what's the number?
> If not, how much is proper?
> Does mount multiple filesystems  can affect the stability of each file system 
> or cause other problems?

Depending on what limits you are looking for, you may find this link useful:

https://build.whamcloud.com/job/lustre-manual//lastSuccessfulBuild/artifact/lustre_manual.xhtml#settinguplustresystem.tab2

For capacity and performance, the upper limits are probably "higher than what 
you have money for". :-)

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Can we re-index the lustre-discuss archive DB?

2020-07-15 Thread Andreas Dilger
On Jul 15, 2020, at 6:07 PM, Cameron Harr  wrote:
> 
> To the person with the power,
> 
> I've been trying to search the lustre-discuss 
> (http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/) archives but 
> it seems only old (<= 2013 perhaps) messages are searchable with the "Search" 
> box. Is it possible to re-index the searchable DB to include recent/current 
> messages?

Cameron, it looks like there is a full archive at:

https://marc.info/?l=lustre-discuss

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Yahoo OpenID not working to log into gerrit.

2020-07-05 Thread Andreas Dilger
Hi Arshad,
I'm at least able to login to Gerrit, but I'm not using Yahoo for
the authentication.  Is it possible that Yahoo discontinued the
OpenID login?  That previously happened with Gmail accounts, which
is why Gerrit no longer allows authentication with Gmail OpenID.


> On Jul 4, 2020, at 11:17 PM, arshad hussain  wrote:
> 
> Hi,
> 
> I use yahoo OpenID to log into gerrit. Since yesterday I am unable to
> log into gerrit. When I click “Sign in with yahoo! ID” -> It complains
> that the “Provider is not supported, or incorrectly  entered”.
> 
> Is anybody else facing this issue or it’s just me. I am confident it
> was working just a couple of days ago.
> 
> I have not tampered or changed any setting on my side as far as I
> could remember. Anything changed on the whamcloud side ?
> 
> Thanks
> Arshad
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] mlx4 and mxl5 mix environment

2020-07-03 Thread Andreas Dilger
There is a "Request Account" link at the top of every page: 
http://wiki.lustre.org/Special:RequestAccount



On Jul 1, 2020, at 08:39, Ms. Megan Larko 
mailto:dobsonu...@gmail.com>> wrote:

Awesome, thanks!   Unfortunately the password reset site is not finding my UID. 
  Maybe I never had access to the Lustre wiki.  (I have so many accounts that 
sometimes my head spins.)   I'm still willing to help.  Is there a request 
password site?

Cheers,
megan

On Fri, Jun 26, 2020 at 8:54 PM Spitz, Cory James 
mailto:cory.sp...@hpe.com>> wrote:
Megan,

You wrote:
PS. [I am willing to add/contribute to the 
http://wiki.lustre.org/Infiniband_Configuration_Howto but I think my account 
for wiki editing has expired (at least the one I thought I had did not work).

Thank you for your offer!  Did you try 
http://wiki.lustre.org/Special:PasswordReset?  If that didn’t work then I think 
that you could email 
lustre@lists.opensfs.org<mailto:lustre@lists.opensfs.org>.

-Cory



On 6/24/20, 3:33 PM, "lustre-discuss on behalf of Ms. Megan Larko" 
mailto:lustre-discuss-boun...@lists.lustre.org>
 on behalf of dobsonu...@gmail.com<mailto:dobsonu...@gmail.com>> wrote:

On 22 Jun 2020 "guru.novice" wrote:
Hi, all
We setup up a cluster use mlx4 and mlx5 driver mixed?all things goes well.
Later I find something in wiki
http://wiki.lustre.org/Infiniband_Configuration_Howto and
http://lists.onebuilding.org/pipermail/lustre-devel-lustre.org/2016-May/003842.html
which was
last edited on 2016.
So do i need to change lnet configuration described in this page ?
Or the problem has been resolved in new version (like 2.12.x) ?
Anymore where can i find more details ?

Any suggestions would be appreciated.
Thanks?

Hello guru.novice,
Lustre 2.12.x has some nice LNet configuration abilities.  The old 
/etc/modprobe.d/ config files have been superceded by /etc/lnet.conf.   An 
install of Lustre 2.12.x provides a sample of this file (with the lines 
commented out).  Our experience has shown that not all lines are necessary; 
edit to suit.

The Lustre 2.12.x has Multi-Rail (MR) on by default so Lustre will attempt to 
automatically find active and viable LNet paths to use.  This should have no 
issue with your mlx4/5 mix environment; we have some mixed IB and eth that 
work. To explicitly use MR one may set "Multi-Rail: true" in the "peer" NID 
section of the /etc/lnet.conf file.  But that was not necessary for us.  We 
used a simple /etc/lnet.conf for MR systems:
File stub: /etc/lnet.conf
net:
   - net type: o2ib0
 local NI(s):
- interfaces:
 0: ib0
  - net type: o2ib777
 local NI(s):
- interfaces:
 0: ib0:1
This allowed LNet to use any NID o2ib0 and o2ib777.

Whatever is placed in the /etc/lnet.conf file is loaded into the kernel modules 
used via the Lustre starting mechanism (CentOS uses /usr/lib/systemd/system).  
Because we are choosing _not_ to use MR on a different box, we explicitly 
defined the available routes in /etc/lnet.conf using the lines:
route:
   - net: tcp
 gateway: 10.10.10.101@o2ib1
   - net: tcp
 gateway: 10.10.10.102@o2ib
And so on up to 10.10.10.116@o2ib

 In CentOS7, /usr/lib/systemd/system/lnet.service file is reproduced below.  
(details: lustre-2.12.4-1 with Mellanox OFED version 4.7-1.0.0.1 and  kernel 
3.10.957.27.2.el7)
File lnet.service:
[unit]
Description=lnet management
Requires=network-online.target
After=network-online.target openibd.service rdma.service opa.service
ConditionsPathExists=!/proc/sys/lnet/

[Service]
Type=oneshot
RemainAfterExit=true
ExecStart=/sbin/modprobe lnet
ExecStart=/usr/sbin/lnetctl lnet configure
ExecStart=/usr/sbin/lnetctl set discover 0   <--Do NOT use this line if you 
want MR function
ExecStart=/usr/sbin/lnetctl import /etc/lnet.conf  <--The file with router, 
credit and similar info
ExecStart=/usr/sbin/lnetctl peer add --nid 10.10.10.[101-116]@o2ib1 
--non_mr  <--Omit non_rm if you want to use MR
ExecStop=/usr/sbin/lustre_rmmod ptlrpc
ExecStop=/usr/sbin/lnetctl lnet unconfigure
ExecStop=/usr/sbin/lustre_rmmod libcfs ldiskfs

[Install]
WantedBy=multi-user.target

I hope this info can help you in the right direction.

Cheers,
megan
PS. [I am willing to add/contribute to the 
http://wiki.lustre.org/Infiniband_Configuration_Howto but I think my account 
for wiki editing has expired (at least the one I thought I had did not work).
Our site had issues with Multi-Rail "not socially distancing appropriately" 
from other LNet networks so in our particular case we disabled MR.  (An 
entirely different experience.) ]
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___

Re: [lustre-discuss] Questions about LU-13645

2020-07-03 Thread Andreas Dilger
On Jul 2, 2020, at 11:17, Chad DeWitt 
mailto:ccdew...@uncc.edu>> wrote:

Good afternoon, All

I hope everyone is doing well.

I had a few questions concerning LU-13645 - Various data corruptions possible 
in lustre [https://jira.whamcloud.com/browse/LU-13645].

We are looking at deploying Lustre 2.12.5 and while browsing JIRA, this 
particular issue caught my attention. I read the issue, however, I can't 
determine if it is severe enough to prevent upgrading.

Is anyone familiar with this issue and its severity/prevalence? Has anyone else 
upgraded to 2.12.5?

Chad,
from what I can see in that ticket, this problem (or problems) largely affect 
specific features that are not used by default - lockahead and SEL, and 
possibly DoM.

The lockahead feature is only used with a specific versions of MPI ADIO 
libraries, so it is unlikely that you are using it.  Similarly, SEL is new and 
doesn't even exist in the 2.12 release.  While DoM is in the 2.12 release, but 
from reading the ticket, it sounds like this issue is very difficult to 
reproduce, so either you are currently (not) seeing it, or (more likely) you 
aren't using DoM at all (you would know if you are), so upgrading to 2.12.5 
doesn't affect this either way.

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Permission denied on lfs getstripe

2020-07-02 Thread Andreas Dilger
Chris,
this looks like a bug that "lfs getstripe -M" is not using supplementary 
groups, or similar.  You wrote that the directory has GID=130817, so this is 
not the primary GID of the user accessing it, so it must depend on the 
supplementary group permissions to access it.  The "regular" ls access *is* 
using the supplementary GID to allow access, and when the directory is cached 
on the client then "lfs getstripe -M" is getting this information out of the 
client-side cache (where the client VFS is locally checking the GID for access 
permission).

I suspect this hasn't really been an issue in the past because few users use 
"lfs getstripe -M", and most of those are root or are accessing their own 
files/directories, so do not need a supplementary group to access this 
information.  It also seems (but isn't shown) that the directory does not have 
world-read permission?  What does "stat" on this directory show?

Could you please file a ticket in Jira with the details so that this issue can 
be tracked.  I don't know how easy/hard it will be to fix this, since this 
information is obtained via ioctl(), and we don't necessarily want non-owners 
of files to be able to call every ioctl on the file/directory.

Note, it is recommended to use "lfs getdirstripe --m" (or "--mdt-index") 
instead of "-M" to get the MDT index of a file, since the "-M" option is 
deprecated to  This would imply you are running a Lustre 2.10 client?  The "-m" 
option is already available in 2.10, and "-M" will print a warning in 2.12 and 
later.

I tested this on master and was not able to reproduce the problem.  If I set 
the directory mode=0640 I got permission denied for directories that I didn't 
have supplementary group access on, but it worked on the first try (after 
flushing all client locks and dropping all caches).  That means the problem 
seems to already be fixed in master, and possibly 2.12 also.

Cheers, Andreas

On Jul 2, 2020, at 10:26, Chang, Christopher 
mailto:christopher.ch...@nrel.gov>> wrote:

Hi Andreas,

   It doesn’t appear to be this issue. I verified the client “id” and server 
“l_getidentity -d” views before and after issuing an “ls” as the user to get 
getstripe working, and there’s no change.

Client:
el3:~> id
uid=131364(***) gid=131364(***) 
groups=131364(***),130033(globus-access),130774(eagle-users),130808(ewer),130817(naris),131016(esp-wps-inputs),131178(lex-access),131237(naermpcm),249837(aces),249945(hpcapps),249996(n-apps)
 context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023

el3:~> lfs getstripe -M 
/projects/naris/pcm_110819/NARIS_TechBreak2050_missingDPV/StageC_RT5min
error opening 
/projects/naris/pcm_110819/NARIS_TechBreak2050_missingDPV/StageC_RT5min: 
Permission denied (13)
…
el3:~> ls 
/projects/naris/pcm_110819/NARIS_TechBreak2050_missingDPV/StageC_RT5min
~Model ( c_RT5min_...

el3:~> lfs getstripe -M 
/projects/naris/pcm_110819/NARIS_TechBreak2050_missingDPV/StageC_RT5min
1

el3:~> id
uid=131364(***) gid=131364(***) 
groups=131364(***),130033(globus-access),130774(eagle-users),130808(ewer),130817(naris),131016(esp-wps-inputs),131178(lex-access),131237(naermpcm),249837(aces),249945(hpcapps),249996(n-apps)
 context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023

Server:
[root@mds02 ~]# l_getidentity -d 131364
uid=131364 gid=131364,130808,130817,131016,131237,249837,249945,249996
permissions:
  nid perm
(client does an ls)
[root@mds02 ~]# l_getidentity -d 131364
uid=131364 gid=131364,130808,130817,131016,131237,249837,249945,249996
permissions:
  nid perm

The relevant gid for the target directory is 130817. I verified that all 3 of 
our MDSs had the same view before and after the “ls”.

Thanks; Chris

From: Andreas Dilger mailto:adil...@whamcloud.com>>
Date: Sunday, June 28, 2020 at 5:11 PM
To: Christopher Chang 
mailto:christopher.ch...@nrel.gov>>
Cc: "lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>" 
mailto:lustre-discuss@lists.lustre.org>>, 
"Kaiser, Timothy" mailto:timothy.kai...@nrel.gov>>
Subject: Re: [lustre-discuss] Permission denied on lfs getstripe

On Jun 26, 2020, at 10:45, Chang, Christopher 
mailto:christopher.ch...@nrel.gov>> wrote:

Hi,

   We’re running into an error with a particular directory. It is weird because 
it can be resolved in an unexpected way, but only for a time.
The error manifests as:

el3:out> lfs getstripe -M 
/projects/naris/pcm_110819/NARIS_TechBreak2050_missingDPV/StageC_RT5min
error opening 
/projects/naris/pcm_110819/NARIS_TechBreak2050_missingDPV/StageC_RT5min: 
Permission denied (13)
llapi_semantic_traverse: Failed to open 
'/projects/naris/pcm_110819/NARIS_TechBreak2050_missingDPV/StageC_RT5min': 
Permission denied (13)
error: getstripe failed for 
/projects/naris/pcm_110819/NARIS_TechBreak205

Re: [lustre-discuss] Permission denied on lfs getstripe

2020-06-28 Thread Andreas Dilger
On Jun 26, 2020, at 10:45, Chang, Christopher 
mailto:christopher.ch...@nrel.gov>> wrote:

Hi,

   We’re running into an error with a particular directory. It is weird because 
it can be resolved in an unexpected way, but only for a time.
The error manifests as:

el3:out> lfs getstripe -M 
/projects/naris/pcm_110819/NARIS_TechBreak2050_missingDPV/StageC_RT5min
error opening 
/projects/naris/pcm_110819/NARIS_TechBreak2050_missingDPV/StageC_RT5min: 
Permission denied (13)
llapi_semantic_traverse: Failed to open 
'/projects/naris/pcm_110819/NARIS_TechBreak2050_missingDPV/StageC_RT5min': 
Permission denied (13)
error: getstripe failed for 
/projects/naris/pcm_110819/NARIS_TechBreak2050_missingDPV/StageC_RT5min.

The temporary resolution is:
el3:out> ls 
/projects/naris/pcm_110819/NARIS_TechBreak2050_missingDPV/StageC_RT5min
~Model ( c_RT5min_TechBreak2050_092P_OLd000_001 ) Log.txt  Model 
c_RT5min_TechBreak2050_092P_OLd000_033 Solution.h5   Model 
c_RT5min_TechBreak2050_092P_OLd000_062 Solution.h5
…

Then
el3:out> lfs getstripe -M 
/projects/naris/pcm_110819/NARIS_TechBreak2050_missingDPV/StageC_RT5min
1
el3:out>

It looks like the user might only have supplementary group access to this file? 
 You could check on the client by running "id" to list the primary user ID and 
supplementary groups, then "ls -ln" on the file to see what group it is owned 
by.

If that is the case, it would indicate that the MDS /etc/group (or other source 
of supplementary group information, like NIS or LDAP, via /etc/nsswitch.conf) 
is not up-to-date with what is on the clients, or you have 
mdt.*.identity_upcall=NONE on the MDS instead of =l_getidentity.  You can test 
what l_getidentity on the MDS thinks the supplementary groups are for a 
particular user by running "l_getidentity -d " to compare what "id" 
returns on the client.

Cheers, Andreas


However, the getstripe command will only continue to work for about 10 minutes, 
then it goes back to the permission denied errors.
It only happens with a selection of files or directories, so we were thinking 
it might be connected to a particular OSS or MDT, but not sure what to look for.

I am not the Lustre admin, so please forgive incomplete information. If folks 
can request specific command output, preferably from user space, that would 
accelerate my ability to answer questions. If something needs to get run while 
logged into a particular Lustre component (MDT, OSS, etc.), please do not 
hesitate to assume that I don’t know that.

We’re running Lustre 2.10.7 provided by DDN on CentOS 7.4. All help 
appreciated, thanks!

Chris

--
Christopher H. Chang, Ph.D.
Computational Scientist
National Renewable Energy Laboratory
15013 Denver West Pkwy., MS ESIF301
Golden, CO 80401


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] mlx4 and mxl5 mix environment

2020-06-22 Thread Andreas Dilger
On Jun 22, 2020, at 2:13 AM, 肖正刚  wrote:
> We setup up a cluster use mlx4 and mlx5 driver mixed,all things goes well.
> Later I find something in wiki 
> http://wiki.lustre.org/Infiniband_Configuration_Howto and 
> http://lists.onebuilding.org/pipermail/lustre-devel-lustre.org/2016-May/003842.html
>  which was last edited on 2016.
> So do i need to change lnet configuration described in this page ?

One of the benefits of being a wiki page is that you can also update it
yourself, after registering for an account.

> Or the problem has been resolved in new version (like 2.12.x) ?
> Anymore where can i find more details ?
> 
> Any suggestions would be appreciated.
> Thanks!

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Client 2.12.5 on 4.18.0-193.6.3.el8_2.x86_64 does not load

2020-06-22 Thread Andreas Dilger
On Jun 22, 2020, at 6:02 AM, Torsten Harenberg 
 wrote:
> 
> Dear all,
> 
> due to the attacks to HPC centers, we were advised to update the kernels
> to the newest version available.
> 
> It seems that the Lustre 2.12.5 client does not load on the very recent
> CentOS 8 kernel anymore:
> 
> [root@arc6 yum.repos.d]# rpm -qa | grep lustre
> lustre-client-2.12.5-1.el8.x86_64
> lustre-client-dkms-2.12.5-1.el8.noarch
> 
> [root@arc6 yum.repos.d]# uname -a
> Linux arc6.pleiades.uni-wuppertal.de 4.18.0-193.6.3.el8_2.x86_64 #1 SMP
> Wed Jun 10 11:09:32 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
> 
> [root@arc6 yum.repos.d]# modprobe -a lustre
> modprobe: ERROR: could not insert 'lustre': Invalid argument
> 
> [root@arc6 yum.repos.d]# dmesg | tail
> [ 1157.583204] obdclass: disagrees about version of symbol
> cfs_crypto_hash_init
> [ 1157.585079] obdclass: Unknown symbol cfs_crypto_hash_init (err -22)

This doesn't seem like a problem with the RHEL8 kernel, but rather a
mismatch between your kernel modules (libcfs and obdclass).  Maybe
you didn't remove the old modules before trying to install the new ones?
The "lustre_rmmod" command should do this for you.

Cheers, Andreas

> [ 1157.586776] obdclass: disagrees about version of symbol
> cfs_crypto_hash_update_page
> [ 1157.588807] obdclass: Unknown symbol cfs_crypto_hash_update_page (err
> -22)
> [ 1456.283749] obdclass: disagrees about version of symbol
> cfs_crypto_hash_final
> [ 1456.287874] obdclass: Unknown symbol cfs_crypto_hash_final (err -22)
> [ 1456.290118] obdclass: disagrees about version of symbol
> cfs_crypto_hash_init
> [ 1456.292143] obdclass: Unknown symbol cfs_crypto_hash_init (err -22)
> [ 1456.294009] obdclass: disagrees about version of symbol
> cfs_crypto_hash_update_page
> [ 1456.296121] obdclass: Unknown symbol cfs_crypto_hash_update_page (err
> -22)
> 
> 
> Complete de- and re-install does not help.
> 
> Is that known? Any advice other than downgrading the kernel again?
> 
> Thanks and kind regards
> 
>  Torsten
> 
> 
> --
> Dr. Torsten Harenberg harenb...@physik.uni-wuppertal.de
> Bergische Universitaet
> Fakultät 4 - Physik   Tel.: +49 (0)202 439-3521
> Gaussstr. 20  Fax : +49 (0)202 439-2811
> 42097 Wuppertal
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Do old clients ever go away?

2020-06-17 Thread Andreas Dilger
On Jun 5, 2020, at 08:37, William D. Colburn 
mailto:wcolb...@nrao.edu>> wrote:

I was looking in /proc/fs/lustre/mgs/MGS/exports/, and I see ip
addresses in there that don't go anywhere anymore.  I'm pretty sure they
are gone so long that they predate the uptime of the mds.  Does a lost
client linger forever, or am I just wrong about when the machines went
offline in relation to the uptime of the MDS?

These files in the exports directory are just old stats, which are kept after 
client disconnect because it is otherwise difficult to track if clients are 
only connecting for a short time.  This doesn't mean that these clients are 
actively connected or part of the filesystem.

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] compiling Lustre from source

2020-06-17 Thread Andreas Dilger
There are several pages for this:
https://wiki.whamcloud.com/display/PUB/Building+Lustre+from+Source

If you are building only the client, then this is easier than also building 
(and optionally patching) the kernel.

On Jun 8, 2020, at 13:39, Peeples, Heath 
mailto:hea...@hpc.msstate.edu>> wrote:

I am needing to compile the Lustre 2.12.4 source with MOFED 4.7.  Is there any 
step by step documentation for doing this?  Thanks for the help.

Heath
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] how to mapping of RPC rate to bandwidth/IOPS?

2020-06-09 Thread Andreas Dilger
On Jun 2, 2020, at 02:30, 肖正刚 
mailto:guru.nov...@gmail.com>> wrote:

Hi all,
we use TBF policy(details: 
https://jira.whamcloud.com/secure/attachment/14201/Lustre%20NRS%20TBF%20documentation%200.1.pdf)
 to limit rpcrate coming from clients; but I do not know how to mapping of 
rpcrate to bandwidth or iops.
For example:
if I set a client's rpcrate=10,how much bandwith or iops the client can get  in 
theory?

Currently, the TBF policies only deal with RPCs.  For most systems today, you 
are probably using 4MB RPC size (osc.*.max_pages_per_rpc=1024), so if you set 
rpcrate=10 the clients will be able to get at most 40MB/s (assuming 
applications do relatively linear IO).  If applications have small random IOPS 
then rpcrate=10 may get up to 256 4KB writes per RPC, or about 2560 IOPS = 
10MB/s.

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] patchless server vs. patched server

2020-06-05 Thread Andreas Dilger
Pascal,
Thanks for taking the time to update the wiki, every contribution helps.

Note that the need for patches kernels for project quotas will also go
away with newer kernels, but Red Hat just couldn't make that feature work
with the RHEL7 kernel without breaking the ABI.

Cheers, Andreas

On Jun 2, 2020, at 06:51, Pascal Suter  wrote:



Hi George

that used to be the case until before 2.10.1, but since 2.10.1 even ldiskfs 
does not require a patch anymore. I have actually updated from a patched 2.10.3 
to 2.12.4 patchless and i am using ldiskfs for my MDTs  and ZFS for the OSTs

but i think i just found out why there are still both versions being packed.. 
while i was looking for a link to quote regarding ldiskfs now working without a 
patch, i actually found the announcement of 2.10.1  at 
http://lustre.org/lustre-2-10-1-released/ which states

   "Patchless server build for ldiskfs is now routinely provided. Note that the 
patched kernel version must still be used to make use of project quotas"

And here is the document that my question was based upon:

http://wiki.lustre.org/Installing_the_Lustre_Software

it states:

"Note: With the release of Lustre version 2.10.1, it is possible to use 
patchless kernels for Lustre servers running LDISKFS. The patchless LDISKFS 
server distribution does not include a Linux kernel. Instead, patchless servers 
will use the kernel distributed with the operating system."

and here is a LUDOC issue regarding documenting this in the official lustre 
documentation:

https://jira.whamcloud.com/browse/LUDOC-435

(amazing what you can find once you know what to look for ;))

i have applied for a lustre.org wiki account to add this missing piece of 
information which should help people to choose better if they want to use the 
patched or patchless kernel. luckily i'm not using the project quota feature ;)

cheers

Pascal



On 6/2/20 1:50 PM, George Melikov wrote:
IIRC "patchless server" can only serve ZFS based backends.
So, it you really need ldiskfs - you're stuck with patched kernel for now.

27.05.2020, 18:41, "Pascal Suter" 
:

Hi all

i am currently upgrading a lustre 2.10.3 to 2.12.4 on CentOS 7.7 and I
am unsure if I should use the patchless or patched server version. what
is the advantage of still using the patched server version over using
the patchless variant? From an linux sysadmin point of view I prefer to
use an unpatched kernel and it would seem unnecessary to still maintain
a patched variant if they both worked the same in the end.

regards

Pascal

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org




Sincerely,
George Melikov

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Fwd: IO500 ISC20 Call for Submission

2020-05-22 Thread Andreas Dilger
> From: committee--- via IO-500 
> Subject: [IO-500] IO500 ISC20 Call for Submission
> Date: May 22, 2020 at 1:48:26 PM MDT
> To: io-...@vi4io.org
> Reply-To: commit...@io500.org
> 
> Deadline: 08 June 2020 AoE
> 
> The IO500  is now accepting and encouraging submissions 
> for the upcoming 6th IO500 list. Once again, we are also accepting 
> submissions to the 10 Node Challenge to encourage the submission of small 
> scale results. The new ranked lists will be announced via live-stream at a 
> virtual session. We hope to see many new results.
> 
> The benchmark suite is designed to be easy to run and the community has 
> multiple active support channels to help with any questions. Please note that 
> submissions of all sizes are welcome; the site has customizable sorting so it 
> is possible to submit on a small system and still get a very good per-client 
> score for example. Additionally, the list is about much more than just the 
> raw rank; all submissions help the community by collecting and publishing a 
> wider corpus of data. More details below.
> 
> Following the success of the Top500 in collecting and analyzing historical 
> trends in supercomputer technology and evolution, the IO500 
>  was created in 2017, published its first list at SC17, 
> and has grown exponentially since then. The need for such an initiative has 
> long been known within High-Performance Computing; however, defining 
> appropriate benchmarks had long been challenging. Despite this challenge, the 
> community, after long and spirited discussion, finally reached consensus on a 
> suite of benchmarks and a metric for resolving the scores into a single 
> ranking.
> 
> The multi-fold goals of the benchmark suite are as follows:
> 
> Maximizing simplicity in running the benchmark suite
> Encouraging optimization and documentation of tuning parameters for 
> performance
> Allowing submitters to highlight their “hero run” performance numbers
> Forcing submitters to simultaneously report performance for challenging IO 
> patterns.
> Specifically, the benchmark suite includes a hero-run of both IOR and mdtest 
> configured however possible to maximize performance and establish an 
> upper-bound for performance. It also includes an IOR and mdtest run with 
> highly constrained parameters forcing a difficult usage pattern in an attempt 
> to determine a lower-bound. Finally, it includes a namespace search as this 
> has been determined to be a highly sought-after feature in HPC storage 
> systems that has historically not been well-measured. Submitters are 
> encouraged to share their tuning insights for publication.
> 
> The goals of the community are also multi-fold:
> 
> Gather historical data for the sake of analysis and to aid predictions of 
> storage futures
> Collect tuning data to share valuable performance optimizations across the 
> community
> Encourage vendors and designers to optimize for workloads beyond “hero runs”
> Establish bounded expectations for users, procurers, and administrators
> 10 Node I/O Challenge
> 
> The 10 Node Challenge is conducted using the regular IO500 benchmark, 
> however, with the rule that exactly 10 client nodes must be used to run the 
> benchmark. You may use any shared storage with, e.g., any number of servers. 
> When submitting for the IO500 list, you can opt-in for “Participate in the 10 
> compute node challenge only”, then we will not include the results into the 
> ranked list. Other 10-node node submissions will be included in the full list 
> and in the ranked list. We will announce the result in a separate derived 
> list and in the full list but not on the ranked IO500 list at 
> https://io500.org/ .
> 
> This information and rules for ISC20 submissions are available here: 
> https://www.vi4io.org/io500/rules/submission 
> 
> Thanks,
> 
> The IO500 Committee
> 


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] NFS Client Attributes caching - equivalent feature/config in Lustre

2020-05-20 Thread Andreas Dilger
I just found this old email in my spam folder...

On Apr 21, 2020, at 14:54, Pinkesh Valdria 
mailto:pinkesh.vald...@oracle.com>> wrote:

Does lustre have mount options to mimic  NFS mount option behavior , listed 
below?

I know in most cases,  Lustre would perform much better than NFS and can scale 
and support a lot of clients in parallel.   I have a use case,  where there are 
only few clients accessing the filesystem and the files are really small, but 
in millions and files are very infrequently updated.   The files are stored on 
an NFS server and its mounted on the clients with the below mount options, 
which results in caching of file attributes/metadata on the client and thus 
reduces # of calls to metadata and delivers better performance.

NFS mount options
type nfs (rw,nolock,nocto,actimeo=900,nfsvers=3,proto=tcp)

Lustre will already cache file attributes and data on the client, since it is 
totally coherent, and doesn't depend on random timeouts like NFS to decide 
whether the client should cache the data or not.

A custom proprietary application which compile (make command) on some of these 
files takes 20-24 seconds to run.  The same command when ran on the same files 
stored in BeeGFS parallel filesystem takes 80-90 seconds (4x times slow),  
mainly because there is no client caching in BeeGFS and client has to make a 
lot more metadata calls compared to NFS cache file attributes.

Question
I already tried BeeGFS and I am asking this question to determine, if Lustre 
performance would be better than NFS for very small file workloads (50 bytes, 
200 bytes, 2KB files) with 5 millions files spread across nested directories.  
Does lustre have mount options to mimic  NFS mount option behavior, listed 
below? Or is there some optional feature in Lustre to achieve this cache  
behavior?

Yes, Lustre will already/always have the desired caching behavior by default, 
no settings needed.  Some tuning might be needed if the working set is so large 
(10,000s of files) that the locks protecting the data are cancelled because of 
the sheer volume of data or because the files are unused for a long time (i.e. 
over 1h).

Since Lustre can be downloaded for free, you could always give your application 
workload a test, to see what the performance is.

For very small files, you might want to consider to use Data-on-MDT (DoM) by 
running "lfs setstripe -E 64k -L mdt -E64M -c 1 -Eeof -c -1 $dir" on the test 
directory (or on the root directory of the filesystem) to have it store these 
tiny files directly on the MDT.  You would in that case need enough free space 
on the MDT to hold all of the files.

Cheers, Andreas




https://linux.die.net/man/5/nfs

ac / noac
Selects whether the client may cache file attributes. If neither option is 
specified (or if ac is specified), the client caches file attributes.

For my custom applications,  cache of file attributes is fine (no negative 
impact) and it helps to improve performance of NFS.


actimeo=n

Using actimeo sets all of acregmin, acregmax, acdirmin, and acdirmax to the 
same value. If this option is not specified, the NFS client uses the defaults 
for each of these options listed above.

For my applications, it’s okay to use cache file attributes/metadata for few 
mins (eg: 5mins)  by setting this value, it can reduce # of metadata calls been 
made to the server and especially with filesystems storing lot of small files,  
it’s a huge performance penalty, which can be avoided.

nolock

When mounting servers that do not support the NLM protocol, or when mounting an 
NFS server through a firewall that blocks the NLM service port, specify the 
nolock mount option.Specifying the nolock option may also be advised to improve 
the performance of a proprietary application which runs on a single client and 
uses file locks extensively.


Appreciate any guidance.

Thanks,
pinkesh valdria
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] confused about mdt space

2020-04-02 Thread Andreas Dilger
On Apr 1, 2020, at 08:59, Mohr Jr, Richard Frank 
mailto:rm...@utk.edu>> wrote:
On Apr 1, 2020, at 10:07 AM, Mohr Jr, Richard Frank 
mailto:rm...@utk.edu>> wrote:

On Apr 1, 2020, at 3:55 AM, 肖正刚 
mailto:guru.nov...@gmail.com>> wrote:

For  " the recent lustre versions use a 1KB inode size by default and the 
default format options create 1 inodes for every 2.5 KB of MDT space" :
I checked the inode size is 1KB and  in my online systems,  as you said , about 
40~41% of mdt disk space consumed by inodes.
but from the manual I found the default "inode ratio" is 2K, so where the 
additional 0.5KB comes from ?


I was basing this on info I found in this email thread:

http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2018-February/015302.html

I think the 2K/inode ratio listed in the manual may be incorrect (or perhaps 
someone can point out my mistake).

Looking at the lustre source (utils/libmount_utils_ldiskfs.c), the default 
behavior is determined by this line:

bytes_per_inode = inode_size + 1536

So the default bytes per inode is 1K (inode size) + 1.5K = 2.5K

Right.  In 2.10 the ldiskfs inode size was increased from 512 bytes to 1024 
bytes in order to fit the more complex PFL layouts.  That increased the total 
amount of space per inode from 2048 bytes to 2560 bytes.  I guess the manual 
needs to be updated.

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] OST recovery

2020-03-31 Thread Andreas Dilger

On Mar 29, 2020, at 20:04, Gong-Do Hwang 
mailto:grover.hw...@gmail.com>> wrote:

Thanks Andreas,

I ran  "mkfs.lustre --ost --reformat --fsname lfs_home --index 6 --mgsnode 
10.10.0.10@o2ib --servicenode 10.10.0.13@o2ib  --failnode 10.10.0.14@o2ib  
/dev/mapper/mpathx", and at that time /dev/mapper/mpathx was mounted and served 
as an OST under FS lfs. And the FS lfs ran well until I umount the 
/dev/mapper/mpathx in order to restart the mgt/mgs.

The issue here is that the "--reformat" option will override the checks if a 
filesystem already exists on the device.  That should not normally be used.


 And after I re-mounted the ost I got the msg "mount.lustre FATAL: failed to 
write local files: Invalid argument
mount.lustre: mount /dev/mapper/mpathx at /lfs/ost8 failed: Invalid argument"  
and the tunefs.luster --dryrun /dev/mapper/mapthx output is "tunefs.lustre 
--dryrun /dev/mapper/mpathx
checking for existing Lustre data: found
Reading CONFIGS/mountdata

   Read previous values:
Target: lfs-OST0008

This shows that the device was previously part of the "lfs" filesystem at index 
8.  While it is possible to change the filesystem name, the OST index should 
never change, so there is no tool for this.

Two things need to be done.  You can rewrite the filesystem label with "e2label 
/dev/mapper/mpathx lfs-OST0008".  Then you need to rebuild the 
"CONFIGS/mountdata" file.

The easiest way to generate a new mountdata file would be to run "mkfs.lustre" 
with the same options as the original OST on a temporary device (e.g. loopback 
device) but add in the "--replace" option so that the OST doesn't try to add 
itself to the filesystem as a new OST.  Then mount the temporary and original 
OSTs as type ldiskfs and copy the file CONFIGS/mountdata from temp to original 
OST to replace the broken one (probably a good idea to make a backup first).

Hopefully with these two changes you can mount your OST again.

Cheers, Andreas

Index:  6
Lustre FS:  lfs_home
Mount type: ldiskfs
Flags:  0x1042
  (OST update no_primnode )
Persistent mount opts: ,errors=remount-ro
Parameters: mgsnode=10.10.0.10@o2ib  
failover.node=10.10.0.13@o2ib:10.10.0.14@o2ib


   Permanent disk data:
Target: lfs_home-OST0006
Index:  6
Lustre FS:  lfs_home
Mount type: ldiskfs
Flags:  0x1042
  (OST update no_primnode )
Persistent mount opts: ,errors=remount-ro
Parameters: mgsnode=10.10.0.10@o2ib  
failover.node=10.10.0.13@o2ib:10.10.0.14@o2ib"

I guess I actually have run the mkfs command twice so the Lustre FS in the 
previous value became lfs_home(originally is lfs).

I tried to mount the partition use backup superblocks and all of them are 
empty. But from the dumpe2fs info,
"Inode count:  41943040
Block count:  10737418240
Reserved block count: 536870912
Free blocks:  1459812475
Free inodes:  39708575"
seems there is still data on it.

The backup superblocks are for the underlying ext4/ldiskfs filesystem, so are 
not really related to this problem.


So my problem is, if the data on the partition is still intact, is there any 
way I can rebuild the file index? And is there anyway I can rewrite the 
CONFIGS/mountdata back to its original values?
Sorry for the lengthy messages and really appreciate your help!

Best Regards,

Grover

On Mon, Mar 30, 2020 at 7:14 AM Andreas Dilger 
mailto:adil...@whamcloud.com>> wrote:
It would be useful if you provided the actual error messages, so we can see 
where the problem is.

What command did you run on the OST?

Does the OST still show that it has data in it (e.g. "df" or "dumpe2fs -h" 
shows lots of used blocks)?

On Mar 25, 2020, at 10:05, Gong-Do Hwang 
mailto:grover.hw...@gmail.com>> wrote:

Dear Lustre,

Months ago when I tried to add a new disk to my new Lustre FS, I accidentally 
target the mkfs.lustre to a then mounted OST partition of another Lustre FS. 
Weird enough the command passed through, and without paying attention to it, I 
umount the partition months later and couldn't mount it back, then I realized 
the mkfs.lustre command was legit.

But my old lustre FS worked well through these months, so I guess the data in 
that OST is still there. But now the permanent CONFIG/mountdata is the new one, 
and I can still see my old config in the previous value.

My question is is there any way I can write back the old CONFIG/mountdata and 
still keep all my files in that OST?

I am using Luster 2.13.0 for my mgs/mdt/ost

Thanks for your help and I really appreciate it!

Grover

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] CFP: First International CHAOSS International Workshop

2020-03-31 Thread Andreas Dilger
# First International Workshop on Challenges and Opportunities of HPC Storage 
Systems (CHAOSS)

The workshop is aimed at researchers, developers of scientific applications, 
engineers and everyone interested in the evolution of HPC storage systems. As 
the developments of computing power, storage and network technologies continue 
to diverge, the performance gap between them widens. This trend, combined with 
the growing data volumes, results in I/O and storage bottlenecks that become 
increasingly serious especially for large-scale HPC storage systems. The 
hierarchy of different storage technologies to ease this situation leads to a 
complex environment which will become even more challenging for future exascale 
systems.

This workshop is a venue for papers exploring topics related to data 
organization and management along with the impacts of multi-tier memory and 
storage for optimizing application throughput. It will take place at the 
Euro-Par 2020 conference in Warsaw, Poland on either August 24 or 25, 2020. 
More information is available at: 
https://wr.informatik.uni-hamburg.de/events/2020/chaoss

## Important Dates

Paper Submission:May 8, 2020
Notification to Authors: June 30, 2020
Registration:July 10, 2020
Camera-Ready Deadline
(Informal Proceedings):  July 10, 2020
Workshop Dates:  August 24 or 25, 2020
Camera-Ready Deadline:   September 11, 2020

## Submission Guidelines

Papers should not exceed 12 pages (including title, text, figures, appendices 
and references). Papers of less than 10 pages will be considered as short 
papers that can be presented at the conference but will not be published in the 
proceedings.

Papers must be formatted according to Springer's LNCS guidelines available at 
https://www.springer.com/gp/computer-science/lncs/conference-proceedings-guidelines.
 Accepted papers will be published in a separate LNCS workshop volume after the 
conference. One author of each accepted paper is required to register for the 
workshop and present the paper.

Submissions will be submitted and managed via EasyChair at: 
https://easychair.org/conferences/?conf=europar2020workshop

## Topics of Interest

Submissions may be more hands-on than research papers and we therefore 
explicitly encourage submissions in the early stages of research. Topics of 
interest include, but are not limited to:

- Kernel and user space file/storage systems
- Parallel and distributed file/storage systems
- Data management approaches for heterogeneous storage systems
- Management of self-describing data formats
- Metadata management
- Approaches using query and database interfaces
- Hybrid solutions using file systems and databases
- Optimized indexing techniques
- Data organizations to support online workflows
- Domain-specific data management solutions
- Related experiences from users: what worked, what didn't?

## Program Committee

- Gabriel Antoniu (INRIA)
- Konstantinos Chasapis (DDN)
- Andreas Dilger (Whamcloud/DDN)
- Kira Duwe (UHH)
- Wolfgang Frings (JSC)
- Elsa Gonsiororowski (LLNL)
- Anthony Kougkas (IIT)
- Michael Kuhn (UHH)
- Margaret Lawson (UIUC SNL)
- Jay Lofstead (SNL)
- Johann Lombardi (Intel)
- Jakob Lüttgau (DKRZ)
- Anna Queralt (BSC)
- Yue Zhu (FSU)

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] OST recovery

2020-03-29 Thread Andreas Dilger
It would be useful if you provided the actual error messages, so we can see 
where the problem is.

What command did you run on the OST?

Does the OST still show that it has data in it (e.g. "df" or "dumpe2fs -h" 
shows lots of used blocks)?

On Mar 25, 2020, at 10:05, Gong-Do Hwang 
mailto:grover.hw...@gmail.com>> wrote:

Dear Lustre,

Months ago when I tried to add a new disk to my new Lustre FS, I accidentally 
target the mkfs.lustre to a then mounted OST partition of another Lustre FS. 
Weird enough the command passed through, and without paying attention to it, I 
umount the partition months later and couldn't mount it back, then I realized 
the mkfs.lustre command was legit.

But my old lustre FS worked well through these months, so I guess the data in 
that OST is still there. But now the permanent CONFIG/mountdata is the new one, 
and I can still see my old config in the previous value.

My question is is there any way I can write back the old CONFIG/mountdata and 
still keep all my files in that OST?

I am using Luster 2.13.0 for my mgs/mdt/ost

Thanks for your help and I really appreciate it!

Grover
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] DNE2 settings are not propagated?

2020-03-20 Thread Andreas Dilger
On Mar 20, 2020, at 13:02, Mannthey, Keith 
mailto:keith.mannt...@intel.com>> wrote:

I am doing some work with Lustre 2.13 (Downloaded from Whamcloud) and DNE 2 
there are 6 MDTs.


"
[root@colts52-clx01 for-test]# lfs getdirstripe TEST
lmv_stripe_count: 6 lmv_stripe_offset: 1 lmv_hash_type: fnv_1a_64
mdtidx   FID[seq:oid:ver]
1   [0x2c400:0x5:0x0]
5   [0x30403:0x5:0x0]
3   [0x34403:0x5:0x0]
0   [0x20403:0x5:0x0]
4   [0x24402:0x5:0x0]
2   [0x28401:0x5:0x0]
[root@colts52-clx01 for-test]# mkdir TEST/now
[root@colts52-clx01 for-test]# lfs getdirstripe TEST/now/
lmv_stripe_count: 0 lmv_stripe_offset: 2 lmv_hash_type: none
"

Directory stripping settings are not being propagated.  Is this expected?

Yes, this working as designed.  It isn't necessarily good for all directories 
to be striped, as that adds overhead without necessary improving performance.

The current recommendation is that DNE should be used for e.g. a top-level 
directory to distribute files and subdirectories across MDTs, or in the case of 
large directories with millions of files.

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] "no space on device"

2020-03-19 Thread Andreas Dilger
On Mar 19, 2020, at 12:56, Lana Deere 
mailto:lana.de...@gmail.com>> wrote:

The MDT shows 6% of storage in use and 9% of inodes in use.  The OST, however, 
shows 46% of storage and 100% of inodes in use (12 free).  (There is only one 
OST on this particular file system.)   I suppose if a lot of files are deleted, 
the system will recover, but then I'm not sure why I couldn't create any files 
after deleting a few.

Is there any way to increase the number of inodes on the OST without losing the 
data currently on the filesystem?

That depends on the storage that the OST is on.  You can use "resize2fs" on the 
OST if the underlying stoage is LVM or similar that can be resized.  Increasing 
the OST size adds inodes proportional to the added capacity.

The more common option for adding capacity to Lustre is to add another OST.  
Based on your earlier comments, you should probably double the inodes per unit 
space (reduce the bytes per inode ratio like "mkfs.lustre --mkfsoptions='-i 
131072' ..." or similar) compared to the first OST.  You can work out the 
average bytes per inode on the OST based on the (used OST capacity / used 
inodes).

Cheers, Andreas


.. Lana (lana.de...@gmail.com<mailto:lana.de...@gmail.com>)




On Thu, Mar 19, 2020 at 2:13 PM Degremont, Aurelien 
mailto:degre...@amazon.com>> wrote:
Hi Lana,

Lustre dispatches the data across several servers, MDTs and OSTs. It is likely 
that one of this OST is full.
To see the usage per sub-component, you should check:

lfs df -h
lfs df -ih

See if this reports one OSTs or MDT is full.

Aurélien

De : lustre-discuss 
mailto:lustre-discuss-boun...@lists.lustre.org>>
 au nom de Lana Deere mailto:lana.de...@gmail.com>>
Date : jeudi 19 mars 2020 à 19:08
À : "lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>" 
mailto:lustre-discuss@lists.lustre.org>>
Objet : [EXTERNAL] [lustre-discuss] "no space on device"
I have a Lustre 2.12 setup running on CentOS 7.  It has been working fine for 
some months but last night one of my users tried to untar a large file, which 
(among other things) created a single directory containing several million 
subdirectories.  At that point the untar failed, reporting "no space on 
device".  All attempts to create a file on this Lustre system now produce the 
same error message, but "df" and "df -i" indicate there is plenty of space and 
inodes left.  I checked the mount point on the metadata node and it appears to 
have plenty of space left too.

I can list directories and view files on this filesystem.  I can delete files 
or directories on it.  But even after removing a few files and a directory I 
cannot create a new file.

If anyone can offer some help here it would be appreciated.

.. Lana (lana.de...@gmail.com<mailto:lana.de...@gmail.com>)

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] old Lustre 2.8.0 panic'ing continously

2020-03-13 Thread Andreas Dilger
One thing to check if you are not seeing any benefit from running e2fsck, is to
make sure you are using the latest e2fsprogs-1.45.2.wc1.

You could also try upgrading the server to Lustre 2.10.8.

Based on the kernel version, it looks like RHEL6.7, which should still work 
with 2.10
(the previous LTS branch), but has a lot more fixes than 2.8.0.

Cheers, Andreas

On Mar 5, 2020, at 00:48, Torsten Harenberg 
mailto:harenb...@physik.uni-wuppertal.de>> 
wrote:

Dear all,

I know it's dared to ask for help for such an old system.

We still run a CentOS 6 based Lustre 2.8.0 system
(kernel-2.6.32-573.12.1.el6_lustre.x86_64,
lustre-2.8.0-2.6.32_573.12.1.el6_lustre.x86_64.x86_64).

It's out of warrenty and about to be replaced. The approval process for
the new grant took over a year and we're currently preparing an EU wide
tender, all of that takes and took much more time than we expected.

The problem is:

one OSS server is always running into a kernel panic. It seems that this
kernel panic is related to one of the OSS mount points. If we mount the
LUNs of that server (all data is on a 3par SAN) to a different server,
this one is panic'ing, too.

We always run file system checks after such a panic but these show only
minor issues that you would expect after a crashed machine like

[QUOTA WARNING] Usage inconsistent for ID 2901:actual (757747712, 217)
!= expected (664182784, 215)

We would love to avoid an upgrade to CentOS 7 with these old machines,
but these crashes happen really often meanwhile and yesterday it
panic'ed after only 30mins.

Now we're running out of ideas.

If anyone has an idea how we could identify the source of the problem,
we would really appreciate it.

Kind regards

 Torsten


--
Dr. Torsten Harenberg 
harenb...@physik.uni-wuppertal.de<mailto:harenb...@physik.uni-wuppertal.de>
Bergische Universitaet
Fakultät 4 - Physik   Tel.: +49 (0)202 439-3521
Gaussstr. 20  Fax : +49 (0)202 439-2811
42097 Wuppertal

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] old Lustre 2.8.0 panic'ing continously

2020-03-09 Thread Andreas Dilger


On Mar 5, 2020, at 09:11, Mohr Jr, Richard Frank 
mailto:rm...@utk.edu>> wrote:



On Mar 5, 2020, at 2:48 AM, Torsten Harenberg 
mailto:harenb...@physik.uni-wuppertal.de>> 
wrote:

[QUOTA WARNING] Usage inconsistent for ID 2901:actual (757747712, 217)
!= expected (664182784, 215)

I assume you are running ldiskfs as the backend?  If so, have you tried 
regenerating the quota info for the OST?  I believe the command is “tune2fs -O 
^quota ” to clear quotas and then “tune2fs -O quota 
” to reenable/regenerate them.  I don’t know if that would 
work, but it might be worth a shot.

Just to clarify, the "tune2fs -O ^quota; tune2fs -O quota" trick is not really 
the best way to do this, even though this is widely circulated.

It would be better to run a full e2fsck, since that not only rebuilds the quota 
tables, but also ensures that the values going into the quota tables are 
correct.  Since the time taken by "tune2fs -O quota" is almost the same as 
running e2fsck, it is better to do it the right way.

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] project quota totals incorrect

2020-02-27 Thread Andreas Dilger
On Feb 27, 2020, at 11:59, Peeples, Heath 
mailto:hea...@hpc.msstate.edu>> wrote:

We have enabled project quotas, but the totals are way off from du as can be 
seen below:

$ du -sh /work/project
1.5T /work/project

$ lfs quota -h -p 17115 /work/project
Disk quotas for prj 17115 (pid 17115):
Filesystem used quota limit grace files quota limit grace
/work/project 1.848G 47.31T 49.8T - 11294 0 0 –

I did use the -s option when creating.

Is every file in /work/project part of projectid 17115, or are there multiple 
projects under that directory?
What does "lfs find /work/project | wc -l" report for the file count?  
According to the lfs quota output
it is only tracking 11294 files for projid 17115.  You could try "lfs find 
/work/project ! -projid 17115" to
see if any of the files are not accounted under the project.

If this doesn't explain the discrepancy, you could verify that the "project" 
feature is on (dumpe2fs -h),
and/or try running e2fsck on all the OSTs/MDTs to ensure that the files are 
properly accounted.

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre as VM backend

2020-02-24 Thread Andreas Dilger
On Feb 19, 2020, at 07:17, Riccardo Veraldi 
mailto:riccardo.vera...@cnaf.infn.it>> wrote:

Hello,
I wanted to ask if anybody is using lustre as a FS backend for virtual 
machines. I am thinking to environments like Openstack, or oVirt
where VM are inside a single qcow2 file basically using libvirt to access the 
underlying filesystem where VMs are stored.
Anyone is using Lustre for this and if so any best practice for the specific 
utilization of Lustre in this environment (libvirt storage backend) ?

Riccardo,
I don't have any specific examples of this with containers, but AFAIK ANU/BOM 
in Australia presented about using Lustre for root filesystem images for their 
client nodes, which is similar.  I think you'd want to use fast flash-based 
storage for the VM images, since they will typically be updated with random 4KB 
IOPS from the guests.  Lustre 2.12 has much better flash performance than with 
2.10 (there have been a few presentations about this recently).

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.12.3 client can't mount filesystem

2020-02-12 Thread Andreas Dilger
Can you please try 2.12.4, it was just released yesterday and has a number of 
fixes.

On Feb 12, 2020, at 13:36, Kevin M. Hildebrand 
mailto:ke...@umd.edu>> wrote:

I just updated some of my clients to RHEL 7.7, Lustre 2.12.3, MOFED 4.7.
Server version is 2.10.8.

I'm now getting errors mounting the filesystem on the client.  In fact, I can't 
even do an 'lctl ping' to any of the servers without getting an I/O error.

Debug logs show this message when I attempt an lctl ping:
0800:0002:0.0:1581538955.090767:0:20471:0:(o2iblnd.c:941:kiblnd_create_conn())
 Can't create QP: -12, send_wr: 32634, recv_wr: 254, send_sge: 2, recv_sge: 1

# lctl list_nids
10.11.80.65@o2ib3
# lctl ping 10.11.80.50@o2ib3
failed to ping 10.11.80.50@o2ib3: Input/output error

Interestingly, if I do an 'lctl ping' to the client _from_ the server, the ping 
succeeds, and from that point on pings from client _to_ server work fine until 
the client is rebooted or lnet is reloaded.

ko2iblnd parameters match on clients and servers, namely:
options ko2iblnd peer_credits=128 peer_credits_hiw=64 credits=1024 
concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 
fmr_flush_trigger=512 fmr_cache=1

Anyone have any thoughts?

Thanks,
Kevin

--
Kevin Hildebrand
University of Maryland
Division of IT
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] pcc?

2020-02-12 Thread Andreas Dilger
The vast majority of the feature is implemented on the client, but a few 
changes had to
be made on the server for correct operation.

On Jan 16, 2020, at 10:16, Michael Di Domenico 
mailto:mdidomeni...@gmail.com>> wrote:

does pcc that's coming out in lustre 2.13 require both the client and
the server to run 2.13 or can just a client running 2.13 utilize pcc?
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Slow release of inodes on OST

2020-02-08 Thread Andreas Dilger
I suspect that having a totally idle system may actually slow things down, 
since the destroys may be waiting for transaction commits. 

That said, it probably makes sense to increase the default values for these 
tunables a bit. Could you please file an LU ticket with the details. Ideally 
with some metrics for different values of the parameters so we don't overload 
the OSS nodes with RPCs needlessly. 

Cheers, Andreas

> On Feb 8, 2020, at 01:47, Åke Sandgren  wrote:
> 
> The filesystems are completely idle during this. It's a test setup where
> I'm running io500 and doing nothing else.
> 
> I set
> osp.rsos-OST-osc-MDT.max_rpcs_in_flight=512
> osp.rsos-OST-osc-MDT.max_rpcs_in_progress=32768
> which severely reduced my waiting time between runs.
> The in_progress being the one that actually affected things.
> 
>> On 2/8/20 4:50 AM, Andreas Dilger wrote:
>> I haven't looked at that code recently, but I suspect that it is waiting
>> for journal commits to complete
>> every 5s before sending another batch of destroys?  Is the filesystem
>> otherwise idle or something?
>> 
>> 
>>> On Feb 7, 2020, at 02:34, Åke Sandgren >> <mailto:ake.sandg...@hpc2n.umu.se>> wrote:
>>> 
>>> Loocking at the osp.*.sync* values i see
>>> osp.rsos-OST-osc-MDT.sync_changes=14174002
>>> osp.rsos-OST-osc-MDT.sync_in_flight=0
>>> osp.rsos-OST-osc-MDT.sync_in_progress=4096
>>> osp.rsos-OST-osc-MDT.destroys_in_flight=14178098
>>> 
>>> And it takes 10 sec between changes of those values.
>>> 
>>> So is there any other tunable I can tweak on either OSS or MDS side?
>>> 
>>> On 2/6/20 6:58 AM, Andreas Dilger wrote:
>>>> On Feb 4, 2020, at 07:23, Åke Sandgren >>> <mailto:ake.sandg...@hpc2n.umu.se>
>>>> <mailto:ake.sandg...@hpc2n.umu.se>> wrote:
>>>>> 
>>>>> When I create a large number of files on an OST and then remove them,
>>>>> the used inode count on the OST decreases very slowly, it takes several
>>>>> hours for it to go from 3M to the correct ~10k.
>>>>> 
>>>>> (I'm running the io500 test suite)
>>>>> 
>>>>> Is there something I can do to make it release them faster?
>>>>> Right now it has gone from 3M to 1.5M in 6 hours, (lfs df -i).
>>>> 
>>>> It this the object count or the file count?  Are you possibly using a
>>>> lot of
>>>> stripes on the files being deleted that is multiplying the work needed?
>>>> 
>>>>> These are SSD based OST's in case it matters.
>>>> 
>>>> The MDS controls the destroy of the OST objects, so there is a rate
>>>> limit, but ~700/s seems low to me, especially for SSD OSTs.
>>>> 
>>>> You could check "lctl get_param osp.*.sync*" on the MDS to see how
>>>> many destroys are pending.  Also, increasing osp.*.max_rpcs_in_flight
>>>> on the MDS might speed this up?  It should default to 32 per OST on
>>>> the MDS vs. default 8 for clients
>>>> 
>>>> Cheers, Andreas
>>>> --
>>>> Andreas Dilger
>>>> Principal Lustre Architect
>>>> Whamcloud
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> -- 
>>> Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
>>> Internet: a...@hpc2n.umu.se <mailto:a...@hpc2n.umu.se>   Phone: +46 90
>>> 7866134 Fax: +46 90-580 14
>>> Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
>>> <http://www.hpc2n.umu.se/>
>>> ___
>>> lustre-discuss mailing list
>>> lustre-discuss@lists.lustre.org <mailto:lustre-discuss@lists.lustre.org>
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> 
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Principal Lustre Architect
>> Whamcloud
>> 
>> 
>> 
>> 
>> 
>> 
> 
> -- 
> Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
> Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
> Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Slow release of inodes on OST

2020-02-07 Thread Andreas Dilger
I haven't looked at that code recently, but I suspect that it is waiting for 
journal commits to complete
every 5s before sending another batch of destroys?  Is the filesystem otherwise 
idle or something?


On Feb 7, 2020, at 02:34, Åke Sandgren 
mailto:ake.sandg...@hpc2n.umu.se>> wrote:

Loocking at the osp.*.sync* values i see
osp.rsos-OST-osc-MDT.sync_changes=14174002
osp.rsos-OST-osc-MDT.sync_in_flight=0
osp.rsos-OST-osc-MDT.sync_in_progress=4096
osp.rsos-OST-osc-MDT.destroys_in_flight=14178098

And it takes 10 sec between changes of those values.

So is there any other tunable I can tweak on either OSS or MDS side?

On 2/6/20 6:58 AM, Andreas Dilger wrote:
On Feb 4, 2020, at 07:23, Åke Sandgren 
mailto:ake.sandg...@hpc2n.umu.se>
<mailto:ake.sandg...@hpc2n.umu.se>> wrote:

When I create a large number of files on an OST and then remove them,
the used inode count on the OST decreases very slowly, it takes several
hours for it to go from 3M to the correct ~10k.

(I'm running the io500 test suite)

Is there something I can do to make it release them faster?
Right now it has gone from 3M to 1.5M in 6 hours, (lfs df -i).

It this the object count or the file count?  Are you possibly using a lot of
stripes on the files being deleted that is multiplying the work needed?

These are SSD based OST's in case it matters.

The MDS controls the destroy of the OST objects, so there is a rate
limit, but ~700/s seems low to me, especially for SSD OSTs.

You could check "lctl get_param osp.*.sync*" on the MDS to see how
many destroys are pending.  Also, increasing osp.*.max_rpcs_in_flight
on the MDS might speed this up?  It should default to 32 per OST on
the MDS vs. default 8 for clients

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud







--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se<mailto:a...@hpc2n.umu.se>   Phone: +46 90 7866134 
Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se<http://www.hpc2n.umu.se/>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Slow release of inodes on OST

2020-02-05 Thread Andreas Dilger
On Feb 4, 2020, at 07:23, Åke Sandgren 
mailto:ake.sandg...@hpc2n.umu.se>> wrote:

When I create a large number of files on an OST and then remove them,
the used inode count on the OST decreases very slowly, it takes several
hours for it to go from 3M to the correct ~10k.

(I'm running the io500 test suite)

Is there something I can do to make it release them faster?
Right now it has gone from 3M to 1.5M in 6 hours, (lfs df -i).

It this the object count or the file count?  Are you possibly using a lot of
stripes on the files being deleted that is multiplying the work needed?

These are SSD based OST's in case it matters.

The MDS controls the destroy of the OST objects, so there is a rate
limit, but ~700/s seems low to me, especially for SSD OSTs.

You could check "lctl get_param osp.*.sync*" on the MDS to see how
many destroys are pending.  Also, increasing osp.*.max_rpcs_in_flight
on the MDS might speed this up?  It should default to 32 per OST on
the MDS vs. default 8 for clients

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Nodemap, ssk and mutiple fileset from one client

2020-01-26 Thread Andreas Dilger
The main reason is that clients are identified for the nodemap by their NID, 
and optionally verified by crypto (Kerberos or SSK).  That makes it difficult 
to separate two different mounts from the same client.

It would potentially be possible to have the primary client identification be 
done by the crypto key, which is passed at mount time, but I don't think anyone 
is planning to work on this feature. You would of course be welcome to submit a 
patch if this is important to you.

Cheers, Andreas

On Jan 26, 2020, at 14:54, Hans Henrik Happe  wrote:

 Thanks, for the input. WRT to one LNET per fileset, is there some technical 
reason for this design?

Cheers,
Hans Henrik

On 06.01.2020 09.41, Moreno Diego (ID SIS) wrote:
I’m not sure about the SSK limitations but I know for sure that you can have 
multiple filesets belonging to the same filesystem on a client. As you already 
said, you’ll basically need to have one LNET per fileset (o2ib0, o2ib1, o2ib2), 
then mount each fileset with the option ‘-o network=’.

I gave a talk on our setup during last LAD (https://bit.ly/35oaPl7), slide 24 
contains a few details on this. It’s for a routed configuration but we also had 
it working without LNET routers.

Diego


From: lustre-discuss 

 on behalf of Jeremy Filizetti 

Date: Tuesday, 31 December 2019 at 04:22
To: Hans Henrik Happe 
Cc: "lustre-discuss@lists.lustre.org" 

Subject: Re: [lustre-discuss] Nodemap, ssk and mutiple fileset from one client

It doesn't look like this would be possible due to nodemap or SSK limitations.  
As you pointed out, nodemap must associate a NID with a single nodemap.  SSK 
was intentionally tied to nodemap by design.  It does a lookup on the nodemap 
of a NID to verify it matches what is found in the server key.  I think even if 
you used multiple NIDs for a client like o2ib(ib0),o2ib1(ib0) you would still 
run into issues due to LNet, but I'm not certain on that.

Jeremy

On Mon, Dec 30, 2019 at 9:30 PM Hans Henrik Happe 
mailto:ha...@nbi.dk>> wrote:
Hi,

Is it possible to have one client mount multiple fileset's with
different ssk keys.

Basically, we would just like to hand out a key to clients that should
be allowed to mount a specific fileset (subdir). First, it looks like
the nodemap must contain the client NID for it to be able to mount. The
key is not enough. Secondly, nodemaps are not allowed hold the same
NIDs, so it seems impossible to have multiple ssk protected filesets
mounted from one client, unless multiple NIDs are used?

Example: For nodes A and B and filesets f0 (key0) and f1 (key1).

A: Should be allowed to mount f0 (key0).
B: Should be allowed to mount f0 (key0) and f1 (key1).

Cheers,
Hans Henrik
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Read performance bad, telepathy in Lustre

2020-01-23 Thread Andreas Dilger
Thomas,
If you are positive that the two sets of clients are not reading files on other 
on the OSTs, I don't think there is anything at the Lustre level that 
communicates between OSSes to balance traffic or anything like that.

One possibility is congestion control at the network level, possibly at the 
switch?

Cheers, Andreas

On Jan 23, 2020, at 08:01, Thomas Roth mailto:t.r...@gsi.de>> 
wrote:

Hi all,

Lustre 2.10.6, 45 OSS with 7 OSTs each on ZFS 0.7.9, 3 MDTs (ldiskfs), clients 
2.10 and 2.12. Infiniband network, Mellanox FDR w half bisectional bandwidth.

A sample of ~250.000 files, stripe count 1, average size 100 MB. is read with 
dd, output > /dev/null.

The location of the files has been recorded, from this we have drawn up 
separate file lists for each OSS.


In the first run, one client reads the files on one OSS and gets a read 
performance X, e.g. 2 GB/s.

In the second run, this setup is simply multiplied by 10 or 40: Client 1 still 
reads from OSS 1, Client 2 works with the files on OSS2, client 3 with OSS 3, 
...

With only 12 pairs of this kind we see 2 or 3 pairs whose performance dropsto < 
500 MB/s. The other pairs keep the read rate as seen before. Once they have 
finished, the remaining 2 -3 pairs jump back to original performance.

When the runs are repeated, the affected OSS are not the same as before.

This should exclude effects of bad hardware: servers, disks, cables, switches.

Since this behaviour is reproducible, the effects of interactions with other 
jobs/users can also be excluded.




By now I am able to reproduce the behavior on a test system, same 
configuration, with just 2 client-OSS pairs, nobody else on there.

56 parallel dd processes on client 1, reading files on server 1: 440 MB/s
56 parallel dd processes on client 2, reading files on server 2: 1.6 GB/s

Then kill all processes on client 2. Client 1 continues, rising to 1.1 GB/s


These processes are not even visible on the MDS of this system, and from all I 
understand the metadata server should be the only connecting element between 
the two pairs?
How do they know about each other, who, what tells client-1-server-1 to keep it 
low while client-2 is working on server-1?

Curioser and curioser,
Thomas




--

Thomas Roth
Department: Informationstechnologie
Location: SB3 2.291
Phone: +49-6159-71 1453  Fax: +49-6159-71 2986


GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de<http://www.gsi.de>

Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Jörg Blaurock
Chairman of the Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
State Secretary / Staatssekretär Dr. Volkmar Dietz

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [Lwg] SC19 bof slides

2020-01-16 Thread Andreas Dilger
   ​
   

   Hello,

   Are the slides presented during the bof in Denver available 
somewhere?

   I'm specifically looking for one of the sites presentation that gave 
some feedback about DoM.

   Regards

   Thomas Hamel
   ___
   lustre-discuss mailing list
   
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
   
http://lists.lustre.org/li<http://ustre.org/li>stinfo.cgi/lustre-discuss-lustre.org




   ___
   lwg mailing list
   l...@lists.opensfs.org<mailto:l...@lists.opensfs.org>
   http://lists.opensfs.org/listinfo.cgi/lwg-opensfs.org


___
lwg mailing list
l...@lists.opensfs.org<mailto:l...@lists.opensfs.org>
http://lists.opensfs.org/listinfo.cgi/lwg-opensfs.org

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre and Optane

2020-01-13 Thread Andreas Dilger
The most likely candidate for effectively using Optane/NVRAM would be
via Persistent Client Cache (PCC), which allows client-local storage to
be part of the Lustre namespace.  Files can be cached on a local NVRAM
device (managed by a local filesystem like ext4, or possibly something
more experimental like NOVA for better performance) and then migrated
into the cache.

Once the file is in PCC, it can be accessed via the local filesystem
operations, including DAX, for very low-latency operations.  See the
presentation from LAD'19 for details:

https://www.eofs.eu/_media/events/lad19/07_li_xi-nvram_pcc.pdf

It should be noted that in Lustre 2.13, files in PCC are NOT resident in the
main filesystem, so if the client node goes offline then the files will not be
accessible until the client node is restarted.  For some workloads this is OK
(e.g. files being generated locally with high IOPS that are occasionally
needed on other clients), but not for others.  We will be improving PCC to
use FLR to mirror a copy into the client cache and still keep a copy in the
main filesystem, but that is not available yet.

Cheers, Andreas

On Jan 13, 2020, at 10:03, Dave Holland 
mailto:d...@sanger.ac.uk>> wrote:

I haven't been to LUG or LAD recently, so I'm a bit out of the loop, but
how much use is Optane finding in the Lustre world?

The main obstacle I see is that it's server-local, so building a
resilient/failover-capable system isn't straightforward.

Thanks for any observations.

Cheers,
Dave
--
** Dave Holland ** Systems Support -- Informatics Systems Group **
** 01223 496923 **Wellcome Sanger Institute, Hinxton, UK**


Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lnet Self Test

2019-11-27 Thread Andreas Dilger
The first thing to note is that lst reports results in binary units
(MiB/s) while iperf reports results in decimal units (Gbps).  If you do the
conversion you get 2055.31 MiB/s = 2155 MB/s.

The other thing to check is the CPU usage. For TCP the CPU usage can
be high. You should try RoCE+o2iblnd instead.

Cheers, Andreas

On Nov 26, 2019, at 21:26, Pinkesh Valdria 
mailto:pinkesh.vald...@oracle.com>> wrote:

Hello All,

I created a new Lustre cluster on CentOS7.6 and I am running 
lnet_selftest_wrapper.sh to measure throughput on the network.  The nodes are 
connected to each other using 25Gbps ethernet, so theoretical max is 25 Gbps * 
125 = 3125 MB/s.Using iperf3,  I get 22Gbps (2750 MB/s) between the nodes.


[root@lustre-client-2 ~]# for c in 1 2 4 8 12 16 20 24 ;  do echo $c ; 
ST=lst-output-$(date +%Y-%m-%d-%H:%M:%S)  CN=$c  SZ=1M  TM=30 BRW=write 
CKSUM=simple LFROM="10.0.3.7@tcp1" LTO="10.0.3.6@tcp1" 
/root/lnet_selftest_wrapper.sh; done ;

When I run lnet_selftest_wrapper.sh (from Lustre 
wiki) between 2 nodes,  I get a max of  
2055.31  MiB/s,  Is that expected at the Lnet level?  Or can I further tune the 
network and OS kernel (tuning I applied are below) to get better throughput?



Result Snippet from lnet_selftest_wrapper.sh

[LNet Rates of lfrom]
[R] Avg: 4112 RPC/s Min: 4112 RPC/s Max: 4112 RPC/s
[W] Avg: 4112 RPC/s Min: 4112 RPC/s Max: 4112 RPC/s
[LNet Bandwidth of lfrom]
[R] Avg: 0.31 MiB/s Min: 0.31 MiB/s Max: 0.31 MiB/s
[W] Avg: 2055.30  MiB/s Min: 2055.30  MiB/s Max: 2055.30  MiB/s
[LNet Rates of lto]
[R] Avg: 4136 RPC/s Min: 4136 RPC/s Max: 4136 RPC/s
[W] Avg: 4136 RPC/s Min: 4136 RPC/s Max: 4136 RPC/s
[LNet Bandwidth of lto]
[R] Avg: 2055.31  MiB/s Min: 2055.31  MiB/s Max: 2055.31  MiB/s
[W] Avg: 0.32 MiB/s Min: 0.32 MiB/s Max: 0.32 MiB/s


Tuning applied:
Ethernet NICs:

ip link set dev ens3 mtu 9000

ethtool -G ens3 rx 2047 tx 2047 rx-jumbo 8191


less /etc/sysctl.conf
net.core.wmem_max=16777216
net.core.rmem_max=16777216
net.core.wmem_default=16777216
net.core.rmem_default=16777216
net.core.optmem_max=16777216
net.core.netdev_max_backlog=27000
kernel.sysrq=1
kernel.shmmax=18446744073692774399
net.core.somaxconn=8192
net.ipv4.tcp_adv_win_scale=2
net.ipv4.tcp_low_latency=1
net.ipv4.tcp_rmem = 212992 87380 16777216
net.ipv4.tcp_sack = 1
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_wmem = 212992 65536 16777216
vm.min_free_kbytes = 65536
net.ipv4.tcp_congestion_control = cubic
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_congestion_control = htcp
net.ipv4.tcp_no_metrics_save = 0



echo "#
# tuned configuration
#
[main]
summary=Broadly applicable tuning that provides excellent performance across a 
variety of common server workloads

[disk]
devices=!dm-*, !sda1, !sda2, !sda3
readahead=>4096

[cpu]
force_latency=1
governor=performance
energy_perf_bias=performance
min_perf_pct=100
[vm]
transparent_huge_pages=never
[sysctl]
kernel.sched_min_granularity_ns = 1000
kernel.sched_wakeup_granularity_ns = 1500
vm.dirty_ratio = 30
vm.dirty_background_ratio = 10
vm.swappiness=30
" > lustre-performance/tuned.conf

tuned-adm profile lustre-performance


Thanks,
Pinkesh Valdria

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] one ost down

2019-11-15 Thread Andreas Dilger
If the HDD has enough bad sectors that it is reporting errors to user space 
then it means that all of the remapping sectors are already consumed will 
typically continue to have more errors in the future. It should be replaced 
rather than continuing to be used. 

I would agree with Marek that making a "dd" copy is safest, as it will avoid 
further errors appearing, and allows you to restore from the backup if 
something goes wrong with your repair attempt. 

Cheers, Andreas

> On Nov 15, 2019, at 05:56, Marek Magryś  wrote:
> 
> Hi Einar,
> 
> You can to run ddrescue from the OSS server to create a block-level
> backup of the OST. ddrescue should create a fairly decent backup from
> what is available on the block device. I would suggest to run such
> backup before you play with clearing the bad blocks, in worst case you
> will be able to restore the OST to the current state.
> 
> Cheers,
> Marek
> 
>  Original Message 
>> Hi Einar,
>> 
>>  
>> 
>> As for the OST in bad shape, if you have not cleared the bad blocks on
>> the storage system you’ll keep having IO errors when your server tries
>> to access these blocks, that’s kind of a protection mechanism and lots
>> of IO errors might give you many issues. The procedure to clean them up
>> is a bit of storage and filesystem surgery. I would suggest, this high
>> level view plan:
>> 
>>  
>> 
>>  * Obtain the bad blocks from the storage system (offset, size, etc…)
>>  * Map them to filesystem blocks: watch out, the storage system speaks
>>probably and for old systems about 512bytes blocks and the
>>filesystem blocks are 4KB, so you need to map storage blocks to
>>filesystem blocks
>>  * Clear the bad blocks on the storage system, each storage system has
>>their own commands to clear those. You’ll probably no longer have IO
>>errors accessing these sectors after clearing the bad blocks
>>  * Optional, zero the bad storage blocks with dd (and just these bad
>>blocks of course) to ignore the “trash” there might be on these blocks
>>  * Find out with debugfs which files are affected
>>  * Run e2fsck on the device
>> 
>>  
>> 
>> As I said, surgery, so if you really care about what you have on that
>> device try to do a block level backup before… But the minimum for sure
>> is that you need to clear the bad blocks, otherwise you get IO access
>> error on the device.
>> 
>>  
>> 
>> Regards,
>> 
>>  
>> 
>> Diego
>> 
>>  
>> 
>>  
>> 
>> *From: *lustre-discuss  on
>> behalf of Einar Næss Jensen 
>> *Date: *Friday, 15 November 2019 at 10:01
>> *To: *"lustre-discuss@lists.lustre.org" 
>> *Subject: *[lustre-discuss] one ost down
>> 
>>  
>> 
>>  
>> 
>> Hello dear lustre community.
>> 
>>  
>> 
>> We have a lustre file system, where one ost is having problems.
>> 
>> The underlying diskarray, an old sfa10k from DDN (without support), have
>> one raidset with ca 1300 bad blocks. The bad blocks came about when one
>> disk in the raid failed while another drive in other raidset was rebuilding.
>> 
>>  
>> 
>> Now.
>> 
>> The ost is offline, and the file system seems useable for new files,
>> while old files on the corresponding ost is generating lots of kernel
>> messages on the OSS.
>> 
>> Quotainformation is not available though.
>> 
>>  
>> 
>> Questions: 
>> 
>> May I assume that for new files, everything is fine, since they are not
>> using the inactive device anyway?
>> 
>> I tried to run e2fschk on the ost unmounted, while jobs were still
>> running on the filesystem, and for a few minutes it seemd like this was
>> working, as the filesystem seemed to come back complete afterwards.
>> After a few minutes the ost failed again, though.
>> 
>>  
>> 
>> Any pointers on how to rebuild/fix the ost and get it back is very much
>> appreciated. 
>> 
>>  
>> 
>> Also how to regenerate the quotainformation, which is currently
>> unavailable would help. With or without the troublesome OST.
>> 
>>  
>> 
>>  
>> 
>>  
>> 
>> Best Regards
>> 
>> Einar Næss Jensen (on flight to Denver)
>> 
>>  
>> 
>>  
>> 
>> -- 
>> Einar Næss Jensen
>> NTNU HPC Section
>> Norwegian University of Science and Technoloy
>> Address: Høgskoleringen 7i
>>  N-7491 Trondheim, NORWAY
>> tlf: +47 90990249
>> email:   einar.nass.jen...@ntnu.no
>> 
>> 
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] changing inode size on MDT

2019-11-11 Thread Andreas Dilger
You can check the ashift of the zpool via "zpool get all | grep ashift".  If 
this is different, it will make a huge difference in space usage. There are a 
number of ZFS articles that discuss this, it isn't specific to Lustre.

Also, RAID-Z2 is going to have much more space overhead for the MDT than 
mirroring, because the MDT is almost entirely small blocks.  Normally the MDT 
is using mirrored VDEVs.

The reason is that RAID-Z2 has two parity sectors per data stripe vs. a single 
extra mirror per data block, so if all data blocks are 4KB that would double 
the parity overhead vs. mirroring. Secondly, depending on the geometry, RAID-Z2 
needs padding sectors to align the variable RAID-Z stripes, which mirrors do 
not.

For large files/blocks RAID-Z2 is better, but that isn't the workload on the 
MDT unless you are storing DoM files there (eg. 64KB or larger).

Cheers, Andreas

On Nov 11, 2019, at 13:48, Hebenstreit, Michael 
mailto:michael.hebenstr...@intel.com>> wrote:

Recordsize/ahift: in both cases default values were used (but on different 
versions of Lustre). How can I check different values for recordsize/ashift for 
actual values to compare?

zpool mirroring is quite different though – bad drive is a simple raidz2:

  raidz2-0  ONLINE   0 0 0
sdd ONLINE   0 0 0
….

errors: No known data errors

the good drive uses 10 mirrors:

NAME STATE READ WRITE CKSUM
mdt  ONLINE   0 0 0
  mirror-0   ONLINE   0 0 0
sdd  ONLINE   0 0 0
sde  ONLINE   0 0 0
  mirror-1   ONLINE   0 0 0
sdf  ONLINE   0 0 0
sdg  ONLINE   0 0 0
  mirror-2   ONLINE   0 0 0
sdh  ONLINE   0 0 0
sdi  ONLINE   0 0 0
  mirror-3   ONLINE   0 0 0
sdj  ONLINE   0 0 0
sdk  ONLINE   0 0 0
  mirror-4   ONLINE   0 0 0
sdl  ONLINE   0 0 0
sdm  ONLINE   0 0 0
  mirror-5   ONLINE   0 0 0
sdn  ONLINE   0 0 0
sdo  ONLINE   0 0 0
  mirror-6   ONLINE   0 0 0
sdp  ONLINE   0 0 0
sdq  ONLINE   0 0 0
  mirror-7   ONLINE   0 0 0
sdr  ONLINE   0 0 0
sds  ONLINE   0 0 0
  mirror-8   ONLINE   0 0 0
sdt  ONLINE   0 0 0
sdu  ONLINE   0 0 0
  mirror-9   ONLINE   0 0 0
sdv  ONLINE   0 0 0
sdw  ONLINE   0 0 0
  mirror-10  ONLINE   0 0 0
sdx  ONLINE   0 0 0
sdy  ONLINE   0 0 0

thanks
Michael

From: Andreas Dilger mailto:adil...@whamcloud.com>>
Sent: Monday, November 11, 2019 14:42
To: Hebenstreit, Michael 
mailto:michael.hebenstr...@intel.com>>
Cc: Mohr Jr, Richard Frank mailto:rm...@utk.edu>>; 
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] changing inode size on MDT

There isn't really enough information to make any kind of real analysis.


My guess would be that you are using a larger ZFS recordsize or ashift on the 
new filesystem, or the RAID config is different?

Cheers, Andreas

On Nov 7, 2019, at 08:45, Hebenstreit, Michael 
mailto:michael.hebenstr...@intel.com>> wrote:
So we went ahead and used the FS – using rsync to duplicate the existing FS. 
The inodes available on the NEW mdt (which is almost twice the size of the 
second mdt) are dropping rapidly and are now LESS than on the smaller mdt (even 
though the sync is only 90% complete). Both FS are running almost identical 
Lustre 2.10. I cannot say anymore which ZFS version was used to format the good 
FS.

Any ideas why those 2 MDTs behave so differently?

old GOOD FS:
# df -i
mgt/mgt   81718714   205   817185091% /lfs/lfsarc02/mgt
mdt/mdt  458995000 130510339  328484661   29% /lfs/lfsarc02/mdt
# df -h
mgt/mgt  427G  7.0M  427G   1% /lfs/lfsarc02/mgt
mdt/mdt  4.6T  1.4T  3.3T  29% /lfs/lfsarc02/mdt
# rpm -q -a | grep zfs
libzfs2-0.7.9-1.el7.x86_64
lustre-osd-zfs-mount-2.10.4-1.el7.x86_64
lustre-zfs-dkms-2.10.4-1.el7.noarch
zfs-0.7.9-1.el7.x86_64
zfs-dkms-0.7.9-1.el7.noarch

new BAD FS
# df -ih
mgt/mgt83M   169   83M1% /lfs/lfsarc01/mgt
mdt/mdt   297M  122M  175M   42% /lfs/lfsarc01/mdt
# df -h
mgt/mgt  427G  5.8M  427G   1% /lfs/lfsarc01/mgt
mdt/mdt  8.2T  3.4T  4.9T  41% /lfs/lfsarc01/mdt
# rpm -q -a | grep

Re: [lustre-discuss] changing inode size on MDT

2019-11-11 Thread Andreas Dilger
There isn't really enough information to make any kind of real analysis.

My guess would be that you are using a larger ZFS recordsize or ashift on the 
new filesystem, or the RAID config is different?

Cheers, Andreas

On Nov 7, 2019, at 08:45, Hebenstreit, Michael 
mailto:michael.hebenstr...@intel.com>> wrote:

So we went ahead and used the FS – using rsync to duplicate the existing FS. 
The inodes available on the NEW mdt (which is almost twice the size of the 
second mdt) are dropping rapidly and are now LESS than on the smaller mdt (even 
though the sync is only 90% complete). Both FS are running almost identical 
Lustre 2.10. I cannot say anymore which ZFS version was used to format the good 
FS.

Any ideas why those 2 MDTs behave so differently?

old GOOD FS:
# df -i
mgt/mgt   81718714   205   817185091% /lfs/lfsarc02/mgt
mdt/mdt  458995000 130510339  328484661   29% /lfs/lfsarc02/mdt
# df -h
mgt/mgt  427G  7.0M  427G   1% /lfs/lfsarc02/mgt
mdt/mdt  4.6T  1.4T  3.3T  29% /lfs/lfsarc02/mdt
# rpm -q -a | grep zfs
libzfs2-0.7.9-1.el7.x86_64
lustre-osd-zfs-mount-2.10.4-1.el7.x86_64
lustre-zfs-dkms-2.10.4-1.el7.noarch
zfs-0.7.9-1.el7.x86_64
zfs-dkms-0.7.9-1.el7.noarch

new BAD FS
# df -ih
mgt/mgt83M   169   83M1% /lfs/lfsarc01/mgt
mdt/mdt   297M  122M  175M   42% /lfs/lfsarc01/mdt
# df -h
mgt/mgt  427G  5.8M  427G   1% /lfs/lfsarc01/mgt
mdt/mdt  8.2T  3.4T  4.9T  41% /lfs/lfsarc01/mdt
# rpm -q -a | grep zfs
libzfs2-0.7.9-1.el7.x86_64
lustre-osd-zfs-mount-2.10.8-1.el7.x86_64
lustre-zfs-dkms-2.10.8-1.el7.noarch
zfs-0.7.9-1.el7.x86_64
zfs-dkms-0.7.9-1.el7.noarch

From: Andreas Dilger mailto:adil...@whamcloud.com>>
Sent: Thursday, October 03, 2019 20:38
To: Hebenstreit, Michael 
mailto:michael.hebenstr...@intel.com>>
Cc: Mohr Jr, Richard Frank mailto:rm...@utk.edu>>; 
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] changing inode size on MDT

On Oct 3, 2019, at 20:09, Hebenstreit, Michael 
mailto:michael.hebenstr...@intel.com>> wrote:

So bottom line – don’t change the default values, it won’t get better?

Like I wrote previously, there *are* no default/tunable values to change for 
ZFS.  The tunables are only for ldiskfs, which statically allocates everything, 
but is will cause problems if you guessed incorrectly at the instant you format 
the filesystem.

The number reported by raw ZFS and by Lustre-on-ZFS is just an estimate, and 
you will (essentially) run out of inodes once you run out of space on the MDT 
or all OSTs.  And I didn't say "it won't get better", actually I said the 
estimate _will_ get better once you actually start using the filesystem.

If the (my estimate) 2-3B inodes on the MDT is insufficient, you can always add 
another (presumably mirrored) VDEV to the MDT, or add a new MDT to the 
filesystem to increase the number of inodes available.

Cheers, Andreas



From: Andreas Dilger mailto:adil...@whamcloud.com>>
Sent: Thursday, October 03, 2019 19:38
To: Hebenstreit, Michael 
mailto:michael.hebenstr...@intel.com>>
Cc: Mohr Jr, Richard Frank mailto:rm...@utk.edu>>; 
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] changing inode size on MDT

On Oct 3, 2019, at 05:03, Hebenstreit, Michael 
mailto:michael.hebenstr...@intel.com>> wrote:

So you are saying on a zfs based Lustre there is no way to increase the number 
of available inodes? I have 8TB MDT with roughly 17G inodes

[root@elfsa1m1 ~]# df -h
Filesystem   Size  Used Avail Use% Mounted on
mdt  8.3T  256K  8.3T   1% /mdt

[root@elfsa1m1 ~]# df -i
Filesystem   Inodes  IUsed   IFree IUse% Mounted on
mdt 17678817874  6 176788178681% /mdt

For ZFS the only way to increase inodes on the *MDT* is to increase the size of 
the MDT, though more on that below.  Note that the "number of inodes" reported 
by ZFS is an estimate based on the currently-allocated blocks and inodes (i.e. 
bytes_per_inode_ratio = bytes_used / inodes_used, total inode estimate = 
bytes_free / inode_ratio + inodes_used), which becomes more accurate as the MDT 
becomes more full.  With 17B inodes on a 8TB MDT that is an bytes-per-inode 
ratio of 497, which is unrealistically low for Lustre since the MDT will always 
stores multiple xattrs on each inode.  Note that the filesystem only has 6 
inodes allocated, so the ZFS total inodes estimate is unrealistically high and 
will get better as more inodes are allocated in the filesystem.

Formating under Lustre 2.10.8

mkfs.lustre --mdt --backfstype=zfs --fsname=lfsarc01 --index=0 
--mgsnid="36.101.92.22@tcp<mailto:36.101.92.22@tcp>" --reformat mdt/mdt

this translates to only 948M inodes on the Lustre FS.

[root@elfsa1m1 ~]# df -i
Filesystem   Inodes  IUsed 

Re: [lustre-discuss] Lustre client/server versions/compatibility

2019-10-20 Thread Andreas Dilger
Cory,
I think the protocol change for object deletion was done as part of the OSD API 
change (to add ZFS support) in 2.4, and LU-5814 doesn't seem to be related to 
that.  Or maybe the removal of the interoperability support for pre-2.4 clients 
was removed as part of that change (which is marked landed in 2.8.0)?

In any case, it is entirely possible that there is some change that can cause 
interop problems, so definitely upgrading is preferred.

Cheers, Andreas



On Oct 20, 2019, at 14:05, Cory Spitz 
mailto:spitz...@cray.com>> wrote:

Hi, William.

You should plan to update your 2.4.x and 2.5.x clients.  There is at least one 
known compatibility problem with clients that old.  The code associated with 
LU-5814 changed the protocol for deleting objects.  You’ll find that files 
deleted from those old clients will leak objects on the OSTs.  Sure, you can 
live with it and lfsck can clean things up for you, but because of this issue 
those old clients don’t seem to be used anywhere in production along with 
modern servers and there may be other problems to worry about.

-Cory

--


On 10/17/19, 12:12 AM, "lustre-discuss on behalf of Andreas Dilger" 
mailto:lustre-discuss-boun...@lists.lustre.org>
 on behalf of adil...@whamcloud.com<mailto:adil...@whamcloud.com>> wrote:

On Oct 17, 2019, at 00:34, William D. Colburn 
mailto:wcolb...@nrao.edu>> wrote:

We are moving closer to migrating our lustre 2.5.5 servers to lustre
2.12.2.  But we still have a collection of clients.  We have a vague bit
of folklore, handed down word of mouth from the village elder, that if
we upgrade the server to 2.12.2 that the clients all have to be upgraded
to 2.7.something.  I've tried google, but all the searches I've tried
for compatibility all lead to information on what versions of lustre
run on what versions of linux, and not what versions of clients interact
with what versions of the server.

Is there a compatibility matrix for clients versus servers?

In theory, there is a very long window of compatibility between Lustre versions
between the client and server (much less leeway is allowed between servers).

Since each client is independent (they never communicate with each other, only
with the servers) this also provides a good deal of flexibility, unlike e.g. 
shared
storage filesystem like GPFS or GlusterFS or peer-peer filesystems.

The main reason there is not a large compatibility matrix is that this takes a 
lot
of effort to test various versions.  For releases we always test release N 
against
N-1 and the most recent LTS release (e.g. 2.12.3 is being tested against 2.11.0
and 2.10.8, and 2.13 is tested against 2.12) for both client and server 
versions.
This is already 4 sets of interop tests (new client/old server, old client/new 
server
for both), plus disk format upgrade tests, plus several different distros 
(RHEL7/8,
Ubuntu) and CPU architectures (x86, PPC64, Arm)...

So, for software we release for free we can't do exhaustive testing of all the 
older
release versions as well.  While client and server interop _should_ work between
all versions, there are occasional bugs, so we only report compatibility with 
versions
that we actually tested.  As shown with your list of client versions below 
(2.12.2 clients
with 2.5.5 servers), there is a pretty wide range of compatibility.  Each 
client and
server negotiate protocol compatibility at connection time, so if new features 
are
added they are only active if both systems support them. We are only just 
removing
some Lustre 1.8 and 2.1 and kernel 2.6.32 compatibility code from the 2.13 
release,
which were first released about 8 years ago.

If you wanted, you could create a wiki.lustre.org<http://wiki.lustre.org/> page 
that included the various
versions that are actually tested, and allow users to fill in the blanks?  I've 
heard
several reports on the list of sites using 2.10.x or 2.12.x clients with 2.5.x 
servers,
but not so much the other way around.


Alternatively, does someone just know what version of the clients we
need to have everywhere for a 2.12.2 server?

Our clients include:
 2.4.3
 2.5.5
 2.10.1
 2.10.2
 2.10.4
 2.10.5
 2.10.6
 2.10.7
 2.10.8
 2.12.2

My recommendation would be to upgrade the 2.4 and 2.5 clients to 2.10.8 and then
you could run 2.12.2 servers.  If you really need to run older clients because 
of old
applications, you could try using 2.7.x on those clients and 2.10.8 servers.  
You might
be able to use 2.7.x clients with 2.12 servers, but that isn't tested for those 
releases.
Alternately, run old apps in VMs/containers and use a newer kernel+Lustre 
underneath.

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre client/server versions/compatibility

2019-10-16 Thread Andreas Dilger
On Oct 17, 2019, at 00:34, William D. Colburn 
mailto:wcolb...@nrao.edu>> wrote:

We are moving closer to migrating our lustre 2.5.5 servers to lustre
2.12.2.  But we still have a collection of clients.  We have a vague bit
of folklore, handed down word of mouth from the village elder, that if
we upgrade the server to 2.12.2 that the clients all have to be upgraded
to 2.7.something.  I've tried google, but all the searches I've tried
for compatibility all lead to information on what versions of lustre
run on what versions of linux, and not what versions of clients interact
with what versions of the server.

Is there a compatibility matrix for clients versus servers?

In theory, there is a very long window of compatibility between Lustre versions
between the client and server (much less leeway is allowed between servers).

Since each client is independent (they never communicate with each other, only
with the servers) this also provides a good deal of flexibility, unlike e.g. 
shared
storage filesystem like GPFS or GlusterFS or peer-peer filesystems.

The main reason there is not a large compatibility matrix is that this takes a 
lot
of effort to test various versions.  For releases we always test release N 
against
N-1 and the most recent LTS release (e.g. 2.12.3 is being tested against 2.11.0
and 2.10.8, and 2.13 is tested against 2.12) for both client and server 
versions.
This is already 4 sets of interop tests (new client/old server, old client/new 
server
for both), plus disk format upgrade tests, plus several different distros 
(RHEL7/8,
Ubuntu) and CPU architectures (x86, PPC64, Arm)...

So, for software we release for free we can't do exhaustive testing of all the 
older
release versions as well.  While client and server interop _should_ work between
all versions, there are occasional bugs, so we only report compatibility with 
versions
that we actually tested.  As shown with your list of client versions below 
(2.12.2 clients
with 2.5.5 servers), there is a pretty wide range of compatibility.  Each 
client and
server negotiate protocol compatibility at connection time, so if new features 
are
added they are only active if both systems support them. We are only just 
removing
some Lustre 1.8 and 2.1 and kernel 2.6.32 compatibility code from the 2.13 
release,
which were first released about 8 years ago.

If you wanted, you could create a wiki.lustre.org<http://wiki.lustre.org> page 
that included the various
versions that are actually tested, and allow users to fill in the blanks?  I've 
heard
several reports on the list of sites using 2.10.x or 2.12.x clients with 2.5.x 
servers,
but not so much the other way around.

Alternatively, does someone just know what version of the clients we
need to have everywhere for a 2.12.2 server?

Our clients include:
 2.4.3
 2.5.5
 2.10.1
 2.10.2
 2.10.4
 2.10.5
 2.10.6
 2.10.7
 2.10.8
 2.12.2

My recommendation would be to upgrade the 2.4 and 2.5 clients to 2.10.8 and then
you could run 2.12.2 servers.  If you really need to run older clients because 
of old
applications, you could try using 2.7.x on those clients and 2.10.8 servers.  
You might
be able to use 2.7.x clients with 2.12 servers, but that isn't tested for those 
releases.
Alternately, run old apps in VMs/containers and use a newer kernel+Lustre 
underneath.

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Limit to number of OSS?

2019-10-10 Thread Andreas Dilger
On Oct 10, 2019, at 11:20, Michael Di Domenico 
mailto:mdidomeni...@gmail.com>> wrote:

On Mon, Oct 7, 2019 at 6:33 PM Andreas Dilger 
mailto:adil...@whamcloud.com>> wrote:

With socklnd there are 3 TCP connections per client-server pair.
For IB there is no such connection limit that I'm aware of.

just out of morbid curiosity, can very briefly explain the
connectivity differences between TCP/IB.  Does IB use the same 3
connections as TCP?  If not, is that why the connectivity limit
doesn't exist with IB or is there some other overriding design
principal in IB that allows lustre to push past TCP?  Not that any of
this has any relevance to anything i do, i'm just curious.

i'd love to have 2000 OSS's and 20k clients, but sadly i do not... :(

This is a fundamental difference between TCP and IB.  TCP needs a persistent
connection between peers (socket) to manage state, and the (very ancient) IP
protocol on which TCP is built has a limit of 65536 connections on a single 
node.
When computers had 1-2MB of RAM that was more than enough...

IB does not have this limitation, though it does consume some memory for each
peer that that it is communicating with.  o2iblnd can establish multiple 
connections
to a single peer to get better bandwidth, and this is important for OPA 
performance,
but is not critical for IB networks.

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Limit to number of OSS?

2019-10-07 Thread Andreas Dilger
Whether there are problems with a large number of OSS and/or MDS nodes depends 
on whether you are using TCP or IB networking.

With socklnd there are 3 TCP connections per client-server pair (bulk read, 
bulk write, and small message) so the maximum you could have would be around 
(65536 - 1024)/3 = 21500 (or likely fewer) clients or servers, unless you also 
configured LNet routers in between (which would allow more clients, but not 
more servers).  That isn't a limitation for most deployments, but at least one 
known limitation.  For IB there is no such connection limit that I'm aware of.

There are likely other factors such as memory consumption per target, but I 
don't think that would be the first thing to cause problems on modern systems 
with hundreds of GB of RAM.

Cheers, Andreas

On Oct 4, 2019, at 01:45, Degremont, Aurelien 
mailto:degre...@amazon.com>> wrote:

Thanks for this info. But actually I was really looking at the number of OSS, 
not OSTs :)
This is really more how Lustre client nodes and MDT will cope with very large 
number of OSSes.

De : Andreas Dilger mailto:adil...@whamcloud.com>>
Date : vendredi 4 octobre 2019 à 04:54
À : "Degremont, Aurelien" mailto:degre...@amazon.com>>
Objet : Re: [lustre-discuss] Limit to number of OSS?

On Oct 3, 2019, at 07:55, Degremont, Aurelien 
mailto:degre...@amazon.com>> wrote:

Hello all!

This doc from the wiki says "Lustre can support up to 2000 OSS per file system" 
(http://wiki.lustre.org/Lustre_Server_Requirements_Guidelines).

I'm a bit surprised by this statement. Does somebody has information about the 
upper limit to the number of OSSes?
Or what could be the scaling limitator for this number of OSS? Network limit? 
Memory consumption? Other?

That's likely a combination of a bit of confusion and a bit of safety on the 
part of Intel writing that document.

The Lustre Operations Manual writes:
Although a single file can only be striped over 2000 objects, Lustre file 
systems can have thousands of OSTs. The I/O bandwidth to access a single file 
is the aggregated I/O bandwidth to the objects in a file, which can be as much 
as a bandwidth of up to 2000 servers. On systems with more than 2000 OSTs, 
clients can do I/O using multiple files to utilize the full file system 
bandwidth.
I think PNNL once tested up to 4000 OSTs, and I think the compile-time limit 
is/was 8000 OSTs (maybe it was made dynamic, I don't recall offhand), but the 
current code could _probably_ handle up to 65000 OSTs without significant 
problems.  Beyond that, there is the 16-bit OST index limit in the filesystem 
device labels and the __u16 lov_user_md_v1->lmm_stripe_offset to specify the 
starting OST index for "lfs setstripe", but that could be overcome with some 
changes.

Given OSTs are starting to approach 1PB with large drives and 
declustered-parity RAID, this would get us in the range 8-65EB, which is over 
2^64 bytes (16EB), so I don't think it is an immediate concern.  Let me know if 
you have any trouble with a 9000-OST filesystem... :-)

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre rpm install creating a file that breaks lustre

2019-10-04 Thread Andreas Dilger
The "alias ko2iblnd-opa" line in ko2iblnd.conf doesn't do anything unless OPA 
(or older QLogic) interfaces are detected on your system via the 
/usr/sbin/ko2iblnd-probe script.

This indirection is used to allow better default module parameters between OPA 
and MLX devices, which can't easily be determined inside the kernel, and would 
be hard to change after the fact.  In theory, if there were substantially 
better ko2iblnd module parameters for new MLX or RoCE devices, then this same 
mechanism could be used to do similar interface-specific tunings in userspace.

You could run that script with the last line commented out (or replace 
"exec->echo" on the last line) to see what it is doing.

Cheers, Andreas

On Oct 2, 2019, at 12:55, Kurt Strosahl 
mailto:stros...@jlab.org>> wrote:



Good Afternoon,



While getting lustre 2.10.8 running on a RHEL 7.7 system I found that the 
RPM install was putting a file in /etc/modprobe.d that was preventing lnet from 
starting properly.



the file is ko2iblnd.conf, which contains the following...



alias ko2iblnd-opa ko2iblnd
options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024 
concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 
fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4



install ko2iblnd /usr/sbin/ko2iblnd-probe



Our system is running infiniband, not omnipath.  So I'm mot sure why this file 
is being put in place.  Removing the file allows lnet to start properly.

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Limit to number of OSS?

2019-10-03 Thread Andreas Dilger
On Oct 3, 2019, at 07:55, Degremont, Aurelien 
mailto:degre...@amazon.com>> wrote:

Hello all!

This doc from the wiki says "Lustre can support up to 2000 OSS per file system" 
(http://wiki.lustre.org/Lustre_Server_Requirements_Guidelines).

I'm a bit surprised by this statement. Does somebody has information about the 
upper limit to the number of OSSes?
Or what could be the scaling limitator for this number of OSS? Network limit? 
Memory consumption? Other?

That's likely a combination of a bit of confusion and a bit of safety on the 
part of Intel writing that document.

The Lustre Operations Manual writes:

Although a single file can only be striped over 2000 objects, Lustre file 
systems can have thousands of OSTs. The I/O bandwidth to access a single file 
is the aggregated I/O bandwidth to the objects in a file, which can be as much 
as a bandwidth of up to 2000 servers. On systems with more than 2000 OSTs, 
clients can do I/O using multiple files to utilize the full file system 
bandwidth.

I think PNNL once tested up to 4000 OSTs, and I think the compile-time limit 
is/was 8000 OSTs (maybe it was made dynamic, I don't recall offhand), but the 
current code could _probably_ handle up to 65000 OSTs without significant 
problems.  Beyond that, there is the 16-bit OST index limit in the filesystem 
device labels and the __u16 lov_user_md_v1->lmm_stripe_offset to specify the 
starting OST index for "lfs setstripe", but that could be overcome with some 
changes.

Given OSTs are starting to approach 1PB with large drives and 
declustered-parity RAID, this would get us in the range 8-65EB, which is over 
2^64 bytes (16EB), so I don't think it is an immediate concern.  Let me know if 
you have any trouble with a 9000-OST filesystem... :-)

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] changing inode size on MDT

2019-10-03 Thread Andreas Dilger
On Oct 3, 2019, at 20:09, Hebenstreit, Michael 
mailto:michael.hebenstr...@intel.com>> wrote:

So bottom line – don’t change the default values, it won’t get better?

Like I wrote previously, there *are* no default/tunable values to change for 
ZFS.  The tunables are only for ldiskfs, which statically allocates everything, 
but is will cause problems if you guessed incorrectly at the instant you format 
the filesystem.

The number reported by raw ZFS and by Lustre-on-ZFS is just an estimate, and 
you will (essentially) run out of inodes once you run out of space on the MDT 
or all OSTs.  And I didn't say "it won't get better", actually I said the 
estimate _will_ get better once you actually start using the filesystem.

If the (my estimate) 2-3B inodes on the MDT is insufficient, you can always add 
another (presumably mirrored) VDEV to the MDT, or add a new MDT to the 
filesystem to increase the number of inodes available.

Cheers, Andreas


From: Andreas Dilger mailto:adil...@whamcloud.com>>
Sent: Thursday, October 03, 2019 19:38
To: Hebenstreit, Michael 
mailto:michael.hebenstr...@intel.com>>
Cc: Mohr Jr, Richard Frank mailto:rm...@utk.edu>>; 
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] changing inode size on MDT

On Oct 3, 2019, at 05:03, Hebenstreit, Michael 
mailto:michael.hebenstr...@intel.com>> wrote:

So you are saying on a zfs based Lustre there is no way to increase the number 
of available inodes? I have 8TB MDT with roughly 17G inodes

[root@elfsa1m1 ~]# df -h
Filesystem   Size  Used Avail Use% Mounted on
mdt  8.3T  256K  8.3T   1% /mdt

[root@elfsa1m1 ~]# df -i
Filesystem   Inodes  IUsed   IFree IUse% Mounted on
mdt 17678817874  6 176788178681% /mdt

For ZFS the only way to increase inodes on the *MDT* is to increase the size of 
the MDT, though more on that below.  Note that the "number of inodes" reported 
by ZFS is an estimate based on the currently-allocated blocks and inodes (i.e. 
bytes_per_inode_ratio = bytes_used / inodes_used, total inode estimate = 
bytes_free / inode_ratio + inodes_used), which becomes more accurate as the MDT 
becomes more full.  With 17B inodes on a 8TB MDT that is an bytes-per-inode 
ratio of 497, which is unrealistically low for Lustre since the MDT will always 
stores multiple xattrs on each inode.  Note that the filesystem only has 6 
inodes allocated, so the ZFS total inodes estimate is unrealistically high and 
will get better as more inodes are allocated in the filesystem.

Formating under Lustre 2.10.8

mkfs.lustre --mdt --backfstype=zfs --fsname=lfsarc01 --index=0 
--mgsnid="36.101.92.22@tcp<mailto:36.101.92.22@tcp>" --reformat mdt/mdt

this translates to only 948M inodes on the Lustre FS.

[root@elfsa1m1 ~]# df -i
Filesystem   Inodes  IUsed   IFree IUse% Mounted on
mdt 17678817874  6 176788178681% /mdt
mdt/mdt   948016092263   9480158291% /lfs/lfsarc01/mdt

[root@elfsa1m1 ~]# df -h
Filesystem   Size  Used Avail Use% Mounted on
mdt  8.3T  256K  8.3T   1% /mdt
mdt/mdt  8.2T   24M  8.2T   1% /lfs/lfsarc01/mdt

and there is no reasonable option to provide more file entries except for 
adding another MDT?

The Lustre statfs code will weight in some initial estimates for the 
bytes-per-inode ratio when computing the total inode estimate for the 
filesystem.  When the filesystem is nearly empty, as is the case here, then 
those initial estimates will dominate, but once you've allocated a few thousand 
inodes in the filesystem the actual values will dominate and you will have a 
much more accurate number for the total inode count.  This will probably be 
more in the range of 2B-4B inodes in the end, unless you also use Data-on-MDT 
(Lustre 2.11 and later) to store small files directly on the MDT.

You've also excluded the OST lines from the above output?  For the Lustre 
filesystem you (typically) also need at least one OST inode (object) for each 
file in the filesystem, possibly more than one, so "df" of the Lustre 
filesystem may also be limited by the number of inodes reported by the OSTs 
(which may themselves depend on the average bytes-per-inode for files stored on 
the OST).  If you use Data-on-MDT and only have a small files, then no OST 
object is needed for small files, but you consume correspondingly more space on 
the MDT.

Cheers, Andreas


From: Andreas Dilger mailto:adil...@whamcloud.com>>
Sent: Wednesday, October 02, 2019 18:49
To: Hebenstreit, Michael 
mailto:michael.hebenstr...@intel.com>>
Cc: Mohr Jr, Richard Frank mailto:rm...@utk.edu>>; 
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] changing inode size on MDT

There are several confusing/misleading comments on this thread that need

Re: [lustre-discuss] changing inode size on MDT

2019-10-03 Thread Andreas Dilger
On Oct 3, 2019, at 05:03, Hebenstreit, Michael 
mailto:michael.hebenstr...@intel.com>> wrote:

So you are saying on a zfs based Lustre there is no way to increase the number 
of available inodes? I have 8TB MDT with roughly 17G inodes

[root@elfsa1m1 ~]# df -h
Filesystem   Size  Used Avail Use% Mounted on
mdt  8.3T  256K  8.3T   1% /mdt

[root@elfsa1m1 ~]# df -i
Filesystem   Inodes  IUsed   IFree IUse% Mounted on
mdt 17678817874  6 176788178681% /mdt

For ZFS the only way to increase inodes on the *MDT* is to increase the size of 
the MDT, though more on that below.  Note that the "number of inodes" reported 
by ZFS is an estimate based on the currently-allocated blocks and inodes (i.e. 
bytes_per_inode_ratio = bytes_used / inodes_used, total inode estimate = 
bytes_free / inode_ratio + inodes_used), which becomes more accurate as the MDT 
becomes more full.  With 17B inodes on a 8TB MDT that is an bytes-per-inode 
ratio of 497, which is unrealistically low for Lustre since the MDT will always 
stores multiple xattrs on each inode.  Note that the filesystem only has 6 
inodes allocated, so the ZFS total inodes estimate is unrealistically high and 
will get better as more inodes are allocated in the filesystem.

Formating under Lustre 2.10.8

mkfs.lustre --mdt --backfstype=zfs --fsname=lfsarc01 --index=0 
--mgsnid="36.101.92.22@tcp" --reformat mdt/mdt

this translates to only 948M inodes on the Lustre FS.

[root@elfsa1m1 ~]# df -i
Filesystem   Inodes  IUsed   IFree IUse% Mounted on
mdt 17678817874  6 176788178681% /mdt
mdt/mdt   948016092263   9480158291% /lfs/lfsarc01/mdt

[root@elfsa1m1 ~]# df -h
Filesystem   Size  Used Avail Use% Mounted on
mdt  8.3T  256K  8.3T   1% /mdt
mdt/mdt  8.2T   24M  8.2T   1% /lfs/lfsarc01/mdt

and there is no reasonable option to provide more file entries except for 
adding another MDT?

The Lustre statfs code will weight in some initial estimates for the 
bytes-per-inode ratio when computing the total inode estimate for the 
filesystem.  When the filesystem is nearly empty, as is the case here, then 
those initial estimates will dominate, but once you've allocated a few thousand 
inodes in the filesystem the actual values will dominate and you will have a 
much more accurate number for the total inode count.  This will probably be 
more in the range of 2B-4B inodes in the end, unless you also use Data-on-MDT 
(Lustre 2.11 and later) to store small files directly on the MDT.

You've also excluded the OST lines from the above output?  For the Lustre 
filesystem you (typically) also need at least one OST inode (object) for each 
file in the filesystem, possibly more than one, so "df" of the Lustre 
filesystem may also be limited by the number of inodes reported by the OSTs 
(which may themselves depend on the average bytes-per-inode for files stored on 
the OST).  If you use Data-on-MDT and only have a small files, then no OST 
object is needed for small files, but you consume correspondingly more space on 
the MDT.

Cheers, Andreas


From: Andreas Dilger mailto:adil...@whamcloud.com>>
Sent: Wednesday, October 02, 2019 18:49
To: Hebenstreit, Michael 
mailto:michael.hebenstr...@intel.com>>
Cc: Mohr Jr, Richard Frank mailto:rm...@utk.edu>>; 
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] changing inode size on MDT

There are several confusing/misleading comments on this thread that need to be 
clarified...

On Oct 2, 2019, at 13:45, Hebenstreit, Michael 
mailto:michael.hebenstr...@intel.com>> wrote:

http://wiki.lustre.org/Lustre_Tuning#Number_of_Inodes_for_MDS

Note that I've updated this page to reflect current defaults.  The Lustre 
Operations Manual has a much better description of these parameters.


and I'd like to use --mkfsoptions='-i 1024' to have more inodes in the MDT. We 
already run out of inodes on that FS (probably due to an ZFS bug in early IEEL 
version) - so I'd like to increase #inodes if possible.

The "-i 1024" option (bytes-per-inode ratio) is only needed for ldiskfs since 
it statically allocates the inodes at mkfs time, it is not relevant for ZFS 
since ZFS dynamically allocates inodes and blocks as needed.

On Oct 2, 2019, at 14:00, Colin Faber 
mailto:cfa...@gmail.com>> wrote:
With 1K inodes you won't have space to accommodate new features, IIRC the 
current minimal limit on modern lustre is 2K now. If you're running out of MDT 
space you might consider DNE and multiple MDT's to accommodate that larger name 
space.

To clarify, since Lustre 2.10 any new ldiskfs MDT will allocate 1024 bytes for 
the inode itself (-I 1024).  That allows enough space *within* the inode to 
efficiently store xattrs for more complex layouts (PFL, FLR, DoM).  If xattrs 
do not fit inside t

Re: [lustre-discuss] changing inode size on MDT

2019-10-02 Thread Andreas Dilger
There are several confusing/misleading comments on this thread that need to be 
clarified...

On Oct 2, 2019, at 13:45, Hebenstreit, Michael 
mailto:michael.hebenstr...@intel.com>> wrote:

http://wiki.lustre.org/Lustre_Tuning#Number_of_Inodes_for_MDS

Note that I've updated this page to reflect current defaults.  The Lustre 
Operations Manual has a much better description of these parameters.


and I'd like to use --mkfsoptions='-i 1024' to have more inodes in the MDT. We 
already run out of inodes on that FS (probably due to an ZFS bug in early IEEL 
version) - so I'd like to increase #inodes if possible.

The "-i 1024" option (bytes-per-inode ratio) is only needed for ldiskfs since 
it statically allocates the inodes at mkfs time, it is not relevant for ZFS 
since ZFS dynamically allocates inodes and blocks as needed.

On Oct 2, 2019, at 14:00, Colin Faber 
mailto:cfa...@gmail.com>> wrote:
With 1K inodes you won't have space to accommodate new features, IIRC the 
current minimal limit on modern lustre is 2K now. If you're running out of MDT 
space you might consider DNE and multiple MDT's to accommodate that larger name 
space.

To clarify, since Lustre 2.10 any new ldiskfs MDT will allocate 1024 bytes for 
the inode itself (-I 1024).  That allows enough space *within* the inode to 
efficiently store xattrs for more complex layouts (PFL, FLR, DoM).  If xattrs 
do not fit inside the inode itself then they will be stored in an external 4KB 
inode block.

The MDT is formatted with a bytes-per-inode *ratio* of 2.5KB, which means 
(approximately) one inode will be created for every 2.5kB of the total MDT 
size.  That 2.5KB of space includes the 1KB for the inode itself, plus space 
for a directory entry (or multiple if hard-linked), extra xattrs, the journal 
(up to 4GB for large MDTs), Lustre recovery logs, ChangeLogs, etc.  Each 
directory inode will have at least one 4KB block allocated.

So, it is _possible_ to reduce the inode *ratio* below 2.5KB if you know what 
you are doing (e.g. 2KB/inode or 1.5KB/inode, this can be an arbitrary number 
of bytes, it doesn't have to be an even multiple of anything) but it definitely 
isn't possible to have 1KB inode size and 1KB per inode ratio, as there 
wouldn't be *any* space left for directories, log files, journal, etc.

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Do not recreate OST objects on OST replacement

2019-09-12 Thread Andreas Dilger


On Sep 10, 2019, at 10:23, Degremont, Aurelien 
mailto:degre...@amazon.com>> wrote:

Hello

When an OST dies and you have no choice but replacing it with a newly freshly 
formatted one (using mkfs --replace), Lustre runs a resynchronization 
mechanisms between the MDT and the OST.
The MDT will sent the last object ID it knows for this OST and the OST will 
compare this value with its own counter (0 for a freshly formatted OST).
If the difference is greater than 100,000 objects, it will recreate only the 
last 10,000, if not, it will recreate all the missing objects.

I would like it to avoid recreating any objects. The missing ones are lost and 
just start recreating new ones. Is there a way to achieve that?

It isn't currently possible to completely avoid recreating these objects.  
Normally it isn't a huge problem, given the size of normal OSTs.  This is done 
to ensure that if the MDS has previously allocated those objects there will be 
objects available for the clients to write to them. LFSCK can be used to clean 
up these orphan objects if they are not in use.


Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] find xdev?

2019-09-11 Thread Andreas Dilger


On Sep 11, 2019, at 10:06, Michael Di Domenico 
mailto:mdidomeni...@gmail.com>> wrote:

On Tue, Sep 10, 2019 at 5:48 PM Andreas Dilger 
mailto:adil...@whamcloud.com>> wrote:

I don't think "lfs find -xdev" has never been a priority for Lustre, since it 
is rare for Lustre filesystems to be
mounted in a nested manner.  Since people already run multiple "lfs find" tasks 
in parallel on different
clients to get better performance, it isn't hard to run separate tasks from the 
top-level mountpoint of
different filesystems.  What is the use case for this?

doesn't xdev keep find from crossing mount points, not necessarily
only in a nested manner but also if there's a link to a directory in a
different filesystem.  i believe 'find' without -xdev will follow and
descend.  but this predicates that my understanding is sound (which it
probably isn't).  i generally add -xdev to my finds as a habit to keep
from scanning nfs volumes.

Yes, -xdev is to avoid crossing mountpoints, but like I wrote it is rare to 
have nested Lustre
mountpoints, so this wouldn't really be useful in most cases.  The find tree 
walking does
*not* follow symlinks into the target directory, only mountpoints.


along the same vein, can anyone state whether there's any actual
performance gain walking the filesystem using find vs lfs find?

For "find" vs. "lfs find" performance, this depends heavily on what the search 
parameters are.  If just
the filename, they will be the same.  If it includes some MDT-specific 
attributes (e.g. uid, gid) then
"lfs find" can be significantly faster (e.g 3-5x).  If it is uses file size, 
then they will be about the same
unless there are other MDT-only parameters, or once LSOM support is landed 
(hopefully 2.13).

okay, that's what i thought or recalled correctly from hearing
somewhere else.  in my particular instance i was just using 'find
-type f' and didn't see any appreciable difference in scanning speed
between the two

For Lustre, ext4, and most other filesystems, the file type is also stored in 
the directory entry, so that
"find" can determine the type without a "stat".  That is safe since the file 
type cannot be changed after
the file is created.

In a scan like "(lfs) find -name '*foo*' -type f" it only needs to read the 
directory entries (including the
file type) and process each entry.  There is nothing that "lfs find" can 
optimize.  With mode, uid, gid, and
*some* timestamp queries, "lfs find" can fetch only the MDS attributes and skip 
any OST RPCs for
that file if the (non)match can be decided without the OST attributes.

Once the Lazy Size-on-MDT (LSOM) integration is finished 
(https://review.whamcloud.com/35167) it
will be possible for "lfs find --lazy" to use *only* attributes from the MDS 
(size, timestamps) to speed
up scanning and avoid OST RPC overhead.

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Replacing ldiskfs MDT with larger disk

2019-07-31 Thread Andreas Dilger
Just to clarify, when I referred to "file level backup/restore", I was 
referring to the MDT ldiskfs filesystem, not the whole Lustre filesystem (which 
would be _much_ too large for most sites.  The various backup/restore methods 
are documented in the Lustre Operations Manual.

Cheers, Andreas

> On Jul 31, 2019, at 15:10, Jesse Stroik  wrote:
> 
> This is excellent information, Andreas.
> 
> Presently we do file level backups to the live file system and they take over 
> 24 hours, so they're done continuously. For that timeframe to wrok, we'd need 
> to be able to back up and recover the MDT to the new MDT with the file system 
> online.
> 
> Given that resizing the file system will proportionately increase the inodes 
> (I didn't realize that), dd to a logical volume may be a reasonable option 
> for us. The dd would be faster enough that we could weather the downtime.
> 
> PFL and FLR aren't features they're planning for the file system and it may 
> be replaced next year so I suspect they'll opt for the DNE method.
> 
> Thanks again,
> Jesse Stroik
> 
> On 7/31/19 3:11 PM, Andreas Dilger wrote:
>> Normally the easy answer would be that a "dd" copy of the MDT device from 
>> your HDDs to a larger SSD LUN, then resize2fs to increase the filesystem 
>> size would also increase the number of inodes proportionately to the LUN 
>> size.
>> However, since you are *not* using 1024-byte inode size, only 512-byte inode 
>> size + 512-bytes space for other things (ie. 1024 bytes-per-inode ratio), 
>> I'd suggest a file-level MDT backup/restore to a newly-formatted MDT because 
>> newer features like PFL and FLR need more space in the inode itself. The 
>> benefit of this approach is that you keep a full backup of the MDT on the 
>> HDDs in case of problems.  Note that after backup/restore the LFSCK OI Scrub 
>> will run for some time (maybe an hour or two, depending on size), which will 
>> result in slowdown. That would likely be compensated by faster SSD storage.
>> If you go the DNE route, then migrate some of the namespace to the new MDT, 
>> you definitely still need to keep MDT.  However, you could combine these 
>> approaches and still copy MDT to new flash storage instead of keeping 
>> the HDDs around forever.  I'd again recommend a file-level MDT 
>> backup/restore to a newly-formatted MDT to get the newer format options.
>> Cheers, Andreas
>>> On Jul 31, 2019, at 13:50, Jesse Stroik  wrote:
>>> 
>>> Hi everyone,
>>> 
>>> One of our lustre file systems outgrew its MDT and the original scope of 
>>> its operation. This one is still running ldiskfs on the MDT. Here's our 
>>> setup and restrictions:
>>> 
>>> - centos 6 / lustre 2.8
>>> - ldiskfs MDT
>>> - minimal downtime allowed, but the FS can be read-only for a while.
>>> 
>>> The MDT itself, set up with -i 1024, needs both more space and available 
>>> inodes. Its purpose changed in scope and we'd now like the performance 
>>> benefits of getting off of spinning media as well.
>>> 
>>> We need a new files system instead of expanding the existing ldiskfs 
>>> because we need more inodes.
>>> 
>>> I think my options are (1) a file level backup and recovery or direct copy 
>>> onto the new file system or (2) add a new MDT to the system and assign all 
>>> directories under the root to it, then lfs_migrate everything on the file 
>>> system thereafter.
>>> 
>>> Is there a disadvantage to the DNE approach other than the fact that we 
>>> have to keep the original spinning-disk MDT around to service the root of 
>>> the FS?
>>> 
>>> If we had to do option 1, we'd want to remount the current MDT read only 
>>> and continue using it while we were preparing new MDT. When I searched, I 
>>> couldn't find anything that seemed definitive about ensuring no changes to 
>>> an ldiskfs MDT during operation and I don't want to assume i can simply 
>>> remount it read only.
>>> 
>>> Thanks,
>>> Jesse Stroik
>>> 
>>> _______
>>> lustre-discuss mailing list
>>> lustre-discuss@lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Replacing ldiskfs MDT with larger disk

2019-07-31 Thread Andreas Dilger
Normally the easy answer would be that a "dd" copy of the MDT device from your 
HDDs to a larger SSD LUN, then resize2fs to increase the filesystem size would 
also increase the number of inodes proportionately to the LUN size.

However, since you are *not* using 1024-byte inode size, only 512-byte inode 
size + 512-bytes space for other things (ie. 1024 bytes-per-inode ratio), I'd 
suggest a file-level MDT backup/restore to a newly-formatted MDT because newer 
features like PFL and FLR need more space in the inode itself. The benefit of 
this approach is that you keep a full backup of the MDT on the HDDs in case of 
problems.  Note that after backup/restore the LFSCK OI Scrub will run for some 
time (maybe an hour or two, depending on size), which will result in slowdown. 
That would likely be compensated by faster SSD storage. 

If you go the DNE route, then migrate some of the namespace to the new MDT, you 
definitely still need to keep MDT.  However, you could combine these 
approaches and still copy MDT to new flash storage instead of keeping the 
HDDs around forever.  I'd again recommend a file-level MDT backup/restore to a 
newly-formatted MDT to get the newer format options. 

Cheers, Andreas

> On Jul 31, 2019, at 13:50, Jesse Stroik  wrote:
> 
> Hi everyone,
> 
> One of our lustre file systems outgrew its MDT and the original scope of its 
> operation. This one is still running ldiskfs on the MDT. Here's our setup and 
> restrictions:
> 
> - centos 6 / lustre 2.8
> - ldiskfs MDT
> - minimal downtime allowed, but the FS can be read-only for a while.
> 
> The MDT itself, set up with -i 1024, needs both more space and available 
> inodes. Its purpose changed in scope and we'd now like the performance 
> benefits of getting off of spinning media as well.
> 
> We need a new files system instead of expanding the existing ldiskfs because 
> we need more inodes.
> 
> I think my options are (1) a file level backup and recovery or direct copy 
> onto the new file system or (2) add a new MDT to the system and assign all 
> directories under the root to it, then lfs_migrate everything on the file 
> system thereafter.
> 
> Is there a disadvantage to the DNE approach other than the fact that we have 
> to keep the original spinning-disk MDT around to service the root of the FS?
> 
> If we had to do option 1, we'd want to remount the current MDT read only and 
> continue using it while we were preparing new MDT. When I searched, I 
> couldn't find anything that seemed definitive about ensuring no changes to an 
> ldiskfs MDT during operation and I don't want to assume i can simply remount 
> it read only.
> 
> Thanks,
> Jesse Stroik
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] mdt: unhealthy - healthy

2019-07-29 Thread Andreas Dilger
On Jul 26, 2019, at 04:28, Thomas Roth mailto:t.r...@gsi.de>> 
wrote:

Hi all,

this morning one of our MDT went 'unhealthy',

Jul 26 10:15:13 lxmds20 kernel: LustreError: 
9510:0:(service.c:3285:ptlrpc_svcpt_health_check())
mdt: unhealthy - request has been waiting 1017s

However, somewhat later,

lxmds20:~# cat /sys/fs/lustre/health_check
healthy

and all Lustre operations seem to be good, too.

This means that some RPC has been stuck, but if the RPC eventually completes 
then there is no reason for the MDS to be "unhealthy" anymore.

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Error in lfsck: "NOT IMPLEMETED YET"

2019-07-22 Thread Andreas Dilger
If you are trying to delete MDT then that is definitely not implemented 
yet...

Cheers, Andreas

On Jul 22, 2019, at 16:08, João Carlos Mendes Luís 
mailto:jo...@corp.globo.com>> wrote:


Hi,

I'm running some lab tests with lustre 2.12.2 in Oracle Linux Server 
release 7.6.  Last test I did was about migration and MDT splitting.  I started 
with a MGS+MDS node, and two OSS nodes, and one of the tests was to create two 
more MDSs and migrate data between then, until, after some time, I could delete 
the original MDS.  But something happened in the middle and the servers 
panicked/rebooted.

I am now in what appears to be an lfsck bug.  After many other tests, I run 
lfsck_start, and after some time get this message on the nodes:

MGS/MDS0:

[Mon Jul 22 17:42:25 2019] LustreError: 
24107:0:(osd_index.c:1872:osd_index_it_get()) NOT IMPLEMETED YET (move to 
0x24810200)

OSS1/MDS1

[Mon Jul 22 17:40:29 2019] LustreError: 
31558:0:(osd_index.c:1872:osd_index_it_get()) NOT IMPLEMETED YET (move to 
0xa41300c00200)

OST2/MDS2

[Mon Jul 22 17:40:32 2019] LustreError: 
8935:0:(osd_index.c:1872:osd_index_it_get()) NOT IMPLEMETED YET (move to 
0xa0130300)


And for current lfsck status, I run lctl get_param *.*.lfsck* | grep -E 
'status|\.lfsck_lay|\.lfsck_name'

MGS/MDS0:

mdd.mirror01-MDT.lfsck_layout=
status: completed
mdd.mirror01-MDT.lfsck_namespace=
status: partial

OSS1/MDS1

mdd.mirror01-MDT0001.lfsck_layout=
status: completed
mdd.mirror01-MDT0001.lfsck_namespace=
status: partial
obdfilter.mirror01-OST0065.lfsck_layout=
status: completed


OST2/MDS2

mdd.mirror01-MDT0002.lfsck_layout=
status: completed
mdd.mirror01-MDT0002.lfsck_namespace=
status: partial
obdfilter.mirror01-OST0066.lfsck_layout=
status: completed

Is this a known bug?  How do I fix these "partial" lsfck runs?

Thanks for any help,

Jonny



[globo.com]
João Carlos Mendes Luís
Senior DevOps Engineer
jo...@corp.globo.com
+55-21-2483-6893
+55-21-99218-1222


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre v2.12.3 Availability

2019-07-12 Thread Andreas Dilger
Also, the b2_12 branch is only receiving patches that have already been tested 
on master, so should be relatively stable.  Doing advanced testing of this 
branch before the 2.12.3 release is always welcome.

On Jul 12, 2019, at 14:59, Peter Jones  wrote:
> 
> Our current thinking is that it will be sooner rather than later in the 
> quarter. You can follow the progress at 
> https://git.whamcloud.com/?p=fs/lustre-release.git;a=shortlog;h=refs/heads/b2_12
>  and we;ll announce on this list when it’s ready.
>  
> On Friday, July 12, 2019 at 11:39 AM "Tauferner, Andrew T" 
>  wrote:
>>  
>> What is the outlook for v2.12.3 availability?  The release roadmap shows 
>> something around Q3 ’19.  I’d like a more definitive target if possible.  
>> Thanks.

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] A question about lctl lfsck

2019-07-04 Thread Andreas Dilger
You can use "lctl dk" to dump the kernel debug log on the MDS/OSS nodes, and 
grep for the LFSCK messages, but if there are lots of messages the kernel logs 
would not be enough to hold them all.

Another option is to enable "lctl set_param printk=+lfsck" on the MDS and OSS 
and have it print repair messages to the console, and use syslog filtering 
rules to put those messages into their own log file.

> On Jul 3, 2019, at 14:15, Kurt Strosahl  wrote:
> 
> Good Afternoon,
> 
> Hopefully a simple question... If I run lctl lfsck_start is there a place 
> where I can get a list of what it did?

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre metrics

2019-06-26 Thread Andreas Dilger
The first question would be how/where you are collecting those metrics (i.e. 
with some monitoring tool), and what version of Lustre you are running.

> On Jun 26, 2019, at 05:44, Joe Ritz  wrote:
> 
> Hi all.  I am new to Lustre and I have a few questions about the metrics it 
> keeps.  I currently am running Lustre and getting metrics back.  The issue I 
> have is that some of the metrics are reporting negative numbers.  For 
> instance, bytes written is reporting less than zero.  I am not sure how that 
> is possible.  Any thoughts?
> 
> Also, is there any documentation that gives a description of each metric and 
> what it represents?
> 
> Thanks,
> 
> Joe
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] a question about max_create_count

2019-06-20 Thread Andreas Dilger
Kurt,
To clarify a bit further, the max_create_count is the maximum number of objects 
that the MDS might precreate in a batch on an OST if it has a high create rate. 
 They are not necessarily _simultaneous_ creates.

In the recent IO-500 run, Cambridge sustained 1.78M creates/sec for 5 minutes 
(on 24 MDTs + 24*11 OSTs, mind you), so this is not a very high limit.

On Jun 20, 2019, at 10:09, Degremont, Aurelien  wrote:
> 
> Hello Kurt,
>  
> max_create_count is a tunable per OST. It controls how many object 
> pre-creations could be done on each OST, by the MDT. If you set this to 0 for 
> one OST, no additional object will be allocated there and no new file will 
> have file stripes on it.
>  
> With the appropriate hardware, Lustre can create much more than 20,000 files 
> simultaneously.
>  
> Aurélien
>  
> Kurt Strosahl  wrote on 2019-06-19 at 22:43:
>>  
>> Good Afternoon,
>>  
>> I'm in the process of testing a new lustre file system at 2.12, and in 
>> going through the documentation I saw that to stop writes from going to the 
>> system we now set the max_create_count to 0 on the mdt.
>>  
>>I looked at that value on my system and the default seems to be 2, am 
>> I correct in thinking that this is the maximum number of simultaneous 
>> creates that can happen on an OST?
>>  
> 

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Unable to compile client in Debian

2019-06-11 Thread Andreas Dilger
These problems may relate to using GCC8, which has a lot of new checks
that cause old code to break.  You could try installing an older GCC,
but that may not create modules compatible with your kernel.

The GCC8 issues are being fixed on master, so you might want to try this
again, but at the same time, master is still undergoing a lot of landings
for the 2.13 release and not what I'd recommend for using in production
until the final 2.13.0 release is finished.

If you are using Debian, I'd suggest to download the Ubuntu 18.04 LTS
kernel, which is 4.15 based.  This is built on a regular basis.

Cheers, Andreas

On May 28, 2019, at 14:38, Alejandro Sierra  wrote:
> 
> Hello:
> 
> I want to be able to see my lustre 2.10.5 disk in my cluster, but
> every attemp to complile the client code failed.  If I try the same
> version as the server (a Centos7 server) it falis because of the
> -Werror flag with trivial code impossible to correct, like
> 
> util/parser.c:198:13: error: comparison between pointer and zero
> character constant [-Werror=pointer-compare]
>   if (*next == '\0') {
> ^~
> util/parser.c:198:7: note: did you mean to dereference the pointer?
>   if (*next == '\0') {
>   ^
> util/parser.c: In function ‘Parser_list_commands’:
> util/parser.c:575:25: error: ‘%2d’ directive output may be truncated
> writing between 2 and 10 bytes into a region of size 4
> [-Werror=format-truncation=]
>snprintf(fmt, 6, "%%-%2ds", char_max - len);
> ^~~
> 
> I even tried to erase all Werror flags from the configure and automake
> files but that only make it worse because of the incompatible automake
> versions.
> 
> Then I tried the current release cloned with git and I get errors like these
> 
> lustre-release/lnet/lnet/lib-socket.c:76:8: error: implicit
> declaration of function ‘kernel_sock_ioctl’; did you mean
> ‘lnet_sock_ioctl’? [-Werror=implicit-function-declaration]
>   rc = kernel_sock_ioctl(sock, cmd, arg);
> 
> lustre-release/lustre/llite/namei.c:977:15: error: ‘FILE_CREATED’
> undeclared (first use in this function); did you mean ‘FLD_CREATE’?
>*opened |= FILE_CREATED;
>   ^~~~
>   FLD_CREATE
> 
> lustre-release/lustre/llite/namei.c:988:10: error: too many arguments
> to function ‘finish_open’
> rc = finish_open(file, dentry, NULL, opened);
>  ^~~
> 
> I can't use the available precompiled versions because I need modules
> for my kernel version 4.19.0.
> 
> What can I do?  Any help will be highly appreciated.
> 
> Alejandro A. Sierra
> National Earth Observation Laboratory, Mexico
> http://www.lanot.unam.mx/
> _______
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre snapshots

2019-05-31 Thread Andreas Dilger
Yes, from the Lustre point of view, the snapshots are currently totally 
separate filesystems.

You would need to generate the automount map from "lctl snapshot_list".  
However, if you mounted a client on the MGS temporarily after creating the 
snapshot you could write the automount map into e.g. 
"/mnt/lustre/.snapshot/automountmap" so that it would immediately be visible to 
all clients mounting the filesystem, and immediately unmount the filesystem 
again.  Of course you could use some other mechanism to distribute it as well.

Cheers, Andreas

On May 30, 2019, at 08:26, Hans Henrik Happe  wrote:
> 
> It's only the mapping from the snapshot name to hex value used for
> mounting I'm asking about. It's explained in the doc (30.3.3. Mounting a
> Snapshot).
> 
> The client cannot call 'lctl snapshot_list'. Guess the main fs and its
> snapshot fs' are so separated that putting the mappings in the client
> /proc structures of the main fs would become ugly.
> 
> We will just communicate client mount name through another channel.
> 
> Cheers,
> Hans Henrik
> 
> On 30/05/2019 10.05, Andreas Dilger wrote:
>> On May 30, 2019, at 01:50, Hans Henrik Happe  wrote:
>>> 
>>> Hi,
>>> 
>>> I've tested snapshots and they work as expected.
>>> 
>>> However, I'm wondering how the clients should mount without knowing the
>>> mount names of snapshots. As I see it there are two possibilities:
>>> 
>>> 1. Clients get ssh (limited) access to MGS (Don't want that).
>>> 2. The names are communicated through another channel. Perhaps, written
>>> to a file on the head Lustre filesystem or just directly to all clients
>>> that need snapshot mounting through ssh.
>>> 
>>> If there isn't a better way, I think number two is the way to go.
>> 
>> You could use automount to mount the snapshots on the clients, when they are 
>> needed.  The automount map could be created automatically from the snapshot 
>> list.
>> 
>> Probably it makes the most sense to limit snapshot access to a subset of 
>> nodes, such as user login nodes, so that users do not try to compute from 
>> the snapshot filesystems directly.
>> 
>>> Guess the limited length of Lustre fs names is preventing the use of the
>>> snapshots names directly?
>> 
>> If you rotate the snapshots like Apple Time Machine, you could use generic 
>> snapshot names like "last_month", "last_week", "yesterday", "6h_ago" and 
>> such and not have to update the automount map.  The filesystem names could 
>> be mostly irrelevant if the snapshot mountpoints are chosen properly, like 
>> "$MOUNT/.snapshot/last_month/" or similar.
>> 
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Principal Lustre Architect
>> Whamcloud
>> 

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre snapshots

2019-05-30 Thread Andreas Dilger
On May 30, 2019, at 01:50, Hans Henrik Happe  wrote:
> 
> Hi,
> 
> I've tested snapshots and they work as expected.
> 
> However, I'm wondering how the clients should mount without knowing the
> mount names of snapshots. As I see it there are two possibilities:
> 
> 1. Clients get ssh (limited) access to MGS (Don't want that).
> 2. The names are communicated through another channel. Perhaps, written
> to a file on the head Lustre filesystem or just directly to all clients
> that need snapshot mounting through ssh.
> 
> If there isn't a better way, I think number two is the way to go.

You could use automount to mount the snapshots on the clients, when they are 
needed.  The automount map could be created automatically from the snapshot 
list.

Probably it makes the most sense to limit snapshot access to a subset of nodes, 
such as user login nodes, so that users do not try to compute from the snapshot 
filesystems directly.

> Guess the limited length of Lustre fs names is preventing the use of the
> snapshots names directly?

If you rotate the snapshots like Apple Time Machine, you could use generic 
snapshot names like "last_month", "last_week", "yesterday", "6h_ago" and such 
and not have to update the automount map.  The filesystem names could be mostly 
irrelevant if the snapshot mountpoints are chosen properly, like 
"$MOUNT/.snapshot/last_month/" or similar.

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre under autofs

2019-05-28 Thread Andreas Dilger
On May 27, 2019, at 02:05, Youssef Eldakar  wrote:
> 
> In our Rocks cluster, we added entries to /etc/auto.share to conveniently 
> mount our Lustre file systems on all nodes:
> 
> lfs01 -fstype=lustre 192.168.230.238@o2ib:/lustrefs
> lfs02 -fstype=lustre 192.168.230.239@o2ib:/lustrefs
> 
> On login nodes, the lfs01 or lfs02 mount points would occasionally give "No 
> such file or directory," and attempting to statically mount them while at 
> that state would give "File exists." Rebooting the client node makes the 
> problem go away temporarily.
> 
> Are we perhaps missing some option in the auto.share entries above to prevent 
> that behavior from reoccurring?
> 
> Thanks in advance for any pointers.

What version of Lustre are you using?  I haven't heard of such problems, but 
I'm not sure how many users use automount either.  I know _some_ use it, since 
we had a bug open about it a few years ago, but haven't heard much since.

If you are using an old version (2.7.x or earlier), I'd recommend to upgrade at 
least the clients to a newer version (2.10.8, or maybe 2.12.2), and try with 
that.  If you are running a newer version, please file a ticket in Jira with 
details (/var/log/messages, Lustre kernel debug log when there is a problem).

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lfs migrate of hard links

2019-05-27 Thread Andreas Dilger
You may be able to just copy the lfs_migrate script from a 2.12.x client, I 
don't think it has any dependencies on kernel or lfs features in the 2.12 
release.

I would of course recommend that you test it is working properly on some test 
hard-linked files before running it on the whole filesystem.

On May 27, 2019, at 19:02, Scott Wood  wrote:
> 
> Hi folks,
> 
> I am in the process of draining a dozen OSTs in a 2.10.3 environment (all 
> servers and clients) for replacement and it turns out they have many files 
> with multiple hard links to them.  We can't leave these files behind, as we 
> need to replace the OSTs, but we can't let them " be split into multiple 
> separate files" by lfs_migrate as that would break things.
> 
> From my reading of the changelogs and the new lfs_migrate script, it seems 
> that these are now handled more elegantly.  If I were to upgrade the lustre 
> client to 2.12.1 on a dozen clients, would the new client side lfs_migrate in 
> 2.12.1 work with 2.10.3 servers, or will I need a system wide outage to 
> upgrade all clients and servers?
> 
> Cheers!
> Scott

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] FLR mirroring on 2.12.1-1

2019-05-22 Thread Andreas Dilger
This is definitely not the intended behavior.  Could you please file a ticket 
in Jira.

Cheers, Andreas

On May 20, 2019, at 07:20, John Doe  wrote:
> 
> It turns out that the read eventually finished and was 1/10th of the 
> performance that I was expecting. 
> 
> As ost idx 1 is unavailable, the client read has to timeout on ost idx 1 and 
> then will read from ost idx 7. This happens for each 1MB block, as I am using 
> that as the block size. 
> 
> Is there a tunable to avoid this issue?
> 
> lfs check osts also takes about 30 seconds as it times out on the unavailable 
> OST. 
> 
> Due to this issue, I am virtually unable to use the mirroring feature. 
> 
> 
> On Sun, May 19, 2019 at 4:27 PM John Doe  wrote:
>> After mirroring a file , when one mirror is down, any reads from a client 
>> just hangs. Both server and client are running latest 2.12.1-1. Client waits 
>> for ost idx 1 to come back online.  I am only unmounting ost idx1 not ost 
>> idx 7.
>> 
>> Has anyone tried this feature? 
>> 
>> Thanks,
>> John.
>> 
>> lfs getstripe mirror10
>> mirror10
>>   lcm_layout_gen:5
>>   lcm_mirror_count:  2
>>   lcm_entry_count:   2
>> lcme_id: 65537
>> lcme_mirror_id:  1
>> lcme_flags:  init
>> lcme_extent.e_start: 0
>> lcme_extent.e_end:   EOF
>>   lmm_stripe_count:  1
>>   lmm_stripe_size:   1048576
>>   lmm_pattern:   raid0
>>   lmm_layout_gen:0
>>   lmm_stripe_offset: 1
>>   lmm_pool:  01
>>   lmm_objects:
>>   - 0: { l_ost_idx: 1, l_fid: [0x10001:0x280a8:0x0] }
>> 
>> lcme_id: 131074
>> lcme_mirror_id:  2
>> lcme_flags:  init
>> lcme_extent.e_start: 0
>> lcme_extent.e_end:   EOF
>>   lmm_stripe_count:  1
>>   lmm_stripe_size:   1048576
>>   lmm_pattern:   raid0
>>   lmm_layout_gen:0
>>   lmm_stripe_offset: 7
>>   lmm_pool:  02
>>   lmm_objects:
>>   - 0: { l_ost_idx: 7, l_fid: [0x10007:0x28066:0x0] }
> 

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre kernel module compatibility

2019-05-22 Thread Andreas Dilger
On May 21, 2019, at 10:15, Tauferner, Andrew T  
wrote:
> 
> Is it okay to run Lustre 2.12.x kernel modules on clients that are part of a 
> 2.10.x Lustre installation?  What I mean is that the servers and the clients 
> have installed version 2.10.x RPMs.  For what it’s worth, this seems to work 
> okay.

Lustre releases are tested against the next higher/lower release for 
clients/servers, as well as the previous LTS release.  In this case, 2.10 and 
2.12 are expected to work together without issues.  While it is possible for a 
wider range of versions to interoperate (e.g. some people have tested/used 2.5 
and 2.10 together), we do not test this due to limits on our testing resources.

> The alternative is to keep using the 2.6.99 vintage Lustre kernel modules 
> that come with kernel.org kernels with the 2.10.x software.

The in-kernel version is not really tested and is not recommended for use.

> I’d really like to get all Lustre code to the 2.12 version but I’m not sure 
> that I can convince our cluster administrator of that.  Thanks.

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Unable to mount client with 56 MDSes and beyond

2019-05-22 Thread Andreas Dilger
Scott, if you haven't already done so, it is probably best to file a ticket in 
Jira with the details.  Please include the client syslog/dmesg as well as a 
Lustre debug log ("lctl dk /tmp/debug") so that the problem can be isolated.

During DNE development we tested with up to 128 MDTs in AWS, but haven't tested 
that many MDTs in some time.

Cheers, Andreas

On May 8, 2019, at 12:28, White, Scott F  wrote:
> 
> We’ve been testing DNE Phase II and tried scaling the number of MDSes(one MDT 
> each for all of our tests) very high, but when we did that, we couldn’t mount 
> the filesystem on a client.  After trial and error, we discovered that we 
> were unable to mount the filesystem when there were 56 MDSes. 55 MDSes 
> mounted without issue, and it appears any number below that will mount. This 
> failure at 56 MDSes was replicable across different nodes being used for the 
> MDSes, all of which were tested with working configurations, so it doesn’t 
> seem to be a bad server.
>  
> Here’s the error info we saw in dmesg on the client:
>  
> LustreError: 28880:0:(obd_config.c:559:class_setup()) setup 
> lustre-MDT0037-mdc-95923d31b000 failed (-16)
> LustreError: 28880:0:(obd_config.c:1836:class_config_llog_handler()) 
> MGCx.x.x.x@o2ib: cfg command failed: rc = -16
> Lustre:cmd=cf003 0:lustre-MDT0037-mdc  1:lustre-MDT0037_UUID  
> 2:x.x.x.x@o2ib
> LustreError: 15c-8: MGCx.x.x.x@o2ib: The configuration from log 
> 'lustre-client' failed (-16). This may be the result of communication errors 
> between this node and the MGS, a bad configuration, or other errors. See the 
> syslog for more information.
> LustreError: 28858:0:(obd_config.c:610:class_cleanup()) Device 58 not setup
> Lustre: Unmounted lustre-client
> LustreError: 28858:0:(obd_mount.c:1608:lustre_fill_super()) Unable to mount  
> (-16)
>  
> OS: CentOS 7.6.1810 
> Kernel: 3.10.0-957.5.1.el7.x86_64
> Lustre: 2.12.1
> Network card: Qlogic InfiniPath_QLE7340
>  
> Other things to note for completeness’ sake: this happened with both ldiskfs 
> and zfs backfstypes, and these tests were using files in memory as the 
> backing devices.
>  
> Is there something I’m missing as to why more than 56 MDSes won’t mount?
>  
> Thanks,
> Scott White
> Scientist, HPC 
> Los Alamos National Laboratory
>  
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Setting infinite grace period with soft quotas

2019-05-09 Thread Andreas Dilger
Cameron, could you please file an LU ticket for that so it isn't lost. Thanks. 

Cheers, Andreas

> On May 9, 2019, at 08:55, Harr, Cameron  wrote:
> 
> Thanks Andreas. This is somewhat interesting in that I don't have such a 
> man page on the systems and 'man lfs' doesn't have such a statement. 
> But, certainly, 2^48 sec. would be sufficient! Our workaround was just 
> to use a -t XXXw to specify a large number of weeks (largest granularity 
> I could see) for the grace period, but having a '-1' shortcut would be 
> preferable.
> 
>> On 5/8/19 4:43 PM, Andreas Dilger wrote:
>> I do see in the lfs setquota usage message and lfs-setquota.1 man page:
>> 
>> "The maximum quota grace time is 2^48 - 1 seconds."
>> 
>> That's about 9M years, so it should probably be long enough?  It might
>> make sense to map "-1" internally to "(1 << 48) - 1" to make this easier.
>> 
>>> On May 8, 2019, at 17:18, Harr, Cameron  wrote:
>>> I had tested first and couldn't find a way to do so, so I was curious if
>>> there was some undocumented way. I'm proceeding with, "No, there's not a
>>> way."
>>> 
>>>> On 5/6/19 12:52 PM, Andreas Dilger wrote:
>>>>> On Apr 11, 2019, at 11:02, Harr, Cameron  wrote:
>>>>> We're exploring an idea where we keep soft quotas enabled so that users
>>>>> will be notified they're nearing their hard quotas (via in-house
>>>>> scripts), but users don't like that the soft quota becomes a hard block
>>>>> after the grace period. I can understand their rationale as well that
>>>>> they should be able to write up to their hard quota always.
>>>>> 
>>>>> Is there a way to set the grace period as unlimited (e.g. lfs setquota
>>>>> -t -1 ...)?
>>>> Judging by the lack of response, I don't think anyone has tried this, but
>>>> it also seems like something that could be tested quite easily?
>>>> 
>>>> Cheers, Andreas
>>>> --
>>>> Andreas Dilger
>>>> Principal Lustre Architect
>>>> Whamcloud
>>>> 
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Principal Lustre Architect
>> Whamcloud
>> 
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Setting infinite grace period with soft quotas

2019-05-08 Thread Andreas Dilger
I do see in the lfs setquota usage message and lfs-setquota.1 man page:

"The maximum quota grace time is 2^48 - 1 seconds."

That's about 9M years, so it should probably be long enough?  It might
make sense to map "-1" internally to "(1 << 48) - 1" to make this easier.

On May 8, 2019, at 17:18, Harr, Cameron  wrote:
> 
> I had tested first and couldn't find a way to do so, so I was curious if 
> there was some undocumented way. I'm proceeding with, "No, there's not a 
> way."
> 
> On 5/6/19 12:52 PM, Andreas Dilger wrote:
>> On Apr 11, 2019, at 11:02, Harr, Cameron  wrote:
>>> We're exploring an idea where we keep soft quotas enabled so that users
>>> will be notified they're nearing their hard quotas (via in-house
>>> scripts), but users don't like that the soft quota becomes a hard block
>>> after the grace period. I can understand their rationale as well that
>>> they should be able to write up to their hard quota always.
>>> 
>>> Is there a way to set the grace period as unlimited (e.g. lfs setquota
>>> -t -1 ...)?
>> Judging by the lack of response, I don't think anyone has tried this, but
>> it also seems like something that could be tested quite easily?
>> 
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Principal Lustre Architect
>> Whamcloud
>> 

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] 2.10 <-> 2.12 interoperability?

2019-05-07 Thread Andreas Dilger
On May 3, 2019, at 15:35, Hans Henrik Happe  wrote:
> 
> On 03/05/2019 22.41, Andreas Dilger wrote:
>> On May 3, 2019, at 14:33, Patrick Farrell  wrote:
>>> 
>>> Thomas,
>>> 
>>> As a general rule, Lustre only supports mixing versions on servers for 
>>> rolling upgrades.
>>> 
>>> - Patrick
>> 
>> And only then between maintenance versions of the same release (e.g. 2.10.6
>> and 2.10.7).  If you are upgrading, say, 2.7.21 to 2.10.6 then you would need
>> to fail over half of the targets, upgrade half of the servers, fail back (at
>> which point all targets would be running on the same new version), upgrade 
>> the
>> other half of the servers, and then restore normal operation.
>> 
>> There is also a backport of the LU-11507 patch for 2.10 that could be used
>> instead of upgrading just one server to 2.12.
>> 
>> Cheers, Andreas
> 
> I think the documentation is quite clear:
> 
> http://doc.lustre.org/lustre_manual.xhtml#upgradinglustre
> 
> An upgrade path for major releases on the servers would be nice, though.
> Wonder if this could be done with a mode where clients flush all they
> got and are put into a blocking mode. I guess the hard part would be to
> re-negotiate all the state after the upgrade, which is hard enough for
> regular replays.

At one time there was some discussion and initial design of a
feature "Simplified Interoperability", which would allow servers to
signal the clients to pause IO operations and go into a "holding state"
while an upgrade is being done.

This has unfortunately not been an implementation priority for anyone
to date, but maybe it can be raised as an issue during the OpenSFS member
survey.

Cheers, Andreas

>>> On Wednesday, April 24, 2019 3:54:09 AM, Thomas Roth  wrote:
>>>> 
>>>> Hi all,
>>>> 
>>>> OS=CentOS 7.5
>>>> Lustre 2.10.6
>>>> 
>>>> One of the OSS (one OST only) was upgraded to zfs 0.7.13, and LU-11507 
>>>> forced an upgrade of Lustre to 2.12
>>>> 
>>>> Mounts, reconnects, recovers, but then is unusable, and the MDS reports:
>>>> 
>>>> Lustre: 13650:0:(mdt_handler.c:5350:mdt_connect_internal()) test-MDT: 
>>>> client
>>>> test-MDT-lwp-OST0002_UUID does not support ibits lock, either very old 
>>>> or an invalid client: flags
>>>> 0x204140104320
>>>> 
>>>> 
>>>> So far I have not found any hints that these versions would not cooperate, 
>>>> or that I should have set a
>>>> certain parameter.
>>>> LU-10175 indicates that the ibits have some connection to data-on-mdt 
>>>> which we don't use.
>>>> 
>>>> Any suggestions?
>>>> 
>>>> 
>>>> Regards,
>>>> Thomas
>> --
>> Andreas Dilger
>> Principal Lustre Architect
>> Whamcloud
>> 
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Limit client side caching?

2019-05-07 Thread Andreas Dilger
On Apr 22, 2019, at 10:35, Nehring, Shane R [LAS]  wrote:
> 
> Hello
> 
> Is there a way to limit the client side caches to a specific size?
> We're occasionally seeing panics due to failed allocations in low
> available memory conditions in our environment. We limit user
> allocations with cgroups through slurm, but it would be handy to
> know exactly how much ram to reserve for system use.

You can limit the size of the client cache via the "max_dirty_mb" and 
"max_cached_mb" tunables.

There is an aggregate "max_dirty_mb" for the whole client, which is the total 
amount of unwritten/in-flight dirty data, and a separate "osc.*.max_dirty_mb" 
tunable (per-OST value) if you want to tune the maximum amount of unwritten 
data differently for multiple client mountpoints.

The "llite.*.max_cached_mb" is the amount of cached (dirty+clean) cached data 
for the filesystem.  By default this is 3/4 of RAM.

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Setting infinite grace period with soft quotas

2019-05-06 Thread Andreas Dilger
On Apr 11, 2019, at 11:02, Harr, Cameron  wrote:
> 
> We're exploring an idea where we keep soft quotas enabled so that users 
> will be notified they're nearing their hard quotas (via in-house 
> scripts), but users don't like that the soft quota becomes a hard block 
> after the grace period. I can understand their rationale as well that 
> they should be able to write up to their hard quota always.
> 
> Is there a way to set the grace period as unlimited (e.g. lfs setquota 
> -t -1 ...)?

Judging by the lack of response, I don't think anyone has tried this, but
it also seems like something that could be tested quite easily?

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] inotify

2019-05-06 Thread Andreas Dilger
On Apr 23, 2019, at 11:16, Faaland, Olaf P.  wrote:
> 
> Hi,
> 
>> Does inotify work on Lustre and if so, are there any caveats (performance, 
>> functionality, or otherwise)? 

AFAIK, inotify does *not* work correctly on Lustre, because it is purely a
local VFS mechanism.  That means it _might_ work for a small number of files
that are accessed/cached locally on the client, but definitely the client
would not be notified of changes that are happening on other clients.

Hooking inotify into the Lustre Changelog mechanism might be possible, so
that _select_ clients which need full inotify functionality could be added
as changelog consumers, and Changelog records would be mapped to inotify
events, but I think there would be a very significant overhead if a large
number of clients were all trying to be notified of every even in the whole
filesystem...

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] stat

2019-05-06 Thread Andreas Dilger
It would be useful to add an llapi_ function for this. In connection with LSOM 
the client will also be able to get the approximate file size, once 
https://jira.whamcloud.com/browse/LU-11367 is landed.

Cheers, Andreas

On May 1, 2019, at 09:35, Nathaniel Clark 
mailto:ncl...@whamcloud.com>> wrote:

You should look at the IOC_MDC_GETFILEINFO ioctl.  "Example" usage can
be found in
lustre-release/lustre/utils/liblustreapi.c::get_lmd_info_fd().

There currently isn't a convenient liblustre API call for it.  It
doesn't poll the OSTs, it just grabs info from the MDT, so it will do
fewer RPCs.

On 4/24/19 12:05 PM, Harms, Kevin wrote:
  Does Lustre provide an optimized stat("filename", ...) that requires fewer 
RPCs than fd=open("filename", ...); stat(fd); ? If so, are there any 
descriptions of this optimization?

thanks,
kevin
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

--
Nathaniel Clark
Senior Software Engineer
Whamcloud / DDN

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] 2.10 <-> 2.12 interoperability?

2019-05-03 Thread Andreas Dilger
On May 3, 2019, at 14:33, Patrick Farrell  wrote:
> 
> Thomas,
> 
> As a general rule, Lustre only supports mixing versions on servers for 
> rolling upgrades.
> 
> - Patrick

And only then between maintenance versions of the same release (e.g. 2.10.6
and 2.10.7).  If you are upgrading, say, 2.7.21 to 2.10.6 then you would need
to fail over half of the targets, upgrade half of the servers, fail back (at
which point all targets would be running on the same new version), upgrade the
other half of the servers, and then restore normal operation.

There is also a backport of the LU-11507 patch for 2.10 that could be used
instead of upgrading just one server to 2.12.

Cheers, Andreas

> On Wednesday, April 24, 2019 3:54:09 AM, Thomas Roth  wrote:
>>  
>> Hi all,
>> 
>> OS=CentOS 7.5
>> Lustre 2.10.6
>> 
>> One of the OSS (one OST only) was upgraded to zfs 0.7.13, and LU-11507 
>> forced an upgrade of Lustre to 2.12
>> 
>> Mounts, reconnects, recovers, but then is unusable, and the MDS reports:
>> 
>> Lustre: 13650:0:(mdt_handler.c:5350:mdt_connect_internal()) test-MDT: 
>> client
>> test-MDT-lwp-OST0002_UUID does not support ibits lock, either very old 
>> or an invalid client: flags
>> 0x204140104320
>> 
>> 
>> So far I have not found any hints that these versions would not cooperate, 
>> or that I should have set a
>> certain parameter.
>> LU-10175 indicates that the ibits have some connection to data-on-mdt which 
>> we don't use.
>> 
>> Any suggestions?
>> 
>> 
>> Regards,
>> Thomas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] State of arm client?

2019-04-25 Thread Andreas Dilger
The older Pi boards are also 64-bit CPUs, but the problem is that Raspbian is 
only compiled with 32-bit kernels.  I was recently testing this, and for 
Raspbian you will need at least the tip of b2_12, or 2.12.1 in order to compile.

I compiled 2.12.1-rc on my 32-bit Raspbian. This mostly works, but there is a 
bug in readdir() handling for large dirs, which means that it will return the 
same entries over and over again, but eventually completes (possibly minutes 
later).  This is caused by userspace calling readdir64() on a 32-bit kernel and 
Lustre is confused with its 32-bit compatibility.

I haven't had any chance to look into this in more detail, especially since I 
don't think anyone else is running 32-bit clients. So, in summary, Raspbian is 
_almost_ usable, but using a 64-bit kernel is just a lot easier.

Cheers, Andreas

On Apr 25, 2019, at 15:46, Patrick Farrell 
mailto:pfarr...@whamcloud.com>> wrote:

Also, you’ll need (I think?) fairly new Pis - Lustre only supports ARM64 and 
older ones were 32 bit.

- Patrick

From: lustre-discuss 
mailto:lustre-discuss-boun...@lists.lustre.org>>
 on behalf of Peter Jones mailto:pjo...@whamcloud.com>>
Sent: Wednesday, April 24, 2019 11:08:38 PM
To: Andrew Elwell; 
lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] State of arm client?


Andrew



You will need to use 2.12.x (and 2.12.1 is in final release testing so would be 
a good bet if you can wait a short while)



Peter



From: lustre-discuss 
mailto:lustre-discuss-boun...@lists.lustre.org>>
 on behalf of Andrew Elwell 
mailto:andrew.elw...@gmail.com>>
Date: Wednesday, April 24, 2019 at 7:31 PM
To: "lustre-discuss@lists.lustre.org" 
mailto:lustre-discuss@lists.lustre.org>>
Subject: [lustre-discuss] State of arm client?



Hi folks,



I remember seeing a press release by DDN/Whamcloud last November that they were 
going to support ARM, but can anyone point me to the current state of client?



I'd like to deploy it onto a raspberry pi cluster (only 4-5 nodes) ideally on 
raspbian for demo / training purposes. (Yes I know it won't *quite* be 
infiniband performance, but as it's hitting a VM based set of lustre servers, 
that's the least of my worries). Ideally 2.10.x, but I'd take a 2.12 client if 
it can talk to 2.10.x servers





Many thanks

Andrew

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] PFL not working on 2.10 client

2019-04-23 Thread Andreas Dilger
Rick,
Does this still fail with 2.10.1 or a later client?  It may just be a bug in 
"lfs" or the client, not an interop problem per-se. If it doesn't fail with a 
newer client then it probably isn't worthwhile to track down. 

If you _really_ need to get this working with the 2.10.0 client you could try 
the "lfs" binary (and possibly liblustreapi.so) from a newer client. Otherwise, 
upgrading the kernel modules to 2.10.0+patch is not much different than 
upgrading to 2.10.6 or 2.10.7 to fix this problem. 

Cheers, Andreas

On Apr 22, 2019, at 15:15, Mohr Jr, Richard Frank  wrote:
> 
> 
> I was trying to play around with some PFL layout today, and I ran into an 
> issue.  I have a file system running Lustre 2.10.6 and a client with 2.10.0 
> installed.  I created a PFL with this command:
> 
> [rfmohr@sip-login1 rfmohr]$ lfs setstripe -E 4M -c 2 -E 100M -c 4 comp_file
> 
> It did not return any errors, so I tried to query the layout:
> 
> [rfmohr@sip-login1 rfmohr]$ lfs getstripe comp_file
> comp_file has no stripe info
> 
> And if I write any data to it, I end up with a file that uses the system’s 
> default stripe count:
> 
> [rfmohr@sip-login1 rfmohr]$ dd if=/dev/zero of=comp_file bs=1M count=50
> 50+0 records in
> 50+0 records out
> 52428800 bytes (52 MB) copied, 0.0825892 s, 635 MB/s
> 
> [rfmohr@sip-login1 rfmohr]$ lfs getstripe comp_file
> comp_file
> lmm_stripe_count:  1
> lmm_stripe_size:   1048576
> lmm_pattern:   1
> lmm_layout_gen:0
> lmm_stripe_offset: 3
>obdidx objid objid group
> 3265665  0x40dc1 0
> 
> I could not find a JIRA ticket that looked similar to this. Is this a known 
> bug?  Or some odd interop issue?  When I tried the command on another file 
> system that uses 2.10.3 on the servers and clients, I got the expected 
> behavior:
> 
> -bash-4.2$ lfs setstripe -E 4M -c 2 -E 64M -c 4 comp_file
> 
> -bash-4.2$ lfs getstripe comp_file
> comp_file
>  lcm_layout_gen:  2
>  lcm_entry_count: 2
>lcme_id: 1
>lcme_flags:  init
>lcme_extent.e_start: 0
>lcme_extent.e_end:   4194304
>  lmm_stripe_count:  2
>  lmm_stripe_size:   1048576
>  lmm_pattern:   1
>  lmm_layout_gen:0
>  lmm_stripe_offset: 6
>  lmm_objects:
>  - 0: { l_ost_idx: 6, l_fid: [0x10006:0x8f84d:0x0] }
>  - 1: { l_ost_idx: 7, l_fid: [0x10007:0x8f72d:0x0] }
> 
>lcme_id: 2
>lcme_flags:  0
>lcme_extent.e_start: 4194304
>lcme_extent.e_end:   67108864
>  lmm_stripe_count:  4
>  lmm_stripe_size:   1048576
>  lmm_pattern:   1
>  lmm_layout_gen:65535
>  lmm_stripe_offset: -1
> 
> --
> Rick Mohr
> Senior HPC System Administrator
> National Institute for Computational Sciences
> http://www.nics.tennessee.edu
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] inodes not adding up

2019-04-18 Thread Andreas Dilger
On Apr 15, 2019, at 12:56, Mohr Jr, Richard Frank (Rick Mohr)  
wrote:
> 
> 
>> On Apr 13, 2019, at 4:57 AM, Youssef Eldakar  
>> wrote:
>> 
>> For one Lustre filesystem, inode count in the summary is notably less than 
>> what the individual OST inode counts would add up to:
> 
> The first thing to understand is that every Lustre file will consume one 
> inode on the MDT, and this inode uses attributes to store information about 
> which OSTs the file is striped over.  Then for each file stripe, there will 
> also be an inode consumed on the corresponding OSTs.  For example, a file 
> with stripe_count=4 will consume one inode on the MDT and four inodes on OSTs 
> (one inode on each OST the file is striped over).
> 
>> # lfs df -i /lfs01
>> UUID  Inodes   IUsed   IFree IUse% Mounted on
>> lustrefs-MDT_UUID  240228761646560885  2355726731   2% 
>> /share/lfs01[MDT:0]
>> lustrefs-OST0001_UUID2411724822883788 1233460  95% 
>> /share/lfs01[OST:1]
>> lustrefs-OST0003_UUID2411724822903308 1213940  95% 
>> /share/lfs01[OST:3]
>> lustrefs-OST0004_UUID2411724822895442 1221806  95% 
>> /share/lfs01[OST:4]
>> lustrefs-OST0006_UUID2411724822890201 1227047  95% 
>> /share/lfs01[OST:6]
>> 
>> filesystem_summary: 5145713846560885 4896253  90% /share/lfs01
> 
> On this file system, there are already 46,560,885 files which also consume 
> the same number of inodes on the MDT (so IUsed=46560885).  However, even 
> though the MDT has over 2 billion inodes free, every file created in the 
> future will use at least one inode on an OST.  If you add up all the free 
> inodes on all the OSTs, you get 4896253.  So at best, there is only space for 
> 4,896,253 more files.  That is why IFree=4896253.  Then, Inodes = IUsed + 
> IFree = 46,560,885 + 4,896,253 = 51,457,138.
> 
>> On another filesystem, this is not the case:
>> 
>> # lfs df -i /lfs02
>> UUID  Inodes   IUsed   IFree IUse% Mounted on
>> lustrefs-MDT_UUID  128850329619222318  1269280978   1% 
>> /share/lfs02[MDT:0]
>> lustrefs-OST0001_UUID24117248 594215618175092  25% 
>> /share/lfs02[OST:1]
>> lustrefs-OST0002_UUID24117248 581646918300779  24% 
>> /share/lfs02[OST:2]
>> lustrefs-OST0003_UUID24117248 598296218134286  25% 
>> /share/lfs02[OST:3]
>> 
>> filesystem_summary: 738324751922231854610157  26% /share/lfs02
> 
> Again, there are already 19,222,318 files on the file system, so 
> IUsed=19222318.   All the OSTs together only have 18,175,092 + 18,300,779 + 
> 18,134,286 = 54,610,157 inodes available, so IFree=54610157.  And Inodes = 
> IUsed + IFree = 73832475.

Thanks to Rick for the good explanation here.  One thing to add is that it 
appears that
the /lfs01 filesystem has a default stripe_count=2, since there are 46560885 
inodes used
on MDT and 91572739 total objects used on the four OSTs, and 
91572739/46560885 = 1.96
OST objects per MDT inode.

If you have a large number of small files, you don't need a high stripe count.

Cheers, Andreas
---
Andreas Dilger
Principal Lustre Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lfsck repair quota

2019-04-17 Thread Andreas Dilger
> 
> =
> Fernando Pérez
> Institut de Ciències del Mar (CMIMA-CSIC)
> Departament Oceanografía Física i Tecnològica
> Passeig Marítim de la Barceloneta,37-49
> 08003 Barcelona
> Phone:  (+34) 93 230 96 35
> =
> 
>> El 16 abr 2019, a las 20:43, Martin Hecht  escribió:
>> 
>> Are there a lot of inodes moved to lost+found by the fsck, which contribute 
>> to the occupied quota now?
>> 
>> - Ursprüngliche Mail -
>> Von: Fernando Pérez 
>> An: lustre-discuss@lists.lustre.org
>> Gesendet: Tue, 16 Apr 2019 16:24:13 +0200 (CEST)
>> Betreff: Re: [lustre-discuss] lfsck repair quota
>> 
>> Thank you Rick.
>> 
>> I followed these steps for the ldiskfs OSTs and MDT, but the quotes for all 
>> users is more corrupted than before.
>> 
>> I tried to run e2fsck in ldiskfs OSTs MDT, but the problem was the MDT 
>> e2fsck ran very slow ( 10 inodes per second for more than 100 million 
>> inodes).
>> 
>> According to the lustre wiki I though that the lfsck could repair corrupted 
>> quotes:
>> 
>> http://wiki.lustre.org/Lustre_Quota_Troubleshooting
>> 
>> Regards.
>> 
>> 
>> Fernando Pérez
>> Institut de Ciències del Mar (CSIC)
>> Departament Oceanografía Física i Tecnològica
>> Passeig Marítim de la Barceloneta,37-49
>> 08003 Barcelona
>> Phone:  (+34) 93 230 96 35
>> 
>> 
>>> El 16 abr 2019, a las 15:34, Mohr Jr, Richard Frank (Rick Mohr) 
>>>  escribió:
>>> 
>>> 
>>>> On Apr 15, 2019, at 10:54 AM, Fernando Perez  wrote:
>>>> 
>>>> Could anyone confirm me that the correct way to repair wrong quotes in a 
>>>> ldiskfs mdt is lctl lfsck_start -t layout -A?
>>> 
>>> As far as I know, lfsck doesn’t repair quota info. It only fixes internal 
>>> consistency within Lustre.
>>> 
>>> Whenever I have had to repair quotas, I just follow the procedure you did 
>>> (unmount everything, run “tune2fs -O ^quota ”, run “tune2fs -O quota 
>>> ”, and then remount).  But all my systems used ldiskfs, so I don’t 
>>> know if the ZFS OSTs introduce any sort of complication.  (Actually, I am 
>>> not even sure if/how you can regenerate quota info for ZFS.)
>>> 
>>> --
>>> Rick Mohr
>>> Senior HPC System Administrator
>>> National Institute for Computational Sciences
>>> http://www.nics.tennessee.edu
>>> 
>> 
> 

Cheers, Andreas
---
Andreas Dilger
CTO Whamcloud




___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lfsck repair quota

2019-04-16 Thread Andreas Dilger
On Apr 16, 2019, at 07:34, Mohr Jr, Richard Frank (Rick Mohr)  
wrote:
> 
>> On Apr 15, 2019, at 10:54 AM, Fernando Perez  wrote:
>> 
>> Could anyone confirm me that the correct way to repair wrong quotes in a 
>> ldiskfs mdt is lctl lfsck_start -t layout -A?
> 
> As far as I know, lfsck doesn’t repair quota info. It only fixes internal 
> consistency within Lustre.
> 
> Whenever I have had to repair quotas, I just follow the procedure you did 
> (unmount everything, run “tune2fs -O ^quota ”, run “tune2fs -O quota 
> ”, and then remount).  But all my systems used ldiskfs, so I don’t know 
> if the ZFS OSTs introduce any sort of complication.  (Actually, I am not even 
> sure if/how you can regenerate quota info for ZFS.)

Note the "tune2fs -O ^quota" to repair the quota accounting is a bit of a 
"cargo cult" behaviour.

Running "e2fsck -fp" is the proper way to repair the quota files, since it not 
only recalculates the quota accounting using the same code as "tune2fs -O 
quota" does, but it also ensures that the files themselves are valid in the 
first place.  They should take about the same time, except in the case your 
filesystem is corrupted, in which case you'd want e2fsck to repair the 
filesystem anyway.

Cheers, Andreas
---
Andreas Dilger
Principal Lustre Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lfsck repair quota

2019-04-16 Thread Andreas Dilger
On Apr 16, 2019, at 09:22, Mohr Jr, Richard Frank (Rick Mohr)  
wrote:
> 
>> On Apr 16, 2019, at 10:24 AM, Fernando Pérez  wrote:
>> 
>> According to the lustre wiki I though that the lfsck could repair corrupted 
>> quotes:
>> 
>> http://wiki.lustre.org/Lustre_Quota_Troubleshooting
> 
> Keep in mind that page is a few years old, but I assume they were referring 
> to LFSCK Phase 2 
> (http://wiki.lustre.org/LFSCK_Phase_2_-_MDT-OST_Consistency_Solution_Architecture)
>  which maintains consistency between the MDTs and OSTs.  One of the things 
> that lfsck will do is make sure that the ownership info for a file’s MDT 
> object matches the ownership of the OST objects for that file.  This is 
> necessary to ensure that quota information reported by Lustre is accurate, 
> but I don’t believe it is meant to fix any corruption in the quota files 
> themselves.

Correct, LFSCK will repair the OST object ownership, but does not *directly* 
repair quota
files. That said, if an object is owned by the wrong owner for some reason, 
changing the
ownership _should_ transfer the quota usage properly to the new owner, so in 
that regard
the quota usage should be indirectly repaired by an LFSCK run.

Cheers, Andreas
---
Andreas Dilger
Principal Lustre Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] LNET Conf Advise and Rearchitecting

2019-04-04 Thread Andreas Dilger
> 
> Thanks for your help in advance!
> 
> -Paul Edmon-
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
---
Andreas Dilger
Principal Lustre Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] EINVAL error when writing to a PFL file (lustre 2.12.0)

2019-03-29 Thread Andreas Dilger
Thomas,
there _are_ potential use cases for for having incomplete layouts now and in 
the future:
- limiting the size of files (e.g. logs), so that they don't exceed a set limit
- partial HSM restore or FLR mirrors for very large files on fast/local storage

The append problem is something we are aware of and will hopefully have a fix 
for soon (patch https://review.whamcloud.com/34425 "LU-9341 llite: Do not lock 
the world for append" if you want to test it and provide some feedback).

Another thought I just had while re-reading LU-9341 is whether it would be 
better to have the MDS always create files opened with O_APPEND with 
stripe_count=1?  There is no write parallelism for O_APPEND files, so having 
multiple stripes doesn't help the writer.  Because the writer also always locks 
the whole file [0,EOF] then there is no read-write parallelism either, so 
creating only a single file stripe simplifies things significantly with no real 
loss.

Minor complications include ensuring the file is created in the right OST pool 
(e.g. use the pool from the first component of the original layout of a PFL 
file), what to do if appending to a PFL+DoM file (instantiate a 1-stripe 
component to EOF?), and what to do for size-limited PFL files (keep current 
behaviour and the (hopefully 1-2) existing components?

Cheers, Andreas

On Feb 22, 2019, at 10:09, LEIBOVICI Thomas  wrote:
> 
> Hello Patrick,
> 
> Thank you for the quick reply.
> No, I have no particular use-case in mind, I'm just playing around with PFL.
> 
> If this is currently not properly supported, a quick fix could be to prevent 
> the user from creating such incomplete layouts?
> 
> Regards,
> Thomas
> 
> On 2/22/19 5:33 PM, Patrick Farrell wrote:
>> Thomas,
>> 
>> This is expected, but it's also something we'd like to fix - See LU-9341.
>> 
>> Basically, append tries to instantiate the layout from 0 to infinity, and it 
>> fails because your layout is incomplete (ie doesn't go to infinity).
>> 
>> May I ask why you're creating a file with an incomplete layout?  Do you have 
>> a use case in mind?
>> 
>> - Patrick
>> From: lustre-discuss  on behalf of 
>> LEIBOVICI Thomas 
>> Sent: Friday, February 22, 2019 10:27:48 AM
>> To: lustre-discuss@lists.lustre.org
>> Subject: [lustre-discuss] EINVAL error when writing to a PFL file (lustre 
>> 2.12.0)
>>  
>> Hello,
>> 
>> Is it expected to get an error when appending a PFL file made of 2 
>> regions [0 - 1M] and [1M to 6M]
>> even if writing in this range?
>> 
>> I get an error when appending it, even when writting in the very first 
>> bytes:
>> 
>> [root@vm0]# lfs setstripe  -E 1M -c 1 -E 6M -c 2 /mnt/lustre/m_fou3
>> 
>> [root@vm0]# lfs getstripe /mnt/lustre/m_fou3
>> /mnt/lustre/m_fou3
>>lcm_layout_gen:2
>>lcm_mirror_count:  1
>>lcm_entry_count:   2
>>  lcme_id: 1
>>  lcme_mirror_id:  0
>>  lcme_flags:  init
>>  lcme_extent.e_start: 0
>>  lcme_extent.e_end:   1048576
>>lmm_stripe_count:  1
>>lmm_stripe_size:   1048576
>>lmm_pattern:   raid0
>>lmm_layout_gen:0
>>lmm_stripe_offset: 3
>>lmm_objects:
>>- 0: { l_ost_idx: 3, l_fid: [0x10003:0x9cf:0x0] }
>> 
>>  lcme_id: 2
>>  lcme_mirror_id:  0
>>  lcme_flags:  0
>>  lcme_extent.e_start: 1048576
>>  lcme_extent.e_end:   6291456
>>lmm_stripe_count:  2
>>lmm_stripe_size:   1048576
>>lmm_pattern:   raid0
>>lmm_layout_gen:0
>>lmm_stripe_offset: -1
>> 
>> [root@vm0]# stat -c %s /mnt/lustre/m_fou3
>> 14
>> 
>> * append fails:
>> 
>> [root@vm0]# echo qsdkjqslkdjkj >> /mnt/lustre/m_fou3
>> bash: echo: write error: Invalid argument
>> 
>> # strace indicates that write() gets the error:
>> 
>> write(1, "qsdkjqslkdjkj\n", 14) = -1 EINVAL (Invalid argument)
>> 
>> * no error in case of an open/truncate:
>> 
>> [root@vm0]# echo qsdkjqslkdjkj > /mnt/lustre/m_fou3
>> 
>> OK
>> 
>> Is it expected or should I open a ticket?
>> 
>> Thomas
>> 
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
---
Andreas Dilger
CTO Whamcloud




___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] How often Log file get Updated

2019-03-25 Thread Andreas Dilger
On Mar 24, 2019, at 00:42, Masudul Hasan Masud Bhuiyan 
 wrote:
> 
> Hi, 
> I am trying to find how often the log in /proc/fs/lustre get updated. I was 
> creating some files in a particular ost and observing the stats file for that 
> OST. But I noticed although the file is creating properly "write_bytes" field 
> in the stats file is not changing. I created ~25GB file but the total size 
> was still the same. That's why I was wondering how often does the log file 
> gets updated?

This is not a log file, but rather a real-time display of the status, so if it 
is not being updated, then there are a few options that are possible:

- you are looking into the wrong stats file (e.g. different OST), as there are 
many different ones
- there is a bug in the code that prevents that the "write_bytes" from being 
updated.

What verson of Lustre are you running?

Cheers, Andreas
---
Andreas Dilger
Principal Lustre Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Disaster recover files from ZFS OSTs

2019-03-24 Thread Andreas Dilger
On Feb 16, 2019, at 16:05, Hans Henrik Happe  wrote:
> 
> Moving a system away from Gluster to Lustre there is one feature we
> miss. With Gluster the name space and data can easily be found on the
> underlying filesystems. While we never needed it with Lustre, it would
> be nice to have it as a last resort. Lustre has been rock solid, while
> we have needed it plenty on Gluster.
> 
> Looking at the output of 'getstripe' I figured out how to find files
> using the objids mod 128 (that's was how many dX dirs i found). Easy
> with stripe-count=1, probably also with higher counts.
> 
> So given a backup of MDTs we should be able to do some poking around. We
> could also do a database of getstripe info. Perhaps robinhood could help.
> 
> Is there some formal description of Lustre object layout? It seems
> simple but I'm wondering if there are pitfalls. PFL seems to be pretty
> well described.
> 
> Perhaps there are already tools for this, that we have missed?

As you have already seen, the MDT is essentially a directory tree of all
the files in the filesystem, and the OST objects listed in the LOV EA
contain the file data.  If there are multiple stripes, the data is round-
robin across each object in stripe_size chunks.  For PFL, each component
is a regular file layout.

That said, I wouldn't suggest that you do filesystem recovery by rebuilding
files from the raw objects in this manner.  It would be a lot easier to make
a backup of the MDT in case of problems and use that to recover if the MDT
ever failed completely.

> Side note:
> 
> Looking at how Lustre put object files in large buckets, I was wondering
> if this ZFS issue could become an issue:
> 
> Guess these buckets are rarely listed?

All of the object lookups in these directories are done by name (object ID),
so they are not listed per se.  Even so, this issue as described doesn't
really apply to Lustre since it always uses FatZAPs and not MicroZAPs.  The
other issue described in the ticket is that performance declines when a
directory gets very full and is then emptied.  This isn't really a problem
as the object directories are continually used.

Cheers, Andreas
---
Andreas Dilger
Principal Lustre Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Data migration from one OST to anther

2019-03-10 Thread Andreas Dilger
Note that the "max_create_count=0" feature is only working with newer versions 
of Lustre - 2.10 and later.

It is recommended to upgrade to a newer release than 2.5 in any case. 

Cheers, Andreas

> On Mar 5, 2019, at 10:33, Tung-Han Hsieh  
> wrote:
> 
> Dear All,
> 
> We have found the answer. Starting from Lustre-2.4, the OST will stop
> any update actions if we deactive it. Hence during data migration, if
> we deactive the OST chome-OST0028_UUID, and copy data out via:
> 
>cp -a  .tmp
>mv .tmp 
> 
> The "junk" still leaves in chome-OST0028_UUID, unless we restart the
> MDT. Restarting MDT will clean out the junks resides the previously
> deactived OSTs.
> 
> Another way to perform the data migration for chome-OST0028_UUID is:
> 
> root@mds# echo 0 > 
> /opt/lustre/fs/osc/chome-OST0028-osc-MDT/max_create_count
> 
> Thus the OST is still active, but just not creating new object. So doing
> data migration we can see its space continuously released.
> 
> But here we encouter another problem. In our Lustre file system we have
> 41 OSTs, in which 8 OSTs are full and we want to do data migration. So
> we blocked these OSTs from creating new objects. But during the data
> migration, suddently the whole Lustre file system hangs, and the MDS
> server has a lot of the following dmesg messages:
> 
> ---
> [960570.287161] Lustre: chome-OST001a-osc-MDT: slow creates, 
> last=[0x1001a:0x3ef241:0x0], next=[0x1001a:0x3ef241:0x0], reserved=0, 
> syn_changes=0, syn_rpc_in_progress=0, status=0
> [960570.287244] Lustre: Skipped 2 previous similar messages
> ---
> 
> where chome-OST001a-osc-MDT is one of the blocked OSTs. It looks like
> that MDT still wants to store data into the blocked OSTs. But since they
> are blocked, so the whole file system hangs.
> 
> Could anyone give us suggestions how to solve it ?
> 
> Best Regards,
> 
> T.H.Hsieh
> 
>> On Sun, Mar 03, 2019 at 06:00:17PM +0800, Tung-Han Hsieh wrote:
>> Dear All,
>> 
>> We have a problem of data migration from one OST two another.
>> 
>> We have installed Lustre-2.5.3 on the MDS and OSS servers, and Lustre-2.8
>> on the clients. We want to migrate some data from one OST to another in
>> order to re-balance the data occupation among OSTs. In the beginning we
>> follow the old method (i.e., method found in Lustre-1.8.X manuals) for
>> the data migration. Suppose we have two OSTs:
>> 
>> root@client# /opt/lustre/bin/lfs df
>> UUID   1K-blocksUsed   Available Use% Mounted on
>> chome-OST0028_UUID7692938224  724670914855450156  99% /work[OST:40]
>> chome-OST002a_UUID   14640306852  7094037956  6813847024  51% /work[OST:42]
>> 
>> and we want to migrate data from chome-OST0028_UUID to chome-OST002a_UUID.
>> Our procedures are:
>> 
>> 1. We deactivate chome-OST0028_UUID:
>>   root@mds# echo 0 > /opt/lustre/fs/osc/chome-OST0028-osc-MDT/active
>> 
>> 2. We find all files located in chome-OST0028_UUID:
>>   root@client# /opt/lustre/bin/lfs find --obd chome-OST0028_UUID /work > list
>> 
>> 3. In each file listed in the file "list", we did:
>> 
>>cp -a  .tmp
>>mv .tmp 
>> 
>> During the migration, we really saw that more and more data written into
>> chome-OST002a_UUID. But we did not see any disk release in 
>> chome-OST0028_UUID.
>> In Lustre-1.8.X, doing this way we did saw that chome-OST002a_UUID has
>> more data coming in, and chome-OST0028_UUID has more and more free space.
>> 
>> It looks like that the data files referenced by MDT have copied to
>> chome-OST002a_UUID, but the junks still remain in chome-OST0028_UUID.
>> Even though we activate chome-OST0028_UUID after migration, the situation
>> is still the same:
>> 
>> root@mds# echo 1 > /opt/lustre/fs/osc/chome-OST0028-osc-MDT/active
>> 
>> Is there any way to cure this problem ?
>> 
>> 
>> Thanks very much.
>> 
>> T.H.Hsieh
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre Monitoring metrics

2019-02-24 Thread Andreas Dilger
Probably for a new user it doesn't make sense to use the Lustre stats in /proc 
directly. There are a number of different tools that present these stats in a 
more useful manner, such as IML (GUI Web front end), LMT, lltop, etc.

Cheers, Andreas

On Feb 24, 2019, at 02:09, Masudul Hasan Masud Bhuiyan 
mailto:masud.ha...@nevada.unr.edu>> wrote:

Hi,

I am very new to Lustre and  I am trying to get idea about the lustre 
monitoring system. I have collected the stats from "/proc/fs/lustre". 
Unfortunately I couldn't understand what does these data are actually mean. I 
have tried to go through Lustre documentation and user guide to get 
understanding of the data. But had no luck.

I will be grateful if you can help me to understand these log files. This is a 
sample of stats for metadat server. But I am not sure how can I interpret these 
stat. what does "mds_getattr"/mds_get_root/mds_quotactl means?

Fri Feb 8 00:52:13 2019
snapshot_time 1549615933.091769380 secs.nsecs
req_waittime 540323978 samples [usec] 58 689541437 1767362743217 
12488011030933825609
req_active 540323988 samples [reqs] 1 142007 60970174330 3247243705183378
mds_getattr 168893 samples [usec] 82 165621 390916699 1522734684023
mds_close 110379799 samples [usec] 64 1231540 17796322782 62011789734542
mds_readpage 1236125 samples [usec] 245 10402461 10527744335 1995909318950381
mds_connect 6 samples [usec] 186 868 1900 969558
mds_get_root 1 samples [usec] 85 85 85 7225
mds_statfs 795609 samples [usec] 67 1109562 226674777 4390097174469
mds_sync 6769 samples [usec] 361624 689541437 80448645395 7168190473271866793
mds_quotactl 1449 samples [usec] 69 18051 551878 2621695252
mds_getxattr 1886 samples [usec] 93 13792 419449 525089597
mds_hsm_state_set 234556 samples [usec] 77 893155 96605623 1595746930715
ldlm_cancel 115821035 samples [usec] 58 5627198 1149359370420 
1270415775256810056
obd_ping 23932 samples [usec] 87 34096 7061304 6296645618
seq_query 38 samples [usec] 87 767 6885 1688653
fld_query 3 samples [usec] 84 370 789 256181

Sincerely yours,
Masudul Hasan Masud

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Suspended jobs and rebooting lustre servers

2019-02-22 Thread Andreas Dilger
This is not really correct.

Lustre clients can handle the addition of OSTs to a running filesystem. The MGS 
will register the new OSTs, and the clients will be notified by the MGS that 
the OSTs have been added, so no need to unmount the clients during this process.


Cheers, Andreas

On Feb 21, 2019, at 19:23, Raj 
mailto:rajgau...@gmail.com>> wrote:

Hello Raj,
It’s best and safe to unmount from all the clients and then do the upgrade. 
Your FS is getting more OSTs and changing conf in the existing ones, your 
client needs to get the new layout by remounting it.
Also you mentioned about client eviction, during eviction the client has to 
drop it’s dirty pages and all the open file descriptors in the FS will be gone.

On Thu, Feb 21, 2019 at 12:25 PM Raj Ayyampalayam 
mailto:ans...@gmail.com>> wrote:
What can I expect to happen to the jobs that are suspended during the file 
system restart?
Will the processes holding an open file handle die when I unsuspend them after 
the filesystem restart?

Thanks!
-Raj


On Thu, Feb 21, 2019 at 12:52 PM Colin Faber 
mailto:cfa...@gmail.com>> wrote:
Ah yes,

If you're adding to an existing OSS, then you will need to reconfigure the file 
system which requires writeconf event.

On Thu, Feb 21, 2019 at 10:00 AM Raj Ayyampalayam 
mailto:ans...@gmail.com>> wrote:
The new OST's will be added to the existing file system (the OSS nodes are 
already part of the filesystem), I will have to re-configure the current HA 
resource configuration to tell it about the 4 new OST's.
Our exascaler's HA monitors the individual OST and I need to re-configure the 
HA on the existing filesystem.

Our vendor support has confirmed that we would have to restart the filesystem 
if we want to regenerate the HA configs to include the new OST's.

Thanks,
-Raj


On Thu, Feb 21, 2019 at 11:23 AM Colin Faber 
mailto:cfa...@gmail.com>> wrote:
It seems to me that steps may still be missing?

You're going to rack/stack and provision the OSS nodes with new OSTs'.

Then you're going to introduce failover options somewhere? new osts? existing 
system? etc?

If you're introducing failover with the new OST's and leaving the existing 
system in place, you should be able to accomplish this without bringing the 
system offline.

If you're going to be introducing failover to your existing system then you 
will need to reconfigure the file system to accommodate the new failover 
settings (failover nides, etc.)

-cf


On Thu, Feb 21, 2019 at 9:13 AM Raj Ayyampalayam 
mailto:ans...@gmail.com>> wrote:
Our upgrade strategy is as follows:

1) Load all disks into the storage array.
2) Create RAID pools and virtual disks.
3) Create lustre file system using mkfs.lustre command. (I still have to figure 
out all the parameters used on the existing OSTs).
4) Create mount points on all OSSs.
5) Mount the lustre OSTs.
6) Maybe rebalance the filesystem.
My understanding is that the above can be done without bringing the filesystem 
down. I want to create the HA configuration (corosync and pacemaker) for the 
new OSTs. This step requires the filesystem to be down. I want to know what 
would happen to the suspended processes across the cluster when I bring the 
filesystem down to re-generate the HA configs.

Thanks,
-Raj

On Thu, Feb 21, 2019 at 12:59 AM Colin Faber 
mailto:cfa...@gmail.com>> wrote:
Can you provide more details on your upgrade strategy? In some cases expanding 
your storage shouldn't impact client / job activity at all.

On Wed, Feb 20, 2019, 11:09 AM Raj Ayyampalayam 
mailto:ans...@gmail.com>> wrote:
Hello,

We are planning on expanding our storage by adding more OSTs to our lustre file 
system. It looks like it would be easier to expand if we bring the filesystem 
down and perform the necessary operations. We are planning to suspend all the 
jobs running on the cluster. We originally planned to add new OSTs to the live 
filesystem.

We are trying to determine the potential impact to the suspended jobs if we 
bring down the filesystem for the upgrade.
One of the questions we have is what would happen to the suspended processes 
that hold an open file handle in the lustre file system when the filesystem is 
brought down for the upgrade?
Will they recover from the client eviction?

We do have vendor support and have engaged them. I wanted to ask the community 
and get some feedback.

Thanks,
-Raj
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org

Re: [lustre-discuss] Migrate MGS to ZFS

2019-02-19 Thread Andreas Dilger
PS: it is always a good idea to make a backup of your MDT, since it is 
relatively small compared to the rest of the filesystem. A full-device "dd" 
copy doesn't take too long and is the most accurate backup for ldiskfs. 

Cheers, Andreas

> On Feb 19, 2019, at 19:31, Andreas Dilger  wrote:
> 
> Yes, it is possible to migrate the MGS files to another device as you 
> propose. I don't think there is any particular difference if you move it to a 
> separate ldiskfs or ZFS target. 
> 
> One caveat is that we don't test combined ZFS and ldiskfs targets on the same 
> node, though in theory it would work. 
> 
> Migrating the MDT from ldiskfs to ZFS is also possible with newer versions of 
> Lustre (2.12 for sure, I don't recall if it is in 2.10 or not).  You need to 
> follow a special process to do this, please see the Lustre Operations Manual 
> for details. 
> 
> Cheers, Andreas
> 
>> On Feb 19, 2019, at 17:48, Fernando Pérez  wrote:
>> 
>> Dear lustre experts.
>> 
>> Whats is the best way to migrate a MGS device to ZFS? Copy the 
>> CONFIGS/filesystem_name-* files from the old ldiskfs device to the new ZFS 
>> MGS device?
>> 
>> Currently we have a combined MDT/MGT under ldiskfs with lustre 2.10.4. 
>> 
>> We want to upgrade to lustre 2.12.0 and then separate the combined MDT/MGT 
>> and migrate MDT and MGT to separate ZFS devices. 
>> 
>> Regards.
>> =
>> Fernando Pérez
>> Institut de Ciències del Mar (CMIMA-CSIC)
>> Departament Oceanografía Física i Tecnològica
>> Passeig Marítim de la Barceloneta,37-49
>> 08003 Barcelona
>> Phone:  (+34) 93 230 96 35
>> =
>> 
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Migrate MGS to ZFS

2019-02-19 Thread Andreas Dilger
Yes, it is possible to migrate the MGS files to another device as you propose. 
I don't think there is any particular difference if you move it to a separate 
ldiskfs or ZFS target. 

One caveat is that we don't test combined ZFS and ldiskfs targets on the same 
node, though in theory it would work. 

Migrating the MDT from ldiskfs to ZFS is also possible with newer versions of 
Lustre (2.12 for sure, I don't recall if it is in 2.10 or not).  You need to 
follow a special process to do this, please see the Lustre Operations Manual 
for details. 

Cheers, Andreas

> On Feb 19, 2019, at 17:48, Fernando Pérez  wrote:
> 
> Dear lustre experts.
> 
> Whats is the best way to migrate a MGS device to ZFS? Copy the 
> CONFIGS/filesystem_name-* files from the old ldiskfs device to the new ZFS 
> MGS device?
> 
> Currently we have a combined MDT/MGT under ldiskfs with lustre 2.10.4. 
> 
> We want to upgrade to lustre 2.12.0 and then separate the combined MDT/MGT 
> and migrate MDT and MGT to separate ZFS devices. 
> 
> Regards.
> =
> Fernando Pérez
> Institut de Ciències del Mar (CMIMA-CSIC)
> Departament Oceanografía Física i Tecnològica
> Passeig Marítim de la Barceloneta,37-49
> 08003 Barcelona
> Phone:  (+34) 93 230 96 35
> =
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


<    1   2   3   4   5   6   7   8   9   10   >