Re: [lustre-discuss] [Samba] Odd "File exists" behavior when copy-pasting many files to an SMB exported Lustre FS
Hi Jeremy! Am 23.09.22 um 18:42 schrieb Jeremy Allison: > In practice this has never been an issue. It's only an issue > now due to the strange EA behavior. Once we have a vfs_lustre > module in place it will go back to not being an issue :-). The root of the problem seems to be an asymmetry in how Samba maps filesystem to Windows EAs and back. For SMB_INFO_SET_EA it's foo (Windows) ---> user.foo (filesystem) whereas SMB_INFO_QUERY_ALL_EAS map as user.foo (filesystem) ---> foo (Windows) other.bar (filesystem) ---> other.bar (Windows) This means a) the Windows side cannot distinguish between other.bar and user.other.bar, and b) a copy operation via Samba turns other.bar into user.other.bar (leading to the issue at hand because the user.* EA namespace is off by default in Lustre). So rather than adding a VFS module filtering some EAs, why not just make the mapping symmetric, ie. diff --git a/source3/smbd/smb2_trans2.c b/source3/smbd/smb2_trans2.c index 95cecce96e1..31a7a04a72c 100644 --- a/source3/smbd/smb2_trans2.c +++ b/source3/smbd/smb2_trans2.c @@ -454,7 +454,7 @@ static NTSTATUS get_ea_list_from_fsp(TALLOC_CTX *mem_ctx, struct ea_list *listp; fstring dos_ea_name; - if (strnequal(names[i], "system.", 7) + if (!strnequal(names[i], "user.", 5) || samba_private_attr_name(names[i])) continue; What am I missing? Kind regards, Daniel -- Daniel Kobras Principal Architect Puzzle ITC Deutschland +49 7071 14316 0 www.puzzle-itc.de -- Puzzle ITC Deutschland GmbH Sitz der Gesellschaft: Eisenbahnstraße 1, 72072 Tübingen Eingetragen am Amtsgericht Stuttgart HRB 765802 Geschäftsführer: Lukas Kallies, Daniel Kobras, Mark Pröhl ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] [Samba] Odd "File exists" behavior when copy-pasting many files to an SMB exported Lustre FS
Hi Micha, hi Jeremy! Am 22.09.22 um 11:21 schrieb Michael Weiser: > [from the docs]: >> If a client(s) will be mounted on several file systems, add the following >> line to /etc/xattr.conf > file to avoid problems when files are moved >> between the file systems: lustre.* skip" > > What exactly were those problems hinted at in the documentation? > Is the visibility of the lustre.lov attribute for unprivileged users actually > needed for anything? > Can exposing it to unprivileged users be switched off Lustre-side? In the lustre.lov xattr, Lustre exposes layout information (ie. content distribution across servers) to regular users. In some cases, it's also possible to set the desired layout through this interface. Layout information depends on the innards of the specific fs setup, and should not be retained when moving files between different filesystems, hence the hint in the docs. > Jeremy wrote: > >> Lutre really should not be exposing EA's to callers if >> it doesn't actually support EA's. Just to clarify as Micha's original post had it hidden in parenthesis: Lustre does support xattrs in general, and it does support the 'user.*' namespace. It's just that the latter needs to be enabled explicitly with the 'user_xattr' mount option. By default, access to 'user.*' xattrs is rejected with EOPNOTSUPP. >> No. Lustre is returning "fictional" EA's that >> cannot be set. Linux filesystems that don't have >> EA's don't do that. On a Lustre system, the lustre.lov xattr can be set alright without receiving an error. But that's not what Samba does here. Instead, it tries to copy any user-readable xattr, and prepends a 'user.' prefix to any name unless it's already there. This only works if the filesystem has been mounted with the 'user_xattr' option. So for a working setup, it boils down to either turning off 'ea support' in Samba (on by default since 4.9), or turning on 'user_xattr' in the filesystem. Kind regards, Daniel -- Daniel Kobras Principal Architect Puzzle ITC Deutschland +49 7071 14316 0 www.puzzle-itc.de -- Puzzle ITC Deutschland GmbH Sitz der Gesellschaft: Eisenbahnstraße 1, 72072 Tübingen Eingetragen am Amtsgericht Stuttgart HRB 765802 Geschäftsführer: Lukas Kallies, Daniel Kobras, Mark Pröhl ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] [Samba] Odd "File exists" behavior when copy-pasting many files to an SMB exported Lustre FS
Hi Michael! Am 21.09.22 um 16:54 schrieb Michael Weiser via lustre-discuss: > That leaves the question, where that extended attribute user.lustre.lov is > coming from. It appears that Lustre itself exposes a number of extended > attributes for every file which reflect internal housekeeping data: > > $ getfattr -m - blarf > # file: blarf > lustre.lov > trusted.link > trusted.lma > trusted.lov Try adding lustre.* skip to /etc/xattr.conf (cf. <https://doc.lustre.org/lustre_manual.xhtml#lustre_configure_multiple_fs>). Haven't tested yet, but Samba's EA handling seems to be libattr-based, so the above tweak should be sufficient to keep smbd off of these fs-specific attributes. Kind regards, Daniel -- Daniel Kobras Principal Architect Puzzle ITC Deutschland +49 7071 14316 0 www.puzzle-itc.de -- Puzzle ITC Deutschland GmbH Sitz der Gesellschaft: Eisenbahnstraße 1, 72072 Tübingen Eingetragen am Amtsgericht Stuttgart HRB 765802 Geschäftsführer: Lukas Kallies, Daniel Kobras, Mark Pröhl ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] MDS using D3710 DAS
Hi! Am 12.02.21 um 06:14 schrieb Sid Young: > Is anyone using a HPe D3710 with two HPeDL380/385 servers in a MDS HA > Configuration? If so, is your D3710 presenting LV's to both servers at > the same time AND are you using PCS with the Lustre PCS Resources? > > I've just received new kit and cannot get disk to present to the MDS > servers at the same time. :( It's been a while, but I've seen this once with an enclosure that had been equipped with single- instead of dual-ported drives by mistake. Might be worthwhile to double-check the specs for your system. Kind regards, Daniel -- Daniel Kobras Principal Architect Puzzle ITC Deutschland +49 7071 14316 0 www.puzzle-itc.de -- Puzzle ITC Deutschland GmbH Sitz der Gesellschaft: Eisenbahnstraße 1, 72072 Tübingen Eingetragen am Amtsgericht Stuttgart HRB 765802 Geschäftsführer: Lukas Kallies, Daniel Kobras, Mark Pröhl ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] rsync not appropriate for lustre
Hi! Am 23.01.20 um 16:33 schrieb Bernd Melchers: > we are copying large data sets within our lustre filesystem and between > lustre and an external nfs server. In both cases the performance is > unexpected low and the reason seems that rsync is reading and writing in > 32 kB Blocks, whereas our lustre would be more happy with 4 MB > Blocks. > rsync has an --block-size=SIZE Parameter but this adjusts only the > checksum block size (and the maximum is 131072), not the i/o block size. > Is there a solution to accelerate rsync? Assuming sequential access to sufficiently large files, most reads issued by a Lustre client with default setting should already be RPC-sized, regardless of the size used in the read() syscall. Maybe double-check your readahead settings? Kind regards, Daniel -- Daniel Kobras Principal Architect Puzzle ITC Deutschland +49 7071 14316 0 www.puzzle-itc.de -- Puzzle ITC Deutschland GmbH Sitz der Gesellschaft: Jurastr. 27/1, 72072 Tübingen Eingetragen am Amtsgericht Stuttgart HRB 765802 Geschäftsführer: Lukas Kallies, Daniel Kobras, Mark Pröhl ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] How to remove a pool
Hi! Am 04.03.19 um 17:56 schrieb Bernd Melchers: > ... or is there a way to un-setstripe directories from pools? You should be able to remove the pool association with lfs setstripe -p none Kind regards, Daniel -- Daniel Kobras Principal Architect Puzzle ITC Deutschland +49 7071 14316 0 www.puzzle-itc.de -- Puzzle ITC Deutschland GmbH Sitz der Gesellschaft: Jurastr. 27/1, 72072 Tübingen Eingetragen am Amtsgericht Stuttgart HRB 765802 Geschäftsführer: Lukas Kallies, Daniel Kobras, Mark Pröhl ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Disable identity_upcall and ACL
Hi Aurélien! Am 09.01.19 um 14:30 schrieb Degremont, Aurelien: > The secondary group thing was ok to me. I got this idea even if there is some > weird results during my tests. Looks like you can overwrite MDT checks if > user and group is properly defined on client node. Cache effects? In a talk I gave a decade ago, I described a problem with authorization due to inconsistencies between client and MDT, depending on whether metadata was in the client cache or not (see p. 23 of http://wiki.lustre.org/images/b/ba/Tuesday_lustre_automotive.pdf -- you really managed to challenge my memory ;-) I faintly remember Andreas commenting that the MDT was always supposed to be authoritative, even for cached content, and the experienced behaviour was a bug. Indeed, other than those prehistoric versions, I'm not aware of any inconsistencies in authorization due to cache effects. > ACL is really the thing I was interested in. Who is validating the ACLs? MDT, > client or both? Do you think ACL could be properly applied if user/groups are > only defined on client side and identity_upcall is disabled on MDT side? Posix ACLs use numeric uids and gids, just like ordinary permission bits. Evaluation is supposed to happens on the MDT for both. If you can do without secondary groups, there's no need for user and group databases on the MDT--numeric id will work fine. (Unless you use Kerberos, which will typically require user names for proper id mapping.) Kind regards, Daniel -- Daniel Kobras Principal Architect Puzzle ITC Deutschland +49 7071 14316 0 www.puzzle-itc.de -- Puzzle ITC Deutschland GmbH Sitz der Gesellschaft: Jurastr. 27/1, 72072 Tübingen Eingetragen am Amtsgericht Stuttgart HRB 765802 Geschäftsführer: Lukas Kallies, Daniel Kobras, Mark Pröhl ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Disable identity_upcall and ACL
Hi Aurélien! Am 09.01.19 um 11:48 schrieb Degremont, Aurelien: > When disabling identity_upcall on a MDT, you get this message in system > logs: > > lustre-MDT: disable "identity_upcall" with ACL enabled maybe cause > unexpected "EACCESS" > > I’m trying to understand what could be a scenario that shows this problem? > What is the implication, or rather, how identity_upcall works? Without an identity_upcall, all Lustre users effectively lose their secondary group memberships. These are not passed in the RPCs, but evaluated on the MDS instead. The default l_getidentity receives a numeric uid, queries NSS to obtain the corresponding account's list of gids, and passes the list back to the kernel. As a test scenario, just try to access a file or directory from an account that only has access permissions via one of its secondardy groups. (The log message is a bit misleading--you don't actually need to use ACLs, ordinary group permissions are sufficient.) Kind regards, Daniel -- Daniel Kobras Principal Architect Puzzle ITC Deutschland +49 7071 14316 0 www.puzzle-itc.de -- Puzzle ITC Deutschland GmbH Sitz der Gesellschaft: Jurastr. 27/1, 72072 Tübingen Eingetragen am Amtsgericht Stuttgart HRB 765802 Geschäftsführer: Lukas Kallies, Daniel Kobras, Mark Pröhl ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Command line tool to monitor Lustre I/O ?
Hi Roland! > Am 20.12.2018 um 15:04 schrieb Laifer, Roland (SCC) : > > what is a good command line tool to monitor current Lustre metadata and > throughput operations on the local client or server? Up to now we had > used collectl but this no longer works for Lustre 2.10. The Lustre exporter (https://github.com/HewlettPackard/lustre_exporter) for Prometheus copes well with 2.10. Calling it a command-line tool is a bit of a stretch (hey, there’s curl after all!), but it can certainly step in for collect’s non-interactive mode of operation. Kind regards, Daniel -- Daniel Kobras Principal Architect Puzzle ITC Deutschland https://www.puzzle-itc.de -- Puzzle ITC Deutschland GmbH Sitz der Gesellschaft: Jurastr. 27/1, 72072 Tübingen Eingetragen am Amtsgericht Stuttgart HRB 765802 Geschäftsführer: Lukas Kallies ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] projects and project quota
Hi! Am 10.12.18 um 17:38 schrieb Thomas Roth: > Why do I gent an error when enabling project quota? > > The system ("hebe") is running Lustre 2.10.5 and zfs 0.7.9, but it was > installed and formatted with > 2.10.2 and zfs 0.7.1 (and LU-7991 was resolved November 17, fix version is > 2.10.3). LU-7991 was resolved in 2.10.x, but the corresponding changes in ZoL (https://github.com/zfsonlinux/zfs/pull/6290) have only landed in master, and will released with 0.8. As far as I'm aware, project quota is not functional with zfs 0.7.x. Kind regards, Daniel -- Daniel Kobras Principal Architect Puzzle ITC Deutschland +49 7071 14316 0 www.puzzle-itc.de -- Puzzle ITC Deutschland GmbH Sitz der Gesellschaft: Jurastr. 27/1, 72072 Tübingen Eingetragen am Amtsgericht Stuttgart HRB 765802 Geschäftsführer: Lukas Kallies ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] new lustre setup, questions about the mgt and mdt.
Hi! > Am 23.10.2018 um 19:36 schrieb Kurt Strosahl : > > 1) I've seen it said that ZFS is a good choice for the lustre mdt, as long as > it is zfs 0.7.x. Has anyone had that experience? > 2) What about the MGT, that seems like it wouldn't have any issues with being > on zfs. We’ve been running several all-ZFS backed systems over the last few years. Way back in the 2.5 days, there once was a ZFS-specific issue with ACLs that we’ve actually hit in production. Other than that, the backend FS never caused any problems—or they were insignificant enough to slip my mind. > The system I presently have in mind is two servers hooked up to a jbod, with > the MDT and the MGT being separate zpools on the jbod. That way one head to > run the mgt and one could run the mdt, while allowing both heads to serve as > the backup for each other. MGS data is small, mostly static, and in my experience, the extra load from the service can be neglected as well. So overall you’re probably better off if you don’t waste drives on a separate pool just for the MGS. Just add them to the MDT pool instead, and enjoy the extra oomph. Or consider splitting the drivesl into two pools for an active/active configuration with one MDT on each node instead. Depending on how your workload spreads across multiple MDTs, this can be beneficial even for a relatively small number of disks because you effectively double your cache size. Kind regards, Daniel ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Lustre Client in a container
Hi David! Do you require both systems to be available as native Lustre filesystems on all clients? Otherwise, reexporting one of the systems via NFS during the migration phase will keep all data available but decouple the version interdependence between servers and clients. In this situation, it’s probably the least experimental option. Kind regards, Daniel > Am 31.12.2017 um 09:50 schrieb David Cohen : > > Patrick, > Thanks for you response. > I looking for a way to migrate from 1.8.9 system to 2.10.2, stable enough to > run the several weeks or more that it might take. > > > David > > On Sun, Dec 31, 2017 at 12:12 AM, Patrick Farrell wrote: > David, > > I have no direct experience trying this, but I would imagine not - Lustre is > a kernel module (actually a set of kernel modules), so unless the container > tech you're using allows loading multiple different versions of *kernel > modules*, this is likely impossible. My limited understanding of container > tech on Linux suggests that this would be impossible, containers allow > userspace separation but there is only one kernel/set of modules/drivers. > > I don't know of any way to run multiple client versions on the same node. > > The other question is *why* do you want to run multiple client versions on > one node...? Clients are usually interoperable across a pretty generous set > of server versions. > > - Patrick > > > From: lustre-discuss on behalf of > David Cohen > Sent: Saturday, December 30, 2017 11:45:15 AM > To: lustre-discuss@lists.lustre.org > Subject: [lustre-discuss] Lustre Client in a container > > Hi, > Is it possible to run Lustre client in a container? > The goal is to run two different client version on the same node, can it be > done? > > David > > > ___ > lustre-discuss mailing list > lustre-discuss@lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] lfs_migrate rsync vs. lfs migrate and layout swap
Hi! > Am 20.11.2017 um 00:01 schrieb Dilger, Andreas : > > It would be interesting to strace your rsync vs. "lfs migrate" read/write > patterns so that the copy method of "lfs migrate" can be improved to match > rsync. Since they are both userspace copy actions they should be about the > same performance. It may be that "lfs migrate" is using O_DIRECT to minimize > client cache pollution (I don't have the code handy to check right now). In > the future we could use "copyfile()" to avoid this as well. lfs migrate indeed uses O_DIRECT for reading the source file. A few tests on a system running 2.10.1 yielded a 10x higher throughput with a modified lfs migrate that simply dropped the O_DIRECT flag. I’ve filed https://jira.hpdd.intel.com/browse/LU-10278 about it. (A simple patch to make O_DIRECT optional is ready, but I still need to charm the gods of the firewall to let me push to Gerrit.) Kind regards, Daniel ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Dependency errors with Lustre 2.10.1 packages
Hi! > [root@lustre-ost03 ~]# yum install lustre > Loaded plugins: fastestmirror, versionlock > Loading mirror speeds from cached hostfile > Resolving Dependencies > --> Running transaction check > ---> Package lustre.x86_64 0:2.10.1-1.el7 will be installed > --> Processing Dependency: kmod-lustre = 2.10.1 for package: > lustre-2.10.1-1.el7.x86_64 > --> Processing Dependency: lustre-osd for package: lustre-2.10.1-1.el7.x86_64 > --> Processing Dependency: lustre-osd-mount for package: > lustre-2.10.1-1.el7.x86_64 > --> Processing Dependency: libyaml-0.so.2()(64bit) for package: > lustre-2.10.1-1.el7.x86_64 > --> Running transaction check > ---> Package kmod-lustre.x86_64 0:2.10.1-1.el7 will be installed > ---> Package kmod-lustre-osd-ldiskfs.x86_64 0:2.10.1-1.el7 will be installed > --> Processing Dependency: ldiskfsprogs >= 1.42.7.wc1 for package: > kmod-lustre-osd-ldiskfs-2.10.1-1.el7.x86_64 > ---> Package libyaml.x86_64 0:0.1.4-11.el7_0 will be installed > ---> Package lustre-osd-ldiskfs-mount.x86_64 0:2.10.1-1.el7 will be installed > --> Finished Dependency Resolution > Error: Package: kmod-lustre-osd-ldiskfs-2.10.1-1.el7.x86_64 (lustre) >Requires: ldiskfsprogs >= 1.42.7.wc1 > You could try using --skip-broken to work around the problem > > I've checked the repos and don't see a package for ldiskfsprogs at all. > Does anybody know how to resolve this? It’s a bit convoluted because no package is actually named ldiskfsprogs. Instead, the e2fsprogs packages available from https://downloads.hpdd.intel.com/public/e2fsprogs/latest/el7/RPMS/x86_64/ have a „Provides“ entry for ldiskfsprogs. Installing them should resolve the dependency problem. Kind regards, Daniel ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Large file read performance degradation from multiple OST's
Hi! On Wed, May 24, 2017 at 04:04:00PM +, Vicker, Darby (JSC-EG311) wrote: > I tried a 2.8 client mounting the 2.9 servers and that showed the expected > behavior ??? increasing performance with increasing OST's. Two things: > > 1. Any pointers to compiling a 2.8 client on recent RHEL 7 kernels would be > helpful. I had to boot into an older kernel to get the above test done. > > 2. Any help to find the problem in the 2.9 code base would also be helpful. The fast_read feature has been added to the client-side read path between 2.8 and 2.9. So for starters it might be worthwhile to re-run the test on a 2.9 client with fast_read=0? Kind regards, Daniel ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [Lustre-discuss] Object index
Hi Jason! Am 24.07.2012 um 20:18 schrieb Rappleye, Jason (ARC-TN)[Computer Sciences Corporation]: > On 7/24/12 11:10 AM, "Daniel Kobras" wrote: >> lfs fid2path (on any client) should do what you're looking for. > > Unfortunately that doesn't work for files created prior to Lustre 2.0, or > files with components of their path created before Lustre 2.0, The link > EA is missing from the MDT inode of such files, which is what fid2path > appears to use. This was a real bummer for us, and I'd love for someone to > tell me that I'm wrong. Please? Pre-2.0, you can extract the inode number of the parent object on the MDT from the object's trusted.fid EA, eg. with ll_decode_filter_fid. On the MDT, you can map the inode number to a filesystem path with ncheck in debugfs or find -inum. Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Object index
Hi Aurelien! Am 24.07.2012 um 14:04 schrieb DEGREMONT Aurelien: > I'm trying to analyze an OST with few thousands of object and find where they > belong to. > Mounting this OST with ldiskfs and using ll_decode_filter_fid tells me that. > > -Most of these object do not have a fid EA back pointer. Does that means they > are not used? Is this the troglodyte type of OST that started its life in times of prehistoric versions of Lustre? We see this on old files that were created in the early ages of Lustre 1.6, before the trusted.fid EA was introduced. Other than that, these objects could have been preallocated, but never actually used. Do these objects contain any data at all (blockcount != 0)? > -Some of them have good results, and the man page says that > "For objects with MDT parent sequence numbers above 0x2, this > indicates that the FID needs to be mapped via the > MDT Object Index (OI) file on the MDT". > How do I do this mapping? I found some iam utilities but they do not seems to > be ok, and I'm afraid IAM userspace code > has been deactivated. lfs fid2path (on any client) should do what you're looking for. > How can I know if those files could be removed without risk. > I previously checked that "lfs find" did not find any other files with object > on this specific OST I'm working on. >From my experience, a small amount of object leakage is not too uncommon on >real-world systems, so if lfs find doesn't show up any objects anymore, most >likely you're good to take this OST down. (Hey, and you can double-check with >rbh-report --dump-ost, of course! ;-) Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] very slow directory access (lstat taking >20s to return)
Hi Frederik! On Fri, Oct 29, 2010 at 09:40:33AM +0100, Frederik Ferner wrote: > Doing a 'strace -T -e file ls -n' on one directory with about 750 files, > while users were seeing the hanging ls, showed lstat calls taking > seconds, up to 23s. The (l)stat() calls determine the exact size of all files in the displayed directory. This means that each OSTs needs to revoke client write locks for all these files, ie. client-side write caches for all files in the directory are flushed before the (l)stat() returns. This can easily take several seconds if there is heavy write activity on the file. Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] system disk with external journals for OSTs formatted
On Wed, Oct 27, 2010 at 08:46:41PM +0800, Andreas Dilger wrote: > I don't know what these errors are, possibly trying to write into the broken > journal device? The rest of the fileystem errors are very minor. You should > probably delete the journal device via "tune2fs -O ^has_journal", run a full > "e2fsck -f" and then recreate the journal with "tune2fs -j size=400". On a filesystem with errors, you'll have to use "tune2fs -f -O ^has_journal" to force removal of the journal. At least that's what the man page says. When I once had to pull such a stunt a while ago, tune2fs refused to remove the journal even when given the force flag, though. It might work as documented now. Otherwise, my workaround was to retrieve the journal UUID from the main filesystem with tune2fs -l, then create a new external journal with this UUID (mke2fs -O journal_dev -U ...). At this point I was able to run e2fsck on the main filesystem to get back to a clean state. Finally, I could remove and add back the journal with normal tune2fs calls to get it properly linked back to the filesystem. Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] need help debuggin an access permission problem
Hi! On Fri, Sep 24, 2010 at 09:18:15AM +0100, Tina Friedrich wrote: > Cheers Andreas. I had actually found that, but there doesn't seem to be > that much documentation about it. Or I didn't find it :) Plus it > appeared to find the users that were problematic whenever I tried it, so > I wondered if that is all there is, or if there's some other mechanism I > could test for. Mind that access to cached files is no longer authorized by the MDS, but by the client itself. I wouldn't call it documentation, but http://wiki.lustre.org/images/b/ba/Tuesday_lustre_automotive.pdf has an illustration of why this is a problem when nameservices become out of sync between MDS and Lustre clients (slides 23/24). Sounds like you hit a very similar issue. Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] error e2fsck run for lfsck
Hi Thomas! On Fri, Sep 10, 2010 at 08:16:57PM +0200, Thomas Roth wrote: > on a 1.8.4 test system, I tried prepare for lfsck and got an error from > e2fsck: > > mds:~# e2fsck -n -v --mdsdb /tmp/mdsdb /dev/sdb2 > e2fsck 1.41.10.sun2 (24-Feb-2010) > lustre-MDT lustre database creation, check forced. > Pass 1: Checking inodes, blocks, and sizes > MDS: ost_idx 0 max_id 190010 > MDS: ost_idx 1 max_id 179671 > MDS: ost_idx 2 max_id 43436 > MDS: ost_idx 3 max_id 97654 > MDS: ost_idx 4 max_id 78973 > MDS: got 40 bytes = 5 entries in lov_objids > MDS: max_files = 15090 > MDS: num_osts = 5 > mds info db file written > error: only handle v1 LOV EAs, not 0bd30bd0 > e2fsck: aborted This looks like https://bugzilla.lustre.org/show_bug.cgi?id=21704: Currently you cannot fsck an MDT filesystem that uses pools. Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] mkfs.lustre and failover question
On Wed, Aug 18, 2010 at 10:16:36AM -0500, David Noriega wrote: > I've read through the 'More Complicated Configurations' section in the > manual and it says as part of setting up failover with > two(active/passive) MDS/MGS and two OSSs(active/active) to use the > following: > > mkfs.lustre --fsname=lustre --ost --failnode=192.168.5@tcp0 > --mgsnode=192.168.5@tcp0,192.168.5@tcp0 > /dev/lustre-ost1-dg1/lv1 A Lustre-MGS can have more than one network address (NID). Different NIDs of the same server are separated by commas. Here, you want to configure NIDs for different servers. Those either have to be separated with a colon. Alternatively, you can just use option --mgsnode twice with different arguments. In your case, try mkfs.lustre --fsname=lustre --ost --failnode=192.168.5@tcp0 --mgsnode=192.168.5@tcp0 --mgsnode=192.168.5@tcp0 /dev/lustre-ost1-dg1/lv1 Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] lfs --obd discrepancy to lctl dl (1.8.3)
Hi! On Wed, Aug 11, 2010 at 09:53:01AM +0200, Heiko Schröter wrote: > According to "lfs find" the stripe should be on scia-OST0017_UUID. "lfs > gestripe" reports to have it on obdidx 23 , which is scia-OST0014 according > to lctl dl. Which one is true ? lctl dl returns a list of all Lustre components running on the local machine, prefixed by a sequential number. It depends on the order in which components have been started on the machine, and has nothing to do with obdidx. Your file resides on OST 23 (== 0x17), therefore. Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] NFS Export Issues
On Fri, Jul 16, 2010 at 05:23:16PM -0700, William Olson wrote: > [r...@lustreclient mnt]# strace -f -p 15964 (...) > lstat("/mnt/lustre_mail_fs", 0x7fff4bd4b2b0) = -1 EACCES (Permission denied) When it comes to inexplicable permission problems, have you checked if SELinux is turned off on the NFS server? Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre file access / locking
Hi! On Thu, Jul 01, 2010 at 12:35:32PM +0200, Arne Wiebalck wrote: > I got some basic questions concerning how file access in Lustre > works: from what I understand, Lustre clients lock file > ranges with the server before executing I/O operations. What > does a client get on its next I/O call in case the server needs > to revoke the lock (the server can do that, right?): does > the client try to re-acquire the lock, block and then time > out (in case the server does not grant the lock again to that > client)? The locks protect the client-side caches, not the actual I/O. If a lock is dequeued, the client has to resynchronize (affected parts of) its cache with the server. Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] lnet infiniband config
Hi! On Tue, Jun 22, 2010 at 04:19:08PM +0200, Thomas Roth wrote: > I'm getting my feet wet in the infiniband lake and of course I run into > some problems. > It would seem I got the compilation part of sles11 kernel 2.6.27 + > Lustre 1.8.3 + ofed 1.4.2 right, because it allows me to see and use the > infiniband fabric, and because ko2iblnd loads without any complaints. > > In /etc/modprobe.d/lustre (this is a Debian system, hence this subdir of > modprobe-configs), I have > > options ip2nets="o2ib0 192.168.0.[1-5]" If this is a verbatim copy from the config file, then you're lacking the name of the module, ie. 'options lnet ip2nets=...'. Maybe also double-check with 'modprobe -c' that options get passed on as intended. > I load lnet and do 'lctl network up', but then 'lctl list_nids' will > invariably give me only > > 192.168@tcp > no matter how I twist the modprobe-config (ip2nets="o2ib", > network="o2ib", network="o2ib(ib0), etc.) > > This is true as long as I have ib0 configured with the IP 192.168.0.1 > Once I unconfigure it, I get, quite expectedly, > LNET configure error 100: Network is down So ib0 is the only network interface in the system? In this case, I could imagine that ksocklnd gets loaded unconditionally, always grabs the first interface it can get hold of, and just doesn't leave any IB interface for ko2iblnd when it eventually gets loaded. This is just a shot in the dark, but you could check by manually loading modules via insmod. Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Clients getting incorrect network information for one of two MDT servers (active/passive)
Hi! On Fri, May 21, 2010 at 11:54:56AM -0400, McKee, Shawn wrote: > Parameters: > mgsnode=10.10.1@tcp,192.41.230@tcp1,141.211.101@tcp2 > failover.node=10.10.1...@tcp,192.41.230...@tcp1 > > Notice there is no reference to 192.41.230...@tcp anywhere here. Lustre MDS and OSS nodes register themselves with the MGS when they are started (mounted) for the first time. In particular, the then-current list of network ids is recorded and sent off to the MGS, from where it is propagated to all clients. This information sticks and will not be updated automatically, even if the configuration on the server changes. From your description, it sounds like you initially started up the MDS with an incorrect LNET config (and probably fixed it in the meantime, but the MGS and thus the clients won't know). Check with "lctl list_nids" on your first MDS that you're content with the current configuration, then follow the procedure to change a server nid ("writeconf procedure") that is documented in the manual, and you should get both server nodes operational again. Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] programmatic access to parameters
On Thu, Mar 25, 2010 at 10:07:03AM -0700, burlen wrote: > > I don't know what your constraints are, but should note that this sort > > of information (number of OSTs) can be obtained rather trivially from > > any lustre client via shell prompt, to wit: > True, but parsing the output of a c "system" call is something I hoped > to avoid. It might not be portable and might be fragile over time. > > This also gets at my motivation for asking for a header with the Lustre > limits, if I hard code something down the road the limits change, and we > are suddenly shooting ourselves in the foot. Instead of teaching the application some filesystem intrinsics, you could also teach the queueing system about your application's output behaviour and let it set up and adequately configured working directory. GridEngine allows to run queue-specific prolog scripts for this purpose, other systems certainly offer similar features. Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Slow du
Hi! On Wed, Feb 03, 2010 at 04:36:43PM -0500, Larry Brown wrote: > We have a cluster set up with 60 nodes. We have a folder with the > striping count set to 20. We loaded up two folders with a large number > of small files. I know the bigger the file the better the more Lustre > shines but we have multiple things to test and this is the first. > According to "lfs df -h" there is a total disk use of 76.3G. I ran "du > -sh" at the top level of the folders and am now at 45 minutes run time > without a result yet. Top only shows du taking between 3 and 5% cpu > time and 1% of memory. > > Does anyone know what causes this? Shouldn't the server be able to > examine the MDS to sum the space used? At worst wouldn't the first > object on the stripe return the total file size if it isn't kept on the > MDS? In Lustre 1.8, the MDS doesn't know the accurate size of a file. This feature is planned for Lustre 2.x. Size information therefore currently needs to be obtained from the OSTs, and as files can contain holes, there's no way knowing the end of the file until all objects have been examined--even if for a small file most of them turn out to be empty. Furthermore, unless you set a stripe count hint, or otherwise overrode the default inode size when creating the MDT, the stripe information won't fit the inode and therefore require an extra seek for each file. Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Client Mount Error Messages
Hi! On Mon, Dec 28, 2009 at 12:00:05PM -0500, CHU, STEPHEN H (ATTSI) wrote: > I have a question regarding a few error messages presented after a client has > mounted the File System. The FS mounted ok and is useable but the > LusterErrors do not look normal. The client does not have IB connectivity to > the MDS/OSS but uses "tcpX" to access the MDSs/OSSs. MDSs and OSSs are > inter-connected with IB. The following are the configurations: > > MDS1: (...) > ·MGS/MDS Parameters: lov.stripesize=25M lov.stripecount=1 > failover.node=10.103.34@o2ib,10.103.34@o2ib1 failover.nodes tells the clients where to connect when the primary server is unavailable. You need to list all NIDs of the failover node, or your 10GE-only nodes won't be able to connect. The Lustre manual has a section on "Changing the Address of a Failover Node" that might help you along. Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] WARNING: short read while accessing file >4GB on 32-bit client
Hi! On Thu, Dec 10, 2009 at 02:36:07PM +0100, Johann Lombardi wrote: > Actually, we are also considering adding such a page to the lustre wiki. > In addition to lustre mailing lists, you can also monitor bugs added > to the corruption tracker (bugzilla ticket #19512). That a good hint, thanks! Unfortunately, the bug doesn't seem to be open for public access. Is it intentional? Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] ACL question
Hi! On Mon, Oct 26, 2009 at 05:35:36PM -0600, Robert LeBlanc wrote: > Well, my little trick isn't working right now. I'm not sure how to debug > this. With Lustre, the MDS authorizes access when a client first touches a certain file. Once it's cached, the client handles authorization itself. If you experience erratic behaviour, this point to a difference in either nameservice or ACL configuration between MDS and client. Are ACLs turned on on the MDT? Does user2 show up as a member of group1 on the MDS? Is the Lustre group upcall configured? Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] How to find a file by object id
Hi! On Wed, Oct 21, 2009 at 11:58:43AM +0100, Wojciech Turek wrote: > I apologize if this question was answered earlier but I can not find it in > the mailing list. > I have an object ID and I would like to find file that this object is part > of. I tried to use lfs find but I can not seem to find right combination of > options. On a Lustre client, I'm not aware of any better method than using a combination of lfs find and lfs getstripe to look for matching object IDs. On the servers, Andreas Dilger has described a more efficient way using debugfs in http://lists.lustre.org/pipermail/lustre-discuss/2009-July/011080.html > Also is there a simple way to list all the files and their object IDs? I'm not sure whether it counts as "simple", but this is what I'm using: lfs find . -type f -print0 | while IFS='' read -d '' i; do echo "$i"; lfs getstripe -q "$i"; done Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Expected number of clients during MDS reconnect.
Hi! When initiating an MDS failover on one of our systems, we see the new active MDS expecting more clients to recover than were actually connected before. # cat /proc/fs/lustre/mds/lustrefs-MDT/recovery_status status: COMPLETE recovery_start: 1255622509 recovery_duration: 300 delayed_clients: 0/651 completed_clients: 260/651 replayed_requests: 4 last_transno: 4112137365 Where 260 is indeed the correct number of active clients. # ls -d1 /proc/fs/lustre/mds/lustrefs-MDT/exports/*...@* | wc -l 260 # cat /proc/fs/lustre/mds/lustrefs-MDT/num_exports 261 Not sure what caues the off-by-one between num_exports and the number of entries in the exports subdirectory, but the difference doesn't look severe. I do wonder about the expected number of 651 clients, though. When recovery has finished on the MDS, Lustre correctly evicts those surplus clients, it seems, as the syslog reports Lustre: lustrefs-MDT: Recovery period over after 5:00, of 651 clients 260 recovered and 391 were evicted. but still the MDT apparently keeps note of them and expects them back during the next recovery cycle. Which means that currently we always have to wait the full recovery timespan even though all active clients have reconnected already. We've seen this behaviour with MDSes running 1.6.7.2 and 1.8.1, clients run a mixture of versions between 1.6.6 and 1.8.1. During the lifetime of the system, we've only decommissioned a small number of systems running Lustre clients, so the difference between current and expected client numbers must have developped by some other means. Does anyone know how the MDT calculates the number of expected clients? Is there a way to make Lustre dump a list of nids of the surplus clients it evicts after the recovery phase? And above all, is there a way to convince the MDT about the true number of clients (preferrably one that doesn't involve the writeconf dance ;-)? Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] NFS vs Lustre
Hi! On Mon, Aug 31, 2009 at 04:34:58PM -0400, Brian J. Murrell wrote: > On Mon, 2009-08-31 at 21:56 +0200, Daniel Kobras wrote: > > Lustre's > > standard config follows Posix and allows dirty client-side caches after > > close(). Performance improves as a result, of course, but in case something > > goes wrong on the net or the server, users potentially lose data just like > > on > > any local Posix filesystem. > > I don't think this is true. This is something that I am only > peripherally knowledgeable about and I am sure somebody like Andreas or > Johann can correct me if/where I go wrong... > > You are right that there is an opportunity for a client to write to an > OST and get it's write(2) call returned before data goes to physical > disk. But Lustre clients know that, and therefore they keep the state > needed to replay that write(2) to the server until the server sends back > a commit callback. The commit callback is what tells the client that > the data actually went to physical media and that it can now purge any > state required to replay that transaction. Lustre can recover from certain error conditions just fine, of course, but still it cannot recover gracefully from others. Think double failures or, more likely, connectivity problems to a subset of hosts. For instance, if, say, an Ethernet switch goes down for a few minutes with IB still available, all Ethernet-connected clients will get evicted. Users won't necessarily notice that there was a problem, but they've just potentially lost data. VBR makes the data loss less likely in this case, but the possibility is still there. I'd suspect you'll always be able to construct similar corner cases as long as the networked filesystem allows dirty caches after close(). Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] NFS vs Lustre
Hi! On Sun, Aug 30, 2009 at 04:12:11PM -0500, Nicolas Williams wrote: > NFSv4 can't handle O_APPEND, and has those close-to-open semantics. > Those are the two large departures from POSIX in NFSv4. Along these lines, it's probably worth mentioning commit-on-close as well, an area where NFS (v3 and v4, optionally relaxed when using write delegations) is more strict than Posix. This is to make sure that NFS still has the possibility to notify the user about errors when trying to save their data. Lustre's standard config follows Posix and allows dirty client-side caches after close(). Performance improves as a result, of course, but in case something goes wrong on the net or the server, users potentially lose data just like on any local Posix filesystem. The difference being that users tend to notice when their local machine crashes. It's much easier to miss a remote server or a switch going down, and hence suffer from silent data loss. (Admins will typically notice, eg. via eviction messages in the logs, but have a hard time telling whicht files had been affected.) The solution is to fsync() all valuable data on a Posix filesystem, but that's not necessarily within reach for an average end user. Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] NFS vs Lustre
Hi! On Sat, Aug 29, 2009 at 11:56:40AM -0600, Lee Ward wrote: > NFS4 addresses those by: > > 1) Introducing state. Can do full POSIX now without the lock servers. > Lots of resiliency mechanisms introduced to offset the downside of this, > too. NFS4 implementations are able to handle Posix advisory locks, but unlike Lustre, they don't support full Posix filesystem semantics. For example, NFS4 still follows the traditional NFS close-to-open cache consistency model whereas with Lustre, individual write()s are atomic and become immediately visible to all clients. Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.
Hi Sébastien! On Wed, Jun 24, 2009 at 09:46:19AM +0200, Sébastien Buisson wrote: > > - if the 'service id' information was stored on the MGS on a file system > > basis, one could imagine to retrieve it at mount time on the clients. > > The 'service id' information stored on the MGS could consist in a port > > space and a port id. Thus it would be possible to affect different > > service ports to the various connections initiated by the client, > > depending on the target file system. > > What do you think? Would you say this is feasible, or can you see major > > issues with this proposal? > > > > The peer's port information could be stored in the kib_peer_t structure. > That way, it would be possible to make clients connect to servers which > listen on different ports. > What do you think? Why do you want to distinguish the two filesystems solely by service id rather than, say, service id + port guids of the respective Lustre servers? You'll need a full QoS policy file instead of the simplified syntax, and configuration needs to be adapted on hardware changes, but this still looks simpler to me than modifying the wire protocol. Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] x4540 (thor) panic
Hi! On Thu, Jun 11, 2009 at 10:08:30AM -0400, Brock Palen wrote: > I attached a screen shot of the dump from console, its incomplete > though. > Should I just update the server to 1.6.7.2 ? Just strange that the > two x4500 had no issues, but the x4540 does The X4500 comes with Intel NICs while the X4540 is equipped with NVidia chips. The Linux driver for the latter reportedly isn't well tuned. Bumping parameter max_interrupt_work of the forcedeth module to 100 cured stability problems for us. That's on a RHEL5-based system, though. Not sure how the driver in RHEL5 fares, but it might still be worth a try. Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.
Hi! On Mon, May 18, 2009 at 01:34:03PM -0700, Jim Garlick wrote: > Hi, I don't know much about this stuff, but our IB guys did use QoS > to help us when we found LNET was falling apart when we brought up > our first 1K node cluster based on quad socket, quad core opterons, > and ran MPI collective stress tests on all cores. > > Here are some notes they put together - see the "QoS Policy file" section. Great summary, thanks for sharing! Seems like qos-ulp is a rather recent OpenSM-specific feature, and the SMs in our switches apparently don't offer a similar SID-to-SL mapping, either. , but it certainly got me a leap further. Thanks, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre attribute caching?
Hi! On Mon, Apr 27, 2009 at 01:29:20AM -0600, Andreas Dilger wrote: > Lustre will cache negative dentries as long as the directory is not > modified. If the directory has a new file created or an existing > one deleted then the negative dentries, along with the rest of the > readdir data is removed from all client caches, though the individual > inodes can still live in the client cache as they have their own locks. Interesting. Thanks for the info! This makes me wonder how Lustre copes with temporary network outages in this case. If the client drops off the net, the server cannot invalidate the caches. Does the client notice the communication problem, or could it briefly serve stale data? Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.
Hi! Does anyone know how to use QoS with Lustre's o2ib LND? The Voltaire IB LND allowed to #define a service level, but I couldn't find a similar facility in o2ib. Is there a different way to apply QoS rules? Thanks, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] NUMA IO and Lustre
Hi! On Tue, May 12, 2009 at 05:49:26PM +0200, Sébastien Buisson wrote: > Your workaround sounds very interesting, because it shows that the > network settings of an OST cannot be fixed by any configuration > directive: this is done in an opaque way at mkfs.lustre time. To clarify: It's the initial mount rather than mkfs. The process as far as I understood works as follows (and hopefully someone will correct me if I got it wrong): * OST is formatted. * OST is mounted for the first time. + OST queries the current list of nids on its OSS. + OST sends its list of nids off to the MGS. * MGS registers new OST (and its nids). * MGS advertises new OSTS (and its nids) to all (present and future) clients. It's important to note that MGS registration is a one-off process that cannot be changed or redone later on, unless you wipe the complete Lustre configuration from all servers using the infamous --writeconf method and restart all Lustre clients to remove any stale cached info. (Which of course implies a downtime of the complete filesystem.) Once the registration is done, you can change a server's LNET configuration, but the MGS won't care, and the clients will never get notified of it. Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] NUMA IO and Lustre
Hi! On Tue, May 12, 2009 at 04:07:51PM +0200, Sébastien Buisson wrote: > Concretely, we would like to know if it is possible in Lustre to bind an > OST to a specific network interface, so that this OST is only reached > through this interface (thus avoiding the NUMA IO factor in our case) ? > For instance, we would like to have 4 OSTs attached to ib0 and the 4 > other OSTs attached to ib1. As I recently learned the hard way, the network settings of an OST are fixed when the OST connects to the MGS for the first time. Hence, you could ifup ib0, ifdown ib1, start LNET and fire up the first set of OSTs, then down LNET again, ifdown ib0, ifup ib1, restart LNET and start up the rest of the OSTs. Subsequently, you should be able to run with both IB nids enabled, but clients should still only know about a single IB nid for each OSTs. As for a less hackish solution, wouldn't it be cleaner to just run, say, two Xen domUs on the machine and map the HBAs/HCAs as appropriate? Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Lustre attribute caching?
Hi! I wonder about Lustre's client-side caching strategy for file attributes and directory entries, and whether it suffers from the same problems as NFS in this regard. I seem to be unable to find this question answered on the web, so I'd be happy for any hints. Specifically, I'm interested in caching of negative entries. With /sharedfs shared via NFS, attribute caching might prevent clientA from seeing the newly created file in the final stat call of this test: On clientA: # rm -f /sharedfs/test; ls -l /sharedfs/test; ssh clientB "touch /sharedfs/test"; stat /sharedfs/test How does Lustre behave in this case? Are newly created files immediately visible on all clients, or will they cache (negative) attributes and dentries for some time? Does Posix mandate specific behaviour in this regard? Thanks for any insight, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Registering OST with failover MGS.
On Wed, Feb 04, 2009 at 11:07:42AM -0700, Kevin Van Maren wrote: > Brian J. Murrell wrote: > > On Wed, 2009-02-04 at 18:25 +0100, Daniel Kobras wrote: > >> I've tried that--as well as a comma- or colon-separated list of NIDs--, > >> but neither options appears to work: As soon as I try to mount the OST > >> partition with the MGS active on the secondary node, the mount fails, > >> and error messages indicate that only the primary MGS node is queried. > >> All systems are running Lustre version 1.6.6, by the way. Correction: tunefs.lustre --mgsnode=mgs1 --mgsnode=mgs2 ... works as soon as I don't typo the NIDs. tunefs.lustre --mgsnode=mgs1:mgs2 doesn't. > > Well, it should work as described. Please try to find the bugzilla bug > > referencing a problem with this and see if it applies. If not, please > > file a new bug. > > Probably referring to bug 15912, which was for "mkfs" issues, where the > mgs option had to be > specified twice to indicate two different servers. mkfs was fixed in 1.6.6, but there still seems to be a similar issue with tunefs.lustre. I've just filed #18438 for this now. Thanks, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Registering OST with failover MGS.
Hi! On Wed, Feb 04, 2009 at 11:50:58AM -0500, Brian J. Murrell wrote: > On Wed, 2009-02-04 at 17:21 +0100, Daniel Kobras wrote: > > Lustre clients allow to mount from either the primary MGS or a failover > > server using a colon-separated syntax like > > > > primary.mgs.example.com:failover.mgs.example.com:/fsname > > > > Is there a similar way to configure an MGS failover pair into an OST? > > tunefs.lustre --mgsnode seems to accept only a single MGS. > > $ man tunefs.lustre > ... > --mgsnode=nid,... >Set the NID(s) of the MGS node, required for all targets other >than the MGS. > > Notice the plurality of "NID(s)"? That means you can use the multiple > NID syntax. Sure, this tells me how to specify multiple NIDs of a single server, but I'd like to configure the NIDs of two distinct servers in a failover pair. > You can also specify --mgsnode=nid multiple times. I've tried that--as well as a comma- or colon-separated list of NIDs--, but neither options appears to work: As soon as I try to mount the OST partition with the MGS active on the secondary node, the mount fails, and error messages indicate that only the primary MGS node is queried. All systems are running Lustre version 1.6.6, by the way. Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Registering OST with failover MGS.
Hi! Lustre clients allow to mount from either the primary MGS or a failover server using a colon-separated syntax like primary.mgs.example.com:failover.mgs.example.com:/fsname Is there a similar way to configure an MGS failover pair into an OST? tunefs.lustre --mgsnode seems to accept only a single MGS. Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre failover of MGS
Kevin Van Maren <[EMAIL PROTECTED]> writes: > mkfs.lustre --reformat [EMAIL PROTECTED] --mgs /dev/sdb Is --failnode evaluated for the MGS? We seem to do fine without it as any client requires explicit configuration of the MGS failnode anyway. Or is it possible to override this configuration with the value set on the MGS? > mkfs.lustre --reformat --fsname bananafs [EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED] --mdt /dev/sdc Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Early replies from pre-1.6.5 servers?
Hi! While debugging connection problems on a Lustre client running 1.6.5.1 on RHEL5, I discovered early replies in the client debug output. Adaptive timeouts are disabled on the client, and our server infrastructure in running stock 1.6.4.3 (RHEL5), still. As far as I understood, early replies should only occur post 1.6.5 when adaptive timeouts are active. Did I get this wrong, and early replies are prefectly valid even in our setup? Or is the 1.6.5.1 client misinterpreting message headers? I'm attaching a debug trace for a single XID on the client and the MDS/MGS. Thanks, Daniel. ==> client ([EMAIL PROTECTED], Lustre 1.6.5.1) <== [EMAIL PROTECTED] # lctl debug_file /tmp/lustre-debug.client.log | grep 266023 0100:0010:2:1217244344.471204:0:9991:0:(client.c:1784:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc ll_cfg_requeue:38cc9155-5e64-1d01-bbf2-8b621120e1b0:9991:x266023:[EMAIL PROTECTED]:101 0100:0200:2:1217244344.471250:0:9991:0:(niobuf.c:540:ptl_send_rpc()) Setup reply buffer: 368 bytes, xid 266023, portal 25 0100:0040:2:1217244344.471269:0:9991:0:(niobuf.c:561:ptl_send_rpc()) @@@ send flg=0 [EMAIL PROTECTED] x266023/t0 o101->[EMAIL PROTECTED]@o2ib_0:26/25 lens 232/368 e 0 to 100 dl 121724 ref 2 fl Rpc:/0/0 rc 0/0 0100:0200:2:1217244344.471299:0:9991:0:(niobuf.c:70:ptl_send_buf()) Sending 232 bytes to portal 26, xid 266023, offset 0 0100:0200:2:1217244344.471379:0:9991:0:(client.c:1871:ptlrpc_queue_wait()) @@@ -- sleeping for 10 ticks [EMAIL PROTECTED] x266023/t0 o101->[EMAIL PROTECTED]@o2ib_0:26/25 lens 232/368 e 0 to 100 dl 121724 ref 2 fl Rpc:/0/0 rc 0/0 0100:0200:2:1217244344.471399:0:9991:0:(client.c:771:ptlrpc_check_reply()) @@@ rc = 0 for [EMAIL PROTECTED] x266023/t0 o101->[EMAIL PROTECTED]@o2ib_0:26/25 lens 232/368 e 0 to 100 dl 121724 ref 2 fl Rpc:/0/0 rc 0/0 0100:0200:2:1217244344.471416:0:9991:0:(client.c:771:ptlrpc_check_reply()) @@@ rc = 0 for [EMAIL PROTECTED] x266023/t0 o101->[EMAIL PROTECTED]@o2ib_0:26/25 lens 232/368 e 0 to 100 dl 121724 ref 2 fl Rpc:/0/0 rc 0/0 0100:0200:1:1217244344.471440:0:3049:0:(events.c:55:request_out_callback()) @@@ type 4, status 0 [EMAIL PROTECTED] x266023/t0 o101->[EMAIL PROTECTED]@o2ib_0:26/25 lens 232/368 e 0 to 100 dl 121724 ref 2 fl Rpc:/0/0 rc 0/0 0100:0040:1:1217244344.471458:0:3049:0:(client.c:1526:__ptlrpc_req_finished()) @@@ refcount now 1 [EMAIL PROTECTED] x266023/t0 o101->[EMAIL PROTECTED]@o2ib_0:26/25 lens 232/368 e 0 to 100 dl 121724 ref 2 fl Rpc:/0/0 rc 0/0 0100:0200:1:1217244344.471565:0:3049:0:(events.c:84:reply_in_callback()) @@@ type 1, status 0 [EMAIL PROTECTED] x266023/t0 o101->[EMAIL PROTECTED]@o2ib_0:26/25 lens 232/368 e 0 to 100 dl 121724 ref 1 fl Rpc:/0/0 rc 0/0 0100:1000:1:1217244344.471578:0:3049:0:(events.c:112:reply_in_callback()) @@@ Early reply received: mlen=240 offset=0 replen=368 replied=0 unlinked=0 [EMAIL PROTECTED] x266023/t0 o101->[EMAIL PROTECTED]@o2ib_0:26/25 lens 232/368 e 0 to 100 dl 121724 ref 1 fl Rpc:/0/0 rc 0/0 0100:0200:2:1217244344.471652:0:9991:0:(client.c:771:ptlrpc_check_reply()) @@@ rc = 0 for [EMAIL PROTECTED] x266023/t0 o101->[EMAIL PROTECTED]@o2ib_0:26/25 lens 232/368 e 1 to 100 dl 121724 ref 1 fl Rpc:/0/0 rc 0/0 0100:0200:2:1217244344.471668:0:9991:0:(client.c:771:ptlrpc_check_reply()) @@@ rc = 0 for [EMAIL PROTECTED] x266023/t0 o101->[EMAIL PROTECTED]@o2ib_0:26/25 lens 232/368 e 1 to 100 dl 121724 ref 1 fl Rpc:/0/0 rc 0/0 0100:0100:2:121724.471354:0:9991:0:(client.c:1198:ptlrpc_expire_one_request()) @@@ timeout (sent at 1217244344, 100s ago) [EMAIL PROTECTED] x266023/t0 o101->[EMAIL PROTECTED]@o2ib_0:26/25 lens 232/368 e 1 to 100 dl 121724 ref 1 fl Rpc:/0/0 rc 0/0 0100:02000400:2:121724.471376:0:9991:0:(client.c:1206:ptlrpc_expire_one_request()) Request x266023 sent from [EMAIL PROTECTED] to NID [EMAIL PROTECTED] 100s ago has timed out (limit 100s). 0100:0200:2:121724.471845:0:9991:0:(events.c:84:reply_in_callback()) @@@ type 5, status 0 [EMAIL PROTECTED] x266023/t0 o101->[EMAIL PROTECTED]@o2ib_0:26/25 lens 232/368 e 1 to 100 dl 121724 ref 1 fl Rpc:X/0/0 rc 0/0 0100:0010:2:121724.471859:0:9991:0:(events.c:102:reply_in_callback()) @@@ unlink [EMAIL PROTECTED] x266023/t0 o101->[EMAIL PROTECTED]@o2ib_0:26/25 lens 232/368 e 1 to 100 dl 121724 ref 1 fl Rpc:X/0/0 rc 0/0 0100:0010:2:121724.471970:0:9991:0:(client.c:2091:ptlrpc_abort_inflight()) @@@ inflight [EMAIL PROTECTED] x266023/t0 o101->[EMAIL PROTECTED]@o2ib_0:26/25 lens 232/368 e 1 to 100 dl 121724 ref 1 fl Rpc:X/0/0 rc 0/0 0100:0200:2:121724.472014:0:9991:0:(client.c:771:ptlrpc_check_reply()) @@@ rc = 1 for [EMAIL PROTECTED] x266023/t0 o101->[EMAIL PROTECTED]@o2ib_0:26/25 lens 232/368 e 1 to 100 dl 121724 ref 1 fl Rpc:E