Re: [lustre-discuss] [Samba] Odd "File exists" behavior when copy-pasting many files to an SMB exported Lustre FS

2022-09-23 Thread Daniel Kobras
Hi Jeremy!

Am 23.09.22 um 18:42 schrieb Jeremy Allison:
> In practice this has never been an issue. It's only an issue
> now due to the strange EA behavior. Once we have a vfs_lustre
> module in place it will go back to not being an issue :-).

The root of the problem seems to be an asymmetry in how Samba maps
filesystem to Windows EAs and back. For SMB_INFO_SET_EA it's

foo (Windows) ---> user.foo (filesystem)

whereas SMB_INFO_QUERY_ALL_EAS map as

user.foo  (filesystem) ---> foo (Windows)
other.bar (filesystem) ---> other.bar (Windows)

This means

a) the Windows side cannot distinguish between other.bar and
   user.other.bar, and
b) a copy operation via Samba turns other.bar into user.other.bar
   (leading to the issue at hand because the user.* EA namespace is off
   by default in Lustre).

So rather than adding a VFS module filtering some EAs, why not
just make the mapping symmetric, ie.

diff --git a/source3/smbd/smb2_trans2.c b/source3/smbd/smb2_trans2.c
index 95cecce96e1..31a7a04a72c 100644
--- a/source3/smbd/smb2_trans2.c
+++ b/source3/smbd/smb2_trans2.c
@@ -454,7 +454,7 @@ static NTSTATUS get_ea_list_from_fsp(TALLOC_CTX
*mem_ctx,
struct ea_list *listp;
fstring dos_ea_name;

-   if (strnequal(names[i], "system.", 7)
+   if (!strnequal(names[i], "user.", 5)
|| samba_private_attr_name(names[i]))
continue;


What am I missing?

Kind regards,

Daniel
-- 
Daniel Kobras
Principal Architect
Puzzle ITC Deutschland
+49 7071 14316 0
www.puzzle-itc.de

-- 
Puzzle ITC Deutschland GmbH
Sitz der Gesellschaft: Eisenbahnstraße 1, 72072 
Tübingen

Eingetragen am Amtsgericht Stuttgart HRB 765802
Geschäftsführer: 
Lukas Kallies, Daniel Kobras, Mark Pröhl

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [Samba] Odd "File exists" behavior when copy-pasting many files to an SMB exported Lustre FS

2022-09-22 Thread Daniel Kobras
Hi Micha, hi Jeremy!

Am 22.09.22 um 11:21 schrieb Michael Weiser:
> [from the docs]:
>> If a client(s) will be mounted on several file systems, add the following 
>> line to /etc/xattr.conf  > file to avoid problems when files are moved 
>> between the file systems: lustre.* skip"
> 
> What exactly were those problems hinted at in the documentation?
> Is the visibility of the lustre.lov attribute for unprivileged users actually 
> needed for anything?
> Can exposing it to unprivileged users be switched off Lustre-side?

In the lustre.lov xattr, Lustre exposes layout information (ie. content
distribution across servers) to regular users. In some cases, it's also
possible to set the desired layout through this interface. Layout
information depends on the innards of the specific fs setup, and should
not be retained when moving files between different filesystems, hence
the hint in the docs.

> Jeremy wrote:
> 
>> Lutre really should not be exposing EA's to callers if
>> it doesn't actually support EA's.

Just to clarify as Micha's original post had it hidden in parenthesis:
Lustre does support xattrs in general, and it does support the 'user.*'
namespace. It's just that the latter needs to be enabled explicitly with
the 'user_xattr' mount option. By default, access to 'user.*' xattrs is
rejected with EOPNOTSUPP.

>> No. Lustre is returning "fictional" EA's that
>> cannot be set. Linux filesystems that don't have
>> EA's don't do that.

On a Lustre system, the lustre.lov xattr can be set alright without
receiving an error. But that's not what Samba does here. Instead, it
tries to copy any user-readable xattr, and prepends a 'user.' prefix to
any name unless it's already there. This only works if the filesystem
has been mounted with the 'user_xattr' option.

So for a working setup, it boils down to either turning off 'ea support'
in Samba (on by default since 4.9), or turning on 'user_xattr' in the
filesystem.

Kind regards,

Daniel
-- 
Daniel Kobras
Principal Architect
Puzzle ITC Deutschland
+49 7071 14316 0
www.puzzle-itc.de

-- 
Puzzle ITC Deutschland GmbH
Sitz der Gesellschaft: Eisenbahnstraße 1, 72072 
Tübingen

Eingetragen am Amtsgericht Stuttgart HRB 765802
Geschäftsführer: 
Lukas Kallies, Daniel Kobras, Mark Pröhl

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [Samba] Odd "File exists" behavior when copy-pasting many files to an SMB exported Lustre FS

2022-09-21 Thread Daniel Kobras
Hi Michael!

Am 21.09.22 um 16:54 schrieb Michael Weiser via lustre-discuss:
> That leaves the question, where that extended attribute user.lustre.lov is 
> coming from. It appears that Lustre itself exposes a number of extended 
> attributes for every file which reflect internal housekeeping data:
> 
> $ getfattr -m - blarf
> # file: blarf
> lustre.lov
> trusted.link
> trusted.lma
> trusted.lov

Try adding

  lustre.* skip

to /etc/xattr.conf (cf.
<https://doc.lustre.org/lustre_manual.xhtml#lustre_configure_multiple_fs>).
Haven't tested yet, but Samba's EA handling seems to be libattr-based,
so the above tweak should be sufficient to keep smbd off of these
fs-specific attributes.

Kind regards,

Daniel
-- 
Daniel Kobras
Principal Architect
Puzzle ITC Deutschland
+49 7071 14316 0
www.puzzle-itc.de

-- 
Puzzle ITC Deutschland GmbH
Sitz der Gesellschaft: Eisenbahnstraße 1, 72072 
Tübingen

Eingetragen am Amtsgericht Stuttgart HRB 765802
Geschäftsführer: 
Lukas Kallies, Daniel Kobras, Mark Pröhl

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] MDS using D3710 DAS

2021-02-12 Thread Daniel Kobras
Hi!

Am 12.02.21 um 06:14 schrieb Sid Young:
> Is anyone using a HPe D3710 with two HPeDL380/385 servers in a MDS HA
> Configuration? If so, is your D3710 presenting LV's to both servers at
> the same time AND are you using PCS with the Lustre PCS Resources?
> 
> I've just received new kit and cannot get disk to present to the MDS
> servers at the same time. :(

It's been a while, but I've seen this once with an enclosure that had
been equipped with single- instead of dual-ported drives by mistake.
Might be worthwhile to double-check the specs for your system.

Kind regards,

Daniel
-- 
Daniel Kobras
Principal Architect
Puzzle ITC Deutschland
+49 7071 14316 0
www.puzzle-itc.de

-- 
Puzzle ITC Deutschland GmbH
Sitz der Gesellschaft: Eisenbahnstraße 1, 72072 
Tübingen

Eingetragen am Amtsgericht Stuttgart HRB 765802
Geschäftsführer: 
Lukas Kallies, Daniel Kobras, Mark Pröhl

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] rsync not appropriate for lustre

2020-01-24 Thread Daniel Kobras
Hi!

Am 23.01.20 um 16:33 schrieb Bernd Melchers:
> we are copying large data sets within our lustre filesystem and between
> lustre and an external nfs server. In both cases the performance is
> unexpected low and the reason seems that rsync is reading and writing in
> 32 kB Blocks, whereas our lustre would be more happy with 4 MB
> Blocks.
> rsync has an --block-size=SIZE Parameter but this adjusts only the
> checksum block size (and the maximum is 131072), not the i/o block size.
> Is there a solution to accelerate rsync? 

Assuming sequential access to sufficiently large files, most reads
issued by a Lustre client with default setting should already be
RPC-sized, regardless of the size used in the read() syscall. Maybe
double-check your readahead settings?

Kind regards,

Daniel
-- 
Daniel Kobras
Principal Architect
Puzzle ITC Deutschland
+49 7071 14316 0
www.puzzle-itc.de

-- 
Puzzle ITC Deutschland GmbH
Sitz der Gesellschaft: Jurastr. 27/1, 72072 
Tübingen

Eingetragen am Amtsgericht Stuttgart HRB 765802
Geschäftsführer: 
Lukas Kallies, Daniel Kobras, Mark Pröhl

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] How to remove a pool

2019-03-04 Thread Daniel Kobras
Hi!

Am 04.03.19 um 17:56 schrieb Bernd Melchers:
> ... or is there a way to un-setstripe directories from pools?

You should be able to remove the pool association with

  lfs setstripe -p none 

Kind regards,

Daniel
-- 
Daniel Kobras
Principal Architect
Puzzle ITC Deutschland
+49 7071 14316 0
www.puzzle-itc.de

-- 
Puzzle ITC Deutschland GmbH
Sitz der Gesellschaft: Jurastr. 27/1, 72072 
Tübingen

Eingetragen am Amtsgericht Stuttgart HRB 765802
Geschäftsführer: 
Lukas Kallies, Daniel Kobras, Mark Pröhl

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Disable identity_upcall and ACL

2019-01-09 Thread Daniel Kobras
Hi Aurélien!

Am 09.01.19 um 14:30 schrieb Degremont, Aurelien:
> The secondary group thing was ok to me. I got this idea even if there is some 
> weird results during my tests. Looks like you can overwrite MDT checks if 
> user and group is properly defined on client node. Cache effects?

In a talk I gave a decade ago, I described a problem with authorization
due to inconsistencies between client and MDT, depending on whether
metadata was in the client cache or not (see p. 23 of
http://wiki.lustre.org/images/b/ba/Tuesday_lustre_automotive.pdf -- you
really managed to challenge my memory ;-) I faintly remember Andreas
commenting that the MDT was always supposed to be authoritative, even
for cached content, and the experienced behaviour was a bug. Indeed,
other than those prehistoric versions, I'm not aware of any
inconsistencies in authorization due to cache effects.

> ACL is really the thing I was interested in. Who is validating the ACLs? MDT, 
> client or both? Do you think ACL could be properly applied if user/groups are 
> only defined on client side and identity_upcall is disabled on MDT side?
Posix ACLs use numeric uids and gids, just like ordinary permission
bits. Evaluation is supposed to happens on the MDT for both. If you can
do without secondary groups, there's no need for user and group
databases on the MDT--numeric id will work fine. (Unless you use
Kerberos, which will typically require user names for proper id mapping.)

Kind regards,

Daniel
-- 
Daniel Kobras
Principal Architect
Puzzle ITC Deutschland
+49 7071 14316 0
www.puzzle-itc.de

-- 
Puzzle ITC Deutschland GmbH
Sitz der Gesellschaft:  Jurastr. 27/1, 72072 
Tübingen
Eingetragen am Amtsgericht Stuttgart HRB 765802
Geschäftsführer: 
Lukas Kallies, Daniel Kobras, Mark Pröhl

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Disable identity_upcall and ACL

2019-01-09 Thread Daniel Kobras
Hi Aurélien!

Am 09.01.19 um 11:48 schrieb Degremont, Aurelien:
> When disabling identity_upcall on a MDT, you get this message in system
> logs:
> 
>   lustre-MDT: disable "identity_upcall" with ACL enabled maybe cause
> unexpected "EACCESS"
> 
> I’m trying to understand what could be a scenario that shows this problem?
> What is the implication, or rather, how identity_upcall works?

Without an identity_upcall, all Lustre users effectively lose their
secondary group memberships. These are not passed in the RPCs, but
evaluated on the MDS instead. The default l_getidentity receives a
numeric uid, queries NSS to obtain the corresponding account's list of
gids, and passes the list back to the kernel. As a test scenario, just
try to access a file or directory from an account that only has access
permissions via one of its secondardy groups. (The log message is a bit
misleading--you don't actually need to use ACLs, ordinary group
permissions are sufficient.)

Kind regards,

Daniel
-- 
Daniel Kobras
Principal Architect
Puzzle ITC Deutschland
+49 7071 14316 0
www.puzzle-itc.de

-- 
Puzzle ITC Deutschland GmbH
Sitz der Gesellschaft:  Jurastr. 27/1, 72072 
Tübingen
Eingetragen am Amtsgericht Stuttgart HRB 765802
Geschäftsführer: 
Lukas Kallies, Daniel Kobras, Mark Pröhl

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Command line tool to monitor Lustre I/O ?

2018-12-20 Thread Daniel Kobras
Hi Roland!

> Am 20.12.2018 um 15:04 schrieb Laifer, Roland (SCC) :
> 
> what is a good command line tool to monitor current Lustre metadata and
> throughput operations on the local client or server? Up to now we had
> used collectl but this no longer works for Lustre 2.10.

The Lustre exporter (https://github.com/HewlettPackard/lustre_exporter) for 
Prometheus copes well with 2.10. Calling it a command-line tool is a bit of a 
stretch (hey, there’s curl after all!), but it can certainly step in for 
collect’s non-interactive mode of operation.

Kind regards,

Daniel
-- 
Daniel Kobras
Principal Architect
Puzzle ITC Deutschland
https://www.puzzle-itc.de

-- 
Puzzle ITC Deutschland GmbH
Sitz der Gesellschaft:  Jurastr. 27/1, 72072 
Tübingen
Eingetragen am Amtsgericht Stuttgart HRB 765802
Geschäftsführer: 
Lukas Kallies
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] projects and project quota

2018-12-10 Thread Daniel Kobras
Hi!

Am 10.12.18 um 17:38 schrieb Thomas Roth:
> Why do I gent an error when enabling project quota?
> 
> The system ("hebe") is running Lustre 2.10.5 and zfs 0.7.9, but it was 
> installed and formatted with
> 2.10.2 and zfs 0.7.1 (and LU-7991 was resolved November 17, fix version is 
> 2.10.3).

LU-7991 was resolved in 2.10.x, but the corresponding changes in ZoL
(https://github.com/zfsonlinux/zfs/pull/6290) have only landed in
master, and will released with 0.8. As far as I'm aware, project quota
is not functional with zfs 0.7.x.

Kind regards,

Daniel
-- 
Daniel Kobras
Principal Architect
Puzzle ITC Deutschland
+49 7071 14316 0
www.puzzle-itc.de

-- 
Puzzle ITC Deutschland GmbH
Sitz der Gesellschaft:  Jurastr. 27/1, 72072 
Tübingen
Eingetragen am Amtsgericht Stuttgart HRB 765802
Geschäftsführer: 
Lukas Kallies
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] new lustre setup, questions about the mgt and mdt.

2018-10-23 Thread Daniel Kobras
Hi!

> Am 23.10.2018 um 19:36 schrieb Kurt Strosahl :
> 
> 1) I've seen it said that ZFS is a good choice for the lustre mdt, as long as 
> it is zfs 0.7.x.  Has anyone had that experience?
> 2) What about the MGT, that seems like it wouldn't have any issues with being 
> on zfs.

We’ve been running several all-ZFS backed systems over the last few years. Way 
back in the 2.5 days, there once was a ZFS-specific issue with ACLs that we’ve 
actually hit in production. Other than that, the backend FS never caused any 
problems—or they were insignificant enough to slip my mind.

> The system I presently have in mind is two servers hooked up to a jbod, with 
> the MDT and the MGT being separate zpools on the jbod.  That way one head to 
> run the mgt and one could run the mdt, while allowing both heads to serve as 
> the backup for each other.

MGS data is small, mostly static, and in my experience, the extra load from the 
service can be neglected as well. So overall you’re probably better off if you 
don’t waste drives on a separate pool just for the MGS. Just add them to the 
MDT pool instead, and enjoy the extra oomph. Or consider splitting the drivesl 
into two pools for an active/active configuration with one MDT on each node 
instead. Depending on how your workload spreads across multiple MDTs, this can 
be beneficial even for a relatively small number of disks because you 
effectively double your cache size.

Kind regards,

Daniel  
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre Client in a container

2017-12-31 Thread Daniel Kobras
Hi David!

Do you require both systems to be available as native Lustre filesystems on all 
clients? Otherwise, reexporting one of the systems via NFS during the migration 
phase will keep all data available but decouple the version interdependence 
between servers and clients. In this situation, it’s probably the least 
experimental option.

Kind regards,

Daniel

> Am 31.12.2017 um 09:50 schrieb David Cohen :
> 
> Patrick,
> Thanks for you response.
> I looking for a way to migrate from 1.8.9 system to 2.10.2, stable enough to 
> run the several weeks or more that it might take.
> 
> 
> David
> 
> On Sun, Dec 31, 2017 at 12:12 AM, Patrick Farrell  wrote:
> David,
> 
> I have no direct experience trying this, but I would imagine not - Lustre is 
> a kernel module (actually a set of kernel modules), so unless the container 
> tech you're using allows loading multiple different versions of *kernel 
> modules*, this is likely impossible.  My limited understanding of container 
> tech on Linux suggests that this would be impossible, containers allow 
> userspace separation but there is only one kernel/set of modules/drivers.
> 
> I don't know of any way to run multiple client versions on the same node.
> 
> The other question is *why* do you want to run multiple client versions on 
> one node...?  Clients are usually interoperable across a pretty generous set 
> of server versions.
> 
> - Patrick
> 
> 
> From: lustre-discuss  on behalf of 
> David Cohen 
> Sent: Saturday, December 30, 2017 11:45:15 AM
> To: lustre-discuss@lists.lustre.org
> Subject: [lustre-discuss] Lustre Client in a container
>  
> Hi,
> Is it possible to run Lustre client in a container?
> The goal is to run two different client version on the same node, can it be 
> done?
> 
> David
> 
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lfs_migrate rsync vs. lfs migrate and layout swap

2017-11-25 Thread Daniel Kobras
Hi!


> Am 20.11.2017 um 00:01 schrieb Dilger, Andreas :
> 
> It would be interesting to strace your rsync vs. "lfs migrate" read/write 
> patterns so that the copy method of "lfs migrate" can be improved to match 
> rsync. Since they are both userspace copy actions they should be about the 
> same performance. It may be that "lfs migrate" is using O_DIRECT to minimize 
> client cache pollution (I don't have the code handy to check right now).  In 
> the future we could use "copyfile()" to avoid this as well. 

lfs migrate indeed uses O_DIRECT for reading the source file. A few tests on a 
system running 2.10.1 yielded a 10x higher throughput with a modified lfs 
migrate that simply dropped the O_DIRECT flag. I’ve filed 
https://jira.hpdd.intel.com/browse/LU-10278 about it. (A simple patch to make 
O_DIRECT optional is ready, but I still need to charm the gods of the firewall 
to let me push to Gerrit.)

Kind regards,

Daniel
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Dependency errors with Lustre 2.10.1 packages

2017-11-15 Thread Daniel Kobras
Hi!

> [root@lustre-ost03 ~]# yum install lustre
> Loaded plugins: fastestmirror, versionlock
> Loading mirror speeds from cached hostfile
> Resolving Dependencies
> --> Running transaction check
> ---> Package lustre.x86_64 0:2.10.1-1.el7 will be installed
> --> Processing Dependency: kmod-lustre = 2.10.1 for package: 
> lustre-2.10.1-1.el7.x86_64
> --> Processing Dependency: lustre-osd for package: lustre-2.10.1-1.el7.x86_64
> --> Processing Dependency: lustre-osd-mount for package: 
> lustre-2.10.1-1.el7.x86_64
> --> Processing Dependency: libyaml-0.so.2()(64bit) for package: 
> lustre-2.10.1-1.el7.x86_64
> --> Running transaction check
> ---> Package kmod-lustre.x86_64 0:2.10.1-1.el7 will be installed
> ---> Package kmod-lustre-osd-ldiskfs.x86_64 0:2.10.1-1.el7 will be installed
> --> Processing Dependency: ldiskfsprogs >= 1.42.7.wc1 for package: 
> kmod-lustre-osd-ldiskfs-2.10.1-1.el7.x86_64
> ---> Package libyaml.x86_64 0:0.1.4-11.el7_0 will be installed
> ---> Package lustre-osd-ldiskfs-mount.x86_64 0:2.10.1-1.el7 will be installed
> --> Finished Dependency Resolution
> Error: Package: kmod-lustre-osd-ldiskfs-2.10.1-1.el7.x86_64 (lustre)
>Requires: ldiskfsprogs >= 1.42.7.wc1
>  You could try using --skip-broken to work around the problem
> 
> I've checked the repos and don't see a package for ldiskfsprogs at all. 
> Does anybody know how to resolve this?

It’s a bit convoluted because no package is actually named ldiskfsprogs. 
Instead, the e2fsprogs packages available from 
https://downloads.hpdd.intel.com/public/e2fsprogs/latest/el7/RPMS/x86_64/ have 
a „Provides“ entry for ldiskfsprogs. Installing them should resolve the 
dependency problem.

Kind regards,

Daniel 
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Large file read performance degradation from multiple OST's

2017-05-26 Thread Daniel Kobras
Hi!

On Wed, May 24, 2017 at 04:04:00PM +, Vicker, Darby (JSC-EG311) wrote:
> I tried a 2.8 client mounting the 2.9 servers and that showed the expected 
> behavior ??? increasing performance with increasing OST's.  Two things:
> 
> 1.  Any pointers to compiling a 2.8 client on recent RHEL 7 kernels would be 
> helpful.  I had to boot into an older kernel to get the above test done.
> 
> 2. Any help to find the problem in the 2.9 code base would also be helpful.

The fast_read feature has been added to the client-side read path between 2.8 
and 2.9. So for starters it might
be worthwhile to re-run the test on a 2.9 client with fast_read=0?

Kind regards,

Daniel
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [Lustre-discuss] Object index

2012-07-24 Thread Daniel Kobras
Hi Jason!

Am 24.07.2012 um 20:18 schrieb Rappleye, Jason (ARC-TN)[Computer Sciences 
Corporation]:
> On 7/24/12 11:10 AM, "Daniel Kobras"  wrote:
>> lfs fid2path (on any client) should do what you're looking for.
> 
> Unfortunately that doesn't work for files created prior to Lustre 2.0, or
> files with components of their path created before Lustre 2.0,  The link
> EA is missing from the MDT inode of such files, which is what fid2path
> appears to use. This was a real bummer for us, and I'd love for someone to
> tell me that I'm wrong. Please?

Pre-2.0, you can extract the inode number of the parent object on the MDT from 
the object's trusted.fid EA, eg. with ll_decode_filter_fid. On the MDT, you can 
map the inode number to a filesystem path with ncheck in debugfs or find -inum.

Regards,

Daniel.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Object index

2012-07-24 Thread Daniel Kobras
Hi Aurelien!

Am 24.07.2012 um 14:04 schrieb DEGREMONT Aurelien:

> I'm trying to analyze an OST with few thousands of object and find where they 
> belong to.
> Mounting this OST with ldiskfs and using ll_decode_filter_fid tells me that.
> 
> -Most of these object do not have a fid EA back pointer. Does that means they 
> are not used?

Is this the troglodyte type of OST that started its life in times of 
prehistoric versions of Lustre? We see this on old files that were created in 
the early ages of Lustre 1.6, before the trusted.fid EA was introduced. Other 
than that, these objects could have been preallocated, but never actually used. 
Do these objects contain any data at all (blockcount != 0)?

> -Some of them have good results, and the man page says that
> "For objects with MDT parent sequence numbers above 0x2, this 
> indicates that the FID needs to be mapped via the 
> MDT Object Index (OI) file on the MDT".
> How do I do this mapping? I found some iam utilities but they do not seems to 
> be ok, and I'm afraid IAM userspace code 
> has been deactivated.

lfs fid2path (on any client) should do what you're looking for.

> How can I know if those files could be removed without risk.
> I previously checked that "lfs find" did not find any other files with object 
> on this specific OST I'm working on.

>From my experience, a small amount of object leakage is not too uncommon on 
>real-world systems, so if lfs find doesn't show up any objects anymore, most 
>likely you're good to take this OST down. (Hey, and you can double-check with 
>rbh-report --dump-ost, of course! ;-)

Regards,

Daniel.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] very slow directory access (lstat taking >20s to return)

2010-10-29 Thread Daniel Kobras
Hi Frederik!

On Fri, Oct 29, 2010 at 09:40:33AM +0100, Frederik Ferner wrote:
> Doing a 'strace -T -e file ls -n' on one directory with about 750 files, 
> while users were seeing the hanging ls, showed lstat calls taking 
> seconds, up to 23s.

The (l)stat() calls determine the exact size of all files in the displayed
directory. This means that each OSTs needs to revoke client write locks for all
these files, ie. client-side write caches for all files in the directory are
flushed before the (l)stat() returns. This can easily take several seconds if
there is heavy write activity on the file.

Regards,

Daniel.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] system disk with external journals for OSTs formatted

2010-10-27 Thread Daniel Kobras
On Wed, Oct 27, 2010 at 08:46:41PM +0800, Andreas Dilger wrote:
> I don't know what these errors are, possibly trying to write into the broken
> journal device?  The rest of the fileystem errors are very minor.  You should
> probably delete the journal device via "tune2fs -O ^has_journal", run a full
> "e2fsck -f" and then recreate the journal with "tune2fs -j size=400".

On a filesystem with errors, you'll have to use "tune2fs -f -O ^has_journal" to
force removal of the journal. At least that's what the man page says. When I
once had to pull such a stunt a while ago, tune2fs refused to remove the
journal even when given the force flag, though. It might work as documented
now. Otherwise, my workaround was to retrieve the journal UUID from the
main filesystem with tune2fs -l, then create a new external journal with
this UUID (mke2fs -O journal_dev -U  ...). At this point I was able to
run e2fsck on the main filesystem to get back to a clean state. Finally, I
could remove and add back the journal with normal tune2fs calls to get it
properly linked back to the filesystem.

Regards,

Daniel.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] need help debuggin an access permission problem

2010-09-24 Thread Daniel Kobras
Hi!

On Fri, Sep 24, 2010 at 09:18:15AM +0100, Tina Friedrich wrote:
> Cheers Andreas. I had actually found that, but there doesn't seem to be 
> that much documentation about it. Or I didn't find it :) Plus it 
> appeared to find the users that were problematic whenever I tried it, so 
> I wondered if that is all there is, or if there's some other mechanism I 
> could test for.

Mind that access to cached files is no longer authorized by the MDS, but by the
client itself. I wouldn't call it documentation, but
http://wiki.lustre.org/images/b/ba/Tuesday_lustre_automotive.pdf has an
illustration of why this is a problem when nameservices become out of sync
between MDS and Lustre clients (slides 23/24). Sounds like you hit a very
similar issue.

Regards,

Daniel.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] error e2fsck run for lfsck

2010-09-10 Thread Daniel Kobras
Hi Thomas!

On Fri, Sep 10, 2010 at 08:16:57PM +0200, Thomas Roth wrote:
> on a 1.8.4 test system, I tried prepare for lfsck and got an error from
> e2fsck:
> 
> mds:~# e2fsck -n -v --mdsdb /tmp/mdsdb /dev/sdb2
> e2fsck 1.41.10.sun2 (24-Feb-2010)
> lustre-MDT lustre database creation, check forced.
> Pass 1: Checking inodes, blocks, and sizes
> MDS: ost_idx 0 max_id 190010
> MDS: ost_idx 1 max_id 179671
> MDS: ost_idx 2 max_id 43436
> MDS: ost_idx 3 max_id 97654
> MDS: ost_idx 4 max_id 78973
> MDS: got 40 bytes = 5 entries in lov_objids
> MDS: max_files = 15090
> MDS: num_osts = 5
> mds info db file written
> error: only handle v1 LOV EAs, not 0bd30bd0
> e2fsck: aborted

This looks like https://bugzilla.lustre.org/show_bug.cgi?id=21704: Currently
you cannot fsck an MDT filesystem that uses pools.

Regards,

Daniel.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] mkfs.lustre and failover question

2010-08-18 Thread Daniel Kobras
On Wed, Aug 18, 2010 at 10:16:36AM -0500, David Noriega wrote:
> I've read through the 'More Complicated Configurations' section in the
> manual and it says as part of setting up failover with
> two(active/passive) MDS/MGS and two OSSs(active/active) to use the
> following:
> 
> mkfs.lustre --fsname=lustre --ost --failnode=192.168.5@tcp0
> --mgsnode=192.168.5@tcp0,192.168.5@tcp0
> /dev/lustre-ost1-dg1/lv1

A Lustre-MGS can have more than one network address (NID). Different NIDs of
the same server are separated by commas. Here, you want to configure NIDs for
different servers. Those either have to be separated with a colon.
Alternatively, you can just use option --mgsnode twice with different
arguments. In your case, try

mkfs.lustre --fsname=lustre --ost --failnode=192.168.5@tcp0 
--mgsnode=192.168.5@tcp0 --mgsnode=192.168.5@tcp0 
/dev/lustre-ost1-dg1/lv1

Regards,

Daniel.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] lfs --obd discrepancy to lctl dl (1.8.3)

2010-08-11 Thread Daniel Kobras
Hi!

On Wed, Aug 11, 2010 at 09:53:01AM +0200, Heiko Schröter wrote:
> According to "lfs find" the stripe should be on scia-OST0017_UUID. "lfs
> gestripe" reports to have it on obdidx 23 , which is scia-OST0014 according
> to lctl dl.  Which one is true ?

lctl dl returns a list of all Lustre components running on the local machine,
prefixed by a sequential number. It depends on the order in which components
have been started on the machine, and has nothing to do with obdidx. Your
file resides on OST 23 (== 0x17), therefore.

Regards,

Daniel.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] NFS Export Issues

2010-07-21 Thread Daniel Kobras
On Fri, Jul 16, 2010 at 05:23:16PM -0700, William Olson wrote:
> [r...@lustreclient mnt]# strace -f -p 15964
(...)
> lstat("/mnt/lustre_mail_fs", 0x7fff4bd4b2b0) = -1 EACCES (Permission denied)

When it comes to inexplicable permission problems, have you checked if
SELinux is turned off on the NFS server?

Regards,

Daniel.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre file access / locking

2010-07-01 Thread Daniel Kobras
Hi!

On Thu, Jul 01, 2010 at 12:35:32PM +0200, Arne Wiebalck wrote:
> I got some basic questions concerning how file access in Lustre
> works: from what I understand, Lustre clients lock file
> ranges with the server before executing I/O operations. What
> does a client get on its next I/O call in case the server needs
> to revoke the lock (the server can do that, right?): does
> the client try to re-acquire the lock, block and then time
> out (in case the server does not grant the lock again to that
> client)?

The locks protect the client-side caches, not the actual I/O. If a lock is
dequeued, the client has to resynchronize (affected parts of) its cache with
the server.

Regards,

Daniel.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] lnet infiniband config

2010-06-24 Thread Daniel Kobras
Hi!

On Tue, Jun 22, 2010 at 04:19:08PM +0200, Thomas Roth wrote:
> I'm getting my feet wet in the infiniband lake and of course I run into
> some problems.
> It would seem I got the compilation part of sles11 kernel 2.6.27 +
> Lustre 1.8.3 + ofed 1.4.2 right, because it allows me to see and use the
> infiniband fabric, and because ko2iblnd loads without any complaints.
> 
> In /etc/modprobe.d/lustre (this is a Debian system, hence this subdir of
> modprobe-configs), I have
> > options ip2nets="o2ib0 192.168.0.[1-5]"

If this is a verbatim copy from the config file, then you're lacking the name
of the module, ie. 'options lnet ip2nets=...'. Maybe also double-check with
'modprobe -c' that options get passed on as intended.

> I load lnet and do 'lctl network up', but then 'lctl list_nids' will
> invariably give me only
> > 192.168@tcp
> no matter how I twist the modprobe-config (ip2nets="o2ib",
> network="o2ib", network="o2ib(ib0), etc.)
> 
> This is true as long as I have ib0 configured with the IP 192.168.0.1
> Once I unconfigure it, I get, quite expectedly,
> LNET configure error 100: Network is down

So ib0 is the only network interface in the system? In this case, I could
imagine that ksocklnd gets loaded unconditionally, always grabs the first
interface it can get hold of, and just doesn't leave any IB interface for
ko2iblnd when it eventually gets loaded. This is just a shot in the dark, but
you could check by manually loading modules via insmod.

Regards,

Daniel.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Clients getting incorrect network information for one of two MDT servers (active/passive)

2010-05-21 Thread Daniel Kobras
Hi!

On Fri, May 21, 2010 at 11:54:56AM -0400, McKee, Shawn wrote:
> Parameters: 
> mgsnode=10.10.1@tcp,192.41.230@tcp1,141.211.101@tcp2 
> failover.node=10.10.1...@tcp,192.41.230...@tcp1
> 
> Notice there is no reference to 192.41.230...@tcp anywhere here.   

Lustre MDS and OSS nodes register themselves with the MGS when they are started
(mounted) for the first time. In particular, the then-current list of network
ids is recorded and sent off to the MGS, from where it is propagated to all
clients. This information sticks and will not be updated automatically, even if
the configuration on the server changes. From your description, it sounds like
you initially started up the MDS with an incorrect LNET config (and probably
fixed it in the meantime, but the MGS and thus the clients won't know). Check
with "lctl list_nids" on your first MDS that you're content with the current
configuration, then follow the procedure to change a server nid ("writeconf
procedure") that is documented in the manual, and you should get both server
nodes operational again.

Regards,

Daniel.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] programmatic access to parameters

2010-03-25 Thread Daniel Kobras
On Thu, Mar 25, 2010 at 10:07:03AM -0700, burlen wrote:
> > I don't know what your constraints are, but should note that this sort
> > of information (number of OSTs) can be obtained rather trivially from 
> > any lustre client via shell prompt, to wit: 
> True, but parsing the output of a c "system" call is something I hoped 
> to avoid. It might not be portable and might be fragile over time.
> 
> This also gets at my motivation for asking for a header with the Lustre 
> limits, if I hard code something down the road the limits change, and we 
> are suddenly shooting ourselves in the foot.

Instead of teaching the application some filesystem intrinsics, you could
also teach the queueing system about your application's output behaviour
and let it set up and adequately configured working directory. GridEngine
allows to run queue-specific prolog scripts for this purpose, other systems
certainly offer similar features.

Regards,

Daniel.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Slow du

2010-02-03 Thread Daniel Kobras
Hi!

On Wed, Feb 03, 2010 at 04:36:43PM -0500, Larry Brown wrote:
> We have a cluster set up with 60 nodes.  We have a folder with the
> striping count set to 20.  We loaded up two folders with a large number
> of small files.  I know the bigger the file the better the more Lustre
> shines but we have multiple things to test and this is the first.
> According to "lfs df -h" there is a total disk use of 76.3G.  I ran "du
> -sh" at the top level of the folders and am now at 45 minutes run time
> without a result yet.  Top only shows du taking between 3 and 5% cpu
> time and 1% of memory.
> 
> Does anyone know what causes this?  Shouldn't the server be able to
> examine the MDS to sum the space used?  At worst wouldn't the first
> object on the stripe return the total file size if it isn't kept on the
> MDS?

In Lustre 1.8, the MDS doesn't know the accurate size of a file. This feature
is planned for Lustre 2.x. Size information therefore currently needs to be
obtained from the OSTs, and as files can contain holes, there's no way
knowing the end of the file until all objects have been examined--even if
for a small file most of them turn out to be empty.

Furthermore, unless you set a stripe count hint, or otherwise overrode the
default inode size when creating the MDT, the stripe information won't fit
the inode and therefore require an extra seek for each file.

Regards,

Daniel.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Client Mount Error Messages

2009-12-28 Thread Daniel Kobras
Hi!

On Mon, Dec 28, 2009 at 12:00:05PM -0500, CHU, STEPHEN H (ATTSI) wrote:
> I have a question regarding a few error messages presented after a client has
> mounted the File System. The FS mounted ok and is useable but the
> LusterErrors do not look normal. The client does not have IB connectivity to
> the MDS/OSS but uses "tcpX" to access the MDSs/OSSs. MDSs and OSSs are
> inter-connected with IB. The following are the configurations:
>
> MDS1:
(...)
> ·MGS/MDS Parameters: lov.stripesize=25M lov.stripecount=1
>   failover.node=10.103.34@o2ib,10.103.34@o2ib1

failover.nodes tells the clients where to connect when the primary server is
unavailable. You need to list all NIDs of the failover node, or your 10GE-only
nodes won't be able to connect. The Lustre manual has a section on "Changing
the Address of a Failover Node" that might help you along.

Regards,

Daniel.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] WARNING: short read while accessing file >4GB on 32-bit client

2009-12-10 Thread Daniel Kobras
Hi!

On Thu, Dec 10, 2009 at 02:36:07PM +0100, Johann Lombardi wrote:
> Actually, we are also considering adding such a page to the lustre wiki.
> In addition to lustre mailing lists, you can also monitor bugs added
> to the corruption tracker (bugzilla ticket #19512).

That a good hint, thanks! Unfortunately, the bug doesn't seem to be open for
public access. Is it intentional?

Regards,

Daniel.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] ACL question

2009-10-26 Thread Daniel Kobras
Hi!

On Mon, Oct 26, 2009 at 05:35:36PM -0600, Robert LeBlanc wrote:
> Well, my little trick isn't working right now. I'm not sure how to debug
> this.

With Lustre, the MDS authorizes access when a client first touches a certain
file. Once it's cached, the client handles authorization itself. If you
experience erratic behaviour, this point to a difference in either nameservice
or ACL configuration between MDS and client. Are ACLs turned on on the MDT?
Does user2 show up as a member of group1 on the MDS? Is the Lustre group upcall
configured?

Regards,

Daniel.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] How to find a file by object id

2009-10-21 Thread Daniel Kobras
Hi!

On Wed, Oct 21, 2009 at 11:58:43AM +0100, Wojciech Turek wrote:
> I apologize if this question was answered earlier but I can not find it in
> the mailing list.
> I have an object ID and I would like to find file that this object is part
> of. I tried to use lfs find but I can not seem to find right combination of
> options.

On a Lustre client, I'm not aware of any better method than using a combination
of lfs find and lfs getstripe to look for matching object IDs.  On the servers,
Andreas Dilger has described a more efficient way using debugfs in 
http://lists.lustre.org/pipermail/lustre-discuss/2009-July/011080.html

> Also is there a simple way to list all the files and their object IDs?

I'm not sure whether it counts as "simple", but this is what I'm using:

lfs find . -type f -print0 | while IFS='' read -d '' i; do echo "$i"; lfs 
getstripe -q "$i"; done

Regards,

Daniel.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Expected number of clients during MDS reconnect.

2009-10-15 Thread Daniel Kobras
Hi!

When initiating an MDS failover on one of our systems, we see the new active
MDS expecting more clients to recover than were actually connected before.

# cat /proc/fs/lustre/mds/lustrefs-MDT/recovery_status 
status: COMPLETE
recovery_start: 1255622509
recovery_duration: 300
delayed_clients: 0/651
completed_clients: 260/651
replayed_requests: 4
last_transno: 4112137365

Where 260 is indeed the correct number of active clients.

# ls -d1 /proc/fs/lustre/mds/lustrefs-MDT/exports/*...@* | wc -l
260
# cat /proc/fs/lustre/mds/lustrefs-MDT/num_exports 
261

Not sure what caues the off-by-one between num_exports and the number of
entries in the exports subdirectory, but the difference doesn't look severe.  I
do wonder about the expected number of 651 clients, though. When recovery has
finished on the MDS, Lustre correctly evicts those surplus clients, it seems,
as the syslog reports

Lustre: lustrefs-MDT: Recovery period over after 5:00, of 651
clients 260 recovered and 391 were evicted.

but still the MDT apparently keeps note of them and expects them back during
the next recovery cycle. Which means that currently we always have to wait the
full recovery timespan even though all active clients have reconnected already.
We've seen this behaviour with MDSes running 1.6.7.2 and 1.8.1, clients run a
mixture of versions between 1.6.6 and 1.8.1. During the lifetime of the system,
we've only decommissioned a small number of systems running Lustre clients, so
the difference between current and expected client numbers must have developped
by some other means.

Does anyone know how the MDT calculates the number of expected clients?
Is there a way to make Lustre dump a list of nids of the surplus clients it
evicts after the recovery phase?
And above all, is there a way to convince the MDT about the true number of
clients (preferrably one that doesn't involve the writeconf dance ;-)?

Regards,

Daniel.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] NFS vs Lustre

2009-08-31 Thread Daniel Kobras
Hi!

On Mon, Aug 31, 2009 at 04:34:58PM -0400, Brian J. Murrell wrote:
> On Mon, 2009-08-31 at 21:56 +0200, Daniel Kobras wrote:
> > Lustre's
> > standard config follows Posix and allows dirty client-side caches after
> > close(). Performance improves as a result, of course, but in case something
> > goes wrong on the net or the server, users potentially lose data just like 
> > on
> > any local Posix filesystem.
> 
> I don't think this is true.  This is something that I am only
> peripherally knowledgeable about and I am sure somebody like Andreas or
> Johann can correct me if/where I go wrong...
> 
> You are right that there is an opportunity for a client to write to an
> OST and get it's write(2) call returned before data goes to physical
> disk.  But Lustre clients know that, and therefore they keep the state
> needed to replay that write(2) to the server until the server sends back
> a commit callback.  The commit callback is what tells the client that
> the data actually went to physical media and that it can now purge any
> state required to replay that transaction.

Lustre can recover from certain error conditions just fine, of course, but
still it cannot recover gracefully from others. Think double failures or, more
likely, connectivity problems to a subset of hosts. For instance, if, say, an
Ethernet switch goes down for a few minutes with IB still available, all
Ethernet-connected clients will get evicted. Users won't necessarily notice
that there was a problem, but they've just potentially lost data. VBR makes the
data loss less likely in this case, but the possibility is still there. I'd
suspect you'll always be able to construct similar corner cases as long as the
networked filesystem allows dirty caches after close().

Regards,

Daniel.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] NFS vs Lustre

2009-08-31 Thread Daniel Kobras
Hi!

On Sun, Aug 30, 2009 at 04:12:11PM -0500, Nicolas Williams wrote:
> NFSv4 can't handle O_APPEND, and has those close-to-open semantics.
> Those are the two large departures from POSIX in NFSv4.

Along these lines, it's probably worth mentioning commit-on-close as well, an
area where NFS (v3 and v4, optionally relaxed when using write delegations) is
more strict than Posix. This is to make sure that NFS still has the possibility
to notify the user about errors when trying to save their data. Lustre's
standard config follows Posix and allows dirty client-side caches after
close(). Performance improves as a result, of course, but in case something
goes wrong on the net or the server, users potentially lose data just like on
any local Posix filesystem. The difference being that users tend to notice when
their local machine crashes. It's much easier to miss a remote server or a
switch going down, and hence suffer from silent data loss. (Admins will
typically notice, eg. via eviction messages in the logs, but have a hard time
telling whicht files had been affected.) The solution is to fsync() all
valuable data on a Posix filesystem, but that's not necessarily within reach
for an average end user.

Regards,

Daniel.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] NFS vs Lustre

2009-08-30 Thread Daniel Kobras
Hi!

On Sat, Aug 29, 2009 at 11:56:40AM -0600, Lee Ward wrote:
> NFS4 addresses those by:
> 
> 1) Introducing state. Can do full POSIX now without the lock servers.
> Lots of resiliency mechanisms introduced to offset the downside of this,
> too.

NFS4 implementations are able to handle Posix advisory locks, but unlike
Lustre, they don't support full Posix filesystem semantics. For example, NFS4
still follows the traditional NFS close-to-open cache consistency model whereas
with Lustre, individual write()s are atomic and become immediately visible to
all clients.

Regards,

Daniel.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.

2009-06-24 Thread Daniel Kobras
Hi Sébastien!

On Wed, Jun 24, 2009 at 09:46:19AM +0200, Sébastien Buisson wrote:
> > - if the 'service id' information was stored on the MGS on a file system 
> > basis, one could imagine to retrieve it at mount time on the clients. 
> > The 'service id' information stored on the MGS could consist in a port 
> > space and a port id. Thus it would be possible to affect different 
> > service ports to the various connections initiated by the client, 
> > depending on the target file system.
> > What do you think? Would you say this is feasible, or can you see major 
> > issues with this proposal?
> > 
> 
> The peer's port information could be stored in the kib_peer_t structure. 
> That way, it would be possible to make clients connect to servers which 
> listen on different ports.
> What do you think?

Why do you want to distinguish the two filesystems solely by service id
rather than, say, service id + port guids of the respective Lustre
servers? You'll need a full QoS policy file instead of the simplified
syntax, and configuration needs to be adapted on hardware changes, but
this still looks simpler to me than modifying the wire protocol.

Regards,

Daniel.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] x4540 (thor) panic

2009-06-15 Thread Daniel Kobras
Hi!

On Thu, Jun 11, 2009 at 10:08:30AM -0400, Brock Palen wrote:
> I attached a screen shot of the dump from console,  its incomplete  
> though.
> Should I just update the server to 1.6.7.2 ?  Just strange that the  
> two x4500 had no issues, but the x4540 does

The X4500 comes with Intel NICs while the X4540 is equipped with NVidia
chips. The Linux driver for the latter reportedly isn't well tuned.
Bumping parameter max_interrupt_work of the forcedeth module to 100
cured stability problems for us. That's on a RHEL5-based system, though.
Not sure how the driver in RHEL5 fares, but it might still be worth a try.

Regards,

Daniel.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.

2009-05-19 Thread Daniel Kobras
Hi!

On Mon, May 18, 2009 at 01:34:03PM -0700, Jim Garlick wrote:
> Hi, I don't know much about this stuff, but our IB guys did use QoS
> to help us when we found LNET was falling apart when we brought up
> our first 1K node cluster based on quad socket, quad core opterons,
> and ran MPI collective stress tests on all cores.
> 
> Here are some notes they put together - see the "QoS Policy file" section.

Great summary, thanks for sharing! Seems like qos-ulp is a rather recent
OpenSM-specific feature, and the SMs in our switches apparently don't
offer a similar SID-to-SL mapping, either. , but it certainly got me a
leap further.

Thanks,

Daniel.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre attribute caching?

2009-05-18 Thread Daniel Kobras
Hi!

On Mon, Apr 27, 2009 at 01:29:20AM -0600, Andreas Dilger wrote:
> Lustre will cache negative dentries as long as the directory is not
> modified.  If the directory has a new file created or an existing
> one deleted then the negative dentries, along with the rest of the
> readdir data is removed from all client caches, though the individual
> inodes can still live in the client cache as they have their own locks.

Interesting. Thanks for the info! This makes me wonder how Lustre copes
with temporary network outages in this case. If the client drops off the
net, the server cannot invalidate the caches. Does the client notice the
communication problem, or could it briefly serve stale data?

Regards,

Daniel.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.

2009-05-18 Thread Daniel Kobras
Hi!

Does anyone know how to use QoS with Lustre's o2ib LND? The Voltaire IB
LND allowed to #define a service level, but I couldn't find a similar
facility in o2ib. Is there a different way to apply QoS rules?

Thanks,

Daniel.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] NUMA IO and Lustre

2009-05-12 Thread Daniel Kobras
Hi!

On Tue, May 12, 2009 at 05:49:26PM +0200, Sébastien Buisson wrote:
> Your workaround sounds very interesting, because it shows that the 
> network settings of an OST cannot be fixed by any configuration 
> directive: this is done in an opaque way at mkfs.lustre time.

To clarify: It's the initial mount rather than mkfs. The process as far
as I understood works as follows (and hopefully someone will correct me
if I got it wrong):

* OST is formatted.
* OST is mounted for the first time.
  + OST queries the current list of nids on its OSS.
  + OST sends its list of nids off to the MGS.
* MGS registers new OST (and its nids).
* MGS advertises new OSTS (and its nids) to all (present and future) clients.

It's important to note that MGS registration is a one-off process that
cannot be changed or redone later on, unless you wipe the complete
Lustre configuration from all servers using the infamous --writeconf
method and restart all Lustre clients to remove any stale cached info.
(Which of course implies a downtime of the complete filesystem.) Once
the registration is done, you can change a server's LNET configuration,
but the MGS won't care, and the clients will never get notified of it.

Regards,

Daniel.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] NUMA IO and Lustre

2009-05-12 Thread Daniel Kobras
Hi!

On Tue, May 12, 2009 at 04:07:51PM +0200, Sébastien Buisson wrote:
> Concretely, we would like to know if it is possible in Lustre to bind an 
> OST to a specific network interface, so that this OST is only reached 
> through this interface (thus avoiding the NUMA IO factor in our case) ? 
> For instance, we would like to have 4 OSTs attached to ib0 and the 4 
> other OSTs attached to ib1.

As I recently learned the hard way, the network settings of an OST are
fixed when the OST connects to the MGS for the first time. Hence, you
could ifup ib0, ifdown ib1, start LNET and fire up the first set of
OSTs, then down LNET again, ifdown ib0, ifup ib1, restart LNET and start
up the rest of the OSTs. Subsequently, you should be able to run with
both IB nids enabled, but clients should still only know about a single
IB nid for each OSTs. As for a less hackish solution, wouldn't it be
cleaner to just run, say, two Xen domUs on the machine and map the
HBAs/HCAs as appropriate?

Regards,

Daniel.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Lustre attribute caching?

2009-04-24 Thread Daniel Kobras
Hi!

I wonder about Lustre's client-side caching strategy for file attributes
and directory entries, and whether it suffers from the same problems as
NFS in this regard. I seem to be unable to find this question answered
on the web, so I'd be happy for any hints.

Specifically, I'm interested in caching of negative entries. With
/sharedfs shared via NFS, attribute caching might prevent clientA from
seeing the newly created file in the final stat call of this test:

On clientA:

# rm -f /sharedfs/test; ls -l /sharedfs/test; ssh clientB "touch 
/sharedfs/test"; stat /sharedfs/test

How does Lustre behave in this case? Are newly created files immediately
visible on all clients, or will they cache (negative) attributes and
dentries for some time? Does Posix mandate specific behaviour in this
regard?

Thanks for any insight,

Daniel.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Registering OST with failover MGS.

2009-02-04 Thread Daniel Kobras
On Wed, Feb 04, 2009 at 11:07:42AM -0700, Kevin Van Maren wrote:
> Brian J. Murrell wrote:
> > On Wed, 2009-02-04 at 18:25 +0100, Daniel Kobras wrote:
> >> I've tried that--as well as a comma- or colon-separated list of NIDs--,
> >> but neither options appears to work: As soon as I try to mount the OST
> >> partition with the MGS active on the secondary node, the mount fails,
> >> and error messages indicate that only the primary MGS node is queried.
> >> All systems are running Lustre version 1.6.6, by the way.

Correction: tunefs.lustre --mgsnode=mgs1 --mgsnode=mgs2 ...
works as soon as I don't typo the NIDs. tunefs.lustre
--mgsnode=mgs1:mgs2 doesn't.

> > Well, it should work as described.  Please try to find the bugzilla bug
> > referencing a problem with this and see if it applies.  If not, please
> > file a new bug.
> 
> Probably referring to bug 15912, which was for "mkfs" issues, where the 
> mgs option had to be
> specified twice to indicate two different servers.

mkfs was fixed in 1.6.6, but there still seems to be a similar issue
with tunefs.lustre. I've just filed #18438 for this now.

Thanks,

Daniel.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Registering OST with failover MGS.

2009-02-04 Thread Daniel Kobras
Hi!

On Wed, Feb 04, 2009 at 11:50:58AM -0500, Brian J. Murrell wrote:
> On Wed, 2009-02-04 at 17:21 +0100, Daniel Kobras wrote:
> > Lustre clients allow to mount from either the primary MGS or a failover
> > server using a colon-separated syntax like
> > 
> > primary.mgs.example.com:failover.mgs.example.com:/fsname
> > 
> > Is there a similar way to configure an MGS failover pair into an OST?
> > tunefs.lustre --mgsnode seems to accept only a single MGS.
> 
> $ man tunefs.lustre
> ...
> --mgsnode=nid,...
>Set  the  NID(s) of the MGS node, required for all targets other
>than the MGS.
> 
> Notice the plurality of "NID(s)"?  That means you can use the multiple
> NID syntax.

Sure, this tells me how to specify multiple NIDs of a single server, but
I'd like to configure the NIDs of two distinct servers in a failover
pair.

> You can also specify --mgsnode=nid multiple times.

I've tried that--as well as a comma- or colon-separated list of NIDs--,
but neither options appears to work: As soon as I try to mount the OST
partition with the MGS active on the secondary node, the mount fails,
and error messages indicate that only the primary MGS node is queried.
All systems are running Lustre version 1.6.6, by the way.

Regards,

Daniel.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Registering OST with failover MGS.

2009-02-04 Thread Daniel Kobras
Hi!

Lustre clients allow to mount from either the primary MGS or a failover
server using a colon-separated syntax like

primary.mgs.example.com:failover.mgs.example.com:/fsname

Is there a similar way to configure an MGS failover pair into an OST?
tunefs.lustre --mgsnode seems to accept only a single MGS.

Regards,

Daniel.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre failover of MGS

2008-11-27 Thread Daniel Kobras
Kevin Van Maren <[EMAIL PROTECTED]> writes:
> mkfs.lustre --reformat [EMAIL PROTECTED] --mgs /dev/sdb

Is --failnode evaluated for the MGS? We seem to do fine without it as any client
requires explicit configuration of the MGS failnode anyway. Or is it possible
to override this configuration with the value set on the MGS?

> mkfs.lustre --reformat --fsname bananafs [EMAIL PROTECTED]
[EMAIL PROTECTED] [EMAIL PROTECTED] --mdt /dev/sdc

Regards,

Daniel.



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Early replies from pre-1.6.5 servers?

2008-07-28 Thread Daniel Kobras
Hi!

While debugging connection problems on a Lustre client running 1.6.5.1
on RHEL5, I discovered early replies in the client debug output.
Adaptive timeouts are disabled on the client, and our server
infrastructure in running stock 1.6.4.3 (RHEL5), still. As far as I
understood, early replies should only occur post 1.6.5 when adaptive
timeouts are active.  Did I get this wrong, and early replies are
prefectly valid even in our setup? Or is the 1.6.5.1 client
misinterpreting message headers?

I'm attaching a debug trace for a single XID on the client and the
MDS/MGS.

Thanks,

Daniel.

==> client ([EMAIL PROTECTED], Lustre 1.6.5.1) <==

[EMAIL PROTECTED] # lctl debug_file /tmp/lustre-debug.client.log | grep 266023
0100:0010:2:1217244344.471204:0:9991:0:(client.c:1784:ptlrpc_queue_wait())
 Sending RPC pname:cluuid:pid:xid:nid:opc 
ll_cfg_requeue:38cc9155-5e64-1d01-bbf2-8b621120e1b0:9991:x266023:[EMAIL 
PROTECTED]:101
0100:0200:2:1217244344.471250:0:9991:0:(niobuf.c:540:ptl_send_rpc()) 
Setup reply buffer: 368 bytes, xid 266023, portal 25
0100:0040:2:1217244344.471269:0:9991:0:(niobuf.c:561:ptl_send_rpc()) 
@@@ send flg=0  [EMAIL PROTECTED] x266023/t0 o101->[EMAIL 
PROTECTED]@o2ib_0:26/25 lens 232/368 e 0 to 100 dl 121724 ref 2 fl Rpc:/0/0 
rc 0/0
0100:0200:2:1217244344.471299:0:9991:0:(niobuf.c:70:ptl_send_buf()) 
Sending 232 bytes to portal 26, xid 266023, offset 0
0100:0200:2:1217244344.471379:0:9991:0:(client.c:1871:ptlrpc_queue_wait())
 @@@ -- sleeping for 10 ticks  [EMAIL PROTECTED] x266023/t0 o101->[EMAIL 
PROTECTED]@o2ib_0:26/25 lens 232/368 e 0 to 100 dl 121724 ref 2 fl Rpc:/0/0 
rc 0/0
0100:0200:2:1217244344.471399:0:9991:0:(client.c:771:ptlrpc_check_reply())
 @@@ rc = 0 for  [EMAIL PROTECTED] x266023/t0 o101->[EMAIL 
PROTECTED]@o2ib_0:26/25 lens 232/368 e 0 to 100 dl 121724 ref 2 fl Rpc:/0/0 
rc 0/0
0100:0200:2:1217244344.471416:0:9991:0:(client.c:771:ptlrpc_check_reply())
 @@@ rc = 0 for  [EMAIL PROTECTED] x266023/t0 o101->[EMAIL 
PROTECTED]@o2ib_0:26/25 lens 232/368 e 0 to 100 dl 121724 ref 2 fl Rpc:/0/0 
rc 0/0
0100:0200:1:1217244344.471440:0:3049:0:(events.c:55:request_out_callback())
 @@@ type 4, status 0  [EMAIL PROTECTED] x266023/t0 o101->[EMAIL 
PROTECTED]@o2ib_0:26/25 lens 232/368 e 0 to 100 dl 121724 ref 2 fl Rpc:/0/0 
rc 0/0
0100:0040:1:1217244344.471458:0:3049:0:(client.c:1526:__ptlrpc_req_finished())
 @@@ refcount now 1  [EMAIL PROTECTED] x266023/t0 o101->[EMAIL 
PROTECTED]@o2ib_0:26/25 lens 232/368 e 0 to 100 dl 121724 ref 2 fl Rpc:/0/0 
rc 0/0
0100:0200:1:1217244344.471565:0:3049:0:(events.c:84:reply_in_callback())
 @@@ type 1, status 0  [EMAIL PROTECTED] x266023/t0 o101->[EMAIL 
PROTECTED]@o2ib_0:26/25 lens 232/368 e 0 to 100 dl 121724 ref 1 fl Rpc:/0/0 
rc 0/0
0100:1000:1:1217244344.471578:0:3049:0:(events.c:112:reply_in_callback())
 @@@ Early reply received: mlen=240 offset=0 replen=368 replied=0 unlinked=0  
[EMAIL PROTECTED] x266023/t0 o101->[EMAIL PROTECTED]@o2ib_0:26/25 lens 232/368 
e 0 to 100 dl 121724 ref 1 fl Rpc:/0/0 rc 0/0
0100:0200:2:1217244344.471652:0:9991:0:(client.c:771:ptlrpc_check_reply())
 @@@ rc = 0 for  [EMAIL PROTECTED] x266023/t0 o101->[EMAIL 
PROTECTED]@o2ib_0:26/25 lens 232/368 e 1 to 100 dl 121724 ref 1 fl Rpc:/0/0 
rc 0/0
0100:0200:2:1217244344.471668:0:9991:0:(client.c:771:ptlrpc_check_reply())
 @@@ rc = 0 for  [EMAIL PROTECTED] x266023/t0 o101->[EMAIL 
PROTECTED]@o2ib_0:26/25 lens 232/368 e 1 to 100 dl 121724 ref 1 fl Rpc:/0/0 
rc 0/0
0100:0100:2:121724.471354:0:9991:0:(client.c:1198:ptlrpc_expire_one_request())
 @@@ timeout (sent at 1217244344, 100s ago)  [EMAIL PROTECTED] x266023/t0 
o101->[EMAIL PROTECTED]@o2ib_0:26/25 lens 232/368 e 1 to 100 dl 121724 ref 
1 fl Rpc:/0/0 rc 0/0
0100:02000400:2:121724.471376:0:9991:0:(client.c:1206:ptlrpc_expire_one_request())
 Request x266023 sent from [EMAIL PROTECTED] to NID [EMAIL PROTECTED] 100s ago 
has timed out (limit 100s).
0100:0200:2:121724.471845:0:9991:0:(events.c:84:reply_in_callback())
 @@@ type 5, status 0  [EMAIL PROTECTED] x266023/t0 o101->[EMAIL 
PROTECTED]@o2ib_0:26/25 lens 232/368 e 1 to 100 dl 121724 ref 1 fl 
Rpc:X/0/0 rc 0/0
0100:0010:2:121724.471859:0:9991:0:(events.c:102:reply_in_callback())
 @@@ unlink  [EMAIL PROTECTED] x266023/t0 o101->[EMAIL PROTECTED]@o2ib_0:26/25 
lens 232/368 e 1 to 100 dl 121724 ref 1 fl Rpc:X/0/0 rc 0/0
0100:0010:2:121724.471970:0:9991:0:(client.c:2091:ptlrpc_abort_inflight())
 @@@ inflight  [EMAIL PROTECTED] x266023/t0 o101->[EMAIL 
PROTECTED]@o2ib_0:26/25 lens 232/368 e 1 to 100 dl 121724 ref 1 fl 
Rpc:X/0/0 rc 0/0
0100:0200:2:121724.472014:0:9991:0:(client.c:771:ptlrpc_check_reply())
 @@@ rc = 1 for  [EMAIL PROTECTED] x266023/t0 o101->[EMAIL 
PROTECTED]@o2ib_0:26/25 lens 232/368 e 1 to 100 dl 121724 ref 1 fl 
Rpc:E