Re: [lustre-discuss] Data migration software

2023-03-22 Thread Stephane Thiell via lustre-discuss
Hi Anna,

We’re about to deploy Lustre/HSM with Phobos for a new large research data 
archival system at Stanford (200PB).

https://github.com/phobos-storage

Phobos is open source and a Lustre copytool is available. Archiving policies 
can be set up via Robinhood like with other HSMs. Robinhood is also open source 
and supports project IDs if you take the patches from GerritHub (like this one: 
https://review.gerrithub.io/c/cea-hpc/robinhood/+/541104 but more are needed, I 
can give you the list if needed). Data restore concurrency should be well 
handled with Lustre/HSM.
A Lustre userspace coordinator named “coordinatool" is required for using 
Phobos in multi-server mode, but it is also freely available on GitHub. We plan 
to have a dedicated Lustre client for the coordinatool.

Hope that helps.

Stéphane


> On Mar 22, 2023, at 7:47 AM, Anna Fuchs via lustre-discuss 
>  wrote:
> 
> Dear all,
> 
> if you have a large Lustre storage and a large tape archive and maybe even 
> additionally some in-house cloud storage, which software do you use for more 
> or less automatic data migration, that has good scaling?
> Ideally it somehow supports Lustre project quota and more important a 
> synchronized catalog to find the data.
> E.g. if the data to be read is on tape, it somehow transparently moves it to 
> the main faster storage (like Lustre) without the user explicitly knowing (at 
> least not required to) where the data has been initially stored.
> If another user wants to access the same shared file, the software would know 
> it is already "buffered" on Lustre and wouldn't read it again from tape.
> Or If the data has not been touched (really processed, not just touch) for a 
> certain period of time, or the user runs out of Lustre quota, but has free 
> archive space, it would be automatically archived on tape and so on.
> Ideally, the software should not cost a billion for a year license or even be 
> open source :)
> 
> Thank you
> Anna Fuchs
> --
> Universität Hamburg
> https://wr.informatik.uni-hamburg.de/people/anna_fuchs
> 
> 
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Data migration software

2023-03-22 Thread Anna Fuchs via lustre-discuss

Dear all,

if you have a large Lustre storage and a large tape archive and maybe 
even additionally some in-house cloud storage, which software do you use 
for more or less automatic data migration, that has good scaling?
Ideally it somehow supports Lustre project quota and more important a 
synchronized catalog to find the data.
E.g. if the data to be read is on tape, it somehow transparently moves 
it to the main faster storage (like Lustre) without the user explicitly 
knowing (at least not required to) where the data has been initially stored.
If another user wants to access the same shared file, the software would 
know it is already "buffered" on Lustre and wouldn't read it again from 
tape.
Or If the data has not been touched (really processed, not just touch) 
for a certain period of time, or the user runs out of Lustre quota, but 
has free archive space, it would be automatically archived on tape and 
so on.
Ideally, the software should not cost a billion for a year license or 
even be open source :)


Thank you
Anna Fuchs
--
Universität Hamburg
https://wr.informatik.uni-hamburg.de/people/anna_fuchs



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre project quotas and project IDs

2023-03-22 Thread Andreas Dilger via lustre-discuss
Of course my preference would be a contribution to improving the name-projid 
mapping in the "lfs project" command under LU-13335 so this would also help 
other Lustre users manage their project IDs.

One proposal I had in LU-13335 that I would welcome feedback on was if a name 
or projid did not exist in /etc/projid that the lfs tool would fall back to 
doing a name/uid lookup in /etc/passwd (or other database as configured in 
/etc/nsswitch.conf).

This would avoid the need to duplicate the full UID database in /etc/projid for 
the common case of projid = uid, and allows using LDAP, NIS, AD, sssd, etc. for 
projid lookup without them having explicit support for a projid database.

This behavior could optionally be configured with a "commented-out" directive 
at the start of /etc/projid, like:

 #lfs fallback: passwd

or "group" or "none".  If all the projects are defined in the passwd database, 
then potentially just this one line is needed in /etc/projid, or not at all if 
"passwd" is the default fallback.

Would this meet your need for using an external database, while still allowing 
your development efforts to produce a solution that helps the Lustre community?

Of course at some point it wouod be desirable to have a dedicated projid 
database supported by glibc, but that is would take much more time and effort 
to implement and deploy, while the passwd/group fallback can be handled 
internally by the lfs command.

Cheers, Andreas

On Mar 17, 2023, at 04:10, Passerini Marco  wrote:



Hi Andreas,


I'm talking the order of ~10,000s of project IDs.

I've been thinking the same as you, that is, doing PROJID=1M + UID  etc. 
However, in our case, it might be better to rely on some scripting and an 
external DB, to keep track of the latest added ID, so that we could increment 
the highest value by 1 on new ID creation. The highest value could as well be 
looked up in:


/proc/fs/lustre/osd-ldiskfs/myfs-MDT/quota_slave_dt/acct_project

Regards,

Marco Passerini


From: Andreas Dilger 
Sent: Thursday, March 16, 2023 11:35:16 PM
To: Passerini Marco
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Lustre project quotas and project IDs

On Mar 16, 2023, at 04:50, Passerini Marco 
mailto:marco.passer...@cscs.ch>> wrote:

By trial and error, I found that, when using project quotas, the maximum ID 
available is 4294967294. Is this correct?

Yes, the "-1" ID is reserved for error conditions.

If I assign quota to a lot of project IDs, is the performance expected to go 
down more than having just a few or is it fixed?

Probably if you have millions or billions of different IDs there would be some 
performance loss, at a minimum just because the quota files will consume a lot 
of disk space and memory to manage.  I don't think we've done specific scaling 
testing for the number of project IDs, but it has worked well for the 
"expected" number of different IDs at production sites (in the 10,000s).

I've recommended to a few sites that want to have a "unified" quota to use e.g. 
PROJID=UID for user directories, PROJID=1M + UID for scratch, and PROJID=2M+N 
for independent projects, just to make the PROJIDs easily identified (at least 
until someone implements LU-13335 to do projid<->name mapping).

How many IDs were you thinking of using?

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Improving the performance of Lustre small files, Maybe aggregation small files

2023-03-22 Thread 王烁斌 via lustre-discuss
Hi~


Regarding improving the performance of Lustre small files, are there any 
aggregation schemes for small files? If so, what schemes can be used?


---Shuobin___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org