Re: [lustre-discuss] Data stored in OST [EXT]

2023-06-18 Thread Peter Grandi via lustre-discuss
>> Our last DDN system has OST's using 14TB disks.

> That's quite popular. If single-digit transfer rates per-HDD
> for HPC clusters are the goal, that's ideal :-). [...]

I was just discussing this with someone and they pointed out
that can be a reasonable goal, and indeed it can be, even if not
for HPC clusters, but for archival (something similar to AWS
Glacier) or even cold storage, where low cost matters per TB
more than huge latency or very low bandwidth.

https://blog.dshr.org/2015/03/googles-near-line-storage-offering.html
https://blog.dshr.org/2014/09/more-on-facebooks-cold-storage.html

But given calculations and experience I would still not use
drives larger than 8TB for that, because the IOPS-per-TB of
larger drives are so low that I think that maintenance
operations are hard to do within "reasonable" timeframes.

Overall HDDs with > 8TB capacity are probably best regarded as
"tapes" with the ability to do some random positioning.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Data stored in OST [EXT]

2023-06-15 Thread Peter Grandi via lustre-discuss
>>> What would be the problem with large 'datacenter' type HDD's
>>> for an OST (in raid10 for instance)?

> Very, very low IOPS-per-TB, leading to terrifyingly low speed
> under combined user and maintenance load. [...]

> Our last DDN system has OST's using 14TB disks.

That's quite popular. If single-digit transfer rates per-HDD for
HPC clusters are the goal, that's ideal :-). Plus probably those
OSTs from DDNs use (their slightly better version of) RAID6,
which "complicates" matters.

My guess why systems with very low IOPS-per-TB are popular is
that what matters most is IOPS-per-TB *actually used*, so for
the initial usage period, when the HDDs hold less than 1-2TBs,
and mostly in the outer cylinders (a kind of spontaneous "short
stroking"), and mostly unfragmented, and maintenance operations
like checking, scrubbing, migration, backup are endlessly
procrastinated, the storage layer seems to perform well and to
be so cheap, making the purchaser look like a genius.

Then when the HDDs fill up, data reaches the inner cylinders,
and the shrinking free space is heavily fragmented, and latency
goes way up (I have seen Lustre systems with IO latencies of
some *seconds*) and user-visible transfer rates go way down
(sometimes below 1MB/s per HDD), and high-IOPS maintenance
operations can no longer be put off, that's when usually I get
hired. :-(.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Data stored in OST

2023-06-11 Thread Peter Grandi via lustre-discuss
>>> The usual practice is to use RAUD10 for the MDT(s) on
>>> "enterprise" high-endurance SSD, and RAID6 for the OST on
>>> "professional" mixed-load SSDs or "small" (1-2TB at most)
>>> "datacenter" HDDs, fronted by failover-servers.

> What would be the problem with large 'datacenter' type HDD's
> for an OST (in raid10 for instance)?

Very, very low IOPS-per-TB, leading to terrifyingly low speed
under combined user and maintenance load. Consider for example
18TB HDDs capable of multistream transfer rates of around 2-3MBs
each. I have seen some such setup where there were not enough
IOPS for the maintenance load (scrubbing and resilvering,
checking, migrating, etc.), never mind for the user load.
Especially if there is a non trivial percentage of "small" (less
than several MB) files.

https://www.sabi.co.uk/blog/13-two.html?131227#131227
"The issue with disk drives with multi-TB capacities"

That applies to every filesystem, but even more so to Lustre
which is mostly targeted at highly parallel HPC user loads.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Data stored in OST

2023-05-23 Thread BALVERS Martin via lustre-discuss
I have a question about the following comment:
>>> The usual practice is to use RAUD10 for the MDT(s) on "enterprise" 
>>> high-endurance SSD, and RAID6 for the OST on "professional" mixed-load SSDs 
>>> or "small" (1-2TB at most) "datacenter" HDDs, fronted by failover-servers.

What would be the problem with large 'datacenter' type HDD's for an OST (in 
raid10 for instance)?

Thanks,
Martin Balvers

-Original Message-
From: lustre-discuss  On Behalf Of 
Peter Grandi via lustre-discuss
Sent: Monday, May 22, 2023 15:35
To: list Lustre discussion 
Subject: Re: [lustre-discuss] Data stored in OST

** Caution - this is an external email **

>>> On Mon, 22 May 2023 13:08:19 +0530, Nick dan via lustre-discuss 
>>>  said:

> Hi I had one doubt. In lustre, data is divided into stripes and stored 
> in multiple OSTs. So each OST will have some part of data. My question 
> is if one OST fails, will there be data loss?

This is extensively discussed in the Lustre manual with comprehensive 
illustrations:

  
https://urldefense.com/v3/__https://doc.lustre.org/lustre_manual.xhtml*understandinglustre.storageio__;Iw!!OUGTln_Lrg!TlK5iwVueuM5ES6iTrGKJ5KfN4p2-JQvUwesXX0wXbSlcx_3apJz7K4idfYgRDm6QLhD6ZDxlJcpa5chGhmzAjwx705bBcifYw$
  
https://urldefense.com/v3/__https://doc.lustre.org/lustre_manual.xhtml*pfl__;Iw!!OUGTln_Lrg!TlK5iwVueuM5ES6iTrGKJ5KfN4p2-JQvUwesXX0wXbSlcx_3apJz7K4idfYgRDm6QLhD6ZDxlJcpa5chGhmzAjwx707k4bctVQ$
  
https://urldefense.com/v3/__https://doc.lustre.org/lustre_manual.xhtml*understandingfailover__;Iw!!OUGTln_Lrg!TlK5iwVueuM5ES6iTrGKJ5KfN4p2-JQvUwesXX0wXbSlcx_3apJz7K4idfYgRDm6QLhD6ZDxlJcpa5chGhmzAjwx706ej8wTvg$

The usual practice is to use RAUD10 for the MDT(s) on "enterprise" 
high-endurance SSD, and RAID6 for the OST on "professional" mixed-load SSDs or 
"small" (1-2TB at most) "datacenter" HDDs, fronted by failover-servers.

I personally think that is is best to rely on Lustre striping and the "new" PFL 
LFR layout (across two OST "pools"), and have each OST on a single device, and 
and very few OSTs per OSS, when Lustre is used as "scratch" area for an HPC 
cluster.

  
https://urldefense.com/v3/__https://doc.lustre.org/lustre_manual.xhtml*flr__;Iw!!OUGTln_Lrg!TlK5iwVueuM5ES6iTrGKJ5KfN4p2-JQvUwesXX0wXbSlcx_3apJz7K4idfYgRDm6QLhD6ZDxlJcpa5chGhmzAjwx705i523T9Q$
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!OUGTln_Lrg!TlK5iwVueuM5ES6iTrGKJ5KfN4p2-JQvUwesXX0wXbSlcx_3apJz7K4idfYgRDm6QLhD6ZDxlJcpa5chGhmzAjwx706PzOli0g$

Ce message électronique et tous les fichiers attachés qu'il contient sont 
confidentiels et destinés exclusivement à l'usage de la personne à laquelle ils 
sont adressés. Si vous avez reçu ce message par erreur, merci de le retourner à 
son émetteur. Les idées et opinions présentées dans ce message sont celles de 
son auteur, et ne représentent pas nécessairement celles de DANONE ou d'une 
quelconque de ses filiales. La publication, l'usage, la distribution, 
l'impression ou la copie non autorisée de ce message et des attachements qu'il 
contient sont strictement interdits. 

This e-mail and any files transmitted with it are confidential and intended 
solely for the use of the individual to whom it is addressed. If you have 
received this email in error please send it back to the person that sent it to 
you. Any views or opinions presented are solely those of its author and do not 
necessarily represent those of DANONE or any of its subsidiary companies. 
Unauthorized publication, use, dissemination, forwarding, printing or copying 
of this email and its associated attachments is strictly prohibited.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Data stored in OST

2023-05-22 Thread Peter Grandi via lustre-discuss
>>> On Mon, 22 May 2023 13:08:19 +0530, Nick dan via lustre-discuss 
>>>  said:

> Hi I had one doubt. In lustre, data is divided into stripes
> and stored in multiple OSTs. So each OST will have some part
> of data. My question is if one OST fails, will there be data
> loss?

This is extensively discussed in the Lustre manual with
comprehensive illustrations:

  https://doc.lustre.org/lustre_manual.xhtml#understandinglustre.storageio
  https://doc.lustre.org/lustre_manual.xhtml#pfl
  https://doc.lustre.org/lustre_manual.xhtml#understandingfailover

The usual practice is to use RAUD10 for the MDT(s) on
"enterprise" high-endurance SSD, and RAID6 for the OST on
"professional" mixed-load SSDs or "small" (1-2TB at most)
"datacenter" HDDs, fronted by failover-servers.

I personally think that is is best to rely on Lustre striping
and the "new" PFL LFR layout (across two OST "pools"), and have
each OST on a single device, and and very few OSTs per OSS, when
Lustre is used as "scratch" area for an HPC cluster.

  https://doc.lustre.org/lustre_manual.xhtml#flr
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Data stored in OST

2023-05-22 Thread Nick dan via lustre-discuss
Hi

Thank you for your reply

Yes, the OSTs must provide internal redundancy - RAID-6 typically
Can RAID_6 be replaced with mirror/RAID0?

Which type of RAID is recommended for MDT and OST?

Also can you brief on how data will be read/written in Lustre with ZFS is
used as backend filesystem in Lustre FS?

Thanks and regards
Nick



On Mon, 22 May 2023 at 13:36, Andreas Dilger  wrote:

> Yes, the OSTs must provide internal redundancy - RAID-6 typically.
>
> There is File Level Redundancy (FLR = mirroring) possible in Lustre file
> layouts, but it is "unmanaged", so users or other system-level tools are
> required to resync FLR files if they are written after mirroring.
>
> Cheers, Andreas
>
> > On May 22, 2023, at 09:39, Nick dan via lustre-discuss <
> lustre-discuss@lists.lustre.org> wrote:
> >
> > 
> > Hi
> >
> > I had one doubt.
> > In lustre, data is divided into stripes and stored in multiple OSTs. So
> each OST will have some part of data.
> > My question is if one OST fails, will there be data loss?
> >
> > Please advise for the same.
> >
> > Thanks and regards
> > Nick
> > ___
> > lustre-discuss mailing list
> > lustre-discuss@lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Data stored in OST

2023-05-22 Thread Andreas Dilger via lustre-discuss
Yes, the OSTs must provide internal redundancy - RAID-6 typically. 

There is File Level Redundancy (FLR = mirroring) possible in Lustre file 
layouts, but it is "unmanaged", so users or other system-level tools are 
required to resync FLR files if they are written after mirroring.

Cheers, Andreas

> On May 22, 2023, at 09:39, Nick dan via lustre-discuss 
>  wrote:
> 
> 
> Hi
> 
> I had one doubt.
> In lustre, data is divided into stripes and stored in multiple OSTs. So each 
> OST will have some part of data. 
> My question is if one OST fails, will there be data loss?
> 
> Please advise for the same.
> 
> Thanks and regards
> Nick
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Data stored in OST

2023-05-22 Thread Nick dan via lustre-discuss
Hi

I had one doubt.
In lustre, data is divided into stripes and stored in multiple OSTs. So
each OST will have some part of data.
My question is if one OST fails, will there be data loss?

Please advise for the same.

Thanks and regards
Nick
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org