Re: [lustre-discuss] Data stored in OST [EXT]
>> Our last DDN system has OST's using 14TB disks. > That's quite popular. If single-digit transfer rates per-HDD > for HPC clusters are the goal, that's ideal :-). [...] I was just discussing this with someone and they pointed out that can be a reasonable goal, and indeed it can be, even if not for HPC clusters, but for archival (something similar to AWS Glacier) or even cold storage, where low cost matters per TB more than huge latency or very low bandwidth. https://blog.dshr.org/2015/03/googles-near-line-storage-offering.html https://blog.dshr.org/2014/09/more-on-facebooks-cold-storage.html But given calculations and experience I would still not use drives larger than 8TB for that, because the IOPS-per-TB of larger drives are so low that I think that maintenance operations are hard to do within "reasonable" timeframes. Overall HDDs with > 8TB capacity are probably best regarded as "tapes" with the ability to do some random positioning. ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Data stored in OST [EXT]
>>> What would be the problem with large 'datacenter' type HDD's >>> for an OST (in raid10 for instance)? > Very, very low IOPS-per-TB, leading to terrifyingly low speed > under combined user and maintenance load. [...] > Our last DDN system has OST's using 14TB disks. That's quite popular. If single-digit transfer rates per-HDD for HPC clusters are the goal, that's ideal :-). Plus probably those OSTs from DDNs use (their slightly better version of) RAID6, which "complicates" matters. My guess why systems with very low IOPS-per-TB are popular is that what matters most is IOPS-per-TB *actually used*, so for the initial usage period, when the HDDs hold less than 1-2TBs, and mostly in the outer cylinders (a kind of spontaneous "short stroking"), and mostly unfragmented, and maintenance operations like checking, scrubbing, migration, backup are endlessly procrastinated, the storage layer seems to perform well and to be so cheap, making the purchaser look like a genius. Then when the HDDs fill up, data reaches the inner cylinders, and the shrinking free space is heavily fragmented, and latency goes way up (I have seen Lustre systems with IO latencies of some *seconds*) and user-visible transfer rates go way down (sometimes below 1MB/s per HDD), and high-IOPS maintenance operations can no longer be put off, that's when usually I get hired. :-(. ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Data stored in OST
>>> The usual practice is to use RAUD10 for the MDT(s) on >>> "enterprise" high-endurance SSD, and RAID6 for the OST on >>> "professional" mixed-load SSDs or "small" (1-2TB at most) >>> "datacenter" HDDs, fronted by failover-servers. > What would be the problem with large 'datacenter' type HDD's > for an OST (in raid10 for instance)? Very, very low IOPS-per-TB, leading to terrifyingly low speed under combined user and maintenance load. Consider for example 18TB HDDs capable of multistream transfer rates of around 2-3MBs each. I have seen some such setup where there were not enough IOPS for the maintenance load (scrubbing and resilvering, checking, migrating, etc.), never mind for the user load. Especially if there is a non trivial percentage of "small" (less than several MB) files. https://www.sabi.co.uk/blog/13-two.html?131227#131227 "The issue with disk drives with multi-TB capacities" That applies to every filesystem, but even more so to Lustre which is mostly targeted at highly parallel HPC user loads. ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Data stored in OST
I have a question about the following comment: >>> The usual practice is to use RAUD10 for the MDT(s) on "enterprise" >>> high-endurance SSD, and RAID6 for the OST on "professional" mixed-load SSDs >>> or "small" (1-2TB at most) "datacenter" HDDs, fronted by failover-servers. What would be the problem with large 'datacenter' type HDD's for an OST (in raid10 for instance)? Thanks, Martin Balvers -Original Message- From: lustre-discuss On Behalf Of Peter Grandi via lustre-discuss Sent: Monday, May 22, 2023 15:35 To: list Lustre discussion Subject: Re: [lustre-discuss] Data stored in OST ** Caution - this is an external email ** >>> On Mon, 22 May 2023 13:08:19 +0530, Nick dan via lustre-discuss >>> said: > Hi I had one doubt. In lustre, data is divided into stripes and stored > in multiple OSTs. So each OST will have some part of data. My question > is if one OST fails, will there be data loss? This is extensively discussed in the Lustre manual with comprehensive illustrations: https://urldefense.com/v3/__https://doc.lustre.org/lustre_manual.xhtml*understandinglustre.storageio__;Iw!!OUGTln_Lrg!TlK5iwVueuM5ES6iTrGKJ5KfN4p2-JQvUwesXX0wXbSlcx_3apJz7K4idfYgRDm6QLhD6ZDxlJcpa5chGhmzAjwx705bBcifYw$ https://urldefense.com/v3/__https://doc.lustre.org/lustre_manual.xhtml*pfl__;Iw!!OUGTln_Lrg!TlK5iwVueuM5ES6iTrGKJ5KfN4p2-JQvUwesXX0wXbSlcx_3apJz7K4idfYgRDm6QLhD6ZDxlJcpa5chGhmzAjwx707k4bctVQ$ https://urldefense.com/v3/__https://doc.lustre.org/lustre_manual.xhtml*understandingfailover__;Iw!!OUGTln_Lrg!TlK5iwVueuM5ES6iTrGKJ5KfN4p2-JQvUwesXX0wXbSlcx_3apJz7K4idfYgRDm6QLhD6ZDxlJcpa5chGhmzAjwx706ej8wTvg$ The usual practice is to use RAUD10 for the MDT(s) on "enterprise" high-endurance SSD, and RAID6 for the OST on "professional" mixed-load SSDs or "small" (1-2TB at most) "datacenter" HDDs, fronted by failover-servers. I personally think that is is best to rely on Lustre striping and the "new" PFL LFR layout (across two OST "pools"), and have each OST on a single device, and and very few OSTs per OSS, when Lustre is used as "scratch" area for an HPC cluster. https://urldefense.com/v3/__https://doc.lustre.org/lustre_manual.xhtml*flr__;Iw!!OUGTln_Lrg!TlK5iwVueuM5ES6iTrGKJ5KfN4p2-JQvUwesXX0wXbSlcx_3apJz7K4idfYgRDm6QLhD6ZDxlJcpa5chGhmzAjwx705i523T9Q$ ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!OUGTln_Lrg!TlK5iwVueuM5ES6iTrGKJ5KfN4p2-JQvUwesXX0wXbSlcx_3apJz7K4idfYgRDm6QLhD6ZDxlJcpa5chGhmzAjwx706PzOli0g$ Ce message électronique et tous les fichiers attachés qu'il contient sont confidentiels et destinés exclusivement à l'usage de la personne à laquelle ils sont adressés. Si vous avez reçu ce message par erreur, merci de le retourner à son émetteur. Les idées et opinions présentées dans ce message sont celles de son auteur, et ne représentent pas nécessairement celles de DANONE ou d'une quelconque de ses filiales. La publication, l'usage, la distribution, l'impression ou la copie non autorisée de ce message et des attachements qu'il contient sont strictement interdits. This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual to whom it is addressed. If you have received this email in error please send it back to the person that sent it to you. Any views or opinions presented are solely those of its author and do not necessarily represent those of DANONE or any of its subsidiary companies. Unauthorized publication, use, dissemination, forwarding, printing or copying of this email and its associated attachments is strictly prohibited. ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Data stored in OST
>>> On Mon, 22 May 2023 13:08:19 +0530, Nick dan via lustre-discuss >>> said: > Hi I had one doubt. In lustre, data is divided into stripes > and stored in multiple OSTs. So each OST will have some part > of data. My question is if one OST fails, will there be data > loss? This is extensively discussed in the Lustre manual with comprehensive illustrations: https://doc.lustre.org/lustre_manual.xhtml#understandinglustre.storageio https://doc.lustre.org/lustre_manual.xhtml#pfl https://doc.lustre.org/lustre_manual.xhtml#understandingfailover The usual practice is to use RAUD10 for the MDT(s) on "enterprise" high-endurance SSD, and RAID6 for the OST on "professional" mixed-load SSDs or "small" (1-2TB at most) "datacenter" HDDs, fronted by failover-servers. I personally think that is is best to rely on Lustre striping and the "new" PFL LFR layout (across two OST "pools"), and have each OST on a single device, and and very few OSTs per OSS, when Lustre is used as "scratch" area for an HPC cluster. https://doc.lustre.org/lustre_manual.xhtml#flr ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Data stored in OST
Hi Thank you for your reply Yes, the OSTs must provide internal redundancy - RAID-6 typically Can RAID_6 be replaced with mirror/RAID0? Which type of RAID is recommended for MDT and OST? Also can you brief on how data will be read/written in Lustre with ZFS is used as backend filesystem in Lustre FS? Thanks and regards Nick On Mon, 22 May 2023 at 13:36, Andreas Dilger wrote: > Yes, the OSTs must provide internal redundancy - RAID-6 typically. > > There is File Level Redundancy (FLR = mirroring) possible in Lustre file > layouts, but it is "unmanaged", so users or other system-level tools are > required to resync FLR files if they are written after mirroring. > > Cheers, Andreas > > > On May 22, 2023, at 09:39, Nick dan via lustre-discuss < > lustre-discuss@lists.lustre.org> wrote: > > > > > > Hi > > > > I had one doubt. > > In lustre, data is divided into stripes and stored in multiple OSTs. So > each OST will have some part of data. > > My question is if one OST fails, will there be data loss? > > > > Please advise for the same. > > > > Thanks and regards > > Nick > > ___ > > lustre-discuss mailing list > > lustre-discuss@lists.lustre.org > > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Data stored in OST
Yes, the OSTs must provide internal redundancy - RAID-6 typically. There is File Level Redundancy (FLR = mirroring) possible in Lustre file layouts, but it is "unmanaged", so users or other system-level tools are required to resync FLR files if they are written after mirroring. Cheers, Andreas > On May 22, 2023, at 09:39, Nick dan via lustre-discuss > wrote: > > > Hi > > I had one doubt. > In lustre, data is divided into stripes and stored in multiple OSTs. So each > OST will have some part of data. > My question is if one OST fails, will there be data loss? > > Please advise for the same. > > Thanks and regards > Nick > ___ > lustre-discuss mailing list > lustre-discuss@lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Data stored in OST
Hi I had one doubt. In lustre, data is divided into stripes and stored in multiple OSTs. So each OST will have some part of data. My question is if one OST fails, will there be data loss? Please advise for the same. Thanks and regards Nick ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org