My own thoughts on this one are that I believe when software is being 
developed, the developers use 'small' use cases.
At the company which develops the software, the developers will probably have a 
desktop machine with a modest amount of RAM and disk space. Then the company 
might have a small to medium sized HPC cluster.
I know I am stretching things a bit, but I would imagine a lot of effort goes 
into verifying the correct operation of software on given data sets.
When the software is released to customers, either commercially or internally 
within a company, it is suddenly applied to larger datasets, and is applied 
repetitively. Hence the creation of directories filled with thousands of small 
files.

My own example from this here is in a former job, wind tunnel data was captured 
as thousands of PNG files which were frames from a camera. The data was shipped 
back to me on a hard drive, and I was asked to store it on an HSM system with 
tape as the lowest tier.
I knew that the PNG files had all bene processed anyway, and significant data 
had been extracted. The engineers wanted to keep the data 'just in case'. I 
knew that keeping thousands of files is bad for filesystem performance and also 
on a tape based system you can have the fiels stored on many tapes, so if you 
ever do trigger a recall you have  a festival of robot tape loading. So what I 
did was zip up all the directories.


-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Jonathan Buzzard
Sent: Tuesday, January 16, 2018 6:26 PM
To: gpfsug main discussion list <[email protected]>
Subject: Re: [gpfsug-discuss] GPFS best practises : end user standpoint

On Tue, 2018-01-16 at 16:35 +0000, Buterbaugh, Kevin L wrote:

[SNIP]

>
> We’re in Tennessee, so not only do we not speak English, we barely
> speak American … y’all will just have to understand, bless your
> hearts!  ;-).
>
> But seriously, like most Universities, we have a ton of users for whom
> English is not their “primary” language, so dealing with “interesting”
> filenames is pretty hard to avoid.  And users’ problems are our
> problems whether or not they’re our problem.
>

User comes with problem, you investigate find problem is due to "wacky"
characters point them to the mandatory training documentation, tell them they 
need to rename their files to something sane and take no further action. Sure 
English is not their primary language but *they* have chosen to study in an 
English speaking country so best to actually embrace that.

I do get it, many of our users are not native English speakers as well.
Yes it's a tough policy but on the other hand pandering to them does them no 
favours either.

[SNIP]

> If you’ve got (bio)medical users using your cluster I don’t see how
> you avoid this … they’re using commercial apps that do this kind of
> stupid stuff (10’s of thousands of files in a directory and the full
> path to each file is longer than the contents of the files
> themselves!).

Well then they have justified the use; aka it's not their fault so you up the 
quota for them. Though they could use different less brain dead software. The 
idea is to force a bump in the road so the users are aware that what they are 
doing is considered bad practice. Most users have no idea that putting a 
million files in a directory is not sensible and worse that trying to access 
them using a GUI file manager is positively brain dead.

[SNIP]

> OK, so here’s my main question … you’re right that SSD’s are the
> answer … but how do you charge them more?  SSDs are move expensive
> than hard disks, and enterprise SSDs are stupid expensive … and users
> barely want to pay hard drive prices for their storage.  If you’ve got
> the magic answer to how to charge them enough to pay for SSDs I’m sure
> I’m not the only one who’d love to hear how you do it?!?!
>

Give every user a one million file number quota. Need to store more than one 
million files, then you are going to have to pay $X per extra million files. 
Either they cough up the money to continue using their brain dead software or 
they switch to less stupid software. If they complain you just say that 
enterprise SSD's are stupidly expensive and you are using that space up at an 
above average rate and so have to pay the costs.

I am quite sure someone storing 1PB has to pay more than someone storing 1TB, 
so why should someone storing 20 million files not have to pay more than 
someone storing 100k files? The only difference is people are used to paying 
more to store extra bytes and not used to paying more for more files, but that 
is because most sane people don't store millions and millions of files 
necessitating the purchase of large amounts of expensive enterprise SSD's.


JAB.

--
Jonathan A. Buzzard                         Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.hearns%40asml.com%7Cf9b43f106c124ced6a4108d55d063196%7Caf73baa8f5944eb2a39d93e96cad61fc%7C1&sdata=c5B3JAJZDp3YiCN2uOzTmf%2BlsLMVRw6BsIzacQuORN8%3D&reserved=0
-- The information contained in this communication and any attachments is 
confidential and may be privileged, and is for the sole use of the intended 
recipient(s). Any unauthorized review, use, disclosure or distribution is 
prohibited. Unless explicitly stated otherwise in the body of this 
communication or the attachment thereto (if any), the information is provided 
on an AS-IS basis without any express or implied warranties or liabilities. To 
the extent you are relying on this information, you are doing so at your own 
risk. If you are not the intended recipient, please notify the sender 
immediately by replying to this message and destroy all copies of this message 
and any attachments. Neither the sender nor the company/group of companies he 
or she represents shall be liable for the proper and complete transmission of 
the information contained in this communication, or for any delay in its 
receipt.
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to