Andreas,

Thanks for responding.

Right now, I am looking at using ZFS and an ssd/nvme for the journal disk. I suggested mirroring, but they aren't too keen on losing 50% of their purchased storage..
This particular system will likely not be scaled up at a future date.

It seems like the 2.11 may be a good direction if they can wait. I do like the idea of running an MDT-only system in the future. With multiple MDTs, that would be a great match for this scenario and also have the ability to grow in the future. And being able to "import" an existing filesystem is awesome. A vote for that!

Brian Andrus


On 11/29/2017 6:08 PM, Dilger, Andreas wrote:
On Nov 29, 2017, at 15:31, Brian Andrus <toomuc...@gmail.com> wrote:
All,

I have always seen lustre as a good solution for large files and not the best 
for many small files.
Recently, I have seen a request for a small lustre system (2 OSSes, 1 MDS) that 
would be for billions of files that average 50k-100k.
This is about 75TB of usable capacity per billion files.  Are you looking at 
HDD or SSD storage?  RAID or mirror?  What kind of client load, and how much 
does this system need to scale in the future?

It seems to me, that for this to be 'of worth', the block sizes on disks need 
to be small, but even then, with tcp overhead and inode limitations, it may 
still not perform all that well (compared to larger files).
Even though Lustre does 1MB or 4MB RPCs, it only allocates as much space on the 
OSTs as needed for the file data.  This means 4KB blocks with ldiskfs, and 
variable (power-of-two) blocksize on ZFS (64KB or 128KB blocks by default). You 
could constrain ZFS to smaller blocks if needed (e.g. recordsize=32k), or 
enable ZFS compression to try and fit the data into smaller blocks (depends 
whether your data is compressible or not).

The drawback is that every Lustre file currently needs an MDT inode (1KB+) and 
an OST inode, so Lustre isn't the most efficient for small files.

Am I off here? Have there been some developments in lustre that help this 
scenario (beyond small files being stored on the MDT directly)?
The Data-on-MDT feature (DoM) has landed for 2.11, which seems like it would 
suit your workload well, since it only needs a single MDT inode for small 
files, and reduces the overhead when accessing the file.  DoM will still be a 
couple of months before that is released, though you could start testing now if 
you were interested.  Currently DoM is intended to be used together with OSTs, 
but if there is a demand we could look into what is needed to run an MDT-only 
filesystem configuration (some checks in the code that prevent the filesystem 
becoming available before at least one OST is mounted would need to be removed).

That said, you could also just set up a single NFS server with ZFS to handle 
the 75TB * N of storage, unless you need highly concurrent access to the files. 
 This would probably be acceptable if you don't need to scale too much (in 
capacity or performance), and don't have a large number of clients connecting.

One of the other features we're currently investigating (not sure how much interest there 
is yet) is to be able to "import" an existing ext4 or ZFS filesystem into 
Lustre as MDT0000 (with DoM), and be able to grow horizontally by adding more MDTs or 
OSTs.  Some work is already being done that will facilitate this in 2.11 (DoM, and OI 
Scrub for ZFS), but more would be needed for this to work.  That would potentially allow 
you to start with a ZFS or ext4 NFS server, and then migrate to Lustre if you need to 
scale it up.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation








_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to