On Nov 29, 2017, at 15:31, Brian Andrus <toomuc...@gmail.com> wrote:
> 
> All,
> 
> I have always seen lustre as a good solution for large files and not the best 
> for many small files.
> Recently, I have seen a request for a small lustre system (2 OSSes, 1 MDS) 
> that would be for billions of files that average 50k-100k.

This is about 75TB of usable capacity per billion files.  Are you looking at 
HDD or SSD storage?  RAID or mirror?  What kind of client load, and how much 
does this system need to scale in the future?

> It seems to me, that for this to be 'of worth', the block sizes on disks need 
> to be small, but even then, with tcp overhead and inode limitations, it may 
> still not perform all that well (compared to larger files).

Even though Lustre does 1MB or 4MB RPCs, it only allocates as much space on the 
OSTs as needed for the file data.  This means 4KB blocks with ldiskfs, and 
variable (power-of-two) blocksize on ZFS (64KB or 128KB blocks by default). You 
could constrain ZFS to smaller blocks if needed (e.g. recordsize=32k), or 
enable ZFS compression to try and fit the data into smaller blocks (depends 
whether your data is compressible or not).

The drawback is that every Lustre file currently needs an MDT inode (1KB+) and 
an OST inode, so Lustre isn't the most efficient for small files.

> Am I off here? Have there been some developments in lustre that help this 
> scenario (beyond small files being stored on the MDT directly)?

The Data-on-MDT feature (DoM) has landed for 2.11, which seems like it would 
suit your workload well, since it only needs a single MDT inode for small 
files, and reduces the overhead when accessing the file.  DoM will still be a 
couple of months before that is released, though you could start testing now if 
you were interested.  Currently DoM is intended to be used together with OSTs, 
but if there is a demand we could look into what is needed to run an MDT-only 
filesystem configuration (some checks in the code that prevent the filesystem 
becoming available before at least one OST is mounted would need to be removed).

That said, you could also just set up a single NFS server with ZFS to handle 
the 75TB * N of storage, unless you need highly concurrent access to the files. 
 This would probably be acceptable if you don't need to scale too much (in 
capacity or performance), and don't have a large number of clients connecting.

One of the other features we're currently investigating (not sure how much 
interest there is yet) is to be able to "import" an existing ext4 or ZFS 
filesystem into Lustre as MDT0000 (with DoM), and be able to grow horizontally 
by adding more MDTs or OSTs.  Some work is already being done that will 
facilitate this in 2.11 (DoM, and OI Scrub for ZFS), but more would be needed 
for this to work.  That would potentially allow you to start with a ZFS or ext4 
NFS server, and then migrate to Lustre if you need to scale it up.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to