Andreas,
Thanks for responding.
Right now, I am looking at using ZFS and an ssd/nvme for the journal
disk. I suggested mirroring, but they aren't too keen on losing 50% of
their purchased storage..
This particular system will likely not be scaled up at a future date.
It seems like the 2.11 may be a good direction if they can wait. I do
like the idea of running an MDT-only system in the future. With multiple
MDTs, that would be a great match for this scenario and also have the
ability to grow in the future. And being able to "import" an existing
filesystem is awesome. A vote for that!
Brian Andrus
On 11/29/2017 6:08 PM, Dilger, Andreas wrote:
On Nov 29, 2017, at 15:31, Brian Andrus <toomuc...@gmail.com> wrote:
All,
I have always seen lustre as a good solution for large files and not the best
for many small files.
Recently, I have seen a request for a small lustre system (2 OSSes, 1 MDS) that
would be for billions of files that average 50k-100k.
This is about 75TB of usable capacity per billion files. Are you looking at
HDD or SSD storage? RAID or mirror? What kind of client load, and how much
does this system need to scale in the future?
It seems to me, that for this to be 'of worth', the block sizes on disks need
to be small, but even then, with tcp overhead and inode limitations, it may
still not perform all that well (compared to larger files).
Even though Lustre does 1MB or 4MB RPCs, it only allocates as much space on the
OSTs as needed for the file data. This means 4KB blocks with ldiskfs, and
variable (power-of-two) blocksize on ZFS (64KB or 128KB blocks by default). You
could constrain ZFS to smaller blocks if needed (e.g. recordsize=32k), or
enable ZFS compression to try and fit the data into smaller blocks (depends
whether your data is compressible or not).
The drawback is that every Lustre file currently needs an MDT inode (1KB+) and
an OST inode, so Lustre isn't the most efficient for small files.
Am I off here? Have there been some developments in lustre that help this
scenario (beyond small files being stored on the MDT directly)?
The Data-on-MDT feature (DoM) has landed for 2.11, which seems like it would
suit your workload well, since it only needs a single MDT inode for small
files, and reduces the overhead when accessing the file. DoM will still be a
couple of months before that is released, though you could start testing now if
you were interested. Currently DoM is intended to be used together with OSTs,
but if there is a demand we could look into what is needed to run an MDT-only
filesystem configuration (some checks in the code that prevent the filesystem
becoming available before at least one OST is mounted would need to be removed).
That said, you could also just set up a single NFS server with ZFS to handle
the 75TB * N of storage, unless you need highly concurrent access to the files.
This would probably be acceptable if you don't need to scale too much (in
capacity or performance), and don't have a large number of clients connecting.
One of the other features we're currently investigating (not sure how much interest there
is yet) is to be able to "import" an existing ext4 or ZFS filesystem into
Lustre as MDT0000 (with DoM), and be able to grow horizontally by adding more MDTs or
OSTs. Some work is already being done that will facilitate this in 2.11 (DoM, and OI
Scrub for ZFS), but more would be needed for this to work. That would potentially allow
you to start with a ZFS or ext4 NFS server, and then migrate to Lustre if you need to
scale it up.
Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org