On May 19, 2011, at 10:28, Kevin Van Maren wrote: > Dardo D Kleiner - CONTRACTOR wrote: >> Short answer: of course it works - they're just block devices after all - >> but you'll find that you won't realize the performance gains you might >> expect (at least not for an MDT). >> > > Yes. See the email thread "improving metadata performance" and Robin > Humble's talk at LUG. The MDT disk is rarely the bottleneck (although > that could change with full size-on-mds support), which others had > discovered using a ram-based (tmpfs) MDT.
I will assert that MDT disk performance is rarely the bottleneck only for filesystem modifying operations, because the seek latency is largely hidden by the linear IO of the journal, and because most metadata benchmarks are done on test filesystems that are empty (i.e. free inodes are all contiguous). I think for real-world usage on filesystems that are aged, and/or cold-cache operations (just mounted, or larger than can fit in RAM) that SSD can help significantly. > As for putting the entire filesystem on flash, sure that would be pretty > nifty, but expensive. Not being able to do failover, with storage on > internal PCIe cards, is a downside. I doubt this will be possible for a long time to come, due to cost, even if the PCI cards have external interfaces (as I've heard some high-end ones do). >> Aside from simply being fast OSTs, there are several areas that would allow >> Lustre to take advantage of these kinds of devices: >> >> 1) SMP scaling for the MDS - the problem right now is that the low latency >> of these devices really shines best when you have many threads scattering >> small I/O. The current (1.8.x) Lustre MDS doesn't >> do this. >> > > SMP scaling is a big issue. In Lustre 1.8.x the maximum performance is > not more than 8 CPUs (maybe fewer) for the MDT -- additional cpu cores > results in _lower_ performance. There are patches for Lustre 2.x to > improve SMP scaling, but I haven't tested a workload. > >> 2) Flashcache/bcache over traditional disk storage (OST or MDT) - this can >> be done today, of course. There's some interop issues in my testing, but >> when it works it does what it says it does. It >> still won't really help an MDT though. >> 3) Targeted device mapping of the metadata portions of an OST on traditional >> disk (e.g. extent lists) onto flash. >> >> #1 is substantial work (ongoing I believe). #2 is pretty nifty, basically >> grow your local page cache beyond RAM - helps when "hot" working set is >> large. #3 is trickier and though I haven't tried it >> I understand there's real effort ongoing in this regard. >> > > flex_bg is in ext4, which allows the inodes to be packed together. As an FYI, a patch to enable flex_bg (and other ext4 features) by default was just landed to the master branch for 2.1. It also reduces the number of inodes created on large OSTs (i.e. pretty much any new OST), and increases the number of inodes created on the MDT. That is more inline with typical users of Lustre today, and testing so far has shown that flex_bg reduces mke2fs and e2fsck time noticably. The higher MDT inode ratio is also helpful for flash users, since it more efficiently uses the space on the MDT. >> Filesystem size in this discussion is mostly irrelevant for an MDT, its just >> whether or not the device is big enough for the number of objects (a few >> million is *not* many). A huge number of clients >> thrashing about creating/modifying/deleting is where these things have the >> most potential. >> >> - Dardo >> >> On 5/16/11 2:58 PM, Carlson, Timothy S wrote: >> >>> Folks, >>> >>> I know that flash based technology gets talked about from time to time on >>> the list, but I was wondering if anybody has actually implemented FusionIO >>> devices for metadata. The last thread I can find on the mailing list that >>> relates to this topic dates from 3 years ago. The software driving the >>> Fusion cards has come quite a ways since then and I've got good experience >>> using the device as a raw disk. I'm just fishing around to see if anybody >>> has implemented one of these devices in a reasonably sized Lustre config >>> where "reasonably" is left open to interpretation. I'm thinking>500T and a >>> few million files. >>> >>> Thanks! >>> >>> Tim >>> >>> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss Cheers, Andreas -- Andreas Dilger Principal Engineer Whamcloud, Inc. _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss