Hi Becky I hear what you're saying. I was just hoping that the slowdown would not be quite so dramatic.
Switching to LMDB does offer an improvement. Untar now only takes 7mins... which is a pretty good speedup. Still to slow to be practical for our uses however. Another user Martin also offered some suggestions in response to my original post. So I'm going to try that too. Thanks! Mark Guz Senior IT Specialist Flash Systems & Technology Office 713-587-1048 Mobile 832-290-8161 [email protected] From: Becky Ligon <[email protected]> To: Mark Guz <[email protected]> Cc: "[email protected]" <[email protected]> Date: 10/19/2017 01:12 PM Subject: Re: [Pvfs2-users] Questions about performance Sent by: [email protected] Mark, OrangeFS is NOT optimized for small file I/O, thus we don't recommend untarring or compiling on this system. We are working to make small file I/O better, but the system was originally designed for large I/O. To that end, we now have a readahead cache in 2.9.6, which helps with small reads, but we don't have a write cache. We are currently working on some kernel-level page caching mechanisms to help with small writes, but the concern is with data consistency amongst compute nodes. The user must be responsible for ensuring that only one node, out of many within one job, needs the most current data. That's a lot to ask of most users, so you see our dilemma. In your scenario, are you using LMDB or BDB? We have found that LMDB gives must better metadata performance than BDB. We deliver LMDB with the tar ball. To use it, add to the configure command --with-db-backend=LMDB. If you want to try the readahead cache, also add to the configure line --enable-ra-cache. Hope this helps! Becky Ligon Mark: Below are comments from Walt Ligon, one of the original designers of the system: I would say that OFS is primarily designed to support large parallel computations with large data I/O. It is not intended as a general purpose file system. In particular it is not particularly efficient for large numbers of very small I/O tasks as typically found in an interactive workload such as tar, gcc, grep, etc. The problem is the latency of sending a request from a single client through the kernel to the servers and back. Reconfiguring the servers, etc. really has no effect on latency. That said, we have been working to improve small file I/O. In the version you have there is a "User Interface" that allows an application to bypass the kernel which can lower latency of accesses, this is potentially suitable for specific applications but less so for common utilities such as tar. In the current version we have added a configurable readahead cache to the client that can improve small read performance for larger files. We are currently working to take advantage of the Linux page cache to improve very small I/O. On Thu, Oct 19, 2017 at 11:43 AM, Mark Guz <[email protected]> wrote: Hi, I'm running a PoC orangefs setup, and have read through all of the previous posts on performance. We are running 2.9.6 on Redhat 7.4. The clients are using the kernel interface I'm running currently on 1 Power 750 server as host (with 8 dual meta/data servers running). The clients are a mix of Intel and PPC64 systems all interconnected by Infiniband DDR cards in Connected mode. The storage backend is a 4G FC attached ssd chassis with 20 250 Gig SSD cards (not regular drives) in, with 8 are assigned to meta and 8 are assigned to data The network tests good. 7ish Gb/s with no retries or errors. We are using bmi_tcp. I can get great performance for large files as expected but when performing small file actions the performance is significantly poorer. For example. I can untar linux-4.13.3.tar.xz locally on the filesystem in 14seconds while on the orangefs it takes 10mins I can see the performance difference when playing with stripe sizes etc when copying monolithic files, but there seems to be a wall that gets hit when there is a lot of metadata activity. I can see how the need for network back and forth could impact performance but is it reasonable to see a 42x performance drop in such cases? Also if I then try and compile the kernel on the orangefs it takes well over 2hours and most of the time is spent io waiting. Compiling locally takes about 20mins on the same server. I've tried running over multiple hosts, running some meta only and some data only servers, i've tried running with just 1 meta and 1 data server. I've applied all the system level optimizations I've found and just cannot reasonably speed up the untar operation. Maybe it's a client thing? there's not much i can see that's configurable about the client though. It seems to me that i should be able to get better performance, even on small file operations, but I'm kinda stumped. Am I just chasing unicorns or is it possible to get usable performance for this sort of file activity? (untaring, compiling etc etc) Config below <Defaults> UnexpectedRequests 256 EventLogging none EnableTracing no LogStamp datetime BMIModules bmi_tcp FlowModules flowproto_multiqueue PerfUpdateInterval 10000000 ServerJobBMITimeoutSecs 30 ServerJobFlowTimeoutSecs 60 ClientJobBMITimeoutSecs 30 ClientJobFlowTimeoutSecs 600 ClientRetryLimit 5 ClientRetryDelayMilliSecs 100 PrecreateBatchSize 0,1024,1024,1024,32,1024,0 PrecreateLowThreshold 0,256,256,256,16,256,0 TroveMaxConcurrentIO 16 <Security> TurnOffTimeouts yes </Security> </Defaults> <Aliases> Alias server01 tcp://server01:3334 Alias server02 tcp://server01:3335 Alias server03 tcp://server01:3336 Alias server04 tcp://server01:3337 Alias server05 tcp://server01:3338 Alias server06 tcp://server01:3339 Alias server07 tcp://server01:3340 Alias server08 tcp://server01:3341 </Aliases> <ServerOptions> Server server01 DataStorageSpace /usr/local/storage/server01/data MetadataStorageSpace /usr/local/storage/server01/meta LogFile /var/log/orangefs-server-server01.log </ServerOptions> <ServerOptions> Server server02 DataStorageSpace /usr/local/storage/server02/data MetadataStorageSpace /usr/local/storage/server02/meta LogFile /var/log/orangefs-server-server02.log </ServerOptions> <ServerOptions> Server server03 DataStorageSpace /usr/local/storage/server03/data MetadataStorageSpace /usr/local/storage/server03/meta LogFile /var/log/orangefs-server-server03.log </ServerOptions> <ServerOptions> Server server04 DataStorageSpace /usr/local/storage/server04/data MetadataStorageSpace /usr/local/storage/server04/meta LogFile /var/log/orangefs-server-server04.log </ServerOptions> <ServerOptions> Server server05 DataStorageSpace /usr/local/storage/server05/data MetadataStorageSpace /usr/local/storage/server05/meta LogFile /var/log/orangefs-server-server05.log </ServerOptions> <ServerOptions> Server server06 DataStorageSpace /usr/local/storage/server06/data MetadataStorageSpace /usr/local/storage/server06/meta LogFile /var/log/orangefs-server-server06.log </ServerOptions> <ServerOptions> Server server07 DataStorageSpace /usr/local/storage/server07/data MetadataStorageSpace /usr/local/storage/server07/meta LogFile /var/log/orangefs-server-server07.log </ServerOptions> <ServerOptions> Server server08 DataStorageSpace /usr/local/storage/server08/data MetadataStorageSpace /usr/local/storage/server08/meta LogFile /var/log/orangefs-server-server08.log </ServerOptions> <Filesystem> Name orangefs ID 146181131 RootHandle 1048576 FileStuffing yes FlowBufferSizeBytes 1048576 FlowBuffersPerFlow 8 DistrDirServersInitial 1 DistrDirServersMax 1 DistrDirSplitSize 100 TreeThreshold 16 <Distribution> Name simple_stripe Param strip_size Value 1048576 </Distribution> <MetaHandleRanges> Range server01 3-576460752303423489 Range server02 576460752303423490-1152921504606846976 Range server03 1152921504606846977-1729382256910270463 Range server04 1729382256910270464-2305843009213693950 Range server05 2305843009213693951-2882303761517117437 Range server06 2882303761517117438-3458764513820540924 Range server07 3458764513820540925-4035225266123964411 Range server08 4035225266123964412-4611686018427387898 </MetaHandleRanges> <DataHandleRanges> Range server01 4611686018427387899-5188146770730811385 Range server02 5188146770730811386-5764607523034234872 Range server03 5764607523034234873-6341068275337658359 Range server04 6341068275337658360-6917529027641081846 Range server05 6917529027641081847-7493989779944505333 Range server06 7493989779944505334-8070450532247928820 Range server07 8070450532247928821-8646911284551352307 Range server08 8646911284551352308-9223372036854775794 </DataHandleRanges> <StorageHints> TroveSyncMeta yes TroveSyncData no TroveMethod alt-aio #DirectIOThreadNum 120 #DirectIOOpsPerQueue 200 #DBCacheSizeBytes 17179869184 #AttrCacheSize 10037 #AttrCacheMaxNumElems 2048 </StorageHints> </Filesystem> _______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
_______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
