Mark:

Since you are a flash specialist, I have a question for you.  We are
running tests on a box that has 6 nvme pcie cards.  Each of the six cards
are defined with their own filesystem (XFS).  If we run a dd to each
filesystem at the same time, the I/O performance is horrible.  Is there a
way to tune this environment, or cannot the pcie bus handle such a load?

Becky
OrangeFS Developer

On Thu, Oct 19, 2017 at 4:55 PM, Mark Guz <[email protected]> wrote:

> Hi Becky
>
> I hear what you're saying. I was just hoping that the slowdown would not
> be quite so dramatic.
>
> Switching to LMDB does offer an improvement. Untar now only takes 7mins...
> which is a pretty good speedup. Still to slow to be practical for our uses
> however.
>
> Another user Martin also offered some suggestions in response to my
> original post. So I'm going to try that too.
>
> Thanks!
> Mark Guz
>
> Senior IT Specialist
> Flash Systems & Technology
>
> Office  713-587-1048 <(713)%20587-1048>
> Mobile 832-290-8161 <(832)%20290-8161>
> [email protected]
>
>
>
>
>
>
>
> From:        Becky Ligon <[email protected]>
> To:        Mark Guz <[email protected]>
> Cc:        "[email protected]" <pvfs2-users@beowulf-
> underground.org>
> Date:        10/19/2017 01:12 PM
> Subject:        Re: [Pvfs2-users] Questions about performance
> Sent by:        [email protected]
> ------------------------------
>
>
>
> Mark,
>
> OrangeFS is NOT optimized for small file I/O, thus we don't recommend
> untarring or compiling on this system.
>
> We are working to make small file I/O better, but the system was
> originally designed for large I/O.  To that end, we now have a readahead
> cache in 2.9.6, which helps with small reads, but we don't have a write
> cache.  We are currently working on some kernel-level page caching
> mechanisms to help with small writes, but the concern is with data
> consistency amongst compute nodes.  The user must be responsible for
> ensuring that only one node, out of many within one job, needs the most
> current data.  That's a lot to ask of most users, so you see our dilemma.
>
> In your scenario, are you using LMDB or BDB?  We have found that LMDB
> gives must better metadata performance than BDB.  We deliver LMDB with the
> tar ball.  To use it, add to the configure command --with-db-backend=LMDB.
>
> If you want to try the readahead cache, also add to the configure line
> --enable-ra-cache.
>
> Hope this helps!
>
> Becky Ligon
>
>
> Mark:
>
> Below are comments from Walt Ligon, one of the original designers of the
> system:
>
> I would say that OFS is primarily designed to support large parallel
> computations with large data I/O.  It is not intended as a general purpose
> file system.  In particular it is not particularly efficient for large
> numbers of very small I/O tasks as typically found in an interactive
> workload such as tar, gcc, grep, etc.  The problem is the latency of
> sending a request from a single client through the kernel to the servers
> and back.  Reconfiguring the servers, etc. really has no effect on latency.
>
> That said, we have been working to improve small file I/O.  In the version
> you have there is a "User Interface" that allows an application to bypass
> the kernel which can lower latency of accesses, this is potentially
> suitable for specific applications but less so for common utilities such as
> tar.  In the current version we have added a configurable readahead cache
> to the client that can improve small read performance for larger files.  We
> are currently working to take advantage of the Linux page cache to improve
> very small I/O.
>
>
>
> On Thu, Oct 19, 2017 at 11:43 AM, Mark Guz <*[email protected]*
> <[email protected]>> wrote:
> Hi,
>
> I'm running a PoC orangefs setup, and have read through all of the
> previous posts on performance. We are running 2.9.6 on Redhat 7.4. The
> clients are using the kernel interface
>
> I'm running currently on 1 Power 750 server as host (with 8 dual meta/data
> servers running).  The clients are a mix of Intel and PPC64 systems all
> interconnected by Infiniband DDR cards in Connected mode.
>
> The storage backend is a 4G FC attached ssd chassis with 20 250 Gig SSD
> cards (not regular drives) in, with 8 are assigned to meta and 8 are
> assigned to data
>
> The network tests good. 7ish Gb/s with no retries or errors. We are using
> bmi_tcp.
>
> I can get great performance for large files as expected but when
> performing small file actions the performance is significantly poorer.
>
> For example.  I can untar linux-4.13.3.tar.xz locally on the  filesystem
> in 14seconds  while on the orangefs it takes 10mins
>
> I can see the performance difference when playing with stripe sizes etc
> when copying monolithic files, but there seems to be a wall that gets hit
> when there is a lot of metadata activity.
>
> I can see how the need for network back and forth could impact performance
> but is it reasonable to see a 42x performance drop in such cases?
>
> Also if I then try and compile the kernel on the orangefs it takes well
> over 2hours and most of the time is spent io waiting.  Compiling locally
> takes about 20mins on the same server.
>
> I've tried running over multiple hosts, running some meta only and some
> data only servers, i've tried running with just 1 meta and 1 data server.
> I've applied all the system level optimizations I've found and just cannot
> reasonably speed up the untar operation.
>
> Maybe it's a client thing? there's not much i can see that's configurable
> about the client though.
>
> It seems to me that i should be able to get better performance, even on
> small file operations, but I'm kinda stumped.
>
> Am I just chasing unicorns or is it possible to get usable performance for
> this sort of file activity? (untaring, compiling etc etc)
>
> Config below
>
> <Defaults>
>         UnexpectedRequests 256
>         EventLogging none
>         EnableTracing no
>         LogStamp datetime
>         BMIModules bmi_tcp
>         FlowModules flowproto_multiqueue
>         PerfUpdateInterval 10000000
>         ServerJobBMITimeoutSecs 30
>         ServerJobFlowTimeoutSecs 60
>         ClientJobBMITimeoutSecs 30
>         ClientJobFlowTimeoutSecs 600
>         ClientRetryLimit 5
>         ClientRetryDelayMilliSecs 100
>         PrecreateBatchSize 0,1024,1024,1024,32,1024,0
>         PrecreateLowThreshold 0,256,256,256,16,256,0
>         TroveMaxConcurrentIO 16
>         <Security>
>                 TurnOffTimeouts yes
>         </Security>
> </Defaults>
>
> <Aliases>
>         Alias server01 tcp://server01:3334
>         Alias server02 tcp://server01:3335
>         Alias server03 tcp://server01:3336
>         Alias server04 tcp://server01:3337
>         Alias server05 tcp://server01:3338
>         Alias server06 tcp://server01:3339
>         Alias server07 tcp://server01:3340
>         Alias server08 tcp://server01:3341
> </Aliases>
>
> <ServerOptions>
>      Server server01
>      DataStorageSpace /usr/local/storage/server01/data
>      MetadataStorageSpace /usr/local/storage/server01/meta
>      LogFile /var/log/orangefs-server-server01.log
> </ServerOptions>
>
> <ServerOptions>
>      Server server02
>      DataStorageSpace /usr/local/storage/server02/data
>      MetadataStorageSpace /usr/local/storage/server02/meta
>      LogFile /var/log/orangefs-server-server02.log
> </ServerOptions>
>
> <ServerOptions>
>      Server server03
>      DataStorageSpace /usr/local/storage/server03/data
>      MetadataStorageSpace /usr/local/storage/server03/meta
>      LogFile /var/log/orangefs-server-server03.log
> </ServerOptions>
> <ServerOptions>
>      Server server04
>      DataStorageSpace /usr/local/storage/server04/data
>      MetadataStorageSpace /usr/local/storage/server04/meta
>      LogFile /var/log/orangefs-server-server04.log
> </ServerOptions>
> <ServerOptions>
>      Server server05
>      DataStorageSpace /usr/local/storage/server05/data
>      MetadataStorageSpace /usr/local/storage/server05/meta
>      LogFile /var/log/orangefs-server-server05.log
> </ServerOptions>
> <ServerOptions>
>      Server server06
>      DataStorageSpace /usr/local/storage/server06/data
>      MetadataStorageSpace /usr/local/storage/server06/meta
>      LogFile /var/log/orangefs-server-server06.log
> </ServerOptions>
> <ServerOptions>
>      Server server07
>      DataStorageSpace /usr/local/storage/server07/data
>      MetadataStorageSpace /usr/local/storage/server07/meta
>      LogFile /var/log/orangefs-server-server07.log
> </ServerOptions>
> <ServerOptions>
>      Server server08
>      DataStorageSpace /usr/local/storage/server08/data
>      MetadataStorageSpace /usr/local/storage/server08/meta
>      LogFile /var/log/orangefs-server-server08.log
> </ServerOptions>
>
> <Filesystem>
>         Name orangefs
>         ID 146181131
>         RootHandle 1048576
>         FileStuffing yes
>         FlowBufferSizeBytes 1048576
>         FlowBuffersPerFlow  8
>         DistrDirServersInitial 1
>         DistrDirServersMax 1
>         DistrDirSplitSize 100
>         TreeThreshold 16
>  <Distribution>
>         Name simple_stripe
>         Param strip_size
>         Value 1048576
>   </Distribution>
> <MetaHandleRanges>
>                 Range server01 3-576460752303423489
>                 Range server02 576460752303423490-1152921504606846976
>                 Range server03 1152921504606846977-1729382256910270463
>                 Range server04 1729382256910270464-2305843009213693950
>                 Range server05 2305843009213693951-2882303761517117437
>                 Range server06 2882303761517117438-3458764513820540924
>                 Range server07 3458764513820540925-4035225266123964411
>                 Range server08 4035225266123964412-4611686018427387898
>         </MetaHandleRanges>
>         <DataHandleRanges>
>                 Range server01 4611686018427387899-5188146770730811385
>                 Range server02 5188146770730811386-5764607523034234872
>                 Range server03 5764607523034234873-6341068275337658359
>                 Range server04 6341068275337658360-6917529027641081846
>                 Range server05 6917529027641081847-7493989779944505333
>                 Range server06 7493989779944505334-8070450532247928820
>                 Range server07 8070450532247928821-8646911284551352307
>                 Range server08 8646911284551352308-9223372036854775794
>         </DataHandleRanges>
>         <StorageHints>
>                 TroveSyncMeta yes
>                 TroveSyncData no
>                 TroveMethod alt-aio
>                 #DirectIOThreadNum 120
>                 #DirectIOOpsPerQueue 200
>                 #DBCacheSizeBytes *17179869184* <(717)%20986-9184>
>                 #AttrCacheSize 10037
>                 #AttrCacheMaxNumElems 2048
>         </StorageHints>
> </Filesystem>
>
>
>
>
> _______________________________________________
> Pvfs2-users mailing list
> *[email protected]*
> <[email protected]>
> *http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users*
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.beowulf-2Dunderground.org_mailman_listinfo_pvfs2-2Dusers&d=DwMFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=X793IODuLpqIji9UhFNQFg&m=q2QoLyJnjytE-wyQH54g9Y2uoAONgwsJA6-A-cgcg-4&s=b2jaOOImqK69C6A9brzTHzdeINjZ2tZcaMNjGNMmJ5M&e=>
>
>
>
>
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to