Hi,
As some of you know we have been attacking some scalability issues.
This email serves the purpose of giving an overview of the activities
since we started this work early Nov.
A summary of the work in progress is visible in Bugzilla:
https://bugzilla.clusterfs.com/showdependencytree.cgi?id=11228
You can drill down in this outline tree and see patches, design
documents etc.
1. Client Count Scalability
This work was preceded by studying the scalability of a few common
operations: many clients mounting, opening files, and doing IO. We will
review more use cases, but in the right now the following issues are
being addressed:
- parallel lock callbacks (instead of sequential for each client) (11301)
- scanning lock lists with skiplists and interval trees instead of
linear lists (10902, 11300)
- a hash for connection retrieval (11013)
- no synchronous IO upon connect (10906)
- remembering the quota master to avoid searching for it (11228)
2. File IO scalability
- do not walk page lists to find pages covered by a lock (10718, 20 (the
oldest open bug!))
3. Server based load simulator
We believe that testing our servers with artificial loads will be very
helpful. We constructed a load generator, and hope to make good use of it.
- this simulator will be available with the next 1.6 beta (11334)
- it required us to scale to address lustre device scalability (11307)
- an alternate load generator based on liblustre is forthcoming (11302)
These issues are all being worked on and are in a variety of stages,
depending on their sizes. We expect that these will alleviate some
problems we have seen on really large systems (Sandia, ORNL). I expect
we will find new scalability issues, and that those, like these will be
relatively easily addressed.
- Peter -
_______________________________________________
Lustre-devel mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-devel