Hi,

As some of you know we have been attacking some scalability issues. This email serves the purpose of giving an overview of the activities since we started this work early Nov.

A summary of the work in progress is visible in Bugzilla:

https://bugzilla.clusterfs.com/showdependencytree.cgi?id=11228

You can drill down in this outline tree and see patches, design documents etc.

1. Client Count Scalability

This work was preceded by studying the scalability of a few common operations: many clients mounting, opening files, and doing IO. We will review more use cases, but in the right now the following issues are being addressed:

- parallel lock callbacks (instead of sequential for each client) (11301)
- scanning lock lists with skiplists and interval trees instead of linear lists (10902, 11300)
- a hash for connection retrieval (11013)
- no synchronous IO upon connect  (10906)
- remembering the quota master to avoid searching for it (11228)

2. File IO scalability

- do not walk page lists to find pages covered by a lock (10718, 20 (the oldest open bug!))

3. Server based load simulator

We believe that testing our servers with artificial loads will be very helpful. We constructed a load generator, and hope to make good use of it.

- this simulator will be available with the next 1.6 beta (11334)
- it required us to scale to address lustre device scalability (11307)
- an alternate load generator based on liblustre is forthcoming (11302)

These issues are all being worked on and are in a variety of stages, depending on their sizes. We expect that these will alleviate some problems we have seen on really large systems (Sandia, ORNL). I expect we will find new scalability issues, and that those, like these will be relatively easily addressed.

- Peter -

_______________________________________________
Lustre-devel mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

Reply via email to