Great question. There's not a simple yes/no answer, so let me go into some
detail about the different ways storage nodes can be selected to handle a
request.
There are a few config settings in the proxy server config that can affect how
nodes are selected for reads. Instead of describing these directly (or pasting
the docs), let me describe it from an implementation perspective.
When the proxy server gets a read request for an object, the proxy looks up in
the ring the storage nodes (object servers) that may know something about that
object. The proxy server builds two lists[1]. The first is for "primary" nodes.
These are the drives where the data is supposed to be. For a replicated storage
policy with three replicas, the primary nodes will be a list of three items[2].
For an 8+4 erasure coded storage policy, it will be the list of 12 nodes where
the EC fragments are supposed to be. The second list the proxy makes is the
list of "handoff" nodes. These are alternative places an object (or fragment)
may be found if they aren't found on a primary node.
Once the proxy has the list of primary nodes, there are a few ways it can
iterate over that list. The `sorting_method` config option determines this. The
default `shuffle` value means that the list of fragments is randomly sorted.
When a proxy server makes a connection to a storage node, it tracks how long it
took to create the connection. The `timing` value of `sorting_method` will sort
the list by these saved connection timings. The idea is that a busy server will
take longer to respond to connection requests, and will then get moved lower in
the sorted list. The `affinity` value will cause the list of nodes to be sorted
according to the rules set in the `read_affinity` config option. This allows a
proxy server to specifically prioritize connections that are local (same DC)
and de-prioritize remote connections. The `read_affinity` setting is fantastic
when Swift is deployed with more than one region (i.e. global clusters).
Once the list of primary nodes is sorted, the the proxy will start contacting
each in turn until the request can be satisfied. With erasure codes, at least
fragments need to be contacted (e.g. 8 in the example
above), so the sorting method value doesn't do much to change performance. For
replicas, though, only one node is needed to satisfy a read request. The naive
way to go through the list is: contact the first one, get response, if error,
repeat with the next node in the list.
However, just walking through the list can be really slow. Swift has another
config option, `concurrency_timeout` (default to 0.5 seconds) that is a delay
before the next request is attempted. Basically, it's a balance between network
connections created and end-user latency. Let's say the `node_timeout` is set
to 5 seconds. If a server can accept a connection with no problem but a disk is
slow, this means that the proxy might start a read request but wait five
seconds before timing out and moving on. Worst case, this could result in a 10
second delay before the last primary node is even asked if it has the data
(first two time out after 5 seconds each). The `concurrency_timeout` means that
the proxy will only wait 500ms before starting that next connection in the
primary node list. Whichever node responds first will be the one that is used
to send data to the client, and the rest are closed and cleaned up.
That's an overview of how the proxy chooses which nodes to talk to when
handling a read request. There's a few different options that can be tuned
depending on your particular deployment, but the default values (shuffle, 500ms
concurrency timeout) are really good for most cases.
As a final note, there's also a `write_affinity` setting for the write data
path. This works very similar to the `read_affinity` setting, but I'm not a big
fan of it. It seems to cause more problems that not. It causes the proxy server
to mix in some local handoff nodes into the primary node list on the write.
This means that all writes in a global cluster will be satisfied in the local
DC, but it doesn't mean the WAN traversal work goes away. Swift's background
consistency process will move the data to the right place, but this is more
expensive than putting it in the right place to start with. I strongly
recommend that you do not use `write_affinity` in your global Swift clusters.
[1] technically, a list and a lazy-eval'd iterator
[2] I sometimes switch between "node" and "drive" and "server". Each element in
these lists has (IP, port, mount point) for finding a particular drive.
Hope this info helps you understand more about how Swift works and how you can
best tune it for your use.
--John
On 28 Apr 2017, at 2:46, Sameer Kulkarni wrote:
> Hi All,
>
> I had a doubt regarding the work-flow of Swift.
>
> *'For read operation, we need to read from one of the three replicas. We
> are aware that geographical origin of