Re: [Openstack] [Swift] : Work-flow

2017-04-28 Thread John Dickinson
Great question. There's not a simple yes/no answer, so let me go into some 
detail about the different ways storage nodes can be selected to handle a 
request.

There are a few config settings in the proxy server config that can affect how 
nodes are selected for reads. Instead of describing these directly (or pasting 
the docs), let me describe it from an implementation perspective.

When the proxy server gets a read request for an object, the proxy looks up in 
the ring the storage nodes (object servers) that may know something about that 
object. The proxy server builds two lists[1]. The first is for "primary" nodes. 
These are the drives where the data is supposed to be. For a replicated storage 
policy with three replicas, the primary nodes will be a list of three items[2]. 
For an 8+4 erasure coded storage policy, it will be the list of 12 nodes where 
the EC fragments are supposed to be. The second list the proxy makes is the 
list of "handoff" nodes. These are alternative places an object (or fragment) 
may be found if they aren't found on a primary node.

Once the proxy has the list of primary nodes, there are a few ways it can 
iterate over that list. The `sorting_method` config option determines this. The 
default `shuffle` value means that the list of fragments is randomly sorted. 
When a proxy server makes a connection to a storage node, it tracks how long it 
took to create the connection. The `timing` value of `sorting_method` will sort 
the list by these saved connection timings. The idea is that a busy server will 
take longer to respond to connection requests, and will then get moved lower in 
the sorted list. The `affinity` value will cause the list of nodes to be sorted 
according to the rules set in the `read_affinity` config option. This allows a 
proxy server to specifically prioritize connections that are local (same DC) 
and de-prioritize remote connections. The `read_affinity` setting is fantastic 
when Swift is deployed with more than one region (i.e. global clusters).

Once the list of primary nodes is sorted, the the proxy will start contacting 
each in turn until the request can be satisfied. With erasure codes, at least 
 fragments need to be contacted (e.g. 8 in the example 
above), so the sorting method value doesn't do much to change performance. For 
replicas, though, only one node is needed to satisfy a read request. The naive 
way to go through the list is: contact the first one, get response, if error, 
repeat with the next node in the list.

However, just walking through the list can be really slow. Swift has another 
config option, `concurrency_timeout` (default to 0.5 seconds) that is a delay 
before the next request is attempted. Basically, it's a balance between network 
connections created and end-user latency. Let's say the `node_timeout` is set 
to 5 seconds. If a server can accept a connection with no problem but a disk is 
slow, this means that the proxy might start a read request but wait five 
seconds before timing out and moving on. Worst case, this could result in a 10 
second delay before the last primary node is even asked if it has the data 
(first two time out after 5 seconds each). The `concurrency_timeout` means that 
the proxy will only wait 500ms before starting that next connection in the 
primary node list. Whichever node responds first will be the one that is used 
to send data to the client, and the rest are closed and cleaned up.

That's an overview of how the proxy chooses which nodes to talk to when 
handling a read request. There's a few different options that can be tuned 
depending on your particular deployment, but the default values (shuffle, 500ms 
concurrency timeout) are really good for most cases.

As a final note, there's also a `write_affinity` setting for the write data 
path. This works very similar to the `read_affinity` setting, but I'm not a big 
fan of it. It seems to cause more problems that not. It causes the proxy server 
to mix in some local handoff nodes into the primary node list on the write. 
This means that all writes in a global cluster will be satisfied in the local 
DC, but it doesn't mean the WAN traversal work goes away. Swift's background 
consistency process will move the data to the right place, but this is more 
expensive than putting it in the right place to start with. I strongly 
recommend that you do not use `write_affinity` in your global Swift clusters.


[1] technically, a list and a lazy-eval'd iterator
[2] I sometimes switch between "node" and "drive" and "server". Each element in 
these lists has (IP, port, mount point) for finding a particular drive.


Hope this info helps you understand more about how Swift works and how you can 
best tune it for your use.


--John





On 28 Apr 2017, at 2:46, Sameer Kulkarni wrote:

> Hi All,
>
> I had a doubt regarding the work-flow of Swift.
>
> *'For read operation, we need to read from one of the three replicas. We
> are aware that geographical origin of 

[Openstack] [Swift] : Work-flow

2017-04-28 Thread Sameer Kulkarni
Hi All,

I had a doubt regarding the work-flow of Swift.

*'For read operation, we need to read from one of the three replicas. We
are aware that geographical origin of request is one of the factors to
decide which replica to read from(usually the nearest replica). But is also
the load on the nodes containing these replicas taken into account? i.e
will all read requests for the same object from a given location read from
same replica or when load increases(on the node containing that replica)
will the requests be directed to a different replica ?’*


Cheers,
Sameer
___
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack