All,

As pointed by [1], parallel writes can result in incorrect quota enforcement. 
[2] was an (unsuccessful) attempt to solve the issue. Some points about [2]:

in_progress_writes is updated _after_ we fetch the size. Due to this, two 
writes can see the same size and hence the issue is not solved. What we should 
be doing is to update in_progress_writes even before we fetch the size. If we 
do this, it is guaranteed that at-least one write sees the other's size 
accounted in in_progress_writes. This approach has two issues:

1. since we had added current write size to in_progress_writes, current write 
would already be accounted in the size of the directory. This is a minor issue 
and can be solved by subtracting the size of the current write from the 
resultant cluster-wide in-progress-size of the directory.

2. We might prematurely fail the writes even though there is some space 
available. Assume there is a 5MB of free space. If two 5MB writes are issued in 
parallel, both might fail as both might see each other's size already 
accounted, though none of them has succeeded. To solve this issue, I am 
proposing following algo:

   * we assign an identity that is unique across the cluster for each write - 
say uuid
   * Among all the in-progress-writes we pick a write. The policy used can be a 
random criteria like smallest of all the uuids. So, each brick selects a 
candidate among its own in-progress-writes _AND_ incoming candidate (see the 
psuedocode of get_dir_size below for more clarity). It sends back this 
candidate along with size of directory. The brick also remembers the last 
candidate it approved. clustering translators like dht pick one write among 
these replies, using the same logic bricks had used. Now along with size we 
also get a candidate to choose from in-progress writes. However, there might be 
a new write on the brick in the time-window where we try to fetch size which 
could be the candidate. We should compare the resultant cluster_wide candidate 
with the per-brick candidate. So, the enforcement logic will be as below:


/* Both enforcer and get_dir_size are executed in brick process. I've left out 
logic of get_dir_size in cluster translators like dht */
enforcer ()
{
    /* Note that this logic is executed independently for each directory on 
which quota limit is set. All the in-progress writes, sizes, candidates are 
valid in the context of 
       that directory 
     */

    my_delta = iov_length (input_iovec, input_count);
    my_id = getuuid();

    add_my_delta_to_in_progress_size ();

    get_dir_size (my_id, &size, &in_progress_size, &cluster_candidate);

    in_progress_size -= my_delta;

    if (((size + my_delta) < quota_limit) && ((size + in_progress_size + 
my_delta) > quota_limit) {

          /* we've to choose among in-progress writes */
  
          brick_candidate = least_of_uuids (directory->in_progress_write_list, 
directory->last_winning_candidate);

          if ((my_id == cluster_candidate) && (my_id == brick_candidate)) {
              /* 1. subtract my_delta from per-brick in-progress writes
                 2. add my_delta to per-brick sizes of all parents
                 3. allow-write

                 getting brick_candidate above, 1 and 2 should be done 
atomically
              */
          } else {
              /* 1. subtract my_delta from per-brick in-progress writes
                 2. fail_write 
               */
    } else if ((size + my_delta) < quota_limit) {
              /* 1. subtract my_delta from per-brick in-progress writes
                 2. add my_delta to per-brick sizes of all parents
                 3. allow-write

                 1 and 2 should be done atomically
              */
    } else {

           fail_write ();

    }

}       

get_dir_size (IN incoming_candidate_id, IN directory, OUT *winning_candidate, 
...)
{
     directory->last_winning_candidate = winning_candidate = least_uuid 
(directory->in_progress_write_list, incoming_candidate_id);

     ....
}

Comments?

[1] http://www.gluster.org/pipermail/gluster-devel/2015-May/045194.html
[2] http://review.gluster.org/#/c/6220/

regards,
Raghavendra.
_______________________________________________
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Reply via email to