Handling Geo-replication rsync/tar+ssh failures more accurately.
================================================================

Existing:
---------
1. Multiple Changelogs processed together, contents are segregated into ENTRY, META and DATA. 2. All Entry and Meta operations are sent to Slave gsyncd via RPC. Entry and Meta Ops are not parallel, executed in Slave sequentially. 3. For Data operations, GFIDs are queued and multiple rsync jobs(default is 3) sync data parallelly. (Since all entry available with previous step). These rsync jobs do not have any idea about to which changelog a GFID belong to.

If rsync/tar+ssh fails to sync then it retries all the steps mentioned above, after MAX retries it adds whole changelog(s) to Skipped List(since it is partially processed).

Problems:
---------
No clue about Entry Ops failures in log/status
For rsync failures, all the steps are repeated even though not necessary.
Skips entire changelog even if one GFID has issue.
No way to get accurate list of failed GFIDs/files.


Planned enhancements:
---------------------
- Log Entry Ops failures in log in following format and show in Status output.(separate log file, say /var/log/glusterfs/geo-replication/<MASTER>_<SLAVE_HOST>_<SLAVE>/failures) (Log format: GFID|Changelog|Reason|Details)

For example, 0d5fd80f-e5b5-4a9a-9023-879d730c9b82|1421648492|File exists with different GFID|E 57bad16c-222c-4c5e-80a8-87f77ffc9284 CREATE 33188 0 0 00000000-0000-0000-0000-000000000001/f1

- Make sure to remove the GFID from DATA list, if any Unlink captured in Changelog for which DATA is also recorded. This avoids rsync failure for these GFID's(rsync will fail because source file is Unlinked)

- Create a Unique list of GFIDs for rsync. (rsync will get benefited, avoids sending duplicate list to rsync)

- When Rsync fails, do not repeat steps 1 and 2, Retry only step 3.
    1. FIRST RETRY: Stat in Master mount, rsync only for valid stat GFIDs.
2. SECOND RETRY: Retry first GFID separately and rest of them all at once, If First GFID fails add to skip list. If rest of the batch fails again, then do the same thing again. (Rsync first in the batch separately and rsync for rest of the batch) Repeat this step till all GFIDs get processed.(Either in skipped list or Success)

SECOND RETRY approach may affect geo-rep performance, but only when their is rsync problem.

If any failures, log in Failures log and show the number in the Status output.(Ex: b340deb7-8dd2-4d10-ab26-80acd3ff4954|1421648492|I/O Error|rsync)

Let me know your thoughts. Thanks

--
regards
Aravinda
_______________________________________________
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Reply via email to