Re: [Gluster-devel] Performance improvements to Gluster Geo-replication

Aravinda Tue, 01 Sep 2015 01:56:32 -0700

Thanks Shyam for your inputs.

regards
Aravinda


On 08/31/2015 07:17 PM, Shyam wrote:

On 08/31/2015 03:17 AM, Aravinda wrote:

Following Changes/ideas identified to improve the Geo-replication
Performance. Please add your ideas/issues to the list

1. Entry stime and Data/Meta stime
----------------------------------
Now we use only one xattr to maintain the state of sync, called
stime. When a Geo-replication worker restarts, it starts from that
stime and sync files.

     get_changes from <STIME> to <CURRENT TIME>
         perform <ENTRY> operations
         perform <META> operations
         perform <DATA> operations

If data operation is failed worker crashes and restarts and reprocess
the changelogs again. Entry, Meta and Data operations will be
retried. If we maintain entry_stime seperately then we can avoid
reprocessing of entry operations which are completed previously.


This seems like a good thing to do.

Here is something more that could be done (I am not well aware ofgeo-rep internals so maybe this cannot be done),- Why not maintain a 'mark', till which even ENTRY/META operations areperformed, so that even when failures occur in ENTRY/META operationqueue, we need to restart from the mark and not all the way from thebeginning STIME.

Changelogs has to be processed from STIME because it will have bothENTRY and META, but execution of ENTRY will be ignored if entry_stime isahead of STIME.

I am not sure where such a 'mark' can be maintained, unless theprocessed get_changes are ordered and written to disk, or orderedidempotently in memory each time.

STIME is maintained as xattr in Master brick root, we can maintain onemore xattr entry_stime.

2. In case of Rsync/Tar failure, do not repeat Entry Operations
---------------------------------------------------------------
In case of Rsync/Tar failures, Changelogs are reprocessed
again. Instead re trigger only Rsync/Tar job for those list of files
which are failed.
(this is more for my understanding)
I assume that this retry is within the same STIME -> NOW1 period. IOW,if the re-trigger of the tar/rsync is going to occur in the next syncinterval, then I would assume that ENTRY/META for NOW1 -> NOW would berepeated, correct? The same is true for the above as well, i.e allENTRY/META operations that are completed between STIME and NOW1 is notrepeated, but events between NOW1 to NOW is, correct?

Syncing files is two step operation. Entry creation with same GFID usingRPC and Sync Data using Rsync. There is a issue with existing code,Entry operations also gets repeated when only data(rsync) failed. (STIME-> NOW1)

3. Better Rsync Queue
---------------------
Now Geo-rep has a Rsync/Tar queue called PostBox. Sync
jobs(configurable, default is 3) will empty the Post Box and feeds it
to Rsync/Tar process. Second sync job may not find any items to sync,
only first job may overloaded. To avoid this, introduce a batch size
to PostBox so that each sync jobs gets equal number of files to sync.
Do you want to consider round-robin of entries to the sync jobs,something that we did in rebalance, instead of a batch size?
A batch size can again be consumed by a single sync process, and thenext batch by the next one so on. Maybe a round-robin distribution offiles to sync from the post-box to each sync thread may help.

Looks like good idea. We need to maintain N number of queues for N syncjobs, while adding entry to post box distribute to N queues. Is that right?



4. Handling the Tracebacks
--------------------------
Collect the list of Tracebacks which are not yet handled, and look for
posibility of handling it in run time. With this, workers crash will
be minimized so that we can avoid initializing and changelogs
reprocess efforts.


5. SSH failure handling
-----------------------
If Slave node goes down, the Master worker connected to it will go to
Faulty and restarts. If we can handle SSH failures intelligently, we
can reestablish the SSH connection instead of restarting Geo-rep
worker. With this change, Active/Passive switch for Network failures
can be avoided.


6. On Worker restart, Utilizing Changelogs which are in .processing
directory
--------------------------------------------------------------------
On Worker restart, Start time for Geo-rep is previously updated
stime. Geo-rep re-parses the Changelogs from Brick backend to Working
directory even though those changelogs parsed previously but stime is
not updated due to failures in sync.

     1. On Geo-rep restart, Delete all files in .processing/cache and
     move all the changelogs available in .processing directory to
     .processing/cache
     2. In Changelog API, look for Changelog file name in cache before
     parsing it.
     3. If available in cache, move it to .processing
     4. else parse it and generate parsed changelog in .processing

I did not understand the above, but that's probably just me as I amnot fully aware of change log process yet :)

To consume the backend Changelogs, Geo-rep registers to Changelog API byspecifying a working directory. Changelog API will parse the backendchangelog to specific format understood by Geo-rep and copies to Workingdirectory. Geo-rep will consume Changelogs from working directory. Ineach iteration "BACKEND CHANGELOGS -> PARSE TO WORKING DIR -> CONSUME"

During the parse process, Changelog API maintains three directory inworking directory ".processing", ".processed" and ".current".

.current -> Changelogs before parse
.processing -> Changelogs parsed but not yet consumed by Geo-rep
.processed -> Changelogs consumed and Synced by Geo-rep

If Geo-rep worker restarts, we cleanup .processing directory to preventpicking up unexpected changelogs by Geo-rep. So "BACKEND CHANGELOGS ->PARSE TO WORKING DIR" is repeated even though parsed data is availablefrom previous run.

While replying to this, got another idea to simplify the Changelogsprocessing.

- Do not parse/maintain Changelogs in Working directory, instead justmaintain the list of Changelog files.- Expose new Changelog API to parse the changelog.libgfchangelog.parse(FILE_PATH, CALLBACK)- Modify Geo-rep to use this new API when it needs to parse a CHANGELOGfile.

With this approach, on worker restart only the list of changelog filesis lost which can be easily regenerated compared to re-parsing Changelogs.





_______________________________________________
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Performance improvements to Gluster Geo-replication

Reply via email to