Thanks Shyam for your inputs.
regards
Aravinda
On 08/31/2015 07:17 PM, Shyam wrote:
On 08/31/2015 03:17 AM, Aravinda wrote:
Following Changes/ideas identified to improve the Geo-replication
Performance. Please add your ideas/issues to the list
1. Entry stime and Data/Meta stime
----------------------------------
Now we use only one xattr to maintain the state of sync, called
stime. When a Geo-replication worker restarts, it starts from that
stime and sync files.
get_changes from <STIME> to <CURRENT TIME>
perform <ENTRY> operations
perform <META> operations
perform <DATA> operations
If data operation is failed worker crashes and restarts and reprocess
the changelogs again. Entry, Meta and Data operations will be
retried. If we maintain entry_stime seperately then we can avoid
reprocessing of entry operations which are completed previously.
This seems like a good thing to do.
Here is something more that could be done (I am not well aware of
geo-rep internals so maybe this cannot be done),
- Why not maintain a 'mark', till which even ENTRY/META operations are
performed, so that even when failures occur in ENTRY/META operation
queue, we need to restart from the mark and not all the way from the
beginning STIME.
Changelogs has to be processed from STIME because it will have both
ENTRY and META, but execution of ENTRY will be ignored if entry_stime is
ahead of STIME.
I am not sure where such a 'mark' can be maintained, unless the
processed get_changes are ordered and written to disk, or ordered
idempotently in memory each time.
STIME is maintained as xattr in Master brick root, we can maintain one
more xattr entry_stime.
2. In case of Rsync/Tar failure, do not repeat Entry Operations
---------------------------------------------------------------
In case of Rsync/Tar failures, Changelogs are reprocessed
again. Instead re trigger only Rsync/Tar job for those list of files
which are failed.
(this is more for my understanding)
I assume that this retry is within the same STIME -> NOW1 period. IOW,
if the re-trigger of the tar/rsync is going to occur in the next sync
interval, then I would assume that ENTRY/META for NOW1 -> NOW would be
repeated, correct? The same is true for the above as well, i.e all
ENTRY/META operations that are completed between STIME and NOW1 is not
repeated, but events between NOW1 to NOW is, correct?
Syncing files is two step operation. Entry creation with same GFID using
RPC and Sync Data using Rsync. There is a issue with existing code,
Entry operations also gets repeated when only data(rsync) failed. (STIME
-> NOW1)
3. Better Rsync Queue
---------------------
Now Geo-rep has a Rsync/Tar queue called PostBox. Sync
jobs(configurable, default is 3) will empty the Post Box and feeds it
to Rsync/Tar process. Second sync job may not find any items to sync,
only first job may overloaded. To avoid this, introduce a batch size
to PostBox so that each sync jobs gets equal number of files to sync.
Do you want to consider round-robin of entries to the sync jobs,
something that we did in rebalance, instead of a batch size?
A batch size can again be consumed by a single sync process, and the
next batch by the next one so on. Maybe a round-robin distribution of
files to sync from the post-box to each sync thread may help.
Looks like good idea. We need to maintain N number of queues for N sync
jobs, while adding entry to post box distribute to N queues. Is that right?
4. Handling the Tracebacks
--------------------------
Collect the list of Tracebacks which are not yet handled, and look for
posibility of handling it in run time. With this, workers crash will
be minimized so that we can avoid initializing and changelogs
reprocess efforts.
5. SSH failure handling
-----------------------
If Slave node goes down, the Master worker connected to it will go to
Faulty and restarts. If we can handle SSH failures intelligently, we
can reestablish the SSH connection instead of restarting Geo-rep
worker. With this change, Active/Passive switch for Network failures
can be avoided.
6. On Worker restart, Utilizing Changelogs which are in .processing
directory
--------------------------------------------------------------------
On Worker restart, Start time for Geo-rep is previously updated
stime. Geo-rep re-parses the Changelogs from Brick backend to Working
directory even though those changelogs parsed previously but stime is
not updated due to failures in sync.
1. On Geo-rep restart, Delete all files in .processing/cache and
move all the changelogs available in .processing directory to
.processing/cache
2. In Changelog API, look for Changelog file name in cache before
parsing it.
3. If available in cache, move it to .processing
4. else parse it and generate parsed changelog in .processing
I did not understand the above, but that's probably just me as I am
not fully aware of change log process yet :)
To consume the backend Changelogs, Geo-rep registers to Changelog API by
specifying a working directory. Changelog API will parse the backend
changelog to specific format understood by Geo-rep and copies to Working
directory. Geo-rep will consume Changelogs from working directory. In
each iteration "BACKEND CHANGELOGS -> PARSE TO WORKING DIR -> CONSUME"
During the parse process, Changelog API maintains three directory in
working directory ".processing", ".processed" and ".current".
.current -> Changelogs before parse
.processing -> Changelogs parsed but not yet consumed by Geo-rep
.processed -> Changelogs consumed and Synced by Geo-rep
If Geo-rep worker restarts, we cleanup .processing directory to prevent
picking up unexpected changelogs by Geo-rep. So "BACKEND CHANGELOGS ->
PARSE TO WORKING DIR" is repeated even though parsed data is available
from previous run.
While replying to this, got another idea to simplify the Changelogs
processing.
- Do not parse/maintain Changelogs in Working directory, instead just
maintain the list of Changelog files.
- Expose new Changelog API to parse the changelog.
libgfchangelog.parse(FILE_PATH, CALLBACK)
- Modify Geo-rep to use this new API when it needs to parse a CHANGELOG
file.
With this approach, on worker restart only the list of changelog files
is lost which can be easily regenerated compared to re-parsing Changelogs.
_______________________________________________
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel