Re: Extended logging for rebalance performance analysis

ткаленко кирилл Thu, 07 May 2020 05:37:01 -0700

Hi, that's it.


06.05.2020, 19:50, "Ivan Rakov" <ivan.glu...@gmail.com>:
> Hi,
>
> IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
>>  duration rebalance of cache group after which partitions distribution is
>>  output, set in milliseconds, default value is 10 minutes.
>
>  Does it mean that if the rebalancing process took less than 10 minutes,
> only a short version of the message (with supplier statistics) will show up?
>
> In general, I have no objections.
>
> On Mon, May 4, 2020 at 10:38 AM ткаленко кирилл <tkalkir...@yandex.ru>
> wrote:
>
>>  Hi, Igniters!
>>
>>  I'd like to share a new small feature in AI [1].
>>
>>  Current rebalance logging does not allow you to quickly answer following
>>  questions:
>>  1)How long was the balance(divided by supplier)?
>>  2)How many records and bytes per supplier were rebalanced?
>>  3)How many times did rebalance restart?
>>  4)Which partitions were rebalanced and from which nodes did they receive
>>  them?
>>  5)When did rebalance for all cache groups end?
>>
>>  What you can see in logs now:
>>
>>  1)Starting rebalance with order of cache groups.
>>  Rebalancing scheduled [order=[ignite-sys-cache, grp1, grp0],
>>  top=AffinityTopologyVersion [topVer=2, minorTopVer=0], force=false,
>>  evt=NODE_JOINED, node=c2146a04-dc23-4bc9-870d-dfbb55c00001]
>>
>>  2)Start rebalance of cache group from a specific supplier, specifying
>>  partition ids and mode - historical or full.
>>  Starting rebalance routine [ignite-sys-cache,
>>  topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0],
>>  supplier=8c525892-703b-4fc4-b28b-b2f139700000, fullPartitions=[0-99],
>>  histPartitions=[]]
>>
>>  3)Getting partial or complete partitions of cache group.
>>  Completed rebalancing [grp=ignite-sys-cache,
>>  supplier=8c525892-703b-4fc4-b28b-b2f139700000,
>>  topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=1/2]
>>  Completed (final) rebalancing [grp=ignite-sys-cache,
>>  supplier=c2146a04-dc23-4bc9-870d-dfbb55c00001,
>>  topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=2/2]
>>
>>  4)End rebalance of cache group.
>>  Completed rebalance future: RebalanceFuture [grp=CacheGroupContext
>>  [grp=ignite-sys-cache], topVer=AffinityTopologyVersion [topVer=2,
>>  minorTopVer=0], rebalanceId=1, routines=1, receivedBytes=1200,
>>  receivedKeys=0, partitionsLeft=0, startTime=1588519707607, endTime=-1,
>>  lastCancelledTime=-1]
>>
>>  Rebalance statistics:
>>
>>  To speed up rebalance analysis, statistics will be output for each cache
>>  group and total for all cache groups.
>>  If duration rebalance for cache group is greater than threshold value,
>>  partition distribution is output.
>>  Statistics will you to analyze duration of the balance for each supplier
>>  to understand which of them has been transmitting data for longest time.
>>
>>  System properties are used to output statistics:
>>
>>  IGNITE_QUIET - to output statistics, value must be false;
>>  IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
>>  duration rebalance of cache group after which partitions distribution is
>>  output, set in milliseconds, default value is 10 minutes.
>>
>>  Statistics examples:
>>
>>  Successful full and historical rebalance of group cache, without
>>  partitions distribution.
>>  Rebalance information per cache group (successful rebalance): [id=3181548,
>>  name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13
>>  10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0, p=5,
>>  d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - entries, b -
>>  bytes, d - duration, h - historical, nodeId mapping
>>  (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
>>  [1=rebalancing.RebalanceStatisticsTest0]
>>  Rebalance information per cache group (successful rebalance): [id=3181547,
>>  name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13
>>  15:01:44,116, d=116 ms, restarted=0] Supplier statistics: [nodeId=0, hp=10,
>>  he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries, b -
>>  bytes, d - duration, h - historical, nodeId mapping
>>  (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
>>
>>  Successful full and historical rebalance of group cache, with partitions
>>  distribution.
>>  Rebalance information per cache group (successful rebalance): [id=3181548,
>>  name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13
>>  10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0, p=5,
>>  d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - entries, b -
>>  bytes, d - duration, h - historical, nodeId mapping
>>  (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
>>  [1=rebalancing.RebalanceStatisticsTest0] Rebalance duration was greater
>>  than 5 ms, printing detailed information about partitions distribution
>>  (threshold can be changed by setting number of milliseconds into
>>  IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD) 0 =
>>  [0,bu,su],[1,bu],[2,pr,su] 1 = [0,bu,su],[1,bu],[2,pr,su] 2 =
>>  [0,bu,su],[1,bu],[2,pr,su] 3 = [0,bu,su],[1,bu],[2,pr,su] 4 =
>>  [0,bu,su],[1,bu],[2,pr,su] 5 = [0,bu,su],[1,bu],[2,pr,su] 6 =
>>  [0,bu,su],[1,bu],[2,pr,su] 7 = [0,bu,su],[1,bu],[2,pr,su] 8 =
>>  [0,bu,su],[1,bu],[2,pr,su] 9 = [0,bu,su],[1,bu],[2,pr,su] Aliases: pr -
>>  primary, bu - backup, su - supplier node, h - historical, nodeId mapping
>>  (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
>>  [1=rebalancing.RebalanceStatisticsTest2]
>>  [2=rebalancing.RebalanceStatisticsTest0]
>>  Rebalance information per cache group (successful rebalance): [id=3181547,
>>  name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13
>>  15:01:44,116, d=116 ms, restarted=0] Supplier statistics: [nodeId=0, hp=10,
>>  he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries, b -
>>  bytes, d - duration, h - historical, nodeId mapping
>>  (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0] Rebalance
>>  duration was greater than 100 ms, printing detailed information about
>>  partitions distribution (threshold can be changed by setting number of
>>  milliseconds into IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD)
>>  0 = [0,pr,su],[1,bu],[2,bu] h 1 = [0,bu,su],[1,bu],[2,pr] h 2 =
>>  [0,pr,su],[1,bu],[2,bu] h 3 = [0,bu,su],[1,bu],[2,pr] h 4 =
>>  [0,pr,su],[1,bu],[2,bu] h 5 = [0,pr,su],[1,bu],[2,bu] h 6 =
>>  [0,bu,su],[1,bu],[2,pr] h 7 = [0,pr,su],[1,bu],[2,bu] h 8 =
>>  [0,pr,su],[1,bu],[2,bu] h 9 = [0,pr,su],[1,bu],[2,bu] h Aliases: pr -
>>  primary, bu - backup, su - supplier node, h - historical, nodeId mapping
>>  (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
>>  [1=rebalancing.RebalanceStatisticsTest2]
>>  [2=rebalancing.RebalanceStatisticsTest1]
>>
>>  Interrupted rebalance of group cache.
>>  Rebalance information per cache group (interrupted rebalance):
>>  [id=644280849, name=default2, startTime=2020-04-13 14:55:24,969,
>>  finishTime=2020-04-13 14:55:24,969, d=0 ms, restarted=0]
>>
>>  Total full and historical rebalance for all cache groups.
>>  Rebalance total information (including successful and not rebalances):
>>  [startTime=2020-04-13 10:55:18,726, finishTime=2020-04-13 10:55:18,780,
>>  d=54 ms] Supplier statistics: [nodeId=0, p=60, e=250, b=25000, d=54 ms]
>>  [nodeId=1, p=60, e=250, b=24945, d=54 ms] Aliases: p - partitions, e -
>>  entries, b - bytes, d - duration, h - historical, nodeId mapping
>>  (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
>>  [1=rebalancing.RebalanceStatisticsTest0]
>>  Rebalance total information (including successful and not rebalances):
>>  [startTime=2020-04-13 15:01:43,822, finishTime=2020-04-13 15:01:44,116,
>>  d=294 ms] Supplier statistics: [nodeId=0, hp=20, he=500, hb=50445, d=294
>>  ms] Aliases: p - partitions, e - entries, b - bytes, d - duration, h -
>>  historical, nodeId mapping (nodeId=id,consistentId)
>>  [0=rebalancing.RebalanceStatisticsTest0]
>>
>>  [1] - https://issues.apache.org/jira/browse/IGNITE-12080

Re: Extended logging for rebalance performance analysis

Reply via email to