MKehayov commented on code in PR #1641:
URL: https://github.com/apache/systemds/pull/1641#discussion_r906696291
##########
src/main/java/org/apache/sysds/runtime/controlprogram/federated/monitoring/services/StatsService.java:
##########
@@ -45,13 +50,20 @@ public static BaseEntityModel getWorkerStatistics(Long id,
String address) {
aggFedStats.aggregate((FederatedStatistics.FedStatsCollection)tmp[0]);
parsedStats = new StatsEntityModel(
- id, aggFedStats.cpuUsage,
aggFedStats.memoryUsage,
- aggFedStats.heavyHitters,
aggFedStats.coordinatorsTrafficBytes);
+ id,
+ new
Timestamp(System.currentTimeMillis()),
+ aggFedStats.cpuUsage,
+ aggFedStats.memoryUsage,
+ aggFedStats.jitCompileTime,
+ aggFedStats.heavyHitters,
+
aggFedStats.coordinatorsTrafficBytes,
+ aggFedStats.requestTypeCount);
}
- } catch(DMLRuntimeException dre) {
+ } catch (DMLRuntimeException dre) {
// silently ignore -> caused by offline federated
workers
+ log.error("Worker offline: " + dre.getMessage());
} catch (Exception e) {
- throw new RuntimeException(e);
+ log.error("Error: " + e.getMessage());
Review Comment:
Completely agree, for now, I left it as it is with additional comments, the
problem was that there is a separate thread running for worker stats gathering
and if the method fails, the thread stops, and no further data is gathered from
any of the workers, the way to fix it is to restart the whole app. With this,
the app can continue functioning without the restart.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]