abdullah alamoudi has posted comments on this change. Change subject: [ASTERIXDB-1943][API][STO] Make rebalance idempotent. ......................................................................
Patch Set 12: (15 comments) https://asterix-gerrit.ics.uci.edu/#/c/1821/12/asterixdb/asterix-app/src/main/java/org/apache/asterix/api/http/server/RebalanceApiServlet.java File asterixdb/asterix-app/src/main/java/org/apache/asterix/api/http/server/RebalanceApiServlet.java: PS12, Line 74: Future parametrize Future throughout this class? PS12, Line 77: rebalanceFutureTerminated I don't understand the need for this queue and the use of the countdownlatch? Why not simply use the future to wait for completion? PS12, Line 174: sendResponse(out, jsonResponse, response, HttpResponseStatus.BAD_REQUEST, : "to rebalance a particular dataset, the parameter dataverseName must be given"); : return; Log bad requests? PS12, Line 159: // Parses and check target nodes. : if (nodes == null) { : sendResponse(out, jsonResponse, response, HttpResponseStatus.BAD_REQUEST, "nodes are not given"); : return; : } : String nodesString = StringUtils.strip(nodes, "\"'").trim(); : String[] targetNodes = nodesString.split(","); : if ("".equals(nodesString)) { : sendResponse(out, jsonResponse, response, HttpResponseStatus.BAD_REQUEST, : "target nodes should not be empty"); : return; : } : : // If a user gives parameter datasetName, she should give dataverseName as well. : if (dataverseName == null && datasetName != null) { : sendResponse(out, jsonResponse, response, HttpResponseStatus.BAD_REQUEST, : "to rebalance a particular dataset, the parameter dataverseName must be given"); : return; : } : : // Does not allow rebalancing a metadata dataset. : if (METADATA.equals(dataverseName)) { : sendResponse(out, jsonResponse, response, HttpResponseStatus.BAD_REQUEST, : "cannot rebalance a metadata dataset"); : return; : } why not validate before submitting the task? https://asterix-gerrit.ics.uci.edu/#/c/1821/12/asterixdb/asterix-app/src/main/java/org/apache/asterix/utils/RebalanceUtil.java File asterixdb/asterix-app/src/main/java/org/apache/asterix/utils/RebalanceUtil.java: PS12, Line 132: / TODO(yingyi): in case a crash happens, currently the system will either: : // 1. (crash before metadata switch) think the rebalance is not done, and the target data files are leaked until : // the next rebalance request. : // 2. (crash after metadata switch) think the rebalance is done, and the source data files are leaked; Create an issue? PS12, Line 148: private interface Work why introduce a new interface? why not simply use Callable<Void> ? PS12, Line 167: throw interruptedException; Log too when that happens. maybe along with the number of interrupted attempts? maybe even with suppressing every subsequent interruption? We need those to give us clues when things go bad... Also add tests for this? https://asterix-gerrit.ics.uci.edu/#/c/1821/12/asterixdb/asterix-app/src/test/java/org/apache/asterix/test/common/RebalanceCancellationTestExecutor.java File asterixdb/asterix-app/src/test/java/org/apache/asterix/test/common/RebalanceCancellationTestExecutor.java: PS12, Line 96: rc == 200 || rc == 404 the coverage of this test is weak and could miss regressions. Create tests that ensure both cases are covered. Also cover the test case for the cancellation during the dataset switch! https://asterix-gerrit.ics.uci.edu/#/c/1821/12/asterixdb/asterix-common/src/main/java/org/apache/asterix/common/context/DatasetLifecycleManager.java File asterixdb/asterix-common/src/main/java/org/apache/asterix/common/context/DatasetLifecycleManager.java: PS12, Line 274: No index found with resourceID " + resourceID this error message should be fixed. preferably with an error code! PS12, Line 280: finally { : if (iInfo != null) { : iInfo.untouch(); : } : if (dsr != null) { : dsr.untouch(); : } Can we add a comment explaining in what scenario can each of these two cases happen? https://asterix-gerrit.ics.uci.edu/#/c/1821/12/asterixdb/asterix-transactions/src/main/java/org/apache/asterix/transaction/management/resource/PersistentLocalResourceRepository.java File asterixdb/asterix-transactions/src/main/java/org/apache/asterix/transaction/management/resource/PersistentLocalResourceRepository.java: PS12, Line 219: // In case resourceFile.delete() gets interrupted -- the file may or may not : // exists, we should still invalidate. : // Since it's just a cache invalidation, it should not affect correctness. : resourceCache.invalidate(relativePath); I think this whole method should be atomic. an easy way to do this is to make it un-interruptible. something like what you did in the rebalance servlet? note that as per the comment, this could lead to a resource being deleted locally but not on replica since the code below will not be executed! https://asterix-gerrit.ics.uci.edu/#/c/1821/12/hyracks-fullstack/hyracks/hyracks-storage-am-common/src/main/java/org/apache/hyracks/storage/am/common/api/IIndexDataflowHelper.java File hyracks-fullstack/hyracks/hyracks-storage-am-common/src/main/java/org/apache/hyracks/storage/am/common/api/IIndexDataflowHelper.java: PS12, Line 49: boolean failSilently I think this interface shouldn't change and we can keep the failSilently in the drop index operator node pushable where we swallow the exception. https://asterix-gerrit.ics.uci.edu/#/c/1821/12/hyracks-fullstack/hyracks/hyracks-storage-am-common/src/main/java/org/apache/hyracks/storage/am/common/dataflow/IndexDataflowHelper.java File hyracks-fullstack/hyracks/hyracks-storage-am-common/src/main/java/org/apache/hyracks/storage/am/common/dataflow/IndexDataflowHelper.java: This file will not need to change if we keep the failSilently flag in the index drop operator node pushable. PS12, Line 61: (lr.getPath( keeping the flag outside this, we can easily avoid this NPE too. https://asterix-gerrit.ics.uci.edu/#/c/1821/12/hyracks-fullstack/hyracks/hyracks-storage-am-common/src/main/java/org/apache/hyracks/storage/am/common/dataflow/IndexDropOperatorNodePushable.java File hyracks-fullstack/hyracks/hyracks-storage-am-common/src/main/java/org/apache/hyracks/storage/am/common/dataflow/IndexDropOperatorNodePushable.java: PS12, Line 55: failSliently instead of passing this in. let this method swallow the exception? -- To view, visit https://asterix-gerrit.ics.uci.edu/1821 To unsubscribe, visit https://asterix-gerrit.ics.uci.edu/settings Gerrit-MessageType: comment Gerrit-Change-Id: I0d14a07978e106cd497cc35538fafef318b2fcf7 Gerrit-PatchSet: 12 Gerrit-Project: asterixdb Gerrit-Branch: master Gerrit-Owner: Yingyi Bu <buyin...@gmail.com> Gerrit-Reviewer: Jenkins <jenk...@fulliautomatix.ics.uci.edu> Gerrit-Reviewer: Michael Blow <mb...@apache.org> Gerrit-Reviewer: Till Westmann <ti...@apache.org> Gerrit-Reviewer: Yingyi Bu <buyin...@gmail.com> Gerrit-Reviewer: abdullah alamoudi <bamou...@gmail.com> Gerrit-HasComments: Yes