[GitHub] [hadoop] mukul1987 commented on a change in pull request #1226: HDDS-1610. applyTransaction failure should not be lost on restart.

2019-08-14 Thread GitBox
mukul1987 commented on a change in pull request #1226: HDDS-1610. 
applyTransaction failure should not be lost on restart.
URL: https://github.com/apache/hadoop/pull/1226#discussion_r313772527
 
 

 ##
 File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/ratis/ContainerStateMachine.java
 ##
 @@ -674,30 +681,60 @@ public void notifyIndexUpdate(long term, long index) {
   if (cmdType == Type.WriteChunk || cmdType ==Type.PutSmallFile) {
 builder.setCreateContainerSet(createContainerSet);
   }
+  CompletableFuture applyTransactionFuture =
+  new CompletableFuture<>();
   // Ensure the command gets executed in a separate thread than
   // stateMachineUpdater thread which is calling applyTransaction here.
-  CompletableFuture future = CompletableFuture
-  .supplyAsync(() -> runCommand(requestProto, builder.build()),
+  CompletableFuture future =
+  CompletableFuture.supplyAsync(
+  () -> runCommand(requestProto, builder.build()),
   getCommandExecutor(requestProto));
-
-  future.thenAccept(m -> {
+  future.thenApply(r -> {
 if (trx.getServerRole() == RaftPeerRole.LEADER) {
   long startTime = (long) trx.getStateMachineContext();
   metrics.incPipelineLatency(cmdType,
   Time.monotonicNowNanos() - startTime);
 }
-
-final Long previous =
-applyTransactionCompletionMap
-.put(index, trx.getLogEntry().getTerm());
-Preconditions.checkState(previous == null);
-if (cmdType == Type.WriteChunk || cmdType == Type.PutSmallFile) {
-  metrics.incNumBytesCommittedCount(
+if (r.getResult() != ContainerProtos.Result.SUCCESS) {
+  StorageContainerException sce =
+  new StorageContainerException(r.getMessage(), r.getResult());
+  LOG.error(
+  "gid {} : ApplyTransaction failed. cmd {} logIndex {} msg : "
+  + "{} Container Result: {}", gid, r.getCmdType(), index,
+  r.getMessage(), r.getResult());
+  metrics.incNumApplyTransactionsFails();
+  ratisServer.handleApplyTransactionFailure(gid, trx.getServerRole());
+  // Since the applyTransaction now is completed exceptionally,
+  // before any further snapshot is taken , the exception will be
+  // caught in stateMachineUpdater in Ratis and ratis server will
+  // shutdown.
+  applyTransactionFuture.completeExceptionally(sce);
 
 Review comment:
   lets move the ratisServer.handleApplyTransactionFailure(gid, 
trx.getServerRole()); as the last line in the if block.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] mukul1987 commented on a change in pull request #1226: HDDS-1610. applyTransaction failure should not be lost on restart.

2019-08-14 Thread GitBox
mukul1987 commented on a change in pull request #1226: HDDS-1610. 
applyTransaction failure should not be lost on restart.
URL: https://github.com/apache/hadoop/pull/1226#discussion_r313771945
 
 

 ##
 File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/ratis/ContainerStateMachine.java
 ##
 @@ -265,6 +269,13 @@ public void persistContainerSet(OutputStream out) throws 
IOException {
   public long takeSnapshot() throws IOException {
 TermIndex ti = getLastAppliedTermIndex();
 long startTime = Time.monotonicNow();
+if (!isStateMachineHealthy.get()) {
+  String msg =
+  "Failed to take snapshot " + " for " + gid + " as the stateMachine"
+  + " is unhealthy. The last applied index is at " + ti;
 
 Review comment:
   lets log this as well.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] mukul1987 commented on a change in pull request #1226: HDDS-1610. applyTransaction failure should not be lost on restart.

2019-08-06 Thread GitBox
mukul1987 commented on a change in pull request #1226: HDDS-1610. 
applyTransaction failure should not be lost on restart.
URL: https://github.com/apache/hadoop/pull/1226#discussion_r311191416
 
 

 ##
 File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/ratis/ContainerStateMachine.java
 ##
 @@ -674,30 +674,54 @@ public void notifyIndexUpdate(long term, long index) {
   if (cmdType == Type.WriteChunk || cmdType ==Type.PutSmallFile) {
 builder.setCreateContainerSet(createContainerSet);
   }
+  CompletableFuture applyTransactionFuture =
+  new CompletableFuture<>();
   // Ensure the command gets executed in a separate thread than
   // stateMachineUpdater thread which is calling applyTransaction here.
-  CompletableFuture future = CompletableFuture
-  .supplyAsync(() -> runCommand(requestProto, builder.build()),
+  CompletableFuture future =
+  CompletableFuture.supplyAsync(
+  () -> runCommandGetResponse(requestProto, builder.build()),
   getCommandExecutor(requestProto));
-
-  future.thenAccept(m -> {
+  future.thenApply(r -> {
 if (trx.getServerRole() == RaftPeerRole.LEADER) {
   long startTime = (long) trx.getStateMachineContext();
   metrics.incPipelineLatency(cmdType,
   Time.monotonicNowNanos() - startTime);
 }
-
-final Long previous =
-applyTransactionCompletionMap
-.put(index, trx.getLogEntry().getTerm());
-Preconditions.checkState(previous == null);
-if (cmdType == Type.WriteChunk || cmdType == Type.PutSmallFile) {
-  metrics.incNumBytesCommittedCount(
+if (r.getResult() != ContainerProtos.Result.SUCCESS) {
+  StorageContainerException sce =
+  new StorageContainerException(r.getMessage(), r.getResult());
+  LOG.error(gid + ": ApplyTransaction failed: cmd " + r.getCmdType()
+  + " logIndex " + index + " Error message: " + r.getMessage()
+  + " Container Result: " + r.getResult());
+  metrics.incNumApplyTransactionsFails();
+  ratisServer.handleApplyTransactionFailure(gid, trx.getServerRole());
+  // Since the applyTransaction now is completed exceptionally,
+  // before any further snapshot is taken , the exception will be
+  // caught in stateMachineUpdater in Ratis and ratis server will
+  // shutdown.
+  applyTransactionFuture.completeExceptionally(sce);
+} else {
+  metrics.incNumBytesWrittenCount(
   requestProto.getWriteChunk().getChunkData().getLen());
+  LOG.debug(gid + ": ApplyTransaction completed: cmd " + r.getCmdType()
 
 Review comment:
   if this is a success, then "" Error message: " + r.getMessage()" will not be 
the right thing to print here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] mukul1987 commented on a change in pull request #1226: HDDS-1610. applyTransaction failure should not be lost on restart.

2019-08-06 Thread GitBox
mukul1987 commented on a change in pull request #1226: HDDS-1610. 
applyTransaction failure should not be lost on restart.
URL: https://github.com/apache/hadoop/pull/1226#discussion_r311191008
 
 

 ##
 File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/ratis/ContainerStateMachine.java
 ##
 @@ -674,30 +674,54 @@ public void notifyIndexUpdate(long term, long index) {
   if (cmdType == Type.WriteChunk || cmdType ==Type.PutSmallFile) {
 builder.setCreateContainerSet(createContainerSet);
   }
+  CompletableFuture applyTransactionFuture =
+  new CompletableFuture<>();
   // Ensure the command gets executed in a separate thread than
   // stateMachineUpdater thread which is calling applyTransaction here.
-  CompletableFuture future = CompletableFuture
-  .supplyAsync(() -> runCommand(requestProto, builder.build()),
+  CompletableFuture future =
+  CompletableFuture.supplyAsync(
+  () -> runCommandGetResponse(requestProto, builder.build()),
   getCommandExecutor(requestProto));
-
-  future.thenAccept(m -> {
+  future.thenApply(r -> {
 if (trx.getServerRole() == RaftPeerRole.LEADER) {
   long startTime = (long) trx.getStateMachineContext();
   metrics.incPipelineLatency(cmdType,
   Time.monotonicNowNanos() - startTime);
 }
-
-final Long previous =
-applyTransactionCompletionMap
-.put(index, trx.getLogEntry().getTerm());
-Preconditions.checkState(previous == null);
-if (cmdType == Type.WriteChunk || cmdType == Type.PutSmallFile) {
-  metrics.incNumBytesCommittedCount(
+if (r.getResult() != ContainerProtos.Result.SUCCESS) {
+  StorageContainerException sce =
+  new StorageContainerException(r.getMessage(), r.getResult());
+  LOG.error(gid + ": ApplyTransaction failed: cmd " + r.getCmdType()
+  + " logIndex " + index + " Error message: " + r.getMessage()
+  + " Container Result: " + r.getResult());
+  metrics.incNumApplyTransactionsFails();
+  ratisServer.handleApplyTransactionFailure(gid, trx.getServerRole());
+  // Since the applyTransaction now is completed exceptionally,
+  // before any further snapshot is taken , the exception will be
+  // caught in stateMachineUpdater in Ratis and ratis server will
+  // shutdown.
+  applyTransactionFuture.completeExceptionally(sce);
+} else {
+  metrics.incNumBytesWrittenCount(
   requestProto.getWriteChunk().getChunkData().getLen());
+  LOG.debug(gid + ": ApplyTransaction completed: cmd " + r.getCmdType()
 
 Review comment:
   same here


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] mukul1987 commented on a change in pull request #1226: HDDS-1610. applyTransaction failure should not be lost on restart.

2019-08-06 Thread GitBox
mukul1987 commented on a change in pull request #1226: HDDS-1610. 
applyTransaction failure should not be lost on restart.
URL: https://github.com/apache/hadoop/pull/1226#discussion_r311181834
 
 

 ##
 File path: 
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/client/rpc/TestContainerStateMachineFailures.java
 ##
 @@ -23,6 +23,11 @@
 import org.apache.hadoop.hdds.conf.OzoneConfiguration;
 import org.apache.hadoop.hdds.protocol.datanode.proto.ContainerProtos;
 import org.apache.hadoop.hdds.protocol.proto.HddsProtos;
+import org.apache.hadoop.hdds.scm.XceiverClientManager;
+import org.apache.hadoop.hdds.scm.XceiverClientSpi;
+import 
org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException;
 
 Review comment:
   unused.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] mukul1987 commented on a change in pull request #1226: HDDS-1610. applyTransaction failure should not be lost on restart.

2019-08-06 Thread GitBox
mukul1987 commented on a change in pull request #1226: HDDS-1610. 
applyTransaction failure should not be lost on restart.
URL: https://github.com/apache/hadoop/pull/1226#discussion_r311179939
 
 

 ##
 File path: 
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/client/rpc/TestContainerStateMachineFailures.java
 ##
 @@ -270,4 +279,73 @@ public void testUnhealthyContainer() throws Exception {
 Assert.assertEquals(ContainerProtos.Result.CONTAINER_UNHEALTHY,
 dispatcher.dispatch(request.build(), null).getResult());
   }
+
+  @Test
+  public void testAppyTransactionFailure() throws Exception {
+OzoneOutputStream key =
+objectStore.getVolume(volumeName).getBucket(bucketName)
+.createKey("ratis", 1024, ReplicationType.RATIS,
+ReplicationFactor.ONE, new HashMap<>());
+// First write and flush creates a container in the datanode
+key.write("ratis".getBytes());
+key.flush();
+key.write("ratis".getBytes());
+
+//get the name of a valid container
+OmKeyArgs keyArgs = new OmKeyArgs.Builder().setVolumeName(volumeName).
+setBucketName(bucketName).setType(HddsProtos.ReplicationType.RATIS)
+.setFactor(HddsProtos.ReplicationFactor.ONE).setKeyName("ratis")
+.build();
+KeyOutputStream groupOutputStream = (KeyOutputStream) 
key.getOutputStream();
+List locationInfoList =
+groupOutputStream.getLocationInfoList();
+Assert.assertEquals(1, locationInfoList.size());
+OmKeyLocationInfo omKeyLocationInfo = locationInfoList.get(0);
+ContainerData containerData =
+cluster.getHddsDatanodes().get(0).getDatanodeStateMachine()
+.getContainer().getContainerSet()
+.getContainer(omKeyLocationInfo.getContainerID())
+.getContainerData();
+Assert.assertTrue(containerData instanceof KeyValueContainerData);
+KeyValueContainerData keyValueContainerData =
+(KeyValueContainerData) containerData;
+key.close();
+
+long containerID = omKeyLocationInfo.getContainerID();
+// delete the container db file
+FileUtil.fullyDelete(new File(keyValueContainerData.getContainerPath()));
+Pipeline pipeline = cluster.getStorageContainerLocationClient()
+.getContainerWithPipeline(containerID).getPipeline();
+XceiverClientSpi client = xceiverClientManager.acquireClient(pipeline);
+ContainerProtos.ContainerCommandRequestProto.Builder request =
+ContainerProtos.ContainerCommandRequestProto.newBuilder();
+request.setDatanodeUuid(pipeline.getFirstNode().getUuidString());
+request.setCmdType(ContainerProtos.Type.CloseContainer);
+request.setContainerID(containerID);
+request.setCloseContainer(
+ContainerProtos.CloseContainerRequestProto.getDefaultInstance());
+// close container transaction will fail over Ratis and will cause the raft
+try {
+  client.sendCommand(request.build());
+  Assert.fail("Expected exception not thrown");
+} catch (IOException e) {
+}
 
 Review comment:
   What exception are we expecting here ?
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] mukul1987 commented on a change in pull request #1226: HDDS-1610. applyTransaction failure should not be lost on restart.

2019-08-06 Thread GitBox
mukul1987 commented on a change in pull request #1226: HDDS-1610. 
applyTransaction failure should not be lost on restart.
URL: https://github.com/apache/hadoop/pull/1226#discussion_r311189155
 
 

 ##
 File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/ratis/ContainerStateMachine.java
 ##
 @@ -674,30 +674,54 @@ public void notifyIndexUpdate(long term, long index) {
   if (cmdType == Type.WriteChunk || cmdType ==Type.PutSmallFile) {
 builder.setCreateContainerSet(createContainerSet);
   }
+  CompletableFuture applyTransactionFuture =
+  new CompletableFuture<>();
   // Ensure the command gets executed in a separate thread than
   // stateMachineUpdater thread which is calling applyTransaction here.
-  CompletableFuture future = CompletableFuture
-  .supplyAsync(() -> runCommand(requestProto, builder.build()),
+  CompletableFuture future =
+  CompletableFuture.supplyAsync(
+  () -> runCommandGetResponse(requestProto, builder.build()),
 
 Review comment:
   lets rename runCommandGetResponse and remove runCommand as all the existing 
caller of the earlier function runCommand can be removed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] mukul1987 commented on a change in pull request #1226: HDDS-1610. applyTransaction failure should not be lost on restart.

2019-08-06 Thread GitBox
mukul1987 commented on a change in pull request #1226: HDDS-1610. 
applyTransaction failure should not be lost on restart.
URL: https://github.com/apache/hadoop/pull/1226#discussion_r311184571
 
 

 ##
 File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/ratis/XceiverServerRatis.java
 ##
 @@ -609,6 +609,16 @@ void handleNoLeader(RaftGroupId groupId, RoleInfoProto 
roleInfoProto) {
 handlePipelineFailure(groupId, roleInfoProto);
   }
 
+  void handleApplyTransactionFailure(RaftGroupId groupId,
+  RaftProtos.RaftPeerRole role) {
+UUID dnId = RatisHelper.toDatanodeId(getServer().getId());
+String msg =
+"Ratis Transaction failure in datanode" + dnId + " with role " + role
++ " Triggering pipeline close action.";
+triggerPipelineClose(groupId, msg, 
ClosePipelineInfo.Reason.PIPELINE_FAILED,
+false);
+stop();
 
 Review comment:
   We do not necessarily need to stop the raftServer here, for the other 
container's we can still keep on applying the transaction


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] mukul1987 commented on a change in pull request #1226: HDDS-1610. applyTransaction failure should not be lost on restart.

2019-08-06 Thread GitBox
mukul1987 commented on a change in pull request #1226: HDDS-1610. 
applyTransaction failure should not be lost on restart.
URL: https://github.com/apache/hadoop/pull/1226#discussion_r311190425
 
 

 ##
 File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/ratis/ContainerStateMachine.java
 ##
 @@ -674,30 +674,54 @@ public void notifyIndexUpdate(long term, long index) {
   if (cmdType == Type.WriteChunk || cmdType ==Type.PutSmallFile) {
 builder.setCreateContainerSet(createContainerSet);
   }
+  CompletableFuture applyTransactionFuture =
+  new CompletableFuture<>();
   // Ensure the command gets executed in a separate thread than
   // stateMachineUpdater thread which is calling applyTransaction here.
-  CompletableFuture future = CompletableFuture
-  .supplyAsync(() -> runCommand(requestProto, builder.build()),
+  CompletableFuture future =
+  CompletableFuture.supplyAsync(
+  () -> runCommandGetResponse(requestProto, builder.build()),
   getCommandExecutor(requestProto));
-
-  future.thenAccept(m -> {
+  future.thenApply(r -> {
 if (trx.getServerRole() == RaftPeerRole.LEADER) {
   long startTime = (long) trx.getStateMachineContext();
   metrics.incPipelineLatency(cmdType,
   Time.monotonicNowNanos() - startTime);
 }
-
-final Long previous =
-applyTransactionCompletionMap
-.put(index, trx.getLogEntry().getTerm());
-Preconditions.checkState(previous == null);
-if (cmdType == Type.WriteChunk || cmdType == Type.PutSmallFile) {
-  metrics.incNumBytesCommittedCount(
+if (r.getResult() != ContainerProtos.Result.SUCCESS) {
+  StorageContainerException sce =
+  new StorageContainerException(r.getMessage(), r.getResult());
+  LOG.error(gid + ": ApplyTransaction failed: cmd " + r.getCmdType()
 
 Review comment:
   lets convert this to parameterized logging.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] mukul1987 commented on a change in pull request #1226: HDDS-1610. applyTransaction failure should not be lost on restart.

2019-08-06 Thread GitBox
mukul1987 commented on a change in pull request #1226: HDDS-1610. 
applyTransaction failure should not be lost on restart.
URL: https://github.com/apache/hadoop/pull/1226#discussion_r311184233
 
 

 ##
 File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/ratis/XceiverServerRatis.java
 ##
 @@ -630,6 +640,10 @@ void handleInstallSnapshotFromLeader(RaftGroupId groupId,
 handlePipelineFailure(groupId, roleInfoProto);
   }
 
+  @VisibleForTesting
+  public boolean isClosed() {
 
 Review comment:
   lets remove this. and this can be replaced with
   
   raftServer.getServer().getLifeCycleState().isClosingOrClosed()


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] mukul1987 commented on a change in pull request #1226: HDDS-1610. applyTransaction failure should not be lost on restart.

2019-08-06 Thread GitBox
mukul1987 commented on a change in pull request #1226: HDDS-1610. 
applyTransaction failure should not be lost on restart.
URL: https://github.com/apache/hadoop/pull/1226#discussion_r311183058
 
 

 ##
 File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/ratis/XceiverServerRatis.java
 ##
 @@ -609,6 +609,16 @@ void handleNoLeader(RaftGroupId groupId, RoleInfoProto 
roleInfoProto) {
 handlePipelineFailure(groupId, roleInfoProto);
   }
 
+  void handleApplyTransactionFailure(RaftGroupId groupId,
+  RaftProtos.RaftPeerRole role) {
+UUID dnId = RatisHelper.toDatanodeId(getServer().getId());
+String msg =
+"Ratis Transaction failure in datanode" + dnId + " with role " + role
++ " Triggering pipeline close action.";
+triggerPipelineClose(groupId, msg, 
ClosePipelineInfo.Reason.PIPELINE_FAILED,
 
 Review comment:
   Lets add a new failure reason here as @supratimdeka is suggesting. This will 
help in identifying difference failure via metrics in SCM.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] mukul1987 commented on a change in pull request #1226: HDDS-1610. applyTransaction failure should not be lost on restart.

2019-08-06 Thread GitBox
mukul1987 commented on a change in pull request #1226: HDDS-1610. 
applyTransaction failure should not be lost on restart.
URL: https://github.com/apache/hadoop/pull/1226#discussion_r311177427
 
 

 ##
 File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/ratis/ContainerStateMachine.java
 ##
 @@ -674,30 +674,54 @@ public void notifyIndexUpdate(long term, long index) {
   if (cmdType == Type.WriteChunk || cmdType ==Type.PutSmallFile) {
 builder.setCreateContainerSet(createContainerSet);
   }
+  CompletableFuture applyTransactionFuture =
+  new CompletableFuture<>();
   // Ensure the command gets executed in a separate thread than
   // stateMachineUpdater thread which is calling applyTransaction here.
-  CompletableFuture future = CompletableFuture
-  .supplyAsync(() -> runCommand(requestProto, builder.build()),
+  CompletableFuture future =
+  CompletableFuture.supplyAsync(
+  () -> runCommandGetResponse(requestProto, builder.build()),
   getCommandExecutor(requestProto));
-
-  future.thenAccept(m -> {
+  future.thenApply(r -> {
 if (trx.getServerRole() == RaftPeerRole.LEADER) {
   long startTime = (long) trx.getStateMachineContext();
   metrics.incPipelineLatency(cmdType,
   Time.monotonicNowNanos() - startTime);
 }
-
-final Long previous =
-applyTransactionCompletionMap
-.put(index, trx.getLogEntry().getTerm());
-Preconditions.checkState(previous == null);
-if (cmdType == Type.WriteChunk || cmdType == Type.PutSmallFile) {
-  metrics.incNumBytesCommittedCount(
+if (r.getResult() != ContainerProtos.Result.SUCCESS) {
+  StorageContainerException sce =
+  new StorageContainerException(r.getMessage(), r.getResult());
+  LOG.error(gid + ": ApplyTransaction failed: cmd " + r.getCmdType()
+  + " logIndex " + index + " Error message: " + r.getMessage()
+  + " Container Result: " + r.getResult());
+  metrics.incNumApplyTransactionsFails();
+  ratisServer.handleApplyTransactionFailure(gid, trx.getServerRole());
+  // Since the applyTransaction now is completed exceptionally,
+  // before any further snapshot is taken , the exception will be
+  // caught in stateMachineUpdater in Ratis and ratis server will
+  // shutdown.
+  applyTransactionFuture.completeExceptionally(sce);
+} else {
+  metrics.incNumBytesWrittenCount(
   requestProto.getWriteChunk().getChunkData().getLen());
+  LOG.debug(gid + ": ApplyTransaction completed: cmd " + r.getCmdType()
+  + " logIndex " + index + " Error message: " + r.getMessage()
+  + " Container Result: " + r.getResult());
+  applyTransactionFuture.complete(r::toByteString);
+  if (cmdType == Type.WriteChunk || cmdType == Type.PutSmallFile) {
+metrics.incNumBytesCommittedCount(
+requestProto.getWriteChunk().getChunkData().getLen());
+  }
 }
+
+final Long previous = applyTransactionCompletionMap
+.put(index, trx.getLogEntry().getTerm());
+Preconditions.checkState(previous == null);
 updateLastApplied();
-  }).whenComplete((r, t) -> applyTransactionSemaphore.release());
-  return future;
+applyTransactionSemaphore.release();
 
 Review comment:
   This should go inside a finally, to make sure the semaphore is released 
always.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] mukul1987 commented on a change in pull request #1226: HDDS-1610. applyTransaction failure should not be lost on restart.

2019-08-06 Thread GitBox
mukul1987 commented on a change in pull request #1226: HDDS-1610. 
applyTransaction failure should not be lost on restart.
URL: https://github.com/apache/hadoop/pull/1226#discussion_r311179755
 
 

 ##
 File path: 
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/client/rpc/TestContainerStateMachineFailures.java
 ##
 @@ -270,4 +279,73 @@ public void testUnhealthyContainer() throws Exception {
 Assert.assertEquals(ContainerProtos.Result.CONTAINER_UNHEALTHY,
 dispatcher.dispatch(request.build(), null).getResult());
   }
+
+  @Test
+  public void testAppyTransactionFailure() throws Exception {
+OzoneOutputStream key =
+objectStore.getVolume(volumeName).getBucket(bucketName)
+.createKey("ratis", 1024, ReplicationType.RATIS,
+ReplicationFactor.ONE, new HashMap<>());
+// First write and flush creates a container in the datanode
+key.write("ratis".getBytes());
+key.flush();
+key.write("ratis".getBytes());
+
+//get the name of a valid container
+OmKeyArgs keyArgs = new OmKeyArgs.Builder().setVolumeName(volumeName).
+setBucketName(bucketName).setType(HddsProtos.ReplicationType.RATIS)
+.setFactor(HddsProtos.ReplicationFactor.ONE).setKeyName("ratis")
+.build();
+KeyOutputStream groupOutputStream = (KeyOutputStream) 
key.getOutputStream();
+List locationInfoList =
+groupOutputStream.getLocationInfoList();
+Assert.assertEquals(1, locationInfoList.size());
+OmKeyLocationInfo omKeyLocationInfo = locationInfoList.get(0);
+ContainerData containerData =
+cluster.getHddsDatanodes().get(0).getDatanodeStateMachine()
+.getContainer().getContainerSet()
+.getContainer(omKeyLocationInfo.getContainerID())
+.getContainerData();
+Assert.assertTrue(containerData instanceof KeyValueContainerData);
+KeyValueContainerData keyValueContainerData =
+(KeyValueContainerData) containerData;
+key.close();
+
+long containerID = omKeyLocationInfo.getContainerID();
+// delete the container db file
+FileUtil.fullyDelete(new File(keyValueContainerData.getContainerPath()));
+Pipeline pipeline = cluster.getStorageContainerLocationClient()
+.getContainerWithPipeline(containerID).getPipeline();
+XceiverClientSpi client = xceiverClientManager.acquireClient(pipeline);
+ContainerProtos.ContainerCommandRequestProto.Builder request =
+ContainerProtos.ContainerCommandRequestProto.newBuilder();
+request.setDatanodeUuid(pipeline.getFirstNode().getUuidString());
+request.setCmdType(ContainerProtos.Type.CloseContainer);
+request.setContainerID(containerID);
+request.setCloseContainer(
+ContainerProtos.CloseContainerRequestProto.getDefaultInstance());
+// close container transaction will fail over Ratis and will cause the raft
 
 Review comment:
   incomplete comment here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] mukul1987 commented on a change in pull request #1226: HDDS-1610. applyTransaction failure should not be lost on restart.

2019-08-06 Thread GitBox
mukul1987 commented on a change in pull request #1226: HDDS-1610. 
applyTransaction failure should not be lost on restart.
URL: https://github.com/apache/hadoop/pull/1226#discussion_r311181106
 
 

 ##
 File path: 
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/client/rpc/TestContainerStateMachineFailures.java
 ##
 @@ -270,4 +279,73 @@ public void testUnhealthyContainer() throws Exception {
 Assert.assertEquals(ContainerProtos.Result.CONTAINER_UNHEALTHY,
 dispatcher.dispatch(request.build(), null).getResult());
   }
+
+  @Test
+  public void testAppyTransactionFailure() throws Exception {
+OzoneOutputStream key =
+objectStore.getVolume(volumeName).getBucket(bucketName)
+.createKey("ratis", 1024, ReplicationType.RATIS,
+ReplicationFactor.ONE, new HashMap<>());
+// First write and flush creates a container in the datanode
+key.write("ratis".getBytes());
+key.flush();
+key.write("ratis".getBytes());
+
+//get the name of a valid container
+OmKeyArgs keyArgs = new OmKeyArgs.Builder().setVolumeName(volumeName).
+setBucketName(bucketName).setType(HddsProtos.ReplicationType.RATIS)
+.setFactor(HddsProtos.ReplicationFactor.ONE).setKeyName("ratis")
+.build();
+KeyOutputStream groupOutputStream = (KeyOutputStream) 
key.getOutputStream();
+List locationInfoList =
+groupOutputStream.getLocationInfoList();
+Assert.assertEquals(1, locationInfoList.size());
+OmKeyLocationInfo omKeyLocationInfo = locationInfoList.get(0);
+ContainerData containerData =
+cluster.getHddsDatanodes().get(0).getDatanodeStateMachine()
+.getContainer().getContainerSet()
+.getContainer(omKeyLocationInfo.getContainerID())
+.getContainerData();
+Assert.assertTrue(containerData instanceof KeyValueContainerData);
+KeyValueContainerData keyValueContainerData =
+(KeyValueContainerData) containerData;
+key.close();
+
+long containerID = omKeyLocationInfo.getContainerID();
+// delete the container db file
+FileUtil.fullyDelete(new File(keyValueContainerData.getContainerPath()));
+Pipeline pipeline = cluster.getStorageContainerLocationClient()
+.getContainerWithPipeline(containerID).getPipeline();
+XceiverClientSpi client = xceiverClientManager.acquireClient(pipeline);
+ContainerProtos.ContainerCommandRequestProto.Builder request =
 
 Review comment:
   How about we append some more data to the key and flush again ? in place of 
a close ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] mukul1987 commented on a change in pull request #1226: HDDS-1610. applyTransaction failure should not be lost on restart.

2019-08-06 Thread GitBox
mukul1987 commented on a change in pull request #1226: HDDS-1610. 
applyTransaction failure should not be lost on restart.
URL: https://github.com/apache/hadoop/pull/1226#discussion_r311179109
 
 

 ##
 File path: 
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/client/rpc/TestContainerStateMachineFailures.java
 ##
 @@ -270,4 +279,73 @@ public void testUnhealthyContainer() throws Exception {
 Assert.assertEquals(ContainerProtos.Result.CONTAINER_UNHEALTHY,
 dispatcher.dispatch(request.build(), null).getResult());
   }
+
+  @Test
+  public void testAppyTransactionFailure() throws Exception {
+OzoneOutputStream key =
+objectStore.getVolume(volumeName).getBucket(bucketName)
+.createKey("ratis", 1024, ReplicationType.RATIS,
+ReplicationFactor.ONE, new HashMap<>());
+// First write and flush creates a container in the datanode
+key.write("ratis".getBytes());
+key.flush();
+key.write("ratis".getBytes());
+
+//get the name of a valid container
+OmKeyArgs keyArgs = new OmKeyArgs.Builder().setVolumeName(volumeName).
 
 Review comment:
   This is unused.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org