subject:"\[jira\] \[Work logged\] \(HDDS\-1843\) Undetectable corruption after restart of a datanode"

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-09-10 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=309585&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309585
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 10/Sep/19 07:26
Start Date: 10/Sep/19 07:26
Worklog Time Spent: 10m 
  Work Description: mukul1987 commented on issue #1364: HDDS-1843. 
Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#issuecomment-529809131
 
 
   /label ozone
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309585)
Time Spent: 10h  (was: 9h 50m)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 10h
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-09-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=309083&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309083
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 09/Sep/19 17:16
Start Date: 09/Sep/19 17:16
Worklog Time Spent: 10m 
  Work Description: bshashikant commented on pull request #1364: HDDS-1843. 
Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309083)
Time Spent: 9h 50m  (was: 9h 40m)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 9h 50m
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-09-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=309082&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309082
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 09/Sep/19 17:15
Start Date: 09/Sep/19 17:15
Worklog Time Spent: 10m 
  Work Description: bshashikant commented on issue #1364: HDDS-1843. 
Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#issuecomment-529579125
 
 
   Thanks @nandakumar131  @mukul1987 @supratimdeka for the reviews. I have 
committed this change to trunk.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309082)
Time Spent: 9h 40m  (was: 9.5h)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 9h 40m
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-09-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=307831&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-307831
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 06/Sep/19 13:09
Start Date: 06/Sep/19 13:09
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on issue #1364: HDDS-1843. 
Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#issuecomment-528848045
 
 
   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime | Comment |
   |::|--:|:|:|
   | 0 | reexec | 39 | Docker mode activated. |
   ||| _ Prechecks _ |
   | +1 | dupname | 1 | No case conflicting files found. |
   | +1 | @author | 0 | The patch does not contain any @author tags. |
   | +1 | test4tests | 0 | The patch appears to include 4 new or modified test 
files. |
   ||| _ trunk Compile Tests _ |
   | 0 | mvndep | 66 | Maven dependency ordering for branch |
   | +1 | mvninstall | 588 | trunk passed |
   | +1 | compile | 382 | trunk passed |
   | +1 | checkstyle | 81 | trunk passed |
   | +1 | mvnsite | 0 | trunk passed |
   | +1 | shadedclient | 872 | branch has no errors when building and testing 
our client artifacts. |
   | +1 | javadoc | 179 | trunk passed |
   | 0 | spotbugs | 421 | Used deprecated FindBugs config; considering 
switching to SpotBugs. |
   | +1 | findbugs | 616 | trunk passed |
   ||| _ Patch Compile Tests _ |
   | 0 | mvndep | 41 | Maven dependency ordering for patch |
   | +1 | mvninstall | 575 | the patch passed |
   | +1 | compile | 389 | the patch passed |
   | +1 | cc | 389 | the patch passed |
   | +1 | javac | 389 | the patch passed |
   | +1 | checkstyle | 83 | the patch passed |
   | +1 | mvnsite | 0 | the patch passed |
   | +1 | whitespace | 0 | The patch has no whitespace issues. |
   | +1 | shadedclient | 700 | patch has no errors when building and testing 
our client artifacts. |
   | +1 | javadoc | 177 | the patch passed |
   | +1 | findbugs | 687 | the patch passed |
   ||| _ Other Tests _ |
   | +1 | unit | 298 | hadoop-hdds in the patch passed. |
   | -1 | unit | 200 | hadoop-ozone in the patch failed. |
   | +1 | asflicense | 48 | The patch does not generate ASF License warnings. |
   | | | 6200 | |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.ozone.om.ratis.TestOzoneManagerRatisServer |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | Client=19.03.1 Server=19.03.1 base: 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/9/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/1364 |
   | Optional Tests | dupname asflicense compile cc mvnsite javac unit javadoc 
mvninstall shadedclient findbugs checkstyle |
   | uname | Linux 806a3bdb7024 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 
16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | personality/hadoop.sh |
   | git revision | trunk / 6e4cdf8 |
   | Default Java | 1.8.0_222 |
   | unit | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/9/artifact/out/patch-unit-hadoop-ozone.txt
 |
   |  Test Results | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/9/testReport/ |
   | Max. process+thread count | 1298 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdds/common hadoop-hdds/container-service 
hadoop-ozone/integration-test U: . |
   | Console output | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/9/console |
   | versions | git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1 |
   | Powered by | Apache Yetus 0.10.0 http://yetus.apache.org |
   
   
   This message was automatically generated.
   
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 307831)
Time Spent: 9.5h  (was: 9h 20m)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
>

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-09-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=307767&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-307767
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 06/Sep/19 11:25
Start Date: 06/Sep/19 11:25
Worklog Time Spent: 10m 
  Work Description: bshashikant commented on pull request #1364: HDDS-1843. 
Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 307767)
Time Spent: 9h 20m  (was: 9h 10m)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-09-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=307755&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-307755
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 06/Sep/19 11:08
Start Date: 06/Sep/19 11:08
Worklog Time Spent: 10m 
  Work Description: bshashikant commented on pull request #1364: HDDS-1843. 
Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 307755)
Time Spent: 9h 10m  (was: 9h)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 9h 10m
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-09-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=307714&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-307714
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 06/Sep/19 09:21
Start Date: 06/Sep/19 09:21
Worklog Time Spent: 10m 
  Work Description: nandakumar131 commented on issue #1364: HDDS-1843. 
Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#issuecomment-528781001
 
 
   @bshashikant you might need to rebase the changes on top of HDDS-1561. Even 
though there is no conflict, the compilation fails.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 307714)
Time Spent: 9h  (was: 8h 50m)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 9h
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-09-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=307712&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-307712
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 06/Sep/19 09:14
Start Date: 06/Sep/19 09:14
Worklog Time Spent: 10m 
  Work Description: nandakumar131 commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r321645502
 
 

 ##
 File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java
 ##
 @@ -240,14 +240,46 @@ public ContainerReportsProto getContainerReport() throws 
IOException {
   }
 
   /**
-   * Builds the missing container set by taking a diff total no containers
-   * actually found and number of containers which actually got created.
+   * Builds the missing container set by taking a diff between total no
+   * containers actually found and number of containers which actually
+   * got created. It also validates the BCSID stored in the snapshot file
+   * for each container as against what is reported in containerScan.
* This will only be called during the initialization of Datanode Service
* when  it still not a part of any write Pipeline.
-   * @param createdContainerSet ContainerId set persisted in the Ratis snapshot
+   * @param container2BCSIDMap Map of containerId to BCSID persisted in the
+   *   Ratis snapshot
*/
-  public void buildMissingContainerSet(Set createdContainerSet) {
-missingContainerSet.addAll(createdContainerSet);
-missingContainerSet.removeAll(containerMap.keySet());
+  public void buildMissingContainerSetAndValidate(
+  Map container2BCSIDMap) {
+container2BCSIDMap.entrySet().parallelStream().forEach((mapEntry) -> {
+  long id = mapEntry.getKey();
+  if (!containerMap.containsKey(id)) {
+LOG.warn("Adding container {} to missing container set.", id);
+missingContainerSet.add(id);
+  } else {
+Container container = containerMap.get(id);
+long containerBCSID = container.getBlockCommitSequenceId();
+long snapshotBCSID = mapEntry.getValue();
+if (containerBCSID < snapshotBCSID) {
+  LOG.warn(
+  "Marking container {} unhealthy as reported BCSID {} is smaller"
+  + " than ratis snapshot recorded value {}", id,
+  containerBCSID, snapshotBCSID);
+  // just mark the container unhealthy. Once the DatanodeStateMachine
+  // thread starts it will send container report to SCM where these
+  // unhealthy containers would be detected
+  try {
+container.markContainerUnhealthy();
+  } catch (StorageContainerException sce) {
+// The container will still be marked unhealthy in memory even if
+// exception occurs. It won't accept any new transactions and will
+// be handled by SCM. Eve if dn restarts, it will still be detected
+// as unheathy as its BCSID won't change.
+LOG.info("Unable to persist unhealthy state for container {}", id);
 
 Review comment:
   This should be `LOG.error`
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 307712)
Time Spent: 8h 50m  (was: 8h 40m)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 8h 50m
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a cli

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-09-05 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=307074&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-307074
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 05/Sep/19 11:11
Start Date: 05/Sep/19 11:11
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on issue #1364: HDDS-1843. 
Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#issuecomment-528314940
 
 
   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime | Comment |
   |::|--:|:|:|
   | 0 | reexec | 41 | Docker mode activated. |
   ||| _ Prechecks _ |
   | +1 | dupname | 1 | No case conflicting files found. |
   | +1 | @author | 0 | The patch does not contain any @author tags. |
   | +1 | test4tests | 0 | The patch appears to include 4 new or modified test 
files. |
   ||| _ trunk Compile Tests _ |
   | 0 | mvndep | 65 | Maven dependency ordering for branch |
   | +1 | mvninstall | 581 | trunk passed |
   | +1 | compile | 376 | trunk passed |
   | +1 | checkstyle | 81 | trunk passed |
   | +1 | mvnsite | 0 | trunk passed |
   | +1 | shadedclient | 869 | branch has no errors when building and testing 
our client artifacts. |
   | +1 | javadoc | 178 | trunk passed |
   | 0 | spotbugs | 416 | Used deprecated FindBugs config; considering 
switching to SpotBugs. |
   | +1 | findbugs | 611 | trunk passed |
   | -0 | patch | 475 | Used diff version of patch file. Binary files and 
potentially other changes not applied. Please rebase and squash commits if 
necessary. |
   ||| _ Patch Compile Tests _ |
   | 0 | mvndep | 38 | Maven dependency ordering for patch |
   | +1 | mvninstall | 538 | the patch passed |
   | +1 | compile | 391 | the patch passed |
   | +1 | cc | 391 | the patch passed |
   | +1 | javac | 391 | the patch passed |
   | -0 | checkstyle | 43 | hadoop-hdds: The patch generated 1 new + 0 
unchanged - 0 fixed = 1 total (was 0) |
   | +1 | mvnsite | 0 | the patch passed |
   | +1 | whitespace | 0 | The patch has no whitespace issues. |
   | +1 | shadedclient | 665 | patch has no errors when building and testing 
our client artifacts. |
   | +1 | javadoc | 175 | the patch passed |
   | +1 | findbugs | 631 | the patch passed |
   ||| _ Other Tests _ |
   | +1 | unit | 283 | hadoop-hdds in the patch passed. |
   | -1 | unit | 2556 | hadoop-ozone in the patch failed. |
   | +1 | asflicense | 54 | The patch does not generate ASF License warnings. |
   | | | 8415 | |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | 
hadoop.ozone.client.rpc.TestMultiBlockWritesWithDnFailures |
   |   | hadoop.ozone.scm.node.TestQueryNode |
   |   | 
hadoop.ozone.container.common.statemachine.commandhandler.TestCloseContainerHandler
 |
   |   | hadoop.ozone.TestSecureOzoneCluster |
   |   | hadoop.ozone.client.rpc.TestCommitWatcher |
   |   | 
hadoop.ozone.container.common.statemachine.commandhandler.TestDeleteContainerHandler
 |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | Client=19.03.1 Server=19.03.1 base: 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/8/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/1364 |
   | Optional Tests | dupname asflicense compile cc mvnsite javac unit javadoc 
mvninstall shadedclient findbugs checkstyle |
   | uname | Linux 8b805e111dba 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 
16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | personality/hadoop.sh |
   | git revision | trunk / 172bcd8 |
   | Default Java | 1.8.0_222 |
   | checkstyle | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/8/artifact/out/diff-checkstyle-hadoop-hdds.txt
 |
   | unit | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/8/artifact/out/patch-unit-hadoop-ozone.txt
 |
   |  Test Results | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/8/testReport/ |
   | Max. process+thread count | 5406 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdds/common hadoop-hdds/container-service 
hadoop-ozone/integration-test U: . |
   | Console output | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/8/console |
   | versions | git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1 |
   | Powered by | Apache Yetus 0.10.0 http://yetus.apache.org |
   
   
   This message was automatically generated.
   
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id:

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-09-05 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=307033&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-307033
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 05/Sep/19 09:46
Start Date: 05/Sep/19 09:46
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on issue #1364: HDDS-1843. 
Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#issuecomment-528287642
 
 
   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime | Comment |
   |::|--:|:|:|
   | 0 | reexec | 39 | Docker mode activated. |
   ||| _ Prechecks _ |
   | +1 | dupname | 0 | No case conflicting files found. |
   | +1 | @author | 0 | The patch does not contain any @author tags. |
   | +1 | test4tests | 0 | The patch appears to include 4 new or modified test 
files. |
   ||| _ trunk Compile Tests _ |
   | 0 | mvndep | 71 | Maven dependency ordering for branch |
   | +1 | mvninstall | 720 | trunk passed |
   | +1 | compile | 422 | trunk passed |
   | +1 | checkstyle | 85 | trunk passed |
   | +1 | mvnsite | 0 | trunk passed |
   | +1 | shadedclient | 1086 | branch has no errors when building and testing 
our client artifacts. |
   | +1 | javadoc | 208 | trunk passed |
   | 0 | spotbugs | 436 | Used deprecated FindBugs config; considering 
switching to SpotBugs. |
   | +1 | findbugs | 702 | trunk passed |
   | -0 | patch | 478 | Used diff version of patch file. Binary files and 
potentially other changes not applied. Please rebase and squash commits if 
necessary. |
   ||| _ Patch Compile Tests _ |
   | 0 | mvndep | 30 | Maven dependency ordering for patch |
   | +1 | mvninstall | 540 | the patch passed |
   | +1 | compile | 372 | the patch passed |
   | +1 | cc | 372 | the patch passed |
   | +1 | javac | 372 | the patch passed |
   | +1 | checkstyle | 77 | the patch passed |
   | +1 | mvnsite | 0 | the patch passed |
   | +1 | whitespace | 0 | The patch has no whitespace issues. |
   | +1 | shadedclient | 751 | patch has no errors when building and testing 
our client artifacts. |
   | +1 | javadoc | 178 | the patch passed |
   | +1 | findbugs | 654 | the patch passed |
   ||| _ Other Tests _ |
   | +1 | unit | 263 | hadoop-hdds in the patch passed. |
   | -1 | unit | 2130 | hadoop-ozone in the patch failed. |
   | +1 | asflicense | 44 | The patch does not generate ASF License warnings. |
   | | | 8527 | |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | 
hadoop.ozone.container.common.statemachine.commandhandler.TestBlockDeletion |
   |   | hadoop.ozone.client.rpc.TestBlockOutputStream |
   |   | hadoop.ozone.scm.TestContainerSmallFile |
   |   | hadoop.ozone.TestSecureOzoneCluster |
   |   | hadoop.ozone.client.rpc.TestMultiBlockWritesWithDnFailures |
   |   | hadoop.ozone.client.rpc.TestBlockOutputStreamWithFailures |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | Client=19.03.1 Server=19.03.1 base: 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/6/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/1364 |
   | Optional Tests | dupname asflicense compile cc mvnsite javac unit javadoc 
mvninstall shadedclient findbugs checkstyle |
   | uname | Linux 71935b58ecf5 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 
16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | personality/hadoop.sh |
   | git revision | trunk / f347c34 |
   | Default Java | 1.8.0_222 |
   | unit | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/6/artifact/out/patch-unit-hadoop-ozone.txt
 |
   |  Test Results | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/6/testReport/ |
   | Max. process+thread count | 4726 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdds/common hadoop-hdds/container-service 
hadoop-ozone/integration-test U: . |
   | Console output | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/6/console |
   | versions | git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1 |
   | Powered by | Apache Yetus 0.10.0 http://yetus.apache.org |
   
   
   This message was automatically generated.
   
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 307033)
Time Spent: 8.5h  (was: 8h 20m)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https:

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-09-05 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=306990&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-306990
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 05/Sep/19 08:16
Start Date: 05/Sep/19 08:16
Worklog Time Spent: 10m 
  Work Description: bshashikant commented on pull request #1364: HDDS-1843. 
Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r321108749
 
 

 ##
 File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java
 ##
 @@ -240,14 +240,37 @@ public ContainerReportsProto getContainerReport() throws 
IOException {
   }
 
   /**
-   * Builds the missing container set by taking a diff total no containers
-   * actually found and number of containers which actually got created.
+   * Builds the missing container set by taking a diff between total no
+   * containers actually found and number of containers which actually
+   * got created. It also validates the BCSID stored in the snapshot file
+   * for each container as against what is reported in containerScan.
* This will only be called during the initialization of Datanode Service
* when  it still not a part of any write Pipeline.
-   * @param createdContainerSet ContainerId set persisted in the Ratis snapshot
+   * @param container2BCSIDMap Map of containerId to BCSID persisted in the
+   *   Ratis snapshot
*/
-  public void buildMissingContainerSet(Set createdContainerSet) {
-missingContainerSet.addAll(createdContainerSet);
-missingContainerSet.removeAll(containerMap.keySet());
+  public void buildMissingContainerSetAndValidate(
+  Map container2BCSIDMap) throws IOException {
 
 Review comment:
   AFAIK, the missing container determination step is the last step executed as 
a part of initialization of RaftGroup which calls into loadSnapshot. Just doing 
it in a separate thread and waiting it in the main thread may not help it to 
restart faster.
   
   We ca change the code to use parallel stream to process the map to make it 
faster.
   
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 306990)
Time Spent: 8h 10m  (was: 8h)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-09-05 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=306991&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-306991
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 05/Sep/19 08:16
Start Date: 05/Sep/19 08:16
Worklog Time Spent: 10m 
  Work Description: bshashikant commented on pull request #1364: HDDS-1843. 
Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r321108749
 
 

 ##
 File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java
 ##
 @@ -240,14 +240,37 @@ public ContainerReportsProto getContainerReport() throws 
IOException {
   }
 
   /**
-   * Builds the missing container set by taking a diff total no containers
-   * actually found and number of containers which actually got created.
+   * Builds the missing container set by taking a diff between total no
+   * containers actually found and number of containers which actually
+   * got created. It also validates the BCSID stored in the snapshot file
+   * for each container as against what is reported in containerScan.
* This will only be called during the initialization of Datanode Service
* when  it still not a part of any write Pipeline.
-   * @param createdContainerSet ContainerId set persisted in the Ratis snapshot
+   * @param container2BCSIDMap Map of containerId to BCSID persisted in the
+   *   Ratis snapshot
*/
-  public void buildMissingContainerSet(Set createdContainerSet) {
-missingContainerSet.addAll(createdContainerSet);
-missingContainerSet.removeAll(containerMap.keySet());
+  public void buildMissingContainerSetAndValidate(
+  Map container2BCSIDMap) throws IOException {
 
 Review comment:
   AFAIK, the missing container determination step is the last step executed as 
a part of initialization of RaftGroup which calls into loadSnapshot. Just doing 
it in a separate thread and waiting it in the main thread may not help it to 
restart faster.
   
   We can change the code to use parallel stream to process the map to make it 
faster.
   
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 306991)
Time Spent: 8h 20m  (was: 8h 10m)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 8h 20m
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-09-05 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=306964&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-306964
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 05/Sep/19 07:27
Start Date: 05/Sep/19 07:27
Worklog Time Spent: 10m 
  Work Description: bshashikant commented on issue #1364: HDDS-1843. 
Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#issuecomment-528234226
 
 
   
   > a) Do we sync the container db's wal on close?
   -- The entire rocks db is synced as a part of closing the container.
   > b) Also, for on restart, we should run the scanner, will this 
identification step run before this?
   --- Yes, identification step is executed before the the scrubber starts. 
   
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 306964)
Time Spent: 8h  (was: 7h 50m)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-09-05 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=306959&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-306959
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 05/Sep/19 07:21
Start Date: 05/Sep/19 07:21
Worklog Time Spent: 10m 
  Work Description: bshashikant commented on pull request #1364: HDDS-1843. 
Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r321108749
 
 

 ##
 File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java
 ##
 @@ -240,14 +240,37 @@ public ContainerReportsProto getContainerReport() throws 
IOException {
   }
 
   /**
-   * Builds the missing container set by taking a diff total no containers
-   * actually found and number of containers which actually got created.
+   * Builds the missing container set by taking a diff between total no
+   * containers actually found and number of containers which actually
+   * got created. It also validates the BCSID stored in the snapshot file
+   * for each container as against what is reported in containerScan.
* This will only be called during the initialization of Datanode Service
* when  it still not a part of any write Pipeline.
-   * @param createdContainerSet ContainerId set persisted in the Ratis snapshot
+   * @param container2BCSIDMap Map of containerId to BCSID persisted in the
+   *   Ratis snapshot
*/
-  public void buildMissingContainerSet(Set createdContainerSet) {
-missingContainerSet.addAll(createdContainerSet);
-missingContainerSet.removeAll(containerMap.keySet());
+  public void buildMissingContainerSetAndValidate(
+  Map container2BCSIDMap) throws IOException {
 
 Review comment:
   AFAIK, the missing container determination step is the last step executed as 
a part of initialization of RaftGroup which calls into loadSnapshot. Just doing 
it in a separate thread and waiting it in the main thread may not help it to 
restart faster.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 306959)
Time Spent: 7h 50m  (was: 7h 40m)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 7h 50m
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-09-05 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=306957&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-306957
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 05/Sep/19 07:19
Start Date: 05/Sep/19 07:19
Worklog Time Spent: 10m 
  Work Description: bshashikant commented on issue #1364: HDDS-1843. 
Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#issuecomment-528234226
 
 
   > The patch generally looks good to me. A couple of questions, they might 
already be implemented
   > a) Do we sync the container db's wal on close?
   -- The entire rocks db is synced as a part of closing the container.
   > b) Also, for on restart, we should run the scanner, will this 
identification step run before this?
   --- Yes, identification step is executed before the the scrubber starts. 
   
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 306957)
Time Spent: 7h 40m  (was: 7.5h)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 7h 40m
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-09-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=306773&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-306773
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 04/Sep/19 23:01
Start Date: 04/Sep/19 23:01
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on issue #1364: HDDS-1843. 
Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#issuecomment-528124811
 
 
   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime | Comment |
   |::|--:|:|:|
   | 0 | reexec | 43 | Docker mode activated. |
   ||| _ Prechecks _ |
   | +1 | dupname | 1 | No case conflicting files found. |
   | +1 | @author | 0 | The patch does not contain any @author tags. |
   | +1 | test4tests | 0 | The patch appears to include 4 new or modified test 
files. |
   ||| _ trunk Compile Tests _ |
   | 0 | mvndep | 70 | Maven dependency ordering for branch |
   | +1 | mvninstall | 617 | trunk passed |
   | +1 | compile | 395 | trunk passed |
   | +1 | checkstyle | 78 | trunk passed |
   | +1 | mvnsite | 0 | trunk passed |
   | +1 | shadedclient | 890 | branch has no errors when building and testing 
our client artifacts. |
   | +1 | javadoc | 176 | trunk passed |
   | 0 | spotbugs | 444 | Used deprecated FindBugs config; considering 
switching to SpotBugs. |
   | +1 | findbugs | 646 | trunk passed |
   ||| _ Patch Compile Tests _ |
   | 0 | mvndep | 40 | Maven dependency ordering for patch |
   | +1 | mvninstall | 556 | the patch passed |
   | +1 | compile | 391 | the patch passed |
   | +1 | cc | 391 | the patch passed |
   | +1 | javac | 391 | the patch passed |
   | +1 | checkstyle | 86 | the patch passed |
   | +1 | mvnsite | 0 | the patch passed |
   | -1 | whitespace | 0 | The patch has 25 line(s) that end in whitespace. Use 
git apply --whitespace=fix <>. Refer 
https://git-scm.com/docs/git-apply |
   | +1 | shadedclient | 691 | patch has no errors when building and testing 
our client artifacts. |
   | +1 | javadoc | 180 | the patch passed |
   | +1 | findbugs | 669 | the patch passed |
   ||| _ Other Tests _ |
   | +1 | unit | 288 | hadoop-hdds in the patch passed. |
   | -1 | unit | 2037 | hadoop-ozone in the patch failed. |
   | +1 | asflicense | 52 | The patch does not generate ASF License warnings. |
   | | | 8110 | |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | 
hadoop.ozone.client.rpc.TestMultiBlockWritesWithDnFailures |
   |   | hadoop.ozone.scm.node.TestQueryNode |
   |   | hadoop.ozone.TestSecureOzoneCluster |
   |   | hadoop.ozone.client.rpc.TestBlockOutputStream |
   |   | 
hadoop.ozone.container.common.statemachine.commandhandler.TestBlockDeletion |
   |   | hadoop.ozone.client.rpc.TestBlockOutputStreamWithFailures |
   |   | hadoop.ozone.TestMiniChaosOzoneCluster |
   |   | hadoop.ozone.om.TestSecureOzoneManager |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | Client=19.03.1 Server=19.03.1 base: 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/5/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/1364 |
   | Optional Tests | dupname asflicense compile cc mvnsite javac unit javadoc 
mvninstall shadedclient findbugs checkstyle |
   | uname | Linux 8df624d8e13a 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 
16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | personality/hadoop.sh |
   | git revision | trunk / 337e9b7 |
   | Default Java | 1.8.0_222 |
   | whitespace | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/5/artifact/out/whitespace-eol.txt
 |
   | unit | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/5/artifact/out/patch-unit-hadoop-ozone.txt
 |
   |  Test Results | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/5/testReport/ |
   | Max. process+thread count | 5375 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdds/common hadoop-hdds/container-service 
hadoop-ozone/integration-test U: . |
   | Console output | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/5/console |
   | versions | git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1 |
   | Powered by | Apache Yetus 0.10.0 http://yetus.apache.org |
   
   
   This message was automatically generated.
   
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 306773)
Time Spent: 7.5h  (was: 7h 20m)

> Undetectable corruption after

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-09-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=306630&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-306630
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 04/Sep/19 19:02
Start Date: 04/Sep/19 19:02
Worklog Time Spent: 10m 
  Work Description: mukul1987 commented on pull request #1364: HDDS-1843. 
Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r320922823
 
 

 ##
 File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java
 ##
 @@ -240,14 +240,37 @@ public ContainerReportsProto getContainerReport() throws 
IOException {
   }
 
   /**
-   * Builds the missing container set by taking a diff total no containers
-   * actually found and number of containers which actually got created.
+   * Builds the missing container set by taking a diff between total no
+   * containers actually found and number of containers which actually
+   * got created. It also validates the BCSID stored in the snapshot file
+   * for each container as against what is reported in containerScan.
* This will only be called during the initialization of Datanode Service
* when  it still not a part of any write Pipeline.
-   * @param createdContainerSet ContainerId set persisted in the Ratis snapshot
+   * @param container2BCSIDMap Map of containerId to BCSID persisted in the
+   *   Ratis snapshot
*/
-  public void buildMissingContainerSet(Set createdContainerSet) {
-missingContainerSet.addAll(createdContainerSet);
-missingContainerSet.removeAll(containerMap.keySet());
+  public void buildMissingContainerSetAndValidate(
+  Map container2BCSIDMap) throws IOException {
 
 Review comment:
   Lets make this multi threaded, so that on restart, this state is reached a 
lot faster.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 306630)
Time Spent: 7h 20m  (was: 7h 10m)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-09-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=306590&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-306590
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 04/Sep/19 18:04
Start Date: 04/Sep/19 18:04
Worklog Time Spent: 10m 
  Work Description: bshashikant commented on issue #1364: HDDS-1843. 
Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#issuecomment-528017062
 
 
   /retest
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 306590)
Time Spent: 7h 10m  (was: 7h)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 7h 10m
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-09-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=306572&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-306572
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 04/Sep/19 17:45
Start Date: 04/Sep/19 17:45
Worklog Time Spent: 10m 
  Work Description: bshashikant commented on pull request #1364: HDDS-1843. 
Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r320889606
 
 

 ##
 File path: hadoop-hdds/common/src/main/proto/DatanodeContainerProtocol.proto
 ##
 @@ -248,8 +248,13 @@ message ContainerDataProto {
   optional ContainerType containerType = 10 [default = KeyValueContainer];
 }
 
-message ContainerIdSetProto {
-repeated int64 containerId = 1;
+message Container2BCSIDMapEntryProto {
 
 Review comment:
   address in the next patch.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 306572)
Time Spent: 7h  (was: 6h 50m)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-09-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=306570&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-306570
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 04/Sep/19 17:41
Start Date: 04/Sep/19 17:41
Worklog Time Spent: 10m 
  Work Description: bshashikant commented on pull request #1364: HDDS-1843. 
Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r320887895
 
 

 ##
 File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/HddsDispatcher.java
 ##
 @@ -329,6 +336,21 @@ private ContainerCommandResponseProto dispatchRequest(
 }
   }
 
+  private void updateBCSID(Container container,
+  DispatcherContext dispatcherContext, ContainerProtos.Type cmdType) {
+long bcsID = container.getBlockCommitSequenceId();
+long containerId = container.getContainerData().getContainerID();
+Map container2BCSIDMap;
+if (dispatcherContext != null && (cmdType == ContainerProtos.Type.PutBlock
+|| cmdType == ContainerProtos.Type.PutSmallFile)) {
+  container2BCSIDMap = dispatcherContext.getContainer2BCSIDMap();
+  Preconditions.checkNotNull(container2BCSIDMap);
+  Preconditions.checkArgument(container2BCSIDMap.containsKey(containerId));
+  // updates the latest BCSID on every putBlock or putSmallFile
+  // transaction over Ratis.
+  container2BCSIDMap.computeIfPresent(containerId, (u, v) -> v = bcsID);
 
 Review comment:
   will address in the next patch.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 306570)
Time Spent: 6h 40m  (was: 6.5h)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-09-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=306571&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-306571
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 04/Sep/19 17:41
Start Date: 04/Sep/19 17:41
Worklog Time Spent: 10m 
  Work Description: bshashikant commented on pull request #1364: HDDS-1843. 
Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r320888068
 
 

 ##
 File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/impl/BlockManagerImpl.java
 ##
 @@ -282,5 +282,4 @@ public void deleteBlock(Container container, BlockID 
blockID) throws
   public void shutdown() {
 BlockUtils.shutdownCache(ContainerCache.getInstance(config));
   }
-
 
 Review comment:
   address in the next patch.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 306571)
Time Spent: 6h 50m  (was: 6h 40m)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-09-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=306568&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-306568
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 04/Sep/19 17:40
Start Date: 04/Sep/19 17:40
Worklog Time Spent: 10m 
  Work Description: bshashikant commented on pull request #1364: HDDS-1843. 
Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r320887705
 
 

 ##
 File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java
 ##
 @@ -240,14 +240,37 @@ public ContainerReportsProto getContainerReport() throws 
IOException {
   }
 
   /**
-   * Builds the missing container set by taking a diff total no containers
-   * actually found and number of containers which actually got created.
+   * Builds the missing container set by taking a diff between total no
+   * containers actually found and number of containers which actually
+   * got created. It also validates the BCSID stored in the snapshot file
+   * for each container as against what is reported in containerScan.
* This will only be called during the initialization of Datanode Service
* when  it still not a part of any write Pipeline.
-   * @param createdContainerSet ContainerId set persisted in the Ratis snapshot
+   * @param container2BCSIDMap Map of containerId to BCSID persisted in the
+   *   Ratis snapshot
*/
-  public void buildMissingContainerSet(Set createdContainerSet) {
-missingContainerSet.addAll(createdContainerSet);
-missingContainerSet.removeAll(containerMap.keySet());
+  public void buildMissingContainerSetAndValidate(
+  Map container2BCSIDMap) throws IOException {
+for (Map.Entry mapEntry : container2BCSIDMap.entrySet()) {
+  long id = mapEntry.getKey();
+  if (!containerMap.containsKey(id)) {
+LOG.warn("Adding container {} to missing container set.", id);
+missingContainerSet.add(id);
+  } else {
+Container container = containerMap.get(id);
+long containerBCSID = container.getBlockCommitSequenceId();
+long snapshotBCSID = mapEntry.getValue();
+if (containerBCSID < snapshotBCSID) {
+  LOG.warn(
+  "Marking container {} unhealthy as reported BCSID {} is smaller"
 
 Review comment:
   will address in the next patch.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 306568)
Time Spent: 6.5h  (was: 6h 20m)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-09-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=306567&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-306567
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 04/Sep/19 17:39
Start Date: 04/Sep/19 17:39
Worklog Time Spent: 10m 
  Work Description: bshashikant commented on pull request #1364: HDDS-1843. 
Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r320887264
 
 

 ##
 File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/ratis/ContainerStateMachine.java
 ##
 @@ -258,8 +257,9 @@ private long loadSnapshot(SingleFileSnapshotInfo snapshot)
* @throws IOException
*/
   public void persistContainerSet(OutputStream out) throws IOException {
 
 Review comment:
   will address it the next patch.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 306567)
Time Spent: 6h 20m  (was: 6h 10m)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-09-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=306566&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-306566
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 04/Sep/19 17:38
Start Date: 04/Sep/19 17:38
Worklog Time Spent: 10m 
  Work Description: bshashikant commented on pull request #1364: HDDS-1843. 
Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r320886836
 
 

 ##
 File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/HddsDispatcher.java
 ##
 @@ -329,6 +336,21 @@ private ContainerCommandResponseProto dispatchRequest(
 }
   }
 
+  private void updateBCSID(Container container,
+  DispatcherContext dispatcherContext, ContainerProtos.Type cmdType) {
+long bcsID = container.getBlockCommitSequenceId();
+long containerId = container.getContainerData().getContainerID();
+Map container2BCSIDMap;
+if (dispatcherContext != null && (cmdType == ContainerProtos.Type.PutBlock
 
 Review comment:
   For all the cmds, the dispatcher context is not setup and will be null. We 
need to check for specific cmd types to get the context.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 306566)
Time Spent: 6h 10m  (was: 6h)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-09-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=306390&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-306390
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 04/Sep/19 15:59
Start Date: 04/Sep/19 15:59
Worklog Time Spent: 10m 
  Work Description: nandakumar131 commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r320841743
 
 

 ##
 File path: hadoop-hdds/common/src/main/proto/DatanodeContainerProtocol.proto
 ##
 @@ -248,8 +248,13 @@ message ContainerDataProto {
   optional ContainerType containerType = 10 [default = KeyValueContainer];
 }
 
-message ContainerIdSetProto {
-repeated int64 containerId = 1;
+message Container2BCSIDMapEntryProto {
 
 Review comment:
   Never used, can be removed.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 306390)
Time Spent: 5h 50m  (was: 5h 40m)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-09-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=306391&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-306391
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 04/Sep/19 15:59
Start Date: 04/Sep/19 15:59
Worklog Time Spent: 10m 
  Work Description: nandakumar131 commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r320842136
 
 

 ##
 File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/HddsDispatcher.java
 ##
 @@ -329,6 +336,21 @@ private ContainerCommandResponseProto dispatchRequest(
 }
   }
 
+  private void updateBCSID(Container container,
+  DispatcherContext dispatcherContext, ContainerProtos.Type cmdType) {
+long bcsID = container.getBlockCommitSequenceId();
+long containerId = container.getContainerData().getContainerID();
+Map container2BCSIDMap;
+if (dispatcherContext != null && (cmdType == ContainerProtos.Type.PutBlock
+|| cmdType == ContainerProtos.Type.PutSmallFile)) {
+  container2BCSIDMap = dispatcherContext.getContainer2BCSIDMap();
+  Preconditions.checkNotNull(container2BCSIDMap);
+  Preconditions.checkArgument(container2BCSIDMap.containsKey(containerId));
+  // updates the latest BCSID on every putBlock or putSmallFile
+  // transaction over Ratis.
+  container2BCSIDMap.computeIfPresent(containerId, (u, v) -> v = bcsID);
 
 Review comment:
   `computeIfPresent` is not needed here, can be replaced with `Map#put`.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 306391)
Time Spent: 6h  (was: 5h 50m)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-09-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=305687&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-305687
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 03/Sep/19 16:40
Start Date: 03/Sep/19 16:40
Worklog Time Spent: 10m 
  Work Description: supratimdeka commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r320313492
 
 

 ##
 File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/HddsDispatcher.java
 ##
 @@ -228,11 +231,14 @@ private ContainerCommandResponseProto dispatchRequest(
   audit(action, eventType, params, AuditEventStatus.FAILURE, sce);
   return ContainerUtils.logAndReturnError(LOG, sce, msg);
 }
-Preconditions.checkArgument(isWriteStage && containerIdSet != null
+Preconditions.checkArgument(isWriteStage && container2BCSIDMap != null
 || dispatcherContext == null);
-if (containerIdSet != null) {
+if (container2BCSIDMap != null) {
   // adds this container to list of containers created in the pipeline
-  containerIdSet.add(containerID);
+  // with initial BCSID recorded as 0.
+  Preconditions
+  .checkArgument(!container2BCSIDMap.containsKey(containerID));
 
 Review comment:
   is this assert safe? asking the question because there was no equivalent 
assert before.
   
   trying to imagine a situation where this might be a false alarm - is this 
possible?
   1. container create is successful, 
   2. container2BCSIDMap.putIfAbsent is successful
   3. datanode restarts - container2BCSIDMap is persisted, but the container 
itself is not persisted (create container involves a rename operation which may 
not be sync'ed inline)
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 305687)
Time Spent: 5h 20m  (was: 5h 10m)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-09-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=305685&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-305685
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 03/Sep/19 16:40
Start Date: 03/Sep/19 16:40
Worklog Time Spent: 10m 
  Work Description: supratimdeka commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r320287270
 
 

 ##
 File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java
 ##
 @@ -240,14 +240,37 @@ public ContainerReportsProto getContainerReport() throws 
IOException {
   }
 
   /**
-   * Builds the missing container set by taking a diff total no containers
-   * actually found and number of containers which actually got created.
+   * Builds the missing container set by taking a diff between total no
+   * containers actually found and number of containers which actually
+   * got created. It also validates the BCSID stored in the snapshot file
+   * for each container as against what is reported in containerScan.
* This will only be called during the initialization of Datanode Service
* when  it still not a part of any write Pipeline.
-   * @param createdContainerSet ContainerId set persisted in the Ratis snapshot
+   * @param container2BCSIDMap Map of containerId to BCSID persisted in the
+   *   Ratis snapshot
*/
-  public void buildMissingContainerSet(Set createdContainerSet) {
-missingContainerSet.addAll(createdContainerSet);
-missingContainerSet.removeAll(containerMap.keySet());
+  public void buildMissingContainerSetAndValidate(
+  Map container2BCSIDMap) throws IOException {
+for (Map.Entry mapEntry : container2BCSIDMap.entrySet()) {
+  long id = mapEntry.getKey();
+  if (!containerMap.containsKey(id)) {
+LOG.warn("Adding container {} to missing container set.", id);
+missingContainerSet.add(id);
+  } else {
+Container container = containerMap.get(id);
+long containerBCSID = container.getBlockCommitSequenceId();
+long snapshotBCSID = mapEntry.getValue();
+if (containerBCSID < snapshotBCSID) {
+  LOG.warn(
+  "Marking container {} unhealthy as reported BCSID {} is smaller"
 
 Review comment:
   argument appears to be missing. container not passed as the first argument.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 305685)
Time Spent: 5h 10m  (was: 5h)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-09-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=305686&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-305686
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 03/Sep/19 16:40
Start Date: 03/Sep/19 16:40
Worklog Time Spent: 10m 
  Work Description: supratimdeka commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r320368137
 
 

 ##
 File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/HddsDispatcher.java
 ##
 @@ -329,6 +336,21 @@ private ContainerCommandResponseProto dispatchRequest(
 }
   }
 
+  private void updateBCSID(Container container,
+  DispatcherContext dispatcherContext, ContainerProtos.Type cmdType) {
+long bcsID = container.getBlockCommitSequenceId();
+long containerId = container.getContainerData().getContainerID();
+Map container2BCSIDMap;
+if (dispatcherContext != null && (cmdType == ContainerProtos.Type.PutBlock
 
 Review comment:
   an alternative implementation would be to ignore the cmdType.
   For all requests, read the BCS ID from the map and compare it to the bcsid 
from the container. if the map value is lower, then update it to the value from 
the container. 
   
   Advantage:
   updateBCSID function becomes de-coupled from the knowledge of which cmdType 
changes the bcsid.
   Disadvantage:
   possibly more CPU cost, because every request will pay the cost of reading 
the bcsid map. but it might be premature to assume that this additional cost 
will be significant.

   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 305686)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-09-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=305689&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-305689
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 03/Sep/19 16:40
Start Date: 03/Sep/19 16:40
Worklog Time Spent: 10m 
  Work Description: supratimdeka commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r320359663
 
 

 ##
 File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/impl/BlockManagerImpl.java
 ##
 @@ -282,5 +282,4 @@ public void deleteBlock(Container container, BlockID 
blockID) throws
   public void shutdown() {
 BlockUtils.shutdownCache(ContainerCache.getInstance(config));
   }
-
 
 Review comment:
   unintended whitespace only change?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 305689)
Time Spent: 5h 40m  (was: 5.5h)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-09-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=305688&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-305688
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 03/Sep/19 16:40
Start Date: 03/Sep/19 16:40
Worklog Time Spent: 10m 
  Work Description: supratimdeka commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r320362456
 
 

 ##
 File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/ratis/ContainerStateMachine.java
 ##
 @@ -258,8 +257,9 @@ private long loadSnapshot(SingleFileSnapshotInfo snapshot)
* @throws IOException
*/
   public void persistContainerSet(OutputStream out) throws IOException {
 
 Review comment:
   should we rename this function as well?
   given that other instances of containerSet have been renamed.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 305688)
Time Spent: 5.5h  (was: 5h 20m)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-09-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=305360&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-305360
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 03/Sep/19 00:34
Start Date: 03/Sep/19 00:34
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on issue #1364: HDDS-1843. 
Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#issuecomment-527263446
 
 
   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime | Comment |
   |::|--:|:|:|
   | 0 | reexec | 36 | Docker mode activated. |
   ||| _ Prechecks _ |
   | +1 | dupname | 0 | No case conflicting files found. |
   | +1 | @author | 0 | The patch does not contain any @author tags. |
   | +1 | test4tests | 0 | The patch appears to include 4 new or modified test 
files. |
   ||| _ trunk Compile Tests _ |
   | 0 | mvndep | 25 | Maven dependency ordering for branch |
   | +1 | mvninstall | 580 | trunk passed |
   | +1 | compile | 406 | trunk passed |
   | +1 | checkstyle | 72 | trunk passed |
   | +1 | mvnsite | 0 | trunk passed |
   | +1 | shadedclient | 908 | branch has no errors when building and testing 
our client artifacts. |
   | +1 | javadoc | 175 | trunk passed |
   | 0 | spotbugs | 473 | Used deprecated FindBugs config; considering 
switching to SpotBugs. |
   | +1 | findbugs | 693 | trunk passed |
   ||| _ Patch Compile Tests _ |
   | 0 | mvndep | 37 | Maven dependency ordering for patch |
   | +1 | mvninstall | 609 | the patch passed |
   | +1 | compile | 399 | the patch passed |
   | +1 | cc | 399 | the patch passed |
   | +1 | javac | 399 | the patch passed |
   | -0 | checkstyle | 37 | hadoop-hdds: The patch generated 2 new + 0 
unchanged - 0 fixed = 2 total (was 0) |
   | -0 | checkstyle | 41 | hadoop-ozone: The patch generated 4 new + 0 
unchanged - 0 fixed = 4 total (was 0) |
   | +1 | mvnsite | 0 | the patch passed |
   | -1 | whitespace | 0 | The patch has 25 line(s) that end in whitespace. Use 
git apply --whitespace=fix <>. Refer 
https://git-scm.com/docs/git-apply |
   | +1 | shadedclient | 643 | patch has no errors when building and testing 
our client artifacts. |
   | +1 | javadoc | 177 | the patch passed |
   | -1 | findbugs | 234 | hadoop-hdds generated 1 new + 0 unchanged - 0 fixed 
= 1 total (was 0) |
   ||| _ Other Tests _ |
   | +1 | unit | 275 | hadoop-hdds in the patch passed. |
   | -1 | unit | 2571 | hadoop-ozone in the patch failed. |
   | +1 | asflicense | 54 | The patch does not generate ASF License warnings. |
   | | | 8591 | |
   
   
   | Reason | Tests |
   |---:|:--|
   | FindBugs | module:hadoop-hdds |
   |  |  
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(ContainerProtos$ContainerCommandRequestProto,
 DispatcherContext) invokes inefficient new Long(long) constructor; use 
Long.valueOf(long) instead  At HddsDispatcher.java:Long(long) constructor; use 
Long.valueOf(long) instead  At HddsDispatcher.java:[line 241] |
   | Failed junit tests | hadoop.ozone.TestContainerOperations |
   |   | hadoop.ozone.client.rpc.TestOzoneClientRetriesOnException |
   |   | hadoop.ozone.scm.TestGetCommittedBlockLengthAndPutKey |
   |   | hadoop.ozone.scm.TestXceiverClientManager |
   |   | hadoop.ozone.TestContainerStateMachineIdempotency |
   |   | hadoop.ozone.om.TestSecureOzoneManager |
   |   | hadoop.ozone.container.metrics.TestContainerMetrics |
   |   | hadoop.ozone.container.ozoneimpl.TestOzoneContainer |
   |   | hadoop.ozone.client.rpc.TestMultiBlockWritesWithDnFailures |
   |   | hadoop.ozone.scm.TestXceiverClientMetrics |
   |   | hadoop.ozone.scm.TestContainerSmallFile |
   |   | hadoop.ozone.client.rpc.Test2WayCommitInRatis |
   |   | hadoop.ozone.container.TestContainerReplication |
   |   | 
hadoop.ozone.container.common.statemachine.commandhandler.TestBlockDeletion |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | Client=19.03.1 Server=19.03.1 base: 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/4/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/1364 |
   | Optional Tests | dupname asflicense compile cc mvnsite javac unit javadoc 
mvninstall shadedclient findbugs checkstyle |
   | uname | Linux 23e0b2fdc078 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | personality/hadoop.sh |
   | git revision | trunk / 915cbc9 |
   | Default Java | 1.8.0_222 |
   | checkstyle | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/4/artifact/out/diff-checkstyle-hadoop-hdds.txt
 |
   | checkstyle | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/4/artifact/out/diff-checkstyle-hadoop-ozone.txt

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=303306&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-303306
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 28/Aug/19 23:57
Start Date: 28/Aug/19 23:57
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on issue #1364: HDDS-1843. 
Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#issuecomment-525966992
 
 
   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime | Comment |
   |::|--:|:|:|
   | 0 | reexec | 144 | Docker mode activated. |
   ||| _ Prechecks _ |
   | +1 | dupname | 1 | No case conflicting files found. |
   | +1 | @author | 0 | The patch does not contain any @author tags. |
   | +1 | test4tests | 0 | The patch appears to include 4 new or modified test 
files. |
   ||| _ trunk Compile Tests _ |
   | 0 | mvndep | 81 | Maven dependency ordering for branch |
   | +1 | mvninstall | 688 | trunk passed |
   | +1 | compile | 401 | trunk passed |
   | +1 | checkstyle | 86 | trunk passed |
   | +1 | mvnsite | 0 | trunk passed |
   | +1 | shadedclient | 924 | branch has no errors when building and testing 
our client artifacts. |
   | +1 | javadoc | 182 | trunk passed |
   | 0 | spotbugs | 444 | Used deprecated FindBugs config; considering 
switching to SpotBugs. |
   | +1 | findbugs | 666 | trunk passed |
   ||| _ Patch Compile Tests _ |
   | 0 | mvndep | 36 | Maven dependency ordering for patch |
   | +1 | mvninstall | 593 | the patch passed |
   | +1 | compile | 417 | the patch passed |
   | +1 | cc | 417 | the patch passed |
   | +1 | javac | 417 | the patch passed |
   | -0 | checkstyle | 42 | hadoop-hdds: The patch generated 2 new + 0 
unchanged - 0 fixed = 2 total (was 0) |
   | -0 | checkstyle | 40 | hadoop-ozone: The patch generated 4 new + 0 
unchanged - 0 fixed = 4 total (was 0) |
   | +1 | mvnsite | 0 | the patch passed |
   | -1 | whitespace | 0 | The patch has 25 line(s) that end in whitespace. Use 
git apply --whitespace=fix <>. Refer 
https://git-scm.com/docs/git-apply |
   | +1 | shadedclient | 752 | patch has no errors when building and testing 
our client artifacts. |
   | +1 | javadoc | 179 | the patch passed |
   | -1 | findbugs | 272 | hadoop-hdds generated 1 new + 0 unchanged - 0 fixed 
= 1 total (was 0) |
   | -1 | findbugs | 422 | hadoop-ozone in the patch failed. |
   ||| _ Other Tests _ |
   | -1 | unit | 1102 | hadoop-hdds in the patch failed. |
   | -1 | unit | 506 | hadoop-ozone in the patch failed. |
   | +1 | asflicense | 45 | The patch does not generate ASF License warnings. |
   | | | 7750 | |
   
   
   | Reason | Tests |
   |---:|:--|
   | FindBugs | module:hadoop-hdds |
   |  |  
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(ContainerProtos$ContainerCommandRequestProto,
 DispatcherContext) invokes inefficient new Long(long) constructor; use 
Long.valueOf(long) instead  At HddsDispatcher.java:Long(long) constructor; use 
Long.valueOf(long) instead  At HddsDispatcher.java:[line 241] |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | Client=19.03.1 Server=19.03.1 base: 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/3/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/1364 |
   | Optional Tests | dupname asflicense compile cc mvnsite javac unit javadoc 
mvninstall shadedclient findbugs checkstyle |
   | uname | Linux 8d1d475f5d3b 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | personality/hadoop.sh |
   | git revision | trunk / 6f2226a |
   | Default Java | 1.8.0_222 |
   | checkstyle | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/3/artifact/out/diff-checkstyle-hadoop-hdds.txt
 |
   | checkstyle | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/3/artifact/out/diff-checkstyle-hadoop-ozone.txt
 |
   | whitespace | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/3/artifact/out/whitespace-eol.txt
 |
   | findbugs | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/3/artifact/out/new-findbugs-hadoop-hdds.html
 |
   | findbugs | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/3/artifact/out/patch-findbugs-hadoop-ozone.txt
 |
   | unit | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/3/artifact/out/patch-unit-hadoop-hdds.txt
 |
   | unit | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/3/artifact/out/patch-unit-hadoop-ozone.txt
 |
   |  Test Results | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/3/testReport/ |
   | Max. process+thread count | 1371 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdds/c

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=303294&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-303294
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 28/Aug/19 23:22
Start Date: 28/Aug/19 23:22
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on issue #1364: HDDS-1843. 
Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#issuecomment-525959389
 
 
   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime | Comment |
   |::|--:|:|:|
   | 0 | reexec | 46 | Docker mode activated. |
   ||| _ Prechecks _ |
   | +1 | dupname | 1 | No case conflicting files found. |
   | +1 | @author | 0 | The patch does not contain any @author tags. |
   | +1 | test4tests | 0 | The patch appears to include 4 new or modified test 
files. |
   ||| _ trunk Compile Tests _ |
   | 0 | mvndep | 78 | Maven dependency ordering for branch |
   | +1 | mvninstall | 667 | trunk passed |
   | +1 | compile | 426 | trunk passed |
   | +1 | checkstyle | 75 | trunk passed |
   | +1 | mvnsite | 0 | trunk passed |
   | +1 | shadedclient | 956 | branch has no errors when building and testing 
our client artifacts. |
   | +1 | javadoc | 183 | trunk passed |
   | 0 | spotbugs | 432 | Used deprecated FindBugs config; considering 
switching to SpotBugs. |
   | +1 | findbugs | 645 | trunk passed |
   ||| _ Patch Compile Tests _ |
   | 0 | mvndep | 39 | Maven dependency ordering for patch |
   | +1 | mvninstall | 561 | the patch passed |
   | +1 | compile | 410 | the patch passed |
   | +1 | cc | 410 | the patch passed |
   | +1 | javac | 410 | the patch passed |
   | -0 | checkstyle | 40 | hadoop-hdds: The patch generated 2 new + 0 
unchanged - 0 fixed = 2 total (was 0) |
   | -0 | checkstyle | 44 | hadoop-ozone: The patch generated 4 new + 0 
unchanged - 0 fixed = 4 total (was 0) |
   | +1 | mvnsite | 0 | the patch passed |
   | -1 | whitespace | 0 | The patch has 25 line(s) that end in whitespace. Use 
git apply --whitespace=fix <>. Refer 
https://git-scm.com/docs/git-apply |
   | +1 | shadedclient | 738 | patch has no errors when building and testing 
our client artifacts. |
   | +1 | javadoc | 187 | the patch passed |
   | -1 | findbugs | 235 | hadoop-hdds generated 1 new + 0 unchanged - 0 fixed 
= 1 total (was 0) |
   ||| _ Other Tests _ |
   | -1 | unit | 1062 | hadoop-hdds in the patch failed. |
   | -1 | unit | 202 | hadoop-ozone in the patch failed. |
   | +1 | asflicense | 44 | The patch does not generate ASF License warnings. |
   | | | 7267 | |
   
   
   | Reason | Tests |
   |---:|:--|
   | FindBugs | module:hadoop-hdds |
   |  |  
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(ContainerProtos$ContainerCommandRequestProto,
 DispatcherContext) invokes inefficient new Long(long) constructor; use 
Long.valueOf(long) instead  At HddsDispatcher.java:Long(long) constructor; use 
Long.valueOf(long) instead  At HddsDispatcher.java:[line 241] |
   | Failed junit tests | 
hadoop.ozone.om.ratis.TestOzoneManagerDoubleBufferWithOMResponse |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | Client=19.03.1 Server=19.03.1 base: 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/1364 |
   | Optional Tests | dupname asflicense compile cc mvnsite javac unit javadoc 
mvninstall shadedclient findbugs checkstyle |
   | uname | Linux 6ae0c04c33e0 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | personality/hadoop.sh |
   | git revision | trunk / 6f2226a |
   | Default Java | 1.8.0_222 |
   | checkstyle | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/2/artifact/out/diff-checkstyle-hadoop-hdds.txt
 |
   | checkstyle | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/2/artifact/out/diff-checkstyle-hadoop-ozone.txt
 |
   | whitespace | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/2/artifact/out/whitespace-eol.txt
 |
   | findbugs | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/2/artifact/out/new-findbugs-hadoop-hdds.html
 |
   | unit | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/2/artifact/out/patch-unit-hadoop-hdds.txt
 |
   | unit | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/2/artifact/out/patch-unit-hadoop-ozone.txt
 |
   |  Test Results | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/2/testReport/ |
   | Max. process+thread count | 413 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdds/common hadoop-hdds/container-service 
hadoop-ozone/integration-test U: . |
   | Console output | 
ht

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302886&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302886
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 28/Aug/19 14:00
Start Date: 28/Aug/19 14:00
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597852
 
 

 ##
 File path: hadoop-hdds/common/src/main/resources/audit.log
 ##
 @@ -0,0 +1,25 @@
+2019-08-28 11:36:31,489 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, 
volume=testcontainerstatemachinefailures, creationTime=1566972391485, 
quotaInBytes=1152921504606846976} | ret=SUCCESS |  
+2019-08-28 11:36:31,494 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,511 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], 
group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], 
group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], 
group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], 
group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], 
group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], 
group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], 
group:com.apple.access_screensharing:a[ACCESS], 
group:com.apple.access_ssh-disabled:a[ACCESS], 
group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, 
storageType=DISK, creationTime=0} | ret=SUCCESS |  
+2019-08-28 11:36:31,515 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,519 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,561 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:37,850 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, 
length=10, offset=0, token=null, pipeline=Pipeline[ Id: 
33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS |  
+2019-08-28 11:36:50,166 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,168 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
 
 Review comment:
   whitespace:end of line
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 302886)
Time Spent: 4h 20m  (was: 4h 10m)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as th

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302884&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302884
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 28/Aug/19 14:00
Start Date: 28/Aug/19 14:00
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597822
 
 

 ##
 File path: hadoop-hdds/common/src/main/resources/audit.log
 ##
 @@ -0,0 +1,25 @@
+2019-08-28 11:36:31,489 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, 
volume=testcontainerstatemachinefailures, creationTime=1566972391485, 
quotaInBytes=1152921504606846976} | ret=SUCCESS |  
+2019-08-28 11:36:31,494 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,511 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], 
group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], 
group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], 
group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], 
group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], 
group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], 
group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], 
group:com.apple.access_screensharing:a[ACCESS], 
group:com.apple.access_ssh-disabled:a[ACCESS], 
group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, 
storageType=DISK, creationTime=0} | ret=SUCCESS |  
+2019-08-28 11:36:31,515 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,519 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,561 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:37,850 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, 
length=10, offset=0, token=null, pipeline=Pipeline[ Id: 
33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS |  
 
 Review comment:
   whitespace:end of line
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 302884)
Time Spent: 4h  (was: 3h 50m)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect th

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302887&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302887
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 28/Aug/19 14:01
Start Date: 28/Aug/19 14:01
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on issue #1364: HDDS-1843. 
Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#issuecomment-525759029
 
 
   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime | Comment |
   |::|--:|:|:|
   | 0 | reexec | 40 | Docker mode activated. |
   ||| _ Prechecks _ |
   | +1 | dupname | 1 | No case conflicting files found. |
   | +1 | @author | 0 | The patch does not contain any @author tags. |
   | +1 | test4tests | 0 | The patch appears to include 4 new or modified test 
files. |
   ||| _ trunk Compile Tests _ |
   | 0 | mvndep | 67 | Maven dependency ordering for branch |
   | +1 | mvninstall | 620 | trunk passed |
   | +1 | compile | 385 | trunk passed |
   | +1 | checkstyle | 76 | trunk passed |
   | +1 | mvnsite | 0 | trunk passed |
   | +1 | shadedclient | 846 | branch has no errors when building and testing 
our client artifacts. |
   | +1 | javadoc | 177 | trunk passed |
   | 0 | spotbugs | 429 | Used deprecated FindBugs config; considering 
switching to SpotBugs. |
   | +1 | findbugs | 629 | trunk passed |
   ||| _ Patch Compile Tests _ |
   | 0 | mvndep | 36 | Maven dependency ordering for patch |
   | +1 | mvninstall | 547 | the patch passed |
   | +1 | compile | 393 | the patch passed |
   | +1 | cc | 393 | the patch passed |
   | +1 | javac | 393 | the patch passed |
   | -0 | checkstyle | 39 | hadoop-hdds: The patch generated 2 new + 0 
unchanged - 0 fixed = 2 total (was 0) |
   | -0 | checkstyle | 42 | hadoop-ozone: The patch generated 4 new + 0 
unchanged - 0 fixed = 4 total (was 0) |
   | +1 | mvnsite | 0 | the patch passed |
   | -1 | whitespace | 0 | The patch has 25 line(s) that end in whitespace. Use 
git apply --whitespace=fix <>. Refer 
https://git-scm.com/docs/git-apply |
   | +1 | shadedclient | 661 | patch has no errors when building and testing 
our client artifacts. |
   | +1 | javadoc | 175 | the patch passed |
   | -1 | findbugs | 213 | hadoop-hdds generated 1 new + 0 unchanged - 0 fixed 
= 1 total (was 0) |
   ||| _ Other Tests _ |
   | -1 | unit | 1060 | hadoop-hdds in the patch failed. |
   | -1 | unit | 2707 | hadoop-ozone in the patch failed. |
   | +1 | asflicense | 52 | The patch does not generate ASF License warnings. |
   | | | 9393 | |
   
   
   | Reason | Tests |
   |---:|:--|
   | FindBugs | module:hadoop-hdds |
   |  |  
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(ContainerProtos$ContainerCommandRequestProto,
 DispatcherContext) invokes inefficient new Long(long) constructor; use 
Long.valueOf(long) instead  At HddsDispatcher.java:Long(long) constructor; use 
Long.valueOf(long) instead  At HddsDispatcher.java:[line 241] |
   | Failed junit tests | hadoop.ozone.container.TestContainerReplication |
   |   | hadoop.ozone.client.rpc.TestBlockOutputStream |
   |   | hadoop.ozone.scm.node.TestQueryNode |
   |   | hadoop.ozone.client.rpc.Test2WayCommitInRatis |
   |   | hadoop.ozone.scm.TestXceiverClientManager |
   |   | hadoop.ozone.container.ozoneimpl.TestSecureOzoneContainer |
   |   | hadoop.ozone.container.ozoneimpl.TestOzoneContainer |
   |   | hadoop.ozone.TestContainerOperations |
   |   | hadoop.ozone.container.metrics.TestContainerMetrics |
   |   | hadoop.ozone.scm.TestXceiverClientMetrics |
   |   | hadoop.ozone.client.rpc.TestMultiBlockWritesWithDnFailures |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | Client=19.03.1 Server=19.03.1 base: 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/1364 |
   | Optional Tests | dupname asflicense compile cc mvnsite javac unit javadoc 
mvninstall shadedclient findbugs checkstyle |
   | uname | Linux 487bb680b6dc 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | personality/hadoop.sh |
   | git revision | trunk / 55cc115 |
   | Default Java | 1.8.0_222 |
   | checkstyle | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/1/artifact/out/diff-checkstyle-hadoop-hdds.txt
 |
   | checkstyle | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/1/artifact/out/diff-checkstyle-hadoop-ozone.txt
 |
   | whitespace | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/1/artifact/out/whitespace-eol.txt
 |
   | findbugs | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/1/artifact/ou

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302885&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302885
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 28/Aug/19 14:00
Start Date: 28/Aug/19 14:00
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597836
 
 

 ##
 File path: hadoop-hdds/common/src/main/resources/audit.log
 ##
 @@ -0,0 +1,25 @@
+2019-08-28 11:36:31,489 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, 
volume=testcontainerstatemachinefailures, creationTime=1566972391485, 
quotaInBytes=1152921504606846976} | ret=SUCCESS |  
+2019-08-28 11:36:31,494 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,511 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], 
group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], 
group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], 
group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], 
group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], 
group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], 
group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], 
group:com.apple.access_screensharing:a[ACCESS], 
group:com.apple.access_ssh-disabled:a[ACCESS], 
group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, 
storageType=DISK, creationTime=0} | ret=SUCCESS |  
+2019-08-28 11:36:31,515 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,519 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,561 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:37,850 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, 
length=10, offset=0, token=null, pipeline=Pipeline[ Id: 
33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS |  
+2019-08-28 11:36:50,166 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
 
 Review comment:
   whitespace:end of line
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 302885)
Time Spent: 4h 10m  (was: 4h)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302882&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302882
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 28/Aug/19 14:00
Start Date: 28/Aug/19 14:00
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597791
 
 

 ##
 File path: hadoop-hdds/common/src/main/resources/audit.log
 ##
 @@ -0,0 +1,25 @@
+2019-08-28 11:36:31,489 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, 
volume=testcontainerstatemachinefailures, creationTime=1566972391485, 
quotaInBytes=1152921504606846976} | ret=SUCCESS |  
+2019-08-28 11:36:31,494 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,511 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], 
group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], 
group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], 
group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], 
group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], 
group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], 
group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], 
group:com.apple.access_screensharing:a[ACCESS], 
group:com.apple.access_ssh-disabled:a[ACCESS], 
group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, 
storageType=DISK, creationTime=0} | ret=SUCCESS |  
+2019-08-28 11:36:31,515 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,519 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
 
 Review comment:
   whitespace:end of line
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 302882)
Time Spent: 3h 40m  (was: 3.5h)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302878&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302878
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 28/Aug/19 14:00
Start Date: 28/Aug/19 14:00
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597670
 
 

 ##
 File path: hadoop-hdds/common/src/main/resources/audit.log
 ##
 @@ -0,0 +1,25 @@
+2019-08-28 11:36:31,489 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, 
volume=testcontainerstatemachinefailures, creationTime=1566972391485, 
quotaInBytes=1152921504606846976} | ret=SUCCESS |  
+2019-08-28 11:36:31,494 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,511 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], 
group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], 
group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], 
group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], 
group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], 
group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], 
group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], 
group:com.apple.access_screensharing:a[ACCESS], 
group:com.apple.access_ssh-disabled:a[ACCESS], 
group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, 
storageType=DISK, creationTime=0} | ret=SUCCESS |  
+2019-08-28 11:36:31,515 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,519 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,561 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:37,850 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, 
length=10, offset=0, token=null, pipeline=Pipeline[ Id: 
33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS |  
+2019-08-28 11:36:50,166 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,168 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,177 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:56,287 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=6, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=2, localID=102693103873294339}, 
length=6, offset=0, token=null, pipeline=Pipeline[ Id: 
5c7f9b1c-2fbc-470e-8e56-1e9bbf131bcb, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693103872901122} | ret=SUCCESS |  
+2019-08-28 11:37:08,500 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:37:08,502 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:37:08,514 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302864&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302864
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 28/Aug/19 14:00
Start Date: 28/Aug/19 14:00
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597430
 
 

 ##
 File path: hadoop-hdds/common/src/main/resources/audit.log
 ##
 @@ -0,0 +1,25 @@
+2019-08-28 11:36:31,489 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, 
volume=testcontainerstatemachinefailures, creationTime=1566972391485, 
quotaInBytes=1152921504606846976} | ret=SUCCESS |  
+2019-08-28 11:36:31,494 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,511 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], 
group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], 
group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], 
group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], 
group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], 
group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], 
group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], 
group:com.apple.access_screensharing:a[ACCESS], 
group:com.apple.access_ssh-disabled:a[ACCESS], 
group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, 
storageType=DISK, creationTime=0} | ret=SUCCESS |  
+2019-08-28 11:36:31,515 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,519 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,561 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:37,850 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, 
length=10, offset=0, token=null, pipeline=Pipeline[ Id: 
33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS |  
+2019-08-28 11:36:50,166 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,168 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,177 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:56,287 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=6, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=2, localID=102693103873294339}, 
length=6, offset=0, token=null, pipeline=Pipeline[ Id: 
5c7f9b1c-2fbc-470e-8e56-1e9bbf131bcb, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693103872901122} | ret=SUCCESS |  
+2019-08-28 11:37:08,500 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
 
 Review comment:
   whitespace:end of line
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For qu

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302876&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302876
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 28/Aug/19 14:00
Start Date: 28/Aug/19 14:00
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597642
 
 

 ##
 File path: hadoop-hdds/common/src/main/resources/audit.log
 ##
 @@ -0,0 +1,25 @@
+2019-08-28 11:36:31,489 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, 
volume=testcontainerstatemachinefailures, creationTime=1566972391485, 
quotaInBytes=1152921504606846976} | ret=SUCCESS |  
+2019-08-28 11:36:31,494 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,511 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], 
group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], 
group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], 
group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], 
group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], 
group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], 
group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], 
group:com.apple.access_screensharing:a[ACCESS], 
group:com.apple.access_ssh-disabled:a[ACCESS], 
group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, 
storageType=DISK, creationTime=0} | ret=SUCCESS |  
+2019-08-28 11:36:31,515 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,519 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,561 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:37,850 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, 
length=10, offset=0, token=null, pipeline=Pipeline[ Id: 
33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS |  
+2019-08-28 11:36:50,166 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,168 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,177 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:56,287 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=6, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=2, localID=102693103873294339}, 
length=6, offset=0, token=null, pipeline=Pipeline[ Id: 
5c7f9b1c-2fbc-470e-8e56-1e9bbf131bcb, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693103872901122} | ret=SUCCESS |  
+2019-08-28 11:37:08,500 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:37:08,502 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:37:08,514 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302877&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302877
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 28/Aug/19 14:00
Start Date: 28/Aug/19 14:00
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597660
 
 

 ##
 File path: hadoop-hdds/common/src/main/resources/audit.log
 ##
 @@ -0,0 +1,25 @@
+2019-08-28 11:36:31,489 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, 
volume=testcontainerstatemachinefailures, creationTime=1566972391485, 
quotaInBytes=1152921504606846976} | ret=SUCCESS |  
+2019-08-28 11:36:31,494 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,511 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], 
group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], 
group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], 
group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], 
group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], 
group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], 
group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], 
group:com.apple.access_screensharing:a[ACCESS], 
group:com.apple.access_ssh-disabled:a[ACCESS], 
group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, 
storageType=DISK, creationTime=0} | ret=SUCCESS |  
+2019-08-28 11:36:31,515 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,519 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,561 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:37,850 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, 
length=10, offset=0, token=null, pipeline=Pipeline[ Id: 
33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS |  
+2019-08-28 11:36:50,166 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,168 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,177 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:56,287 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=6, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=2, localID=102693103873294339}, 
length=6, offset=0, token=null, pipeline=Pipeline[ Id: 
5c7f9b1c-2fbc-470e-8e56-1e9bbf131bcb, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693103872901122} | ret=SUCCESS |  
+2019-08-28 11:37:08,500 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:37:08,502 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:37:08,514 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302873&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302873
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 28/Aug/19 14:00
Start Date: 28/Aug/19 14:00
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597581
 
 

 ##
 File path: hadoop-hdds/common/src/main/resources/audit.log
 ##
 @@ -0,0 +1,25 @@
+2019-08-28 11:36:31,489 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, 
volume=testcontainerstatemachinefailures, creationTime=1566972391485, 
quotaInBytes=1152921504606846976} | ret=SUCCESS |  
+2019-08-28 11:36:31,494 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,511 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], 
group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], 
group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], 
group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], 
group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], 
group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], 
group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], 
group:com.apple.access_screensharing:a[ACCESS], 
group:com.apple.access_ssh-disabled:a[ACCESS], 
group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, 
storageType=DISK, creationTime=0} | ret=SUCCESS |  
+2019-08-28 11:36:31,515 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,519 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,561 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:37,850 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, 
length=10, offset=0, token=null, pipeline=Pipeline[ Id: 
33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS |  
+2019-08-28 11:36:50,166 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,168 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,177 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:56,287 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=6, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=2, localID=102693103873294339}, 
length=6, offset=0, token=null, pipeline=Pipeline[ Id: 
5c7f9b1c-2fbc-470e-8e56-1e9bbf131bcb, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693103872901122} | ret=SUCCESS |  
+2019-08-28 11:37:08,500 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:37:08,502 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:37:08,514 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302883&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302883
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 28/Aug/19 14:00
Start Date: 28/Aug/19 14:00
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597810
 
 

 ##
 File path: hadoop-hdds/common/src/main/resources/audit.log
 ##
 @@ -0,0 +1,25 @@
+2019-08-28 11:36:31,489 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, 
volume=testcontainerstatemachinefailures, creationTime=1566972391485, 
quotaInBytes=1152921504606846976} | ret=SUCCESS |  
+2019-08-28 11:36:31,494 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,511 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], 
group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], 
group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], 
group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], 
group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], 
group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], 
group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], 
group:com.apple.access_screensharing:a[ACCESS], 
group:com.apple.access_ssh-disabled:a[ACCESS], 
group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, 
storageType=DISK, creationTime=0} | ret=SUCCESS |  
+2019-08-28 11:36:31,515 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,519 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,561 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
 
 Review comment:
   whitespace:end of line
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 302883)
Time Spent: 3h 50m  (was: 3h 40m)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302875&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302875
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 28/Aug/19 14:00
Start Date: 28/Aug/19 14:00
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597626
 
 

 ##
 File path: hadoop-hdds/common/src/main/resources/audit.log
 ##
 @@ -0,0 +1,25 @@
+2019-08-28 11:36:31,489 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, 
volume=testcontainerstatemachinefailures, creationTime=1566972391485, 
quotaInBytes=1152921504606846976} | ret=SUCCESS |  
+2019-08-28 11:36:31,494 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,511 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], 
group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], 
group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], 
group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], 
group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], 
group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], 
group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], 
group:com.apple.access_screensharing:a[ACCESS], 
group:com.apple.access_ssh-disabled:a[ACCESS], 
group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, 
storageType=DISK, creationTime=0} | ret=SUCCESS |  
+2019-08-28 11:36:31,515 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,519 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,561 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:37,850 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, 
length=10, offset=0, token=null, pipeline=Pipeline[ Id: 
33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS |  
+2019-08-28 11:36:50,166 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,168 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,177 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:56,287 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=6, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=2, localID=102693103873294339}, 
length=6, offset=0, token=null, pipeline=Pipeline[ Id: 
5c7f9b1c-2fbc-470e-8e56-1e9bbf131bcb, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693103872901122} | ret=SUCCESS |  
+2019-08-28 11:37:08,500 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:37:08,502 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:37:08,514 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302874&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302874
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 28/Aug/19 14:00
Start Date: 28/Aug/19 14:00
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597600
 
 

 ##
 File path: hadoop-hdds/common/src/main/resources/audit.log
 ##
 @@ -0,0 +1,25 @@
+2019-08-28 11:36:31,489 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, 
volume=testcontainerstatemachinefailures, creationTime=1566972391485, 
quotaInBytes=1152921504606846976} | ret=SUCCESS |  
+2019-08-28 11:36:31,494 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,511 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], 
group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], 
group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], 
group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], 
group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], 
group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], 
group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], 
group:com.apple.access_screensharing:a[ACCESS], 
group:com.apple.access_ssh-disabled:a[ACCESS], 
group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, 
storageType=DISK, creationTime=0} | ret=SUCCESS |  
+2019-08-28 11:36:31,515 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,519 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,561 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:37,850 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, 
length=10, offset=0, token=null, pipeline=Pipeline[ Id: 
33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS |  
+2019-08-28 11:36:50,166 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,168 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,177 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:56,287 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=6, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=2, localID=102693103873294339}, 
length=6, offset=0, token=null, pipeline=Pipeline[ Id: 
5c7f9b1c-2fbc-470e-8e56-1e9bbf131bcb, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693103872901122} | ret=SUCCESS |  
+2019-08-28 11:37:08,500 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:37:08,502 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:37:08,514 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302866&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302866
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 28/Aug/19 14:00
Start Date: 28/Aug/19 14:00
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597464
 
 

 ##
 File path: hadoop-hdds/common/src/main/resources/audit.log
 ##
 @@ -0,0 +1,25 @@
+2019-08-28 11:36:31,489 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, 
volume=testcontainerstatemachinefailures, creationTime=1566972391485, 
quotaInBytes=1152921504606846976} | ret=SUCCESS |  
+2019-08-28 11:36:31,494 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,511 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], 
group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], 
group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], 
group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], 
group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], 
group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], 
group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], 
group:com.apple.access_screensharing:a[ACCESS], 
group:com.apple.access_ssh-disabled:a[ACCESS], 
group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, 
storageType=DISK, creationTime=0} | ret=SUCCESS |  
+2019-08-28 11:36:31,515 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,519 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,561 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:37,850 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, 
length=10, offset=0, token=null, pipeline=Pipeline[ Id: 
33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS |  
+2019-08-28 11:36:50,166 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,168 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,177 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:56,287 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=6, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=2, localID=102693103873294339}, 
length=6, offset=0, token=null, pipeline=Pipeline[ Id: 
5c7f9b1c-2fbc-470e-8e56-1e9bbf131bcb, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693103872901122} | ret=SUCCESS |  
+2019-08-28 11:37:08,500 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:37:08,502 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:37:08,514 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302869&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302869
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 28/Aug/19 14:00
Start Date: 28/Aug/19 14:00
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597514
 
 

 ##
 File path: hadoop-hdds/common/src/main/resources/audit.log
 ##
 @@ -0,0 +1,25 @@
+2019-08-28 11:36:31,489 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, 
volume=testcontainerstatemachinefailures, creationTime=1566972391485, 
quotaInBytes=1152921504606846976} | ret=SUCCESS |  
+2019-08-28 11:36:31,494 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,511 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], 
group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], 
group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], 
group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], 
group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], 
group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], 
group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], 
group:com.apple.access_screensharing:a[ACCESS], 
group:com.apple.access_ssh-disabled:a[ACCESS], 
group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, 
storageType=DISK, creationTime=0} | ret=SUCCESS |  
+2019-08-28 11:36:31,515 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,519 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,561 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:37,850 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, 
length=10, offset=0, token=null, pipeline=Pipeline[ Id: 
33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS |  
+2019-08-28 11:36:50,166 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,168 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,177 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:56,287 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=6, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=2, localID=102693103873294339}, 
length=6, offset=0, token=null, pipeline=Pipeline[ Id: 
5c7f9b1c-2fbc-470e-8e56-1e9bbf131bcb, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693103872901122} | ret=SUCCESS |  
+2019-08-28 11:37:08,500 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:37:08,502 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:37:08,514 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302868&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302868
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 28/Aug/19 14:00
Start Date: 28/Aug/19 14:00
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597501
 
 

 ##
 File path: hadoop-hdds/common/src/main/resources/audit.log
 ##
 @@ -0,0 +1,25 @@
+2019-08-28 11:36:31,489 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, 
volume=testcontainerstatemachinefailures, creationTime=1566972391485, 
quotaInBytes=1152921504606846976} | ret=SUCCESS |  
+2019-08-28 11:36:31,494 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,511 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], 
group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], 
group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], 
group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], 
group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], 
group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], 
group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], 
group:com.apple.access_screensharing:a[ACCESS], 
group:com.apple.access_ssh-disabled:a[ACCESS], 
group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, 
storageType=DISK, creationTime=0} | ret=SUCCESS |  
+2019-08-28 11:36:31,515 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,519 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,561 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:37,850 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, 
length=10, offset=0, token=null, pipeline=Pipeline[ Id: 
33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS |  
+2019-08-28 11:36:50,166 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,168 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,177 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:56,287 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=6, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=2, localID=102693103873294339}, 
length=6, offset=0, token=null, pipeline=Pipeline[ Id: 
5c7f9b1c-2fbc-470e-8e56-1e9bbf131bcb, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693103872901122} | ret=SUCCESS |  
+2019-08-28 11:37:08,500 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:37:08,502 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:37:08,514 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302871&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302871
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 28/Aug/19 14:00
Start Date: 28/Aug/19 14:00
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597545
 
 

 ##
 File path: hadoop-hdds/common/src/main/resources/audit.log
 ##
 @@ -0,0 +1,25 @@
+2019-08-28 11:36:31,489 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, 
volume=testcontainerstatemachinefailures, creationTime=1566972391485, 
quotaInBytes=1152921504606846976} | ret=SUCCESS |  
+2019-08-28 11:36:31,494 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,511 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], 
group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], 
group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], 
group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], 
group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], 
group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], 
group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], 
group:com.apple.access_screensharing:a[ACCESS], 
group:com.apple.access_ssh-disabled:a[ACCESS], 
group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, 
storageType=DISK, creationTime=0} | ret=SUCCESS |  
+2019-08-28 11:36:31,515 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,519 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,561 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:37,850 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, 
length=10, offset=0, token=null, pipeline=Pipeline[ Id: 
33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS |  
+2019-08-28 11:36:50,166 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,168 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,177 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:56,287 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=6, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=2, localID=102693103873294339}, 
length=6, offset=0, token=null, pipeline=Pipeline[ Id: 
5c7f9b1c-2fbc-470e-8e56-1e9bbf131bcb, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693103872901122} | ret=SUCCESS |  
+2019-08-28 11:37:08,500 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:37:08,502 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:37:08,514 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302880&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302880
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 28/Aug/19 14:00
Start Date: 28/Aug/19 14:00
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597764
 
 

 ##
 File path: hadoop-hdds/common/src/main/resources/audit.log
 ##
 @@ -0,0 +1,25 @@
+2019-08-28 11:36:31,489 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, 
volume=testcontainerstatemachinefailures, creationTime=1566972391485, 
quotaInBytes=1152921504606846976} | ret=SUCCESS |  
+2019-08-28 11:36:31,494 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,511 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], 
group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], 
group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], 
group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], 
group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], 
group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], 
group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], 
group:com.apple.access_screensharing:a[ACCESS], 
group:com.apple.access_ssh-disabled:a[ACCESS], 
group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, 
storageType=DISK, creationTime=0} | ret=SUCCESS |  
 
 Review comment:
   whitespace:end of line
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 302880)
Time Spent: 3h 20m  (was: 3h 10m)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302881&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302881
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 28/Aug/19 14:00
Start Date: 28/Aug/19 14:00
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597775
 
 

 ##
 File path: hadoop-hdds/common/src/main/resources/audit.log
 ##
 @@ -0,0 +1,25 @@
+2019-08-28 11:36:31,489 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, 
volume=testcontainerstatemachinefailures, creationTime=1566972391485, 
quotaInBytes=1152921504606846976} | ret=SUCCESS |  
+2019-08-28 11:36:31,494 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,511 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], 
group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], 
group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], 
group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], 
group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], 
group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], 
group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], 
group:com.apple.access_screensharing:a[ACCESS], 
group:com.apple.access_ssh-disabled:a[ACCESS], 
group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, 
storageType=DISK, creationTime=0} | ret=SUCCESS |  
+2019-08-28 11:36:31,515 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
 
 Review comment:
   whitespace:end of line
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 302881)
Time Spent: 3.5h  (was: 3h 20m)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302879&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302879
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 28/Aug/19 14:00
Start Date: 28/Aug/19 14:00
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597746
 
 

 ##
 File path: hadoop-hdds/common/src/main/resources/audit.log
 ##
 @@ -0,0 +1,25 @@
+2019-08-28 11:36:31,489 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, 
volume=testcontainerstatemachinefailures, creationTime=1566972391485, 
quotaInBytes=1152921504606846976} | ret=SUCCESS |  
+2019-08-28 11:36:31,494 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
 
 Review comment:
   whitespace:end of line
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 302879)
Time Spent: 3h 10m  (was: 3h)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302870&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302870
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 28/Aug/19 14:00
Start Date: 28/Aug/19 14:00
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597530
 
 

 ##
 File path: hadoop-hdds/common/src/main/resources/audit.log
 ##
 @@ -0,0 +1,25 @@
+2019-08-28 11:36:31,489 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, 
volume=testcontainerstatemachinefailures, creationTime=1566972391485, 
quotaInBytes=1152921504606846976} | ret=SUCCESS |  
+2019-08-28 11:36:31,494 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,511 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], 
group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], 
group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], 
group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], 
group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], 
group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], 
group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], 
group:com.apple.access_screensharing:a[ACCESS], 
group:com.apple.access_ssh-disabled:a[ACCESS], 
group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, 
storageType=DISK, creationTime=0} | ret=SUCCESS |  
+2019-08-28 11:36:31,515 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,519 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,561 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:37,850 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, 
length=10, offset=0, token=null, pipeline=Pipeline[ Id: 
33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS |  
+2019-08-28 11:36:50,166 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,168 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,177 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:56,287 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=6, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=2, localID=102693103873294339}, 
length=6, offset=0, token=null, pipeline=Pipeline[ Id: 
5c7f9b1c-2fbc-470e-8e56-1e9bbf131bcb, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693103872901122} | ret=SUCCESS |  
+2019-08-28 11:37:08,500 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:37:08,502 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:37:08,514 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302872&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302872
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 28/Aug/19 14:00
Start Date: 28/Aug/19 14:00
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597560
 
 

 ##
 File path: hadoop-hdds/common/src/main/resources/audit.log
 ##
 @@ -0,0 +1,25 @@
+2019-08-28 11:36:31,489 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, 
volume=testcontainerstatemachinefailures, creationTime=1566972391485, 
quotaInBytes=1152921504606846976} | ret=SUCCESS |  
 
 Review comment:
   whitespace:end of line
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 302872)
Time Spent: 2h  (was: 1h 50m)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302867&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302867
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 28/Aug/19 14:00
Start Date: 28/Aug/19 14:00
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597486
 
 

 ##
 File path: hadoop-hdds/common/src/main/resources/audit.log
 ##
 @@ -0,0 +1,25 @@
+2019-08-28 11:36:31,489 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, 
volume=testcontainerstatemachinefailures, creationTime=1566972391485, 
quotaInBytes=1152921504606846976} | ret=SUCCESS |  
+2019-08-28 11:36:31,494 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,511 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], 
group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], 
group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], 
group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], 
group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], 
group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], 
group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], 
group:com.apple.access_screensharing:a[ACCESS], 
group:com.apple.access_ssh-disabled:a[ACCESS], 
group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, 
storageType=DISK, creationTime=0} | ret=SUCCESS |  
+2019-08-28 11:36:31,515 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,519 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,561 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:37,850 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, 
length=10, offset=0, token=null, pipeline=Pipeline[ Id: 
33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS |  
+2019-08-28 11:36:50,166 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,168 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,177 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:56,287 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=6, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=2, localID=102693103873294339}, 
length=6, offset=0, token=null, pipeline=Pipeline[ Id: 
5c7f9b1c-2fbc-470e-8e56-1e9bbf131bcb, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693103872901122} | ret=SUCCESS |  
+2019-08-28 11:37:08,500 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:37:08,502 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:37:08,514 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302863&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302863
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 28/Aug/19 14:00
Start Date: 28/Aug/19 14:00
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597419
 
 

 ##
 File path: hadoop-hdds/common/src/main/resources/audit.log
 ##
 @@ -0,0 +1,25 @@
+2019-08-28 11:36:31,489 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, 
volume=testcontainerstatemachinefailures, creationTime=1566972391485, 
quotaInBytes=1152921504606846976} | ret=SUCCESS |  
+2019-08-28 11:36:31,494 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,511 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], 
group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], 
group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], 
group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], 
group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], 
group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], 
group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], 
group:com.apple.access_screensharing:a[ACCESS], 
group:com.apple.access_ssh-disabled:a[ACCESS], 
group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, 
storageType=DISK, creationTime=0} | ret=SUCCESS |  
+2019-08-28 11:36:31,515 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,519 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,561 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:37,850 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, 
length=10, offset=0, token=null, pipeline=Pipeline[ Id: 
33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS |  
+2019-08-28 11:36:50,166 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,168 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,177 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:56,287 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=6, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=2, localID=102693103873294339}, 
length=6, offset=0, token=null, pipeline=Pipeline[ Id: 
5c7f9b1c-2fbc-470e-8e56-1e9bbf131bcb, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693103872901122} | ret=SUCCESS |  
 
 Review comment:
   whitespace:end of line
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 30

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302865&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302865
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 28/Aug/19 14:00
Start Date: 28/Aug/19 14:00
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597444
 
 

 ##
 File path: hadoop-hdds/common/src/main/resources/audit.log
 ##
 @@ -0,0 +1,25 @@
+2019-08-28 11:36:31,489 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, 
volume=testcontainerstatemachinefailures, creationTime=1566972391485, 
quotaInBytes=1152921504606846976} | ret=SUCCESS |  
+2019-08-28 11:36:31,494 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,511 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], 
group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], 
group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], 
group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], 
group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], 
group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], 
group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], 
group:com.apple.access_screensharing:a[ACCESS], 
group:com.apple.access_ssh-disabled:a[ACCESS], 
group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, 
storageType=DISK, creationTime=0} | ret=SUCCESS |  
+2019-08-28 11:36:31,515 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,519 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,561 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:37,850 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, 
length=10, offset=0, token=null, pipeline=Pipeline[ Id: 
33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS |  
+2019-08-28 11:36:50,166 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,168 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,177 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:56,287 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=6, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=2, localID=102693103873294339}, 
length=6, offset=0, token=null, pipeline=Pipeline[ Id: 
5c7f9b1c-2fbc-470e-8e56-1e9bbf131bcb, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693103872901122} | ret=SUCCESS |  
+2019-08-28 11:37:08,500 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:37:08,502 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
 
 Review comment:
   whitespace:end of line
   
 
--

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302862&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302862
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 28/Aug/19 14:00
Start Date: 28/Aug/19 14:00
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #1364: 
HDDS-1843. Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597402
 
 

 ##
 File path: hadoop-hdds/common/src/main/resources/audit.log
 ##
 @@ -0,0 +1,25 @@
+2019-08-28 11:36:31,489 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, 
volume=testcontainerstatemachinefailures, creationTime=1566972391485, 
quotaInBytes=1152921504606846976} | ret=SUCCESS |  
+2019-08-28 11:36:31,494 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,511 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], 
group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], 
group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], 
group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], 
group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], 
group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], 
group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], 
group:com.apple.access_screensharing:a[ACCESS], 
group:com.apple.access_ssh-disabled:a[ACCESS], 
group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, 
storageType=DISK, creationTime=0} | ret=SUCCESS |  
+2019-08-28 11:36:31,515 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,519 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:31,561 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
+2019-08-28 11:36:37,850 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=COMMIT_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, 
replicationType=null, replicationFactor=null, 
keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, 
length=10, offset=0, token=null, pipeline=Pipeline[ Id: 
33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 
82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, 
State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS |  
+2019-08-28 11:36:50,166 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,168 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=READ_BUCKET {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures} | ret=SUCCESS |  
+2019-08-28 11:36:50,177 | INFO  | OMAudit | user=sbanerjee | ip=127.0.0.1 | 
op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, 
bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, 
replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | 
ret=SUCCESS |  
 
 Review comment:
   whitespace:end of line
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 302862)
Time Spent: 20m  (was: 10m)

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time

[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode

2019-08-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302559&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302559
 ]

ASF GitHub Bot logged work on HDDS-1843:


Author: ASF GitHub Bot
Created on: 28/Aug/19 06:17
Start Date: 28/Aug/19 06:17
Worklog Time Spent: 10m 
  Work Description: bshashikant commented on pull request #1364: HDDS-1843. 
Undetectable corruption after restart of a datanode.
URL: https://github.com/apache/hadoop/pull/1364
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 302559)
Remaining Estimate: 0h
Time Spent: 10m

> Undetectable corruption after restart of a datanode
> ---
>
> Key: HDDS-1843
> URL: https://issues.apache.org/jira/browse/HDDS-1843
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-1843.000.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Right now, all write chunks use BufferedIO ie, sync flag is disabled by 
> default. Also, Rocks Db metadata updates are done in Rocks DB cache first at 
> Datanode. In case, there comes a situation where the buffered chunk data as 
> well as the corresponding metadata update is lost as a part of datanode 
> restart, it may lead to a situation where, it will not be possible to detect 
> the corruption (not even with container scanner) of this nature in a 
> reasonable time frame, until and unless there is a client IO failure or Recon 
> server detects it over time. In order to atleast to detect the problem, Ratis 
> snapshot on datanode should sync the rocks db file . In such a way, 
> ContainerScanner will be able to detect this.We can also add a metric around 
> sync to measure how much of a throughput loss it can incurr.
> Thanks [~msingh] for suggesting this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

61 matches

Mail list logo