[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=309585&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309585 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 10/Sep/19 07:26 Start Date: 10/Sep/19 07:26 Worklog Time Spent: 10m Work Description: mukul1987 commented on issue #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#issuecomment-529809131 /label ozone This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 309585) Time Spent: 10h (was: 9h 50m) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 10h > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=309083&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309083 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 09/Sep/19 17:16 Start Date: 09/Sep/19 17:16 Worklog Time Spent: 10m Work Description: bshashikant commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 309083) Time Spent: 9h 50m (was: 9h 40m) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 9h 50m > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=309082&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309082 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 09/Sep/19 17:15 Start Date: 09/Sep/19 17:15 Worklog Time Spent: 10m Work Description: bshashikant commented on issue #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#issuecomment-529579125 Thanks @nandakumar131 @mukul1987 @supratimdeka for the reviews. I have committed this change to trunk. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 309082) Time Spent: 9h 40m (was: 9.5h) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 9h 40m > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=307831&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-307831 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 06/Sep/19 13:09 Start Date: 06/Sep/19 13:09 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on issue #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#issuecomment-528848045 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Comment | |::|--:|:|:| | 0 | reexec | 39 | Docker mode activated. | ||| _ Prechecks _ | | +1 | dupname | 1 | No case conflicting files found. | | +1 | @author | 0 | The patch does not contain any @author tags. | | +1 | test4tests | 0 | The patch appears to include 4 new or modified test files. | ||| _ trunk Compile Tests _ | | 0 | mvndep | 66 | Maven dependency ordering for branch | | +1 | mvninstall | 588 | trunk passed | | +1 | compile | 382 | trunk passed | | +1 | checkstyle | 81 | trunk passed | | +1 | mvnsite | 0 | trunk passed | | +1 | shadedclient | 872 | branch has no errors when building and testing our client artifacts. | | +1 | javadoc | 179 | trunk passed | | 0 | spotbugs | 421 | Used deprecated FindBugs config; considering switching to SpotBugs. | | +1 | findbugs | 616 | trunk passed | ||| _ Patch Compile Tests _ | | 0 | mvndep | 41 | Maven dependency ordering for patch | | +1 | mvninstall | 575 | the patch passed | | +1 | compile | 389 | the patch passed | | +1 | cc | 389 | the patch passed | | +1 | javac | 389 | the patch passed | | +1 | checkstyle | 83 | the patch passed | | +1 | mvnsite | 0 | the patch passed | | +1 | whitespace | 0 | The patch has no whitespace issues. | | +1 | shadedclient | 700 | patch has no errors when building and testing our client artifacts. | | +1 | javadoc | 177 | the patch passed | | +1 | findbugs | 687 | the patch passed | ||| _ Other Tests _ | | +1 | unit | 298 | hadoop-hdds in the patch passed. | | -1 | unit | 200 | hadoop-ozone in the patch failed. | | +1 | asflicense | 48 | The patch does not generate ASF License warnings. | | | | 6200 | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.ozone.om.ratis.TestOzoneManagerRatisServer | | Subsystem | Report/Notes | |--:|:-| | Docker | Client=19.03.1 Server=19.03.1 base: https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/9/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/1364 | | Optional Tests | dupname asflicense compile cc mvnsite javac unit javadoc mvninstall shadedclient findbugs checkstyle | | uname | Linux 806a3bdb7024 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | trunk / 6e4cdf8 | | Default Java | 1.8.0_222 | | unit | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/9/artifact/out/patch-unit-hadoop-ozone.txt | | Test Results | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/9/testReport/ | | Max. process+thread count | 1298 (vs. ulimit of 5500) | | modules | C: hadoop-hdds/common hadoop-hdds/container-service hadoop-ozone/integration-test U: . | | Console output | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/9/console | | versions | git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1 | | Powered by | Apache Yetus 0.10.0 http://yetus.apache.org | This message was automatically generated. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 307831) Time Spent: 9.5h (was: 9h 20m) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 9.5h > Remaining Estimate: 0h > >
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=307767&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-307767 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 06/Sep/19 11:25 Start Date: 06/Sep/19 11:25 Worklog Time Spent: 10m Work Description: bshashikant commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 307767) Time Spent: 9h 20m (was: 9h 10m) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 9h 20m > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=307755&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-307755 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 06/Sep/19 11:08 Start Date: 06/Sep/19 11:08 Worklog Time Spent: 10m Work Description: bshashikant commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 307755) Time Spent: 9h 10m (was: 9h) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 9h 10m > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=307714&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-307714 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 06/Sep/19 09:21 Start Date: 06/Sep/19 09:21 Worklog Time Spent: 10m Work Description: nandakumar131 commented on issue #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#issuecomment-528781001 @bshashikant you might need to rebase the changes on top of HDDS-1561. Even though there is no conflict, the compilation fails. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 307714) Time Spent: 9h (was: 8h 50m) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 9h > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=307712&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-307712 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 06/Sep/19 09:14 Start Date: 06/Sep/19 09:14 Worklog Time Spent: 10m Work Description: nandakumar131 commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r321645502 ## File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java ## @@ -240,14 +240,46 @@ public ContainerReportsProto getContainerReport() throws IOException { } /** - * Builds the missing container set by taking a diff total no containers - * actually found and number of containers which actually got created. + * Builds the missing container set by taking a diff between total no + * containers actually found and number of containers which actually + * got created. It also validates the BCSID stored in the snapshot file + * for each container as against what is reported in containerScan. * This will only be called during the initialization of Datanode Service * when it still not a part of any write Pipeline. - * @param createdContainerSet ContainerId set persisted in the Ratis snapshot + * @param container2BCSIDMap Map of containerId to BCSID persisted in the + * Ratis snapshot */ - public void buildMissingContainerSet(Set createdContainerSet) { -missingContainerSet.addAll(createdContainerSet); -missingContainerSet.removeAll(containerMap.keySet()); + public void buildMissingContainerSetAndValidate( + Map container2BCSIDMap) { +container2BCSIDMap.entrySet().parallelStream().forEach((mapEntry) -> { + long id = mapEntry.getKey(); + if (!containerMap.containsKey(id)) { +LOG.warn("Adding container {} to missing container set.", id); +missingContainerSet.add(id); + } else { +Container container = containerMap.get(id); +long containerBCSID = container.getBlockCommitSequenceId(); +long snapshotBCSID = mapEntry.getValue(); +if (containerBCSID < snapshotBCSID) { + LOG.warn( + "Marking container {} unhealthy as reported BCSID {} is smaller" + + " than ratis snapshot recorded value {}", id, + containerBCSID, snapshotBCSID); + // just mark the container unhealthy. Once the DatanodeStateMachine + // thread starts it will send container report to SCM where these + // unhealthy containers would be detected + try { +container.markContainerUnhealthy(); + } catch (StorageContainerException sce) { +// The container will still be marked unhealthy in memory even if +// exception occurs. It won't accept any new transactions and will +// be handled by SCM. Eve if dn restarts, it will still be detected +// as unheathy as its BCSID won't change. +LOG.info("Unable to persist unhealthy state for container {}", id); Review comment: This should be `LOG.error` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 307712) Time Spent: 8h 50m (was: 8h 40m) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 8h 50m > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a cli
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=307074&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-307074 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 05/Sep/19 11:11 Start Date: 05/Sep/19 11:11 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on issue #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#issuecomment-528314940 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Comment | |::|--:|:|:| | 0 | reexec | 41 | Docker mode activated. | ||| _ Prechecks _ | | +1 | dupname | 1 | No case conflicting files found. | | +1 | @author | 0 | The patch does not contain any @author tags. | | +1 | test4tests | 0 | The patch appears to include 4 new or modified test files. | ||| _ trunk Compile Tests _ | | 0 | mvndep | 65 | Maven dependency ordering for branch | | +1 | mvninstall | 581 | trunk passed | | +1 | compile | 376 | trunk passed | | +1 | checkstyle | 81 | trunk passed | | +1 | mvnsite | 0 | trunk passed | | +1 | shadedclient | 869 | branch has no errors when building and testing our client artifacts. | | +1 | javadoc | 178 | trunk passed | | 0 | spotbugs | 416 | Used deprecated FindBugs config; considering switching to SpotBugs. | | +1 | findbugs | 611 | trunk passed | | -0 | patch | 475 | Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary. | ||| _ Patch Compile Tests _ | | 0 | mvndep | 38 | Maven dependency ordering for patch | | +1 | mvninstall | 538 | the patch passed | | +1 | compile | 391 | the patch passed | | +1 | cc | 391 | the patch passed | | +1 | javac | 391 | the patch passed | | -0 | checkstyle | 43 | hadoop-hdds: The patch generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) | | +1 | mvnsite | 0 | the patch passed | | +1 | whitespace | 0 | The patch has no whitespace issues. | | +1 | shadedclient | 665 | patch has no errors when building and testing our client artifacts. | | +1 | javadoc | 175 | the patch passed | | +1 | findbugs | 631 | the patch passed | ||| _ Other Tests _ | | +1 | unit | 283 | hadoop-hdds in the patch passed. | | -1 | unit | 2556 | hadoop-ozone in the patch failed. | | +1 | asflicense | 54 | The patch does not generate ASF License warnings. | | | | 8415 | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.ozone.client.rpc.TestMultiBlockWritesWithDnFailures | | | hadoop.ozone.scm.node.TestQueryNode | | | hadoop.ozone.container.common.statemachine.commandhandler.TestCloseContainerHandler | | | hadoop.ozone.TestSecureOzoneCluster | | | hadoop.ozone.client.rpc.TestCommitWatcher | | | hadoop.ozone.container.common.statemachine.commandhandler.TestDeleteContainerHandler | | Subsystem | Report/Notes | |--:|:-| | Docker | Client=19.03.1 Server=19.03.1 base: https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/8/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/1364 | | Optional Tests | dupname asflicense compile cc mvnsite javac unit javadoc mvninstall shadedclient findbugs checkstyle | | uname | Linux 8b805e111dba 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | trunk / 172bcd8 | | Default Java | 1.8.0_222 | | checkstyle | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/8/artifact/out/diff-checkstyle-hadoop-hdds.txt | | unit | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/8/artifact/out/patch-unit-hadoop-ozone.txt | | Test Results | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/8/testReport/ | | Max. process+thread count | 5406 (vs. ulimit of 5500) | | modules | C: hadoop-hdds/common hadoop-hdds/container-service hadoop-ozone/integration-test U: . | | Console output | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/8/console | | versions | git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1 | | Powered by | Apache Yetus 0.10.0 http://yetus.apache.org | This message was automatically generated. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id:
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=307033&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-307033 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 05/Sep/19 09:46 Start Date: 05/Sep/19 09:46 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on issue #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#issuecomment-528287642 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Comment | |::|--:|:|:| | 0 | reexec | 39 | Docker mode activated. | ||| _ Prechecks _ | | +1 | dupname | 0 | No case conflicting files found. | | +1 | @author | 0 | The patch does not contain any @author tags. | | +1 | test4tests | 0 | The patch appears to include 4 new or modified test files. | ||| _ trunk Compile Tests _ | | 0 | mvndep | 71 | Maven dependency ordering for branch | | +1 | mvninstall | 720 | trunk passed | | +1 | compile | 422 | trunk passed | | +1 | checkstyle | 85 | trunk passed | | +1 | mvnsite | 0 | trunk passed | | +1 | shadedclient | 1086 | branch has no errors when building and testing our client artifacts. | | +1 | javadoc | 208 | trunk passed | | 0 | spotbugs | 436 | Used deprecated FindBugs config; considering switching to SpotBugs. | | +1 | findbugs | 702 | trunk passed | | -0 | patch | 478 | Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary. | ||| _ Patch Compile Tests _ | | 0 | mvndep | 30 | Maven dependency ordering for patch | | +1 | mvninstall | 540 | the patch passed | | +1 | compile | 372 | the patch passed | | +1 | cc | 372 | the patch passed | | +1 | javac | 372 | the patch passed | | +1 | checkstyle | 77 | the patch passed | | +1 | mvnsite | 0 | the patch passed | | +1 | whitespace | 0 | The patch has no whitespace issues. | | +1 | shadedclient | 751 | patch has no errors when building and testing our client artifacts. | | +1 | javadoc | 178 | the patch passed | | +1 | findbugs | 654 | the patch passed | ||| _ Other Tests _ | | +1 | unit | 263 | hadoop-hdds in the patch passed. | | -1 | unit | 2130 | hadoop-ozone in the patch failed. | | +1 | asflicense | 44 | The patch does not generate ASF License warnings. | | | | 8527 | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.ozone.container.common.statemachine.commandhandler.TestBlockDeletion | | | hadoop.ozone.client.rpc.TestBlockOutputStream | | | hadoop.ozone.scm.TestContainerSmallFile | | | hadoop.ozone.TestSecureOzoneCluster | | | hadoop.ozone.client.rpc.TestMultiBlockWritesWithDnFailures | | | hadoop.ozone.client.rpc.TestBlockOutputStreamWithFailures | | Subsystem | Report/Notes | |--:|:-| | Docker | Client=19.03.1 Server=19.03.1 base: https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/6/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/1364 | | Optional Tests | dupname asflicense compile cc mvnsite javac unit javadoc mvninstall shadedclient findbugs checkstyle | | uname | Linux 71935b58ecf5 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | trunk / f347c34 | | Default Java | 1.8.0_222 | | unit | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/6/artifact/out/patch-unit-hadoop-ozone.txt | | Test Results | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/6/testReport/ | | Max. process+thread count | 4726 (vs. ulimit of 5500) | | modules | C: hadoop-hdds/common hadoop-hdds/container-service hadoop-ozone/integration-test U: . | | Console output | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/6/console | | versions | git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1 | | Powered by | Apache Yetus 0.10.0 http://yetus.apache.org | This message was automatically generated. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 307033) Time Spent: 8.5h (was: 8h 20m) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https:
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=306990&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-306990 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 05/Sep/19 08:16 Start Date: 05/Sep/19 08:16 Worklog Time Spent: 10m Work Description: bshashikant commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r321108749 ## File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java ## @@ -240,14 +240,37 @@ public ContainerReportsProto getContainerReport() throws IOException { } /** - * Builds the missing container set by taking a diff total no containers - * actually found and number of containers which actually got created. + * Builds the missing container set by taking a diff between total no + * containers actually found and number of containers which actually + * got created. It also validates the BCSID stored in the snapshot file + * for each container as against what is reported in containerScan. * This will only be called during the initialization of Datanode Service * when it still not a part of any write Pipeline. - * @param createdContainerSet ContainerId set persisted in the Ratis snapshot + * @param container2BCSIDMap Map of containerId to BCSID persisted in the + * Ratis snapshot */ - public void buildMissingContainerSet(Set createdContainerSet) { -missingContainerSet.addAll(createdContainerSet); -missingContainerSet.removeAll(containerMap.keySet()); + public void buildMissingContainerSetAndValidate( + Map container2BCSIDMap) throws IOException { Review comment: AFAIK, the missing container determination step is the last step executed as a part of initialization of RaftGroup which calls into loadSnapshot. Just doing it in a separate thread and waiting it in the main thread may not help it to restart faster. We ca change the code to use parallel stream to process the map to make it faster. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 306990) Time Spent: 8h 10m (was: 8h) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 8h 10m > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=306991&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-306991 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 05/Sep/19 08:16 Start Date: 05/Sep/19 08:16 Worklog Time Spent: 10m Work Description: bshashikant commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r321108749 ## File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java ## @@ -240,14 +240,37 @@ public ContainerReportsProto getContainerReport() throws IOException { } /** - * Builds the missing container set by taking a diff total no containers - * actually found and number of containers which actually got created. + * Builds the missing container set by taking a diff between total no + * containers actually found and number of containers which actually + * got created. It also validates the BCSID stored in the snapshot file + * for each container as against what is reported in containerScan. * This will only be called during the initialization of Datanode Service * when it still not a part of any write Pipeline. - * @param createdContainerSet ContainerId set persisted in the Ratis snapshot + * @param container2BCSIDMap Map of containerId to BCSID persisted in the + * Ratis snapshot */ - public void buildMissingContainerSet(Set createdContainerSet) { -missingContainerSet.addAll(createdContainerSet); -missingContainerSet.removeAll(containerMap.keySet()); + public void buildMissingContainerSetAndValidate( + Map container2BCSIDMap) throws IOException { Review comment: AFAIK, the missing container determination step is the last step executed as a part of initialization of RaftGroup which calls into loadSnapshot. Just doing it in a separate thread and waiting it in the main thread may not help it to restart faster. We can change the code to use parallel stream to process the map to make it faster. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 306991) Time Spent: 8h 20m (was: 8h 10m) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 8h 20m > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=306964&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-306964 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 05/Sep/19 07:27 Start Date: 05/Sep/19 07:27 Worklog Time Spent: 10m Work Description: bshashikant commented on issue #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#issuecomment-528234226 > a) Do we sync the container db's wal on close? -- The entire rocks db is synced as a part of closing the container. > b) Also, for on restart, we should run the scanner, will this identification step run before this? --- Yes, identification step is executed before the the scrubber starts. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 306964) Time Spent: 8h (was: 7h 50m) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 8h > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=306959&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-306959 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 05/Sep/19 07:21 Start Date: 05/Sep/19 07:21 Worklog Time Spent: 10m Work Description: bshashikant commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r321108749 ## File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java ## @@ -240,14 +240,37 @@ public ContainerReportsProto getContainerReport() throws IOException { } /** - * Builds the missing container set by taking a diff total no containers - * actually found and number of containers which actually got created. + * Builds the missing container set by taking a diff between total no + * containers actually found and number of containers which actually + * got created. It also validates the BCSID stored in the snapshot file + * for each container as against what is reported in containerScan. * This will only be called during the initialization of Datanode Service * when it still not a part of any write Pipeline. - * @param createdContainerSet ContainerId set persisted in the Ratis snapshot + * @param container2BCSIDMap Map of containerId to BCSID persisted in the + * Ratis snapshot */ - public void buildMissingContainerSet(Set createdContainerSet) { -missingContainerSet.addAll(createdContainerSet); -missingContainerSet.removeAll(containerMap.keySet()); + public void buildMissingContainerSetAndValidate( + Map container2BCSIDMap) throws IOException { Review comment: AFAIK, the missing container determination step is the last step executed as a part of initialization of RaftGroup which calls into loadSnapshot. Just doing it in a separate thread and waiting it in the main thread may not help it to restart faster. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 306959) Time Spent: 7h 50m (was: 7h 40m) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 7h 50m > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=306957&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-306957 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 05/Sep/19 07:19 Start Date: 05/Sep/19 07:19 Worklog Time Spent: 10m Work Description: bshashikant commented on issue #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#issuecomment-528234226 > The patch generally looks good to me. A couple of questions, they might already be implemented > a) Do we sync the container db's wal on close? -- The entire rocks db is synced as a part of closing the container. > b) Also, for on restart, we should run the scanner, will this identification step run before this? --- Yes, identification step is executed before the the scrubber starts. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 306957) Time Spent: 7h 40m (was: 7.5h) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 7h 40m > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=306773&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-306773 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 04/Sep/19 23:01 Start Date: 04/Sep/19 23:01 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on issue #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#issuecomment-528124811 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Comment | |::|--:|:|:| | 0 | reexec | 43 | Docker mode activated. | ||| _ Prechecks _ | | +1 | dupname | 1 | No case conflicting files found. | | +1 | @author | 0 | The patch does not contain any @author tags. | | +1 | test4tests | 0 | The patch appears to include 4 new or modified test files. | ||| _ trunk Compile Tests _ | | 0 | mvndep | 70 | Maven dependency ordering for branch | | +1 | mvninstall | 617 | trunk passed | | +1 | compile | 395 | trunk passed | | +1 | checkstyle | 78 | trunk passed | | +1 | mvnsite | 0 | trunk passed | | +1 | shadedclient | 890 | branch has no errors when building and testing our client artifacts. | | +1 | javadoc | 176 | trunk passed | | 0 | spotbugs | 444 | Used deprecated FindBugs config; considering switching to SpotBugs. | | +1 | findbugs | 646 | trunk passed | ||| _ Patch Compile Tests _ | | 0 | mvndep | 40 | Maven dependency ordering for patch | | +1 | mvninstall | 556 | the patch passed | | +1 | compile | 391 | the patch passed | | +1 | cc | 391 | the patch passed | | +1 | javac | 391 | the patch passed | | +1 | checkstyle | 86 | the patch passed | | +1 | mvnsite | 0 | the patch passed | | -1 | whitespace | 0 | The patch has 25 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply | | +1 | shadedclient | 691 | patch has no errors when building and testing our client artifacts. | | +1 | javadoc | 180 | the patch passed | | +1 | findbugs | 669 | the patch passed | ||| _ Other Tests _ | | +1 | unit | 288 | hadoop-hdds in the patch passed. | | -1 | unit | 2037 | hadoop-ozone in the patch failed. | | +1 | asflicense | 52 | The patch does not generate ASF License warnings. | | | | 8110 | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.ozone.client.rpc.TestMultiBlockWritesWithDnFailures | | | hadoop.ozone.scm.node.TestQueryNode | | | hadoop.ozone.TestSecureOzoneCluster | | | hadoop.ozone.client.rpc.TestBlockOutputStream | | | hadoop.ozone.container.common.statemachine.commandhandler.TestBlockDeletion | | | hadoop.ozone.client.rpc.TestBlockOutputStreamWithFailures | | | hadoop.ozone.TestMiniChaosOzoneCluster | | | hadoop.ozone.om.TestSecureOzoneManager | | Subsystem | Report/Notes | |--:|:-| | Docker | Client=19.03.1 Server=19.03.1 base: https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/5/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/1364 | | Optional Tests | dupname asflicense compile cc mvnsite javac unit javadoc mvninstall shadedclient findbugs checkstyle | | uname | Linux 8df624d8e13a 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | trunk / 337e9b7 | | Default Java | 1.8.0_222 | | whitespace | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/5/artifact/out/whitespace-eol.txt | | unit | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/5/artifact/out/patch-unit-hadoop-ozone.txt | | Test Results | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/5/testReport/ | | Max. process+thread count | 5375 (vs. ulimit of 5500) | | modules | C: hadoop-hdds/common hadoop-hdds/container-service hadoop-ozone/integration-test U: . | | Console output | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/5/console | | versions | git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1 | | Powered by | Apache Yetus 0.10.0 http://yetus.apache.org | This message was automatically generated. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 306773) Time Spent: 7.5h (was: 7h 20m) > Undetectable corruption after
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=306630&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-306630 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 04/Sep/19 19:02 Start Date: 04/Sep/19 19:02 Worklog Time Spent: 10m Work Description: mukul1987 commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r320922823 ## File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java ## @@ -240,14 +240,37 @@ public ContainerReportsProto getContainerReport() throws IOException { } /** - * Builds the missing container set by taking a diff total no containers - * actually found and number of containers which actually got created. + * Builds the missing container set by taking a diff between total no + * containers actually found and number of containers which actually + * got created. It also validates the BCSID stored in the snapshot file + * for each container as against what is reported in containerScan. * This will only be called during the initialization of Datanode Service * when it still not a part of any write Pipeline. - * @param createdContainerSet ContainerId set persisted in the Ratis snapshot + * @param container2BCSIDMap Map of containerId to BCSID persisted in the + * Ratis snapshot */ - public void buildMissingContainerSet(Set createdContainerSet) { -missingContainerSet.addAll(createdContainerSet); -missingContainerSet.removeAll(containerMap.keySet()); + public void buildMissingContainerSetAndValidate( + Map container2BCSIDMap) throws IOException { Review comment: Lets make this multi threaded, so that on restart, this state is reached a lot faster. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 306630) Time Spent: 7h 20m (was: 7h 10m) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 7h 20m > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=306590&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-306590 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 04/Sep/19 18:04 Start Date: 04/Sep/19 18:04 Worklog Time Spent: 10m Work Description: bshashikant commented on issue #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#issuecomment-528017062 /retest This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 306590) Time Spent: 7h 10m (was: 7h) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 7h 10m > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=306572&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-306572 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 04/Sep/19 17:45 Start Date: 04/Sep/19 17:45 Worklog Time Spent: 10m Work Description: bshashikant commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r320889606 ## File path: hadoop-hdds/common/src/main/proto/DatanodeContainerProtocol.proto ## @@ -248,8 +248,13 @@ message ContainerDataProto { optional ContainerType containerType = 10 [default = KeyValueContainer]; } -message ContainerIdSetProto { -repeated int64 containerId = 1; +message Container2BCSIDMapEntryProto { Review comment: address in the next patch. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 306572) Time Spent: 7h (was: 6h 50m) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 7h > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=306570&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-306570 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 04/Sep/19 17:41 Start Date: 04/Sep/19 17:41 Worklog Time Spent: 10m Work Description: bshashikant commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r320887895 ## File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/HddsDispatcher.java ## @@ -329,6 +336,21 @@ private ContainerCommandResponseProto dispatchRequest( } } + private void updateBCSID(Container container, + DispatcherContext dispatcherContext, ContainerProtos.Type cmdType) { +long bcsID = container.getBlockCommitSequenceId(); +long containerId = container.getContainerData().getContainerID(); +Map container2BCSIDMap; +if (dispatcherContext != null && (cmdType == ContainerProtos.Type.PutBlock +|| cmdType == ContainerProtos.Type.PutSmallFile)) { + container2BCSIDMap = dispatcherContext.getContainer2BCSIDMap(); + Preconditions.checkNotNull(container2BCSIDMap); + Preconditions.checkArgument(container2BCSIDMap.containsKey(containerId)); + // updates the latest BCSID on every putBlock or putSmallFile + // transaction over Ratis. + container2BCSIDMap.computeIfPresent(containerId, (u, v) -> v = bcsID); Review comment: will address in the next patch. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 306570) Time Spent: 6h 40m (was: 6.5h) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 6h 40m > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=306571&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-306571 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 04/Sep/19 17:41 Start Date: 04/Sep/19 17:41 Worklog Time Spent: 10m Work Description: bshashikant commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r320888068 ## File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/impl/BlockManagerImpl.java ## @@ -282,5 +282,4 @@ public void deleteBlock(Container container, BlockID blockID) throws public void shutdown() { BlockUtils.shutdownCache(ContainerCache.getInstance(config)); } - Review comment: address in the next patch. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 306571) Time Spent: 6h 50m (was: 6h 40m) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 6h 50m > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=306568&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-306568 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 04/Sep/19 17:40 Start Date: 04/Sep/19 17:40 Worklog Time Spent: 10m Work Description: bshashikant commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r320887705 ## File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java ## @@ -240,14 +240,37 @@ public ContainerReportsProto getContainerReport() throws IOException { } /** - * Builds the missing container set by taking a diff total no containers - * actually found and number of containers which actually got created. + * Builds the missing container set by taking a diff between total no + * containers actually found and number of containers which actually + * got created. It also validates the BCSID stored in the snapshot file + * for each container as against what is reported in containerScan. * This will only be called during the initialization of Datanode Service * when it still not a part of any write Pipeline. - * @param createdContainerSet ContainerId set persisted in the Ratis snapshot + * @param container2BCSIDMap Map of containerId to BCSID persisted in the + * Ratis snapshot */ - public void buildMissingContainerSet(Set createdContainerSet) { -missingContainerSet.addAll(createdContainerSet); -missingContainerSet.removeAll(containerMap.keySet()); + public void buildMissingContainerSetAndValidate( + Map container2BCSIDMap) throws IOException { +for (Map.Entry mapEntry : container2BCSIDMap.entrySet()) { + long id = mapEntry.getKey(); + if (!containerMap.containsKey(id)) { +LOG.warn("Adding container {} to missing container set.", id); +missingContainerSet.add(id); + } else { +Container container = containerMap.get(id); +long containerBCSID = container.getBlockCommitSequenceId(); +long snapshotBCSID = mapEntry.getValue(); +if (containerBCSID < snapshotBCSID) { + LOG.warn( + "Marking container {} unhealthy as reported BCSID {} is smaller" Review comment: will address in the next patch. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 306568) Time Spent: 6.5h (was: 6h 20m) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 6.5h > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=306567&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-306567 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 04/Sep/19 17:39 Start Date: 04/Sep/19 17:39 Worklog Time Spent: 10m Work Description: bshashikant commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r320887264 ## File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/ratis/ContainerStateMachine.java ## @@ -258,8 +257,9 @@ private long loadSnapshot(SingleFileSnapshotInfo snapshot) * @throws IOException */ public void persistContainerSet(OutputStream out) throws IOException { Review comment: will address it the next patch. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 306567) Time Spent: 6h 20m (was: 6h 10m) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 6h 20m > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=306566&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-306566 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 04/Sep/19 17:38 Start Date: 04/Sep/19 17:38 Worklog Time Spent: 10m Work Description: bshashikant commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r320886836 ## File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/HddsDispatcher.java ## @@ -329,6 +336,21 @@ private ContainerCommandResponseProto dispatchRequest( } } + private void updateBCSID(Container container, + DispatcherContext dispatcherContext, ContainerProtos.Type cmdType) { +long bcsID = container.getBlockCommitSequenceId(); +long containerId = container.getContainerData().getContainerID(); +Map container2BCSIDMap; +if (dispatcherContext != null && (cmdType == ContainerProtos.Type.PutBlock Review comment: For all the cmds, the dispatcher context is not setup and will be null. We need to check for specific cmd types to get the context. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 306566) Time Spent: 6h 10m (was: 6h) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 6h 10m > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=306390&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-306390 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 04/Sep/19 15:59 Start Date: 04/Sep/19 15:59 Worklog Time Spent: 10m Work Description: nandakumar131 commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r320841743 ## File path: hadoop-hdds/common/src/main/proto/DatanodeContainerProtocol.proto ## @@ -248,8 +248,13 @@ message ContainerDataProto { optional ContainerType containerType = 10 [default = KeyValueContainer]; } -message ContainerIdSetProto { -repeated int64 containerId = 1; +message Container2BCSIDMapEntryProto { Review comment: Never used, can be removed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 306390) Time Spent: 5h 50m (was: 5h 40m) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 5h 50m > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=306391&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-306391 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 04/Sep/19 15:59 Start Date: 04/Sep/19 15:59 Worklog Time Spent: 10m Work Description: nandakumar131 commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r320842136 ## File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/HddsDispatcher.java ## @@ -329,6 +336,21 @@ private ContainerCommandResponseProto dispatchRequest( } } + private void updateBCSID(Container container, + DispatcherContext dispatcherContext, ContainerProtos.Type cmdType) { +long bcsID = container.getBlockCommitSequenceId(); +long containerId = container.getContainerData().getContainerID(); +Map container2BCSIDMap; +if (dispatcherContext != null && (cmdType == ContainerProtos.Type.PutBlock +|| cmdType == ContainerProtos.Type.PutSmallFile)) { + container2BCSIDMap = dispatcherContext.getContainer2BCSIDMap(); + Preconditions.checkNotNull(container2BCSIDMap); + Preconditions.checkArgument(container2BCSIDMap.containsKey(containerId)); + // updates the latest BCSID on every putBlock or putSmallFile + // transaction over Ratis. + container2BCSIDMap.computeIfPresent(containerId, (u, v) -> v = bcsID); Review comment: `computeIfPresent` is not needed here, can be replaced with `Map#put`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 306391) Time Spent: 6h (was: 5h 50m) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 6h > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=305687&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-305687 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 03/Sep/19 16:40 Start Date: 03/Sep/19 16:40 Worklog Time Spent: 10m Work Description: supratimdeka commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r320313492 ## File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/HddsDispatcher.java ## @@ -228,11 +231,14 @@ private ContainerCommandResponseProto dispatchRequest( audit(action, eventType, params, AuditEventStatus.FAILURE, sce); return ContainerUtils.logAndReturnError(LOG, sce, msg); } -Preconditions.checkArgument(isWriteStage && containerIdSet != null +Preconditions.checkArgument(isWriteStage && container2BCSIDMap != null || dispatcherContext == null); -if (containerIdSet != null) { +if (container2BCSIDMap != null) { // adds this container to list of containers created in the pipeline - containerIdSet.add(containerID); + // with initial BCSID recorded as 0. + Preconditions + .checkArgument(!container2BCSIDMap.containsKey(containerID)); Review comment: is this assert safe? asking the question because there was no equivalent assert before. trying to imagine a situation where this might be a false alarm - is this possible? 1. container create is successful, 2. container2BCSIDMap.putIfAbsent is successful 3. datanode restarts - container2BCSIDMap is persisted, but the container itself is not persisted (create container involves a rename operation which may not be sync'ed inline) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 305687) Time Spent: 5h 20m (was: 5h 10m) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 5h 20m > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=305685&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-305685 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 03/Sep/19 16:40 Start Date: 03/Sep/19 16:40 Worklog Time Spent: 10m Work Description: supratimdeka commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r320287270 ## File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java ## @@ -240,14 +240,37 @@ public ContainerReportsProto getContainerReport() throws IOException { } /** - * Builds the missing container set by taking a diff total no containers - * actually found and number of containers which actually got created. + * Builds the missing container set by taking a diff between total no + * containers actually found and number of containers which actually + * got created. It also validates the BCSID stored in the snapshot file + * for each container as against what is reported in containerScan. * This will only be called during the initialization of Datanode Service * when it still not a part of any write Pipeline. - * @param createdContainerSet ContainerId set persisted in the Ratis snapshot + * @param container2BCSIDMap Map of containerId to BCSID persisted in the + * Ratis snapshot */ - public void buildMissingContainerSet(Set createdContainerSet) { -missingContainerSet.addAll(createdContainerSet); -missingContainerSet.removeAll(containerMap.keySet()); + public void buildMissingContainerSetAndValidate( + Map container2BCSIDMap) throws IOException { +for (Map.Entry mapEntry : container2BCSIDMap.entrySet()) { + long id = mapEntry.getKey(); + if (!containerMap.containsKey(id)) { +LOG.warn("Adding container {} to missing container set.", id); +missingContainerSet.add(id); + } else { +Container container = containerMap.get(id); +long containerBCSID = container.getBlockCommitSequenceId(); +long snapshotBCSID = mapEntry.getValue(); +if (containerBCSID < snapshotBCSID) { + LOG.warn( + "Marking container {} unhealthy as reported BCSID {} is smaller" Review comment: argument appears to be missing. container not passed as the first argument. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 305685) Time Spent: 5h 10m (was: 5h) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 5h 10m > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=305686&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-305686 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 03/Sep/19 16:40 Start Date: 03/Sep/19 16:40 Worklog Time Spent: 10m Work Description: supratimdeka commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r320368137 ## File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/HddsDispatcher.java ## @@ -329,6 +336,21 @@ private ContainerCommandResponseProto dispatchRequest( } } + private void updateBCSID(Container container, + DispatcherContext dispatcherContext, ContainerProtos.Type cmdType) { +long bcsID = container.getBlockCommitSequenceId(); +long containerId = container.getContainerData().getContainerID(); +Map container2BCSIDMap; +if (dispatcherContext != null && (cmdType == ContainerProtos.Type.PutBlock Review comment: an alternative implementation would be to ignore the cmdType. For all requests, read the BCS ID from the map and compare it to the bcsid from the container. if the map value is lower, then update it to the value from the container. Advantage: updateBCSID function becomes de-coupled from the knowledge of which cmdType changes the bcsid. Disadvantage: possibly more CPU cost, because every request will pay the cost of reading the bcsid map. but it might be premature to assume that this additional cost will be significant. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 305686) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 5h 10m > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=305689&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-305689 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 03/Sep/19 16:40 Start Date: 03/Sep/19 16:40 Worklog Time Spent: 10m Work Description: supratimdeka commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r320359663 ## File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/impl/BlockManagerImpl.java ## @@ -282,5 +282,4 @@ public void deleteBlock(Container container, BlockID blockID) throws public void shutdown() { BlockUtils.shutdownCache(ContainerCache.getInstance(config)); } - Review comment: unintended whitespace only change? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 305689) Time Spent: 5h 40m (was: 5.5h) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 5h 40m > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=305688&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-305688 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 03/Sep/19 16:40 Start Date: 03/Sep/19 16:40 Worklog Time Spent: 10m Work Description: supratimdeka commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r320362456 ## File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/ratis/ContainerStateMachine.java ## @@ -258,8 +257,9 @@ private long loadSnapshot(SingleFileSnapshotInfo snapshot) * @throws IOException */ public void persistContainerSet(OutputStream out) throws IOException { Review comment: should we rename this function as well? given that other instances of containerSet have been renamed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 305688) Time Spent: 5.5h (was: 5h 20m) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 5.5h > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=305360&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-305360 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 03/Sep/19 00:34 Start Date: 03/Sep/19 00:34 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on issue #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#issuecomment-527263446 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Comment | |::|--:|:|:| | 0 | reexec | 36 | Docker mode activated. | ||| _ Prechecks _ | | +1 | dupname | 0 | No case conflicting files found. | | +1 | @author | 0 | The patch does not contain any @author tags. | | +1 | test4tests | 0 | The patch appears to include 4 new or modified test files. | ||| _ trunk Compile Tests _ | | 0 | mvndep | 25 | Maven dependency ordering for branch | | +1 | mvninstall | 580 | trunk passed | | +1 | compile | 406 | trunk passed | | +1 | checkstyle | 72 | trunk passed | | +1 | mvnsite | 0 | trunk passed | | +1 | shadedclient | 908 | branch has no errors when building and testing our client artifacts. | | +1 | javadoc | 175 | trunk passed | | 0 | spotbugs | 473 | Used deprecated FindBugs config; considering switching to SpotBugs. | | +1 | findbugs | 693 | trunk passed | ||| _ Patch Compile Tests _ | | 0 | mvndep | 37 | Maven dependency ordering for patch | | +1 | mvninstall | 609 | the patch passed | | +1 | compile | 399 | the patch passed | | +1 | cc | 399 | the patch passed | | +1 | javac | 399 | the patch passed | | -0 | checkstyle | 37 | hadoop-hdds: The patch generated 2 new + 0 unchanged - 0 fixed = 2 total (was 0) | | -0 | checkstyle | 41 | hadoop-ozone: The patch generated 4 new + 0 unchanged - 0 fixed = 4 total (was 0) | | +1 | mvnsite | 0 | the patch passed | | -1 | whitespace | 0 | The patch has 25 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply | | +1 | shadedclient | 643 | patch has no errors when building and testing our client artifacts. | | +1 | javadoc | 177 | the patch passed | | -1 | findbugs | 234 | hadoop-hdds generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) | ||| _ Other Tests _ | | +1 | unit | 275 | hadoop-hdds in the patch passed. | | -1 | unit | 2571 | hadoop-ozone in the patch failed. | | +1 | asflicense | 54 | The patch does not generate ASF License warnings. | | | | 8591 | | | Reason | Tests | |---:|:--| | FindBugs | module:hadoop-hdds | | | org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(ContainerProtos$ContainerCommandRequestProto, DispatcherContext) invokes inefficient new Long(long) constructor; use Long.valueOf(long) instead At HddsDispatcher.java:Long(long) constructor; use Long.valueOf(long) instead At HddsDispatcher.java:[line 241] | | Failed junit tests | hadoop.ozone.TestContainerOperations | | | hadoop.ozone.client.rpc.TestOzoneClientRetriesOnException | | | hadoop.ozone.scm.TestGetCommittedBlockLengthAndPutKey | | | hadoop.ozone.scm.TestXceiverClientManager | | | hadoop.ozone.TestContainerStateMachineIdempotency | | | hadoop.ozone.om.TestSecureOzoneManager | | | hadoop.ozone.container.metrics.TestContainerMetrics | | | hadoop.ozone.container.ozoneimpl.TestOzoneContainer | | | hadoop.ozone.client.rpc.TestMultiBlockWritesWithDnFailures | | | hadoop.ozone.scm.TestXceiverClientMetrics | | | hadoop.ozone.scm.TestContainerSmallFile | | | hadoop.ozone.client.rpc.Test2WayCommitInRatis | | | hadoop.ozone.container.TestContainerReplication | | | hadoop.ozone.container.common.statemachine.commandhandler.TestBlockDeletion | | Subsystem | Report/Notes | |--:|:-| | Docker | Client=19.03.1 Server=19.03.1 base: https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/4/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/1364 | | Optional Tests | dupname asflicense compile cc mvnsite javac unit javadoc mvninstall shadedclient findbugs checkstyle | | uname | Linux 23e0b2fdc078 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | trunk / 915cbc9 | | Default Java | 1.8.0_222 | | checkstyle | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/4/artifact/out/diff-checkstyle-hadoop-hdds.txt | | checkstyle | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/4/artifact/out/diff-checkstyle-hadoop-ozone.txt
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=303306&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-303306 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 28/Aug/19 23:57 Start Date: 28/Aug/19 23:57 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on issue #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#issuecomment-525966992 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Comment | |::|--:|:|:| | 0 | reexec | 144 | Docker mode activated. | ||| _ Prechecks _ | | +1 | dupname | 1 | No case conflicting files found. | | +1 | @author | 0 | The patch does not contain any @author tags. | | +1 | test4tests | 0 | The patch appears to include 4 new or modified test files. | ||| _ trunk Compile Tests _ | | 0 | mvndep | 81 | Maven dependency ordering for branch | | +1 | mvninstall | 688 | trunk passed | | +1 | compile | 401 | trunk passed | | +1 | checkstyle | 86 | trunk passed | | +1 | mvnsite | 0 | trunk passed | | +1 | shadedclient | 924 | branch has no errors when building and testing our client artifacts. | | +1 | javadoc | 182 | trunk passed | | 0 | spotbugs | 444 | Used deprecated FindBugs config; considering switching to SpotBugs. | | +1 | findbugs | 666 | trunk passed | ||| _ Patch Compile Tests _ | | 0 | mvndep | 36 | Maven dependency ordering for patch | | +1 | mvninstall | 593 | the patch passed | | +1 | compile | 417 | the patch passed | | +1 | cc | 417 | the patch passed | | +1 | javac | 417 | the patch passed | | -0 | checkstyle | 42 | hadoop-hdds: The patch generated 2 new + 0 unchanged - 0 fixed = 2 total (was 0) | | -0 | checkstyle | 40 | hadoop-ozone: The patch generated 4 new + 0 unchanged - 0 fixed = 4 total (was 0) | | +1 | mvnsite | 0 | the patch passed | | -1 | whitespace | 0 | The patch has 25 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply | | +1 | shadedclient | 752 | patch has no errors when building and testing our client artifacts. | | +1 | javadoc | 179 | the patch passed | | -1 | findbugs | 272 | hadoop-hdds generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) | | -1 | findbugs | 422 | hadoop-ozone in the patch failed. | ||| _ Other Tests _ | | -1 | unit | 1102 | hadoop-hdds in the patch failed. | | -1 | unit | 506 | hadoop-ozone in the patch failed. | | +1 | asflicense | 45 | The patch does not generate ASF License warnings. | | | | 7750 | | | Reason | Tests | |---:|:--| | FindBugs | module:hadoop-hdds | | | org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(ContainerProtos$ContainerCommandRequestProto, DispatcherContext) invokes inefficient new Long(long) constructor; use Long.valueOf(long) instead At HddsDispatcher.java:Long(long) constructor; use Long.valueOf(long) instead At HddsDispatcher.java:[line 241] | | Subsystem | Report/Notes | |--:|:-| | Docker | Client=19.03.1 Server=19.03.1 base: https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/3/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/1364 | | Optional Tests | dupname asflicense compile cc mvnsite javac unit javadoc mvninstall shadedclient findbugs checkstyle | | uname | Linux 8d1d475f5d3b 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | trunk / 6f2226a | | Default Java | 1.8.0_222 | | checkstyle | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/3/artifact/out/diff-checkstyle-hadoop-hdds.txt | | checkstyle | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/3/artifact/out/diff-checkstyle-hadoop-ozone.txt | | whitespace | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/3/artifact/out/whitespace-eol.txt | | findbugs | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/3/artifact/out/new-findbugs-hadoop-hdds.html | | findbugs | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/3/artifact/out/patch-findbugs-hadoop-ozone.txt | | unit | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/3/artifact/out/patch-unit-hadoop-hdds.txt | | unit | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/3/artifact/out/patch-unit-hadoop-ozone.txt | | Test Results | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/3/testReport/ | | Max. process+thread count | 1371 (vs. ulimit of 5500) | | modules | C: hadoop-hdds/c
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=303294&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-303294 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 28/Aug/19 23:22 Start Date: 28/Aug/19 23:22 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on issue #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#issuecomment-525959389 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Comment | |::|--:|:|:| | 0 | reexec | 46 | Docker mode activated. | ||| _ Prechecks _ | | +1 | dupname | 1 | No case conflicting files found. | | +1 | @author | 0 | The patch does not contain any @author tags. | | +1 | test4tests | 0 | The patch appears to include 4 new or modified test files. | ||| _ trunk Compile Tests _ | | 0 | mvndep | 78 | Maven dependency ordering for branch | | +1 | mvninstall | 667 | trunk passed | | +1 | compile | 426 | trunk passed | | +1 | checkstyle | 75 | trunk passed | | +1 | mvnsite | 0 | trunk passed | | +1 | shadedclient | 956 | branch has no errors when building and testing our client artifacts. | | +1 | javadoc | 183 | trunk passed | | 0 | spotbugs | 432 | Used deprecated FindBugs config; considering switching to SpotBugs. | | +1 | findbugs | 645 | trunk passed | ||| _ Patch Compile Tests _ | | 0 | mvndep | 39 | Maven dependency ordering for patch | | +1 | mvninstall | 561 | the patch passed | | +1 | compile | 410 | the patch passed | | +1 | cc | 410 | the patch passed | | +1 | javac | 410 | the patch passed | | -0 | checkstyle | 40 | hadoop-hdds: The patch generated 2 new + 0 unchanged - 0 fixed = 2 total (was 0) | | -0 | checkstyle | 44 | hadoop-ozone: The patch generated 4 new + 0 unchanged - 0 fixed = 4 total (was 0) | | +1 | mvnsite | 0 | the patch passed | | -1 | whitespace | 0 | The patch has 25 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply | | +1 | shadedclient | 738 | patch has no errors when building and testing our client artifacts. | | +1 | javadoc | 187 | the patch passed | | -1 | findbugs | 235 | hadoop-hdds generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) | ||| _ Other Tests _ | | -1 | unit | 1062 | hadoop-hdds in the patch failed. | | -1 | unit | 202 | hadoop-ozone in the patch failed. | | +1 | asflicense | 44 | The patch does not generate ASF License warnings. | | | | 7267 | | | Reason | Tests | |---:|:--| | FindBugs | module:hadoop-hdds | | | org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(ContainerProtos$ContainerCommandRequestProto, DispatcherContext) invokes inefficient new Long(long) constructor; use Long.valueOf(long) instead At HddsDispatcher.java:Long(long) constructor; use Long.valueOf(long) instead At HddsDispatcher.java:[line 241] | | Failed junit tests | hadoop.ozone.om.ratis.TestOzoneManagerDoubleBufferWithOMResponse | | Subsystem | Report/Notes | |--:|:-| | Docker | Client=19.03.1 Server=19.03.1 base: https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/1364 | | Optional Tests | dupname asflicense compile cc mvnsite javac unit javadoc mvninstall shadedclient findbugs checkstyle | | uname | Linux 6ae0c04c33e0 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | trunk / 6f2226a | | Default Java | 1.8.0_222 | | checkstyle | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/2/artifact/out/diff-checkstyle-hadoop-hdds.txt | | checkstyle | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/2/artifact/out/diff-checkstyle-hadoop-ozone.txt | | whitespace | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/2/artifact/out/whitespace-eol.txt | | findbugs | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/2/artifact/out/new-findbugs-hadoop-hdds.html | | unit | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/2/artifact/out/patch-unit-hadoop-hdds.txt | | unit | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/2/artifact/out/patch-unit-hadoop-ozone.txt | | Test Results | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/2/testReport/ | | Max. process+thread count | 413 (vs. ulimit of 5500) | | modules | C: hadoop-hdds/common hadoop-hdds/container-service hadoop-ozone/integration-test U: . | | Console output | ht
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302886&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302886 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 28/Aug/19 14:00 Start Date: 28/Aug/19 14:00 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597852 ## File path: hadoop-hdds/common/src/main/resources/audit.log ## @@ -0,0 +1,25 @@ +2019-08-28 11:36:31,489 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, volume=testcontainerstatemachinefailures, creationTime=1566972391485, quotaInBytes=1152921504606846976} | ret=SUCCESS | +2019-08-28 11:36:31,494 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,511 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], group:com.apple.access_screensharing:a[ACCESS], group:com.apple.access_ssh-disabled:a[ACCESS], group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, storageType=DISK, creationTime=0} | ret=SUCCESS | +2019-08-28 11:36:31,515 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,519 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,561 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:37,850 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, length=10, offset=0, token=null, pipeline=Pipeline[ Id: 33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS | +2019-08-28 11:36:50,166 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,168 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | Review comment: whitespace:end of line This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 302886) Time Spent: 4h 20m (was: 4h 10m) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 4h 20m > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as th
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302884&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302884 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 28/Aug/19 14:00 Start Date: 28/Aug/19 14:00 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597822 ## File path: hadoop-hdds/common/src/main/resources/audit.log ## @@ -0,0 +1,25 @@ +2019-08-28 11:36:31,489 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, volume=testcontainerstatemachinefailures, creationTime=1566972391485, quotaInBytes=1152921504606846976} | ret=SUCCESS | +2019-08-28 11:36:31,494 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,511 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], group:com.apple.access_screensharing:a[ACCESS], group:com.apple.access_ssh-disabled:a[ACCESS], group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, storageType=DISK, creationTime=0} | ret=SUCCESS | +2019-08-28 11:36:31,515 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,519 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,561 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:37,850 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, length=10, offset=0, token=null, pipeline=Pipeline[ Id: 33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS | Review comment: whitespace:end of line This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 302884) Time Spent: 4h (was: 3h 50m) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 4h > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect th
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302887&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302887 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 28/Aug/19 14:01 Start Date: 28/Aug/19 14:01 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on issue #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#issuecomment-525759029 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Comment | |::|--:|:|:| | 0 | reexec | 40 | Docker mode activated. | ||| _ Prechecks _ | | +1 | dupname | 1 | No case conflicting files found. | | +1 | @author | 0 | The patch does not contain any @author tags. | | +1 | test4tests | 0 | The patch appears to include 4 new or modified test files. | ||| _ trunk Compile Tests _ | | 0 | mvndep | 67 | Maven dependency ordering for branch | | +1 | mvninstall | 620 | trunk passed | | +1 | compile | 385 | trunk passed | | +1 | checkstyle | 76 | trunk passed | | +1 | mvnsite | 0 | trunk passed | | +1 | shadedclient | 846 | branch has no errors when building and testing our client artifacts. | | +1 | javadoc | 177 | trunk passed | | 0 | spotbugs | 429 | Used deprecated FindBugs config; considering switching to SpotBugs. | | +1 | findbugs | 629 | trunk passed | ||| _ Patch Compile Tests _ | | 0 | mvndep | 36 | Maven dependency ordering for patch | | +1 | mvninstall | 547 | the patch passed | | +1 | compile | 393 | the patch passed | | +1 | cc | 393 | the patch passed | | +1 | javac | 393 | the patch passed | | -0 | checkstyle | 39 | hadoop-hdds: The patch generated 2 new + 0 unchanged - 0 fixed = 2 total (was 0) | | -0 | checkstyle | 42 | hadoop-ozone: The patch generated 4 new + 0 unchanged - 0 fixed = 4 total (was 0) | | +1 | mvnsite | 0 | the patch passed | | -1 | whitespace | 0 | The patch has 25 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply | | +1 | shadedclient | 661 | patch has no errors when building and testing our client artifacts. | | +1 | javadoc | 175 | the patch passed | | -1 | findbugs | 213 | hadoop-hdds generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) | ||| _ Other Tests _ | | -1 | unit | 1060 | hadoop-hdds in the patch failed. | | -1 | unit | 2707 | hadoop-ozone in the patch failed. | | +1 | asflicense | 52 | The patch does not generate ASF License warnings. | | | | 9393 | | | Reason | Tests | |---:|:--| | FindBugs | module:hadoop-hdds | | | org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(ContainerProtos$ContainerCommandRequestProto, DispatcherContext) invokes inefficient new Long(long) constructor; use Long.valueOf(long) instead At HddsDispatcher.java:Long(long) constructor; use Long.valueOf(long) instead At HddsDispatcher.java:[line 241] | | Failed junit tests | hadoop.ozone.container.TestContainerReplication | | | hadoop.ozone.client.rpc.TestBlockOutputStream | | | hadoop.ozone.scm.node.TestQueryNode | | | hadoop.ozone.client.rpc.Test2WayCommitInRatis | | | hadoop.ozone.scm.TestXceiverClientManager | | | hadoop.ozone.container.ozoneimpl.TestSecureOzoneContainer | | | hadoop.ozone.container.ozoneimpl.TestOzoneContainer | | | hadoop.ozone.TestContainerOperations | | | hadoop.ozone.container.metrics.TestContainerMetrics | | | hadoop.ozone.scm.TestXceiverClientMetrics | | | hadoop.ozone.client.rpc.TestMultiBlockWritesWithDnFailures | | Subsystem | Report/Notes | |--:|:-| | Docker | Client=19.03.1 Server=19.03.1 base: https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/1364 | | Optional Tests | dupname asflicense compile cc mvnsite javac unit javadoc mvninstall shadedclient findbugs checkstyle | | uname | Linux 487bb680b6dc 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | trunk / 55cc115 | | Default Java | 1.8.0_222 | | checkstyle | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/1/artifact/out/diff-checkstyle-hadoop-hdds.txt | | checkstyle | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/1/artifact/out/diff-checkstyle-hadoop-ozone.txt | | whitespace | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/1/artifact/out/whitespace-eol.txt | | findbugs | https://builds.apache.org/job/hadoop-multibranch/job/PR-1364/1/artifact/ou
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302885&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302885 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 28/Aug/19 14:00 Start Date: 28/Aug/19 14:00 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597836 ## File path: hadoop-hdds/common/src/main/resources/audit.log ## @@ -0,0 +1,25 @@ +2019-08-28 11:36:31,489 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, volume=testcontainerstatemachinefailures, creationTime=1566972391485, quotaInBytes=1152921504606846976} | ret=SUCCESS | +2019-08-28 11:36:31,494 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,511 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], group:com.apple.access_screensharing:a[ACCESS], group:com.apple.access_ssh-disabled:a[ACCESS], group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, storageType=DISK, creationTime=0} | ret=SUCCESS | +2019-08-28 11:36:31,515 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,519 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,561 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:37,850 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, length=10, offset=0, token=null, pipeline=Pipeline[ Id: 33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS | +2019-08-28 11:36:50,166 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | Review comment: whitespace:end of line This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 302885) Time Spent: 4h 10m (was: 4h) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 4h 10m > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302882&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302882 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 28/Aug/19 14:00 Start Date: 28/Aug/19 14:00 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597791 ## File path: hadoop-hdds/common/src/main/resources/audit.log ## @@ -0,0 +1,25 @@ +2019-08-28 11:36:31,489 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, volume=testcontainerstatemachinefailures, creationTime=1566972391485, quotaInBytes=1152921504606846976} | ret=SUCCESS | +2019-08-28 11:36:31,494 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,511 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], group:com.apple.access_screensharing:a[ACCESS], group:com.apple.access_ssh-disabled:a[ACCESS], group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, storageType=DISK, creationTime=0} | ret=SUCCESS | +2019-08-28 11:36:31,515 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,519 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | Review comment: whitespace:end of line This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 302882) Time Spent: 3h 40m (was: 3.5h) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 3h 40m > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302878&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302878 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 28/Aug/19 14:00 Start Date: 28/Aug/19 14:00 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597670 ## File path: hadoop-hdds/common/src/main/resources/audit.log ## @@ -0,0 +1,25 @@ +2019-08-28 11:36:31,489 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, volume=testcontainerstatemachinefailures, creationTime=1566972391485, quotaInBytes=1152921504606846976} | ret=SUCCESS | +2019-08-28 11:36:31,494 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,511 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], group:com.apple.access_screensharing:a[ACCESS], group:com.apple.access_ssh-disabled:a[ACCESS], group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, storageType=DISK, creationTime=0} | ret=SUCCESS | +2019-08-28 11:36:31,515 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,519 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,561 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:37,850 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, length=10, offset=0, token=null, pipeline=Pipeline[ Id: 33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS | +2019-08-28 11:36:50,166 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,168 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,177 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:56,287 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=6, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=2, localID=102693103873294339}, length=6, offset=0, token=null, pipeline=Pipeline[ Id: 5c7f9b1c-2fbc-470e-8e56-1e9bbf131bcb, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693103872901122} | ret=SUCCESS | +2019-08-28 11:37:08,500 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:37:08,502 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:37:08,514 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302864&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302864 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 28/Aug/19 14:00 Start Date: 28/Aug/19 14:00 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597430 ## File path: hadoop-hdds/common/src/main/resources/audit.log ## @@ -0,0 +1,25 @@ +2019-08-28 11:36:31,489 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, volume=testcontainerstatemachinefailures, creationTime=1566972391485, quotaInBytes=1152921504606846976} | ret=SUCCESS | +2019-08-28 11:36:31,494 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,511 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], group:com.apple.access_screensharing:a[ACCESS], group:com.apple.access_ssh-disabled:a[ACCESS], group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, storageType=DISK, creationTime=0} | ret=SUCCESS | +2019-08-28 11:36:31,515 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,519 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,561 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:37,850 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, length=10, offset=0, token=null, pipeline=Pipeline[ Id: 33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS | +2019-08-28 11:36:50,166 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,168 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,177 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:56,287 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=6, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=2, localID=102693103873294339}, length=6, offset=0, token=null, pipeline=Pipeline[ Id: 5c7f9b1c-2fbc-470e-8e56-1e9bbf131bcb, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693103872901122} | ret=SUCCESS | +2019-08-28 11:37:08,500 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | Review comment: whitespace:end of line This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For qu
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302876&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302876 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 28/Aug/19 14:00 Start Date: 28/Aug/19 14:00 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597642 ## File path: hadoop-hdds/common/src/main/resources/audit.log ## @@ -0,0 +1,25 @@ +2019-08-28 11:36:31,489 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, volume=testcontainerstatemachinefailures, creationTime=1566972391485, quotaInBytes=1152921504606846976} | ret=SUCCESS | +2019-08-28 11:36:31,494 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,511 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], group:com.apple.access_screensharing:a[ACCESS], group:com.apple.access_ssh-disabled:a[ACCESS], group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, storageType=DISK, creationTime=0} | ret=SUCCESS | +2019-08-28 11:36:31,515 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,519 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,561 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:37,850 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, length=10, offset=0, token=null, pipeline=Pipeline[ Id: 33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS | +2019-08-28 11:36:50,166 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,168 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,177 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:56,287 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=6, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=2, localID=102693103873294339}, length=6, offset=0, token=null, pipeline=Pipeline[ Id: 5c7f9b1c-2fbc-470e-8e56-1e9bbf131bcb, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693103872901122} | ret=SUCCESS | +2019-08-28 11:37:08,500 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:37:08,502 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:37:08,514 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302877&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302877 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 28/Aug/19 14:00 Start Date: 28/Aug/19 14:00 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597660 ## File path: hadoop-hdds/common/src/main/resources/audit.log ## @@ -0,0 +1,25 @@ +2019-08-28 11:36:31,489 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, volume=testcontainerstatemachinefailures, creationTime=1566972391485, quotaInBytes=1152921504606846976} | ret=SUCCESS | +2019-08-28 11:36:31,494 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,511 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], group:com.apple.access_screensharing:a[ACCESS], group:com.apple.access_ssh-disabled:a[ACCESS], group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, storageType=DISK, creationTime=0} | ret=SUCCESS | +2019-08-28 11:36:31,515 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,519 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,561 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:37,850 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, length=10, offset=0, token=null, pipeline=Pipeline[ Id: 33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS | +2019-08-28 11:36:50,166 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,168 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,177 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:56,287 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=6, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=2, localID=102693103873294339}, length=6, offset=0, token=null, pipeline=Pipeline[ Id: 5c7f9b1c-2fbc-470e-8e56-1e9bbf131bcb, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693103872901122} | ret=SUCCESS | +2019-08-28 11:37:08,500 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:37:08,502 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:37:08,514 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302873&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302873 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 28/Aug/19 14:00 Start Date: 28/Aug/19 14:00 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597581 ## File path: hadoop-hdds/common/src/main/resources/audit.log ## @@ -0,0 +1,25 @@ +2019-08-28 11:36:31,489 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, volume=testcontainerstatemachinefailures, creationTime=1566972391485, quotaInBytes=1152921504606846976} | ret=SUCCESS | +2019-08-28 11:36:31,494 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,511 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], group:com.apple.access_screensharing:a[ACCESS], group:com.apple.access_ssh-disabled:a[ACCESS], group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, storageType=DISK, creationTime=0} | ret=SUCCESS | +2019-08-28 11:36:31,515 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,519 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,561 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:37,850 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, length=10, offset=0, token=null, pipeline=Pipeline[ Id: 33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS | +2019-08-28 11:36:50,166 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,168 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,177 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:56,287 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=6, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=2, localID=102693103873294339}, length=6, offset=0, token=null, pipeline=Pipeline[ Id: 5c7f9b1c-2fbc-470e-8e56-1e9bbf131bcb, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693103872901122} | ret=SUCCESS | +2019-08-28 11:37:08,500 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:37:08,502 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:37:08,514 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302883&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302883 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 28/Aug/19 14:00 Start Date: 28/Aug/19 14:00 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597810 ## File path: hadoop-hdds/common/src/main/resources/audit.log ## @@ -0,0 +1,25 @@ +2019-08-28 11:36:31,489 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, volume=testcontainerstatemachinefailures, creationTime=1566972391485, quotaInBytes=1152921504606846976} | ret=SUCCESS | +2019-08-28 11:36:31,494 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,511 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], group:com.apple.access_screensharing:a[ACCESS], group:com.apple.access_ssh-disabled:a[ACCESS], group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, storageType=DISK, creationTime=0} | ret=SUCCESS | +2019-08-28 11:36:31,515 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,519 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,561 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | Review comment: whitespace:end of line This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 302883) Time Spent: 3h 50m (was: 3h 40m) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 3h 50m > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302875&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302875 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 28/Aug/19 14:00 Start Date: 28/Aug/19 14:00 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597626 ## File path: hadoop-hdds/common/src/main/resources/audit.log ## @@ -0,0 +1,25 @@ +2019-08-28 11:36:31,489 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, volume=testcontainerstatemachinefailures, creationTime=1566972391485, quotaInBytes=1152921504606846976} | ret=SUCCESS | +2019-08-28 11:36:31,494 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,511 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], group:com.apple.access_screensharing:a[ACCESS], group:com.apple.access_ssh-disabled:a[ACCESS], group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, storageType=DISK, creationTime=0} | ret=SUCCESS | +2019-08-28 11:36:31,515 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,519 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,561 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:37,850 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, length=10, offset=0, token=null, pipeline=Pipeline[ Id: 33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS | +2019-08-28 11:36:50,166 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,168 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,177 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:56,287 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=6, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=2, localID=102693103873294339}, length=6, offset=0, token=null, pipeline=Pipeline[ Id: 5c7f9b1c-2fbc-470e-8e56-1e9bbf131bcb, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693103872901122} | ret=SUCCESS | +2019-08-28 11:37:08,500 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:37:08,502 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:37:08,514 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302874&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302874 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 28/Aug/19 14:00 Start Date: 28/Aug/19 14:00 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597600 ## File path: hadoop-hdds/common/src/main/resources/audit.log ## @@ -0,0 +1,25 @@ +2019-08-28 11:36:31,489 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, volume=testcontainerstatemachinefailures, creationTime=1566972391485, quotaInBytes=1152921504606846976} | ret=SUCCESS | +2019-08-28 11:36:31,494 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,511 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], group:com.apple.access_screensharing:a[ACCESS], group:com.apple.access_ssh-disabled:a[ACCESS], group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, storageType=DISK, creationTime=0} | ret=SUCCESS | +2019-08-28 11:36:31,515 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,519 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,561 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:37,850 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, length=10, offset=0, token=null, pipeline=Pipeline[ Id: 33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS | +2019-08-28 11:36:50,166 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,168 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,177 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:56,287 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=6, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=2, localID=102693103873294339}, length=6, offset=0, token=null, pipeline=Pipeline[ Id: 5c7f9b1c-2fbc-470e-8e56-1e9bbf131bcb, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693103872901122} | ret=SUCCESS | +2019-08-28 11:37:08,500 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:37:08,502 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:37:08,514 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302866&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302866 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 28/Aug/19 14:00 Start Date: 28/Aug/19 14:00 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597464 ## File path: hadoop-hdds/common/src/main/resources/audit.log ## @@ -0,0 +1,25 @@ +2019-08-28 11:36:31,489 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, volume=testcontainerstatemachinefailures, creationTime=1566972391485, quotaInBytes=1152921504606846976} | ret=SUCCESS | +2019-08-28 11:36:31,494 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,511 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], group:com.apple.access_screensharing:a[ACCESS], group:com.apple.access_ssh-disabled:a[ACCESS], group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, storageType=DISK, creationTime=0} | ret=SUCCESS | +2019-08-28 11:36:31,515 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,519 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,561 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:37,850 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, length=10, offset=0, token=null, pipeline=Pipeline[ Id: 33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS | +2019-08-28 11:36:50,166 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,168 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,177 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:56,287 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=6, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=2, localID=102693103873294339}, length=6, offset=0, token=null, pipeline=Pipeline[ Id: 5c7f9b1c-2fbc-470e-8e56-1e9bbf131bcb, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693103872901122} | ret=SUCCESS | +2019-08-28 11:37:08,500 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:37:08,502 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:37:08,514 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302869&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302869 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 28/Aug/19 14:00 Start Date: 28/Aug/19 14:00 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597514 ## File path: hadoop-hdds/common/src/main/resources/audit.log ## @@ -0,0 +1,25 @@ +2019-08-28 11:36:31,489 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, volume=testcontainerstatemachinefailures, creationTime=1566972391485, quotaInBytes=1152921504606846976} | ret=SUCCESS | +2019-08-28 11:36:31,494 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,511 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], group:com.apple.access_screensharing:a[ACCESS], group:com.apple.access_ssh-disabled:a[ACCESS], group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, storageType=DISK, creationTime=0} | ret=SUCCESS | +2019-08-28 11:36:31,515 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,519 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,561 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:37,850 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, length=10, offset=0, token=null, pipeline=Pipeline[ Id: 33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS | +2019-08-28 11:36:50,166 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,168 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,177 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:56,287 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=6, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=2, localID=102693103873294339}, length=6, offset=0, token=null, pipeline=Pipeline[ Id: 5c7f9b1c-2fbc-470e-8e56-1e9bbf131bcb, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693103872901122} | ret=SUCCESS | +2019-08-28 11:37:08,500 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:37:08,502 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:37:08,514 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302868&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302868 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 28/Aug/19 14:00 Start Date: 28/Aug/19 14:00 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597501 ## File path: hadoop-hdds/common/src/main/resources/audit.log ## @@ -0,0 +1,25 @@ +2019-08-28 11:36:31,489 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, volume=testcontainerstatemachinefailures, creationTime=1566972391485, quotaInBytes=1152921504606846976} | ret=SUCCESS | +2019-08-28 11:36:31,494 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,511 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], group:com.apple.access_screensharing:a[ACCESS], group:com.apple.access_ssh-disabled:a[ACCESS], group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, storageType=DISK, creationTime=0} | ret=SUCCESS | +2019-08-28 11:36:31,515 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,519 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,561 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:37,850 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, length=10, offset=0, token=null, pipeline=Pipeline[ Id: 33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS | +2019-08-28 11:36:50,166 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,168 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,177 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:56,287 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=6, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=2, localID=102693103873294339}, length=6, offset=0, token=null, pipeline=Pipeline[ Id: 5c7f9b1c-2fbc-470e-8e56-1e9bbf131bcb, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693103872901122} | ret=SUCCESS | +2019-08-28 11:37:08,500 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:37:08,502 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:37:08,514 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302871&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302871 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 28/Aug/19 14:00 Start Date: 28/Aug/19 14:00 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597545 ## File path: hadoop-hdds/common/src/main/resources/audit.log ## @@ -0,0 +1,25 @@ +2019-08-28 11:36:31,489 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, volume=testcontainerstatemachinefailures, creationTime=1566972391485, quotaInBytes=1152921504606846976} | ret=SUCCESS | +2019-08-28 11:36:31,494 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,511 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], group:com.apple.access_screensharing:a[ACCESS], group:com.apple.access_ssh-disabled:a[ACCESS], group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, storageType=DISK, creationTime=0} | ret=SUCCESS | +2019-08-28 11:36:31,515 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,519 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,561 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:37,850 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, length=10, offset=0, token=null, pipeline=Pipeline[ Id: 33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS | +2019-08-28 11:36:50,166 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,168 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,177 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:56,287 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=6, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=2, localID=102693103873294339}, length=6, offset=0, token=null, pipeline=Pipeline[ Id: 5c7f9b1c-2fbc-470e-8e56-1e9bbf131bcb, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693103872901122} | ret=SUCCESS | +2019-08-28 11:37:08,500 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:37:08,502 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:37:08,514 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302880&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302880 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 28/Aug/19 14:00 Start Date: 28/Aug/19 14:00 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597764 ## File path: hadoop-hdds/common/src/main/resources/audit.log ## @@ -0,0 +1,25 @@ +2019-08-28 11:36:31,489 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, volume=testcontainerstatemachinefailures, creationTime=1566972391485, quotaInBytes=1152921504606846976} | ret=SUCCESS | +2019-08-28 11:36:31,494 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,511 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], group:com.apple.access_screensharing:a[ACCESS], group:com.apple.access_ssh-disabled:a[ACCESS], group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, storageType=DISK, creationTime=0} | ret=SUCCESS | Review comment: whitespace:end of line This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 302880) Time Spent: 3h 20m (was: 3h 10m) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 3h 20m > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302881&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302881 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 28/Aug/19 14:00 Start Date: 28/Aug/19 14:00 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597775 ## File path: hadoop-hdds/common/src/main/resources/audit.log ## @@ -0,0 +1,25 @@ +2019-08-28 11:36:31,489 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, volume=testcontainerstatemachinefailures, creationTime=1566972391485, quotaInBytes=1152921504606846976} | ret=SUCCESS | +2019-08-28 11:36:31,494 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,511 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], group:com.apple.access_screensharing:a[ACCESS], group:com.apple.access_ssh-disabled:a[ACCESS], group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, storageType=DISK, creationTime=0} | ret=SUCCESS | +2019-08-28 11:36:31,515 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | Review comment: whitespace:end of line This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 302881) Time Spent: 3.5h (was: 3h 20m) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 3.5h > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302879&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302879 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 28/Aug/19 14:00 Start Date: 28/Aug/19 14:00 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597746 ## File path: hadoop-hdds/common/src/main/resources/audit.log ## @@ -0,0 +1,25 @@ +2019-08-28 11:36:31,489 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, volume=testcontainerstatemachinefailures, creationTime=1566972391485, quotaInBytes=1152921504606846976} | ret=SUCCESS | +2019-08-28 11:36:31,494 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | Review comment: whitespace:end of line This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 302879) Time Spent: 3h 10m (was: 3h) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 3h 10m > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302870&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302870 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 28/Aug/19 14:00 Start Date: 28/Aug/19 14:00 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597530 ## File path: hadoop-hdds/common/src/main/resources/audit.log ## @@ -0,0 +1,25 @@ +2019-08-28 11:36:31,489 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, volume=testcontainerstatemachinefailures, creationTime=1566972391485, quotaInBytes=1152921504606846976} | ret=SUCCESS | +2019-08-28 11:36:31,494 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,511 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], group:com.apple.access_screensharing:a[ACCESS], group:com.apple.access_ssh-disabled:a[ACCESS], group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, storageType=DISK, creationTime=0} | ret=SUCCESS | +2019-08-28 11:36:31,515 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,519 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,561 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:37,850 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, length=10, offset=0, token=null, pipeline=Pipeline[ Id: 33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS | +2019-08-28 11:36:50,166 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,168 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,177 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:56,287 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=6, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=2, localID=102693103873294339}, length=6, offset=0, token=null, pipeline=Pipeline[ Id: 5c7f9b1c-2fbc-470e-8e56-1e9bbf131bcb, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693103872901122} | ret=SUCCESS | +2019-08-28 11:37:08,500 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:37:08,502 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:37:08,514 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302872&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302872 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 28/Aug/19 14:00 Start Date: 28/Aug/19 14:00 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597560 ## File path: hadoop-hdds/common/src/main/resources/audit.log ## @@ -0,0 +1,25 @@ +2019-08-28 11:36:31,489 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, volume=testcontainerstatemachinefailures, creationTime=1566972391485, quotaInBytes=1152921504606846976} | ret=SUCCESS | Review comment: whitespace:end of line This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 302872) Time Spent: 2h (was: 1h 50m) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 2h > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302867&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302867 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 28/Aug/19 14:00 Start Date: 28/Aug/19 14:00 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597486 ## File path: hadoop-hdds/common/src/main/resources/audit.log ## @@ -0,0 +1,25 @@ +2019-08-28 11:36:31,489 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, volume=testcontainerstatemachinefailures, creationTime=1566972391485, quotaInBytes=1152921504606846976} | ret=SUCCESS | +2019-08-28 11:36:31,494 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,511 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], group:com.apple.access_screensharing:a[ACCESS], group:com.apple.access_ssh-disabled:a[ACCESS], group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, storageType=DISK, creationTime=0} | ret=SUCCESS | +2019-08-28 11:36:31,515 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,519 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,561 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:37,850 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, length=10, offset=0, token=null, pipeline=Pipeline[ Id: 33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS | +2019-08-28 11:36:50,166 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,168 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,177 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:56,287 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=6, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=2, localID=102693103873294339}, length=6, offset=0, token=null, pipeline=Pipeline[ Id: 5c7f9b1c-2fbc-470e-8e56-1e9bbf131bcb, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693103872901122} | ret=SUCCESS | +2019-08-28 11:37:08,500 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:37:08,502 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:37:08,514 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302863&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302863 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 28/Aug/19 14:00 Start Date: 28/Aug/19 14:00 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597419 ## File path: hadoop-hdds/common/src/main/resources/audit.log ## @@ -0,0 +1,25 @@ +2019-08-28 11:36:31,489 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, volume=testcontainerstatemachinefailures, creationTime=1566972391485, quotaInBytes=1152921504606846976} | ret=SUCCESS | +2019-08-28 11:36:31,494 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,511 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], group:com.apple.access_screensharing:a[ACCESS], group:com.apple.access_ssh-disabled:a[ACCESS], group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, storageType=DISK, creationTime=0} | ret=SUCCESS | +2019-08-28 11:36:31,515 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,519 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,561 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:37,850 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, length=10, offset=0, token=null, pipeline=Pipeline[ Id: 33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS | +2019-08-28 11:36:50,166 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,168 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,177 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:56,287 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=6, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=2, localID=102693103873294339}, length=6, offset=0, token=null, pipeline=Pipeline[ Id: 5c7f9b1c-2fbc-470e-8e56-1e9bbf131bcb, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693103872901122} | ret=SUCCESS | Review comment: whitespace:end of line This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 30
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302865&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302865 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 28/Aug/19 14:00 Start Date: 28/Aug/19 14:00 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597444 ## File path: hadoop-hdds/common/src/main/resources/audit.log ## @@ -0,0 +1,25 @@ +2019-08-28 11:36:31,489 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, volume=testcontainerstatemachinefailures, creationTime=1566972391485, quotaInBytes=1152921504606846976} | ret=SUCCESS | +2019-08-28 11:36:31,494 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,511 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], group:com.apple.access_screensharing:a[ACCESS], group:com.apple.access_ssh-disabled:a[ACCESS], group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, storageType=DISK, creationTime=0} | ret=SUCCESS | +2019-08-28 11:36:31,515 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,519 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,561 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:37,850 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, length=10, offset=0, token=null, pipeline=Pipeline[ Id: 33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS | +2019-08-28 11:36:50,166 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,168 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,177 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:56,287 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=6, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=2, localID=102693103873294339}, length=6, offset=0, token=null, pipeline=Pipeline[ Id: 5c7f9b1c-2fbc-470e-8e56-1e9bbf131bcb, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693103872901122} | ret=SUCCESS | +2019-08-28 11:37:08,500 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:37:08,502 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | Review comment: whitespace:end of line --
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302862&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302862 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 28/Aug/19 14:00 Start Date: 28/Aug/19 14:00 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364#discussion_r318597402 ## File path: hadoop-hdds/common/src/main/resources/audit.log ## @@ -0,0 +1,25 @@ +2019-08-28 11:36:31,489 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_VOLUME {admin=sbanerjee, owner=sbanerjee, volume=testcontainerstatemachinefailures, creationTime=1566972391485, quotaInBytes=1152921504606846976} | ret=SUCCESS | +2019-08-28 11:36:31,494 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,511 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=CREATE_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, acls=[user:sbanerjee:a[ACCESS], group:staff:a[ACCESS], group:everyone:a[ACCESS], group:localaccounts:a[ACCESS], group:_appserverusr:a[ACCESS], group:admin:a[ACCESS], group:_appserveradm:a[ACCESS], group:_lpadmin:a[ACCESS], group:com.apple.sharepoint.group.2:a[ACCESS], group:_appstore:a[ACCESS], group:_lpoperator:a[ACCESS], group:_developer:a[ACCESS], group:_analyticsusers:a[ACCESS], group:com.apple.access_ftp:a[ACCESS], group:com.apple.access_screensharing:a[ACCESS], group:com.apple.access_ssh-disabled:a[ACCESS], group:com.apple.sharepoint.group.1:a[ACCESS]], isVersionEnabled=false, storageType=DISK, creationTime=0} | ret=SUCCESS | +2019-08-28 11:36:31,515 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,519 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:31,561 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | +2019-08-28 11:36:37,850 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=COMMIT_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=10, replicationType=null, replicationFactor=null, keyLocationInfo=[{blockID={containerID=1, localID=102693102652358657}, length=10, offset=0, token=null, pipeline=Pipeline[ Id: 33e5321e-9d61-4d31-94ca-a18a6abc80ba, Nodes: 82c96fdc-06f5-47f4-ab5d-7a2fa599b46e{ip: 192.168.0.64, host: 192.168.0.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:ONE, State:OPEN], createVersion=0}], clientID=102693102651244544} | ret=SUCCESS | +2019-08-28 11:36:50,166 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_VOLUME {volume=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,168 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=READ_BUCKET {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures} | ret=SUCCESS | +2019-08-28 11:36:50,177 | INFO | OMAudit | user=sbanerjee | ip=127.0.0.1 | op=ALLOCATE_KEY {volume=testcontainerstatemachinefailures, bucket=testcontainerstatemachinefailures, key=ratis, dataSize=1024, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=null} | ret=SUCCESS | Review comment: whitespace:end of line This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 302862) Time Spent: 20m (was: 10m) > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time
[jira] [Work logged] (HDDS-1843) Undetectable corruption after restart of a datanode
[ https://issues.apache.org/jira/browse/HDDS-1843?focusedWorklogId=302559&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302559 ] ASF GitHub Bot logged work on HDDS-1843: Author: ASF GitHub Bot Created on: 28/Aug/19 06:17 Start Date: 28/Aug/19 06:17 Worklog Time Spent: 10m Work Description: bshashikant commented on pull request #1364: HDDS-1843. Undetectable corruption after restart of a datanode. URL: https://github.com/apache/hadoop/pull/1364 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 302559) Remaining Estimate: 0h Time Spent: 10m > Undetectable corruption after restart of a datanode > --- > > Key: HDDS-1843 > URL: https://issues.apache.org/jira/browse/HDDS-1843 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1843.000.patch > > Time Spent: 10m > Remaining Estimate: 0h > > Right now, all write chunks use BufferedIO ie, sync flag is disabled by > default. Also, Rocks Db metadata updates are done in Rocks DB cache first at > Datanode. In case, there comes a situation where the buffered chunk data as > well as the corresponding metadata update is lost as a part of datanode > restart, it may lead to a situation where, it will not be possible to detect > the corruption (not even with container scanner) of this nature in a > reasonable time frame, until and unless there is a client IO failure or Recon > server detects it over time. In order to atleast to detect the problem, Ratis > snapshot on datanode should sync the rocks db file . In such a way, > ContainerScanner will be able to detect this.We can also add a metric around > sync to measure how much of a throughput loss it can incurr. > Thanks [~msingh] for suggesting this. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org