[jira] [Created] (HDFS-17220) fix same available space policy in AvailableSpaceVolumeChoosingPolicy
Fei Guo created HDFS-17220: -- Summary: fix same available space policy in AvailableSpaceVolumeChoosingPolicy Key: HDFS-17220 URL: https://issues.apache.org/jira/browse/HDFS-17220 Project: Hadoop HDFS Issue Type: Bug Components: hdfs Affects Versions: 3.3.6 Reporter: Fei Guo if all the volumes have the same available space. for example {{1 MB}} and {{dfs.datanode.available-space-volume-choosing-policy.balanced-space-threshold}} is set 0, which means we should treat all the volumes equally when we choose available volumes.but currently not, we can fix it in {{{}AvailableSpaceVolumeChoosingPolicy{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17210) Optimize AvailableSpaceBlockPlacementPolicy
Fei Guo created HDFS-17210: -- Summary: Optimize AvailableSpaceBlockPlacementPolicy Key: HDFS-17210 URL: https://issues.apache.org/jira/browse/HDFS-17210 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs Affects Versions: 3.3.6 Reporter: Fei Guo for now ,we may have many nodes usage over 85%, some nodes'usage have been over 90%, if we choose three nodes as nodeA-97%,nodeB-98%,nodeC-99%, actually i don't want nodeC(99%) be choosen, for nodeC usage is really high , even random chance ,nodeC will reach 100% soon, so we can directly choose the less usage node(nodeA) if all nodes's usage are over 95%(just a example) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16357) Fix log format in DFSUtilClient
guo created HDFS-16357: -- Summary: Fix log format in DFSUtilClient Key: HDFS-16357 URL: https://issues.apache.org/jira/browse/HDFS-16357 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs Affects Versions: 3.3.1 Reporter: guo if address is local ,there will be additional space in the log .we can improve it to look proper -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16355) Improve block scanner desc
guo created HDFS-16355: -- Summary: Improve block scanner desc Key: HDFS-16355 URL: https://issues.apache.org/jira/browse/HDFS-16355 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs Affects Versions: 3.3.1 Reporter: guo datanode block scanner will be dissbled if `dfs.block.scanner.volume.bytes.per.second` is configured less then or equal to zero, we can improve the desciption -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16355) Improve block scanner desc
[ https://issues.apache.org/jira/browse/HDFS-16355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guo updated HDFS-16355: --- Description: datanode block scanner will be disabled if `dfs.block.scanner.volume.bytes.per.second` is configured less then or equal to zero, we can improve the desciption (was: datanode block scanner will be dissbled if `dfs.block.scanner.volume.bytes.per.second` is configured less then or equal to zero, we can improve the desciption) > Improve block scanner desc > -- > > Key: HDFS-16355 > URL: https://issues.apache.org/jira/browse/HDFS-16355 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.3.1 >Reporter: guo >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > datanode block scanner will be disabled if > `dfs.block.scanner.volume.bytes.per.second` is configured less then or equal > to zero, we can improve the desciption -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16351) add path exception information in FSNamesystem
guo created HDFS-16351: -- Summary: add path exception information in FSNamesystem Key: HDFS-16351 URL: https://issues.apache.org/jira/browse/HDFS-16351 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs Affects Versions: 3.3.1 Reporter: guo add path information in exception message to make message more clear in FSNamesystem -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16347) Fix directory scan throttle default value
guo created HDFS-16347: -- Summary: Fix directory scan throttle default value Key: HDFS-16347 URL: https://issues.apache.org/jira/browse/HDFS-16347 Project: Hadoop HDFS Issue Type: Improvement Components: documentation Affects Versions: 3.3.1 Reporter: guo `dfs.datanode.directoryscan.throttle.limit.ms.per.sec` was changed from `1000` to `-1` by default after HDFS-13947, we can improve the doc -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16345) Fix test cases fail in TestBlockStoragePolicy
guo created HDFS-16345: -- Summary: Fix test cases fail in TestBlockStoragePolicy Key: HDFS-16345 URL: https://issues.apache.org/jira/browse/HDFS-16345 Project: Hadoop HDFS Issue Type: Improvement Components: build Affects Versions: 3.3.1 Reporter: guo test class ``TestBlockStoragePolicy` ` fail frequently for the `BindException`, it block all normal source code build. we can improve it. [ERROR] Tests run: 26, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 49.295 s <<< FAILURE! - in org.apache.hadoop.hdfs.TestBlockStoragePolicy [ERROR] testChooseTargetWithTopology(org.apache.hadoop.hdfs.TestBlockStoragePolicy) Time elapsed: 0.551 s <<< ERROR! java.net.BindException: Problem binding to [localhost:43947] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:931) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:827) at org.apache.hadoop.ipc.Server.bind(Server.java:657) at org.apache.hadoop.ipc.Server$Listener.(Server.java:1352) at org.apache.hadoop.ipc.Server.(Server.java:3252) at org.apache.hadoop.ipc.RPC$Server.(RPC.java:1062) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server.(ProtobufRpcEngine2.java:468) at org.apache.hadoop.ipc.ProtobufRpcEngine2.getServer(ProtobufRpcEngine2.java:371) at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:853) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.(NameNodeRpcServer.java:466) at org.apache.hadoop.hdfs.server.namenode.NameNode.createRpcServer(NameNode.java:860) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:766) at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:1017) at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:992) at org.apache.hadoop.hdfs.TestBlockStoragePolicy.testChooseTargetWithTopology(TestBlockStoragePolicy.java:1275) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.ParentRunner.run(ParentRunner.java:413) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) Caused by: java.net.BindException: Address already in use at sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:461) at sun.nio.ch.Net.bind(Net.java:453) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:222) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:85) at
[jira] [Resolved] (HDFS-16340) improve diskbalancer error message
[ https://issues.apache.org/jira/browse/HDFS-16340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guo resolved HDFS-16340. Resolution: Won't Fix > improve diskbalancer error message > -- > > Key: HDFS-16340 > URL: https://issues.apache.org/jira/browse/HDFS-16340 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.3.1 >Reporter: guo >Priority: Minor > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > during disk balance ,when we cannot get json from item ,only `Unable to get > json from Item.` will be recorded, but no real exception message printed, we > can improve it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-16342) improve code in KMS
[ https://issues.apache.org/jira/browse/HDFS-16342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guo resolved HDFS-16342. Resolution: Won't Fix > improve code in KMS > --- > > Key: HDFS-16342 > URL: https://issues.apache.org/jira/browse/HDFS-16342 > Project: Hadoop HDFS > Issue Type: Improvement > Components: kms >Affects Versions: 3.3.1 >Reporter: guo >Priority: Minor > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > There are duplicated code in KMS, we can do a little improve -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16342) improve code in KMS
guo created HDFS-16342: -- Summary: improve code in KMS Key: HDFS-16342 URL: https://issues.apache.org/jira/browse/HDFS-16342 Project: Hadoop HDFS Issue Type: Improvement Components: kms Affects Versions: 3.3.1 Reporter: guo There are duplicated code in KMS, we can do a little improve -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16341) Add block placement policy desc
guo created HDFS-16341: -- Summary: Add block placement policy desc Key: HDFS-16341 URL: https://issues.apache.org/jira/browse/HDFS-16341 Project: Hadoop HDFS Issue Type: Improvement Components: documentation Affects Versions: 3.3.1 Reporter: guo Now, we have six block placement policies supported, we can keep the doc updated. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16340) improve diskbalancer error message
guo created HDFS-16340: -- Summary: improve diskbalancer error message Key: HDFS-16340 URL: https://issues.apache.org/jira/browse/HDFS-16340 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs Affects Versions: 3.3.1 Reporter: guo during disk balance ,when we cannot get json from item ,only `Unable to get json from Item.` will be recorded, but no real exception message printed, we can improve it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16338) Fix error configuration message in FSImage
[ https://issues.apache.org/jira/browse/HDFS-16338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guo updated HDFS-16338: --- Description: `dfs.namenode.checkpoint.edits.dir` may be different from `dfs.namenode.checkpoint.dir` , if `checkpointEditsDirs` is null or empty, error message should warn the edit dir configuration, we can fix it. (was: During import checkpoint , if `checkpointEditsDirs` is null or empty, error message should warn the right configuration, we can fix it.) > Fix error configuration message in FSImage > -- > > Key: HDFS-16338 > URL: https://issues.apache.org/jira/browse/HDFS-16338 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.3.1 >Reporter: guo >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > `dfs.namenode.checkpoint.edits.dir` may be different from > `dfs.namenode.checkpoint.dir` , if `checkpointEditsDirs` is null or empty, > error message should warn the edit dir configuration, we can fix it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16338) Fix error configuration message in FSImage
guo created HDFS-16338: -- Summary: Fix error configuration message in FSImage Key: HDFS-16338 URL: https://issues.apache.org/jira/browse/HDFS-16338 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs Affects Versions: 3.3.1 Reporter: guo During import checkpoint , if `checkpointEditsDirs` is null or empty, error message should warn the right configuration, we can fix it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16334) correct namenode acl desc
[ https://issues.apache.org/jira/browse/HDFS-16334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guo updated HDFS-16334: --- Description: `dfs.namenode.acls.enabled` is set to be `true` by default after HDFS-13505 ,we can improve the desc (was: `dfs.namenode.acls.enabled` is set to be `true` after HDFS-13505 ,we can improve the desc) > correct namenode acl desc > - > > Key: HDFS-16334 > URL: https://issues.apache.org/jira/browse/HDFS-16334 > Project: Hadoop HDFS > Issue Type: Improvement > Components: documentation >Affects Versions: 3.3.1 >Reporter: guo >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > `dfs.namenode.acls.enabled` is set to be `true` by default after HDFS-13505 > ,we can improve the desc -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16334) correct namenode acl desc
guo created HDFS-16334: -- Summary: correct namenode acl desc Key: HDFS-16334 URL: https://issues.apache.org/jira/browse/HDFS-16334 Project: Hadoop HDFS Issue Type: Improvement Components: documentation Affects Versions: 3.3.1 Reporter: guo `dfs.namenode.acls.enabled` is set to be `true` after HDFS-13505 ,we can improve the desc -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16328) Correct disk balancer param desc
[ https://issues.apache.org/jira/browse/HDFS-16328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guo updated HDFS-16328: --- Description: `dfs.disk.balancer.enabled` is enabled by default after HDFS-13153, we can improve the doc to avoid confusion > Correct disk balancer param desc > > > Key: HDFS-16328 > URL: https://issues.apache.org/jira/browse/HDFS-16328 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.3.1 > Environment: `dfs.disk.balancer.enabled` is enabled by default after > HDFS-13153, we can improve the doc to avoid confusion >Reporter: guo >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > `dfs.disk.balancer.enabled` is enabled by default after HDFS-13153, we can > improve the doc to avoid confusion -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16328) Correct disk balancer param desc
[ https://issues.apache.org/jira/browse/HDFS-16328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guo updated HDFS-16328: --- Environment: (was: `dfs.disk.balancer.enabled` is enabled by default after HDFS-13153, we can improve the doc to avoid confusion) > Correct disk balancer param desc > > > Key: HDFS-16328 > URL: https://issues.apache.org/jira/browse/HDFS-16328 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.3.1 >Reporter: guo >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > `dfs.disk.balancer.enabled` is enabled by default after HDFS-13153, we can > improve the doc to avoid confusion -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16328) Correct disk balancer param desc
guo created HDFS-16328: -- Summary: Correct disk balancer param desc Key: HDFS-16328 URL: https://issues.apache.org/jira/browse/HDFS-16328 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs Affects Versions: 3.3.1 Environment: `dfs.disk.balancer.enabled` is enabled by default after HDFS-13153, we can improve the doc to avoid confusion Reporter: guo -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16324) fix error log in BlockManagerSafeMode
guo created HDFS-16324: -- Summary: fix error log in BlockManagerSafeMode Key: HDFS-16324 URL: https://issues.apache.org/jira/browse/HDFS-16324 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs Affects Versions: 3.3.1 Reporter: guo if `recheckInterval` was set as invalid value, there will be warning log output, but the message seems not that proper ,we can improve it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16318) Add exception blockinfo
[ https://issues.apache.org/jira/browse/HDFS-16318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17443341#comment-17443341 ] guo commented on HDFS-16318: Thanks [~hexiaoqiao] for your note, have just updated > Add exception blockinfo > --- > > Key: HDFS-16318 > URL: https://issues.apache.org/jira/browse/HDFS-16318 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.3.1 >Reporter: guo >Priority: Minor > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > we may suffer `Could not obtain the last block location` exception, but we > may reading more than one file, the following exception cannnot guide us to > find the problem block or dn info. we can add more info in the log to help > us . > `2021-11-12 14:01:59,633 WARN [main] org.apache.hadoop.hdfs.DFSClient: Last > block locations not available. Datanodes might not have reported blocks > completely. Will retry for 3 times` > `2021-11-12 14:02:03,724 WARN [main] org.apache.hadoop.hdfs.DFSClient: Last > block locations not available. Datanodes might not have reported blocks > completely. Will retry for 2 times` > `2021-11-12 14:02:07,726 WARN [main] org.apache.hadoop.hdfs.DFSClient: Last > block locations not available. Datanodes might not have reported blocks > completely. Will retry for 1 times` > `Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.GeneratedConstructorAccessor19.newInstance(Unknown Source) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:251) > ... 11 more` > `Caused by: java.io.IOException: Could not obtain the last block locations. > at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:291) > at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:264) > at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1535) > at > org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304) > at > org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312) > at org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:162) > at > org.apache.hadoop.fs.viewfs.ChRootedFileSystem.open(ChRootedFileSystem.java:261) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem.open(ViewFileSystem.java:463) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768) > at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:109) > at > org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67) > at > org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:66) > ... 15 more` -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16318) Add exception blockinfo
[ https://issues.apache.org/jira/browse/HDFS-16318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guo updated HDFS-16318: --- Description: we may suffer `Could not obtain the last block location` exception, but we may reading more than one file, the following exception cannnot guide us to find the problem block or dn info. we can add more info in the log to help us . `2021-11-12 14:01:59,633 WARN [main] org.apache.hadoop.hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 3 times` `2021-11-12 14:02:03,724 WARN [main] org.apache.hadoop.hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 2 times` `2021-11-12 14:02:07,726 WARN [main] org.apache.hadoop.hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 1 times` `Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedConstructorAccessor19.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:251) ... 11 more` `Caused by: java.io.IOException: Could not obtain the last block locations. at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:291) at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:264) at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1535) at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304) at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312) at org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:162) at org.apache.hadoop.fs.viewfs.ChRootedFileSystem.open(ChRootedFileSystem.java:261) at org.apache.hadoop.fs.viewfs.ViewFileSystem.open(ViewFileSystem.java:463) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768) at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:109) at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67) at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:66) ... 15 more` > Add exception blockinfo > --- > > Key: HDFS-16318 > URL: https://issues.apache.org/jira/browse/HDFS-16318 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.3.1 >Reporter: guo >Priority: Minor > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > we may suffer `Could not obtain the last block location` exception, but we > may reading more than one file, the following exception cannnot guide us to > find the problem block or dn info. we can add more info in the log to help > us . > `2021-11-12 14:01:59,633 WARN [main] org.apache.hadoop.hdfs.DFSClient: Last > block locations not available. Datanodes might not have reported blocks > completely. Will retry for 3 times` > `2021-11-12 14:02:03,724 WARN [main] org.apache.hadoop.hdfs.DFSClient: Last > block locations not available. Datanodes might not have reported blocks > completely. Will retry for 2 times` > `2021-11-12 14:02:07,726 WARN [main] org.apache.hadoop.hdfs.DFSClient: Last > block locations not available. Datanodes might not have reported blocks > completely. Will retry for 1 times` > `Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.GeneratedConstructorAccessor19.newInstance(Unknown Source) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:251) > ... 11 more` > `Caused by: java.io.IOException: Could not obtain the last block locations. > at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:291) > at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:264) > at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1535) > at > org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304) > at > org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299) > at >
[jira] [Created] (HDFS-16321) Fix invalid config in TestAvailableSpaceRackFaultTolerantBPP
guo created HDFS-16321: -- Summary: Fix invalid config in TestAvailableSpaceRackFaultTolerantBPP Key: HDFS-16321 URL: https://issues.apache.org/jira/browse/HDFS-16321 Project: Hadoop HDFS Issue Type: Improvement Components: test Affects Versions: 3.3.1 Reporter: guo `TestAvailableSpaceRackFaultTolerantBPP` seems setting invalid param(valid in `TestAvailableSpaceBlockPlacementPolicy`), we can fix it to avoid further trouble. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16318) Add exception blockinfo
guo created HDFS-16318: -- Summary: Add exception blockinfo Key: HDFS-16318 URL: https://issues.apache.org/jira/browse/HDFS-16318 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs Affects Versions: 3.3.1 Reporter: guo -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16307) improve HdfsBlockPlacementPolicies docs
guo created HDFS-16307: -- Summary: improve HdfsBlockPlacementPolicies docs Key: HDFS-16307 URL: https://issues.apache.org/jira/browse/HDFS-16307 Project: Hadoop HDFS Issue Type: Improvement Components: documentation Affects Versions: 3.3.1 Reporter: guo improve HdfsBlockPlacementPolicies docs readability -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16277) Improve decision in AvailableSpaceBlockPlacementPolicy
[ https://issues.apache.org/jira/browse/HDFS-16277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432780#comment-17432780 ] guo commented on HDFS-16277: Thanks [~ayushtkn] for you kindly review, glad to meet hadoop here:) > Improve decision in AvailableSpaceBlockPlacementPolicy > -- > > Key: HDFS-16277 > URL: https://issues.apache.org/jira/browse/HDFS-16277 > Project: Hadoop HDFS > Issue Type: Improvement > Components: block placement >Affects Versions: 3.3.1 >Reporter: guo >Assignee: guo >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.3.2 > > Time Spent: 5h 10m > Remaining Estimate: 0h > > Hi > In product environment,we may meet two or more datanode usage reaches nealy > 100%,for exampe 99.99%,98%,97%. > if we configure `AvailableSpaceBlockPlacementPolicy` , we also have the > chance to choose the 99.99%(assume it is the highest usage),for we treat the > two choosen datanode as the same usage if their storage usage are different > within 5%. > but this is not what we want, so i suggest we can improve the decision. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16277) improve decision in AvailableSpaceBlockPlacementPolicy
guo created HDFS-16277: -- Summary: improve decision in AvailableSpaceBlockPlacementPolicy Key: HDFS-16277 URL: https://issues.apache.org/jira/browse/HDFS-16277 Project: Hadoop HDFS Issue Type: Improvement Components: block placement Affects Versions: 3.3.1 Reporter: guo Hi In product environment,we may meet two or more datanode usage reaches nealy 100%,for exampe 99.99%,98%,97%. if we configure `AvailableSpaceBlockPlacementPolicy` , we also have the chance to choose the 99.99%(assume it is the highest usage),for we treat the two choosen datanode as the same usage if their storage usage are different within 5%. but this is not what we want, so i suggest we can improve the decision. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16029) Divide by zero bug in InstrumentationService.java
[ https://issues.apache.org/jira/browse/HDFS-16029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiyuan GUO updated HDFS-16029: -- Component/s: (was: security) libhdfs > Divide by zero bug in InstrumentationService.java > - > > Key: HDFS-16029 > URL: https://issues.apache.org/jira/browse/HDFS-16029 > Project: Hadoop HDFS > Issue Type: Bug > Components: libhdfs >Reporter: Yiyuan GUO >Priority: Major > Labels: easy-fix, security > > In the file _lib/service/instrumentation/InstrumentationService.java,_ the > method > _Timer.getValues_ has the following > [code|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-httpfs/src/main/java/org/apache/hadoop/lib/service/instrumentation/InstrumentationService.java#L236]: > {code:java} > long[] getValues() { > .. > int limit = (full) ? size : (last + 1); > .. > values[AVG_TOTAL] = values[AVG_TOTAL] / limit; > } > {code} > The variable _limit_ is used as a divisor. However, its value may be equal to > _last + 1,_ which can be zero since _last_ is initialized to -1 in the > [constructor|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-httpfs/src/main/java/org/apache/hadoop/lib/service/instrumentation/InstrumentationService.java#L222]: > {code:java} > public Timer(int size) { > ... > last = -1; > } > {code} > Thus, a divide by zero problem can happen. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16029) Divide by zero bug in InstrumentationService.java
[ https://issues.apache.org/jira/browse/HDFS-16029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiyuan GUO updated HDFS-16029: -- Description: In the file _lib/service/instrumentation/InstrumentationService.java,_ the method _Timer.getValues_ has the following [code|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-httpfs/src/main/java/org/apache/hadoop/lib/service/instrumentation/InstrumentationService.java#L236]: {code:java} long[] getValues() { .. int limit = (full) ? size : (last + 1); .. values[AVG_TOTAL] = values[AVG_TOTAL] / limit; } {code} The variable _limit_ is used as a divisor. However, its value may be equal to _last + 1,_ which can be zero since _last_ is initialized to -1 in the [constructor|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-httpfs/src/main/java/org/apache/hadoop/lib/service/instrumentation/InstrumentationService.java#L222]: {code:java} public Timer(int size) { ... last = -1; } {code} Thus, a divide by zero problem can happen. was: In the file _lib/service/instrumentation/InstrumentationService.java,_ the method _Timer.getValues_ has the following [code|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-httpfs/src/main/java/org/apache/hadoop/lib/service/instrumentation/InstrumentationService.java#L236]: {code:java} long[] getValues() { .. int limit = (full) ? size : (last + 1); .. values[AVG_TOTAL] = values[AVG_TOTAL] / limit; } {code} The variable _limit_ is used as a divisor. However, its value may be equal to _last + 1,_ which can be zero since _last_ is __ initialized to -1 in the [constructor|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-httpfs/src/main/java/org/apache/hadoop/lib/service/instrumentation/InstrumentationService.java#L222]: {code:java} public Timer(int size) { ... last = -1; } {code} Thus, a divide by zero problem can happen > Divide by zero bug in InstrumentationService.java > - > > Key: HDFS-16029 > URL: https://issues.apache.org/jira/browse/HDFS-16029 > Project: Hadoop HDFS > Issue Type: Bug > Components: security >Reporter: Yiyuan GUO >Priority: Major > Labels: easy-fix, security > > In the file _lib/service/instrumentation/InstrumentationService.java,_ the > method > _Timer.getValues_ has the following > [code|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-httpfs/src/main/java/org/apache/hadoop/lib/service/instrumentation/InstrumentationService.java#L236]: > {code:java} > long[] getValues() { > .. > int limit = (full) ? size : (last + 1); > .. > values[AVG_TOTAL] = values[AVG_TOTAL] / limit; > } > {code} > The variable _limit_ is used as a divisor. However, its value may be equal to > _last + 1,_ which can be zero since _last_ is initialized to -1 in the > [constructor|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-httpfs/src/main/java/org/apache/hadoop/lib/service/instrumentation/InstrumentationService.java#L222]: > {code:java} > public Timer(int size) { > ... > last = -1; > } > {code} > Thus, a divide by zero problem can happen. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16029) Divide by zero bug in InstrumentationService.java
Yiyuan GUO created HDFS-16029: - Summary: Divide by zero bug in InstrumentationService.java Key: HDFS-16029 URL: https://issues.apache.org/jira/browse/HDFS-16029 Project: Hadoop HDFS Issue Type: Bug Components: security Reporter: Yiyuan GUO In the file _lib/service/instrumentation/InstrumentationService.java,_ the method _Timer.getValues_ has the following [code|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-httpfs/src/main/java/org/apache/hadoop/lib/service/instrumentation/InstrumentationService.java#L236]: {code:java} long[] getValues() { .. int limit = (full) ? size : (last + 1); .. values[AVG_TOTAL] = values[AVG_TOTAL] / limit; } {code} The variable _limit_ is used as a divisor. However, its value may be equal to _last + 1,_ which can be zero since _last_ is __ initialized to -1 in the [constructor|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-httpfs/src/main/java/org/apache/hadoop/lib/service/instrumentation/InstrumentationService.java#L222]: {code:java} public Timer(int size) { ... last = -1; } {code} Thus, a divide by zero problem can happen -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15292) Infinite loop in Lease Manager due to replica is missing in dn
Aaron Guo created HDFS-15292: Summary: Infinite loop in Lease Manager due to replica is missing in dn Key: HDFS-15292 URL: https://issues.apache.org/jira/browse/HDFS-15292 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 3.1.3 Reporter: Aaron Guo In our production environment, we found that files of under construction keep growing, and the lease manager is trying to release the lease in a Infinite loop: {code:java} 2020-04-18 23:10:57,816 WARN namenode.LeaseManager (LeaseManager.java:checkLeases(589)) - Cannot release the path /user/hadoop/myTestFile.txt in the lease [Lease. Holder: go-hdfs-7VVGF3sGvHZcsZZC, pending creates: 1]. It will be retried. org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* NameSystem.internalReleaseLease: Failed to release lease for file /user/hadoop/myTestFile.txt. Committed blocks are waiting to be minimally replicated. Try again later. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:3391) at org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:586) at org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:524) at java.lang.Thread.run(Thread.java:745) {code} This is because the last block of this file can NOT meet the minimum required replica of 1, a AlreadyBeingCreatedException get thrown, and it will keeps retry forever. This infinite loop also cause another issue since the lease manager always trying to release the first lease then goto the next one, so no lease will be released. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error
[ https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17077758#comment-17077758 ] guo commented on HDFS-15240: Nice job, LGTM. > Erasure Coding: dirty buffer causes reconstruction block error > -- > > Key: HDFS-15240 > URL: https://issues.apache.org/jira/browse/HDFS-15240 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Affects Versions: 3.3.0 >Reporter: HuangTao >Assignee: HuangTao >Priority: Major > Attachments: HDFS-15240.001.patch, HDFS-15240.002.patch, > HDFS-15240.003.patch, HDFS-15240.004.patch, HDFS-15240.005.patch > > > When read some lzo files we found some blocks were broken. > I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k) from > DN directly, and choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') > blocks. And find the longest common sequenece(LCS) between b6'(decoded) and > b6(read from DN)(b7'/b7 and b8'/b8). > After selecting 6 blocks of the block group in combinations one time and > iterating through all cases, I find one case that the length of LCS is the > block length - 64KB, 64KB is just the length of ByteBuffer used by > StripedBlockReader. So the corrupt reconstruction block is made by a dirty > buffer. > The following log snippet(only show 2 of 28 cases) is my check program > output. In my case, I known the 3th block is corrupt, so need other 5 blocks > to decode another 3 blocks, then find the 1th block's LCS substring is block > length - 64kb. > It means (0,1,2,4,5,6)th blocks were used to reconstruct 3th block, and the > dirty buffer was used before read the 1th block. > Must be noted that StripedBlockReader read from the offset 0 of the 1th block > after used the dirty buffer. > {code:java} > decode from [0, 2, 3, 4, 5, 7] -> [1, 6, 8] > Check Block(1) first 131072 bytes longest common substring length 4 > Check Block(6) first 131072 bytes longest common substring length 4 > Check Block(8) first 131072 bytes longest common substring length 4 > decode from [0, 2, 3, 4, 5, 6] -> [1, 7, 8] > Check Block(1) first 131072 bytes longest common substring length 65536 > CHECK AGAIN: Block(1) all 27262976 bytes longest common substring length > 27197440 # this one > Check Block(7) first 131072 bytes longest common substring length 4 > Check Block(8) first 131072 bytes longest common substring length 4{code} > Now I know the dirty buffer causes reconstruction block error, but how does > the dirty buffer come about? > After digging into the code and DN log, I found this following DN log is the > root reason. > {code:java} > [INFO] [stripedRead-1017] : Interrupted while waiting for IO on channel > java.nio.channels.SocketChannel[connected local=/:52586 > remote=/:50010]. 18 millis timeout left. > [WARN] [StripedBlockReconstruction-199] : Failed to reconstruct striped > block: BP-714356632--1519726836856:blk_-YY_3472979393 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:94) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:834) {code} > Reading from DN may timeout(hold by a future(F)) and output the INFO log, but > the futures that contains the future(F) is cleared, > {code:java} > return new StripingChunkReadResult(futures.remove(future), > StripingChunkReadResult.CANCELLED); {code} > futures.remove(future) cause NPE. So the EC reconstruction is failed. In the > finally phase, the code snippet in *getStripedReader().close()* > {code:java} > reconstructor.freeBuffer(reader.getReadBuffer()); > reader.freeReadBuffer(); > reader.closeBlockReader(); {code} > free buffer firstly, but the StripedBlockReader still holds the buffer and > write it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HDFS-14592) Support NIO transferTo semantics in HDFS
Chenzhao Guo created HDFS-14592: --- Summary: Support NIO transferTo semantics in HDFS Key: HDFS-14592 URL: https://issues.apache.org/jira/browse/HDFS-14592 Project: Hadoop HDFS Issue Type: New Feature Reporter: Chenzhao Guo I'm currently developing a Spark shuffle manager based on HDFS. I need to merge some spill files on HDFS to one, or rearrange some HDFS files. An API similar to NIO transferTo, which bypasses memory will be more efficient than manually reading and writing bytes(the method I'm using at present). So can HDFS implements something like NIO transferTo? Making path.transferTo(pathDestination) possible? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443630#comment-16443630 ] Zephyr Guo commented on HDFS-13243: --- Thank you for reviewing,[~daryn]. There are some mistakes in your summaries. {quote} thread1 is writing and closes the stream thread2 is syncing the stream thread1 calls commits the block with size -141232- 2054413 thread2 fsyncs with size -2054413- 141232 DNs report block with size 2054413, marked corrupt {quote} {quote} This sounds like a serious client-side issue. {quote} Yes, as I said above comments. Client call sync() with a corrent length, but sync request could be sent after close(). The root cause is that DFSOutputStream#flushOrSync() is not thread-safe. See following simplified code: {code} synchronized (this) {} ... // code before send request } // **We send request here, but it's not included in synchronized code block** if (getStreamer().getPersistBlocks().getAndSet(false) || updateLength) { try { dfsClient.namenode.fsync(src, fileId, dfsClient.clientName, lastBlockLength); } catch (IOException ioe) { // Deal with ioe } } synchronized (this) { ... // code after send request } {code} I am not sure how to fix client-side. Cloud we put RPC into synchronized code block directly? Maybe we put RPC outside synchronized code block can get more performance benefit. {quote} We cannot simply return success in some invalid cases: ie. fsync when the file has no blocks, size is negative, size is less than less synced/committed size. That just masks bugs. {quote} TestHFlush.hSyncUpdateLength_00 make an sync call with no blocks. So I don't think this is a bug case. {quote} Also, we shouldn't need all the new factories. The tests must verify that namesystem calls, in various specific orders, with specific arguments that are good/bad, either succeed or fail. {quote} I can add new test cases to verify namesystem calls in various specific orders. I have to mock DFSOutputStream to reappear bus about corrupt blocks, so I need all the new factories. If mock DFSOutputStream is not necessary in your opinion, I will remove it. > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, > HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch, > HDFS-13243-v6.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433543#comment-16433543 ] Zephyr Guo commented on HDFS-13243: --- I agree with you [~jojochuang]. Do not fix client side in path-v6, and I fix the failure of the test case. > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, > HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch, > HDFS-13243-v6.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by > hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:40,162 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED,
[jira] [Updated] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zephyr Guo updated HDFS-13243: -- Attachment: HDFS-13243-v6.patch > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, > HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch, > HDFS-13243-v6.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by > hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:40,162 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in > file >
[jira] [Updated] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zephyr Guo updated HDFS-13243: -- Status: Open (was: Patch Available) > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, > HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by > hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:40,162 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in > file >
[jira] [Updated] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zephyr Guo updated HDFS-13243: -- Status: Patch Available (was: Open) > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, > HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by > hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:40,162 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in > file >
[jira] [Updated] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zephyr Guo updated HDFS-13243: -- Attachment: (was: HDFS-13243-v5.patch) > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, > HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by > hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:40,162 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in > file >
[jira] [Updated] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zephyr Guo updated HDFS-13243: -- Attachment: HDFS-13243-v5.patch > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, > HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch, > HDFS-13243-v5.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by > hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:40,162 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in > file >
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431670#comment-16431670 ] Zephyr Guo commented on HDFS-13243: --- I have rebased. [~jojochuang] > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, > HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by > hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:40,162 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in > file >
[jira] [Updated] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zephyr Guo updated HDFS-13243: -- Attachment: HDFS-13243-v5.patch > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, > HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by > hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:40,162 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in > file >
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417198#comment-16417198 ] Zephyr Guo commented on HDFS-13243: --- Rebase patch-v3, attach v4. > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, > HDFS-13243-v3.patch, HDFS-13243-v4.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by > hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:40,162 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in > file >
[jira] [Updated] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zephyr Guo updated HDFS-13243: -- Attachment: HDFS-13243-v4.patch > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, > HDFS-13243-v3.patch, HDFS-13243-v4.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by > hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:40,162 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in > file >
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417167#comment-16417167 ] Zephyr Guo commented on HDFS-13243: --- Hi, [~jojochuang] I attached patch-v3. I move RPC call into synchronized code block. I try my best to let mock code clearly. > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, > HDFS-13243-v3.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by > hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:40,162 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in > file >
[jira] [Updated] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zephyr Guo updated HDFS-13243: -- Attachment: HDFS-13243-v3.patch > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, > HDFS-13243-v3.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by > hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:40,162 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in > file >
[jira] [Comment Edited] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392889#comment-16392889 ] Zephyr Guo edited comment on HDFS-13243 at 3/9/18 2:04 PM: --- [~jojochuang] {quote} I suspect this race condition happens because of this unusual setting.(or makes it more prone to this bug) {quote} The minimal replication is set to 1 in my test case. I agree that this unusual setting makes it more prone to this bug. {quote} If the problem is client side race condition, I would recommend fixing it at client side. {quote} We have to fix server-side as well. We have no power to let all users update their client code, right? I will write a new patch in several days, thanks for your advice. BTW, why don't we include dfsClient.namenode.fsync() into synchronized code block formerly?Is this for performance benefit?If that, Is it necessary to fix the client? was (Author: gzh1992n): [~jojochuang] {quote} I suspect this race condition happens because of this unusual setting.(or makes it more prone to this bug) {quote} The minimal replication is set to 1 in my test case. I agree that this unusual setting makes it more prone to this bug. {qupte} If the problem is client side race condition, I would recommend fixing it at client side. {quote} We have to fix server-side as well. We have no power to let all users update their client code, right? I will write a new patch in several days, thanks for your advice. BTW, why don't we include dfsClient.namenode.fsync() into synchronized code block formerly?Is this for performance benefit?If that, Is it necessary to fix the client? > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Fix For: 3.2.0 > > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > >
[jira] [Comment Edited] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392889#comment-16392889 ] Zephyr Guo edited comment on HDFS-13243 at 3/9/18 1:55 PM: --- [~jojochuang] {quote} I suspect this race condition happens because of this unusual setting.(or makes it more prone to this bug) {quote} The minimal replication is set to 1 in my test case. I agree that this unusual setting makes it more prone to this bug. {qupte} If the problem is client side race condition, I would recommend fixing it at client side. {quote} We have to fix server-side as well. We have no power to let all users update their client code, right? I will write a new patch in several days, thanks for your advice. BTW, why don't we include dfsClient.namenode.fsync() into synchronized code block formerly?Is this for performance benefit?If that, Is it necessary to fix the client? was (Author: gzh1992n): [~jojochuang] {quote} I suspect this race condition happens because of this unusual setting.(or makes it more prone to this bug) {quote} The minimal replication is set to 1 in my test case. I agree that this unusual setting makes it more prone to this bug. {qupte} If the problem is client side race condition, I would recommend fixing it at client side. {quote} We have to fix server-side as well. You have no power to let all user update their client code, right? I will write a new patch in several days, thanks for your advice. BTW, why don't we include dfsClient.namenode.fsync() into synchronized code block formerly?Is this for performance benefit?If that, Is it necessary to fix the client? > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Fix For: 3.2.0 > > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > >
[jira] [Comment Edited] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392889#comment-16392889 ] Zephyr Guo edited comment on HDFS-13243 at 3/9/18 1:42 PM: --- [~jojochuang] {quote} I suspect this race condition happens because of this unusual setting.(or makes it more prone to this bug) {quote} The minimal replication is set to 1 in my test case. I agree that this unusual setting makes it more prone to this bug. {qupte} If the problem is client side race condition, I would recommend fixing it at client side. {quote} We have to fix server-side as well. You have no power to let all user update their client code, right? I will write a new patch in several days, thanks for your advice. BTW, why don't we include dfsClient.namenode.fsync() into synchronized code block formerly?Is this for performance benefit?If that, Is it necessary to fix the client? was (Author: gzh1992n): [~jojochuang] {quote} I suspect this race condition happens because of this unusual setting.(or makes it more prone to this bug) {quote} The minimal replication is 1 in my test case. I agree that this unusual setting makes it more prone to this bug. {qupte} If the problem is client side race condition, I would recommend fixing it at client side. {quote} We have to fix server-side as well. You have no power to let all user update their client code, right? I will write a new patch in several days, thanks for your advice. BTW, why don't we include dfsClient.namenode.fsync() into synchronized code block formerly?Is this for performance benefit?If that, Is it necessary to fix the client? > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Fix For: 3.2.0 > > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > >
[jira] [Comment Edited] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392889#comment-16392889 ] Zephyr Guo edited comment on HDFS-13243 at 3/9/18 1:41 PM: --- [~jojochuang] {quote} I suspect this race condition happens because of this unusual setting.(or makes it more prone to this bug) {quote} The minimal replication is 1 in my test case. I agree that this unusual setting makes it more prone to this bug. {qupte} If the problem is client side race condition, I would recommend fixing it at client side. {quote} We have to fix server-side as well. You have no power to let all user update their client code, right? I will write a new patch in several days, thanks for your advice. BTW, why don't we include dfsClient.namenode.fsync() into synchronized code block formerly?Is this for performance benefit?If that, Is it necessary to fix the client? was (Author: gzh1992n): [~jojochuang] {quote} I suspect this race condition happens because of this unusual setting.(or makes it more prone to this bug) {quote} The minimal replication is 1 in my test case. I agree that this unusual setting makes it more prone to this bug. {qupte} If the problem is client side race condition, I would recommend fixing it at client side. {quote} We have to fix server-side as well. You have no power to let all user update their client code, right? I will write a new patch in serval days, thanks for your advice. BTW, why doesn't we include dfsClient.namenode.fsync() into synchronized code block?Is this for performance benefit?If that, Is it necessary to fix the client? > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Fix For: 3.2.0 > > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > >
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392889#comment-16392889 ] Zephyr Guo commented on HDFS-13243: --- [~jojochuang] {quote} I suspect this race condition happens because of this unusual setting.(or makes it more prone to this bug) {quote} The minimal replication is 1 in my test case. I agree that this unusual setting makes it more prone to this bug. {qupte} If the problem is client side race condition, I would recommend fixing it at client side. {quote} We have to fix server-side as well. You have no power to let all user update their client code, right? I will write a new patch in serval days, thanks for your advice. BTW, why doesn't we include dfsClient.namenode.fsync() into synchronized code block?Is this for performance benefit?If that, Is it necessary to fix the client? > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Fix For: 3.2.0 > > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by > hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is > COMMITTED and reported length 2054413 does not match length in
[jira] [Comment Edited] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392309#comment-16392309 ] Zephyr Guo edited comment on HDFS-13243 at 3/9/18 2:55 AM: --- {{[~jojochuang], Thanks for reviewing.}} {quote} 1.It seems to me the root of problem is that client would call fsync() with an incorrect length (shorter than what it is supposed to sync). If that's the case you should fix the client (DFSOutputStream), rather than the NameNode. {quote} Client call sync() with a *corrent* length, but sync request could be sent after close(). The root cause is that DFSOutputStream#flushOrSync() is not thread-safe. See following simplified code: {code:java} synchronized (this) {} ... // code before send request } // **We send request here, but it's not included in synchronized code block** if (getStreamer().getPersistBlocks().getAndSet(false) || updateLength) { try { dfsClient.namenode.fsync(src, fileId, dfsClient.clientName, lastBlockLength); } catch (IOException ioe) { // Deal with ioe } } synchronized (this) { ... // code after send request } {code} {quote} 2. Looking at the log, your minimal replication number is 2, rather than 1. That's very unusual. In my past experience a lot of weird behavior like this could arise when you have that kind of configuration. {quote} I'm not sure that data reliability would be affected if minimal replication set to 1. Do you have some experience about this {quote} 3.And why is close() in the picture? IMHO you don't even need to close(). Suppose you block DataNode heartbeat, and let client keep the file open and then call sync(), the last block's state remains in COMMITTED. Would that cause the same behavior? {quote} close() must be called after sync(see above root cause). If you don't call close(), the last block's state can't change to COMMITTED, right? {quote} 4.Looking at the patch, I would like to ask you to stay away from using reflection. You could refactor FSNamesystem and DFSOutputStream to return a new FSNamesystem/DFSOutputStream object and override them in the test code. That way, you don't need to introduce new configurations too. And it'll be much cleaner. {quote} I don't understand this. There is no API that set impl of FSNamesystem in MiniCluster. Could you give me a sample in another test case? I will rewrite this patch. {quote} 5.I don't understand the following code. {quote} My fixed code is not final version, because I don't know that whether DFSOutputStream#flushOrSync() impl is ok. We sent sync RPC without lock in client, maybe for performance benefit? If for a benefit, we just fix server-side.If not,we need to fix both server-side and client. In server-side, we could log warn for wrong length and throw exception for invalid state. Is this better than current version? was (Author: gzh1992n): {{[~jojochuang], Thanks for reviewing.}} {quote} 1.It seems to me the root of problem is that client would call fsync() with an incorrect length (shorter than what it is supposed to sync). If that's the case you should fix the client (DFSOutputStream), rather than the NameNode. {quote} Client call sync() with a *corrent* length, but sync request could be sent after close(). The root cause is that DFSOutputStream#flushOrSync() is not thread-safe. See following simplified code: {code:java} synchronized (this) {} ... // code before send request } // **We send request here, but it's not included in synchronized code block** if (getStreamer().getPersistBlocks().getAndSet(false) || updateLength) { try { dfsClient.namenode.fsync(src, fileId, dfsClient.clientName, lastBlockLength); } catch (IOException ioe) { // Deal with ioe } synchronized (this) { ... // code after send request } {code} {quote} 2. Looking at the log, your minimal replication number is 2, rather than 1. That's very unusual. In my past experience a lot of weird behavior like this could arise when you have that kind of configuration. {quote} I'm not sure that data reliability would be affected if minimal replication set to 1. Do you have some experience about this {quote} 3.And why is close() in the picture? IMHO you don't even need to close(). Suppose you block DataNode heartbeat, and let client keep the file open and then call sync(), the last block's state remains in COMMITTED. Would that cause the same behavior? {quote} close() must be called after sync(see above root cause). If you don't call close(), the last block's state can't change to COMMITTED, right? {quote} 4.Looking at the patch, I would like to ask you to stay away from using reflection. You could refactor FSNamesystem and DFSOutputStream to return a new FSNamesystem/DFSOutputStream object and override them in the test code. That way, you don't need to introduce new configurations too. And it'll be much cleaner. {quote} I don't
[jira] [Comment Edited] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392309#comment-16392309 ] Zephyr Guo edited comment on HDFS-13243 at 3/9/18 2:54 AM: --- {{[~jojochuang], Thanks for reviewing.}} {quote} 1.It seems to me the root of problem is that client would call fsync() with an incorrect length (shorter than what it is supposed to sync). If that's the case you should fix the client (DFSOutputStream), rather than the NameNode. {quote} Client call sync() with a *corrent* length, but sync request could be sent after close(). The root cause is that DFSOutputStream#flushOrSync() is not thread-safe. See following simplified code: {code:java} synchronized (this) {} ... // code before send request } // **We send request here, but it's not included in synchronized code block** if (getStreamer().getPersistBlocks().getAndSet(false) || updateLength) { try { dfsClient.namenode.fsync(src, fileId, dfsClient.clientName, lastBlockLength); } catch (IOException ioe) { // Deal with ioe } synchronized (this) { ... // code after send request } {code} {quote} 2. Looking at the log, your minimal replication number is 2, rather than 1. That's very unusual. In my past experience a lot of weird behavior like this could arise when you have that kind of configuration. {quote} I'm not sure that data reliability would be affected if minimal replication set to 1. Do you have some experience about this {quote} 3.And why is close() in the picture? IMHO you don't even need to close(). Suppose you block DataNode heartbeat, and let client keep the file open and then call sync(), the last block's state remains in COMMITTED. Would that cause the same behavior? {quote} close() must be called after sync(see above root cause). If you don't call close(), the last block's state can't change to COMMITTED, right? {quote} 4.Looking at the patch, I would like to ask you to stay away from using reflection. You could refactor FSNamesystem and DFSOutputStream to return a new FSNamesystem/DFSOutputStream object and override them in the test code. That way, you don't need to introduce new configurations too. And it'll be much cleaner. {quote} I don't understand this. There is no API that set impl of FSNamesystem in MiniCluster. Could you give me a sample in another test case? I will rewrite this patch. {quote} 5.I don't understand the following code. {quote} My fixed code is not final version, because I don't know that whether DFSOutputStream#flushOrSync() impl is ok. We sent sync RPC without lock in client, maybe for performance benefit? If for a benefit, we just fix server-side.If not,we need to fix both server-side and client. In server-side, we could log warn for wrong length and throw exception for invalid state. Is this better than current version? was (Author: gzh1992n): {{[~jojochuang], Thanks for reviewing.}} {{{quote}}} {{1.}}It seems to me the root of problem is that client would call fsync() with an incorrect length (shorter than what it is supposed to sync). If that's the case you should fix the client (DFSOutputStream), rather than the NameNode. {{{quote}}} {{Client call sync() with a *corrent* length, but sync request could be sent after close().}} {{The root cause is that DFSOutputStream#flushOrSync() is not thread-safe. See following simplified code:}} {{synchronized (this) {}} {{ ... // code before send request}} {{}}} {{// **We send request here, but it's not included in synchronized code block** if (getStreamer().getPersistBlocks().getAndSet(false) || updateLength) \{ try { dfsClient.namenode.fsync(src, fileId, dfsClient.clientName, lastBlockLength); } catch (IOException ioe) \{ // Deal with ioe } }}} {{synchronized (this) { }} {{ ... // code after send request}} {{}}} {{{quote}}} 2. Looking at the log, your minimal replication number is 2, rather than 1. That's very unusual. In my past experience a lot of weird behavior like this could arise when you have that kind of configuration. {{{quote}}} {{I'm not sure that data reliability would be affected if minimal replication set to 1. Do you have some experience about this?}} {{{quote}}} 3.And why is close() in the picture? IMHO you don't even need to close(). Suppose you block DataNode heartbeat, and let client keep the file open and then call sync(), the last block's state remains in COMMITTED. Would that cause the same behavior? {{{quote}}} {{close() must be called after sync(see above root cause). If you don't call close(), the last block's state can't change to COMMITTED, right? }} {{{quote}}} 4.Looking at the patch, I would like to ask you to stay away from using reflection. You could refactor FSNamesystem and DFSOutputStream to return a new FSNamesystem/DFSOutputStream object and override them in the test code. That way, you don't need to introduce new
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392309#comment-16392309 ] Zephyr Guo commented on HDFS-13243: --- {{[~jojochuang], Thanks for reviewing.}} {{{quote}}} {{1.}}It seems to me the root of problem is that client would call fsync() with an incorrect length (shorter than what it is supposed to sync). If that's the case you should fix the client (DFSOutputStream), rather than the NameNode. {{{quote}}} {{Client call sync() with a *corrent* length, but sync request could be sent after close().}} {{The root cause is that DFSOutputStream#flushOrSync() is not thread-safe. See following simplified code:}} {{synchronized (this) {}} {{ ... // code before send request}} {{}}} {{// **We send request here, but it's not included in synchronized code block** if (getStreamer().getPersistBlocks().getAndSet(false) || updateLength) \{ try { dfsClient.namenode.fsync(src, fileId, dfsClient.clientName, lastBlockLength); } catch (IOException ioe) \{ // Deal with ioe } }}} {{synchronized (this) { }} {{ ... // code after send request}} {{}}} {{{quote}}} 2. Looking at the log, your minimal replication number is 2, rather than 1. That's very unusual. In my past experience a lot of weird behavior like this could arise when you have that kind of configuration. {{{quote}}} {{I'm not sure that data reliability would be affected if minimal replication set to 1. Do you have some experience about this?}} {{{quote}}} 3.And why is close() in the picture? IMHO you don't even need to close(). Suppose you block DataNode heartbeat, and let client keep the file open and then call sync(), the last block's state remains in COMMITTED. Would that cause the same behavior? {{{quote}}} {{close() must be called after sync(see above root cause). If you don't call close(), the last block's state can't change to COMMITTED, right? }} {{{quote}}} 4.Looking at the patch, I would like to ask you to stay away from using reflection. You could refactor FSNamesystem and DFSOutputStream to return a new FSNamesystem/DFSOutputStream object and override them in the test code. That way, you don't need to introduce new configurations too. And it'll be much cleaner. {{{quote}}} {{I don't understand this. There is no API that set impl of FSNamesystem in MiniCluster. Could you give me a sample in another test case? I will rewrite this patch.}} {{{quote}}} {{5.I don't understand the following code.}} {{{quote}}} {{My fixed code is not final version, because I don't know that whether DFSOutputStream#flushOrSync() impl is ok. We sent sync RPC without lock in client, maybe for performance benefit? If for a benefit, we just fix server-side.If not,we need to fix both server-side and client.}} {{In server-side, we could log warn for wrong length and throw exception for invalid state. Is this better than current version?}} > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: Zephyr Guo >Priority: Critical > Fix For: 3.2.0 > > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: >
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390772#comment-16390772 ] Zephyr Guo commented on HDFS-13243: --- Attach v2 patch to fix NoSuchMethodException > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Priority: Critical > Fix For: 3.2.0 > > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by > hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:40,162 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in > file >
[jira] [Updated] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zephyr Guo updated HDFS-13243: -- Attachment: HDFS-13243-v2.patch > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Priority: Critical > Fix For: 3.2.0 > > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by > hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:40,162 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in > file >
[jira] [Updated] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zephyr Guo updated HDFS-13243: -- Description: HDFS File might get broken because of corrupt block(s) that could be produced by calling close and sync in the same time. When calling close was not successful, UCBlock status would change to COMMITTED, and if a sync request gets popped from queue and processed, sync operation would change the last block length. After that, DataNode would report all received block to NameNode, and will check Block length of all COMMITTED Blocks. But the block length was already different between recorded in NameNode memory and reported by DataNode, and consequently, the last block is marked as corruptted because of inconsistent length. {panel:title=Log in my hdfs} 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, truncateBlock=null, primaryNodeIndex=-1, replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} for /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* fsync: /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 for DFSClient_NONMAPREDUCE_1077513762_1 2018-03-05 04:05:39,761 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, primaryNodeIndex=-1, replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in file /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.0.0.220:50010 is added to blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, primaryNodeIndex=-1, replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} size 2054413 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 10.0.0.219:50010 by hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is COMMITTED and reported length 2054413 does not match length in block map 141232 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 10.0.0.218:50010 by hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is COMMITTED and reported length 2054413 does not match length in block map 141232 2018-03-05 04:05:40,162 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, primaryNodeIndex=-1, replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in file /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 {panel} was: HDFS File might get broken because of corrupt block(s) that could be produced by calling close and sync in the same time. When calling close was not successful, UCBlock status would change to COMMITTED, and if a sync request gets popped from queue and processed, sync operation would change the last block length. After that, DataNode would report all received block to NameNode, and will check Block length of all COMMITTED Blocks. But the block length was already different between recorded in NameNode memory and reported by DataNode, and consequently, the last block is marked as corruptted because of inconsistent length. > Get CorruptBlock because of
[jira] [Updated] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zephyr Guo updated HDFS-13243: -- Summary: Get CorruptBlock because of calling close and sync in same time (was: File might get broken because of calling close and sync in same time) > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Priority: Critical > Fix For: 3.2.0 > > Attachments: HDFS-13243-v1.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13243) File might get broken because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zephyr Guo updated HDFS-13243: -- Attachment: HDFS-13243-v1.patch > File might get broken because of calling close and sync in same time > > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Priority: Critical > Fix For: 3.2.0 > > Attachments: HDFS-13243-v1.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13243) File might get broken because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zephyr Guo updated HDFS-13243: -- Status: Patch Available (was: Open) > File might get broken because of calling close and sync in same time > > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Priority: Critical > Fix For: 3.2.0 > > Attachments: HDFS-13243-v1.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-13243) File might get broken because of calling close and sync in same time
Zephyr Guo created HDFS-13243: - Summary: File might get broken because of calling close and sync in same time Key: HDFS-13243 URL: https://issues.apache.org/jira/browse/HDFS-13243 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.7.2, 3.2.0 Reporter: Zephyr Guo Fix For: 3.2.0 HDFS File might get broken because of corrupt block(s) that could be produced by calling close and sync in the same time. When calling close was not successful, UCBlock status would change to COMMITTED, and if a sync request gets popped from queue and processed, sync operation would change the last block length. After that, DataNode would report all received block to NameNode, and will check Block length of all COMMITTED Blocks. But the block length was already different between recorded in NameNode memory and reported by DataNode, and consequently, the last block is marked as corruptted because of inconsistent length. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] (HDFS-10629) Federation Router
[ https://issues.apache.org/jira/browse/HDFS-10629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847495#comment-15847495 ] Lei Guo commented on HDFS-10629: [~elgoiri] [~jakace], do we have rough idea about the overhead introduced via router? > Federation Router > - > > Key: HDFS-10629 > URL: https://issues.apache.org/jira/browse/HDFS-10629 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs >Reporter: Inigo Goiri >Assignee: Jason Kace > Attachments: HDFS-10629.000.patch, HDFS-10629.001.patch, > HDFS-10629-HDFS-10467-002.patch, HDFS-10629-HDFS-10467-003.patch, > HDFS-10629-HDFS-10467-004.patch, HDFS-10629-HDFS-10467-005.patch, > HDFS-10629-HDFS-10467-006.patch, HDFS-10629-HDFS-10467-007.patch > > > Component that routes calls from the clients to the right Namespace. It > implements {{ClientProtocol}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549883#comment-14549883 ] Leitao Guo commented on HDFS-7692: -- [~eddyxu], thanks for your comments, please have a check of the new patch. 1.In DataStorage#recoverTransitionRead, log the InterruptedException and rethrow it as InterruptedIOException; 2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then let the test case fail; 3.The multithread in DataStorage#addStorageLocations() is for one specific namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is creating one thread pool for each namespace. Not change here. 4.Re-phrase the parameter successVolumes. [~szetszwo],thanks for your comments, please have a check of the new patch. 1. InterruptedException re-thrown as InterruptedIOException; 2. I think it's a good idea to log the upgrade progress for each dir, but so far, we can not get the progress easily from the current api. Do you think it's necessary to file a new jira to follow this? DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Labels: BB2015-05-TBR Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549882#comment-14549882 ] Leitao Guo commented on HDFS-7692: -- [~eddyxu], thanks for your comments, please have a check of the new patch. 1.In DataStorage#recoverTransitionRead, log the InterruptedException and rethrow it as InterruptedIOException; 2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then let the test case fail; 3.The multithread in DataStorage#addStorageLocations() is for one specific namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is creating one thread pool for each namespace. Not change here. 4.Re-phrase the parameter successVolumes. [~szetszwo],thanks for your comments, please have a check of the new patch. 1. InterruptedException re-thrown as InterruptedIOException; 2. I think it's a good idea to log the upgrade progress for each dir, but so far, we can not get the progress easily from the current api. Do you think it's necessary to file a new jira to follow this? DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Labels: BB2015-05-TBR Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549930#comment-14549930 ] Leitao Guo commented on HDFS-7692: -- Sorry it's my mistake to comment many times here! It seems that my network condition is not very good now... DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Labels: BB2015-05-TBR Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549901#comment-14549901 ] Leitao Guo commented on HDFS-7692: -- [~eddyxu], thanks for your comments, please have a check of the new patch. 1.In DataStorage#recoverTransitionRead, log the InterruptedException and rethrow it as InterruptedIOException; 2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then let the test case fail; 3.The multithread in DataStorage#addStorageLocations() is for one specific namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is creating one thread pool for each namespace. Not change here. 4.Re-phrase the parameter successVolumes. [~szetszwo],thanks for your comments, please have a check of the new patch. 1. InterruptedException re-thrown as InterruptedIOException; 2. I think it's a good idea to log the upgrade progress for each dir, but so far, we can not get the progress easily from the current api. Do you think it's necessary to file a new jira to follow this? DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Labels: BB2015-05-TBR Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549898#comment-14549898 ] Leitao Guo commented on HDFS-7692: -- [~eddyxu], thanks for your comments, please have a check of the new patch. 1.In DataStorage#recoverTransitionRead, log the InterruptedException and rethrow it as InterruptedIOException; 2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then let the test case fail; 3.The multithread in DataStorage#addStorageLocations() is for one specific namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is creating one thread pool for each namespace. Not change here. 4.Re-phrase the parameter successVolumes. [~szetszwo],thanks for your comments, please have a check of the new patch. 1. InterruptedException re-thrown as InterruptedIOException; 2. I think it's a good idea to log the upgrade progress for each dir, but so far, we can not get the progress easily from the current api. Do you think it's necessary to file a new jira to follow this? DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Labels: BB2015-05-TBR Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549907#comment-14549907 ] Leitao Guo commented on HDFS-7692: -- [~eddyxu], thanks for your comments, please have a check of the new patch. 1.In DataStorage#recoverTransitionRead, log the InterruptedException and rethrow it as InterruptedIOException; 2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then let the test case fail; 3.The multithread in DataStorage#addStorageLocations() is for one specific namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is creating one thread pool for each namespace. Not change here. 4.Re-phrase the parameter successVolumes. [~szetszwo],thanks for your comments, please have a check of the new patch. 1. InterruptedException re-thrown as InterruptedIOException; 2. I think it's a good idea to log the upgrade progress for each dir, but so far, we can not get the progress easily from the current api. Do you think it's necessary to file a new jira to follow this? DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Labels: BB2015-05-TBR Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549903#comment-14549903 ] Leitao Guo commented on HDFS-7692: -- [~eddyxu], thanks for your comments, please have a check of the new patch. 1.In DataStorage#recoverTransitionRead, log the InterruptedException and rethrow it as InterruptedIOException; 2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then let the test case fail; 3.The multithread in DataStorage#addStorageLocations() is for one specific namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is creating one thread pool for each namespace. Not change here. 4.Re-phrase the parameter successVolumes. [~szetszwo],thanks for your comments, please have a check of the new patch. 1. InterruptedException re-thrown as InterruptedIOException; 2. I think it's a good idea to log the upgrade progress for each dir, but so far, we can not get the progress easily from the current api. Do you think it's necessary to file a new jira to follow this? DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Labels: BB2015-05-TBR Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549905#comment-14549905 ] Leitao Guo commented on HDFS-7692: -- [~eddyxu], thanks for your comments, please have a check of the new patch. 1.In DataStorage#recoverTransitionRead, log the InterruptedException and rethrow it as InterruptedIOException; 2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then let the test case fail; 3.The multithread in DataStorage#addStorageLocations() is for one specific namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is creating one thread pool for each namespace. Not change here. 4.Re-phrase the parameter successVolumes. [~szetszwo],thanks for your comments, please have a check of the new patch. 1. InterruptedException re-thrown as InterruptedIOException; 2. I think it's a good idea to log the upgrade progress for each dir, but so far, we can not get the progress easily from the current api. Do you think it's necessary to file a new jira to follow this? DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Labels: BB2015-05-TBR Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549896#comment-14549896 ] Leitao Guo commented on HDFS-7692: -- [~eddyxu], thanks for your comments, please have a check of the new patch. 1.In DataStorage#recoverTransitionRead, log the InterruptedException and rethrow it as InterruptedIOException; 2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then let the test case fail; 3.The multithread in DataStorage#addStorageLocations() is for one specific namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is creating one thread pool for each namespace. Not change here. 4.Re-phrase the parameter successVolumes. [~szetszwo],thanks for your comments, please have a check of the new patch. 1. InterruptedException re-thrown as InterruptedIOException; 2. I think it's a good idea to log the upgrade progress for each dir, but so far, we can not get the progress easily from the current api. Do you think it's necessary to file a new jira to follow this? DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Labels: BB2015-05-TBR Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549894#comment-14549894 ] Leitao Guo commented on HDFS-7692: -- [~eddyxu], thanks for your comments, please have a check of the new patch. 1.In DataStorage#recoverTransitionRead, log the InterruptedException and rethrow it as InterruptedIOException; 2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then let the test case fail; 3.The multithread in DataStorage#addStorageLocations() is for one specific namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is creating one thread pool for each namespace. Not change here. 4.Re-phrase the parameter successVolumes. [~szetszwo],thanks for your comments, please have a check of the new patch. 1. InterruptedException re-thrown as InterruptedIOException; 2. I think it's a good idea to log the upgrade progress for each dir, but so far, we can not get the progress easily from the current api. Do you think it's necessary to file a new jira to follow this? DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Labels: BB2015-05-TBR Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549887#comment-14549887 ] Leitao Guo commented on HDFS-7692: -- [~eddyxu], thanks for your comments, please have a check of the new patch. 1.In DataStorage#recoverTransitionRead, log the InterruptedException and rethrow it as InterruptedIOException; 2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then let the test case fail; 3.The multithread in DataStorage#addStorageLocations() is for one specific namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is creating one thread pool for each namespace. Not change here. 4.Re-phrase the parameter successVolumes. [~szetszwo],thanks for your comments, please have a check of the new patch. 1. InterruptedException re-thrown as InterruptedIOException; 2. I think it's a good idea to log the upgrade progress for each dir, but so far, we can not get the progress easily from the current api. Do you think it's necessary to file a new jira to follow this? DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Labels: BB2015-05-TBR Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549889#comment-14549889 ] Leitao Guo commented on HDFS-7692: -- [~eddyxu], thanks for your comments, please have a check of the new patch. 1.In DataStorage#recoverTransitionRead, log the InterruptedException and rethrow it as InterruptedIOException; 2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then let the test case fail; 3.The multithread in DataStorage#addStorageLocations() is for one specific namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is creating one thread pool for each namespace. Not change here. 4.Re-phrase the parameter successVolumes. [~szetszwo],thanks for your comments, please have a check of the new patch. 1. InterruptedException re-thrown as InterruptedIOException; 2. I think it's a good idea to log the upgrade progress for each dir, but so far, we can not get the progress easily from the current api. Do you think it's necessary to file a new jira to follow this? DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Labels: BB2015-05-TBR Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549908#comment-14549908 ] Leitao Guo commented on HDFS-7692: -- [~eddyxu], thanks for your comments, please have a check of the new patch. 1.In DataStorage#recoverTransitionRead, log the InterruptedException and rethrow it as InterruptedIOException; 2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then let the test case fail; 3.The multithread in DataStorage#addStorageLocations() is for one specific namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is creating one thread pool for each namespace. Not change here. 4.Re-phrase the parameter successVolumes. [~szetszwo],thanks for your comments, please have a check of the new patch. 1. InterruptedException re-thrown as InterruptedIOException; 2. I think it's a good idea to log the upgrade progress for each dir, but so far, we can not get the progress easily from the current api. Do you think it's necessary to file a new jira to follow this? DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Labels: BB2015-05-TBR Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549904#comment-14549904 ] Leitao Guo commented on HDFS-7692: -- [~eddyxu], thanks for your comments, please have a check of the new patch. 1.In DataStorage#recoverTransitionRead, log the InterruptedException and rethrow it as InterruptedIOException; 2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then let the test case fail; 3.The multithread in DataStorage#addStorageLocations() is for one specific namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is creating one thread pool for each namespace. Not change here. 4.Re-phrase the parameter successVolumes. [~szetszwo],thanks for your comments, please have a check of the new patch. 1. InterruptedException re-thrown as InterruptedIOException; 2. I think it's a good idea to log the upgrade progress for each dir, but so far, we can not get the progress easily from the current api. Do you think it's necessary to file a new jira to follow this? DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Labels: BB2015-05-TBR Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549906#comment-14549906 ] Leitao Guo commented on HDFS-7692: -- [~eddyxu], thanks for your comments, please have a check of the new patch. 1.In DataStorage#recoverTransitionRead, log the InterruptedException and rethrow it as InterruptedIOException; 2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then let the test case fail; 3.The multithread in DataStorage#addStorageLocations() is for one specific namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is creating one thread pool for each namespace. Not change here. 4.Re-phrase the parameter successVolumes. [~szetszwo],thanks for your comments, please have a check of the new patch. 1. InterruptedException re-thrown as InterruptedIOException; 2. I think it's a good idea to log the upgrade progress for each dir, but so far, we can not get the progress easily from the current api. Do you think it's necessary to file a new jira to follow this? DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Labels: BB2015-05-TBR Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549892#comment-14549892 ] Leitao Guo commented on HDFS-7692: -- [~eddyxu], thanks for your comments, please have a check of the new patch. 1.In DataStorage#recoverTransitionRead, log the InterruptedException and rethrow it as InterruptedIOException; 2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then let the test case fail; 3.The multithread in DataStorage#addStorageLocations() is for one specific namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is creating one thread pool for each namespace. Not change here. 4.Re-phrase the parameter successVolumes. [~szetszwo],thanks for your comments, please have a check of the new patch. 1. InterruptedException re-thrown as InterruptedIOException; 2. I think it's a good idea to log the upgrade progress for each dir, but so far, we can not get the progress easily from the current api. Do you think it's necessary to file a new jira to follow this? DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Labels: BB2015-05-TBR Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549885#comment-14549885 ] Leitao Guo commented on HDFS-7692: -- [~eddyxu], thanks for your comments, please have a check of the new patch. 1.In DataStorage#recoverTransitionRead, log the InterruptedException and rethrow it as InterruptedIOException; 2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then let the test case fail; 3.The multithread in DataStorage#addStorageLocations() is for one specific namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is creating one thread pool for each namespace. Not change here. 4.Re-phrase the parameter successVolumes. [~szetszwo],thanks for your comments, please have a check of the new patch. 1. InterruptedException re-thrown as InterruptedIOException; 2. I think it's a good idea to log the upgrade progress for each dir, but so far, we can not get the progress easily from the current api. Do you think it's necessary to file a new jira to follow this? DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Labels: BB2015-05-TBR Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549890#comment-14549890 ] Leitao Guo commented on HDFS-7692: -- [~eddyxu], thanks for your comments, please have a check of the new patch. 1.In DataStorage#recoverTransitionRead, log the InterruptedException and rethrow it as InterruptedIOException; 2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then let the test case fail; 3.The multithread in DataStorage#addStorageLocations() is for one specific namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is creating one thread pool for each namespace. Not change here. 4.Re-phrase the parameter successVolumes. [~szetszwo],thanks for your comments, please have a check of the new patch. 1. InterruptedException re-thrown as InterruptedIOException; 2. I think it's a good idea to log the upgrade progress for each dir, but so far, we can not get the progress easily from the current api. Do you think it's necessary to file a new jira to follow this? DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Labels: BB2015-05-TBR Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549886#comment-14549886 ] Leitao Guo commented on HDFS-7692: -- [~eddyxu], thanks for your comments, please have a check of the new patch. 1.In DataStorage#recoverTransitionRead, log the InterruptedException and rethrow it as InterruptedIOException; 2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then let the test case fail; 3.The multithread in DataStorage#addStorageLocations() is for one specific namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is creating one thread pool for each namespace. Not change here. 4.Re-phrase the parameter successVolumes. [~szetszwo],thanks for your comments, please have a check of the new patch. 1. InterruptedException re-thrown as InterruptedIOException; 2. I think it's a good idea to log the upgrade progress for each dir, but so far, we can not get the progress easily from the current api. Do you think it's necessary to file a new jira to follow this? DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Labels: BB2015-05-TBR Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549888#comment-14549888 ] Leitao Guo commented on HDFS-7692: -- [~eddyxu], thanks for your comments, please have a check of the new patch. 1.In DataStorage#recoverTransitionRead, log the InterruptedException and rethrow it as InterruptedIOException; 2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then let the test case fail; 3.The multithread in DataStorage#addStorageLocations() is for one specific namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is creating one thread pool for each namespace. Not change here. 4.Re-phrase the parameter successVolumes. [~szetszwo],thanks for your comments, please have a check of the new patch. 1. InterruptedException re-thrown as InterruptedIOException; 2. I think it's a good idea to log the upgrade progress for each dir, but so far, we can not get the progress easily from the current api. Do you think it's necessary to file a new jira to follow this? DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Labels: BB2015-05-TBR Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leitao Guo updated HDFS-7692: - Attachment: HDFS-7692.02.patch DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Labels: BB2015-05-TBR Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549899#comment-14549899 ] Leitao Guo commented on HDFS-7692: -- [~eddyxu], thanks for your comments, please have a check of the new patch. 1.In DataStorage#recoverTransitionRead, log the InterruptedException and rethrow it as InterruptedIOException; 2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then let the test case fail; 3.The multithread in DataStorage#addStorageLocations() is for one specific namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is creating one thread pool for each namespace. Not change here. 4.Re-phrase the parameter successVolumes. [~szetszwo],thanks for your comments, please have a check of the new patch. 1. InterruptedException re-thrown as InterruptedIOException; 2. I think it's a good idea to log the upgrade progress for each dir, but so far, we can not get the progress easily from the current api. Do you think it's necessary to file a new jira to follow this? DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Labels: BB2015-05-TBR Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549902#comment-14549902 ] Leitao Guo commented on HDFS-7692: -- [~eddyxu], thanks for your comments, please have a check of the new patch. 1.In DataStorage#recoverTransitionRead, log the InterruptedException and rethrow it as InterruptedIOException; 2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then let the test case fail; 3.The multithread in DataStorage#addStorageLocations() is for one specific namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is creating one thread pool for each namespace. Not change here. 4.Re-phrase the parameter successVolumes. [~szetszwo],thanks for your comments, please have a check of the new patch. 1. InterruptedException re-thrown as InterruptedIOException; 2. I think it's a good idea to log the upgrade progress for each dir, but so far, we can not get the progress easily from the current api. Do you think it's necessary to file a new jira to follow this? DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Labels: BB2015-05-TBR Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549893#comment-14549893 ] Leitao Guo commented on HDFS-7692: -- [~eddyxu], thanks for your comments, please have a check of the new patch. 1.In DataStorage#recoverTransitionRead, log the InterruptedException and rethrow it as InterruptedIOException; 2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then let the test case fail; 3.The multithread in DataStorage#addStorageLocations() is for one specific namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is creating one thread pool for each namespace. Not change here. 4.Re-phrase the parameter successVolumes. [~szetszwo],thanks for your comments, please have a check of the new patch. 1. InterruptedException re-thrown as InterruptedIOException; 2. I think it's a good idea to log the upgrade progress for each dir, but so far, we can not get the progress easily from the current api. Do you think it's necessary to file a new jira to follow this? DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Labels: BB2015-05-TBR Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549900#comment-14549900 ] Leitao Guo commented on HDFS-7692: -- [~eddyxu], thanks for your comments, please have a check of the new patch. 1.In DataStorage#recoverTransitionRead, log the InterruptedException and rethrow it as InterruptedIOException; 2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then let the test case fail; 3.The multithread in DataStorage#addStorageLocations() is for one specific namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is creating one thread pool for each namespace. Not change here. 4.Re-phrase the parameter successVolumes. [~szetszwo],thanks for your comments, please have a check of the new patch. 1. InterruptedException re-thrown as InterruptedIOException; 2. I think it's a good idea to log the upgrade progress for each dir, but so far, we can not get the progress easily from the current api. Do you think it's necessary to file a new jira to follow this? DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Labels: BB2015-05-TBR Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549895#comment-14549895 ] Leitao Guo commented on HDFS-7692: -- [~eddyxu], thanks for your comments, please have a check of the new patch. 1.In DataStorage#recoverTransitionRead, log the InterruptedException and rethrow it as InterruptedIOException; 2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then let the test case fail; 3.The multithread in DataStorage#addStorageLocations() is for one specific namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is creating one thread pool for each namespace. Not change here. 4.Re-phrase the parameter successVolumes. [~szetszwo],thanks for your comments, please have a check of the new patch. 1. InterruptedException re-thrown as InterruptedIOException; 2. I think it's a good idea to log the upgrade progress for each dir, but so far, we can not get the progress easily from the current api. Do you think it's necessary to file a new jira to follow this? DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Labels: BB2015-05-TBR Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549897#comment-14549897 ] Leitao Guo commented on HDFS-7692: -- [~eddyxu], thanks for your comments, please have a check of the new patch. 1.In DataStorage#recoverTransitionRead, log the InterruptedException and rethrow it as InterruptedIOException; 2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then let the test case fail; 3.The multithread in DataStorage#addStorageLocations() is for one specific namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is creating one thread pool for each namespace. Not change here. 4.Re-phrase the parameter successVolumes. [~szetszwo],thanks for your comments, please have a check of the new patch. 1. InterruptedException re-thrown as InterruptedIOException; 2. I think it's a good idea to log the upgrade progress for each dir, but so far, we can not get the progress easily from the current api. Do you think it's necessary to file a new jira to follow this? DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Labels: BB2015-05-TBR Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549913#comment-14549913 ] Leitao Guo commented on HDFS-7692: -- [~eddyxu], thanks for your comments, please have a check of the new patch. 1.In DataStorage#recoverTransitionRead, log the InterruptedException and rethrow it as InterruptedIOException; 2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then let the test case fail; 3.The multithread in DataStorage#addStorageLocations() is for one specific namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is creating one thread pool for each namespace. Not change here. 4.Re-phrase the parameter successVolumes. [~szetszwo],thanks for your comments, please have a check of the new patch. 1. InterruptedException re-thrown as InterruptedIOException; 2. I think it's a good idea to log the upgrade progress for each dir, but so far, we can not get the progress easily from the current api. Do you think it's necessary to file a new jira to follow this? DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Labels: BB2015-05-TBR Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549910#comment-14549910 ] Leitao Guo commented on HDFS-7692: -- [~eddyxu], thanks for your comments, please have a check of the new patch. 1.In DataStorage#recoverTransitionRead, log the InterruptedException and rethrow it as InterruptedIOException; 2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then let the test case fail; 3.The multithread in DataStorage#addStorageLocations() is for one specific namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is creating one thread pool for each namespace. Not change here. 4.Re-phrase the parameter successVolumes. [~szetszwo],thanks for your comments, please have a check of the new patch. 1. InterruptedException re-thrown as InterruptedIOException; 2. I think it's a good idea to log the upgrade progress for each dir, but so far, we can not get the progress easily from the current api. Do you think it's necessary to file a new jira to follow this? DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Labels: BB2015-05-TBR Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549914#comment-14549914 ] Leitao Guo commented on HDFS-7692: -- [~eddyxu], thanks for your comments, please have a check of the new patch. 1.In DataStorage#recoverTransitionRead, log the InterruptedException and rethrow it as InterruptedIOException; 2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then let the test case fail; 3.The multithread in DataStorage#addStorageLocations() is for one specific namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is creating one thread pool for each namespace. Not change here. 4.Re-phrase the parameter successVolumes. [~szetszwo],thanks for your comments, please have a check of the new patch. 1. InterruptedException re-thrown as InterruptedIOException; 2. I think it's a good idea to log the upgrade progress for each dir, but so far, we can not get the progress easily from the current api. Do you think it's necessary to file a new jira to follow this? DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Labels: BB2015-05-TBR Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14310587#comment-14310587 ] Leitao Guo commented on HDFS-7692: -- [~eddyxu], thanks for your review and comments! When running tests after patching, I got NullPointerException at this line, so I add the check of null != datanode.getConf() here. {code} Executors.newFixedThreadPool(null != datanode.getConf() ? datanode 493 .getConf().getInt( 494 DFSConfigKeys.DFS_DATANODE_DIRECTORYSCAN_THREADS_KEY, 495 DFSConfigKeys.DFS_DATANODE_DIRECTORYSCAN_THREADS_DEFAULT) 496 : dataDirs.size()); {code} I will update the patch according to your comments, thanks! DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Attachments: HDFS-7692.01.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14310591#comment-14310591 ] Leitao Guo commented on HDFS-7692: -- When upgrading before the patch, I find there is high cpu utilization (~90%) in our cluster , so I think we'd better control the num of threads here. I will have a test verify this. DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Attachments: HDFS-7692.01.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14310592#comment-14310592 ] Leitao Guo commented on HDFS-7692: -- [~szetszwo] thanks for your comments, I will update the patch asap. DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Assignee: Leitao Guo Attachments: HDFS-7692.01.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leitao Guo updated HDFS-7692: - Attachment: HDFS-7692.01.patch DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Attachments: HDFS-7692.01.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leitao Guo updated HDFS-7692: - Release Note: Please help review the patch. Thanks! Status: Patch Available (was: Open) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo Attachments: HDFS-7692.01.patch {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leitao Guo updated HDFS-7692: - Description: {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. was: {code:title=DataStorage#addStorageLocations...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid} for (StorageLocation dataDir : dataDirs) { File root = dataDir.getFile(); ... ... bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt); addBlockPoolStorage(bpid, bpStorage); ... ... successVolumes.add(dataDir); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.
[ https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leitao Guo updated HDFS-7692: - Summary: DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. (was: BlockPoolSliceStorage#loadBpStorageDirectories(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories. -- Key: HDFS-7692 URL: https://issues.apache.org/jira/browse/HDFS-7692 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.2 Reporter: Leitao Guo {code:title=BlockPoolSliceStorage#loadBpStorageDirectories(...)|borderStyle=solid} for (File dataDir : dataDirs) { if (containsStorageDir(dataDir)) { throw new IOException( BlockPoolSliceStorage.recoverTransitionRead: + attempt to load an used block storage: + dataDir); } StorageDirectory sd = loadStorageDirectory(datanode, nsInfo, dataDir, startOpt); succeedDirs.add(sd); } {code} In the above code the storage directories will be analyzed one by one, which is really time consuming when upgrading HDFS with datanodes have dozens of large volumes. MultiThread dataDirs analyzing should be supported here to speedup upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)