[jira] [Created] (HDFS-17220) fix same available space policy in AvailableSpaceVolumeChoosingPolicy

2023-10-11 Thread Fei Guo (Jira)
Fei Guo created HDFS-17220:
--

 Summary: fix same available space policy in 
AvailableSpaceVolumeChoosingPolicy
 Key: HDFS-17220
 URL: https://issues.apache.org/jira/browse/HDFS-17220
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs
Affects Versions: 3.3.6
Reporter: Fei Guo


if all the volumes have the same available space. for example {{1 MB}} and 
{{dfs.datanode.available-space-volume-choosing-policy.balanced-space-threshold}}
 is set 0, which means we should treat all the volumes equally when we choose 
available volumes.but currently not, we can fix it in 
{{{}AvailableSpaceVolumeChoosingPolicy{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17210) Optimize AvailableSpaceBlockPlacementPolicy

2023-09-25 Thread Fei Guo (Jira)
Fei Guo created HDFS-17210:
--

 Summary: Optimize AvailableSpaceBlockPlacementPolicy
 Key: HDFS-17210
 URL: https://issues.apache.org/jira/browse/HDFS-17210
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs
Affects Versions: 3.3.6
Reporter: Fei Guo


for now ,we may have many nodes usage over 85%, some nodes'usage have been over 
90%, if we choose three nodes as nodeA-97%,nodeB-98%,nodeC-99%, actually i 
don't want nodeC(99%) be choosen, for nodeC usage is really high , even random 
chance ,nodeC will reach 100% soon, so we can directly choose the less usage 
node(nodeA) if all nodes's usage are over 95%(just a example)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16357) Fix log format in DFSUtilClient

2021-11-26 Thread guo (Jira)
guo created HDFS-16357:
--

 Summary: Fix log format in DFSUtilClient
 Key: HDFS-16357
 URL: https://issues.apache.org/jira/browse/HDFS-16357
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs
Affects Versions: 3.3.1
Reporter: guo


if address is local ,there will be additional space in the log .we can improve 
it to look proper



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16355) Improve block scanner desc

2021-11-25 Thread guo (Jira)
guo created HDFS-16355:
--

 Summary: Improve block scanner desc
 Key: HDFS-16355
 URL: https://issues.apache.org/jira/browse/HDFS-16355
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs
Affects Versions: 3.3.1
Reporter: guo


datanode block scanner will be dissbled if 
`dfs.block.scanner.volume.bytes.per.second` is configured less then or equal to 
zero, we can improve the desciption



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16355) Improve block scanner desc

2021-11-25 Thread guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guo updated HDFS-16355:
---
Description: datanode block scanner will be disabled if 
`dfs.block.scanner.volume.bytes.per.second` is configured less then or equal to 
zero, we can improve the desciption  (was: datanode block scanner will be 
dissbled if `dfs.block.scanner.volume.bytes.per.second` is configured less then 
or equal to zero, we can improve the desciption)

> Improve block scanner desc
> --
>
> Key: HDFS-16355
> URL: https://issues.apache.org/jira/browse/HDFS-16355
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.1
>Reporter: guo
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> datanode block scanner will be disabled if 
> `dfs.block.scanner.volume.bytes.per.second` is configured less then or equal 
> to zero, we can improve the desciption



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16351) add path exception information in FSNamesystem

2021-11-23 Thread guo (Jira)
guo created HDFS-16351:
--

 Summary: add path exception information in FSNamesystem
 Key: HDFS-16351
 URL: https://issues.apache.org/jira/browse/HDFS-16351
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs
Affects Versions: 3.3.1
Reporter: guo


add path information in exception message to make message more clear in 
FSNamesystem



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16347) Fix directory scan throttle default value

2021-11-22 Thread guo (Jira)
guo created HDFS-16347:
--

 Summary: Fix directory scan throttle default value
 Key: HDFS-16347
 URL: https://issues.apache.org/jira/browse/HDFS-16347
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: documentation
Affects Versions: 3.3.1
Reporter: guo


`dfs.datanode.directoryscan.throttle.limit.ms.per.sec` was changed from `1000` 
to `-1` by default after HDFS-13947, we can improve the doc



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16345) Fix test cases fail in TestBlockStoragePolicy

2021-11-21 Thread guo (Jira)
guo created HDFS-16345:
--

 Summary: Fix test cases fail in TestBlockStoragePolicy
 Key: HDFS-16345
 URL: https://issues.apache.org/jira/browse/HDFS-16345
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: build
Affects Versions: 3.3.1
Reporter: guo


test class ``TestBlockStoragePolicy` ` fail frequently for the `BindException`, 
it block all normal source code build. we can improve it.

[ERROR] Tests run: 26, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 49.295 
s <<< FAILURE! - in org.apache.hadoop.hdfs.TestBlockStoragePolicy [ERROR] 
testChooseTargetWithTopology(org.apache.hadoop.hdfs.TestBlockStoragePolicy) 
Time elapsed: 0.551 s <<< ERROR! java.net.BindException: Problem binding to 
[localhost:43947] java.net.BindException: Address already in use; For more 
details see: http://wiki.apache.org/hadoop/BindException at 
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
 at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at 
org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:931) at 
org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:827) at 
org.apache.hadoop.ipc.Server.bind(Server.java:657) at 
org.apache.hadoop.ipc.Server$Listener.(Server.java:1352) at 
org.apache.hadoop.ipc.Server.(Server.java:3252) at 
org.apache.hadoop.ipc.RPC$Server.(RPC.java:1062) at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server.(ProtobufRpcEngine2.java:468)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine2.getServer(ProtobufRpcEngine2.java:371) 
at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:853) at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.(NameNodeRpcServer.java:466)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNode.createRpcServer(NameNode.java:860)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:766) 
at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:1017) 
at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:992) at 
org.apache.hadoop.hdfs.TestBlockStoragePolicy.testChooseTargetWithTopology(TestBlockStoragePolicy.java:1275)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
 at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
 at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
 at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
 at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at 
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
 at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
 at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
 at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at 
org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at 
org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at 
org.junit.runners.ParentRunner.run(ParentRunner.java:413) at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
 at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
 at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
 at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) 
at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
 at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
 at 
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) 
Caused by: java.net.BindException: Address already in use at 
sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:461) at 
sun.nio.ch.Net.bind(Net.java:453) at 
sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:222) at 
sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:85) at 

[jira] [Resolved] (HDFS-16340) improve diskbalancer error message

2021-11-21 Thread guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guo resolved HDFS-16340.

Resolution: Won't Fix

> improve diskbalancer error message
> --
>
> Key: HDFS-16340
> URL: https://issues.apache.org/jira/browse/HDFS-16340
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.1
>Reporter: guo
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> during disk balance ,when we cannot get json from item ,only `Unable to get 
> json from Item.` will be recorded, but no real exception message printed, we 
> can improve it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16342) improve code in KMS

2021-11-21 Thread guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guo resolved HDFS-16342.

Resolution: Won't Fix

> improve code in KMS
> ---
>
> Key: HDFS-16342
> URL: https://issues.apache.org/jira/browse/HDFS-16342
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: kms
>Affects Versions: 3.3.1
>Reporter: guo
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> There are duplicated code in KMS, we can do a little improve



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16342) improve code in KMS

2021-11-20 Thread guo (Jira)
guo created HDFS-16342:
--

 Summary: improve code in KMS
 Key: HDFS-16342
 URL: https://issues.apache.org/jira/browse/HDFS-16342
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: kms
Affects Versions: 3.3.1
Reporter: guo


There are duplicated code in KMS, we can do a little improve



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16341) Add block placement policy desc

2021-11-20 Thread guo (Jira)
guo created HDFS-16341:
--

 Summary: Add block placement policy desc
 Key: HDFS-16341
 URL: https://issues.apache.org/jira/browse/HDFS-16341
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: documentation
Affects Versions: 3.3.1
Reporter: guo


  Now, we have six block placement policies supported, we can keep the doc 
updated.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16340) improve diskbalancer error message

2021-11-20 Thread guo (Jira)
guo created HDFS-16340:
--

 Summary: improve diskbalancer error message
 Key: HDFS-16340
 URL: https://issues.apache.org/jira/browse/HDFS-16340
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs
Affects Versions: 3.3.1
Reporter: guo


during disk balance ,when we cannot get json from item ,only `Unable to get 
json from Item.` will be recorded, but no real exception message printed, we 
can improve it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16338) Fix error configuration message in FSImage

2021-11-19 Thread guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guo updated HDFS-16338:
---
Description: `dfs.namenode.checkpoint.edits.dir` may be different from 
`dfs.namenode.checkpoint.dir` , if `checkpointEditsDirs` is null or empty, 
error message should warn the edit dir configuration, we can fix it.  (was: 
During import checkpoint , if `checkpointEditsDirs` is null or empty, error 
message should warn the right configuration, we can fix it.)

> Fix error configuration message in FSImage
> --
>
> Key: HDFS-16338
> URL: https://issues.apache.org/jira/browse/HDFS-16338
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.1
>Reporter: guo
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> `dfs.namenode.checkpoint.edits.dir` may be different from 
> `dfs.namenode.checkpoint.dir` , if `checkpointEditsDirs` is null or empty, 
> error message should warn the edit dir configuration, we can fix it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16338) Fix error configuration message in FSImage

2021-11-19 Thread guo (Jira)
guo created HDFS-16338:
--

 Summary: Fix error configuration message in FSImage
 Key: HDFS-16338
 URL: https://issues.apache.org/jira/browse/HDFS-16338
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs
Affects Versions: 3.3.1
Reporter: guo


During import checkpoint , if `checkpointEditsDirs` is null or empty, error 
message should warn the right configuration, we can fix it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16334) correct namenode acl desc

2021-11-18 Thread guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guo updated HDFS-16334:
---
Description: `dfs.namenode.acls.enabled` is set to be `true` by default 
after HDFS-13505 ,we can improve the desc  (was: `dfs.namenode.acls.enabled` is 
set to be `true` after HDFS-13505 ,we can improve the desc)

> correct namenode acl desc
> -
>
> Key: HDFS-16334
> URL: https://issues.apache.org/jira/browse/HDFS-16334
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 3.3.1
>Reporter: guo
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> `dfs.namenode.acls.enabled` is set to be `true` by default after HDFS-13505 
> ,we can improve the desc



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16334) correct namenode acl desc

2021-11-18 Thread guo (Jira)
guo created HDFS-16334:
--

 Summary: correct namenode acl desc
 Key: HDFS-16334
 URL: https://issues.apache.org/jira/browse/HDFS-16334
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: documentation
Affects Versions: 3.3.1
Reporter: guo


`dfs.namenode.acls.enabled` is set to be `true` after HDFS-13505 ,we can 
improve the desc



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16328) Correct disk balancer param desc

2021-11-17 Thread guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guo updated HDFS-16328:
---
Description: `dfs.disk.balancer.enabled` is enabled by default after 
HDFS-13153, we can improve the doc to avoid confusion

> Correct disk balancer param desc
> 
>
> Key: HDFS-16328
> URL: https://issues.apache.org/jira/browse/HDFS-16328
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.1
> Environment: `dfs.disk.balancer.enabled` is enabled by default after 
> HDFS-13153, we can improve the doc to avoid confusion
>Reporter: guo
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> `dfs.disk.balancer.enabled` is enabled by default after HDFS-13153, we can 
> improve the doc to avoid confusion



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16328) Correct disk balancer param desc

2021-11-17 Thread guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guo updated HDFS-16328:
---
Environment: (was: `dfs.disk.balancer.enabled` is enabled by default 
after HDFS-13153, we can improve the doc to avoid confusion)

> Correct disk balancer param desc
> 
>
> Key: HDFS-16328
> URL: https://issues.apache.org/jira/browse/HDFS-16328
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.1
>Reporter: guo
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> `dfs.disk.balancer.enabled` is enabled by default after HDFS-13153, we can 
> improve the doc to avoid confusion



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16328) Correct disk balancer param desc

2021-11-17 Thread guo (Jira)
guo created HDFS-16328:
--

 Summary: Correct disk balancer param desc
 Key: HDFS-16328
 URL: https://issues.apache.org/jira/browse/HDFS-16328
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs
Affects Versions: 3.3.1
 Environment: `dfs.disk.balancer.enabled` is enabled by default after 
HDFS-13153, we can improve the doc to avoid confusion
Reporter: guo






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16324) fix error log in BlockManagerSafeMode

2021-11-15 Thread guo (Jira)
guo created HDFS-16324:
--

 Summary: fix error log in BlockManagerSafeMode
 Key: HDFS-16324
 URL: https://issues.apache.org/jira/browse/HDFS-16324
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs
Affects Versions: 3.3.1
Reporter: guo


if `recheckInterval` was set as invalid value, there will be warning log 
output, but the message seems not that proper ,we can improve it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16318) Add exception blockinfo

2021-11-14 Thread guo (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17443341#comment-17443341
 ] 

guo commented on HDFS-16318:


Thanks [~hexiaoqiao]  for your note, have just updated

> Add exception blockinfo
> ---
>
> Key: HDFS-16318
> URL: https://issues.apache.org/jira/browse/HDFS-16318
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.1
>Reporter: guo
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> we may suffer `Could not obtain the last block location` exception, but we 
> may reading more than one file, the following exception cannnot guide us to 
> find the problem block or dn info.  we can add more info in the log to help 
> us .
> `2021-11-12 14:01:59,633 WARN [main] org.apache.hadoop.hdfs.DFSClient: Last 
> block locations not available. Datanodes might not have reported blocks 
> completely. Will retry for 3 times`
> `2021-11-12 14:02:03,724 WARN [main] org.apache.hadoop.hdfs.DFSClient: Last 
> block locations not available. Datanodes might not have reported blocks 
> completely. Will retry for 2 times`
> `2021-11-12 14:02:07,726 WARN [main] org.apache.hadoop.hdfs.DFSClient: Last 
> block locations not available. Datanodes might not have reported blocks 
> completely. Will retry for 1 times`
> `Caused by: java.lang.reflect.InvocationTargetException
>     at sun.reflect.GeneratedConstructorAccessor19.newInstance(Unknown Source)
>     at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>     at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>     at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:251)
>     ... 11 more`
> `Caused by: java.io.IOException: Could not obtain the last block locations.
>     at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:291)
>     at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:264)
>     at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1535)
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304)
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299)
>     at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312)
>     at org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:162)
>     at 
> org.apache.hadoop.fs.viewfs.ChRootedFileSystem.open(ChRootedFileSystem.java:261)
>     at 
> org.apache.hadoop.fs.viewfs.ViewFileSystem.open(ViewFileSystem.java:463)
>     at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
>     at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:109)
>     at 
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
>     at 
> org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:66)
>     ... 15 more`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16318) Add exception blockinfo

2021-11-14 Thread guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guo updated HDFS-16318:
---
Description: 
we may suffer `Could not obtain the last block location` exception, but we may 
reading more than one file, the following exception cannnot guide us to find 
the problem block or dn info.  we can add more info in the log to help us .

`2021-11-12 14:01:59,633 WARN [main] org.apache.hadoop.hdfs.DFSClient: Last 
block locations not available. Datanodes might not have reported blocks 
completely. Will retry for 3 times`
`2021-11-12 14:02:03,724 WARN [main] org.apache.hadoop.hdfs.DFSClient: Last 
block locations not available. Datanodes might not have reported blocks 
completely. Will retry for 2 times`
`2021-11-12 14:02:07,726 WARN [main] org.apache.hadoop.hdfs.DFSClient: Last 
block locations not available. Datanodes might not have reported blocks 
completely. Will retry for 1 times`


`Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.GeneratedConstructorAccessor19.newInstance(Unknown Source)
    at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at 
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:251)
    ... 11 more`
`Caused by: java.io.IOException: Could not obtain the last block locations.
    at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:291)
    at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:264)
    at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1535)
    at 
org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304)
    at 
org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299)
    at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at 
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312)
    at org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:162)
    at 
org.apache.hadoop.fs.viewfs.ChRootedFileSystem.open(ChRootedFileSystem.java:261)
    at org.apache.hadoop.fs.viewfs.ViewFileSystem.open(ViewFileSystem.java:463)
    at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
    at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:109)
    at 
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
    at 
org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:66)
    ... 15 more`

> Add exception blockinfo
> ---
>
> Key: HDFS-16318
> URL: https://issues.apache.org/jira/browse/HDFS-16318
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.1
>Reporter: guo
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> we may suffer `Could not obtain the last block location` exception, but we 
> may reading more than one file, the following exception cannnot guide us to 
> find the problem block or dn info.  we can add more info in the log to help 
> us .
> `2021-11-12 14:01:59,633 WARN [main] org.apache.hadoop.hdfs.DFSClient: Last 
> block locations not available. Datanodes might not have reported blocks 
> completely. Will retry for 3 times`
> `2021-11-12 14:02:03,724 WARN [main] org.apache.hadoop.hdfs.DFSClient: Last 
> block locations not available. Datanodes might not have reported blocks 
> completely. Will retry for 2 times`
> `2021-11-12 14:02:07,726 WARN [main] org.apache.hadoop.hdfs.DFSClient: Last 
> block locations not available. Datanodes might not have reported blocks 
> completely. Will retry for 1 times`
> `Caused by: java.lang.reflect.InvocationTargetException
>     at sun.reflect.GeneratedConstructorAccessor19.newInstance(Unknown Source)
>     at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>     at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>     at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:251)
>     ... 11 more`
> `Caused by: java.io.IOException: Could not obtain the last block locations.
>     at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:291)
>     at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:264)
>     at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1535)
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304)
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299)
>     at 
> 

[jira] [Created] (HDFS-16321) Fix invalid config in TestAvailableSpaceRackFaultTolerantBPP

2021-11-13 Thread guo (Jira)
guo created HDFS-16321:
--

 Summary: Fix invalid config in 
TestAvailableSpaceRackFaultTolerantBPP 
 Key: HDFS-16321
 URL: https://issues.apache.org/jira/browse/HDFS-16321
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: test
Affects Versions: 3.3.1
Reporter: guo


`TestAvailableSpaceRackFaultTolerantBPP` seems setting invalid param(valid in 
`TestAvailableSpaceBlockPlacementPolicy`), we can fix it to avoid further 
trouble.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16318) Add exception blockinfo

2021-11-12 Thread guo (Jira)
guo created HDFS-16318:
--

 Summary: Add exception blockinfo
 Key: HDFS-16318
 URL: https://issues.apache.org/jira/browse/HDFS-16318
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs
Affects Versions: 3.3.1
Reporter: guo






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16307) improve HdfsBlockPlacementPolicies docs

2021-11-08 Thread guo (Jira)
guo created HDFS-16307:
--

 Summary: improve HdfsBlockPlacementPolicies docs
 Key: HDFS-16307
 URL: https://issues.apache.org/jira/browse/HDFS-16307
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: documentation
Affects Versions: 3.3.1
Reporter: guo


improve HdfsBlockPlacementPolicies docs readability



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16277) Improve decision in AvailableSpaceBlockPlacementPolicy

2021-10-21 Thread guo (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432780#comment-17432780
 ] 

guo commented on HDFS-16277:


Thanks [~ayushtkn] for you kindly review, glad to meet hadoop here:)

> Improve decision in AvailableSpaceBlockPlacementPolicy
> --
>
> Key: HDFS-16277
> URL: https://issues.apache.org/jira/browse/HDFS-16277
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: block placement
>Affects Versions: 3.3.1
>Reporter: guo
>Assignee: guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.2
>
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> Hi 
> In product environment,we may meet two or more datanode usage reaches nealy 
> 100%,for exampe 99.99%,98%,97%.
> if we configure `AvailableSpaceBlockPlacementPolicy` , we also have the 
> chance to choose the 99.99%(assume it is the highest usage),for we treat the 
> two choosen datanode as the same usage if their storage usage are different 
> within 5%.
> but this is not what we want, so i suggest we can improve the decision.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16277) improve decision in AvailableSpaceBlockPlacementPolicy

2021-10-16 Thread guo (Jira)
guo created HDFS-16277:
--

 Summary: improve decision in AvailableSpaceBlockPlacementPolicy
 Key: HDFS-16277
 URL: https://issues.apache.org/jira/browse/HDFS-16277
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: block placement
Affects Versions: 3.3.1
Reporter: guo


Hi 
In product environment,we may meet two or more datanode usage reaches nealy 
100%,for exampe 99.99%,98%,97%.
if we configure `AvailableSpaceBlockPlacementPolicy` , we also have the chance 
to choose the 99.99%(assume it is the highest usage),for we treat the two 
choosen datanode as the same usage if their storage usage are different within 
5%.
but this is not what we want, so i suggest we can improve the decision.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16029) Divide by zero bug in InstrumentationService.java

2021-05-19 Thread Yiyuan GUO (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiyuan GUO updated HDFS-16029:
--
Component/s: (was: security)
 libhdfs

> Divide by zero bug in InstrumentationService.java
> -
>
> Key: HDFS-16029
> URL: https://issues.apache.org/jira/browse/HDFS-16029
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: libhdfs
>Reporter: Yiyuan GUO
>Priority: Major
>  Labels: easy-fix, security
>
> In the file _lib/service/instrumentation/InstrumentationService.java,_ the 
> method 
>  _Timer.getValues_ has the following 
> [code|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-httpfs/src/main/java/org/apache/hadoop/lib/service/instrumentation/InstrumentationService.java#L236]:
> {code:java}
> long[] getValues() {
> ..
> int limit = (full) ? size : (last + 1);
> ..
> values[AVG_TOTAL] = values[AVG_TOTAL] / limit;
> }
> {code}
> The variable _limit_ is used as a divisor. However, its value may be equal to 
> _last + 1,_ which can be zero since _last_ is initialized to -1 in the 
> [constructor|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-httpfs/src/main/java/org/apache/hadoop/lib/service/instrumentation/InstrumentationService.java#L222]:
> {code:java}
> public Timer(int size) {
> ...
> last = -1;
> }
> {code}
> Thus, a divide by zero problem can happen.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16029) Divide by zero bug in InstrumentationService.java

2021-05-19 Thread Yiyuan GUO (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiyuan GUO updated HDFS-16029:
--
Description: 
In the file _lib/service/instrumentation/InstrumentationService.java,_ the 
method 
 _Timer.getValues_ has the following 
[code|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-httpfs/src/main/java/org/apache/hadoop/lib/service/instrumentation/InstrumentationService.java#L236]:
{code:java}
long[] getValues() {
..
int limit = (full) ? size : (last + 1);
..
values[AVG_TOTAL] = values[AVG_TOTAL] / limit;
}
{code}
The variable _limit_ is used as a divisor. However, its value may be equal to 
_last + 1,_ which can be zero since _last_ is initialized to -1 in the 
[constructor|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-httpfs/src/main/java/org/apache/hadoop/lib/service/instrumentation/InstrumentationService.java#L222]:
{code:java}
public Timer(int size) {
...
last = -1;
}
{code}
Thus, a divide by zero problem can happen.
  

  was:
In the file _lib/service/instrumentation/InstrumentationService.java,_ the 
method 
_Timer.getValues_ has the following 
[code|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-httpfs/src/main/java/org/apache/hadoop/lib/service/instrumentation/InstrumentationService.java#L236]:
{code:java}
long[] getValues() {
..
int limit = (full) ? size : (last + 1);
..
values[AVG_TOTAL] = values[AVG_TOTAL] / limit;
}
{code}
The variable _limit_ is used as a divisor. However, its value may be equal to 
_last + 1,_ which can be zero since _last_ is __ initialized to -1 in the 
[constructor|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-httpfs/src/main/java/org/apache/hadoop/lib/service/instrumentation/InstrumentationService.java#L222]:
{code:java}
public Timer(int size) {
...
last = -1;
}
{code}
Thus, a divide by zero problem can happen
 


> Divide by zero bug in InstrumentationService.java
> -
>
> Key: HDFS-16029
> URL: https://issues.apache.org/jira/browse/HDFS-16029
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: security
>Reporter: Yiyuan GUO
>Priority: Major
>  Labels: easy-fix, security
>
> In the file _lib/service/instrumentation/InstrumentationService.java,_ the 
> method 
>  _Timer.getValues_ has the following 
> [code|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-httpfs/src/main/java/org/apache/hadoop/lib/service/instrumentation/InstrumentationService.java#L236]:
> {code:java}
> long[] getValues() {
> ..
> int limit = (full) ? size : (last + 1);
> ..
> values[AVG_TOTAL] = values[AVG_TOTAL] / limit;
> }
> {code}
> The variable _limit_ is used as a divisor. However, its value may be equal to 
> _last + 1,_ which can be zero since _last_ is initialized to -1 in the 
> [constructor|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-httpfs/src/main/java/org/apache/hadoop/lib/service/instrumentation/InstrumentationService.java#L222]:
> {code:java}
> public Timer(int size) {
> ...
> last = -1;
> }
> {code}
> Thus, a divide by zero problem can happen.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16029) Divide by zero bug in InstrumentationService.java

2021-05-19 Thread Yiyuan GUO (Jira)
Yiyuan GUO created HDFS-16029:
-

 Summary: Divide by zero bug in InstrumentationService.java
 Key: HDFS-16029
 URL: https://issues.apache.org/jira/browse/HDFS-16029
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: security
Reporter: Yiyuan GUO


In the file _lib/service/instrumentation/InstrumentationService.java,_ the 
method 
_Timer.getValues_ has the following 
[code|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-httpfs/src/main/java/org/apache/hadoop/lib/service/instrumentation/InstrumentationService.java#L236]:
{code:java}
long[] getValues() {
..
int limit = (full) ? size : (last + 1);
..
values[AVG_TOTAL] = values[AVG_TOTAL] / limit;
}
{code}
The variable _limit_ is used as a divisor. However, its value may be equal to 
_last + 1,_ which can be zero since _last_ is __ initialized to -1 in the 
[constructor|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-httpfs/src/main/java/org/apache/hadoop/lib/service/instrumentation/InstrumentationService.java#L222]:
{code:java}
public Timer(int size) {
...
last = -1;
}
{code}
Thus, a divide by zero problem can happen
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15292) Infinite loop in Lease Manager due to replica is missing in dn

2020-04-21 Thread Aaron Guo (Jira)
Aaron Guo created HDFS-15292:


 Summary: Infinite loop in Lease Manager due to replica is missing 
in dn
 Key: HDFS-15292
 URL: https://issues.apache.org/jira/browse/HDFS-15292
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 3.1.3
Reporter: Aaron Guo


In our production environment, we found that files of under construction keep 
growing, and the lease manager is trying to release the lease in a Infinite 
loop:
{code:java}
2020-04-18 23:10:57,816 WARN  namenode.LeaseManager 
(LeaseManager.java:checkLeases(589)) - Cannot release the path 
/user/hadoop/myTestFile.txt in the lease [Lease.  Holder: 
go-hdfs-7VVGF3sGvHZcsZZC, pending creates: 1]. It will be retried.
org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* 
NameSystem.internalReleaseLease: Failed to release lease for file 
/user/hadoop/myTestFile.txt. Committed blocks are waiting to be minimally 
replicated. Try again later.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:3391)
at 
org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:586)
at 
org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:524)
at java.lang.Thread.run(Thread.java:745)
{code}
 This is because the last block of this file can NOT meet the minimum required 
replica of 1, a  AlreadyBeingCreatedException get thrown, and it will keeps 
retry forever.

This infinite loop also cause another issue since the lease manager always 
trying to release the first lease then goto the next one, so no lease will be 
released.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error

2020-04-07 Thread guo (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17077758#comment-17077758
 ] 

guo commented on HDFS-15240:


Nice job, LGTM.

> Erasure Coding: dirty buffer causes reconstruction block error
> --
>
> Key: HDFS-15240
> URL: https://issues.apache.org/jira/browse/HDFS-15240
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Affects Versions: 3.3.0
>Reporter: HuangTao
>Assignee: HuangTao
>Priority: Major
> Attachments: HDFS-15240.001.patch, HDFS-15240.002.patch, 
> HDFS-15240.003.patch, HDFS-15240.004.patch, HDFS-15240.005.patch
>
>
> When read some lzo files we found some blocks were broken.
> I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k) from 
> DN directly, and choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') 
> blocks. And find the longest common sequenece(LCS) between b6'(decoded) and 
> b6(read from DN)(b7'/b7 and b8'/b8).
> After selecting 6 blocks of the block group in combinations one time and 
> iterating through all cases, I find one case that the length of LCS is the 
> block length - 64KB, 64KB is just the length of ByteBuffer used by 
> StripedBlockReader. So the corrupt reconstruction block is made by a dirty 
> buffer.
> The following log snippet(only show 2 of 28 cases) is my check program 
> output. In my case, I known the 3th block is corrupt, so need other 5 blocks 
> to decode another 3 blocks, then find the 1th block's LCS substring is block 
> length - 64kb.
> It means (0,1,2,4,5,6)th blocks were used to reconstruct 3th block, and the 
> dirty buffer was used before read the 1th block.
> Must be noted that StripedBlockReader read from the offset 0 of the 1th block 
> after used the dirty buffer.
> {code:java}
> decode from [0, 2, 3, 4, 5, 7] -> [1, 6, 8]
> Check Block(1) first 131072 bytes longest common substring length 4
> Check Block(6) first 131072 bytes longest common substring length 4
> Check Block(8) first 131072 bytes longest common substring length 4
> decode from [0, 2, 3, 4, 5, 6] -> [1, 7, 8]
> Check Block(1) first 131072 bytes longest common substring length 65536
> CHECK AGAIN: Block(1) all 27262976 bytes longest common substring length 
> 27197440  # this one
> Check Block(7) first 131072 bytes longest common substring length 4
> Check Block(8) first 131072 bytes longest common substring length 4{code}
> Now I know the dirty buffer causes reconstruction block error, but how does 
> the dirty buffer come about?
> After digging into the code and DN log, I found this following DN log is the 
> root reason.
> {code:java}
> [INFO] [stripedRead-1017] : Interrupted while waiting for IO on channel 
> java.nio.channels.SocketChannel[connected local=/:52586 
> remote=/:50010]. 18 millis timeout left.
> [WARN] [StripedBlockReconstruction-199] : Failed to reconstruct striped 
> block: BP-714356632--1519726836856:blk_-YY_3472979393
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:94)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:834) {code}
> Reading from DN may timeout(hold by a future(F)) and output the INFO log, but 
> the futures that contains the future(F)  is cleared, 
> {code:java}
> return new StripingChunkReadResult(futures.remove(future),
> StripingChunkReadResult.CANCELLED); {code}
> futures.remove(future) cause NPE. So the EC reconstruction is failed. In the 
> finally phase, the code snippet in *getStripedReader().close()* 
> {code:java}
> reconstructor.freeBuffer(reader.getReadBuffer());
> reader.freeReadBuffer();
> reader.closeBlockReader(); {code}
> free buffer firstly, but the StripedBlockReader still holds the buffer and 
> write it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HDFS-14592) Support NIO transferTo semantics in HDFS

2019-06-20 Thread Chenzhao Guo (JIRA)
Chenzhao Guo created HDFS-14592:
---

 Summary: Support NIO transferTo semantics in HDFS
 Key: HDFS-14592
 URL: https://issues.apache.org/jira/browse/HDFS-14592
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Chenzhao Guo


I'm currently developing a Spark shuffle manager based on HDFS. I need to merge 
some spill files on HDFS to one, or rearrange some HDFS files.

An API similar to NIO transferTo, which bypasses memory will be more efficient 
than manually reading and writing bytes(the method I'm using at present).

So can HDFS implements something like NIO transferTo? Making 
path.transferTo(pathDestination) possible? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time

2018-04-19 Thread Zephyr Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443630#comment-16443630
 ] 

Zephyr Guo commented on HDFS-13243:
---

Thank you for reviewing,[~daryn].

There are some mistakes in your summaries.
{quote}
thread1 is writing and closes the stream
thread2 is syncing the stream
thread1 calls commits the block with size -141232- 2054413
thread2 fsyncs with size -2054413- 141232
DNs report block with size 2054413, marked corrupt
{quote}

{quote}
This sounds like a serious client-side issue.
{quote}
Yes, as I said above comments.
Client call sync() with a corrent length, but sync request could be sent after 
close().
The root cause is that DFSOutputStream#flushOrSync() is not thread-safe. See 
following simplified code:
{code}
synchronized (this) {}
 ... // code before send request
}

// **We send request here, but it's not included in synchronized code block**
 if (getStreamer().getPersistBlocks().getAndSet(false) || updateLength) {
   try {

   dfsClient.namenode.fsync(src, fileId, dfsClient.clientName, 
lastBlockLength); 

   } catch (IOException ioe) {
   // Deal with ioe
   }
}

synchronized (this) {
  ... // code after send request
}
{code}

I am not sure how to fix client-side. Cloud we put RPC into synchronized code 
block directly?
Maybe we put RPC outside synchronized code block can get more performance 
benefit.

{quote}
We cannot simply return success in some invalid cases: ie. fsync when the file 
has no blocks, size is negative, size is less than less synced/committed size. 
That just masks bugs.
{quote}
TestHFlush.hSyncUpdateLength_00 make an sync call with no blocks. So I don't 
think this is a bug case.

{quote}
Also, we shouldn't need all the new factories. The tests must verify that 
namesystem calls, in various specific orders, with specific arguments that are 
good/bad, either succeed or fail. 
{quote}

I can add new test cases to verify namesystem calls in various specific orders. 
I have to mock DFSOutputStream to reappear bus about corrupt blocks, so I need 
all the new factories. If mock DFSOutputStream is not necessary in your 
opinion, I will remove it. 







> Get CorruptBlock because of calling close and sync in same time
> ---
>
> Key: HDFS-13243
> URL: https://issues.apache.org/jira/browse/HDFS-13243
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2, 3.2.0
>Reporter: Zephyr Guo
>Assignee: Zephyr Guo
>Priority: Critical
> Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, 
> HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch, 
> HDFS-13243-v6.patch
>
>
> HDFS File might get broken because of corrupt block(s) that could be produced 
> by calling close and sync in the same time.
> When calling close was not successful, UCBlock status would change to 
> COMMITTED, and if a sync request gets popped from queue and processed, sync 
> operation would change the last block length.
> After that, DataNode would report all received block to NameNode, and will 
> check Block length of all COMMITTED Blocks. But the block length was already 
> different between recorded in NameNode memory and reported by DataNode, and 
> consequently, the last block is marked as corruptted because of inconsistent 
> length.
>  
> {panel:title=Log in my hdfs}
> 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, 
> truncateBlock=null, primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  for 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> fsync: 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
>  for DFSClient_NONMAPREDUCE_1077513762_1
> 2018-03-05 04:05:39,761 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE 

[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time

2018-04-11 Thread Zephyr Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433543#comment-16433543
 ] 

Zephyr Guo commented on HDFS-13243:
---

I agree with you [~jojochuang]. Do not fix client side in path-v6, and I fix 
the failure of the test case.

> Get CorruptBlock because of calling close and sync in same time
> ---
>
> Key: HDFS-13243
> URL: https://issues.apache.org/jira/browse/HDFS-13243
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2, 3.2.0
>Reporter: Zephyr Guo
>Assignee: Zephyr Guo
>Priority: Critical
> Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, 
> HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch, 
> HDFS-13243-v6.patch
>
>
> HDFS File might get broken because of corrupt block(s) that could be produced 
> by calling close and sync in the same time.
> When calling close was not successful, UCBlock status would change to 
> COMMITTED, and if a sync request gets popped from queue and processed, sync 
> operation would change the last block length.
> After that, DataNode would report all received block to NameNode, and will 
> check Block length of all COMMITTED Blocks. But the block length was already 
> different between recorded in NameNode memory and reported by DataNode, and 
> consequently, the last block is marked as corruptted because of inconsistent 
> length.
>  
> {panel:title=Log in my hdfs}
> 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, 
> truncateBlock=null, primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  for 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> fsync: 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
>  for DFSClient_NONMAPREDUCE_1077513762_1
> 2018-03-05 04:05:39,761 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in 
> file 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: 
> blockMap updated: 10.0.0.220:50010 is added to 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  size 2054413
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.219:50010 by 
> hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.218:50010 by 
> hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:40,162 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, 

[jira] [Updated] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time

2018-04-11 Thread Zephyr Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zephyr Guo updated HDFS-13243:
--
Attachment: HDFS-13243-v6.patch

> Get CorruptBlock because of calling close and sync in same time
> ---
>
> Key: HDFS-13243
> URL: https://issues.apache.org/jira/browse/HDFS-13243
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2, 3.2.0
>Reporter: Zephyr Guo
>Assignee: Zephyr Guo
>Priority: Critical
> Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, 
> HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch, 
> HDFS-13243-v6.patch
>
>
> HDFS File might get broken because of corrupt block(s) that could be produced 
> by calling close and sync in the same time.
> When calling close was not successful, UCBlock status would change to 
> COMMITTED, and if a sync request gets popped from queue and processed, sync 
> operation would change the last block length.
> After that, DataNode would report all received block to NameNode, and will 
> check Block length of all COMMITTED Blocks. But the block length was already 
> different between recorded in NameNode memory and reported by DataNode, and 
> consequently, the last block is marked as corruptted because of inconsistent 
> length.
>  
> {panel:title=Log in my hdfs}
> 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, 
> truncateBlock=null, primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  for 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> fsync: 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
>  for DFSClient_NONMAPREDUCE_1077513762_1
> 2018-03-05 04:05:39,761 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in 
> file 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: 
> blockMap updated: 10.0.0.220:50010 is added to 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  size 2054413
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.219:50010 by 
> hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.218:50010 by 
> hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:40,162 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in 
> file 
> 

[jira] [Updated] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time

2018-04-10 Thread Zephyr Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zephyr Guo updated HDFS-13243:
--
Status: Open  (was: Patch Available)

> Get CorruptBlock because of calling close and sync in same time
> ---
>
> Key: HDFS-13243
> URL: https://issues.apache.org/jira/browse/HDFS-13243
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2, 3.2.0
>Reporter: Zephyr Guo
>Assignee: Zephyr Guo
>Priority: Critical
> Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, 
> HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch
>
>
> HDFS File might get broken because of corrupt block(s) that could be produced 
> by calling close and sync in the same time.
> When calling close was not successful, UCBlock status would change to 
> COMMITTED, and if a sync request gets popped from queue and processed, sync 
> operation would change the last block length.
> After that, DataNode would report all received block to NameNode, and will 
> check Block length of all COMMITTED Blocks. But the block length was already 
> different between recorded in NameNode memory and reported by DataNode, and 
> consequently, the last block is marked as corruptted because of inconsistent 
> length.
>  
> {panel:title=Log in my hdfs}
> 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, 
> truncateBlock=null, primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  for 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> fsync: 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
>  for DFSClient_NONMAPREDUCE_1077513762_1
> 2018-03-05 04:05:39,761 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in 
> file 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: 
> blockMap updated: 10.0.0.220:50010 is added to 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  size 2054413
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.219:50010 by 
> hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.218:50010 by 
> hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:40,162 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in 
> file 
> 

[jira] [Updated] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time

2018-04-10 Thread Zephyr Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zephyr Guo updated HDFS-13243:
--
Status: Patch Available  (was: Open)

> Get CorruptBlock because of calling close and sync in same time
> ---
>
> Key: HDFS-13243
> URL: https://issues.apache.org/jira/browse/HDFS-13243
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2, 3.2.0
>Reporter: Zephyr Guo
>Assignee: Zephyr Guo
>Priority: Critical
> Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, 
> HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch
>
>
> HDFS File might get broken because of corrupt block(s) that could be produced 
> by calling close and sync in the same time.
> When calling close was not successful, UCBlock status would change to 
> COMMITTED, and if a sync request gets popped from queue and processed, sync 
> operation would change the last block length.
> After that, DataNode would report all received block to NameNode, and will 
> check Block length of all COMMITTED Blocks. But the block length was already 
> different between recorded in NameNode memory and reported by DataNode, and 
> consequently, the last block is marked as corruptted because of inconsistent 
> length.
>  
> {panel:title=Log in my hdfs}
> 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, 
> truncateBlock=null, primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  for 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> fsync: 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
>  for DFSClient_NONMAPREDUCE_1077513762_1
> 2018-03-05 04:05:39,761 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in 
> file 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: 
> blockMap updated: 10.0.0.220:50010 is added to 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  size 2054413
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.219:50010 by 
> hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.218:50010 by 
> hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:40,162 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in 
> file 
> 

[jira] [Updated] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time

2018-04-10 Thread Zephyr Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zephyr Guo updated HDFS-13243:
--
Attachment: (was: HDFS-13243-v5.patch)

> Get CorruptBlock because of calling close and sync in same time
> ---
>
> Key: HDFS-13243
> URL: https://issues.apache.org/jira/browse/HDFS-13243
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2, 3.2.0
>Reporter: Zephyr Guo
>Assignee: Zephyr Guo
>Priority: Critical
> Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, 
> HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch
>
>
> HDFS File might get broken because of corrupt block(s) that could be produced 
> by calling close and sync in the same time.
> When calling close was not successful, UCBlock status would change to 
> COMMITTED, and if a sync request gets popped from queue and processed, sync 
> operation would change the last block length.
> After that, DataNode would report all received block to NameNode, and will 
> check Block length of all COMMITTED Blocks. But the block length was already 
> different between recorded in NameNode memory and reported by DataNode, and 
> consequently, the last block is marked as corruptted because of inconsistent 
> length.
>  
> {panel:title=Log in my hdfs}
> 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, 
> truncateBlock=null, primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  for 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> fsync: 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
>  for DFSClient_NONMAPREDUCE_1077513762_1
> 2018-03-05 04:05:39,761 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in 
> file 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: 
> blockMap updated: 10.0.0.220:50010 is added to 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  size 2054413
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.219:50010 by 
> hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.218:50010 by 
> hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:40,162 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in 
> file 
> 

[jira] [Updated] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time

2018-04-10 Thread Zephyr Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zephyr Guo updated HDFS-13243:
--
Attachment: HDFS-13243-v5.patch

> Get CorruptBlock because of calling close and sync in same time
> ---
>
> Key: HDFS-13243
> URL: https://issues.apache.org/jira/browse/HDFS-13243
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2, 3.2.0
>Reporter: Zephyr Guo
>Assignee: Zephyr Guo
>Priority: Critical
> Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, 
> HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch, 
> HDFS-13243-v5.patch
>
>
> HDFS File might get broken because of corrupt block(s) that could be produced 
> by calling close and sync in the same time.
> When calling close was not successful, UCBlock status would change to 
> COMMITTED, and if a sync request gets popped from queue and processed, sync 
> operation would change the last block length.
> After that, DataNode would report all received block to NameNode, and will 
> check Block length of all COMMITTED Blocks. But the block length was already 
> different between recorded in NameNode memory and reported by DataNode, and 
> consequently, the last block is marked as corruptted because of inconsistent 
> length.
>  
> {panel:title=Log in my hdfs}
> 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, 
> truncateBlock=null, primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  for 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> fsync: 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
>  for DFSClient_NONMAPREDUCE_1077513762_1
> 2018-03-05 04:05:39,761 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in 
> file 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: 
> blockMap updated: 10.0.0.220:50010 is added to 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  size 2054413
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.219:50010 by 
> hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.218:50010 by 
> hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:40,162 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in 
> file 
> 

[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time

2018-04-09 Thread Zephyr Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431670#comment-16431670
 ] 

Zephyr Guo commented on HDFS-13243:
---

I have rebased. [~jojochuang]

> Get CorruptBlock because of calling close and sync in same time
> ---
>
> Key: HDFS-13243
> URL: https://issues.apache.org/jira/browse/HDFS-13243
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2, 3.2.0
>Reporter: Zephyr Guo
>Assignee: Zephyr Guo
>Priority: Critical
> Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, 
> HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch
>
>
> HDFS File might get broken because of corrupt block(s) that could be produced 
> by calling close and sync in the same time.
> When calling close was not successful, UCBlock status would change to 
> COMMITTED, and if a sync request gets popped from queue and processed, sync 
> operation would change the last block length.
> After that, DataNode would report all received block to NameNode, and will 
> check Block length of all COMMITTED Blocks. But the block length was already 
> different between recorded in NameNode memory and reported by DataNode, and 
> consequently, the last block is marked as corruptted because of inconsistent 
> length.
>  
> {panel:title=Log in my hdfs}
> 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, 
> truncateBlock=null, primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  for 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> fsync: 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
>  for DFSClient_NONMAPREDUCE_1077513762_1
> 2018-03-05 04:05:39,761 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in 
> file 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: 
> blockMap updated: 10.0.0.220:50010 is added to 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  size 2054413
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.219:50010 by 
> hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.218:50010 by 
> hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:40,162 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in 
> file 
> 

[jira] [Updated] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time

2018-04-09 Thread Zephyr Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zephyr Guo updated HDFS-13243:
--
Attachment: HDFS-13243-v5.patch

> Get CorruptBlock because of calling close and sync in same time
> ---
>
> Key: HDFS-13243
> URL: https://issues.apache.org/jira/browse/HDFS-13243
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2, 3.2.0
>Reporter: Zephyr Guo
>Assignee: Zephyr Guo
>Priority: Critical
> Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, 
> HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch
>
>
> HDFS File might get broken because of corrupt block(s) that could be produced 
> by calling close and sync in the same time.
> When calling close was not successful, UCBlock status would change to 
> COMMITTED, and if a sync request gets popped from queue and processed, sync 
> operation would change the last block length.
> After that, DataNode would report all received block to NameNode, and will 
> check Block length of all COMMITTED Blocks. But the block length was already 
> different between recorded in NameNode memory and reported by DataNode, and 
> consequently, the last block is marked as corruptted because of inconsistent 
> length.
>  
> {panel:title=Log in my hdfs}
> 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, 
> truncateBlock=null, primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  for 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> fsync: 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
>  for DFSClient_NONMAPREDUCE_1077513762_1
> 2018-03-05 04:05:39,761 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in 
> file 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: 
> blockMap updated: 10.0.0.220:50010 is added to 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  size 2054413
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.219:50010 by 
> hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.218:50010 by 
> hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:40,162 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in 
> file 
> 

[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time

2018-03-28 Thread Zephyr Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417198#comment-16417198
 ] 

Zephyr Guo commented on HDFS-13243:
---

Rebase patch-v3, attach v4.

> Get CorruptBlock because of calling close and sync in same time
> ---
>
> Key: HDFS-13243
> URL: https://issues.apache.org/jira/browse/HDFS-13243
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2, 3.2.0
>Reporter: Zephyr Guo
>Assignee: Zephyr Guo
>Priority: Critical
> Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, 
> HDFS-13243-v3.patch, HDFS-13243-v4.patch
>
>
> HDFS File might get broken because of corrupt block(s) that could be produced 
> by calling close and sync in the same time.
> When calling close was not successful, UCBlock status would change to 
> COMMITTED, and if a sync request gets popped from queue and processed, sync 
> operation would change the last block length.
> After that, DataNode would report all received block to NameNode, and will 
> check Block length of all COMMITTED Blocks. But the block length was already 
> different between recorded in NameNode memory and reported by DataNode, and 
> consequently, the last block is marked as corruptted because of inconsistent 
> length.
>  
> {panel:title=Log in my hdfs}
> 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, 
> truncateBlock=null, primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  for 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> fsync: 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
>  for DFSClient_NONMAPREDUCE_1077513762_1
> 2018-03-05 04:05:39,761 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in 
> file 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: 
> blockMap updated: 10.0.0.220:50010 is added to 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  size 2054413
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.219:50010 by 
> hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.218:50010 by 
> hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:40,162 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in 
> file 
> 

[jira] [Updated] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time

2018-03-28 Thread Zephyr Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zephyr Guo updated HDFS-13243:
--
Attachment: HDFS-13243-v4.patch

> Get CorruptBlock because of calling close and sync in same time
> ---
>
> Key: HDFS-13243
> URL: https://issues.apache.org/jira/browse/HDFS-13243
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2, 3.2.0
>Reporter: Zephyr Guo
>Assignee: Zephyr Guo
>Priority: Critical
> Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, 
> HDFS-13243-v3.patch, HDFS-13243-v4.patch
>
>
> HDFS File might get broken because of corrupt block(s) that could be produced 
> by calling close and sync in the same time.
> When calling close was not successful, UCBlock status would change to 
> COMMITTED, and if a sync request gets popped from queue and processed, sync 
> operation would change the last block length.
> After that, DataNode would report all received block to NameNode, and will 
> check Block length of all COMMITTED Blocks. But the block length was already 
> different between recorded in NameNode memory and reported by DataNode, and 
> consequently, the last block is marked as corruptted because of inconsistent 
> length.
>  
> {panel:title=Log in my hdfs}
> 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, 
> truncateBlock=null, primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  for 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> fsync: 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
>  for DFSClient_NONMAPREDUCE_1077513762_1
> 2018-03-05 04:05:39,761 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in 
> file 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: 
> blockMap updated: 10.0.0.220:50010 is added to 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  size 2054413
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.219:50010 by 
> hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.218:50010 by 
> hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:40,162 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in 
> file 
> 

[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time

2018-03-28 Thread Zephyr Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417167#comment-16417167
 ] 

Zephyr Guo commented on HDFS-13243:
---

Hi, [~jojochuang]
I attached patch-v3. I move RPC call into synchronized code block. I try my 
best to let mock code clearly.

> Get CorruptBlock because of calling close and sync in same time
> ---
>
> Key: HDFS-13243
> URL: https://issues.apache.org/jira/browse/HDFS-13243
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2, 3.2.0
>Reporter: Zephyr Guo
>Assignee: Zephyr Guo
>Priority: Critical
> Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, 
> HDFS-13243-v3.patch
>
>
> HDFS File might get broken because of corrupt block(s) that could be produced 
> by calling close and sync in the same time.
> When calling close was not successful, UCBlock status would change to 
> COMMITTED, and if a sync request gets popped from queue and processed, sync 
> operation would change the last block length.
> After that, DataNode would report all received block to NameNode, and will 
> check Block length of all COMMITTED Blocks. But the block length was already 
> different between recorded in NameNode memory and reported by DataNode, and 
> consequently, the last block is marked as corruptted because of inconsistent 
> length.
>  
> {panel:title=Log in my hdfs}
> 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, 
> truncateBlock=null, primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  for 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> fsync: 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
>  for DFSClient_NONMAPREDUCE_1077513762_1
> 2018-03-05 04:05:39,761 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in 
> file 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: 
> blockMap updated: 10.0.0.220:50010 is added to 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  size 2054413
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.219:50010 by 
> hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.218:50010 by 
> hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:40,162 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in 
> file 
> 

[jira] [Updated] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time

2018-03-28 Thread Zephyr Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zephyr Guo updated HDFS-13243:
--
Attachment: HDFS-13243-v3.patch

> Get CorruptBlock because of calling close and sync in same time
> ---
>
> Key: HDFS-13243
> URL: https://issues.apache.org/jira/browse/HDFS-13243
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2, 3.2.0
>Reporter: Zephyr Guo
>Assignee: Zephyr Guo
>Priority: Critical
> Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, 
> HDFS-13243-v3.patch
>
>
> HDFS File might get broken because of corrupt block(s) that could be produced 
> by calling close and sync in the same time.
> When calling close was not successful, UCBlock status would change to 
> COMMITTED, and if a sync request gets popped from queue and processed, sync 
> operation would change the last block length.
> After that, DataNode would report all received block to NameNode, and will 
> check Block length of all COMMITTED Blocks. But the block length was already 
> different between recorded in NameNode memory and reported by DataNode, and 
> consequently, the last block is marked as corruptted because of inconsistent 
> length.
>  
> {panel:title=Log in my hdfs}
> 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, 
> truncateBlock=null, primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  for 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> fsync: 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
>  for DFSClient_NONMAPREDUCE_1077513762_1
> 2018-03-05 04:05:39,761 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in 
> file 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: 
> blockMap updated: 10.0.0.220:50010 is added to 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  size 2054413
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.219:50010 by 
> hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.218:50010 by 
> hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:40,162 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in 
> file 
> 

[jira] [Comment Edited] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time

2018-03-09 Thread Zephyr Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392889#comment-16392889
 ] 

Zephyr Guo edited comment on HDFS-13243 at 3/9/18 2:04 PM:
---

[~jojochuang]
{quote}
I suspect this race condition happens because of this unusual setting.(or makes 
it more prone to this bug)
{quote}
The minimal replication is set to 1 in my test case. I agree that this unusual 
setting makes it more prone to this bug.

{quote}
If the problem is client side race condition, I would recommend fixing it at 
client side.
{quote}
We have to fix server-side as well. We have no power to let all users update 
their client code, right?
I will write a new patch in several days, thanks for your advice. 

BTW, why don't we include dfsClient.namenode.fsync() into synchronized code 
block formerly?Is this for performance benefit?If that, Is it necessary to fix 
the client?




was (Author: gzh1992n):
[~jojochuang]
{quote}
I suspect this race condition happens because of this unusual setting.(or makes 
it more prone to this bug)
{quote}
The minimal replication is set to 1 in my test case. I agree that this unusual 
setting makes it more prone to this bug.

{qupte}
If the problem is client side race condition, I would recommend fixing it at 
client side.
{quote}
We have to fix server-side as well. We have no power to let all users update 
their client code, right?
I will write a new patch in several days, thanks for your advice. 

BTW, why don't we include dfsClient.namenode.fsync() into synchronized code 
block formerly?Is this for performance benefit?If that, Is it necessary to fix 
the client?



> Get CorruptBlock because of calling close and sync in same time
> ---
>
> Key: HDFS-13243
> URL: https://issues.apache.org/jira/browse/HDFS-13243
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2, 3.2.0
>Reporter: Zephyr Guo
>Assignee: Zephyr Guo
>Priority: Critical
> Fix For: 3.2.0
>
> Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch
>
>
> HDFS File might get broken because of corrupt block(s) that could be produced 
> by calling close and sync in the same time.
> When calling close was not successful, UCBlock status would change to 
> COMMITTED, and if a sync request gets popped from queue and processed, sync 
> operation would change the last block length.
> After that, DataNode would report all received block to NameNode, and will 
> check Block length of all COMMITTED Blocks. But the block length was already 
> different between recorded in NameNode memory and reported by DataNode, and 
> consequently, the last block is marked as corruptted because of inconsistent 
> length.
>  
> {panel:title=Log in my hdfs}
> 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, 
> truncateBlock=null, primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  for 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> fsync: 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
>  for DFSClient_NONMAPREDUCE_1077513762_1
> 2018-03-05 04:05:39,761 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in 
> file 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: 
> blockMap updated: 10.0.0.220:50010 is added to 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> 

[jira] [Comment Edited] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time

2018-03-09 Thread Zephyr Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392889#comment-16392889
 ] 

Zephyr Guo edited comment on HDFS-13243 at 3/9/18 1:55 PM:
---

[~jojochuang]
{quote}
I suspect this race condition happens because of this unusual setting.(or makes 
it more prone to this bug)
{quote}
The minimal replication is set to 1 in my test case. I agree that this unusual 
setting makes it more prone to this bug.

{qupte}
If the problem is client side race condition, I would recommend fixing it at 
client side.
{quote}
We have to fix server-side as well. We have no power to let all users update 
their client code, right?
I will write a new patch in several days, thanks for your advice. 

BTW, why don't we include dfsClient.namenode.fsync() into synchronized code 
block formerly?Is this for performance benefit?If that, Is it necessary to fix 
the client?




was (Author: gzh1992n):
[~jojochuang]
{quote}
I suspect this race condition happens because of this unusual setting.(or makes 
it more prone to this bug)
{quote}
The minimal replication is set to 1 in my test case. I agree that this unusual 
setting makes it more prone to this bug.

{qupte}
If the problem is client side race condition, I would recommend fixing it at 
client side.
{quote}
We have to fix server-side as well. You have no power to let all user update 
their client code, right?
I will write a new patch in several days, thanks for your advice. 

BTW, why don't we include dfsClient.namenode.fsync() into synchronized code 
block formerly?Is this for performance benefit?If that, Is it necessary to fix 
the client?



> Get CorruptBlock because of calling close and sync in same time
> ---
>
> Key: HDFS-13243
> URL: https://issues.apache.org/jira/browse/HDFS-13243
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2, 3.2.0
>Reporter: Zephyr Guo
>Assignee: Zephyr Guo
>Priority: Critical
> Fix For: 3.2.0
>
> Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch
>
>
> HDFS File might get broken because of corrupt block(s) that could be produced 
> by calling close and sync in the same time.
> When calling close was not successful, UCBlock status would change to 
> COMMITTED, and if a sync request gets popped from queue and processed, sync 
> operation would change the last block length.
> After that, DataNode would report all received block to NameNode, and will 
> check Block length of all COMMITTED Blocks. But the block length was already 
> different between recorded in NameNode memory and reported by DataNode, and 
> consequently, the last block is marked as corruptted because of inconsistent 
> length.
>  
> {panel:title=Log in my hdfs}
> 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, 
> truncateBlock=null, primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  for 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> fsync: 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
>  for DFSClient_NONMAPREDUCE_1077513762_1
> 2018-03-05 04:05:39,761 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in 
> file 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: 
> blockMap updated: 10.0.0.220:50010 is added to 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> 

[jira] [Comment Edited] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time

2018-03-09 Thread Zephyr Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392889#comment-16392889
 ] 

Zephyr Guo edited comment on HDFS-13243 at 3/9/18 1:42 PM:
---

[~jojochuang]
{quote}
I suspect this race condition happens because of this unusual setting.(or makes 
it more prone to this bug)
{quote}
The minimal replication is set to 1 in my test case. I agree that this unusual 
setting makes it more prone to this bug.

{qupte}
If the problem is client side race condition, I would recommend fixing it at 
client side.
{quote}
We have to fix server-side as well. You have no power to let all user update 
their client code, right?
I will write a new patch in several days, thanks for your advice. 

BTW, why don't we include dfsClient.namenode.fsync() into synchronized code 
block formerly?Is this for performance benefit?If that, Is it necessary to fix 
the client?




was (Author: gzh1992n):
[~jojochuang]
{quote}
I suspect this race condition happens because of this unusual setting.(or makes 
it more prone to this bug)
{quote}
The minimal replication is 1 in my test case. I agree that this unusual setting 
makes it more prone to this bug.

{qupte}
If the problem is client side race condition, I would recommend fixing it at 
client side.
{quote}
We have to fix server-side as well. You have no power to let all user update 
their client code, right?
I will write a new patch in several days, thanks for your advice. 

BTW, why don't we include dfsClient.namenode.fsync() into synchronized code 
block formerly?Is this for performance benefit?If that, Is it necessary to fix 
the client?



> Get CorruptBlock because of calling close and sync in same time
> ---
>
> Key: HDFS-13243
> URL: https://issues.apache.org/jira/browse/HDFS-13243
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2, 3.2.0
>Reporter: Zephyr Guo
>Assignee: Zephyr Guo
>Priority: Critical
> Fix For: 3.2.0
>
> Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch
>
>
> HDFS File might get broken because of corrupt block(s) that could be produced 
> by calling close and sync in the same time.
> When calling close was not successful, UCBlock status would change to 
> COMMITTED, and if a sync request gets popped from queue and processed, sync 
> operation would change the last block length.
> After that, DataNode would report all received block to NameNode, and will 
> check Block length of all COMMITTED Blocks. But the block length was already 
> different between recorded in NameNode memory and reported by DataNode, and 
> consequently, the last block is marked as corruptted because of inconsistent 
> length.
>  
> {panel:title=Log in my hdfs}
> 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, 
> truncateBlock=null, primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  for 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> fsync: 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
>  for DFSClient_NONMAPREDUCE_1077513762_1
> 2018-03-05 04:05:39,761 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in 
> file 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: 
> blockMap updated: 10.0.0.220:50010 is added to 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> 

[jira] [Comment Edited] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time

2018-03-09 Thread Zephyr Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392889#comment-16392889
 ] 

Zephyr Guo edited comment on HDFS-13243 at 3/9/18 1:41 PM:
---

[~jojochuang]
{quote}
I suspect this race condition happens because of this unusual setting.(or makes 
it more prone to this bug)
{quote}
The minimal replication is 1 in my test case. I agree that this unusual setting 
makes it more prone to this bug.

{qupte}
If the problem is client side race condition, I would recommend fixing it at 
client side.
{quote}
We have to fix server-side as well. You have no power to let all user update 
their client code, right?
I will write a new patch in several days, thanks for your advice. 

BTW, why don't we include dfsClient.namenode.fsync() into synchronized code 
block formerly?Is this for performance benefit?If that, Is it necessary to fix 
the client?




was (Author: gzh1992n):
[~jojochuang]
{quote}
I suspect this race condition happens because of this unusual setting.(or makes 
it more prone to this bug)
{quote}
The minimal replication is 1 in my test case. I agree that this unusual setting 
makes it more prone to this bug.

{qupte}
If the problem is client side race condition, I would recommend fixing it at 
client side.
{quote}
We have to fix server-side as well. You have no power to let all user update 
their client code, right?
I will write a new patch in serval days, thanks for your advice. 

BTW, why doesn't we include dfsClient.namenode.fsync() into synchronized code 
block?Is this for performance benefit?If that, Is it necessary to fix the 
client?



> Get CorruptBlock because of calling close and sync in same time
> ---
>
> Key: HDFS-13243
> URL: https://issues.apache.org/jira/browse/HDFS-13243
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2, 3.2.0
>Reporter: Zephyr Guo
>Assignee: Zephyr Guo
>Priority: Critical
> Fix For: 3.2.0
>
> Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch
>
>
> HDFS File might get broken because of corrupt block(s) that could be produced 
> by calling close and sync in the same time.
> When calling close was not successful, UCBlock status would change to 
> COMMITTED, and if a sync request gets popped from queue and processed, sync 
> operation would change the last block length.
> After that, DataNode would report all received block to NameNode, and will 
> check Block length of all COMMITTED Blocks. But the block length was already 
> different between recorded in NameNode memory and reported by DataNode, and 
> consequently, the last block is marked as corruptted because of inconsistent 
> length.
>  
> {panel:title=Log in my hdfs}
> 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, 
> truncateBlock=null, primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  for 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> fsync: 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
>  for DFSClient_NONMAPREDUCE_1077513762_1
> 2018-03-05 04:05:39,761 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in 
> file 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: 
> blockMap updated: 10.0.0.220:50010 is added to 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> 

[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time

2018-03-09 Thread Zephyr Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392889#comment-16392889
 ] 

Zephyr Guo commented on HDFS-13243:
---

[~jojochuang]
{quote}
I suspect this race condition happens because of this unusual setting.(or makes 
it more prone to this bug)
{quote}
The minimal replication is 1 in my test case. I agree that this unusual setting 
makes it more prone to this bug.

{qupte}
If the problem is client side race condition, I would recommend fixing it at 
client side.
{quote}
We have to fix server-side as well. You have no power to let all user update 
their client code, right?
I will write a new patch in serval days, thanks for your advice. 

BTW, why doesn't we include dfsClient.namenode.fsync() into synchronized code 
block?Is this for performance benefit?If that, Is it necessary to fix the 
client?



> Get CorruptBlock because of calling close and sync in same time
> ---
>
> Key: HDFS-13243
> URL: https://issues.apache.org/jira/browse/HDFS-13243
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2, 3.2.0
>Reporter: Zephyr Guo
>Assignee: Zephyr Guo
>Priority: Critical
> Fix For: 3.2.0
>
> Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch
>
>
> HDFS File might get broken because of corrupt block(s) that could be produced 
> by calling close and sync in the same time.
> When calling close was not successful, UCBlock status would change to 
> COMMITTED, and if a sync request gets popped from queue and processed, sync 
> operation would change the last block length.
> After that, DataNode would report all received block to NameNode, and will 
> check Block length of all COMMITTED Blocks. But the block length was already 
> different between recorded in NameNode memory and reported by DataNode, and 
> consequently, the last block is marked as corruptted because of inconsistent 
> length.
>  
> {panel:title=Log in my hdfs}
> 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, 
> truncateBlock=null, primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  for 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> fsync: 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
>  for DFSClient_NONMAPREDUCE_1077513762_1
> 2018-03-05 04:05:39,761 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in 
> file 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: 
> blockMap updated: 10.0.0.220:50010 is added to 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  size 2054413
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.219:50010 by 
> hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.218:50010 by 
> hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is 
> COMMITTED and reported length 2054413 does not match length in 

[jira] [Comment Edited] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time

2018-03-08 Thread Zephyr Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392309#comment-16392309
 ] 

Zephyr Guo edited comment on HDFS-13243 at 3/9/18 2:55 AM:
---

{{[~jojochuang], Thanks for reviewing.}}

{quote}
1.It seems to me the root of problem is that client would call fsync() with an 
incorrect length (shorter than what it is supposed to sync). If that's the case 
you should fix the client (DFSOutputStream), rather than the NameNode.
{quote}

Client call sync() with a *corrent* length, but sync request could be sent 
after close().
The root cause is that DFSOutputStream#flushOrSync() is not thread-safe. See 
following simplified code:

 
{code:java}
synchronized (this) {}
 ... // code before send request
}

// **We send request here, but it's not included in synchronized code block**
 if (getStreamer().getPersistBlocks().getAndSet(false) || updateLength) {
   try {

   dfsClient.namenode.fsync(src, fileId, dfsClient.clientName, 
lastBlockLength); 

   } catch (IOException ioe) {
   // Deal with ioe
   }
}

synchronized (this) {
  ... // code after send request
}
{code}


{quote}
2. Looking at the log, your minimal replication number is 2, rather than 1. 
That's very unusual. In my past experience a lot of weird behavior like this 
could arise when you have that kind of configuration.
{quote}
I'm not sure that data reliability would be affected  if minimal replication 
set to 1. Do you have some experience about this

{quote}
3.And why is close() in the picture? IMHO you don't even need to close(). 
Suppose you block DataNode heartbeat, and let client keep the file open and 
then call sync(), the last block's state remains in COMMITTED. Would that cause 
the same behavior?
{quote}
close() must be called after sync(see above root cause). If you don't call 
close(), the last block's state can't change to COMMITTED, right? 

{quote}
4.Looking at the patch, I would like to ask you to stay away from using 
reflection. You could refactor FSNamesystem and DFSOutputStream to return a new 
FSNamesystem/DFSOutputStream object and override them in the test code. That 
way, you don't need to introduce new configurations too. And it'll be much 
cleaner.
{quote}
I don't understand this. There is no API that set impl of FSNamesystem in 
MiniCluster. Could you give me a sample in another test case? I will rewrite 
this patch.

{quote}
5.I don't understand the following code.
{quote}
My fixed code is not final version, because I don't know that whether 
DFSOutputStream#flushOrSync() impl is ok. We sent sync RPC without lock in 
client, maybe for performance benefit? If for a benefit, we just fix 
server-side.If not,we need to fix both server-side and client.
In server-side, we could log warn for wrong length and throw exception for 
invalid state. Is this better than current version?

 

 


was (Author: gzh1992n):
{{[~jojochuang], Thanks for reviewing.}}

{quote}
1.It seems to me the root of problem is that client would call fsync() with an 
incorrect length (shorter than what it is supposed to sync). If that's the case 
you should fix the client (DFSOutputStream), rather than the NameNode.
{quote}

Client call sync() with a *corrent* length, but sync request could be sent 
after close().
The root cause is that DFSOutputStream#flushOrSync() is not thread-safe. See 
following simplified code:

 
{code:java}
synchronized (this) {}
 ... // code before send request
}

// **We send request here, but it's not included in synchronized code block**
 if (getStreamer().getPersistBlocks().getAndSet(false) || updateLength) {
 try {

 dfsClient.namenode.fsync(src, fileId, dfsClient.clientName, lastBlockLength); 

 } catch (IOException ioe) {
 // Deal with ioe
 }

synchronized (this) {
  ... // code after send request
}
{code}


{quote}
2. Looking at the log, your minimal replication number is 2, rather than 1. 
That's very unusual. In my past experience a lot of weird behavior like this 
could arise when you have that kind of configuration.
{quote}
I'm not sure that data reliability would be affected  if minimal replication 
set to 1. Do you have some experience about this

{quote}
3.And why is close() in the picture? IMHO you don't even need to close(). 
Suppose you block DataNode heartbeat, and let client keep the file open and 
then call sync(), the last block's state remains in COMMITTED. Would that cause 
the same behavior?
{quote}
close() must be called after sync(see above root cause). If you don't call 
close(), the last block's state can't change to COMMITTED, right? 

{quote}
4.Looking at the patch, I would like to ask you to stay away from using 
reflection. You could refactor FSNamesystem and DFSOutputStream to return a new 
FSNamesystem/DFSOutputStream object and override them in the test code. That 
way, you don't need to introduce new configurations too. And it'll be much 
cleaner.
{quote}
I don't 

[jira] [Comment Edited] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time

2018-03-08 Thread Zephyr Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392309#comment-16392309
 ] 

Zephyr Guo edited comment on HDFS-13243 at 3/9/18 2:54 AM:
---

{{[~jojochuang], Thanks for reviewing.}}

{quote}
1.It seems to me the root of problem is that client would call fsync() with an 
incorrect length (shorter than what it is supposed to sync). If that's the case 
you should fix the client (DFSOutputStream), rather than the NameNode.
{quote}

Client call sync() with a *corrent* length, but sync request could be sent 
after close().
The root cause is that DFSOutputStream#flushOrSync() is not thread-safe. See 
following simplified code:

 
{code:java}
synchronized (this) {}
 ... // code before send request
}

// **We send request here, but it's not included in synchronized code block**
 if (getStreamer().getPersistBlocks().getAndSet(false) || updateLength) {
 try {

 dfsClient.namenode.fsync(src, fileId, dfsClient.clientName, lastBlockLength); 

 } catch (IOException ioe) {
 // Deal with ioe
 }

synchronized (this) {
  ... // code after send request
}
{code}


{quote}
2. Looking at the log, your minimal replication number is 2, rather than 1. 
That's very unusual. In my past experience a lot of weird behavior like this 
could arise when you have that kind of configuration.
{quote}
I'm not sure that data reliability would be affected  if minimal replication 
set to 1. Do you have some experience about this

{quote}
3.And why is close() in the picture? IMHO you don't even need to close(). 
Suppose you block DataNode heartbeat, and let client keep the file open and 
then call sync(), the last block's state remains in COMMITTED. Would that cause 
the same behavior?
{quote}
close() must be called after sync(see above root cause). If you don't call 
close(), the last block's state can't change to COMMITTED, right? 

{quote}
4.Looking at the patch, I would like to ask you to stay away from using 
reflection. You could refactor FSNamesystem and DFSOutputStream to return a new 
FSNamesystem/DFSOutputStream object and override them in the test code. That 
way, you don't need to introduce new configurations too. And it'll be much 
cleaner.
{quote}
I don't understand this. There is no API that set impl of FSNamesystem in 
MiniCluster. Could you give me a sample in another test case? I will rewrite 
this patch.

{quote}
5.I don't understand the following code.
{quote}
My fixed code is not final version, because I don't know that whether 
DFSOutputStream#flushOrSync() impl is ok. We sent sync RPC without lock in 
client, maybe for performance benefit? If for a benefit, we just fix 
server-side.If not,we need to fix both server-side and client.
In server-side, we could log warn for wrong length and throw exception for 
invalid state. Is this better than current version?

 

 


was (Author: gzh1992n):
{{[~jojochuang], Thanks for reviewing.}}

{{{quote}}}

{{1.}}It seems to me the root of problem is that client would call fsync() with 
an incorrect length (shorter than what it is supposed to sync). If that's the 
case you should fix the client (DFSOutputStream), rather than the NameNode.

{{{quote}}}
{{Client call sync() with a *corrent* length, but sync request could be sent 
after close().}}
{{The root cause is that DFSOutputStream#flushOrSync() is not thread-safe. See 
following simplified code:}}

{{synchronized (this) {}}
{{ ... // code before send request}}
{{}}}

{{// **We send request here, but it's not included in synchronized code block**
if (getStreamer().getPersistBlocks().getAndSet(false) || updateLength) \{
  try {
dfsClient.namenode.fsync(src, fileId, dfsClient.clientName,
lastBlockLength);
  } catch (IOException ioe) \{
 // Deal with ioe
  }
}}}

{{synchronized (this) { }}
{{  ... // code after send request}}
{{}}}

{{{quote}}}

2. Looking at the log, your minimal replication number is 2, rather than 1. 
That's very unusual. In my past experience a lot of weird behavior like this 
could arise when you have that kind of configuration.

{{{quote}}}
{{I'm not sure that data reliability would be affected  if minimal replication 
set to 1. Do you have some experience about this?}}
{{{quote}}}

3.And why is close() in the picture? IMHO you don't even need to close(). 
Suppose you block DataNode heartbeat, and let client keep the file open and 
then call sync(), the last block's state remains in COMMITTED. Would that cause 
the same behavior?

{{{quote}}}
{{close() must be called after sync(see above root cause). If you don't call 
close(), the last block's state can't change to COMMITTED, right? }}
{{{quote}}}

4.Looking at the patch, I would like to ask you to stay away from using 
reflection. You could refactor FSNamesystem and DFSOutputStream to return a new 
FSNamesystem/DFSOutputStream object and override them in the test code. That 
way, you don't need to introduce new 

[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time

2018-03-08 Thread Zephyr Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392309#comment-16392309
 ] 

Zephyr Guo commented on HDFS-13243:
---

{{[~jojochuang], Thanks for reviewing.}}

{{{quote}}}

{{1.}}It seems to me the root of problem is that client would call fsync() with 
an incorrect length (shorter than what it is supposed to sync). If that's the 
case you should fix the client (DFSOutputStream), rather than the NameNode.

{{{quote}}}
{{Client call sync() with a *corrent* length, but sync request could be sent 
after close().}}
{{The root cause is that DFSOutputStream#flushOrSync() is not thread-safe. See 
following simplified code:}}

{{synchronized (this) {}}
{{ ... // code before send request}}
{{}}}

{{// **We send request here, but it's not included in synchronized code block**
if (getStreamer().getPersistBlocks().getAndSet(false) || updateLength) \{
  try {
dfsClient.namenode.fsync(src, fileId, dfsClient.clientName,
lastBlockLength);
  } catch (IOException ioe) \{
 // Deal with ioe
  }
}}}

{{synchronized (this) { }}
{{  ... // code after send request}}
{{}}}

{{{quote}}}

2. Looking at the log, your minimal replication number is 2, rather than 1. 
That's very unusual. In my past experience a lot of weird behavior like this 
could arise when you have that kind of configuration.

{{{quote}}}
{{I'm not sure that data reliability would be affected  if minimal replication 
set to 1. Do you have some experience about this?}}
{{{quote}}}

3.And why is close() in the picture? IMHO you don't even need to close(). 
Suppose you block DataNode heartbeat, and let client keep the file open and 
then call sync(), the last block's state remains in COMMITTED. Would that cause 
the same behavior?

{{{quote}}}
{{close() must be called after sync(see above root cause). If you don't call 
close(), the last block's state can't change to COMMITTED, right? }}
{{{quote}}}

4.Looking at the patch, I would like to ask you to stay away from using 
reflection. You could refactor FSNamesystem and DFSOutputStream to return a new 
FSNamesystem/DFSOutputStream object and override them in the test code. That 
way, you don't need to introduce new configurations too. And it'll be much 
cleaner.

{{{quote}}}
{{I don't understand this. There is no API that set impl of FSNamesystem in 
MiniCluster. Could you give me a sample in another test case? I will rewrite 
this patch.}}
{{{quote}}}

{{5.I don't understand the following code.}}

{{{quote}}}
{{My fixed code is not final version, because I don't know that whether 
DFSOutputStream#flushOrSync() impl is ok. We sent sync RPC without lock in 
client, maybe for performance benefit? If for a benefit, we just fix 
server-side.If not,we need to fix both server-side and client.}}
{{In server-side, we could log warn for wrong length and throw exception for 
invalid state. Is this better than current version?}}

 

> Get CorruptBlock because of calling close and sync in same time
> ---
>
> Key: HDFS-13243
> URL: https://issues.apache.org/jira/browse/HDFS-13243
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2, 3.2.0
>Reporter: Zephyr Guo
>Assignee: Zephyr Guo
>Priority: Critical
> Fix For: 3.2.0
>
> Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch
>
>
> HDFS File might get broken because of corrupt block(s) that could be produced 
> by calling close and sync in the same time.
> When calling close was not successful, UCBlock status would change to 
> COMMITTED, and if a sync request gets popped from queue and processed, sync 
> operation would change the last block length.
> After that, DataNode would report all received block to NameNode, and will 
> check Block length of all COMMITTED Blocks. But the block length was already 
> different between recorded in NameNode memory and reported by DataNode, and 
> consequently, the last block is marked as corruptted because of inconsistent 
> length.
>  
> {panel:title=Log in my hdfs}
> 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, 
> truncateBlock=null, primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  for 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> fsync: 
> 

[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time

2018-03-07 Thread Zephyr Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390772#comment-16390772
 ] 

Zephyr Guo commented on HDFS-13243:
---

Attach v2 patch to fix NoSuchMethodException

> Get CorruptBlock because of calling close and sync in same time
> ---
>
> Key: HDFS-13243
> URL: https://issues.apache.org/jira/browse/HDFS-13243
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2, 3.2.0
>Reporter: Zephyr Guo
>Priority: Critical
> Fix For: 3.2.0
>
> Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch
>
>
> HDFS File might get broken because of corrupt block(s) that could be produced 
> by calling close and sync in the same time.
> When calling close was not successful, UCBlock status would change to 
> COMMITTED, and if a sync request gets popped from queue and processed, sync 
> operation would change the last block length.
> After that, DataNode would report all received block to NameNode, and will 
> check Block length of all COMMITTED Blocks. But the block length was already 
> different between recorded in NameNode memory and reported by DataNode, and 
> consequently, the last block is marked as corruptted because of inconsistent 
> length.
>  
> {panel:title=Log in my hdfs}
> 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, 
> truncateBlock=null, primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  for 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> fsync: 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
>  for DFSClient_NONMAPREDUCE_1077513762_1
> 2018-03-05 04:05:39,761 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in 
> file 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: 
> blockMap updated: 10.0.0.220:50010 is added to 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  size 2054413
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.219:50010 by 
> hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.218:50010 by 
> hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:40,162 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in 
> file 
> 

[jira] [Updated] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time

2018-03-07 Thread Zephyr Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zephyr Guo updated HDFS-13243:
--
Attachment: HDFS-13243-v2.patch

> Get CorruptBlock because of calling close and sync in same time
> ---
>
> Key: HDFS-13243
> URL: https://issues.apache.org/jira/browse/HDFS-13243
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2, 3.2.0
>Reporter: Zephyr Guo
>Priority: Critical
> Fix For: 3.2.0
>
> Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch
>
>
> HDFS File might get broken because of corrupt block(s) that could be produced 
> by calling close and sync in the same time.
> When calling close was not successful, UCBlock status would change to 
> COMMITTED, and if a sync request gets popped from queue and processed, sync 
> operation would change the last block length.
> After that, DataNode would report all received block to NameNode, and will 
> check Block length of all COMMITTED Blocks. But the block length was already 
> different between recorded in NameNode memory and reported by DataNode, and 
> consequently, the last block is marked as corruptted because of inconsistent 
> length.
>  
> {panel:title=Log in my hdfs}
> 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, 
> truncateBlock=null, primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  for 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> fsync: 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
>  for DFSClient_NONMAPREDUCE_1077513762_1
> 2018-03-05 04:05:39,761 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in 
> file 
> /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: 
> blockMap updated: 10.0.0.220:50010 is added to 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  size 2054413
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.219:50010 by 
> hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
> 10.0.0.218:50010 by 
> hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is 
> COMMITTED and reported length 2054413 does not match length in block map 
> 141232
> 2018-03-05 04:05:40,162 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
>  
> ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
>  is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in 
> file 
> 

[jira] [Updated] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time

2018-03-07 Thread Zephyr Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zephyr Guo updated HDFS-13243:
--
Description: 
HDFS File might get broken because of corrupt block(s) that could be produced 
by calling close and sync in the same time.

When calling close was not successful, UCBlock status would change to 
COMMITTED, and if a sync request gets popped from queue and processed, sync 
operation would change the last block length.

After that, DataNode would report all received block to NameNode, and will 
check Block length of all COMMITTED Blocks. But the block length was already 
different between recorded in NameNode memory and reported by DataNode, and 
consequently, the last block is marked as corruptted because of inconsistent 
length.

 
{panel:title=Log in my hdfs}
2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, 
truncateBlock=null, primaryNodeIndex=-1, 
replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
 
ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
 
ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
 for 
/hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* fsync: 
/hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
 for DFSClient_NONMAPREDUCE_1077513762_1
2018-03-05 04:05:39,761 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
primaryNodeIndex=-1, 
replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
 
ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
 
ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
 is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in file 
/hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap 
updated: 10.0.0.220:50010 is added to 
blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
primaryNodeIndex=-1, 
replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
 
ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
 
ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
 size 2054413
2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK 
NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
10.0.0.219:50010 by hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 
because block is COMMITTED and reported length 2054413 does not match length in 
block map 141232
2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK 
NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on 
10.0.0.218:50010 by hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 
because block is COMMITTED and reported length 2054413 does not match length in 
block map 141232
2018-03-05 04:05:40,162 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, 
primaryNodeIndex=-1, 
replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
 
ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW],
 
ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
 is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in file 
/hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
{panel}

  was:
HDFS File might get broken because of corrupt block(s) that could be produced 
by calling close and sync in the same time.

When calling close was not successful, UCBlock status would change to 
COMMITTED, and if a sync request gets popped from queue and processed, sync 
operation would change the last block length.

After that, DataNode would report all received block to NameNode, and will 
check Block length of all COMMITTED Blocks. But the block length was already 
different between recorded in NameNode memory and reported by DataNode, and 
consequently, the last block is marked as corruptted because of inconsistent 
length.


> Get CorruptBlock because of 

[jira] [Updated] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time

2018-03-07 Thread Zephyr Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zephyr Guo updated HDFS-13243:
--
Summary: Get CorruptBlock because of calling close and sync in same time  
(was: File might get broken because of calling close and sync in same time)

> Get CorruptBlock because of calling close and sync in same time
> ---
>
> Key: HDFS-13243
> URL: https://issues.apache.org/jira/browse/HDFS-13243
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2, 3.2.0
>Reporter: Zephyr Guo
>Priority: Critical
> Fix For: 3.2.0
>
> Attachments: HDFS-13243-v1.patch
>
>
> HDFS File might get broken because of corrupt block(s) that could be produced 
> by calling close and sync in the same time.
> When calling close was not successful, UCBlock status would change to 
> COMMITTED, and if a sync request gets popped from queue and processed, sync 
> operation would change the last block length.
> After that, DataNode would report all received block to NameNode, and will 
> check Block length of all COMMITTED Blocks. But the block length was already 
> different between recorded in NameNode memory and reported by DataNode, and 
> consequently, the last block is marked as corruptted because of inconsistent 
> length.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13243) File might get broken because of calling close and sync in same time

2018-03-07 Thread Zephyr Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zephyr Guo updated HDFS-13243:
--
Attachment: HDFS-13243-v1.patch

> File might get broken because of calling close and sync in same time
> 
>
> Key: HDFS-13243
> URL: https://issues.apache.org/jira/browse/HDFS-13243
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2, 3.2.0
>Reporter: Zephyr Guo
>Priority: Critical
> Fix For: 3.2.0
>
> Attachments: HDFS-13243-v1.patch
>
>
> HDFS File might get broken because of corrupt block(s) that could be produced 
> by calling close and sync in the same time.
> When calling close was not successful, UCBlock status would change to 
> COMMITTED, and if a sync request gets popped from queue and processed, sync 
> operation would change the last block length.
> After that, DataNode would report all received block to NameNode, and will 
> check Block length of all COMMITTED Blocks. But the block length was already 
> different between recorded in NameNode memory and reported by DataNode, and 
> consequently, the last block is marked as corruptted because of inconsistent 
> length.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13243) File might get broken because of calling close and sync in same time

2018-03-07 Thread Zephyr Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zephyr Guo updated HDFS-13243:
--
Status: Patch Available  (was: Open)

> File might get broken because of calling close and sync in same time
> 
>
> Key: HDFS-13243
> URL: https://issues.apache.org/jira/browse/HDFS-13243
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2, 3.2.0
>Reporter: Zephyr Guo
>Priority: Critical
> Fix For: 3.2.0
>
> Attachments: HDFS-13243-v1.patch
>
>
> HDFS File might get broken because of corrupt block(s) that could be produced 
> by calling close and sync in the same time.
> When calling close was not successful, UCBlock status would change to 
> COMMITTED, and if a sync request gets popped from queue and processed, sync 
> operation would change the last block length.
> After that, DataNode would report all received block to NameNode, and will 
> check Block length of all COMMITTED Blocks. But the block length was already 
> different between recorded in NameNode memory and reported by DataNode, and 
> consequently, the last block is marked as corruptted because of inconsistent 
> length.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-13243) File might get broken because of calling close and sync in same time

2018-03-07 Thread Zephyr Guo (JIRA)
Zephyr Guo created HDFS-13243:
-

 Summary: File might get broken because of calling close and sync 
in same time
 Key: HDFS-13243
 URL: https://issues.apache.org/jira/browse/HDFS-13243
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.7.2, 3.2.0
Reporter: Zephyr Guo
 Fix For: 3.2.0


HDFS File might get broken because of corrupt block(s) that could be produced 
by calling close and sync in the same time.

When calling close was not successful, UCBlock status would change to 
COMMITTED, and if a sync request gets popped from queue and processed, sync 
operation would change the last block length.

After that, DataNode would report all received block to NameNode, and will 
check Block length of all COMMITTED Blocks. But the block length was already 
different between recorded in NameNode memory and reported by DataNode, and 
consequently, the last block is marked as corruptted because of inconsistent 
length.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] (HDFS-10629) Federation Router

2017-01-31 Thread Lei Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847495#comment-15847495
 ] 

Lei Guo commented on HDFS-10629:


[~elgoiri] [~jakace],  do we have rough idea about the overhead introduced via 
router?

> Federation Router
> -
>
> Key: HDFS-10629
> URL: https://issues.apache.org/jira/browse/HDFS-10629
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs
>Reporter: Inigo Goiri
>Assignee: Jason Kace
> Attachments: HDFS-10629.000.patch, HDFS-10629.001.patch, 
> HDFS-10629-HDFS-10467-002.patch, HDFS-10629-HDFS-10467-003.patch, 
> HDFS-10629-HDFS-10467-004.patch, HDFS-10629-HDFS-10467-005.patch, 
> HDFS-10629-HDFS-10467-006.patch, HDFS-10629-HDFS-10467-007.patch
>
>
> Component that routes calls from the clients to the right Namespace. It 
> implements {{ClientProtocol}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-05-19 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549883#comment-14549883
 ] 

Leitao Guo commented on HDFS-7692:
--

[~eddyxu], thanks for your comments, please have a check of the new patch.
1.In DataStorage#recoverTransitionRead, log the InterruptedException and 
rethrow it as InterruptedIOException;
2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then 
let the test case fail;
3.The multithread in DataStorage#addStorageLocations() is for one specific 
namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is 
creating one thread pool for each namespace. Not change here.
4.Re-phrase the parameter successVolumes.

[~szetszwo],thanks for your comments, please have a check of the new patch.
1. InterruptedException re-thrown as InterruptedIOException;
2. I think it's a good idea to log the upgrade progress for each dir, but so 
far, we can not get the progress easily from the current api. Do you think it's 
necessary to file a new jira to follow this?

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
  Labels: BB2015-05-TBR
 Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-05-19 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549882#comment-14549882
 ] 

Leitao Guo commented on HDFS-7692:
--

[~eddyxu], thanks for your comments, please have a check of the new patch.
1.In DataStorage#recoverTransitionRead, log the InterruptedException and 
rethrow it as InterruptedIOException;
2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then 
let the test case fail;
3.The multithread in DataStorage#addStorageLocations() is for one specific 
namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is 
creating one thread pool for each namespace. Not change here.
4.Re-phrase the parameter successVolumes.

[~szetszwo],thanks for your comments, please have a check of the new patch.
1. InterruptedException re-thrown as InterruptedIOException;
2. I think it's a good idea to log the upgrade progress for each dir, but so 
far, we can not get the progress easily from the current api. Do you think it's 
necessary to file a new jira to follow this?

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
  Labels: BB2015-05-TBR
 Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-05-19 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549930#comment-14549930
 ] 

Leitao Guo commented on HDFS-7692:
--

Sorry it's my mistake to comment many times here! It seems that my network 
condition is not very good now...

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
  Labels: BB2015-05-TBR
 Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-05-19 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549901#comment-14549901
 ] 

Leitao Guo commented on HDFS-7692:
--

[~eddyxu], thanks for your comments, please have a check of the new patch.
1.In DataStorage#recoverTransitionRead, log the InterruptedException and 
rethrow it as InterruptedIOException;
2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then 
let the test case fail;
3.The multithread in DataStorage#addStorageLocations() is for one specific 
namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is 
creating one thread pool for each namespace. Not change here.
4.Re-phrase the parameter successVolumes.

[~szetszwo],thanks for your comments, please have a check of the new patch.
1. InterruptedException re-thrown as InterruptedIOException;
2. I think it's a good idea to log the upgrade progress for each dir, but so 
far, we can not get the progress easily from the current api. Do you think it's 
necessary to file a new jira to follow this?

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
  Labels: BB2015-05-TBR
 Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-05-19 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549898#comment-14549898
 ] 

Leitao Guo commented on HDFS-7692:
--

[~eddyxu], thanks for your comments, please have a check of the new patch.
1.In DataStorage#recoverTransitionRead, log the InterruptedException and 
rethrow it as InterruptedIOException;
2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then 
let the test case fail;
3.The multithread in DataStorage#addStorageLocations() is for one specific 
namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is 
creating one thread pool for each namespace. Not change here.
4.Re-phrase the parameter successVolumes.

[~szetszwo],thanks for your comments, please have a check of the new patch.
1. InterruptedException re-thrown as InterruptedIOException;
2. I think it's a good idea to log the upgrade progress for each dir, but so 
far, we can not get the progress easily from the current api. Do you think it's 
necessary to file a new jira to follow this?

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
  Labels: BB2015-05-TBR
 Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-05-19 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549907#comment-14549907
 ] 

Leitao Guo commented on HDFS-7692:
--

[~eddyxu], thanks for your comments, please have a check of the new patch.
1.In DataStorage#recoverTransitionRead, log the InterruptedException and 
rethrow it as InterruptedIOException;
2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then 
let the test case fail;
3.The multithread in DataStorage#addStorageLocations() is for one specific 
namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is 
creating one thread pool for each namespace. Not change here.
4.Re-phrase the parameter successVolumes.

[~szetszwo],thanks for your comments, please have a check of the new patch.
1. InterruptedException re-thrown as InterruptedIOException;
2. I think it's a good idea to log the upgrade progress for each dir, but so 
far, we can not get the progress easily from the current api. Do you think it's 
necessary to file a new jira to follow this?

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
  Labels: BB2015-05-TBR
 Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-05-19 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549903#comment-14549903
 ] 

Leitao Guo commented on HDFS-7692:
--

[~eddyxu], thanks for your comments, please have a check of the new patch.
1.In DataStorage#recoverTransitionRead, log the InterruptedException and 
rethrow it as InterruptedIOException;
2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then 
let the test case fail;
3.The multithread in DataStorage#addStorageLocations() is for one specific 
namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is 
creating one thread pool for each namespace. Not change here.
4.Re-phrase the parameter successVolumes.

[~szetszwo],thanks for your comments, please have a check of the new patch.
1. InterruptedException re-thrown as InterruptedIOException;
2. I think it's a good idea to log the upgrade progress for each dir, but so 
far, we can not get the progress easily from the current api. Do you think it's 
necessary to file a new jira to follow this?

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
  Labels: BB2015-05-TBR
 Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-05-19 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549905#comment-14549905
 ] 

Leitao Guo commented on HDFS-7692:
--

[~eddyxu], thanks for your comments, please have a check of the new patch.
1.In DataStorage#recoverTransitionRead, log the InterruptedException and 
rethrow it as InterruptedIOException;
2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then 
let the test case fail;
3.The multithread in DataStorage#addStorageLocations() is for one specific 
namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is 
creating one thread pool for each namespace. Not change here.
4.Re-phrase the parameter successVolumes.

[~szetszwo],thanks for your comments, please have a check of the new patch.
1. InterruptedException re-thrown as InterruptedIOException;
2. I think it's a good idea to log the upgrade progress for each dir, but so 
far, we can not get the progress easily from the current api. Do you think it's 
necessary to file a new jira to follow this?

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
  Labels: BB2015-05-TBR
 Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-05-19 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549896#comment-14549896
 ] 

Leitao Guo commented on HDFS-7692:
--

[~eddyxu], thanks for your comments, please have a check of the new patch.
1.In DataStorage#recoverTransitionRead, log the InterruptedException and 
rethrow it as InterruptedIOException;
2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then 
let the test case fail;
3.The multithread in DataStorage#addStorageLocations() is for one specific 
namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is 
creating one thread pool for each namespace. Not change here.
4.Re-phrase the parameter successVolumes.

[~szetszwo],thanks for your comments, please have a check of the new patch.
1. InterruptedException re-thrown as InterruptedIOException;
2. I think it's a good idea to log the upgrade progress for each dir, but so 
far, we can not get the progress easily from the current api. Do you think it's 
necessary to file a new jira to follow this?

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
  Labels: BB2015-05-TBR
 Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-05-19 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549894#comment-14549894
 ] 

Leitao Guo commented on HDFS-7692:
--

[~eddyxu], thanks for your comments, please have a check of the new patch.
1.In DataStorage#recoverTransitionRead, log the InterruptedException and 
rethrow it as InterruptedIOException;
2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then 
let the test case fail;
3.The multithread in DataStorage#addStorageLocations() is for one specific 
namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is 
creating one thread pool for each namespace. Not change here.
4.Re-phrase the parameter successVolumes.

[~szetszwo],thanks for your comments, please have a check of the new patch.
1. InterruptedException re-thrown as InterruptedIOException;
2. I think it's a good idea to log the upgrade progress for each dir, but so 
far, we can not get the progress easily from the current api. Do you think it's 
necessary to file a new jira to follow this?

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
  Labels: BB2015-05-TBR
 Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-05-19 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549887#comment-14549887
 ] 

Leitao Guo commented on HDFS-7692:
--

[~eddyxu], thanks for your comments, please have a check of the new patch.
1.In DataStorage#recoverTransitionRead, log the InterruptedException and 
rethrow it as InterruptedIOException;
2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then 
let the test case fail;
3.The multithread in DataStorage#addStorageLocations() is for one specific 
namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is 
creating one thread pool for each namespace. Not change here.
4.Re-phrase the parameter successVolumes.

[~szetszwo],thanks for your comments, please have a check of the new patch.
1. InterruptedException re-thrown as InterruptedIOException;
2. I think it's a good idea to log the upgrade progress for each dir, but so 
far, we can not get the progress easily from the current api. Do you think it's 
necessary to file a new jira to follow this?

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
  Labels: BB2015-05-TBR
 Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-05-19 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549889#comment-14549889
 ] 

Leitao Guo commented on HDFS-7692:
--

[~eddyxu], thanks for your comments, please have a check of the new patch.
1.In DataStorage#recoverTransitionRead, log the InterruptedException and 
rethrow it as InterruptedIOException;
2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then 
let the test case fail;
3.The multithread in DataStorage#addStorageLocations() is for one specific 
namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is 
creating one thread pool for each namespace. Not change here.
4.Re-phrase the parameter successVolumes.

[~szetszwo],thanks for your comments, please have a check of the new patch.
1. InterruptedException re-thrown as InterruptedIOException;
2. I think it's a good idea to log the upgrade progress for each dir, but so 
far, we can not get the progress easily from the current api. Do you think it's 
necessary to file a new jira to follow this?

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
  Labels: BB2015-05-TBR
 Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-05-19 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549908#comment-14549908
 ] 

Leitao Guo commented on HDFS-7692:
--

[~eddyxu], thanks for your comments, please have a check of the new patch.
1.In DataStorage#recoverTransitionRead, log the InterruptedException and 
rethrow it as InterruptedIOException;
2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then 
let the test case fail;
3.The multithread in DataStorage#addStorageLocations() is for one specific 
namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is 
creating one thread pool for each namespace. Not change here.
4.Re-phrase the parameter successVolumes.

[~szetszwo],thanks for your comments, please have a check of the new patch.
1. InterruptedException re-thrown as InterruptedIOException;
2. I think it's a good idea to log the upgrade progress for each dir, but so 
far, we can not get the progress easily from the current api. Do you think it's 
necessary to file a new jira to follow this?

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
  Labels: BB2015-05-TBR
 Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-05-19 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549904#comment-14549904
 ] 

Leitao Guo commented on HDFS-7692:
--

[~eddyxu], thanks for your comments, please have a check of the new patch.
1.In DataStorage#recoverTransitionRead, log the InterruptedException and 
rethrow it as InterruptedIOException;
2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then 
let the test case fail;
3.The multithread in DataStorage#addStorageLocations() is for one specific 
namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is 
creating one thread pool for each namespace. Not change here.
4.Re-phrase the parameter successVolumes.

[~szetszwo],thanks for your comments, please have a check of the new patch.
1. InterruptedException re-thrown as InterruptedIOException;
2. I think it's a good idea to log the upgrade progress for each dir, but so 
far, we can not get the progress easily from the current api. Do you think it's 
necessary to file a new jira to follow this?

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
  Labels: BB2015-05-TBR
 Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-05-19 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549906#comment-14549906
 ] 

Leitao Guo commented on HDFS-7692:
--

[~eddyxu], thanks for your comments, please have a check of the new patch.
1.In DataStorage#recoverTransitionRead, log the InterruptedException and 
rethrow it as InterruptedIOException;
2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then 
let the test case fail;
3.The multithread in DataStorage#addStorageLocations() is for one specific 
namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is 
creating one thread pool for each namespace. Not change here.
4.Re-phrase the parameter successVolumes.

[~szetszwo],thanks for your comments, please have a check of the new patch.
1. InterruptedException re-thrown as InterruptedIOException;
2. I think it's a good idea to log the upgrade progress for each dir, but so 
far, we can not get the progress easily from the current api. Do you think it's 
necessary to file a new jira to follow this?

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
  Labels: BB2015-05-TBR
 Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-05-19 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549892#comment-14549892
 ] 

Leitao Guo commented on HDFS-7692:
--

[~eddyxu], thanks for your comments, please have a check of the new patch.
1.In DataStorage#recoverTransitionRead, log the InterruptedException and 
rethrow it as InterruptedIOException;
2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then 
let the test case fail;
3.The multithread in DataStorage#addStorageLocations() is for one specific 
namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is 
creating one thread pool for each namespace. Not change here.
4.Re-phrase the parameter successVolumes.

[~szetszwo],thanks for your comments, please have a check of the new patch.
1. InterruptedException re-thrown as InterruptedIOException;
2. I think it's a good idea to log the upgrade progress for each dir, but so 
far, we can not get the progress easily from the current api. Do you think it's 
necessary to file a new jira to follow this?

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
  Labels: BB2015-05-TBR
 Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-05-19 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549885#comment-14549885
 ] 

Leitao Guo commented on HDFS-7692:
--

[~eddyxu], thanks for your comments, please have a check of the new patch.
1.In DataStorage#recoverTransitionRead, log the InterruptedException and 
rethrow it as InterruptedIOException;
2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then 
let the test case fail;
3.The multithread in DataStorage#addStorageLocations() is for one specific 
namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is 
creating one thread pool for each namespace. Not change here.
4.Re-phrase the parameter successVolumes.

[~szetszwo],thanks for your comments, please have a check of the new patch.
1. InterruptedException re-thrown as InterruptedIOException;
2. I think it's a good idea to log the upgrade progress for each dir, but so 
far, we can not get the progress easily from the current api. Do you think it's 
necessary to file a new jira to follow this?

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
  Labels: BB2015-05-TBR
 Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-05-19 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549890#comment-14549890
 ] 

Leitao Guo commented on HDFS-7692:
--

[~eddyxu], thanks for your comments, please have a check of the new patch.
1.In DataStorage#recoverTransitionRead, log the InterruptedException and 
rethrow it as InterruptedIOException;
2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then 
let the test case fail;
3.The multithread in DataStorage#addStorageLocations() is for one specific 
namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is 
creating one thread pool for each namespace. Not change here.
4.Re-phrase the parameter successVolumes.

[~szetszwo],thanks for your comments, please have a check of the new patch.
1. InterruptedException re-thrown as InterruptedIOException;
2. I think it's a good idea to log the upgrade progress for each dir, but so 
far, we can not get the progress easily from the current api. Do you think it's 
necessary to file a new jira to follow this?

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
  Labels: BB2015-05-TBR
 Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-05-19 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549886#comment-14549886
 ] 

Leitao Guo commented on HDFS-7692:
--

[~eddyxu], thanks for your comments, please have a check of the new patch.
1.In DataStorage#recoverTransitionRead, log the InterruptedException and 
rethrow it as InterruptedIOException;
2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then 
let the test case fail;
3.The multithread in DataStorage#addStorageLocations() is for one specific 
namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is 
creating one thread pool for each namespace. Not change here.
4.Re-phrase the parameter successVolumes.

[~szetszwo],thanks for your comments, please have a check of the new patch.
1. InterruptedException re-thrown as InterruptedIOException;
2. I think it's a good idea to log the upgrade progress for each dir, but so 
far, we can not get the progress easily from the current api. Do you think it's 
necessary to file a new jira to follow this?

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
  Labels: BB2015-05-TBR
 Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-05-19 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549888#comment-14549888
 ] 

Leitao Guo commented on HDFS-7692:
--

[~eddyxu], thanks for your comments, please have a check of the new patch.
1.In DataStorage#recoverTransitionRead, log the InterruptedException and 
rethrow it as InterruptedIOException;
2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then 
let the test case fail;
3.The multithread in DataStorage#addStorageLocations() is for one specific 
namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is 
creating one thread pool for each namespace. Not change here.
4.Re-phrase the parameter successVolumes.

[~szetszwo],thanks for your comments, please have a check of the new patch.
1. InterruptedException re-thrown as InterruptedIOException;
2. I think it's a good idea to log the upgrade progress for each dir, but so 
far, we can not get the progress easily from the current api. Do you think it's 
necessary to file a new jira to follow this?

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
  Labels: BB2015-05-TBR
 Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-05-19 Thread Leitao Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leitao Guo updated HDFS-7692:
-
Attachment: HDFS-7692.02.patch

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
  Labels: BB2015-05-TBR
 Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-05-19 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549899#comment-14549899
 ] 

Leitao Guo commented on HDFS-7692:
--

[~eddyxu], thanks for your comments, please have a check of the new patch.
1.In DataStorage#recoverTransitionRead, log the InterruptedException and 
rethrow it as InterruptedIOException;
2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then 
let the test case fail;
3.The multithread in DataStorage#addStorageLocations() is for one specific 
namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is 
creating one thread pool for each namespace. Not change here.
4.Re-phrase the parameter successVolumes.

[~szetszwo],thanks for your comments, please have a check of the new patch.
1. InterruptedException re-thrown as InterruptedIOException;
2. I think it's a good idea to log the upgrade progress for each dir, but so 
far, we can not get the progress easily from the current api. Do you think it's 
necessary to file a new jira to follow this?

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
  Labels: BB2015-05-TBR
 Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-05-19 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549902#comment-14549902
 ] 

Leitao Guo commented on HDFS-7692:
--

[~eddyxu], thanks for your comments, please have a check of the new patch.
1.In DataStorage#recoverTransitionRead, log the InterruptedException and 
rethrow it as InterruptedIOException;
2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then 
let the test case fail;
3.The multithread in DataStorage#addStorageLocations() is for one specific 
namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is 
creating one thread pool for each namespace. Not change here.
4.Re-phrase the parameter successVolumes.

[~szetszwo],thanks for your comments, please have a check of the new patch.
1. InterruptedException re-thrown as InterruptedIOException;
2. I think it's a good idea to log the upgrade progress for each dir, but so 
far, we can not get the progress easily from the current api. Do you think it's 
necessary to file a new jira to follow this?

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
  Labels: BB2015-05-TBR
 Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-05-19 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549893#comment-14549893
 ] 

Leitao Guo commented on HDFS-7692:
--

[~eddyxu], thanks for your comments, please have a check of the new patch.
1.In DataStorage#recoverTransitionRead, log the InterruptedException and 
rethrow it as InterruptedIOException;
2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then 
let the test case fail;
3.The multithread in DataStorage#addStorageLocations() is for one specific 
namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is 
creating one thread pool for each namespace. Not change here.
4.Re-phrase the parameter successVolumes.

[~szetszwo],thanks for your comments, please have a check of the new patch.
1. InterruptedException re-thrown as InterruptedIOException;
2. I think it's a good idea to log the upgrade progress for each dir, but so 
far, we can not get the progress easily from the current api. Do you think it's 
necessary to file a new jira to follow this?

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
  Labels: BB2015-05-TBR
 Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-05-19 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549900#comment-14549900
 ] 

Leitao Guo commented on HDFS-7692:
--

[~eddyxu], thanks for your comments, please have a check of the new patch.
1.In DataStorage#recoverTransitionRead, log the InterruptedException and 
rethrow it as InterruptedIOException;
2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then 
let the test case fail;
3.The multithread in DataStorage#addStorageLocations() is for one specific 
namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is 
creating one thread pool for each namespace. Not change here.
4.Re-phrase the parameter successVolumes.

[~szetszwo],thanks for your comments, please have a check of the new patch.
1. InterruptedException re-thrown as InterruptedIOException;
2. I think it's a good idea to log the upgrade progress for each dir, but so 
far, we can not get the progress easily from the current api. Do you think it's 
necessary to file a new jira to follow this?

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
  Labels: BB2015-05-TBR
 Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-05-19 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549895#comment-14549895
 ] 

Leitao Guo commented on HDFS-7692:
--

[~eddyxu], thanks for your comments, please have a check of the new patch.
1.In DataStorage#recoverTransitionRead, log the InterruptedException and 
rethrow it as InterruptedIOException;
2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then 
let the test case fail;
3.The multithread in DataStorage#addStorageLocations() is for one specific 
namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is 
creating one thread pool for each namespace. Not change here.
4.Re-phrase the parameter successVolumes.

[~szetszwo],thanks for your comments, please have a check of the new patch.
1. InterruptedException re-thrown as InterruptedIOException;
2. I think it's a good idea to log the upgrade progress for each dir, but so 
far, we can not get the progress easily from the current api. Do you think it's 
necessary to file a new jira to follow this?

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
  Labels: BB2015-05-TBR
 Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-05-19 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549897#comment-14549897
 ] 

Leitao Guo commented on HDFS-7692:
--

[~eddyxu], thanks for your comments, please have a check of the new patch.
1.In DataStorage#recoverTransitionRead, log the InterruptedException and 
rethrow it as InterruptedIOException;
2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then 
let the test case fail;
3.The multithread in DataStorage#addStorageLocations() is for one specific 
namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is 
creating one thread pool for each namespace. Not change here.
4.Re-phrase the parameter successVolumes.

[~szetszwo],thanks for your comments, please have a check of the new patch.
1. InterruptedException re-thrown as InterruptedIOException;
2. I think it's a good idea to log the upgrade progress for each dir, but so 
far, we can not get the progress easily from the current api. Do you think it's 
necessary to file a new jira to follow this?

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
  Labels: BB2015-05-TBR
 Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-05-19 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549913#comment-14549913
 ] 

Leitao Guo commented on HDFS-7692:
--

[~eddyxu], thanks for your comments, please have a check of the new patch.
1.In DataStorage#recoverTransitionRead, log the InterruptedException and 
rethrow it as InterruptedIOException;
2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then 
let the test case fail;
3.The multithread in DataStorage#addStorageLocations() is for one specific 
namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is 
creating one thread pool for each namespace. Not change here.
4.Re-phrase the parameter successVolumes.

[~szetszwo],thanks for your comments, please have a check of the new patch.
1. InterruptedException re-thrown as InterruptedIOException;
2. I think it's a good idea to log the upgrade progress for each dir, but so 
far, we can not get the progress easily from the current api. Do you think it's 
necessary to file a new jira to follow this?

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
  Labels: BB2015-05-TBR
 Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-05-19 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549910#comment-14549910
 ] 

Leitao Guo commented on HDFS-7692:
--

[~eddyxu], thanks for your comments, please have a check of the new patch.
1.In DataStorage#recoverTransitionRead, log the InterruptedException and 
rethrow it as InterruptedIOException;
2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then 
let the test case fail;
3.The multithread in DataStorage#addStorageLocations() is for one specific 
namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is 
creating one thread pool for each namespace. Not change here.
4.Re-phrase the parameter successVolumes.

[~szetszwo],thanks for your comments, please have a check of the new patch.
1. InterruptedException re-thrown as InterruptedIOException;
2. I think it's a good idea to log the upgrade progress for each dir, but so 
far, we can not get the progress easily from the current api. Do you think it's 
necessary to file a new jira to follow this?

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
  Labels: BB2015-05-TBR
 Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-05-19 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549914#comment-14549914
 ] 

Leitao Guo commented on HDFS-7692:
--

[~eddyxu], thanks for your comments, please have a check of the new patch.
1.In DataStorage#recoverTransitionRead, log the InterruptedException and 
rethrow it as InterruptedIOException;
2.In TestDataStorage#testAddStorageDirectoreis, catch InterruptedException then 
let the test case fail;
3.The multithread in DataStorage#addStorageLocations() is for one specific 
namespace, so in TestDataStorage#testAddStorageDirectoreis my intention is 
creating one thread pool for each namespace. Not change here.
4.Re-phrase the parameter successVolumes.

[~szetszwo],thanks for your comments, please have a check of the new patch.
1. InterruptedException re-thrown as InterruptedIOException;
2. I think it's a good idea to log the upgrade progress for each dir, but so 
far, we can not get the progress easily from the current api. Do you think it's 
necessary to file a new jira to follow this?

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
  Labels: BB2015-05-TBR
 Attachments: HDFS-7692.01.patch, HDFS-7692.02.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-02-06 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14310587#comment-14310587
 ] 

Leitao Guo commented on HDFS-7692:
--

[~eddyxu], thanks for your review and comments!

When running tests after patching, I got NullPointerException at this line, 
so I add the check of null != datanode.getConf() here.
{code} Executors.newFixedThreadPool(null != datanode.getConf() ? datanode
493 .getConf().getInt(
494 DFSConfigKeys.DFS_DATANODE_DIRECTORYSCAN_THREADS_KEY,
495 
DFSConfigKeys.DFS_DATANODE_DIRECTORYSCAN_THREADS_DEFAULT)
496 : dataDirs.size()); {code}

I will update the patch according to your comments, thanks!

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
 Attachments: HDFS-7692.01.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-02-06 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14310591#comment-14310591
 ] 

Leitao Guo commented on HDFS-7692:
--

When upgrading before the patch, I find there is high cpu utilization (~90%) in 
our cluster , so I think we'd better control the num of threads here. I will 
have a test verify this.

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
 Attachments: HDFS-7692.01.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-02-06 Thread Leitao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14310592#comment-14310592
 ] 

Leitao Guo commented on HDFS-7692:
--

[~szetszwo] thanks for your comments, I will update the patch asap.

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
Assignee: Leitao Guo
 Attachments: HDFS-7692.01.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-02-04 Thread Leitao Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leitao Guo updated HDFS-7692:
-
Attachment: HDFS-7692.01.patch

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
 Attachments: HDFS-7692.01.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-02-04 Thread Leitao Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leitao Guo updated HDFS-7692:
-
Release Note: Please help review the patch. Thanks!
  Status: Patch Available  (was: Open)

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo
 Attachments: HDFS-7692.01.patch


 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-02-03 Thread Leitao Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leitao Guo updated HDFS-7692:
-
Description: 
{code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
for (StorageLocation dataDir : dataDirs) {
  File root = dataDir.getFile();
 ... ...
bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt);
addBlockPoolStorage(bpid, bpStorage);
... ...
  successVolumes.add(dataDir);
}
{code}

In the above code the storage directories will be analyzed one by one, which is 
really time consuming when upgrading HDFS with datanodes have dozens of large 
volumes.  MultiThread dataDirs analyzing should be supported here to speedup 
upgrade.

  was:
{code:title=DataStorage#addStorageLocations...)|borderStyle=solid}
for (StorageLocation dataDir : dataDirs) {
  File root = dataDir.getFile();
 ... ...
bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, startOpt);
addBlockPoolStorage(bpid, bpStorage);
... ...
  successVolumes.add(dataDir);
}
{code}

In the above code the storage directories will be analyzed one by one, which is 
really time consuming when upgrading HDFS with datanodes have dozens of large 
volumes.  MultiThread dataDirs analyzing should be supported here to speedup 
upgrade.


 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo

 {code:title=DataStorage#addStorageLocations(...)|borderStyle=solid}
 for (StorageLocation dataDir : dataDirs) {
   File root = dataDir.getFile();
  ... ...
 bpStorage.recoverTransitionRead(datanode, nsInfo, bpDataDirs, 
 startOpt);
 addBlockPoolStorage(bpid, bpStorage);
 ... ...
   successVolumes.add(dataDir);
 }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7692) DataStorage#addStorageLocations(...) should support MultiThread to speedup the upgrade of block pool at multi storage directories.

2015-02-03 Thread Leitao Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leitao Guo updated HDFS-7692:
-
Summary: DataStorage#addStorageLocations(...) should support MultiThread to 
speedup the upgrade of block pool at multi storage directories.  (was: 
BlockPoolSliceStorage#loadBpStorageDirectories(...) should support MultiThread 
to speedup the upgrade of block pool at multi storage directories.)

 DataStorage#addStorageLocations(...) should support MultiThread to speedup 
 the upgrade of block pool at multi storage directories.
 --

 Key: HDFS-7692
 URL: https://issues.apache.org/jira/browse/HDFS-7692
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.5.2
Reporter: Leitao Guo

 {code:title=BlockPoolSliceStorage#loadBpStorageDirectories(...)|borderStyle=solid}
 for (File dataDir : dataDirs) {
 if (containsStorageDir(dataDir)) {
   throw new IOException(
   BlockPoolSliceStorage.recoverTransitionRead:  +
   attempt to load an used block storage:  + dataDir);
 }
 StorageDirectory sd =
 loadStorageDirectory(datanode, nsInfo, dataDir, startOpt);
 succeedDirs.add(sd);
   }
 {code}
 In the above code the storage directories will be analyzed one by one, which 
 is really time consuming when upgrading HDFS with datanodes have dozens of 
 large volumes.  MultiThread dataDirs analyzing should be supported here to 
 speedup upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >